Enhancing Fact Retrieval in PLMs through Truthfulness

Paul Youssef{\dagger} Jörg Schlötterer{\dagger}{\ddagger}† ‡ Christin Seifert{\dagger}
{\dagger}University of Marburg, {\ddagger}University of Mannheim
{paul.youssef, joerg.schloetterer, christin.seifert}@uni-marburg.de
Abstract

Pre-trained Language Models (PLMs) encode various facts about the world at their pre-training phase as they are trained to predict the next or missing word in a sentence. There has a been an interest in quantifying and improving the amount of facts that can be extracted from PLMs, as they have been envisioned to act as soft knowledge bases, which can be queried in natural language. Different approaches exist to enhance fact retrieval from PLM. Recent work shows that the hidden states of PLMs can be leveraged to determine the truthfulness of the PLMs’ inputs. Leveraging this finding to improve factual knowledge retrieval remains unexplored. In this work, we investigate the use of a helper model to improve fact retrieval. The helper model assesses the truthfulness of an input based on the corresponding hidden states representations from the PLMs. We evaluate this approach on several masked PLMs and show that it enhances fact retrieval by up to 33%. Our findings highlight the potential of hidden states representations from PLMs in improving their factual knowledge retrieval.

1 Introduction

Pre-trained Language Models (PLMs) absorb numerous facts about the world from their pre-training data Petroni et al. (2019); Roberts et al. (2020). This has sparked the interest of the NLP community in examining and improving the amount of knowledge that can be extracted from PLMs Youssef et al. (2023). Indeed, several enhancements have been proposed, which go beyond manual prompts Petroni et al. (2019) to improve fact retrieval by directly optimizing prompts Shin et al. (2020); Zhong et al. (2021); Li et al. (2022), re-writing prompts with other PLMs Haviv et al. (2021); Zhang et al. (2022), finetuning the PLMs themselves Roberts et al. (2020); Fichtel et al. (2021) or debiasing the outputs of PLMs Zhao et al. (2021); Dong et al. (2022); Wang et al. (2023).

Refer to caption
Figure 1: An overview of our method for factual knowledge retrieval. A helper model decides which of the proposed answers is correct based on hidden state representations of the answer from the probed PLM.

Recent work Burns et al. (2022) shows that representations from PLMs can be leveraged to determine if the provided inputs are truthful or not, i.e., these representations can be utilized to answer yes/no questions, or to conduct binary classification in an unsupervised manner. However, the utility of these representations for improving fact retrieval has not been examined yet. In this work, we close this gap by investigating, how using a helper model that classifies which of the top-k𝑘kitalic_k retrieved answers is correct based on the corresponding representations improves fact retrieval. Figure 1 demonstrate an overview of our approach. Our results show an improved performance on several masked PLMs.

In summary, our contributions are the following: i) We investigate the use of a helper model in improving fact retrieval based on hidden representations from PLMs; ii) We show that our approach improves fact retrieval performance by up to 33%; iii) We analyze how increasing the number of the considered predictions affects the final performance.

2 Related Work

Fact retrieval.

Despite the incoherence of factual knowledge in PLMs Youssef et al. (2024), many works exist that aim to improve fact retrieval from PLMs. One improvement direction is the prompts used to retrieve facts, which have undergone many refinements. After the use of manual prompts by Petroni et al. (2019), the focus shifted on optimizing prompts either through automatically finding paraphrases that perform better Qin and Eisner (2021), by optimizing the prompts in discrete space Shin et al. (2020) or in a continuous space Zhong et al. (2021), or re-writing the prompts by other PLMs Haviv et al. (2021); Zhang et al. (2022). Another direction for improvement has been the PLMs themselves, which have been finetuned for better fact retrieval Roberts et al. (2020); Fichtel et al. (2021) or to become more robust to changes in the prompts Elazar et al. (2021); Newman et al. (2022). Other works have focused on debiasing the outputs from PLMs in different ways Zhao et al. (2021); Dong et al. (2022); Malkin et al. (2022); Wang et al. (2023); Yoshikawa and Okazaki (2023). Our work also aims to improve the outputs from PLMs by leveraging information about the truthfulness of the inputs, which can be derived from the hidden states. For a comprehensive review about factual knowledge retrieval from PLMs, we refer the interested reader to Youssef et al. (2023).

Truthfulness in Language Models.

Burns et al. (2022) show that the hidden states from LLMs can be used in an unsupervised learning setting to distinguish between truthful and untruthful statements. Similarly, Azaria and Mitchell (2023) leverage the hidden states of LLMs to train a feedforward neural network to predict the truthfulness of the LLMs’ inputs, and show its effectiveness on several LLMs. Pacchiardi et al. (2024) show that it is possible to detect untruthful answers from LLMs with no access to the hidden states from the LLMs by asking several simple yes/no questions and feeding the LLMs’ outputs into a logistic regression model. Despite its simplicity, their approach is shown to generalize to different architectures and LLMs that are finetuned to output lies. In summary, many works exist that show that the hidden states of LLMs can be leveraged to predict the truthfulness of their inputs. In this work, we leverage the truthfulness signal in hidden states to improve fact retrieval.

3 Methodology

Factual knowledge in PLMs is estimated by evaluating how often can PLMs correctly predict an object entity o𝑜oitalic_o, given a subject entity s𝑠sitalic_s and a relation r𝑟ritalic_r that are expressed through a prompt p(s,r)𝑝𝑠𝑟p(s,r)italic_p ( italic_s , italic_r ). Given a PLM \mathcal{M}caligraphic_M, its predicted object o^=argmaxo,p(s,r)[o]^𝑜𝑎𝑟𝑔𝑚𝑎subscript𝑥𝑜subscript𝑝𝑠𝑟delimited-[]𝑜\hat{o}=argmax_{o}\ \mathbb{P}_{\mathcal{M},p(s,r)}[o]over^ start_ARG italic_o end_ARG = italic_a italic_r italic_g italic_m italic_a italic_x start_POSTSUBSCRIPT italic_o end_POSTSUBSCRIPT blackboard_P start_POSTSUBSCRIPT caligraphic_M , italic_p ( italic_s , italic_r ) end_POSTSUBSCRIPT [ italic_o ] is the token with the highest probability given the prompt that contains the subject and the relation p(s,r)𝑝𝑠𝑟p(s,r)italic_p ( italic_s , italic_r ).

In this work, we consider top-k𝑘kitalic_k outputs from \mathcal{M}caligraphic_M instead of using the top-1 prediction o^^𝑜\hat{o}over^ start_ARG italic_o end_ARG. In order, to decide which of the k𝑘kitalic_k outputs is the final prediction, we leverage a helper model \mathcal{H}caligraphic_H. \mathcal{H}caligraphic_H takes as input the hidden state that corresponds to the final token in the input from the last encoder layer of the PLM \mathcal{M}caligraphic_M after inputting the prompt p(s,r,oi)𝑝𝑠𝑟subscript𝑜𝑖p(s,r,o_{i})italic_p ( italic_s , italic_r , italic_o start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ), where oisubscript𝑜𝑖o_{i}italic_o start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT refers to the top i𝑖iitalic_i-th prediction and i{1,k}𝑖1𝑘i\in\{1,k\}italic_i ∈ { 1 , italic_k }. Assuming that p(s,r,oi)𝑝𝑠𝑟subscript𝑜𝑖p(s,r,o_{i})italic_p ( italic_s , italic_r , italic_o start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) consists of j𝑗jitalic_j tokens and \mathcal{M}caligraphic_M has l𝑙litalic_l layers, we use the hidden state hj,lsubscript𝑗𝑙h_{j,l}italic_h start_POSTSUBSCRIPT italic_j , italic_l end_POSTSUBSCRIPT from (p(s,r,oi))𝑝𝑠𝑟subscript𝑜𝑖\mathcal{M}(p(s,r,o_{i}))caligraphic_M ( italic_p ( italic_s , italic_r , italic_o start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) ). hj,lsubscript𝑗𝑙h_{j,l}italic_h start_POSTSUBSCRIPT italic_j , italic_l end_POSTSUBSCRIPT represents the hidden state that corresponds to the j𝑗jitalic_j-token from the l𝑙litalic_l-th layer, as input to the helper model \mathcal{H}caligraphic_H. \mathcal{H}caligraphic_H classifies hj,lsubscript𝑗𝑙h_{j,l}italic_h start_POSTSUBSCRIPT italic_j , italic_l end_POSTSUBSCRIPT as either truthful or not. Since Burns et al. (2022) show that the hidden states of PLMs contain information about the truthfulness of their inputs we expect the helper model \mathcal{H}caligraphic_H to positively affect the factual knowledge retrieval performance.

4 Experiments

Here, we describe the data and PLMs, for our experiments in detail.

4.1 Datasets and PLMs

Datasets.

we use 2 test sets in our experiments in order to evaluate the fact retrieval performance:

  • LAMA: we use the T-REx Elsahar et al. (2018) subset of LAMA Petroni et al. (2019), which is often used to estimate factual knowledge in PLMs.

  • WIKIUNI: A second dataset for estimating factual knowledge in PLMs Cao et al. (2021). In contrast to LAMA, in this dataset the ground truth objects are uniformly distributed.

To train the helper model \mathcal{H}caligraphic_H, we consider two training sets:

  • AUTOPROMPT: The dataset used by Shin et al. (2020) to optimize prompts in a discrete space.

  • WIKIUNI: The same as the dataset mentioned above for testing.

Training the helper model.

since the training sets contain only the ground truth object, we sample untruthful examples from \mathcal{M}caligraphic_M’s outputs to train \mathcal{M}caligraphic_M. The untruthful examples correspond to the top k+1𝑘1k+1italic_k + 1 outputs. We report the optimal top k+1𝑘1k+1italic_k + 1 for each setting, and refer to this as the Neg. Index. We also report the accuracy of the helper model. As helper model, we use a simple logistic regression model with L1 regularization. Since the frequency distribution of the object entities between the training and test sets might have an impact on the performance, we report the Pearson correlation coefficient Corr between these two distribution.

PLMs.

we experiment with four models: BERT-base, BERT-large Devlin et al. (2019), T5-base and T5-large Raffel et al. (2020). We summarize the number of parameters and architecture for each model in Table 1. For BERT, we exclude examples with objects that consist of more than one token. For T5 model, we keep both one-token and multiple-token objects, and use Typed Querying (TyQ) Kassner et al. (2021) to extract the top-k𝑘kitalic_k predictions. In TyQ, the number objects to be considered is limited to a subset in contrast to the normal case where the whole vocabulary is considered. TyQ makes it easier to consider objects consisting of multiple tokens. We augment the set of objects for each relation with predictions from BERT. We replace the subject with NA to extract the predictions, keeping only the relation in the prompt. As a baseline, we use the object with the highest probability from the PLMs. Following  Burns et al. (2022), we extract the hidden representations from the last encoder layer for both model types.

Model #Parameters Architecture
BERT-base 110M encoder-only
BERT-large 345M encoder-only
T5-base 220M encoder-decoder
T5-large 770M encoder-decoder
Table 1: Models with number of parameters and architectures.

4.2 Results & Discussions

Training Set Test Set Model Baseline Ours Diff.% Corr Neg. Index Helper Acc.
AUTOPROMPT LAMA BERT-base 29.29 38.99 33.12 0.74 21 87.33
BERT-large 31.37 40.04 27.64 0.74 21 87.54
T5-base 14.52 16.59 14.26 0.74 31 89.22
T5-large 17.63 19.50 10.61 0.74 101 88.09
WIKIUNI BERT-base 14.94 16.93 13.32 0.04 91 90.30
BERT-large 16.79 19.14 14.00 0.04 81 90.91
T5-base 5.90 6.15 4.24 0.04 21 88.20
T5-large 7.44 7.43 -0.13 0.04 11 89.45
WIKIUNI LAMA BERT-base 29.29 34.23 16.87 -0.03 101 82.53
BERT-large 31.37 36.26 15.59 -0.03 71 81.39
T5-base 14.53 16.21 11.56 -0.03 101 79.82
T5-large 17.63 18.94 7.43 -0.03 101 79.08
Table 2: Factual knowledge retrieval performance (accuracy) on several test sets. Training set is only used with helper approach. Diff.% refers to the percentage improvement in performance.

Table 2 shows the results of our experiments. In general, the top-1 accuracy improves when the helper model is used. The improvements vary between more than 33% (BERT with AUTOPROMPT and LAMA) and 4% (T5-base with AUTOPROMPT and WIKIUNI). One exception is T5-large with AUTOPROMPT and WIKIUNI, where the performance does not change significantly.

Even though the improvements in performance are the highest when the correlation between the training and test sets is high (AUTOPROMPT and LAMA), we still notice improvements when there is no correlation between the training and test sets (WIKIUNI and LAMA). This verifies the findings of Burns et al. (2022) that the hidden states contain information about the truthfulness of the inputs. This is further verified by the high accuracy of the helper model (> 80% in most cases). We also notice that the fact retrieval performance and the gains in performance after using a helper model are higher for BERT-models than for T5-models. This shows that encoder-only models are not only better at fact retrieval Lewis et al. (2020); Zhang et al. (2021); Youssef et al. (2023), but also their hidden states contain more information about the truthfulness of the inputs.

Refer to caption
(a)
Refer to caption
(b)
Figure 2: Relation between Neg. Index (K+1) and Gain in Accuracy. Top: BERT-models, bottom: T5-models

Effect of Neg. Index (k+1𝑘1k+1italic_k + 1)

We also investigate the relation between the Neg. Index and the gain in performance. Note that as we increase the Neg. Index, we also increase the number of predictions k𝑘kitalic_k to be considered by the helper model. For example, a Neg. Index of 21 corresponds to a k=20𝑘20k=20italic_k = 20, i.e., the helper model assesses the top k=20𝑘20k=20italic_k = 20 predictions, and returns the first one that is predicted to be truthful.

Figure 2 shows the relation between Neg. Index (k+1𝑘1k+1italic_k + 1) and gains in performance for BERT models 2a and T5 models 2b. We notice that the gains in performance for BERT-base and BERT-large look similar. The gains on AUTOPROMPT-LAMA reach their highest at 21, and start dropping slightly after that. On WIKIUNI-LAMA, the gains steadily increase with Neg. Index and rich their highest at 71 (BERT-large) and 101 (BERT-base). A similar trend can be noticed on AUTOPROMPT-WIKIUNI. In general, one can notice that when there is a high correlation between the train and test sets (e.g., AUTOPROMPT-LAMA) then the gains in performance are high and are attained at small Neg. Index (k𝑘kitalic_k) values. Conversely, when there is no correlation between the train and test sets the gains in performance are lower and are reached at larger Neg. Index (k𝑘kitalic_k) values. This can either be attributed to the availability of more predictions to choose from (with the increase of k𝑘kitalic_k), or to a potential improvement in the accuracy of the helper model. A further investigation is needed to disentangle the effect of both factors.

The gains in performance for T5-models are smaller (see Figure 2 b) than those of BERT-models. Here, we notice that the gains on WIKIUNI are the smallest and do not vary much. For T5-large, the performance even degrades slightly. On LAMA, there is more variance, and we notice that when there is no correlation between the training and test sets (WIKIUNI-LAMA) the best performance is reached at high Neg. Index values (101101101101 for both T5-base and T5-large). When the correlation is high (AUTOPROMPT-LAMA) the best performance is reached at lower values (e.g., T5-base at Neg. Index=31absent31=31= 31 ). An exception here is T5-large where the best performance is reached at Neg. Index =101absent101=101= 101. However, the difference is not big compared to a peak at a lower Neg. Index =31absent31=31= 31 (1.56 vs. 1.87). We believe the differences in performance between BERT and T5 can be attributed to their different architectures.

5 Conclusion

In this work, we investigated the use of a helper model to improve fact retrieval. The helper model relies on the hidden state representations from PLMs to determine the truthfulness of the corresponding inputs. We showed the effectiveness of this approach in improving fact retrieval on several masked PLMs. Furthermore, we showed that increasing the number of the considered predictions affects the performance positively, especially in cases where the answer frequencies between the training and test sets are not correlated. Even though our approach for improving fact retrieval leads to an improved performance. It is nonetheless more computationally demanding, since it requires extracting the hidden states for all potential predictions. In future work, we aim to optimize this method and extend our evaluation to LLMs.

6 Limitations

Our experiments do not include any LLMs due to the high computational costs associated with LLMs under our approach.

References