Enhancing Fact Retrieval in PLMs through Truthfulness
Abstract
Pre-trained Language Models (PLMs) encode various facts about the world at their pre-training phase as they are trained to predict the next or missing word in a sentence. There has a been an interest in quantifying and improving the amount of facts that can be extracted from PLMs, as they have been envisioned to act as soft knowledge bases, which can be queried in natural language. Different approaches exist to enhance fact retrieval from PLM. Recent work shows that the hidden states of PLMs can be leveraged to determine the truthfulness of the PLMs’ inputs. Leveraging this finding to improve factual knowledge retrieval remains unexplored. In this work, we investigate the use of a helper model to improve fact retrieval. The helper model assesses the truthfulness of an input based on the corresponding hidden states representations from the PLMs. We evaluate this approach on several masked PLMs and show that it enhances fact retrieval by up to 33%. Our findings highlight the potential of hidden states representations from PLMs in improving their factual knowledge retrieval.
1 Introduction
Pre-trained Language Models (PLMs) absorb numerous facts about the world from their pre-training data Petroni et al. (2019); Roberts et al. (2020). This has sparked the interest of the NLP community in examining and improving the amount of knowledge that can be extracted from PLMs Youssef et al. (2023). Indeed, several enhancements have been proposed, which go beyond manual prompts Petroni et al. (2019) to improve fact retrieval by directly optimizing prompts Shin et al. (2020); Zhong et al. (2021); Li et al. (2022), re-writing prompts with other PLMs Haviv et al. (2021); Zhang et al. (2022), finetuning the PLMs themselves Roberts et al. (2020); Fichtel et al. (2021) or debiasing the outputs of PLMs Zhao et al. (2021); Dong et al. (2022); Wang et al. (2023).
Recent work Burns et al. (2022) shows that representations from PLMs can be leveraged to determine if the provided inputs are truthful or not, i.e., these representations can be utilized to answer yes/no questions, or to conduct binary classification in an unsupervised manner. However, the utility of these representations for improving fact retrieval has not been examined yet. In this work, we close this gap by investigating, how using a helper model that classifies which of the top- retrieved answers is correct based on the corresponding representations improves fact retrieval. Figure 1 demonstrate an overview of our approach. Our results show an improved performance on several masked PLMs.
In summary, our contributions are the following: i) We investigate the use of a helper model in improving fact retrieval based on hidden representations from PLMs; ii) We show that our approach improves fact retrieval performance by up to 33%; iii) We analyze how increasing the number of the considered predictions affects the final performance.
2 Related Work
Fact retrieval.
Despite the incoherence of factual knowledge in PLMs Youssef et al. (2024), many works exist that aim to improve fact retrieval from PLMs. One improvement direction is the prompts used to retrieve facts, which have undergone many refinements. After the use of manual prompts by Petroni et al. (2019), the focus shifted on optimizing prompts either through automatically finding paraphrases that perform better Qin and Eisner (2021), by optimizing the prompts in discrete space Shin et al. (2020) or in a continuous space Zhong et al. (2021), or re-writing the prompts by other PLMs Haviv et al. (2021); Zhang et al. (2022). Another direction for improvement has been the PLMs themselves, which have been finetuned for better fact retrieval Roberts et al. (2020); Fichtel et al. (2021) or to become more robust to changes in the prompts Elazar et al. (2021); Newman et al. (2022). Other works have focused on debiasing the outputs from PLMs in different ways Zhao et al. (2021); Dong et al. (2022); Malkin et al. (2022); Wang et al. (2023); Yoshikawa and Okazaki (2023). Our work also aims to improve the outputs from PLMs by leveraging information about the truthfulness of the inputs, which can be derived from the hidden states. For a comprehensive review about factual knowledge retrieval from PLMs, we refer the interested reader to Youssef et al. (2023).
Truthfulness in Language Models.
Burns et al. (2022) show that the hidden states from LLMs can be used in an unsupervised learning setting to distinguish between truthful and untruthful statements. Similarly, Azaria and Mitchell (2023) leverage the hidden states of LLMs to train a feedforward neural network to predict the truthfulness of the LLMs’ inputs, and show its effectiveness on several LLMs. Pacchiardi et al. (2024) show that it is possible to detect untruthful answers from LLMs with no access to the hidden states from the LLMs by asking several simple yes/no questions and feeding the LLMs’ outputs into a logistic regression model. Despite its simplicity, their approach is shown to generalize to different architectures and LLMs that are finetuned to output lies. In summary, many works exist that show that the hidden states of LLMs can be leveraged to predict the truthfulness of their inputs. In this work, we leverage the truthfulness signal in hidden states to improve fact retrieval.
3 Methodology
Factual knowledge in PLMs is estimated by evaluating how often can PLMs correctly predict an object entity , given a subject entity and a relation that are expressed through a prompt . Given a PLM , its predicted object is the token with the highest probability given the prompt that contains the subject and the relation .
In this work, we consider top- outputs from instead of using the top-1 prediction . In order, to decide which of the outputs is the final prediction, we leverage a helper model . takes as input the hidden state that corresponds to the final token in the input from the last encoder layer of the PLM after inputting the prompt , where refers to the top -th prediction and . Assuming that consists of tokens and has layers, we use the hidden state from . represents the hidden state that corresponds to the -token from the -th layer, as input to the helper model . classifies as either truthful or not. Since Burns et al. (2022) show that the hidden states of PLMs contain information about the truthfulness of their inputs we expect the helper model to positively affect the factual knowledge retrieval performance.
4 Experiments
Here, we describe the data and PLMs, for our experiments in detail.
4.1 Datasets and PLMs
Datasets.
we use 2 test sets in our experiments in order to evaluate the fact retrieval performance:
- •
-
•
WIKIUNI: A second dataset for estimating factual knowledge in PLMs Cao et al. (2021). In contrast to LAMA, in this dataset the ground truth objects are uniformly distributed.
To train the helper model , we consider two training sets:
-
•
AUTOPROMPT: The dataset used by Shin et al. (2020) to optimize prompts in a discrete space.
-
•
WIKIUNI: The same as the dataset mentioned above for testing.
Training the helper model.
since the training sets contain only the ground truth object, we sample untruthful examples from ’s outputs to train . The untruthful examples correspond to the top outputs. We report the optimal top for each setting, and refer to this as the Neg. Index. We also report the accuracy of the helper model. As helper model, we use a simple logistic regression model with L1 regularization. Since the frequency distribution of the object entities between the training and test sets might have an impact on the performance, we report the Pearson correlation coefficient Corr between these two distribution.
PLMs.
we experiment with four models: BERT-base, BERT-large Devlin et al. (2019), T5-base and T5-large Raffel et al. (2020). We summarize the number of parameters and architecture for each model in Table 1. For BERT, we exclude examples with objects that consist of more than one token. For T5 model, we keep both one-token and multiple-token objects, and use Typed Querying (TyQ) Kassner et al. (2021) to extract the top- predictions. In TyQ, the number objects to be considered is limited to a subset in contrast to the normal case where the whole vocabulary is considered. TyQ makes it easier to consider objects consisting of multiple tokens. We augment the set of objects for each relation with predictions from BERT. We replace the subject with NA to extract the predictions, keeping only the relation in the prompt. As a baseline, we use the object with the highest probability from the PLMs. Following Burns et al. (2022), we extract the hidden representations from the last encoder layer for both model types.
Model | #Parameters | Architecture |
BERT-base | 110M | encoder-only |
BERT-large | 345M | encoder-only |
T5-base | 220M | encoder-decoder |
T5-large | 770M | encoder-decoder |
4.2 Results & Discussions
Training Set | Test Set | Model | Baseline | Ours | Diff.% | Corr | Neg. Index | Helper Acc. |
AUTOPROMPT | LAMA | BERT-base | 29.29 | 38.99 | 33.12 | 0.74 | 21 | 87.33 |
BERT-large | 31.37 | 40.04 | 27.64 | 0.74 | 21 | 87.54 | ||
T5-base | 14.52 | 16.59 | 14.26 | 0.74 | 31 | 89.22 | ||
T5-large | 17.63 | 19.50 | 10.61 | 0.74 | 101 | 88.09 | ||
WIKIUNI | BERT-base | 14.94 | 16.93 | 13.32 | 0.04 | 91 | 90.30 | |
BERT-large | 16.79 | 19.14 | 14.00 | 0.04 | 81 | 90.91 | ||
T5-base | 5.90 | 6.15 | 4.24 | 0.04 | 21 | 88.20 | ||
T5-large | 7.44 | 7.43 | -0.13 | 0.04 | 11 | 89.45 | ||
WIKIUNI | LAMA | BERT-base | 29.29 | 34.23 | 16.87 | -0.03 | 101 | 82.53 |
BERT-large | 31.37 | 36.26 | 15.59 | -0.03 | 71 | 81.39 | ||
T5-base | 14.53 | 16.21 | 11.56 | -0.03 | 101 | 79.82 | ||
T5-large | 17.63 | 18.94 | 7.43 | -0.03 | 101 | 79.08 |
Table 2 shows the results of our experiments. In general, the top-1 accuracy improves when the helper model is used. The improvements vary between more than 33% (BERT with AUTOPROMPT and LAMA) and 4% (T5-base with AUTOPROMPT and WIKIUNI). One exception is T5-large with AUTOPROMPT and WIKIUNI, where the performance does not change significantly.
Even though the improvements in performance are the highest when the correlation between the training and test sets is high (AUTOPROMPT and LAMA), we still notice improvements when there is no correlation between the training and test sets (WIKIUNI and LAMA). This verifies the findings of Burns et al. (2022) that the hidden states contain information about the truthfulness of the inputs. This is further verified by the high accuracy of the helper model (> 80% in most cases). We also notice that the fact retrieval performance and the gains in performance after using a helper model are higher for BERT-models than for T5-models. This shows that encoder-only models are not only better at fact retrieval Lewis et al. (2020); Zhang et al. (2021); Youssef et al. (2023), but also their hidden states contain more information about the truthfulness of the inputs.
Effect of Neg. Index ()
We also investigate the relation between the Neg. Index and the gain in performance. Note that as we increase the Neg. Index, we also increase the number of predictions to be considered by the helper model. For example, a Neg. Index of 21 corresponds to a , i.e., the helper model assesses the top predictions, and returns the first one that is predicted to be truthful.
Figure 2 shows the relation between Neg. Index () and gains in performance for BERT models 2a and T5 models 2b. We notice that the gains in performance for BERT-base and BERT-large look similar. The gains on AUTOPROMPT-LAMA reach their highest at 21, and start dropping slightly after that. On WIKIUNI-LAMA, the gains steadily increase with Neg. Index and rich their highest at 71 (BERT-large) and 101 (BERT-base). A similar trend can be noticed on AUTOPROMPT-WIKIUNI. In general, one can notice that when there is a high correlation between the train and test sets (e.g., AUTOPROMPT-LAMA) then the gains in performance are high and are attained at small Neg. Index () values. Conversely, when there is no correlation between the train and test sets the gains in performance are lower and are reached at larger Neg. Index () values. This can either be attributed to the availability of more predictions to choose from (with the increase of ), or to a potential improvement in the accuracy of the helper model. A further investigation is needed to disentangle the effect of both factors.
The gains in performance for T5-models are smaller (see Figure 2 b) than those of BERT-models. Here, we notice that the gains on WIKIUNI are the smallest and do not vary much. For T5-large, the performance even degrades slightly. On LAMA, there is more variance, and we notice that when there is no correlation between the training and test sets (WIKIUNI-LAMA) the best performance is reached at high Neg. Index values ( for both T5-base and T5-large). When the correlation is high (AUTOPROMPT-LAMA) the best performance is reached at lower values (e.g., T5-base at Neg. Index ). An exception here is T5-large where the best performance is reached at Neg. Index . However, the difference is not big compared to a peak at a lower Neg. Index (1.56 vs. 1.87). We believe the differences in performance between BERT and T5 can be attributed to their different architectures.
5 Conclusion
In this work, we investigated the use of a helper model to improve fact retrieval. The helper model relies on the hidden state representations from PLMs to determine the truthfulness of the corresponding inputs. We showed the effectiveness of this approach in improving fact retrieval on several masked PLMs. Furthermore, we showed that increasing the number of the considered predictions affects the performance positively, especially in cases where the answer frequencies between the training and test sets are not correlated. Even though our approach for improving fact retrieval leads to an improved performance. It is nonetheless more computationally demanding, since it requires extracting the hidden states for all potential predictions. In future work, we aim to optimize this method and extend our evaluation to LLMs.
6 Limitations
Our experiments do not include any LLMs due to the high computational costs associated with LLMs under our approach.
References
- Azaria and Mitchell (2023) Amos Azaria and Tom Mitchell. 2023. The internal state of an LLM knows when it’s lying. In Findings of the Association for Computational Linguistics: EMNLP 2023, pages 967–976, Singapore. Association for Computational Linguistics.
- Burns et al. (2022) Collin Burns, Haotian Ye, Dan Klein, and Jacob Steinhardt. 2022. Discovering latent knowledge in language models without supervision. In The Eleventh International Conference on Learning Representations.
- Cao et al. (2021) Boxi Cao, Hongyu Lin, Xianpei Han, Le Sun, Lingyong Yan, Meng Liao, Tong Xue, and Jin Xu. 2021. Knowledgeable or educated guess? revisiting language models as knowledge bases. In Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers), pages 1860–1874, Online. Association for Computational Linguistics.
- Devlin et al. (2019) Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2019. BERT: Pre-training of deep bidirectional transformers for language understanding. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), pages 4171–4186, Minneapolis, Minnesota. Association for Computational Linguistics.
- Dong et al. (2022) Qingxiu Dong, Damai Dai, Yifan Song, Jingjing Xu, Zhifang Sui, and Lei Li. 2022. Calibrating factual knowledge in pretrained language models. In Findings of the Association for Computational Linguistics: EMNLP 2022, pages 5937–5947, Abu Dhabi, United Arab Emirates. Association for Computational Linguistics.
- Elazar et al. (2021) Yanai Elazar, Nora Kassner, Shauli Ravfogel, Abhilasha Ravichander, Eduard Hovy, Hinrich Schütze, and Yoav Goldberg. 2021. Measuring and improving consistency in pretrained language models. Transactions of the Association for Computational Linguistics, 9:1012–1031.
- Elsahar et al. (2018) Hady Elsahar, Pavlos Vougiouklis, Arslen Remaci, Christophe Gravier, Jonathon Hare, Frederique Laforest, and Elena Simperl. 2018. T-REx: A large scale alignment of natural language with knowledge base triples. In Proceedings of the Eleventh International Conference on Language Resources and Evaluation (LREC 2018), Miyazaki, Japan. European Language Resources Association (ELRA).
- Fichtel et al. (2021) Leandra Fichtel, Jan-Christoph Kalo, and Wolf-Tilo Balke. 2021. Prompt tuning or fine-tuning - investigating relational knowledge in pre-trained language models. In 3rd Conference on Automated Knowledge Base Construction.
- Haviv et al. (2021) Adi Haviv, Jonathan Berant, and Amir Globerson. 2021. BERTese: Learning to speak to BERT. In Proceedings of the 16th Conference of the European Chapter of the Association for Computational Linguistics: Main Volume, pages 3618–3623, Online. Association for Computational Linguistics.
- Kassner et al. (2021) Nora Kassner, Philipp Dufter, and Hinrich Schütze. 2021. Multilingual LAMA: Investigating knowledge in multilingual pretrained language models. In Proceedings of the 16th Conference of the European Chapter of the Association for Computational Linguistics: Main Volume, pages 3250–3258, Online. Association for Computational Linguistics.
- Lewis et al. (2020) Mike Lewis, Yinhan Liu, Naman Goyal, Marjan Ghazvininejad, Abdelrahman Mohamed, Omer Levy, Veselin Stoyanov, and Luke Zettlemoyer. 2020. BART: Denoising sequence-to-sequence pre-training for natural language generation, translation, and comprehension. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pages 7871–7880, Online. Association for Computational Linguistics.
- Li et al. (2022) Yiyuan Li, Tong Che, Yezhen Wang, Zhengbao Jiang, Caiming Xiong, and Snigdha Chaturvedi. 2022. SPE: Symmetrical prompt enhancement for fact probing. In Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, pages 11689–11698, Abu Dhabi, United Arab Emirates. Association for Computational Linguistics.
- Malkin et al. (2022) Nikolay Malkin, Zhen Wang, and Nebojsa Jojic. 2022. Coherence boosting: When your pretrained language model is not paying enough attention. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 8214–8236, Dublin, Ireland. Association for Computational Linguistics.
- Newman et al. (2022) Benjamin Newman, Prafulla Kumar Choubey, and Nazneen Rajani. 2022. P-adapters: Robustly extracting factual information from language models with diverse prompts. In International Conference on Learning Representations.
- Pacchiardi et al. (2024) Lorenzo Pacchiardi, Alex James Chan, Sören Mindermann, Ilan Moscovitz, Alexa Yue Pan, Yarin Gal, Owain Evans, and Jan M. Brauner. 2024. How to catch an AI liar: Lie detection in black-box LLMs by asking unrelated questions. In The Twelfth International Conference on Learning Representations.
- Petroni et al. (2019) Fabio Petroni, Tim Rocktäschel, Sebastian Riedel, Patrick Lewis, Anton Bakhtin, Yuxiang Wu, and Alexander Miller. 2019. Language models as knowledge bases? In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), pages 2463–2473, Hong Kong, China. Association for Computational Linguistics.
- Qin and Eisner (2021) Guanghui Qin and Jason Eisner. 2021. Learning how to ask: Querying LMs with mixtures of soft prompts. In Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pages 5203–5212, Online. Association for Computational Linguistics.
- Raffel et al. (2020) Colin Raffel, Noam Shazeer, Adam Roberts, Katherine Lee, Sharan Narang, Michael Matena, Yanqi Zhou, Wei Li, and Peter J. Liu. 2020. Exploring the limits of transfer learning with a unified text-to-text transformer. Journal of Machine Learning Research, 21(140):1–67.
- Roberts et al. (2020) Adam Roberts, Colin Raffel, and Noam Shazeer. 2020. How much knowledge can you pack into the parameters of a language model? In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), pages 5418–5426, Online. Association for Computational Linguistics.
- Shin et al. (2020) Taylor Shin, Yasaman Razeghi, Robert L. Logan IV, Eric Wallace, and Sameer Singh. 2020. AutoPrompt: Eliciting Knowledge from Language Models with Automatically Generated Prompts. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), pages 4222–4235, Online. Association for Computational Linguistics.
- Wang et al. (2023) Yuhang Wang, Dongyuan Lu, Chao Kong, and Jitao Sang. 2023. Towards alleviating the object bias in prompt tuning-based factual knowledge extraction. In Findings of the Association for Computational Linguistics: ACL 2023, pages 4420–4432, Toronto, Canada. Association for Computational Linguistics.
- Yoshikawa and Okazaki (2023) Hiyori Yoshikawa and Naoaki Okazaki. 2023. Selective-LAMA: Selective prediction for confidence-aware evaluation of language models. In Findings of the Association for Computational Linguistics: EACL 2023, pages 2017–2028, Dubrovnik, Croatia. Association for Computational Linguistics.
- Youssef et al. (2023) Paul Youssef, Osman Koraş, Meijie Li, Jörg Schlötterer, and Christin Seifert. 2023. Give me the facts! a survey on factual knowledge probing in pre-trained language models. In Findings of the Association for Computational Linguistics: EMNLP 2023, pages 15588–15605, Singapore. Association for Computational Linguistics.
- Youssef et al. (2024) Paul Youssef, Jörg Schlötterer, and Christin Seifert. 2024. The queen of England is not England’s queen: On the lack of factual coherency in PLMs. In Findings of the Association for Computational Linguistics: EACL 2024, pages 2342–2354, St. Julian’s, Malta. Association for Computational Linguistics.
- Zhang et al. (2021) Yian Zhang, Alex Warstadt, Xiaocheng Li, and Samuel R. Bowman. 2021. When do you need billions of words of pretraining data? In Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers), pages 1112–1125, Online. Association for Computational Linguistics.
- Zhang et al. (2022) Yue Zhang, Hongliang Fei, Dingcheng Li, and Ping Li. 2022. PromptGen: Automatically generate prompts using generative models. In Findings of the Association for Computational Linguistics: NAACL 2022, pages 30–37, Seattle, United States. Association for Computational Linguistics.
- Zhao et al. (2021) Zihao Zhao, Eric Wallace, Shi Feng, Dan Klein, and Sameer Singh. 2021. Calibrate before use: Improving few-shot performance of language models. In Proceedings of the 38th International Conference on Machine Learning, volume 139 of Proceedings of Machine Learning Research, pages 12697–12706. PMLR.
- Zhong et al. (2021) Zexuan Zhong, Dan Friedman, and Danqi Chen. 2021. Factual probing is [MASK]: Learning vs. learning to recall. In Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pages 5017–5033, Online. Association for Computational Linguistics.