The trustworthiness assessment of GPT models reveals potential vulnerabilities and privacy risks.

2025-07-13 06:51:33

Abstract generation in progress

Research on the Credibility Assessment of Language Models

The University of Illinois at Urbana-Champaign, in collaboration with several universities and research institutions, has launched a large language model (LLMs) comprehensive trustworthiness assessment platform, which is introduced in the latest paper "Decoding Trust: A Comprehensive Assessment of the Trustworthiness of GPT Models."

The research team conducted a comprehensive credibility assessment of the GPT model and discovered some previously unpublished vulnerabilities. For example, the GPT model is prone to generating toxic and biased outputs and may leak private information from training data and conversation history. Although GPT-4 is generally more reliable than GPT-3.5 in standard tests, it is more susceptible to attacks when faced with maliciously designed prompts, possibly because it adheres more strictly to misleading instructions.

The study conducted a comprehensive evaluation of the GPT model from eight credibility perspectives, including its robustness against text adversarial attacks, adaptability to different task instructions, and system prompts. The evaluation utilized various scenarios, tasks, metrics, and datasets.

The results show that the GPT model performs excellently in certain aspects, such as being unmisled by counterfactual examples. However, there are also some issues, such as being susceptible to misleading system prompts that can produce biased content, especially GPT-4 is more easily influenced. The degree of bias in the model is also related to specific topics, with less bias on certain sensitive topics.

In terms of privacy protection, research has found that GPT models may leak sensitive information from training data, such as email addresses. GPT-4 is more robust than GPT-3.5 in protecting personal identification information, but may still leak privacy in certain cases.

This study provides a comprehensive benchmark for assessing the credibility of language models, helping to identify potential vulnerabilities and driving the development of more reliable models. The research team hopes that this work will encourage academia to continue in-depth research based on this foundation, working together to create more powerful and trustworthy language models.

GPT-5.61%

View Original

This page may contain third-party content, which is provided for information purposes only (not representations/warranties) and should not be considered as an endorsement of its views by Gate, nor as financial or professional advice. See Disclaimer for details.

20 Likes