🎉 #Gate Alpha 3rd Points Carnival & ES Launchpool# Joint Promotion Task is Now Live!
Total Prize Pool: 1,250 $ES
This campaign aims to promote the Eclipse ($ES) Launchpool and Alpha Phase 11: $ES Special Event.
📄 For details, please refer to:
Launchpool Announcement: https://www.gate.com/zh/announcements/article/46134
Alpha Phase 11 Announcement: https://www.gate.com/zh/announcements/article/46137
🧩 [Task Details]
Create content around the Launchpool and Alpha Phase 11 campaign and include a screenshot of your participation.
📸 [How to Participate]
1️⃣ Post with the hashtag #Gate Alpha 3rd
OpenAI is going to solve math problems for GPT-4: the reward model is wrong, and the level of problem solving has reached a new level
Source: Heart of the Machine
Now, big language models usher in the era of "omnipotent", in which the ability to perform complex multi-step reasoning has also been greatly improved. Still, even large, state-of-the-art models can produce logical errors, often called hallucinations. Therefore, alleviating hallucinations is a crucial step in building aligned AGI.
In order to train a more reliable model, there are currently two different methods to choose to train the reward model, one is outcome supervision and the other is process supervision. Outcome-supervised reward models (ORMs) are trained using only the final outcome of the model's thought chain, while process-supervised reward models (PRMs) receive rewards for each step in the thought chain.
Given the importance of training reliable models and the high cost of human feedback, it is important to carefully compare outcome supervision with process supervision. While recent work has carried out this comparison, many questions remain.
In this paper, OpenAI investigates and finds that process supervision significantly outperforms outcome supervision when training models to solve problems on the MATH dataset. OpenAI solved 78% of the problems on a representative subset of the MATH test set using its own PRM model.
In addition, to support related research, OpenAI also open-sources PRM800K, a complete dataset containing 800K step-level human feedback labels, for training their optimal reward models.
Dataset address:
Method Overview
The study compares outcome supervision with process supervision following a similar approach to Uesato et al. (2022). It is worth noting that this study provides no human oversight of the results, as all questions in the MATH dataset have automatically checkable answers. In contrast, there is no easy way to automate process supervision. The study relies on human data labelers to provide process oversight, specifically the correctness of each step in the solution that requires human labeling model generation. The study conducted experiments in both large-scale and small-scale settings.
scope
For each model size, the study uses a fixed model to generate all solutions. This model is called a generator, and OpenAI says it will not improve the generator with reinforcement learning (RL).
Basic model
All large models are fine-tuned based on the GPT-4 model. The study also added an additional pre-training step — fine-tuning all models on MathMix, a dataset containing about 1.5B math-related tokens. Similar to Lewkowycz et al. (2022), OpenAI's research team found that this approach improves the model's mathematical reasoning ability.
Builder
To make parsing individual steps easier, the study trained the generator to generate solutions with steps separated by newlines. Specifically, the study uses few-shot generation solutions to MATH training problems, filters out the solutions that lead to the final correct answer, and fine-tunes the base model for one epoch on this dataset.
data collection
To collect process-supervised data, the study shows human data labelers step-by-step solutions to mathematical problems sampled by large-scale generators. The task of a human data labeler is to assign a positive, negative, or neutral label to each step in the solution, as shown in Figure 1 below.
Outcome Supervised Reward Model (ORM)
This study trains an ORM following a similar approach to Cobbe et al. (2021), and samples a fixed number of solutions to each problem from the generator, then trains the ORM to predict whether each solution is correct or not. In practice, it is common practice to automatically check the final answer for correctness, but human labelers provide the labels in principle. At test time, the study uses the ORM's prediction at the final token as the total score for each solution.
Process Supervisory Reward Model (PRM)
PRM is used to predict the correctness of the steps after the last token in each step. This prediction takes the form of individual tokens, and OpenAI maximizes the log-likelihood of these target tokens during training. Therefore, PRMs can be trained in standard language model pipelines without any special adaptations.
Figure 2 shows two solutions to the same problem, the answer on the left is correct, and the answer on the right is wrong. A green background indicates a high PRM score and a red background indicates a low PRM score. PRM can correctly identify errors in error solutions.
Mass Surveillance
OpenAI uses the full-process supervised data set PRM800K to train PRM. In order to make the ORM benchmark more powerful, OpenAI also trains 100 samples for each question. These samples are all from the generator, so there is no overlapping sample between the ORM training set and PRM800K .
The figure below shows a comparison of result-supervised and process-supervised reward models and voting schemes, showing that PRM is more effective than ORM and majority voting in searching for solutions generated by the model.
In order to better compare result supervision and process supervision, the first thing to note is that the training sets of ORM and PRM are not directly comparable. The PRM training set is constructed using active learning and is biased towards solutions with wrong answers. set an order of magnitude less.
Process Monitoring VS Result Monitoring
First OpenAI samples 1 to 200 solutions for each problem from the small-scale generator. For each dataset, OpenAI provides three forms of supervision: process supervision from PRM_large, result supervision from PRM_large, and result supervision from final answer check.
Figure 4a shows that process supervision is significantly better than the other two forms of result supervision; Figure 4b shows that result supervision with PRM_large is significantly more effective than result supervision with final answer check.
To measure the performance of models generalizing out of distribution (OOD), OpenAI evaluates large-scale ORM and PRM on a held-out (hold-out method) consisting of 224 STEM problems from the latest AP Physics ( American University Advanced Placement (AP), AP Calculus, AP Chemistry, AMC10 (understood as a mathematics competition) and AMC12 exams, the model has not seen these questions. The top 100 best performances for ORM, PRM and majority voting are reported in Table 1. shows that PRM outperforms both ORM and majority voting, and implies that PRM's performance on new test problems remains the same.