OpenAI is going to solve math problems for GPT-4: the reward model is wrong, and the level of problem solving has reached a new level

Source: Heart of the Machine

For challenging step-by-step mathematical reasoning problems, is it more effective to give rewards at each step or a single reward at the end? New research from OpenAI has their answer.

Image source: Generated by Unbounded AI tool

Now, big language models usher in the era of "omnipotent", in which the ability to perform complex multi-step reasoning has also been greatly improved. Still, even large, state-of-the-art models can produce logical errors, often called hallucinations. Therefore, alleviating hallucinations is a crucial step in building aligned AGI.

In order to train a more reliable model, there are currently two different methods to choose to train the reward model, one is outcome supervision and the other is process supervision. Outcome-supervised reward models (ORMs) are trained using only the final outcome of the model's thought chain, while process-supervised reward models (PRMs) receive rewards for each step in the thought chain.

Given the importance of training reliable models and the high cost of human feedback, it is important to carefully compare outcome supervision with process supervision. While recent work has carried out this comparison, many questions remain.

In this paper, OpenAI investigates and finds that process supervision significantly outperforms outcome supervision when training models to solve problems on the MATH dataset. OpenAI solved 78% of the problems on a representative subset of the MATH test set using its own PRM model.

In addition, to support related research, OpenAI also open-sources PRM800K, a complete dataset containing 800K step-level human feedback labels, for training their optimal reward models.

The following is an example of a real (True positive) question and answer. This problem and the other problem examples cited by OpenAI are from GPT-4. This challenging trigonometry problem requires the unobvious application of multiple identities in succession. Most solution attempts fail because it's hard to know which identities are actually useful. Although GPT-4 usually fails to solve this problem (only 0.1% correct), our reward model correctly identifies that this solution is effective.

Let’s look at another False positive question answering example. In the fourth step, GPT-4 falsely claimed that the sequence repeated every 12 terms, when it actually repeated every 10 terms. This counting error occasionally fools reward models.

"The really interesting result of using LLMs to do math problems is that it's more effective to supervise each step than to just check the answer," said Jan Leike, one of the authors of the paper and head of the OpenAI Alignment team.

According to Jim Fan, an AI scientist at Nvidia, "the point of this paper is simple: For challenging step-by-step problems, rewards are given at each step, rather than a single reward at the end. Fundamentally, dense reward signals > sparse."

Let's take a closer look at the methods and results of the OpenAI paper.

Paper address:

Dataset address:

Method Overview

The study compares outcome supervision with process supervision following a similar approach to Uesato et al. (2022). It is worth noting that this study provides no human oversight of the results, as all questions in the MATH dataset have automatically checkable answers. In contrast, there is no easy way to automate process supervision. The study relies on human data labelers to provide process oversight, specifically the correctness of each step in the solution that requires human labeling model generation. The study conducted experiments in both large-scale and small-scale settings.

scope

For each model size, the study uses a fixed model to generate all solutions. This model is called a generator, and OpenAI says it will not improve the generator with reinforcement learning (RL).

Basic model

All large models are fine-tuned based on the GPT-4 model. The study also added an additional pre-training step — fine-tuning all models on MathMix, a dataset containing about 1.5B math-related tokens. Similar to Lewkowycz et al. (2022), OpenAI's research team found that this approach improves the model's mathematical reasoning ability.

Builder

To make parsing individual steps easier, the study trained the generator to generate solutions with steps separated by newlines. Specifically, the study uses few-shot generation solutions to MATH training problems, filters out the solutions that lead to the final correct answer, and fine-tunes the base model for one epoch on this dataset.

data collection

To collect process-supervised data, the study shows human data labelers step-by-step solutions to mathematical problems sampled by large-scale generators. The task of a human data labeler is to assign a positive, negative, or neutral label to each step in the solution, as shown in Figure 1 below.

The study only labels solutions produced by large generators to maximize the value of limited artificial data resources. The study refers to the collected stepwise labeled entire dataset as PRM800K. The PRM800K training set contains 800K step labels covering 75K solutions to 12K problems. To minimize overfitting, the PRM800K training set contains data from 4.5K test problems of MATH, and the model is only evaluated on the remaining 500 test problems of MATH.

Outcome Supervised Reward Model (ORM)

This study trains an ORM following a similar approach to Cobbe et al. (2021), and samples a fixed number of solutions to each problem from the generator, then trains the ORM to predict whether each solution is correct or not. In practice, it is common practice to automatically check the final answer for correctness, but human labelers provide the labels in principle. At test time, the study uses the ORM's prediction at the final token as the total score for each solution.

Process Supervisory Reward Model (PRM)

PRM is used to predict the correctness of the steps after the last token in each step. This prediction takes the form of individual tokens, and OpenAI maximizes the log-likelihood of these target tokens during training. Therefore, PRMs can be trained in standard language model pipelines without any special adaptations.

Figure 2 shows two solutions to the same problem, the answer on the left is correct, and the answer on the right is wrong. A green background indicates a high PRM score and a red background indicates a low PRM score. PRM can correctly identify errors in error solutions.

When performing process supervision, OpenAI deliberately chooses to supervise only the first error step, making the comparison between outcome supervision and process supervision more straightforward. For the correct solution, both methods provide the same information because each step is the correct way to solve the problem. For erroneous solutions, both methods reveal at least one error, and process monitoring also reveals the exact location of the error.

Mass Surveillance

OpenAI uses the full-process supervised data set PRM800K to train PRM. In order to make the ORM benchmark more powerful, OpenAI also trains 100 samples for each question. These samples are all from the generator, so there is no overlapping sample between the ORM training set and PRM800K .

The figure below shows a comparison of result-supervised and process-supervised reward models and voting schemes, showing that PRM is more effective than ORM and majority voting in searching for solutions generated by the model.

Small-scale Comprehensive Supervision

In order to better compare result supervision and process supervision, the first thing to note is that the training sets of ORM and PRM are not directly comparable. The PRM training set is constructed using active learning and is biased towards solutions with wrong answers. set an order of magnitude less.

Process Monitoring VS Result Monitoring

First OpenAI samples 1 to 200 solutions for each problem from the small-scale generator. For each dataset, OpenAI provides three forms of supervision: process supervision from PRM_large, result supervision from PRM_large, and result supervision from final answer check.

Figure 4a shows that process supervision is significantly better than the other two forms of result supervision; Figure 4b shows that result supervision with PRM_large is significantly more effective than result supervision with final answer check.

OOD Generalization

To measure the performance of models generalizing out of distribution (OOD), OpenAI evaluates large-scale ORM and PRM on a held-out (hold-out method) consisting of 224 STEM problems from the latest AP Physics ( American University Advanced Placement (AP), AP Calculus, AP Chemistry, AMC10 (understood as a mathematics competition) and AMC12 exams, the model has not seen these questions. The top 100 best performances for ORM, PRM and majority voting are reported in Table 1. shows that PRM outperforms both ORM and majority voting, and implies that PRM's performance on new test problems remains the same.

View Original
This page may contain third-party content, which is provided for information purposes only (not representations/warranties) and should not be considered as an endorsement of its views by Gate, nor as financial or professional advice. See Disclaimer for details.
  • Reward
  • Comment
  • Share
Comment
0/400
No comments
Trade Crypto Anywhere Anytime
qrCode
Scan to download Gate app
Community
English
  • 简体中文
  • English
  • Tiếng Việt
  • 繁體中文
  • Español
  • Русский
  • Français (Afrique)
  • Português (Portugal)
  • Bahasa Indonesia
  • 日本語
  • بالعربية
  • Українська
  • Português (Brasil)