CN118095402A

CN118095402A - Reward model training method and system based on human feedback reinforcement learning

Info

Publication number: CN118095402A
Application number: CN202410528660.0A
Authority: CN
Inventors: 郭建威
Original assignee: Zhejiang Zhenshi Intelligent Technology Co ltd
Current assignee: Zhejiang Zhenshi Intelligent Technology Co ltd
Priority date: 2024-04-29
Filing date: 2024-04-29
Publication date: 2024-05-28

Abstract

The invention belongs to the technical field of model training, and particularly relates to a reward model training method and system based on human feedback reinforcement learning. The method comprises the following steps: s1, performing supervision training on a base model by adopting marked instruction prompts and reply texts to obtain a supervision training model; s2, obtaining a group of instruction prompts and a reply text list after adding human preference prompts; s3, marking and labeling the instruction prompt and the reply text list after the human preference prompt is added by using a large model, and obtaining labeled sequencing data; s4, carrying out data enhancement on the marked ordering data to obtain ordering data after data enhancement; s5, training the reward model by using the ordering data enhanced by the data and adopting a low-rank LORA (Low-rank adaptive) method to obtain a multi-gate hybrid expert model MMOE; and S6, combining the multiple mixed expert models MMOE, and training the supervision training model in the step S1 by adopting a reinforcement learning method to obtain a final dialogue model.

Description

Reward model training method and system based on human feedback reinforcement learning

Technical Field

The invention belongs to the technical field of model training, and particularly relates to a reward model training method and system based on human feedback reinforcement learning.

Background

The reinforcement learning method is a machine learning method that enables an agent to learn how to make decisions by interacting with the environment so as to maximize the accumulated rewards. The reinforcement learning method is different from the supervised learning method in that explicit labels or target outputs exist in data required by the supervised learning, and the reinforcement learning method relies on the interaction of an agent and the environment to acquire reward feedback to learn a strategy for maximizing the reward.

One common application of reinforcement learning in natural language processing tasks today is Reinforcement Learning (RLHF) based on human feedback, i.e., optimizing a language model by using preferences of human feedback for the output of the language model such that the output of the language model is aligned with the feedback of preferences of humans. Reinforcement learning, which is typically based on human feedback, includes 3 steps: the first step is supervised fine tuning, instruction prompt and output are used as training data, and a base model is trained to obtain a model for supervising fine tuning; training a reward model, namely training a scoring model by taking human preference data as training data and taking the scoring model as the reward model; and thirdly, optimizing the supervised and fine-tuned model by using the reward model trained in the second step and using a reinforcement learning algorithm to obtain a final language model aligned with human preferences.

Currently, training method optimization with RLHF includes four types:

The first approach is to use a large language model to help generate the ranking data needed to train the reward model.

The second approach balances the effect of the reward model in multiple directions by changing the distribution of training data or changing the objective function of the reward model.

A third method optimizes a language model using multiple reward models when reinforcement learning is performed by training multiple different reward models.

The fourth method increases the generalization ability of the bonus model by adding a public data set to the training data of the bonus model.

However, the four methods described above have the following disadvantages:

in the first method, when the model outputs the ordered data to be marked, the purpose of acquiring the diversified data is achieved only by adjusting the generation parameters of the model or rewriting the generated sentences by the model, and the acquired ordered data samples to be marked have insufficient diversity and cannot determine the quality of the sentences rewritten by the model.

The second method for adjusting training data in different directions has the defect that the proportion of the training data in different aspects is difficult to determine, the proportion of the training data as a super parameter needs to be confirmed according to the quantity of the data and the importance degree in aspects, the adjustment is needed according to application scenes, and automation cannot be achieved. Another disadvantage is that training with the same model in different preferential directions according to the related studies may interfere with each other in training results, which may not be optimal.

The third method of training multiple reward models, which requires more computing resources than training one model, may include video memory, memory and CPU, consumes relatively large computing resources in the third stage of training for reinforcement learning based on human feedback, and results in the need to load multiple reward models simultaneously in the third stage, resulting in further increase in the resources required for training.

In the fourth method based on adding the public data set, especially the Chinese data set, is less and cannot be collected in a large quantity, in addition, the public data set is not the output of the original model, and the correlation between the data sets and the original output of the model cannot be ensured, so that the generalization capability of adding the public training set to effectively increase the rewarding model cannot be ensured.

Therefore, it is very important to design a reward model training method and system based on human feedback reinforcement learning, which can improve the diversity of the reward model training data, increase the amount of the reward model training data, improve the performance of the reward model in learning preferences in multiple directions, and reduce the training parameters and training resource requirements by using the loRa training method.

Disclosure of Invention

The invention provides a reward model training method and a system based on human feedback reinforcement learning, which can improve the diversity of the reward model training data, increase the training data quantity of the reward model, improve the performance of the reward model in learning a plurality of directions and reduce the training parameters and the training resource requirements by using a LORA training method, so as to solve the problems that the prior RLHF training method has insufficient diversity of sequencing data, cannot be automatically adjusted, has large training resource requirements and cannot ensure the increase of model generalization capability.

In order to achieve the aim of the invention, the invention adopts the following technical scheme:

the reward model training method based on human feedback reinforcement learning comprises the following steps of;

S1, performing supervision training on a base model by using marked instruction prompts and reply texts as supervision data to obtain a supervision training model;

S2, collecting required instruction prompts, inputting the instruction prompts into a supervision training model to generate a plurality of reply texts, adding human preference prompts to the original instruction prompts, and obtaining a group of instruction prompts and reply text lists after adding the human preference prompts;

s3, marking and labeling the instruction prompt and the reply text list after the human preference prompt is added by using a large model, and obtaining labeled sequencing data;

S4, carrying out data enhancement on the marked ordering data to obtain ordering data after data enhancement;

S5, training the rewarding model by using the ordering data enhanced by the data and adopting a low-rank LORA (Low-rank adaptive) method to obtain a multi-gate hybrid expert model MMOE for learning human preferences in multiple directions;

and S6, training the supervised training model in the step S1 by adopting a reinforcement learning method in combination with the multi-gate hybrid expert model MMOE obtained in the step S5 to obtain a final dialogue model.

Preferably, in step S2, the human preference prompt is added to the original instruction prompt, specifically, for the instruction prompt X, the instruction prompt after the preference prompt C is added is XC or CX.

Preferably, in step S5, each expert model in the plurality of hybrid expert models MMOE is a reward model based on a supervised training model.

Preferably, in step S5, in the training phase, the loss function loss is calculated as follows:

S51, inputting a pair of instruction prompt and reply text For each input/>The output of each expert model is/>；

Wherein,For inputting and scoring higher replies,/>Replies with lower inputs and scores;

S52, output of each Gate The method comprises the following steps:

；

Wherein, Is a trainable matrix, n is the number of experts, and d is the feature latitude; /(I)The output dimension of the kth gate is the number of experts;

s53, output of the linear layer The method comprises the following steps:

；

Wherein, For/>Is the i-th dimension output result;

S54, output scores of preferences of each direction The method comprises the following steps:

；

Wherein, Calculating a weight matrix for the score of the task k;

S55, respectively to ,/>Calculating the score:

；

Obtaining a loss function 。

Preferably, the step S5 further includes the steps of:

S56, obtaining a trained reward model by performing iterative optimization on training data and minimizing a loss function loss 。

Preferably, in step S5, the same architecture based on converter is used for each expert model, each expert model sharing the same pre-training weight W; for each expert model, a different LORA adaptive matrix group is respectively corresponding; when training the reward model by adopting the low-rank LORA adaptation method, only updating the LORA self-adaptive matrix group corresponding to each expert model.

Preferably, step S6 includes the steps of:

S61, setting For the supervised training model to be trained, D is reinforcement learning training data,/>For the base model trained in the first step,/>Is KL reward coefficient, and strengthens the objective function/>, of learningThe method comprises the following steps:

；

Wherein, For the expectation of the difference between the reward function and the logarithmic function on the distribution of the training set D, r is the reply of the supervised training model corresponding to the input p in the training data D; and performing objective function maximization training by using the training data D to obtain a final dialogue model.

Preferably, in step S6, the reinforcement learning method is specifically a method of outputting a signal using a bonus model.

The invention also provides a reward model training system based on human feedback reinforcement learning, which comprises:

The supervision and training module is used for performing supervision and training on the base model by using the marked instruction prompt and the reply text as supervision data to obtain a supervision and training model;

the training data acquisition module is used for collecting required instruction prompts, inputting the instruction prompts into the supervision training model to generate a plurality of reply texts, adding human preference prompts to the original instruction prompts, and obtaining a group of instruction prompts and reply text lists after adding the human preference prompts;

the data labeling module is used for marking and labeling instruction prompts and a reply text list after the human preference prompts are added by using the large model, and labeled sequencing data are obtained;

The data enhancement module is used for carrying out data enhancement on the marked ordering data to obtain ordering data after data enhancement;

The reward model training module is used for training the reward model by using the data-enhanced sequencing data and adopting a low-rank LORA (Low-rank adaptive) method to obtain a multi-gate hybrid expert model MMOE which is used for learning human preferences in multiple directions;

And the reinforcement learning module is used for combining the multiple mixed expert models MMOE, and training the supervised training model by adopting a reinforcement learning method to obtain a final dialogue model.

Compared with the prior art, the invention has the beneficial effects that: (1) The invention provides the method for generating high-quality reply data by using the instruction prompt of increasing the preference prompt when generating the ordered data to be marked, thereby improving the diversity of the ordered data; (2) The invention provides a method for increasing preference ordering data through data enhancement, which reduces cost and ensures that marked data has strong correlation and can increase generalization capability of a supervision model; (3) In order to better learn the human preference of multiple aspects when training the reward model, the invention provides a novel MMOE-based reward model, and because the language model which is the same as the supervision model is generally adopted in the training of RLHF, in order to avoid the increase of the calculation amount of a plurality of expert models, a MMOE model is constructed by adopting a LORA method under the condition of increasing a small amount of training parameters, thereby realizing the control of the calculation cost and better learning of the human preference of multiple aspects; (4) The scheme of the invention can improve the learning ability of the reward model and increase the generalization ability of the reward model, thereby better performing RLHF training.

Drawings

FIG. 1 is a flow chart of a method for training a bonus model based on human feedback reinforcement learning according to the present invention;

FIG. 2 is a flow chart of a model feedback ordering process in accordance with the present invention;

FIG. 3 is a schematic diagram of a multi-gate hybrid expert model of the present invention;

FIG. 4 is a schematic diagram of training a reward model using a low-rank adaptive LORA method according to the present invention.

Detailed Description

In order to more clearly illustrate the embodiments of the present invention, specific embodiments of the present invention will be described below with reference to the accompanying drawings. It is evident that the drawings in the following description are only examples of the invention, from which other drawings and other embodiments can be obtained by a person skilled in the art without inventive effort.

As shown in FIG. 1, the invention provides a reward model training method based on human feedback reinforcement learning, which comprises the following steps of;

1. Taking the marked instruction prompt and the reply text as supervision data to carry out supervision training on the base model to obtain a supervision training model;

2. Collecting required instruction prompts, inputting the instruction prompts into a supervision training model to generate a plurality of reply texts, adding human preference prompts to the original instruction prompts, and obtaining a group of instruction prompts and reply text lists after adding the human preference prompts;

3. Scoring and marking instruction prompts and a reply text list after human preference prompts are added by using a large model, and obtaining marked ordering data; the large model can be a supervision model trained in the first step, and can also be a payment API (application program interface) such as chatGPT or an open source large model such as LLama;

4. Performing data enhancement on the marked ordering data to obtain ordering data after data enhancement;

5. Training the reward model by using the data-enhanced ordering data and adopting a low-rank LORA (Low-rank adaptive) method to obtain a multi-gate hybrid expert model MMOE for learning human preferences in multiple directions; each expert model in the multiple hybrid expert models MMOE is a reward model based on a supervised training model;

6. And (3) training the supervised training model in the step (1) by adopting a reinforcement learning method in combination with the multi-gate hybrid expert model MMOE obtained in the step (5) to obtain a final dialogue model.

In step 2, a batch of instruction prompts is manually collected, a plurality of reply texts are generated by using a finely tuned supervision training model for each instruction, and the prompts are improved by adding preference prompts, and for the instruction prompt distribution X, the prompts after adding the preference prompts C are XC or CX, and high-quality preference data equivalent to the contextual distillation effect can be obtained by adding the preference prompts, so that a group of instruction prompts and reply text lists are formed. Contextual distillation is an alignment method that uses a loss function based on KL divergence to fine tune the model for context C and data distribution P (X) in order to make the model output for P (X) close to P (x|c). By adding preference cues C, the recovery quality of the model output is comparable to that of the model after contextual distillation.

In step 3, for each instruction prompt, as shown in fig. 2, the reply text list is scored by using a large model, and the ranking data after marking the score is obtained, namely, a model feedback ranking process.

And respectively adopting a data enhancement method to obtain new instruction prompt and reply texts of the marked instruction prompt and reply texts of the ordered data, wherein the enhanced texts and the original data have the same semantics, so that the ordered sequence labels do not need to be marked again.

And (4) adding the new instruction prompt and reply text and the sequence thereof into the data acquired in the step (1) to acquire the sequence data after data enhancement.

For step 5, the data-enhanced ranking data is used as input to train a multi-gate hybrid expert model, and the process is shown in fig. 3. The structure and the weight of the expert model are the same as those of the supervision model, a plurality of experts adopt a plurality of groups of LORA self-adaptive matrixes, and the final loss is the output of gate weighting. As shown in fig. 3, the loss function loss during the training phase is calculated as follows:

Input of a pair of instruction hint and reply text compositions For each input/>The output of each expert model is/>；

The output of each Gate The method comprises the following steps:

；

Wherein, Is a trainable matrix, n is the number of experts, and d is the feature latitude; /(I)The output dimension of the kth gate is the number of experts; gate is a structure in the neural network that controls the conversion of input to output information.

Output of linear layerThe method comprises the following steps:

；

Wherein, For/>Is the i-th dimension output result;

Output score for preference for each direction The method comprises the following steps:

；

Wherein, Calculating a weight matrix for the score of the task k;

respectively to ,/>Calculating the score:

；

Obtaining a loss function 。

Wherein for each expert model, the same transducer architecture is used, and the trainable parameter matrix of the self-attention self-attention module of the transducer is calculated according to the LORA method,/>,/>) Can be regarded as 3 trainable (/ >)X d) matrix,/>As shown in fig. 4, W is/>, which is the dimension of input xAny weight in the matrix W, x is input of the matrix W, h is output, and a low-rank adaptive matrix (low-rank adaptation matrix)/>, is added after a LORA method is adoptedAnd/>Wherein/>Dimension (/ >)r),/>Dimension (/ >)），For the dimension of input x, d is the dimension of the feature, r is the value of rank, r is the super parameter, and the output of the module in forward propagation is/>。

During the fine tuning process, the trainable matrix W module in self-attention module remains unchanged, and for the ith expert model, only the adaptive matrix corresponding to multiple self-attention is updated，/>And the expert models correspond to a plurality of groups of self-adaptive matrixes. Therefore, n expert models share the same pre-training weight W, each expert has different self-adaptive matrix groups, and r is generally far smaller than d, so that the increase of the parameter number and the calculated amount of the rewarding model is not obvious along with the increase of the expert models.

For the step 6, the supervised training model in the step 1 is trained by adopting a reinforcement learning method in combination with the reward model obtained in the step 5, wherein the reinforcement learning algorithm can be an algorithm adopting the output of the reward model as a signal, such as PPO, project Sample, and the like, and the specific process is as follows:

Setting up For the supervised training model to be trained, D is reinforcement learning training data,/>For the base model trained in the first step,/>Is KL reward coefficient, and strengthens the objective function/>, of learningThe method comprises the following steps:

；

r is the reply of the supervised training model corresponding to the input p in the training data D; and training by using the training data D and the objective function to obtain a final dialogue model.

In addition, the invention also provides a reward model training system based on human feedback reinforcement learning, which comprises:

Based on the technical scheme of the invention, the implementation process of the invention in practical application is illustrated by the following case scenario, and the specific application implementation scheme is as follows:

the method comprises three stages when training. When the instruction prompt for increasing the preference prompt is used for generating the high-quality model reply, taking the environment protection as an example of the generated result of the model, the generated result corresponds to the preference alignment of the model in the environment protection consciousness. Wherein the first stage is supervised training. And training the supervised training data by using the language model to obtain a supervised training model.

The second stage training the reward model comprises the following steps:

1. The instructions prompt data collection. Manually outputting a batch of instruction prompt data ，/>N is the data quantity in the data set, which is prompted by an instruction.

2. Each piece of data in PInput to the first stage training supervision model/>, respectivelySecondary acquisition/>Individual instruction ordering data/>Generation/>, with improved instruction hintHigh quality recovery/>. Finally per instruction hint/>There is/>Obtaining the ordered data set/>, to be marked。

The improved instruction hint is as follows:

instruction hint instance to add preference hints

Original instruction hint: the old relatives sent a wild giant salamander to grasp himself, ask how to do so without fishy smell and good taste?

Instruction hint to add preference hint: the old relatives sent a wild giant salamander to grasp himself, ask how to do so without fishy smell and good taste? Your answer should meet environmental protection consciousness, please output your answer.

Model output: firstly, we criticize the unsafe and sustainable behavior of wild giant salamander. … accordingly, we should support protection of wild animals, following relevant regulations.

3. Each of DCorrespondence/>Inputting the two combinations into a large model to obtain the product by judging the quality of two repliesWherein the large model is for reply/>Better than recovery/>The obtained sequencing data marked by the large modelWherein/>。

4. For each ofObtaining/>, using data enhancement methods，，/>L is the number of data enhancement, and in this embodiment, data enhancement is performed by replacing synonyms in sentences. For example, "what is called a natural giant salamander to ask the old and relatives to do so without fishy smell and good taste? ", after data enhancement: "what is the old's relatives sent from a wild giant salamander to grasp by himself, ask what is done to be free of fishy smell and good-looking? ".

5. For the followingLet/>=(/>,/>),/>=(/>,/>) Respectively input into MMOE-based reward models, wherein/>Obtaining the output/>, of the ith expert modelThe output of each gate is。

6. Weighting to obtain。

7. Obtaining a final k-th preference score through a linear layerAnd is obtained by the same methodScore of kth preference/>。

8. Calculating the final loss value。

9. Iterative tuning of parameters in a network is performed on the calculated loss value by using a back propagation algorithm, and a LORA method is adopted to perform low-rank self-adaptive matrix corresponding to the involved ith expertAnd/>And updating.

10. And (2) iterating the step 2 to the step 9 to the preset times, or reducing the loss to the specified value, and stopping iterating to obtain the rewarding model.

The third stage is reinforcement learning training, the instruction prompt collected manually is used as input, and the language model is optimized by using the reward model trained in the second stage and reinforcement learning algorithm. Assume thatFor the supervised training model to be trained, D is reinforcement learning training data,/>For the base model trained in the first step,/>For the reward model trained for the second stage, for one input p in D, input to/>The model gets a reply r, for (p, r) with a reward score/>Then calculate the reinforcement learning objective function/>Obtaining a final dialog model by maximizing the objective function in the training set D, wherein/>Is the KL prize coefficient.

The invention mainly improves the training of the reinforcement learning reward model based on human feedback, and in the process of collecting the sequencing data of the reward model, provides a mode of using a preference prompt to improve instruction prompt to acquire the output of a high-quality supervision model, thereby increasing the diversity of the sequencing data. When the sorting data is marked, a large model is adopted for marking, so that the cost of manual marking is reduced. After the marked ordering data required by the bonus model training is obtained, the marked ordering data is enhanced by a data enhancement mode, so that the generalization capability of the bonus model can be improved. The invention provides a novel reward model combined with a multi-expert model, which can better learn preferences in multiple directions, and a scoring model can be trained without increasing training resources by adopting a LORA training technology.

The innovation points of the invention are as follows:

1. the invention provides a data enhancement method for sequencing data, which obtains sequencing data which is more similar to the original sequencing data distribution through data enhancement by using instruction prompt and model output of the sequencing data, thereby improving generalization of a reward model.

2. The invention creatively provides a method for prompting the increase of the training data diversity of the reward model by using the instruction for increasing the preference prompt. In the generation process of the reward model sequencing data, besides using different parameters to generate the output of instruction prompts, the quality and the score of the generated output can be improved by simply adding preference prompts, and the generalization capability and the effect of the reward model are improved;

3. The invention innovatively provides a MMOE-based rewarding model, and through MMOE, each expert model corresponds to preference training in one direction, so that human preferences in multiple directions can be learned at the same time, and performance degradation of the model when the preferences in multiple directions conflict is avoided.

4. The present invention proposes a method for training MMOE using the LORA, by training the different expert models of MMOE using the LORA method on a base model, the training parameters of MMOE can be reduced and the training can be guaranteed to achieve the desired effect.

In summary, the invention is a latest result scheme for improving the feedback-based reinforcement learning effect of training a large language model by improving the diversity of training data of the reward model, increasing the training data amount of the reward model, improving the performance of the reward model in learning preferences in multiple directions, and reducing training parameters and training resource requirements by using LORA.

The foregoing is only illustrative of the preferred embodiments and principles of the present invention, and changes in specific embodiments will occur to those skilled in the art upon consideration of the teachings provided herein, and such changes are intended to be included within the scope of the invention as defined by the claims.

Claims

1. The reward model training method based on human feedback reinforcement learning is characterized by comprising the following steps of;

2. The method according to claim 1, wherein in step S2, the human preference prompt is added to the original instruction prompt, specifically, the instruction prompt after adding the preference prompt C is XC or CX for the instruction prompt X.

3. The method according to claim 1, wherein in step S5, each expert model of the plurality of mixed expert models MMOE is a supervised training model-based bonus model.

4. A method for training a bonus model based on human feedback reinforcement learning as claimed in claim 3, characterized in that in step S5, the loss function loss is calculated as follows:

S52, output of each Gate The method comprises the following steps:

；

s53, output of the linear layer The method comprises the following steps:

；

Wherein, For/>Is the i-th dimension output result;

；

Wherein, Calculating a weight matrix for the score of the task k;

S55, respectively to ,/>Calculating the score:

；

Obtaining a loss function 。

5. The method for training a bonus model based on human feedback reinforcement learning according to claim 4, wherein step S5 further comprises the steps of:

6. The method according to claim 5, wherein in step S5, the same architecture based on converter is used for each expert model, and each expert model shares the same pre-training weight W; for each expert model, a different LORA adaptive matrix group is respectively corresponding; when training the reward model by adopting the low-rank LORA adaptation method, only updating the LORA self-adaptive matrix group corresponding to each expert model.

7. The method for training a bonus model based on human feedback reinforcement learning according to claim 5, wherein step S6 comprises the steps of:

；

8. The method according to claim 7, wherein in step S6, the reinforcement learning method is specifically a method of using the output of the reinforcement model as a signal.

9. A reward model training system based on human feedback reinforcement learning for implementing the reward model training method based on human feedback reinforcement learning according to any one of claims 1 to 8, characterized in that the reward model training system based on human feedback reinforcement learning comprises: