CN118095402A - Reward model training method and system based on human feedback reinforcement learning - Google Patents

Reward model training method and system based on human feedback reinforcement learning Download PDF

Info

Publication number
CN118095402A
CN118095402A CN202410528660.0A CN202410528660A CN118095402A CN 118095402 A CN118095402 A CN 118095402A CN 202410528660 A CN202410528660 A CN 202410528660A CN 118095402 A CN118095402 A CN 118095402A
Authority
CN
China
Prior art keywords
model
training
data
prompts
instruction
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202410528660.0A
Other languages
Chinese (zh)
Inventor
郭建威
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Zhejiang Zhenshi Intelligent Technology Co ltd
Original Assignee
Zhejiang Zhenshi Intelligent Technology Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Zhejiang Zhenshi Intelligent Technology Co ltd filed Critical Zhejiang Zhenshi Intelligent Technology Co ltd
Priority to CN202410528660.0A priority Critical patent/CN118095402A/en
Publication of CN118095402A publication Critical patent/CN118095402A/en
Pending legal-status Critical Current

Links

Classifications

    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02TCLIMATE CHANGE MITIGATION TECHNOLOGIES RELATED TO TRANSPORTATION
    • Y02T10/00Road transport of goods or passengers
    • Y02T10/10Internal combustion engine [ICE] based vehicles
    • Y02T10/40Engine management systems

Landscapes

  • Management, Administration, Business Operations System, And Electronic Commerce (AREA)

Abstract

The invention belongs to the technical field of model training, and particularly relates to a reward model training method and system based on human feedback reinforcement learning. The method comprises the following steps: s1, performing supervision training on a base model by adopting marked instruction prompts and reply texts to obtain a supervision training model; s2, obtaining a group of instruction prompts and a reply text list after adding human preference prompts; s3, marking and labeling the instruction prompt and the reply text list after the human preference prompt is added by using a large model, and obtaining labeled sequencing data; s4, carrying out data enhancement on the marked ordering data to obtain ordering data after data enhancement; s5, training the reward model by using the ordering data enhanced by the data and adopting a low-rank LORA (Low-rank adaptive) method to obtain a multi-gate hybrid expert model MMOE; and S6, combining the multiple mixed expert models MMOE, and training the supervision training model in the step S1 by adopting a reinforcement learning method to obtain a final dialogue model.

Description

Reward model training method and system based on human feedback reinforcement learning
Technical Field
The invention belongs to the technical field of model training, and particularly relates to a reward model training method and system based on human feedback reinforcement learning.
Background
The reinforcement learning method is a machine learning method that enables an agent to learn how to make decisions by interacting with the environment so as to maximize the accumulated rewards. The reinforcement learning method is different from the supervised learning method in that explicit labels or target outputs exist in data required by the supervised learning, and the reinforcement learning method relies on the interaction of an agent and the environment to acquire reward feedback to learn a strategy for maximizing the reward.
One common application of reinforcement learning in natural language processing tasks today is Reinforcement Learning (RLHF) based on human feedback, i.e., optimizing a language model by using preferences of human feedback for the output of the language model such that the output of the language model is aligned with the feedback of preferences of humans. Reinforcement learning, which is typically based on human feedback, includes 3 steps: the first step is supervised fine tuning, instruction prompt and output are used as training data, and a base model is trained to obtain a model for supervising fine tuning; training a reward model, namely training a scoring model by taking human preference data as training data and taking the scoring model as the reward model; and thirdly, optimizing the supervised and fine-tuned model by using the reward model trained in the second step and using a reinforcement learning algorithm to obtain a final language model aligned with human preferences.
Currently, training method optimization with RLHF includes four types:
The first approach is to use a large language model to help generate the ranking data needed to train the reward model.
The second approach balances the effect of the reward model in multiple directions by changing the distribution of training data or changing the objective function of the reward model.
A third method optimizes a language model using multiple reward models when reinforcement learning is performed by training multiple different reward models.
The fourth method increases the generalization ability of the bonus model by adding a public data set to the training data of the bonus model.
However, the four methods described above have the following disadvantages:
in the first method, when the model outputs the ordered data to be marked, the purpose of acquiring the diversified data is achieved only by adjusting the generation parameters of the model or rewriting the generated sentences by the model, and the acquired ordered data samples to be marked have insufficient diversity and cannot determine the quality of the sentences rewritten by the model.
The second method for adjusting training data in different directions has the defect that the proportion of the training data in different aspects is difficult to determine, the proportion of the training data as a super parameter needs to be confirmed according to the quantity of the data and the importance degree in aspects, the adjustment is needed according to application scenes, and automation cannot be achieved. Another disadvantage is that training with the same model in different preferential directions according to the related studies may interfere with each other in training results, which may not be optimal.
The third method of training multiple reward models, which requires more computing resources than training one model, may include video memory, memory and CPU, consumes relatively large computing resources in the third stage of training for reinforcement learning based on human feedback, and results in the need to load multiple reward models simultaneously in the third stage, resulting in further increase in the resources required for training.
In the fourth method based on adding the public data set, especially the Chinese data set, is less and cannot be collected in a large quantity, in addition, the public data set is not the output of the original model, and the correlation between the data sets and the original output of the model cannot be ensured, so that the generalization capability of adding the public training set to effectively increase the rewarding model cannot be ensured.
Therefore, it is very important to design a reward model training method and system based on human feedback reinforcement learning, which can improve the diversity of the reward model training data, increase the amount of the reward model training data, improve the performance of the reward model in learning preferences in multiple directions, and reduce the training parameters and training resource requirements by using the loRa training method.
Disclosure of Invention
The invention provides a reward model training method and a system based on human feedback reinforcement learning, which can improve the diversity of the reward model training data, increase the training data quantity of the reward model, improve the performance of the reward model in learning a plurality of directions and reduce the training parameters and the training resource requirements by using a LORA training method, so as to solve the problems that the prior RLHF training method has insufficient diversity of sequencing data, cannot be automatically adjusted, has large training resource requirements and cannot ensure the increase of model generalization capability.
In order to achieve the aim of the invention, the invention adopts the following technical scheme:
the reward model training method based on human feedback reinforcement learning comprises the following steps of;
S1, performing supervision training on a base model by using marked instruction prompts and reply texts as supervision data to obtain a supervision training model;
S2, collecting required instruction prompts, inputting the instruction prompts into a supervision training model to generate a plurality of reply texts, adding human preference prompts to the original instruction prompts, and obtaining a group of instruction prompts and reply text lists after adding the human preference prompts;
s3, marking and labeling the instruction prompt and the reply text list after the human preference prompt is added by using a large model, and obtaining labeled sequencing data;
S4, carrying out data enhancement on the marked ordering data to obtain ordering data after data enhancement;
S5, training the rewarding model by using the ordering data enhanced by the data and adopting a low-rank LORA (Low-rank adaptive) method to obtain a multi-gate hybrid expert model MMOE for learning human preferences in multiple directions;
and S6, training the supervised training model in the step S1 by adopting a reinforcement learning method in combination with the multi-gate hybrid expert model MMOE obtained in the step S5 to obtain a final dialogue model.
Preferably, in step S2, the human preference prompt is added to the original instruction prompt, specifically, for the instruction prompt X, the instruction prompt after the preference prompt C is added is XC or CX.
Preferably, in step S5, each expert model in the plurality of hybrid expert models MMOE is a reward model based on a supervised training model.
Preferably, in step S5, in the training phase, the loss function loss is calculated as follows:
S51, inputting a pair of instruction prompt and reply text For each input/>The output of each expert model is/>
Wherein,For inputting and scoring higher replies,/>Replies with lower inputs and scores;
S52, output of each Gate The method comprises the following steps:
Wherein, Is a trainable matrix, n is the number of experts, and d is the feature latitude; /(I)The output dimension of the kth gate is the number of experts;
s53, output of the linear layer The method comprises the following steps:
Wherein, For/>Is the i-th dimension output result;
S54, output scores of preferences of each direction The method comprises the following steps:
Wherein, Calculating a weight matrix for the score of the task k;
S55, respectively to ,/>Calculating the score:
Obtaining a loss function
Preferably, the step S5 further includes the steps of:
S56, obtaining a trained reward model by performing iterative optimization on training data and minimizing a loss function loss
Preferably, in step S5, the same architecture based on converter is used for each expert model, each expert model sharing the same pre-training weight W; for each expert model, a different LORA adaptive matrix group is respectively corresponding; when training the reward model by adopting the low-rank LORA adaptation method, only updating the LORA self-adaptive matrix group corresponding to each expert model.
Preferably, step S6 includes the steps of:
S61, setting For the supervised training model to be trained, D is reinforcement learning training data,/>For the base model trained in the first step,/>Is KL reward coefficient, and strengthens the objective function/>, of learningThe method comprises the following steps:
Wherein, For the expectation of the difference between the reward function and the logarithmic function on the distribution of the training set D, r is the reply of the supervised training model corresponding to the input p in the training data D; and performing objective function maximization training by using the training data D to obtain a final dialogue model.
Preferably, in step S6, the reinforcement learning method is specifically a method of outputting a signal using a bonus model.
The invention also provides a reward model training system based on human feedback reinforcement learning, which comprises:
The supervision and training module is used for performing supervision and training on the base model by using the marked instruction prompt and the reply text as supervision data to obtain a supervision and training model;
the training data acquisition module is used for collecting required instruction prompts, inputting the instruction prompts into the supervision training model to generate a plurality of reply texts, adding human preference prompts to the original instruction prompts, and obtaining a group of instruction prompts and reply text lists after adding the human preference prompts;
the data labeling module is used for marking and labeling instruction prompts and a reply text list after the human preference prompts are added by using the large model, and labeled sequencing data are obtained;
The data enhancement module is used for carrying out data enhancement on the marked ordering data to obtain ordering data after data enhancement;
The reward model training module is used for training the reward model by using the data-enhanced sequencing data and adopting a low-rank LORA (Low-rank adaptive) method to obtain a multi-gate hybrid expert model MMOE which is used for learning human preferences in multiple directions;
And the reinforcement learning module is used for combining the multiple mixed expert models MMOE, and training the supervised training model by adopting a reinforcement learning method to obtain a final dialogue model.
Compared with the prior art, the invention has the beneficial effects that: (1) The invention provides the method for generating high-quality reply data by using the instruction prompt of increasing the preference prompt when generating the ordered data to be marked, thereby improving the diversity of the ordered data; (2) The invention provides a method for increasing preference ordering data through data enhancement, which reduces cost and ensures that marked data has strong correlation and can increase generalization capability of a supervision model; (3) In order to better learn the human preference of multiple aspects when training the reward model, the invention provides a novel MMOE-based reward model, and because the language model which is the same as the supervision model is generally adopted in the training of RLHF, in order to avoid the increase of the calculation amount of a plurality of expert models, a MMOE model is constructed by adopting a LORA method under the condition of increasing a small amount of training parameters, thereby realizing the control of the calculation cost and better learning of the human preference of multiple aspects; (4) The scheme of the invention can improve the learning ability of the reward model and increase the generalization ability of the reward model, thereby better performing RLHF training.
Drawings
FIG. 1 is a flow chart of a method for training a bonus model based on human feedback reinforcement learning according to the present invention;
FIG. 2 is a flow chart of a model feedback ordering process in accordance with the present invention;
FIG. 3 is a schematic diagram of a multi-gate hybrid expert model of the present invention;
FIG. 4 is a schematic diagram of training a reward model using a low-rank adaptive LORA method according to the present invention.
Detailed Description
In order to more clearly illustrate the embodiments of the present invention, specific embodiments of the present invention will be described below with reference to the accompanying drawings. It is evident that the drawings in the following description are only examples of the invention, from which other drawings and other embodiments can be obtained by a person skilled in the art without inventive effort.
As shown in FIG. 1, the invention provides a reward model training method based on human feedback reinforcement learning, which comprises the following steps of;
1. Taking the marked instruction prompt and the reply text as supervision data to carry out supervision training on the base model to obtain a supervision training model;
2. Collecting required instruction prompts, inputting the instruction prompts into a supervision training model to generate a plurality of reply texts, adding human preference prompts to the original instruction prompts, and obtaining a group of instruction prompts and reply text lists after adding the human preference prompts;
3. Scoring and marking instruction prompts and a reply text list after human preference prompts are added by using a large model, and obtaining marked ordering data; the large model can be a supervision model trained in the first step, and can also be a payment API (application program interface) such as chatGPT or an open source large model such as LLama;
4. Performing data enhancement on the marked ordering data to obtain ordering data after data enhancement;
5. Training the reward model by using the data-enhanced ordering data and adopting a low-rank LORA (Low-rank adaptive) method to obtain a multi-gate hybrid expert model MMOE for learning human preferences in multiple directions; each expert model in the multiple hybrid expert models MMOE is a reward model based on a supervised training model;
6. And (3) training the supervised training model in the step (1) by adopting a reinforcement learning method in combination with the multi-gate hybrid expert model MMOE obtained in the step (5) to obtain a final dialogue model.
In step 2, a batch of instruction prompts is manually collected, a plurality of reply texts are generated by using a finely tuned supervision training model for each instruction, and the prompts are improved by adding preference prompts, and for the instruction prompt distribution X, the prompts after adding the preference prompts C are XC or CX, and high-quality preference data equivalent to the contextual distillation effect can be obtained by adding the preference prompts, so that a group of instruction prompts and reply text lists are formed. Contextual distillation is an alignment method that uses a loss function based on KL divergence to fine tune the model for context C and data distribution P (X) in order to make the model output for P (X) close to P (x|c). By adding preference cues C, the recovery quality of the model output is comparable to that of the model after contextual distillation.
In step 3, for each instruction prompt, as shown in fig. 2, the reply text list is scored by using a large model, and the ranking data after marking the score is obtained, namely, a model feedback ranking process.
And respectively adopting a data enhancement method to obtain new instruction prompt and reply texts of the marked instruction prompt and reply texts of the ordered data, wherein the enhanced texts and the original data have the same semantics, so that the ordered sequence labels do not need to be marked again.
And (4) adding the new instruction prompt and reply text and the sequence thereof into the data acquired in the step (1) to acquire the sequence data after data enhancement.
For step 5, the data-enhanced ranking data is used as input to train a multi-gate hybrid expert model, and the process is shown in fig. 3. The structure and the weight of the expert model are the same as those of the supervision model, a plurality of experts adopt a plurality of groups of LORA self-adaptive matrixes, and the final loss is the output of gate weighting. As shown in fig. 3, the loss function loss during the training phase is calculated as follows:
Input of a pair of instruction hint and reply text compositions For each input/>The output of each expert model is/>
Wherein,For inputting and scoring higher replies,/>Replies with lower inputs and scores;
The output of each Gate The method comprises the following steps:
Wherein, Is a trainable matrix, n is the number of experts, and d is the feature latitude; /(I)The output dimension of the kth gate is the number of experts; gate is a structure in the neural network that controls the conversion of input to output information.
Output of linear layerThe method comprises the following steps:
Wherein, For/>Is the i-th dimension output result;
Output score for preference for each direction The method comprises the following steps:
Wherein, Calculating a weight matrix for the score of the task k;
respectively to ,/>Calculating the score:
Obtaining a loss function
Wherein for each expert model, the same transducer architecture is used, and the trainable parameter matrix of the self-attention self-attention module of the transducer is calculated according to the LORA method,/>,/>) Can be regarded as 3 trainable (/ >)X d) matrix,/>As shown in fig. 4, W is/>, which is the dimension of input xAny weight in the matrix W, x is input of the matrix W, h is output, and a low-rank adaptive matrix (low-rank adaptation matrix)/>, is added after a LORA method is adoptedAnd/>Wherein/>Dimension (/ >)r),/>Dimension (/ >)),For the dimension of input x, d is the dimension of the feature, r is the value of rank, r is the super parameter, and the output of the module in forward propagation is/>
During the fine tuning process, the trainable matrix W module in self-attention module remains unchanged, and for the ith expert model, only the adaptive matrix corresponding to multiple self-attention is updated,/>And the expert models correspond to a plurality of groups of self-adaptive matrixes. Therefore, n expert models share the same pre-training weight W, each expert has different self-adaptive matrix groups, and r is generally far smaller than d, so that the increase of the parameter number and the calculated amount of the rewarding model is not obvious along with the increase of the expert models.
For the step 6, the supervised training model in the step 1 is trained by adopting a reinforcement learning method in combination with the reward model obtained in the step 5, wherein the reinforcement learning algorithm can be an algorithm adopting the output of the reward model as a signal, such as PPO, project Sample, and the like, and the specific process is as follows:
Setting up For the supervised training model to be trained, D is reinforcement learning training data,/>For the base model trained in the first step,/>Is KL reward coefficient, and strengthens the objective function/>, of learningThe method comprises the following steps:
r is the reply of the supervised training model corresponding to the input p in the training data D; and training by using the training data D and the objective function to obtain a final dialogue model.
In addition, the invention also provides a reward model training system based on human feedback reinforcement learning, which comprises:
The supervision and training module is used for performing supervision and training on the base model by using the marked instruction prompt and the reply text as supervision data to obtain a supervision and training model;
the training data acquisition module is used for collecting required instruction prompts, inputting the instruction prompts into the supervision training model to generate a plurality of reply texts, adding human preference prompts to the original instruction prompts, and obtaining a group of instruction prompts and reply text lists after adding the human preference prompts;
the data labeling module is used for marking and labeling instruction prompts and a reply text list after the human preference prompts are added by using the large model, and labeled sequencing data are obtained;
The data enhancement module is used for carrying out data enhancement on the marked ordering data to obtain ordering data after data enhancement;
The reward model training module is used for training the reward model by using the data-enhanced sequencing data and adopting a low-rank LORA (Low-rank adaptive) method to obtain a multi-gate hybrid expert model MMOE which is used for learning human preferences in multiple directions;
And the reinforcement learning module is used for combining the multiple mixed expert models MMOE, and training the supervised training model by adopting a reinforcement learning method to obtain a final dialogue model.
Based on the technical scheme of the invention, the implementation process of the invention in practical application is illustrated by the following case scenario, and the specific application implementation scheme is as follows:
the method comprises three stages when training. When the instruction prompt for increasing the preference prompt is used for generating the high-quality model reply, taking the environment protection as an example of the generated result of the model, the generated result corresponds to the preference alignment of the model in the environment protection consciousness. Wherein the first stage is supervised training. And training the supervised training data by using the language model to obtain a supervised training model.
The second stage training the reward model comprises the following steps:
1. The instructions prompt data collection. Manually outputting a batch of instruction prompt data ,/>N is the data quantity in the data set, which is prompted by an instruction.
2. Each piece of data in PInput to the first stage training supervision model/>, respectivelySecondary acquisition/>Individual instruction ordering data/>Generation/>, with improved instruction hintHigh quality recovery/>. Finally per instruction hint/>There is/>Obtaining the ordered data set/>, to be marked
The improved instruction hint is as follows:
instruction hint instance to add preference hints
Original instruction hint: the old relatives sent a wild giant salamander to grasp himself, ask how to do so without fishy smell and good taste?
Instruction hint to add preference hint: the old relatives sent a wild giant salamander to grasp himself, ask how to do so without fishy smell and good taste? Your answer should meet environmental protection consciousness, please output your answer.
Model output: firstly, we criticize the unsafe and sustainable behavior of wild giant salamander. … accordingly, we should support protection of wild animals, following relevant regulations.
3. Each of DCorrespondence/>Inputting the two combinations into a large model to obtain the product by judging the quality of two repliesWherein the large model is for reply/>Better than recovery/>The obtained sequencing data marked by the large modelWherein/>
4. For each ofObtaining/>, using data enhancement methods,/>L is the number of data enhancement, and in this embodiment, data enhancement is performed by replacing synonyms in sentences. For example, "what is called a natural giant salamander to ask the old and relatives to do so without fishy smell and good taste? ", after data enhancement: "what is the old's relatives sent from a wild giant salamander to grasp by himself, ask what is done to be free of fishy smell and good-looking? ".
5. For the followingLet/>=(/>,/>),/>=(/>,/>) Respectively input into MMOE-based reward models, wherein/>Obtaining the output/>, of the ith expert modelThe output of each gate is
6. Weighting to obtain
7. Obtaining a final k-th preference score through a linear layerAnd is obtained by the same methodScore of kth preference/>
8. Calculating the final loss value
9. Iterative tuning of parameters in a network is performed on the calculated loss value by using a back propagation algorithm, and a LORA method is adopted to perform low-rank self-adaptive matrix corresponding to the involved ith expertAnd/>And updating.
10. And (2) iterating the step 2 to the step 9 to the preset times, or reducing the loss to the specified value, and stopping iterating to obtain the rewarding model.
The third stage is reinforcement learning training, the instruction prompt collected manually is used as input, and the language model is optimized by using the reward model trained in the second stage and reinforcement learning algorithm. Assume thatFor the supervised training model to be trained, D is reinforcement learning training data,/>For the base model trained in the first step,/>For the reward model trained for the second stage, for one input p in D, input to/>The model gets a reply r, for (p, r) with a reward score/>Then calculate the reinforcement learning objective function/>Obtaining a final dialog model by maximizing the objective function in the training set D, wherein/>Is the KL prize coefficient.
The invention mainly improves the training of the reinforcement learning reward model based on human feedback, and in the process of collecting the sequencing data of the reward model, provides a mode of using a preference prompt to improve instruction prompt to acquire the output of a high-quality supervision model, thereby increasing the diversity of the sequencing data. When the sorting data is marked, a large model is adopted for marking, so that the cost of manual marking is reduced. After the marked ordering data required by the bonus model training is obtained, the marked ordering data is enhanced by a data enhancement mode, so that the generalization capability of the bonus model can be improved. The invention provides a novel reward model combined with a multi-expert model, which can better learn preferences in multiple directions, and a scoring model can be trained without increasing training resources by adopting a LORA training technology.
The innovation points of the invention are as follows:
1. the invention provides a data enhancement method for sequencing data, which obtains sequencing data which is more similar to the original sequencing data distribution through data enhancement by using instruction prompt and model output of the sequencing data, thereby improving generalization of a reward model.
2. The invention creatively provides a method for prompting the increase of the training data diversity of the reward model by using the instruction for increasing the preference prompt. In the generation process of the reward model sequencing data, besides using different parameters to generate the output of instruction prompts, the quality and the score of the generated output can be improved by simply adding preference prompts, and the generalization capability and the effect of the reward model are improved;
3. The invention innovatively provides a MMOE-based rewarding model, and through MMOE, each expert model corresponds to preference training in one direction, so that human preferences in multiple directions can be learned at the same time, and performance degradation of the model when the preferences in multiple directions conflict is avoided.
4. The present invention proposes a method for training MMOE using the LORA, by training the different expert models of MMOE using the LORA method on a base model, the training parameters of MMOE can be reduced and the training can be guaranteed to achieve the desired effect.
In summary, the invention is a latest result scheme for improving the feedback-based reinforcement learning effect of training a large language model by improving the diversity of training data of the reward model, increasing the training data amount of the reward model, improving the performance of the reward model in learning preferences in multiple directions, and reducing training parameters and training resource requirements by using LORA.
The foregoing is only illustrative of the preferred embodiments and principles of the present invention, and changes in specific embodiments will occur to those skilled in the art upon consideration of the teachings provided herein, and such changes are intended to be included within the scope of the invention as defined by the claims.

Claims (9)

1. The reward model training method based on human feedback reinforcement learning is characterized by comprising the following steps of;
S1, performing supervision training on a base model by using marked instruction prompts and reply texts as supervision data to obtain a supervision training model;
S2, collecting required instruction prompts, inputting the instruction prompts into a supervision training model to generate a plurality of reply texts, adding human preference prompts to the original instruction prompts, and obtaining a group of instruction prompts and reply text lists after adding the human preference prompts;
s3, marking and labeling the instruction prompt and the reply text list after the human preference prompt is added by using a large model, and obtaining labeled sequencing data;
S4, carrying out data enhancement on the marked ordering data to obtain ordering data after data enhancement;
S5, training the rewarding model by using the ordering data enhanced by the data and adopting a low-rank LORA (Low-rank adaptive) method to obtain a multi-gate hybrid expert model MMOE for learning human preferences in multiple directions;
and S6, training the supervised training model in the step S1 by adopting a reinforcement learning method in combination with the multi-gate hybrid expert model MMOE obtained in the step S5 to obtain a final dialogue model.
2. The method according to claim 1, wherein in step S2, the human preference prompt is added to the original instruction prompt, specifically, the instruction prompt after adding the preference prompt C is XC or CX for the instruction prompt X.
3. The method according to claim 1, wherein in step S5, each expert model of the plurality of mixed expert models MMOE is a supervised training model-based bonus model.
4. A method for training a bonus model based on human feedback reinforcement learning as claimed in claim 3, characterized in that in step S5, the loss function loss is calculated as follows:
S51, inputting a pair of instruction prompt and reply text For each input/>The output of each expert model is/>
Wherein,For inputting and scoring higher replies,/>Replies with lower inputs and scores;
S52, output of each Gate The method comprises the following steps:
Wherein, Is a trainable matrix, n is the number of experts, and d is the feature latitude; /(I)The output dimension of the kth gate is the number of experts;
s53, output of the linear layer The method comprises the following steps:
Wherein, For/>Is the i-th dimension output result;
S54, output scores of preferences of each direction The method comprises the following steps:
Wherein, Calculating a weight matrix for the score of the task k;
S55, respectively to ,/>Calculating the score:
Obtaining a loss function
5. The method for training a bonus model based on human feedback reinforcement learning according to claim 4, wherein step S5 further comprises the steps of:
S56, obtaining a trained reward model by performing iterative optimization on training data and minimizing a loss function loss
6. The method according to claim 5, wherein in step S5, the same architecture based on converter is used for each expert model, and each expert model shares the same pre-training weight W; for each expert model, a different LORA adaptive matrix group is respectively corresponding; when training the reward model by adopting the low-rank LORA adaptation method, only updating the LORA self-adaptive matrix group corresponding to each expert model.
7. The method for training a bonus model based on human feedback reinforcement learning according to claim 5, wherein step S6 comprises the steps of:
S61, setting For the supervised training model to be trained, D is reinforcement learning training data,/>For the base model trained in the first step,/>Is KL reward coefficient, and strengthens the objective function/>, of learningThe method comprises the following steps:
Wherein, For the expectation of the difference between the reward function and the logarithmic function on the distribution of the training set D, r is the reply of the supervised training model corresponding to the input p in the training data D; and performing objective function maximization training by using the training data D to obtain a final dialogue model.
8. The method according to claim 7, wherein in step S6, the reinforcement learning method is specifically a method of using the output of the reinforcement model as a signal.
9. A reward model training system based on human feedback reinforcement learning for implementing the reward model training method based on human feedback reinforcement learning according to any one of claims 1 to 8, characterized in that the reward model training system based on human feedback reinforcement learning comprises:
The supervision and training module is used for performing supervision and training on the base model by using the marked instruction prompt and the reply text as supervision data to obtain a supervision and training model;
the training data acquisition module is used for collecting required instruction prompts, inputting the instruction prompts into the supervision training model to generate a plurality of reply texts, adding human preference prompts to the original instruction prompts, and obtaining a group of instruction prompts and reply text lists after adding the human preference prompts;
the data labeling module is used for marking and labeling instruction prompts and a reply text list after the human preference prompts are added by using the large model, and labeled sequencing data are obtained;
The data enhancement module is used for carrying out data enhancement on the marked ordering data to obtain ordering data after data enhancement;
The reward model training module is used for training the reward model by using the data-enhanced sequencing data and adopting a low-rank LORA (Low-rank adaptive) method to obtain a multi-gate hybrid expert model MMOE which is used for learning human preferences in multiple directions;
And the reinforcement learning module is used for combining the multiple mixed expert models MMOE, and training the supervised training model by adopting a reinforcement learning method to obtain a final dialogue model.
CN202410528660.0A 2024-04-29 2024-04-29 Reward model training method and system based on human feedback reinforcement learning Pending CN118095402A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202410528660.0A CN118095402A (en) 2024-04-29 2024-04-29 Reward model training method and system based on human feedback reinforcement learning

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202410528660.0A CN118095402A (en) 2024-04-29 2024-04-29 Reward model training method and system based on human feedback reinforcement learning

Publications (1)

Publication Number Publication Date
CN118095402A true CN118095402A (en) 2024-05-28

Family

ID=91157821

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202410528660.0A Pending CN118095402A (en) 2024-04-29 2024-04-29 Reward model training method and system based on human feedback reinforcement learning

Country Status (1)

Country Link
CN (1) CN118095402A (en)

Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116383364A (en) * 2023-05-26 2023-07-04 华南理工大学 Medical question-answering reply method and system based on doctor feedback and reinforcement learning
CN116662552A (en) * 2023-06-29 2023-08-29 中国工商银行股份有限公司 Financial text data classification method, device, terminal equipment and medium
CN116955576A (en) * 2023-09-21 2023-10-27 神州医疗科技股份有限公司 Question-answer reply method, system and equipment based on human feedback and reinforcement learning
WO2023206777A1 (en) * 2022-04-29 2023-11-02 浪潮(北京)电子信息产业有限公司 Model generation method and apparatus, operation control method and apparatus, device, and storage medium
CN117035074A (en) * 2023-10-08 2023-11-10 哈尔滨工业大学(深圳)(哈尔滨工业大学深圳科技创新研究院) Multi-modal knowledge generation method and device based on feedback reinforcement
CN117076640A (en) * 2023-08-23 2023-11-17 成都农村商业银行股份有限公司 Method, device, equipment and medium for constructing Chinese reasoning task model
US20240104391A1 (en) * 2022-09-28 2024-03-28 Deepmind Technologies Limited Reward-model based reinforcement learning for performing reasoning tasks

Patent Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2023206777A1 (en) * 2022-04-29 2023-11-02 浪潮(北京)电子信息产业有限公司 Model generation method and apparatus, operation control method and apparatus, device, and storage medium
US20240104391A1 (en) * 2022-09-28 2024-03-28 Deepmind Technologies Limited Reward-model based reinforcement learning for performing reasoning tasks
CN116383364A (en) * 2023-05-26 2023-07-04 华南理工大学 Medical question-answering reply method and system based on doctor feedback and reinforcement learning
CN116662552A (en) * 2023-06-29 2023-08-29 中国工商银行股份有限公司 Financial text data classification method, device, terminal equipment and medium
CN117076640A (en) * 2023-08-23 2023-11-17 成都农村商业银行股份有限公司 Method, device, equipment and medium for constructing Chinese reasoning task model
CN116955576A (en) * 2023-09-21 2023-10-27 神州医疗科技股份有限公司 Question-answer reply method, system and equipment based on human feedback and reinforcement learning
CN117035074A (en) * 2023-10-08 2023-11-10 哈尔滨工业大学(深圳)(哈尔滨工业大学深圳科技创新研究院) Multi-modal knowledge generation method and device based on feedback reinforcement

Non-Patent Citations (3)

* Cited by examiner, † Cited by third party
Title
JIAQI MA等: "Modeling Task Relationships in Multi-task Learning with Multi-gate Mixture-of-Experts", KDD, 19 August 2018 (2018-08-19) *
宋建国;: "基于半监督与词向量加权的文本分类研究", 软件导刊, no. 09, 15 September 2020 (2020-09-15) *
茆美琴;奚媛媛;张榴晨;金鹏;徐海波;: "基于Q学习的微网二次频率在线自适应控制", 电力系统自动化, no. 20, 25 October 2015 (2015-10-25) *

Similar Documents

Publication Publication Date Title
CN108804611B (en) Dialog reply generation method and system based on self comment sequence learning
AU2016327448B2 (en) Methods for the automated generation of speech sample asset production scores for users of a distributed language learning system, automated accent recognition and quantification and improved speech recognition
CN112529153B (en) BERT model fine tuning method and device based on convolutional neural network
CN1647107A (en) Automatic neural-net model generation and maintenance
CN114780675A (en) Dialogue interaction method, device, equipment and medium
CN111582311A (en) Method for training intelligent agent by using dynamic reward example sample based on reinforcement learning
CN112989017B (en) Method for generating high-quality simulation experience for dialogue strategy learning
CN117852616B (en) Big language model alignment fine tuning method and system based on enhanced reject sampling training
CN116881641A (en) Pre-training model adjustment method and device, storage medium and computing equipment
CN117473951A (en) Text processing method, device and storage medium
CN117787241A (en) Method and device for controlling length of generated text based on large language model
CN118095402A (en) Reward model training method and system based on human feedback reinforcement learning
CN117808120A (en) Method and apparatus for reinforcement learning of large language models
WO2021229643A1 (en) Sound signal conversion model learning device, sound signal conversion device, sound signal conversion model learning method, and program
CN115329781A (en) Multi-task machine translation quality estimation method and system based on post-editing translation
CN116484858A (en) Text abstract generation method based on diffusion model
CN109815323B (en) Human-computer interaction training question-answer generation algorithm
Song et al. Dynamic tuning and weighting of meta-learning for NMT domain adaptation
CN117556264B (en) Training method and device for evaluation model and electronic equipment
CN118095441A (en) Open source large language model fine tuning optimization method based on transfer learning
CN117672242A (en) Voice interaction mapping model training method, voice interaction method and device
WO2023097616A1 (en) Apparatus, method, device and medium for loss balancing in multi-task learning
CN117786082A (en) Generation type online evaluation method and system based on fine tuning large model
Niu et al. Scheduled Curiosity-Deep Dyna-Q: Efficient Exploration for Dialog Policy Learning
CN117933423A (en) Multi-round dialogue fine tuning method of autoregressive LLM

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination