CN118036757A

CN118036757A - Training method and device for large language model

Info

Publication number: CN118036757A
Application number: CN202410444737.6A
Authority: CN
Inventors: 代季峰; 宁雪妃
Original assignee: Tsinghua University
Current assignee: Tsinghua University
Priority date: 2024-04-15
Filing date: 2024-04-15
Publication date: 2024-05-14
Anticipated expiration: 2044-04-15
Also published as: CN118036757B

Abstract

The present disclosure relates to the technical field of large language models, and in particular, to a large language model training method and apparatus, in which a first data set for a reward model based on human feedback is acquired, where the first data set includes a plurality of first data of manual markers; training the pre-training language model by using the first data set to obtain a target rewarding model; calculating the error rate of the target rewarding model according to a test result obtained by testing the first data by using the target rewarding model; when the error rate is greater than or equal to the error threshold, training the target rewarding model by using the first data set to obtain a new target rewarding model, and calculating the error rate until the error rate of the latest target rewarding model is less than the error threshold; training the pre-training language model by using all the target reward models and the second data set for the large language model to obtain a trained large language model. And the trained high-efficiency rewarding model is utilized to obtain the LLM with high performance gain, so that the accuracy and effect of the LLM are improved.

Description

Training method and device for large language model

Technical Field

The disclosure relates to the technical field of large language models, and in particular relates to a large language model training method and device.

Background

The large language model (Large Language Model, LLM) is a deep learning model trained on massive text data. The method can understand and generate the natural language text, and is applied to the field of natural language processing, such as text generation, text abstract, language translation and other tasks; the method is applied to the field of artificial intelligence, can help a machine to better understand human language and realize more natural man-machine interaction; the method can also be applied to other fields such as intelligent customer service, intelligent writing, intelligent recommendation and the like. For example, in intelligent customer service, the large language model can automatically answer the questions of the user, so that the customer service efficiency is improved; in intelligent authoring, a large language model can assist an author in generating high-quality text content; in intelligent recommendation, a large language model can recommend content more conforming to the needs of a user according to the historical behavior and preference of the user.

In the related art, RLHF (Reinforcement Learning from Human Feedback, reinforcement learning technology based on human feedback) is used for training a large language model, but the problem of low accuracy and poor effect of the finally trained large language model is caused by the inefficiency of the trained reward model.

Disclosure of Invention

In view of this, the present disclosure proposes a large language model training method and apparatus.

According to an aspect of the present disclosure, there is provided a large language model training method, the method including:

Obtaining a first data set for a reward model obtained based on human feedback, the first data set comprising a plurality of first data of a manual marker;

training the pre-training language model by using the first data set to obtain a target rewarding model;

Testing part or all of the first data in the first data set by using the target rewarding model to obtain a test result, and calculating an error rate of the target rewarding model according to the test result;

training the target rewarding model by utilizing the first data set to obtain a new target rewarding model and calculating an error rate until the error rate of the latest target rewarding model is smaller than an error threshold value under the condition that the error rate is larger than or equal to the error threshold value;

Training the pre-training language model by using all target reward models obtained through training and a second data set for a large language model to obtain a trained large language model, wherein the trained large language model is used for feeding back an output result to a user based on user input in the process of executing tasks in a target field, and the target field comprises at least one of a natural language processing field and an artificial intelligence field.

In one possible implementation, the first data is represented in a triplet including an input, a preferred response to the input, and a non-preferred response, the test result includes a confidence difference for each first data tested, the confidence difference being indicative of a difference between the preferred and non-preferred responses of the first data that the target reward model derives based on the first data,

Wherein the method further comprises: and correcting the preferred response and/or the non-preferred response of the first data with the confidence difference smaller than the confidence threshold for the first data tested in the first data set.

In one possible implementation, calculating the error rate of the target reward model according to the test result includes:

Counting the error data quantity of the target rewarding model prediction error according to the confidence coefficient difference value corresponding to each first data;

and determining the error rate of the target rewarding model according to the ratio between the error data quantity and the total data quantity of the first data tested by the target rewarding model.

In one possible implementation, the second data set includes the first data set, and the training the pre-training language model with all the target reward models obtained by training and the second data set for the large language model to obtain a trained large language model includes:

Inputting each first data in the first data set into a current large language model to obtain a prediction result, wherein the current large language model is the pre-training language model;

under the condition that the predicted result does not meet the preset condition, carrying out model strategy updating on feedback of each predicted result based on each target rewarding model to obtain an updated large language model, and

Determining the updated large language model as a new current large language model, and iteratively executing the corresponding steps of inputting each first data into the current large language model to obtain a prediction result and the following steps;

And under the condition that the prediction result meets the preset condition, stopping training, and determining the current large language model as a trained large language model.

In one possible implementation manner, the model policy updating is performed on the feedback of each prediction result based on each target rewarding model, so as to obtain an updated large language model, which includes:

scoring each predicted result by utilizing each target rewarding model to obtain a rewarding signal fed back by each target rewarding model;

and updating parameters of the current large language model by adopting a preset algorithm according to the reward signals and the corresponding weights of the target reward models, and finishing model strategy updating to obtain an updated large language model, wherein the preset algorithm comprises a reinforcement learning algorithm.

According to another aspect of the present disclosure, there is provided a large language model training apparatus, the apparatus comprising:

A data set acquisition module for acquiring a first data set for a bonus model obtained based on human feedback, the first data set including a plurality of first data of a manual marker;

The first training module is used for training the pre-training language model by utilizing the first data set to obtain a target rewarding model;

the error rate calculation module is used for testing part or all of the first data in the first data set by utilizing the target rewarding model to obtain a test result, and calculating the error rate of the target rewarding model according to the test result;

the second training module is used for training the target rewarding model by utilizing the first data set to obtain a new target rewarding model and calculating the error rate until the error rate of the latest target rewarding model is smaller than the error threshold value under the condition that the error rate is larger than or equal to the error threshold value;

the third training module is used for training the pre-training language model by utilizing all target rewarding models obtained through training and the second data set for the large language model to obtain a trained large language model, the trained large language model is used for feeding back an output result to a user based on user input in the process of executing tasks in a target field, and the target field comprises at least one of a natural language processing field and an artificial intelligence field.

Wherein the apparatus further comprises:

And the data correction module is used for correcting the preferred response and/or the non-preferred response of the first data with the confidence difference smaller than the confidence threshold according to the first data tested in the first data set.

In one possible implementation manner, the error rate calculating module includes:

The error statistics sub-module is used for counting the error data quantity of the target rewarding model prediction error according to the confidence coefficient difference value corresponding to each first data;

and the calculating sub-module is used for determining the error rate of the target rewarding model according to the ratio between the error data quantity and the total data quantity of the first data tested by the target rewarding model.

In one possible implementation, the second data set includes the first data set, and the training of the pre-training language model by using all target reward models obtained through training and the second data set for the large language model, to obtain a trained large language model includes:

Inputting each first data in the first data set into a current large language model to obtain a prediction result, wherein the current large language model is the pre-training voice model;

and under the condition that the prediction result meets the preset condition, stopping training, and taking the current large language model as a trained large language model.

According to another aspect of the present disclosure, there is provided a large language model training apparatus including: a processor; a memory for storing processor-executable instructions; wherein the processor is configured to implement the above-described method when executing the instructions stored by the memory.

According to another aspect of the present disclosure, there is provided a non-transitory computer readable storage medium having stored thereon computer program instructions, wherein the computer program instructions, when executed by a processor, implement the above-described method.

According to another aspect of the present disclosure, there is provided a computer program product comprising a computer readable code, or a non-transitory computer readable storage medium carrying computer readable code, which when run in a processor of an electronic device, performs the above method.

According to the large language model training method and device provided by the embodiment of the disclosure, a first data set for a reward model obtained based on human feedback is obtained, wherein the first data set comprises a plurality of first data marked manually; training the pre-training language model by using the first data set to obtain a target rewarding model; testing part or all of the first data in the first data set by using the target rewarding model to obtain a test result, and calculating an error rate of the target rewarding model according to the test result; training the target rewarding model by utilizing the first data set to obtain a new target rewarding model and calculating an error rate until the error rate of the latest target rewarding model is smaller than an error threshold value under the condition that the error rate is larger than or equal to the error threshold value; and training by using all the target reward models obtained through training and the second data set for the large language model to obtain a trained large language model. And the high-efficiency rewarding model is trained, so that the LLM with high performance gain is finally obtained, and the accuracy and effect of the LLM are improved.

Other features and aspects of the present disclosure will become apparent from the following detailed description of exemplary embodiments, which proceeds with reference to the accompanying drawings.

Drawings

The accompanying drawings, which are incorporated in and constitute a part of this specification, illustrate exemplary embodiments, features and aspects of the present disclosure and together with the description, serve to explain the principles of the disclosure.

FIG. 1 illustrates a flow chart of a large language model training method according to an embodiment of the present disclosure.

FIG. 2 illustrates a schematic diagram of model training in a large language model training method according to an embodiment of the present disclosure.

FIG. 3 illustrates a block diagram of a large language model training apparatus, according to an embodiment of the present disclosure.

FIG. 4 is a block diagram illustrating an apparatus 1900 for large language model training, according to an example embodiment.

Detailed Description

Various exemplary embodiments, features and aspects of the disclosure will be described in detail below with reference to the drawings. In the drawings, like reference numbers indicate identical or functionally similar elements. Although various aspects of the embodiments are illustrated in the accompanying drawings, the drawings are not necessarily drawn to scale unless specifically indicated.

The word "exemplary" is used herein to mean "serving as an example, embodiment, or illustration. Any embodiment described herein as "exemplary" is not necessarily to be construed as preferred or advantageous over other embodiments.

In addition, numerous specific details are set forth in the following detailed description in order to provide a better understanding of the present disclosure. It will be understood by those skilled in the art that the present disclosure may be practiced without some of these specific details. In some instances, methods, means, elements, and circuits well known to those skilled in the art have not been described in detail in order not to obscure the present disclosure.

Large Language Models (LLMs) typically capture features of data during pre-training, which typically contain both high and low quality, so large language models sometimes produce undesirable behavior, such as building facts, generating biased or toxic text, and even generating content that is harmful to humans. Therefore, it is very important to align LLM with human value views (such as helpful, honest (honest), harmless (harmless), i.e., 3H), and the reinforcement learning technology (RLHF) based on human feedback adopted in the related art performs training for large language models, and the general process includes the following core stages:

Stage one: pre-training Language Model (Pre-TRAINING THE Language Model)

At this stage, a classical pre-trained language model (i.e., a pre-trained language model) is first selected as the initial model. For example OpenAI uses a smaller version of GPT-3 in its first RLHF model InstructGPT, while DeepMind uses the 2800 billion parameter model Gopher of home. These pre-trained language models are typically trained on large amounts of unlabeled data to learn the structure and rules of the language.

Stage two: data is collected and a reward Model (Collecting DATA AND TRAINING THE REWARD Model) is trained

At this stage, quality assessment data of the human output for the model needs to be collected. Such data is typically obtained through manual annotation or user interaction. These data are then used to train a reward model that can predict the quality of the model output or the degree of human preference for a given input. The reward model is typically a simple machine learning model such as a linear regression or neural network.

Stage three: fine tuning language model by reinforcement learning (Fine-tuning the Language Model via Reinforcement Learning)

At this stage, the pre-trained language model is fine-tuned using a reinforcement learning algorithm. The key here is to use the reward signal provided by the reward model to guide the training of the model.

But has the following problems: if only human feedback data is utilized, the time for collecting the human feedback data is longer, and the cost is higher; moreover, targets are inconsistent among different people, resulting in the evaluator possibly pursuing wrong targets; and due to time, attention or inattention, when a human is handling more data, simple errors may be made, resulting in lower quality feedback when handling large amounts of data. This may result in a low data quality and an inefficient bonus model if more human feedback data is acquired; if the quantity of human feedback data is reduced to avoid simple errors of people, an efficient rewarding model cannot be obtained, and the accuracy and the effect of the final LLM are low due to the fact that the efficient rewarding model cannot be obtained finally. If only AI (artificial intelligence) is used to feed back data, most of the feedback results of the data can be aligned with feedback of human beings, but some of the data cannot be aligned with human beings, so that the overall quality of the data is low, an efficient rewarding model cannot be obtained, and the final LLM has low precision and poor effect.

In order to solve the above technical problems, an embodiment of the present disclosure provides a large language model training method and apparatus, which acquire a first data set based on human feedback, where the first data set includes a plurality of first data marked manually; training the pre-training language model by using the first data set to obtain a target rewarding model; testing part or all of the first data in the first data set by using the target rewarding model to obtain a test result, and calculating an error rate of the target rewarding model according to the test result; training the target rewarding model by utilizing the first data set to obtain a new target rewarding model and calculating an error rate until the error rate of the latest target rewarding model is smaller than an error threshold value under the condition that the error rate is larger than or equal to the error threshold value; and training the pre-training language model by utilizing all the target reward models obtained through training and the second data set for the large language model to obtain a trained large language model. And the high-efficiency rewarding model is trained, so that the LLM with high performance gain is finally obtained, and the accuracy and effect of the LLM are improved.

FIG. 1 illustrates a flow chart of a large language model training method according to an embodiment of the present disclosure. As shown in fig. 1, the method includes steps S101 to S105.

In step S101, a first data set for a bonus model obtained based on human feedback is acquired, said first data set comprising a plurality of first data of manual markers.

In some embodiments, the first data in the first data set D is represented in a triplet D including the input x _i, the preferred response y _w to the input x _i, and the non-preferred response y _l, then the first data set D may be represented as: d= { d|d= (x _i,y_w,y_l) }. Where x _i denotes an input of the ith first data. Where preferred response y _w represents a response that meets the normal human preference and non-preferred response y _l represents a response that is not preferred based on the normal human preference.

In step S102, training the pre-training language model by using the first data set, to obtain a target rewards model.

In some embodiments, the reward model may be a pre-trained language model using the last unembedding layers (de-embedding layers) removed as an infrastructure. For example, the final embedding (insert) of the last token (token or block) may be input to a linear layer and then a scalar value, i.e., a prize value, may be obtained. In some embodiments, the reward model may also be a machine learning model, such as a linear regression or neural network, or the like. The implementation of the reward model may be set according to actual needs, which is not limited by the present disclosure.

In this embodiment, the loss function used to train the reward modelThe following formula can be used:

where σ is a sigmoid function, λ and β _ri are hyper-parameters, and β _ri can be set accordingly for the ith target rewarding model in the training process. Indicating the desire for the most recent first data set D _ri (D _ri being the above D if no update is required to modify the first data; D _ri being the following most recent first data set D _ri,D_ri if update is required to modify the first data) indicating D _ri was determined based on the i-th target rewards model. r represents the reward model (herein, multiple target reward models are obtained by training the reward model for distinguishing r _i or/>)Representing the i-th target rewards model), r (x, y _w) represents the scores input as x and y _w, r (x, y _l) represents the scores input as x and y _l, and x is the input x _i of each first data. r ' is the same model as r except that r is different from the top linear layer (the dimensions of the r ' linear layer are the size of the dictionary), and r ' (x, y _w) is the likelihood given the prompt term (prompt) x and preference reply y _w. Where the hint word x may be an instruction or context provided to the bonus model to direct the bonus model to generate a particular text output. The prompt x may be a question, a description, an instruction, or any other form of text input. It should be appreciated that other suitable loss functions may also be employed to train the reward model.

In step S103, a part or all of the first data in the first data set is tested by using the target reward model to obtain a test result, and an error rate of the target reward model is calculated according to the test result.

In one possible implementation, the test results may include a confidence difference P _r for each of the first data under test, the confidence difference P _r being used to represent the difference between the preferred and non-preferred responses of the first data that the target rewards model derives based on the first data. Wherein each first data in the first data set is collected by manual feedback.

In order to save the time of the whole model training and improve the efficiency, only part of the first data in the first data set may be tested in step S103 to obtain a test result, where the part of the first data used in each test may be the same or different, which is not limited in the present disclosure, under the condition that the preference and feedback of the human being can be determined by the first data in the first data set.

Also, since a simple error may be made due to time, attention or carelessness during the manual feedback, there may be some errors in the first data set, and the human preference and feedback cannot be truly reflected, at this time, all the first data in the first data set may be tested in step S103 to obtain a test result, and then to further refine the first data set, the method may further include: correcting preferred and/or non-preferred responses of the first data with the confidence difference P _r smaller than the confidence threshold ₁ for the first data tested in the first data set, and combining the first data with the confidence difference P _r larger than or equal to the confidence threshold ₁ to obtain a first data set D _ri corresponding to the current ith target rewarding model. That is, in addition to the first data sets D for the bonus model that are initially obtained based on human feedback, the first data in each of the first data sets D _ri includes the following types: the confidence difference value P _r is always greater than or equal to the first data of the confidence threshold value threshold ₁ which has not been corrected, and the confidence difference value P _r is smaller than the first data of the confidence threshold value threshold ₁ which has been corrected after a certain test or tests.

Wherein, to distinguish the confidence difference value P _r of different target rewards models, the confidence difference value of the ith target rewards model is denoted by P _i. If the test result of the ith target rewarding model obtained at present is P _i≥threshold₁, the comparison result shows that the preferred response and the non-preferred response obtained by the ith target rewarding model have obvious differences, and the preferred response accords with the normal preference of human beings, and the quality of the first data can be determined to be higher at the moment, and the modification by manual work, the correction model and the like is not needed. Otherwise, if P _i＜threshold₁ in the test result of the ith target reward model indicates that the difference between the preferred response and the non-preferred response obtained by the ith target reward model is not large or the difference between the preferred response and the normal preference of the human being is large, the quality of the piece of first data can be considered to be not high, and the modification needs to be performed manually, and the modification includes modification of the preferred response and/or the non-preferred response of the first data. The confidence threshold ₁ may be set according to actual needs, which is not limited in this disclosure.

In this way, based on the artificial feedback (initial first data set D) and the test results of the first data in the first data set by each target reward model, the preferred response and/or the non-preferred response of the first data are continuously modified, so that a high-quality latest first data set can be obtained, and the first data set can truly reflect the human preference and align with the human feedback.

In this embodiment, step S103 is continued after the first training of the bonus model is completed to obtain the first target bonus model r ₁. In some embodiments, the error data amount of the target reward model prediction error may be counted according to the confidence coefficient difference value corresponding to each first data; and then determining the error rate of the target rewarding model according to the ratio between the error data quantity and the total data quantity of the first data tested by the target rewarding model.

In some embodiments, the error rate e _i of the ith target bonus model may be calculated by the following formula:

Where r _i represents the ith target rewards model. P (r _i(x_j)≠y_w) represents the confidence difference that the output of the j-th first data x _j input to r _i is not equal to y _w. I (·) represents an indicator function that takes a value of 1 when the expression in brackets (i.e., r _i(x_j)≠y_w) holds, otherwise a value of 0,w _i,j represents a weight corresponding to the j-th first data x _j in the I-th target bonus model calculation error rate, and N represents the total data amount of the first data under test (N is the total data amount of the first data set in the case where the first data under test is all the first data in the first data set).

In step S104, in the case that the error rate is greater than or equal to the error threshold ₂, training the target reward model by using the first data set (here, the first data set D or the latest first data set D _ri) to obtain a new target reward model, and performing error rate calculation until the error rate of the latest target reward model is less than the error threshold ₂. In this way, training of the reward model is performed based on the latest first data set D _ri, so that the feedback result obtained through the target reward model can be basically aligned with the real human preference, and the target reward model has the advantages of high precision and the like.

In this embodiment, if the error rate e _i≥threshold₂ of the ith target bonus model r _i indicates that the effect of the ith target bonus model r _i has not yet reached the expected value, it is necessary to train the bonus model again on the basis of the current target bonus model r _i using the first dataset D or the updated first dataset D _ri and obtain a new bonus model r _i+1.

In some embodiments, for the large language model training of the subsequent step S105, the coefficient of r _i may be calculated based on the following formula, that is, the weight α _i of the i-th target reward model, so that the weights { α _i, i=1, 2 … m } corresponding to each of the 1-th to m-th target reward models may be obtained, where m is the total number of target reward models.

The weight corresponding to each first data under the new (m+1) th target rewarding model can be obtained:

W_m+1=（w_m+1,1,w_m+1,2…w_m+1,N）

Where w _m+1,j (j=1, 2 … N) represents the weight of the jth first data of the (m+1) -th target bonus model, the expression of which is as follows:

where P _m represents the confidence difference for the mth target rewards model. Z _m is a normalization factor for the mth target rewards model, in order to make the sum of the weights of all the first data 1, the expression of Z _m is as follows:

Where P _j represents the confidence difference for the jth target rewards model. r _m(x_j) represents the output of the jth first data x _j input into the mth target bonus model r _m.

If the error rate e _m＜threshold₂ of the mth target bonus model r _m indicates that the effect of the mth target bonus model r _m is already expected by humans, the first data set D _rm obtained using the mth target bonus model can be considered to be of higher quality and more data, the first data set D _rm being more data and most of it being able to align with normal human preferences.

At this time, since m iterations have been completed, m target bonus models are obtained, wherein the error rate of the 1 st through m-1 st target bonus models is greater than threshold ₂ and the error rate of the m-th target bonus model is less than threshold ₂.

In step S105, training the pre-training language model by using all the target reward models obtained by training and the second data set for the large language model, so as to obtain a trained large language model. The trained large language model is used for feeding back an output result to a user based on the user input in the process of executing tasks in a target field, and the target field comprises at least one of a natural language processing field and an artificial intelligence field. In some embodiments, the second data set may be the first data set (i.e., the first data set D without modification or the latest first data set D _rm) described above, or other data sets, or a set of other data sets and first data sets, which is not limited in this disclosure.

In a possible implementation manner, in the case that the second data set includes the first data set (i.e. the first data set D that does not need to be modified or the latest first data set D _rm), step S105 may include: inputting each first data in the first data set into a current language model to obtain a prediction result, wherein the current large language model is the pre-training language model; determining whether the predicted result meets a predetermined condition for ending training; and under the condition that the predicted result does not meet the preset condition, carrying out model strategy updating on feedback of each predicted result based on each target rewarding model to obtain an updated large language model, determining the updated large language model as a new current large language model, and iteratively executing corresponding steps of inputting each first data into the current large language model to obtain the predicted result and the later. And under the condition that the prediction result meets a preset condition, stopping training, and determining the current large language model as a trained large language model. The preset conditions may be set according to actual needs, for example, the iteration number may exceed a preset value, the error rate of the predicted result is smaller than an error threshold, and the like, which is not limited in the present disclosure.

In this implementation, performing model policy update on feedback of each prediction result based on each target rewarding model to obtain an updated large language model may include: scoring each predicted result by utilizing each target rewarding model to obtain a rewarding signal fed back by each target rewarding model; updating parameters of the current large language model by adopting a preset algorithm according to the reward signals and the corresponding weights of the target reward models, and finishing model strategy updating to obtain an updated large language model; wherein the preset algorithm comprises a reinforcement learning algorithm. Reinforcement learning algorithms, such as the Policy gradient reinforcement learning (Policy GRADIENT RL) algorithm or the near-end Policy optimization (Proximal Policy Optimization, PPO) algorithm, are used to adjust model parameters to maximize the desired rewards, which are not limiting to the present disclosure.

For example, as shown in fig. 2, the following problem 1 is addressed:

"input Prompt (Prompt means instructions or context input to the large language model for directing the large language model to generate a particular text output, referred to herein for brevity as a" Prompt "):

What are the three most common gases in the earth's atmosphere?

Output LM Output (LM Output refers to the text Output generated by the large language model from a given promt, referred to herein for simplicity as "Output". This "Output" is the text result generated by the large language model from information in promt and its internal knowledge base in RLHF LM Output is used to compare with human provided preference feedback to calculate a reward signal, which in turn directs the training process of the model):

The earth's atmosphere is a layer of gas retained by the earth's gravitational force, the most common gas being nitrogen, second oxygen, and third carbon dioxide "calculated as dry gas"

For the same hint word and output pair, the m target reward models score the output differently, since the effect of the m target reward models is different, for example,Scoring the non-expected answers higher and scoring the expected answers lower,/>The answers that meet expectations are scored higher and the answers that do not meet expectations are scored lower, so that different weights alpha ₁、α₂…α_m are needed to adjust the scores to obtain scores corresponding to each of the m target rewards models respectively, for example, -0.5 and 1.5 … 2.2.2, so that punishment or rewards are achieved.

After scoring of each target rewarding model is obtained, model parameter updating is carried out by using a PPO algorithm, the process of updating the strategy based on rewards is realized, and the optimization target of the PPO algorithm is realizedThe following is shown:

Wherein pi ^SFT represents an SFT (supervised fine tuning training, supervised fine-tuning) model; Is a current large language model to be adjusted, and is initialized to pi ^SFT; x is the problem in reinforcement learning (RL, reinforcement Learning) datasets, y is the x's way through the current large language model/> The answer, y=/>, obtainedWherein initialize/>Y=pi ^SFT (x) for pi ^SFT; /(I)A target rewards model for scoring the questions x and the answers y is represented; /(I)(Y|x) represents problem x pass/>Obtaining the probability of the answer y; pi ^SFT (y|x) represents the probability that question x gets answer y through pi ^SFT; x-D _pretrain represent that x is data from the large language model pre-training stage; /(I)Representing a desire for the first dataset D _ri,/>Indicating the desire for x-D _pretrain. Beta, gamma denote adjustment coefficients. It will be appreciated that any other suitable optimization objective may be employed, as the application is not limited in this regard.

In the RLHF algorithm, the task of the large language model is to produce a high quality text output, and the target reward model will assign a reward signal based on feedback. In each training iteration, the reward signal returned by the target reward model is calculated and used as the reward feedback of the PPO algorithm, and the reward signal is used to update the strategy of the large language model (i.e., update the model parameters) to produce better results the next time the large language model generates a text output. By iterating this process continuously, the performance of the large language model can be gradually optimized, making it more accurate and natural in generating text until model training is completed.

As shown in fig. 3, an embodiment of the present disclosure further provides a large language model training apparatus, including:

A data set acquisition module 41 for acquiring a first data set for a bonus model obtained based on human feedback, the first data set including a plurality of first data of a manual marker;

A first training module 42, configured to train the pre-training language model by using the first data set to obtain a target rewards model;

An error rate calculation module 43, configured to test part or all of the first data in the first data set by using the target reward model to obtain a test result, and calculate an error rate of the target reward model according to the test result;

A second training module 44, configured to train the target reward model with the first data set to obtain a new target reward model and perform error rate calculation until the error rate of the latest target reward model is less than an error threshold value, where the error rate is greater than or equal to the error threshold value;

And a third training module 45, configured to train the pre-training language model by using all the target reward models obtained by training and the second data set for the large language model, so as to obtain a trained large language model, where the trained large language model is used to feed back an output result to the user based on user input in a process of executing a task in a target field, and the target field includes at least one of a natural language processing field and an artificial intelligence field.

Wherein the apparatus further comprises:

In some embodiments, functions or modules included in an apparatus provided by the embodiments of the present disclosure may be used to perform a method described in the foregoing method embodiments, and specific implementations thereof may refer to descriptions of the foregoing method embodiments, which are not repeated herein for brevity.

It should be noted that, although the foregoing embodiments describe the large language model training method and apparatus as above by way of example, those skilled in the art will understand that the present disclosure should not be limited thereto. In fact, the user can flexibly set each step and each module according to personal preference and/or actual application scene, so long as the technical scheme of the disclosure is met.

The disclosed embodiments also provide a computer readable storage medium having stored thereon computer program instructions which, when executed by a processor, implement the above-described method. The computer readable storage medium may be a volatile or nonvolatile computer readable storage medium.

The embodiment of the disclosure also provides an electronic device, which comprises: a processor; a memory for storing processor-executable instructions; wherein the processor is configured to implement the above-described method when executing the instructions stored by the memory.

Embodiments of the present disclosure also provide a computer program product comprising computer readable code, or a non-transitory computer readable storage medium carrying computer readable code, which when run in a processor of an electronic device, performs the above method.

FIG. 4 is a block diagram illustrating an apparatus 1900 for large language model training, according to an example embodiment. For example, the apparatus 1900 may be provided as a server or terminal device. Referring to fig. 4, the apparatus 1900 includes a processing component 1922 that further includes one or more processors and memory resources represented by memory 1932 for storing instructions, such as application programs, that are executable by the processing component 1922. The application programs stored in memory 1932 may include one or more modules each corresponding to a set of instructions. Further, processing component 1922 is configured to execute instructions to perform the methods described above.

The apparatus 1900 may further comprise a power component 1926 configured to perform power management of the apparatus 1900, a wired or wireless network interface 1950 configured to connect the apparatus 1900 to a network, and an input/output interface 1958 (I/O interface). The device 1900 may operate based on an operating system stored in memory 1932, such as Windows Server ^TM,Mac OS X^TM,Unix^TM, Linux^TM,FreeBSD^TM or the like.

In an exemplary embodiment, a non-transitory computer readable storage medium is also provided, such as memory 1932, including computer program instructions executable by processing component 1922 of apparatus 1900 to perform the above-described methods.

The present disclosure may be a system, method, and/or computer program product. The computer program product may include a computer readable storage medium having computer readable program instructions embodied thereon for causing a processor to implement aspects of the present disclosure.

The computer readable storage medium may be a tangible device that can hold and store instructions for use by an instruction execution device. The computer readable storage medium may be, for example, but not limited to, an electronic storage device, a magnetic storage device, an optical storage device, an electromagnetic storage device, a semiconductor storage device, or any suitable combination of the foregoing. More specific examples (a non-exhaustive list) of the computer-readable storage medium would include the following: portable computer disks, hard disks, random Access Memory (RAM), read-only memory (ROM), erasable programmable read-only memory (EPROM or flash memory), static Random Access Memory (SRAM), portable compact disk read-only memory (CD-ROM), digital Versatile Disks (DVD), memory sticks, floppy disks, mechanical coding devices, punch cards or in-groove structures such as punch cards or grooves having instructions stored thereon, and any suitable combination of the foregoing. Computer-readable storage media, as used herein, are not to be construed as transitory signals per se, such as radio waves or other freely propagating electromagnetic waves, electromagnetic waves propagating through waveguides or other transmission media (e.g., optical pulses through fiber optic cables), or electrical signals transmitted through wires.

The computer readable program instructions described herein may be downloaded from a computer readable storage medium to a respective computing/processing device or to an external computer or external storage device over a network, such as the internet, a local area network, a wide area network, and/or a wireless network. The network may include copper transmission cables, fiber optic transmissions, wireless transmissions, routers, firewalls, switches, gateway computers and/or edge servers. The network interface card or network interface in each computing/processing device receives computer readable program instructions from the network and forwards the computer readable program instructions for storage in a computer readable storage medium in the respective computing/processing device.

The computer program instructions for performing the operations of the present disclosure may be assembly instructions, instruction Set Architecture (ISA) instructions, machine-related instructions, microcode, firmware instructions, state setting data, or source or object code written in any combination of one or more programming languages, including an object oriented programming language such as SMALLTALK, C ++ or the like and conventional procedural programming languages, such as the "C" programming language or similar programming languages. The computer readable program instructions may be executed entirely on the user's computer, partly on the user's computer, as a stand-alone software package, partly on the user's computer and partly on a remote computer or entirely on the remote computer or server. In the case of a remote computer, the remote computer may be connected to the user's computer through any kind of network, including a Local Area Network (LAN) or a Wide Area Network (WAN), or may be connected to an external computer (for example, through the Internet using an Internet service provider). In some embodiments, aspects of the present disclosure are implemented by personalizing electronic circuitry, such as programmable logic circuitry, field Programmable Gate Arrays (FPGAs), or Programmable Logic Arrays (PLAs), with state information of computer readable program instructions, which can execute the computer readable program instructions.

Various aspects of the present disclosure are described herein with reference to flowchart illustrations and/or block diagrams of methods, apparatus (systems) and computer program products according to embodiments of the disclosure. It will be understood that each block of the flowchart illustrations and/or block diagrams, and combinations of blocks in the flowchart illustrations and/or block diagrams, can be implemented by computer-readable program instructions.

These computer readable program instructions may be provided to a processor of a general purpose computer, special purpose computer, or other programmable data processing apparatus to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing apparatus, create means for implementing the functions/acts specified in the flowchart and/or block diagram block or blocks. These computer readable program instructions may also be stored in a computer readable storage medium that can direct a computer, programmable data processing apparatus, and/or other devices to function in a particular manner, such that the computer readable medium having the instructions stored therein includes an article of manufacture including instructions which implement the function/act specified in the flowchart and/or block diagram block or blocks.

The computer readable program instructions may also be loaded onto a computer, other programmable data processing apparatus, or other devices to cause a series of operational steps to be performed on the computer, other programmable apparatus or other devices to produce a computer implemented process such that the instructions which execute on the computer, other programmable apparatus or other devices implement the functions/acts specified in the flowchart and/or block diagram block or blocks.

The flowcharts and block diagrams in the figures illustrate the architecture, functionality, and operation of possible implementations of systems, methods and computer program products according to various embodiments of the present disclosure. In this regard, each block in the flowchart or block diagrams may represent a module, segment, or portion of instructions, which comprises one or more executable instructions for implementing the specified logical function(s). In some alternative implementations, the functions noted in the block may occur out of the order noted in the figures. For example, two blocks shown in succession may, in fact, be executed substantially concurrently, or the blocks may sometimes be executed in the reverse order, depending upon the functionality involved. It will also be noted that each block of the block diagrams and/or flowchart illustration, and combinations of blocks in the block diagrams and/or flowchart illustration, can be implemented by special purpose hardware-based systems which perform the specified functions or acts, or combinations of special purpose hardware and computer instructions.

The foregoing description of the embodiments of the present disclosure has been presented for purposes of illustration and description, and is not intended to be exhaustive or limited to the embodiments disclosed. Many modifications and variations will be apparent to those of ordinary skill in the art without departing from the scope and spirit of the various embodiments described. The terminology used herein was chosen in order to best explain the principles of the embodiments, the practical application, or the technical improvements in the marketplace, or to enable others of ordinary skill in the art to understand the embodiments disclosed herein.

Claims

1. A method for training a large language model, the method comprising:

2. The method of claim 1, wherein the first data is represented in a triplet including an input, a preferred response to the input, and a non-preferred response, the test results including a confidence difference for each first data tested, the confidence differences representing a difference between the preferred and non-preferred responses of the first data based on the first data by the target rewards model,

Wherein the method further comprises:

And correcting the preferred response and/or the non-preferred response of the first data with the confidence difference smaller than the confidence threshold for the first data tested in the first data set.

3. The method of claim 2, wherein calculating an error rate of the target bonus model based on the test results comprises:

4. The method according to claim 1 or 2, wherein the second data set comprises the first data set, wherein training the pre-trained language model with all target reward models obtained by training and the second data set for the large language model to obtain a trained large language model comprises:

5. The method of claim 4, wherein performing model policy updates on feedback of each of the predicted outcomes based on each of the target rewards models to obtain updated large language models, comprising:

6. A large language model training apparatus, the apparatus comprising:

7. The apparatus of claim 6 wherein the first data is represented in a triplet including an input, a preferred response to the input, and a non-preferred response, the test results including a confidence difference for each first data tested, the confidence differences representing a difference between the preferred and non-preferred responses of the first data based on the first data by the target rewards model,

Wherein the apparatus further comprises:

8. The apparatus of claim 7, wherein the error rate calculation module comprises:

9. The apparatus of claim 6 or 7, wherein the second data set comprises the first data set, wherein training the pre-trained language model using all target reward models obtained by training and a second data set for a large language model, results in a trained large language model, comprising:

10. The apparatus of claim 9, wherein performing a model policy update on feedback of each of the predicted outcomes based on each of the target rewards models to obtain updated large language models comprises:

11. A large language model training apparatus, comprising:

a processor;

A memory for storing processor-executable instructions;

wherein the processor is configured to implement the method of any one of claims 1 to 5 when executing the instructions stored by the memory.

12. A non-transitory computer readable storage medium having stored thereon computer program instructions, which when executed by a processor, implement the method of any of claims 1 to 5.