CN117688158A

CN117688158A - Training method of reward model, answer evaluation method, device and equipment

Info

Publication number: CN117688158A
Application number: CN202311828971.0A
Authority: CN
Inventors: 李亚; 梁佳鑫; 缪磊; 刘权; 王士进; 魏思; 刘聪; 胡国平
Original assignee: iFlytek Co Ltd
Current assignee: iFlytek Co Ltd
Priority date: 2023-12-26
Filing date: 2023-12-26
Publication date: 2024-03-12

Abstract

The invention provides a training method, an answer evaluation device and equipment of a reward model, wherein the training method comprises the following steps: obtaining a plurality of sample pairs, wherein each sample pair comprises a first sample and a second sample, the first sample comprises a sample question and a first sample answer, the second sample comprises a sample question and a second sample answer, the target score of the first sample answer is higher than the target score of the second sample answer, the target score is related to knowledge accuracy, and the knowledge accuracy is determined based on the target answer matched with the sample question in the knowledge graph; inputting a first sample and a second sample in each sample pair into an initial rewarding model to obtain a first score of the first sample and a second score of the second sample output by the initial rewarding model; and adjusting model parameters of the initial rewarding model based on the first score and the second score to obtain the rewarding model. The invention can improve the accuracy of identifying the fact correctness of the text output by the large language model.

Description

Training method of reward model, answer evaluation method, device and equipment

Technical Field

The invention relates to the technical field of artificial intelligence, in particular to a training method of a reward model, an answer evaluation method, an answer evaluation device and equipment.

Background

The knowledge enhancement type rewarding model can enable the large language model to output a scoring result in combination with knowledge correctness in the process of aligning with human demands, so that the knowledge fantasy problem in the large language model is solved.

Conventional reward models typically give comprehensive feedback from the semantic level of the text, such as feedback based on how smooth, helpful, and safe the large language model outputs the text. However, the reward model has a defect in evaluating the knowledge correctness of the output text of the large language model, which causes a problem of low accuracy in identifying the fact content by the large language model.

Disclosure of Invention

The invention provides a training method, an answer evaluation device and equipment for a reward model, which are used for solving the defect that the accuracy of knowledge correctness identification of the output text of a large language model by the reward model in the prior art is low, and improving the accuracy of knowledge correctness identification of the output text of the large language model.

The invention provides a training method of a reward model, which comprises the following steps:

obtaining a plurality of sample pairs, wherein each sample pair comprises a first sample and a second sample, the first sample comprises a sample question and a first sample answer, the second sample comprises the sample question and a second sample answer, the target score of the first sample answer is higher than the target score of the second sample answer, the target score is related to knowledge accuracy, and the knowledge accuracy is determined based on the target answer matched with the sample question in a knowledge graph;

Inputting the first sample and the second sample in the sample pair into an initial rewarding model for each sample pair to obtain a first score of the first sample and a second score of the second sample output by the initial rewarding model;

and adjusting model parameters of the initial rewarding model based on the first score and the second score to obtain a rewarding model, wherein the rewarding model is used for evaluating answers output by the large language model.

According to the training method of the rewarding model provided by the invention, the model parameters of the initial rewarding model are adjusted based on the first score and the second score to obtain the rewarding model, and the training method comprises the following steps:

determining a first knowledge correctness judgment result of the first sample answer relative to the target answer and a second knowledge correctness judgment result of the second sample answer relative to the target answer;

determining a quantization gap between the first knowledge correctness judgment result and the second knowledge correctness judgment result;

determining loss information based on the first score, the second score, and the quantified gap;

and adjusting model parameters of the initial rewarding model based on the loss information to obtain the rewarding model.

According to the training method of the rewarding model provided by the invention, the method further comprises the following steps:

obtaining a plurality of sample answers corresponding to the sample questions;

searching a target answer corresponding to the sample question in the knowledge graph;

and determining the target score of each sample answer based on the target answers.

According to the training method of the reward model provided by the invention, the target score of each sample answer is determined based on the target answers, and the training method comprises the following steps:

determining a target knowledge correctness judgment result of each sample answer relative to the target answer;

determining the labeling score of each sample answer based on the target knowledge correctness judgment result and the target evaluation dimension of the target answer; the target evaluation dimension includes at least one of: the fluency of the sample answer, the satisfaction of the user on the sample answer and the safety of the sample answer.

According to the training method of the reward model provided by the invention, the determining the target knowledge correctness judgment result of each sample answer relative to the target answer comprises the following steps:

for each sample answer, determining that the sample answer comprises a first sub-answer and a second sub-answer except the first sub-answer, wherein the first sub-answer comprises an answer corresponding to the sample question;

Based on the target answers, determining knowledge correctness judgment results of the first sub-answers and knowledge correctness judgment results of the second sub-answers respectively;

and determining a target knowledge correctness judgment result corresponding to the sample answer based on the knowledge correctness judgment result of the first sub-answer and the knowledge correctness judgment result of the second sub-answer.

According to the training method of the reward model provided by the invention, the determining the target knowledge correctness judgment result corresponding to the sample answer based on the knowledge correctness judgment result of the first sub-answer and the knowledge correctness judgment result of the second sub-answer comprises the following steps:

determining that the target knowledge correctness judgment result corresponding to the sample answer is correct under the condition that the knowledge correctness judgment result of the first sub answer is correct and the knowledge correctness judgment result of the second sub answer is correct;

determining that the target knowledge correctness judgment result corresponding to the sample answer is partially correct under the condition that the knowledge correctness judgment result of the first sub-answer is correct and the knowledge correctness judgment result of the second sub-answer is incorrect;

And under the condition that the knowledge correctness judgment result of the first sub-answer is wrong, determining that the target knowledge correctness judgment result corresponding to the sample answer is wrong.

The invention provides an answer evaluation method, which comprises the following steps:

obtaining a predicted answer to the target problem output by the large language model;

and inputting the target questions and the predicted answers into a reward model to obtain evaluation results of the predicted answers output by the reward model, wherein the reward model is trained based on the training method of the reward model according to any one of the above.

The invention also provides a training device of the reward model, which comprises:

the system comprises an acquisition module, a judgment module and a storage module, wherein the acquisition module is used for acquiring a plurality of sample pairs, each sample pair comprises a first sample and a second sample, the first sample comprises a sample question and a first sample answer, the second sample comprises the sample question and the second sample answer, the labeling score of the first sample answer is higher than the target score of the second sample answer, the target score is related to the knowledge accuracy, and the knowledge accuracy is determined based on the target answer matched with the sample question in a knowledge graph;

The input module is used for inputting the first sample and the second sample in the sample pair into an initial rewarding model for each sample pair to obtain a first score of the first sample and a second score of the second sample output by the initial rewarding model;

and the adjustment module is used for adjusting the model parameters of the initial rewarding model based on the first score and the second score to obtain a rewarding model, and the rewarding model is used for evaluating the answers output by the large language model.

The invention also provides an answer evaluation device, which comprises:

the obtaining module is used for obtaining a predicted answer aiming at the target problem and output by the large language model;

and the input module is used for inputting the target questions and the predicted answers into a reward model to obtain evaluation results of the predicted answers output by the reward model, wherein the reward model is trained based on the training method of the reward model.

The invention also provides an electronic device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, wherein the processor implements the training method of any one of the reward models or implements the answer evaluation method of any one of the answers when executing the program.

The present invention also provides a non-transitory computer readable storage medium having stored thereon a computer program which, when executed by a processor, implements a training method for a reward model as described in any one of the above or implements an answer evaluation method as described in any one of the above.

The invention also provides a computer program product comprising a computer program which when executed by a processor implements a training method for a reward model as described in any one of the above or implements an answer evaluation method as described in any one of the above.

According to the training method, the answer evaluation method, the device and the equipment for the rewarding model, a plurality of sample pairs are obtained, each sample pair comprises a first sample and a second sample, the first sample comprises a sample question and a first sample answer, the second sample comprises a sample question and a second sample answer, the target score of the first sample answer is higher than the target score of the second sample answer, the target score is related to knowledge correctness, the knowledge correctness is determined based on the target answer matched with the sample question in a knowledge graph, the first sample and the second sample in the sample pair are input into an initial rewarding model for each sample pair, the first score of the first sample and the second score of the second sample output by the initial rewarding model are obtained, model parameters of the initial rewarding model are adjusted based on the first score and the second score, and the rewarding model is obtained, and the rewarding model is used for evaluating the answer output by the large language model. In the training process of the reward model, the knowledge correctness of the first sample answer and the second sample answer in the sample pair is determined based on the target answer in the knowledge graph, and the target answer in the knowledge graph is the knowledge correct answer, so that the target score determined based on the knowledge correctness is related to the knowledge correctness of the first sample answer and the second sample answer, and the higher the knowledge correctness is, the higher the target score of the sample answer is, therefore, when the reward model is trained based on the sample pair, the difference between the first sample answer with the higher target score and the second sample answer with the lower target score can be pulled, and the trained reward model can pay more attention to the knowledge correctness of the answer when evaluating the answer output by the large language model, so that the recognition accuracy of the knowledge correctness of the answer can be improved.

Drawings

In order to more clearly illustrate the invention or the technical solutions of the prior art, the following description will briefly explain the drawings used in the embodiments or the description of the prior art, and it is obvious that the drawings in the following description are some embodiments of the invention, and other drawings can be obtained according to the drawings without inventive effort for a person skilled in the art.

FIG. 1 is a flow chart of a training method of a reward model according to an embodiment of the present invention;

FIG. 2 is a schematic diagram of a training process of a reward model according to an embodiment of the present invention;

FIG. 3 is a schematic diagram of a training process and a fine tuning process of a reward model according to an embodiment of the present invention;

FIG. 4 is a flowchart of an answer evaluation method according to an embodiment of the present invention;

FIG. 5 is a schematic diagram of a training device for a bonus model according to an embodiment of the invention;

fig. 6 is a schematic structural diagram of an answer evaluation device according to an embodiment of the present invention;

fig. 7 illustrates a physical structure diagram of an electronic device.

Detailed Description

For the purpose of making the objects, technical solutions and advantages of the present invention more apparent, the technical solutions of the present invention will be clearly and completely described below with reference to the accompanying drawings, and it is apparent that the described embodiments are some embodiments of the present invention, not all embodiments. All other embodiments, which can be made by those skilled in the art based on the embodiments of the invention without making any inventive effort, are intended to be within the scope of the invention.

The reward model is an important component in the field of deep learning. The reward model is used to measure the quality of answers generated by the large language model, and provides feedback and guidance for the large language model to improve its performance and output quality.

The traditional rewarding model usually gives comprehensive feedback based on semantic layers such as smoothness, helpfulness and safety of the output text of the large language model, however, the feedback has defects in evaluating the knowledge correctness of the output text of the large language model, and the problem of low accuracy in identifying the fact content of the large language model is caused. Therefore, how to evaluate the output result of the large language model more accurately through the knowledge-enhanced reward model without affecting the original learning process and model structure, and how to align the answer output by the large language model with the human demand by fully utilizing the enhanced reward model, so that the answer output by the large language model accords with the human preference, is still a problem to be solved urgently.

In the method, knowledge correctness judgment can be performed on a first sample answer and a second sample answer corresponding to a sample question based on a target answer matched with the sample question in a knowledge graph, and respective corresponding target scores are determined based on the knowledge correctness, so that a plurality of sample pairs are obtained, each sample pair comprises a first sample and a second sample, the first sample comprises the sample question and the first sample answer, the second sample comprises the sample question and the second sample answer, the target score of the first sample answer is higher than the target score of the second sample answer, the target score is related to the knowledge correctness, and the higher the knowledge correctness is, the higher the target score is. After the obtained sample pairs are respectively input into the initial reward model, the first sample and the second sample can be scored through the initial reward model, so that the initial reward model is trained based on the first score of the first sample and the second score of the second sample, and the reward model is obtained. In the training process of the reward model, the knowledge correctness of the first sample answer and the second sample answer in the sample pair is determined based on the target answer in the knowledge graph, and the target answer in the knowledge graph is the knowledge correct answer, so that the target score determined based on the knowledge correctness is related to the knowledge correctness of the first sample answer and the second sample answer, and therefore, when the reward model is trained based on the sample pair, the first sample answer with higher target score and the second sample answer with lower target score can be pulled apart by a gap, and the trained reward model can pay more attention to the knowledge correctness of the answer when evaluating the answer output by the large language model, so that the recognition accuracy of the knowledge correctness of the answer can be improved.

The following describes a training method of a reward model provided by an embodiment of the present invention with reference to fig. 1 to 4. The embodiment of the invention can be suitable for evaluating the text output by any model, in particular for evaluating the fact correctness or knowledge correctness of the output answer of a large language model. The execution main body of the method can be electronic equipment such as a computer, a terminal device, a server cluster or training equipment of a specially designed rewarding model, and the execution main body of the method can also be a training device of the rewarding model arranged in the electronic equipment, and the training device of the rewarding model can be realized by software, hardware or a combination of the two.

Fig. 1 is a flow chart of a training method of a reward model according to an embodiment of the present invention, as shown in fig. 1, the method includes:

step 101: obtaining a plurality of sample pairs, wherein each sample pair comprises a first sample and a second sample, the first sample comprises a sample question and a first sample answer, the second sample comprises a sample question and a second sample answer, the target score of the first sample answer is higher than the target score of the second sample answer, the target score is related to knowledge accuracy, and the knowledge accuracy is determined based on the target answer matched with the sample question in the knowledge graph.

In this step, the sample question is a question that the collected user inputs a large language model to inquire, and may be another question collected from the network. After model parameters are changed by selecting different large language models or the same large language model, the collected sample questions are input into each large language model, so that sample answers output by each large language model are obtained. For example, two different large language models may be selected, sample questions may be input into each large language model separately, and a first sample answer output by one large language model and a second sample answer output by the other large language model may be obtained. These large language models may choose variations of different institutional developments or the same large language model and change temperature parameters to maximize diversity. The large language model may be, for example, an IFlytek Spark large model (IFlytek Spark) or other models capable of knowledge question answering.

In order to enhance the ability of the model to judge the correctness of knowledge, in the embodiment of the invention, an external knowledge base is introduced, and the knowledge in the external knowledge base is stored in a knowledge graph mode. The knowledge graph is structured and can represent knowledge with entity and relation, and the invention uses the structural characteristic of the knowledge graph and uses entity and relation in the graph to strengthen the rewarding mechanism of the model, so that the rewarding model can more obviously identify and extract various knowledge concepts and concepts. Therefore, the sample questions can be combined with the knowledge graph to extract the triples, and the contents extracted from the knowledge graph in the triples are target answers matched with the sample questions, and the target answers can be understood as correct knowledge answers determined based on the knowledge graph.

Further, after the first sample answer and the second sample answer are obtained, knowledge correctness of the first sample answer and the second sample answer may be determined based on the determined target answer, and a target score of the first sample answer and a target score of the second sample answer may be determined based on the knowledge correctness. The target score of the first sample answer is higher than the target score of the second sample answer, which indicates that the first sample answer is closer to the target answer, or the fact correctness of the first sample answer is higher than the fact correctness of the second sample answer, or the knowledge correctness of the first sample answer is higher than the knowledge correctness of the second sample answer.

After determining the target scores of the first sample answer and the second sample answer, the sample question and the first sample answer may be taken as a first sample, and the sample question and the second sample answer may be taken as a second sample, thereby forming a sample pair.

Step 102: and inputting the first sample and the second sample in the sample pair into the initial rewarding model aiming at each sample pair to obtain a first score of the first sample and a second score of the second sample output by the initial rewarding model.

In this step, after a plurality of sample pairs are obtained, the first sample and the second sample in each sample pair may be input into an initial reward model, and through the initial reward model, the first sample answer may be scored based on the sample question in the first sample to obtain a first score, and similarly, the second sample answer may be scored based on the sample question in the second sample to obtain a second score. Wherein the sample problem is the same in the first sample and the second sample. The first score characterizes the initial reward model's evaluation of the first sample answer and the second score characterizes the initial reward model's evaluation of the second sample answer.

In a specific implementation process, for each sample pair, a sample question in a first sample and a first sample answer can be spliced, a sample question in a second sample and a second sample answer are spliced, and spliced contents are input into an initial rewarding model to score the first sample and the second sample. In addition, the labeling scores corresponding to the first sample answer and the second sample answer can be converted into a binary label format, and different scores are forced between the labeling scores and the binary label format.

Step 103: based on the first score and the second score, model parameters of the initial rewarding model are adjusted to obtain a rewarding model, and the rewarding model is used for evaluating answers output by the large language model.

In this step, the loss information may be determined based on the first score and the second score, so that the model parameters of the initial rewarding model are adjusted based on the loss information, and the above-mentioned process is performed through continuous iteration until the model converges or the iteration number reaches the preset number, and the finally obtained model is used as the rewarding model. The trained reward model can be used for evaluating answers output by the large language model, and the evaluation result comprises evaluation of knowledge correctness of the answers.

Illustratively, the loss information may be binary sorting loss, which may be specifically shown in formula (1):

Loss＝-log(σ(r _θ (x,y _h )-r _θ (x,y _l ))) (1)

wherein Loss represents Loss information, x represents sample problem, which comprises triples extracted by combining with a knowledge graph, y _h Representing the answer to the first sample, y _l Representing a second sample answer, r _θ (x,y _h ) A first score, r, representing a first sample of the weight θ of the initial reward model _θ (x,y _l ) A second score, σ (), representing a second sample of the weight θ of the initial reward model represents the normalization process.

According to the training method for the reward model, a plurality of sample pairs are obtained, each sample pair comprises a first sample and a second sample, the first sample comprises a sample question and a first sample answer, the second sample comprises a sample question and a second sample answer, the target score of the first sample answer is higher than the target score of the second sample answer, the target score is related to knowledge accuracy, the knowledge accuracy is determined based on the target answer matched with the sample question in a knowledge graph, the first sample and the second sample in the sample pair are input into an initial reward model for each sample pair, the first score of the first sample and the second score of the second sample output by the initial reward model are obtained, model parameters of the initial reward model are adjusted based on the first score and the second score, and the reward model is used for evaluating the answer output by a large language model. In the training process of the reward model, the knowledge correctness of the first sample answer and the second sample answer in the sample pair is determined based on the target answer in the knowledge graph, and the target answer in the knowledge graph is the knowledge correct answer, so that the target score determined based on the knowledge correctness is related to the knowledge correctness of the first sample answer and the second sample answer, and the higher the knowledge correctness is, the higher the target score of the sample answer is, therefore, when the reward model is trained based on the sample pair, the difference between the first sample answer with the higher target score and the second sample answer with the lower target score can be pulled, and the trained reward model can pay more attention to the knowledge correctness of the answer when evaluating the answer output by the large language model, so that the recognition accuracy of the knowledge correctness of the answer can be improved.

Illustratively, when the model parameters of the initial reward model are adjusted based on the first score and the second score to obtain the reward model, the first knowledge correctness judgment result of the first sample answer relative to the target answer and the second knowledge correctness judgment result of the second sample answer relative to the target answer may be determined, the quantization gap between the first knowledge correctness judgment result and the second knowledge correctness judgment result may be determined, and the loss information may be determined based on the first score, the second score and the quantization gap, so that the model parameters of the initial reward model are adjusted based on the loss information to obtain the reward model.

In this step, in order to introduce the accuracy assessment of knowledge into the reward model, in the embodiment of the present invention, an auxiliary task of knowledge judgment is further added, specifically, further modification is made on the basis of the binary sorting loss in the foregoing embodiment, so as to better enable the reward model to learn the ability of combining knowledge to judge the accuracy of the answer. Therefore, it is necessary to determine a first knowledge correctness judgment result of the first sample answer with respect to the target answer and a second knowledge correctness judgment result of the second sample answer with respect to the target answer. The first knowledge correctness judgment result and the second knowledge correctness judgment result are mainly divided into three categories: correct, partially correct and incorrect. Taking the first knowledge correctness judgment result as an example, the requirements on the most central part of the sample problem need to be strict, and the supplementary information output by the large language model can be properly relaxed, for example: the first sample answer and the target answer can be compared, and when the part of the first sample answer corresponding to the sample question input by the user is correct and the additional supplementary information of the large language model is also correct, the first knowledge correctness judgment result can be determined to be correct; for the partial answer corresponding to the sample question input by the user in the first sample answer, when the additional supplementary information of the large language model is wrong, the first knowledge correctness judgment result can be determined to be 'partial correct'; when the first knowledge correctness judgment result is determined to be "wrong" with respect to the partial answer errors corresponding to the sample questions input by the user in the first sample answers. Likewise, the second knowledge correctness judgment result of the second sample answer relative to the target answer can also be determined according to the mode.

Further, a quantization gap between the first knowledge correctness determination result and the second knowledge correctness determination result may be determined, for example, the size of the quantization gap r may be determined according to three knowledge labels of "correct", "partially correct" and "error", where the value of r is [0,2], for example, if the first knowledge correctness determination result is "correct", the second knowledge correctness determination result is "error", r is 2, if the first knowledge correctness determination result is "correct", the second knowledge correctness determination result is "partially correct", r is 1, if the first knowledge correctness determination result is "partially correct", the second knowledge correctness determination result is "error", r is 0.8, and so on.

After determining the quantization gap, the loss information may be determined based on the following equation (2):

Loss＝-log(σ(r _θ (x,y _h )-r _θ (x,y _l )-m _k (r))) (2)

wherein the function m _k And (r) represents a discrete function of the recovery correctness deviation for measuring a knowledge correctness difference between the first knowledge correctness judgment result and the second knowledge correctness judgment result, wherein r represents a quantization difference.

It will be appreciated that adding m to the loss information _k And (r) in the process of training the reward model by the Multi-task Learning mode, model parameters of the initial reward model can be adjusted based on the determined loss information, and the mode can pull scores of sample answers with correct knowledge and incorrect knowledge, so that the trained reward model can learn how to judge the knowledge correctness of the answers output by the large language model by combining the knowledge graph more explicitly.

In the embodiment of the invention, the judgment of the knowledge by the rewarding model can be enhanced by combining the knowledge graph in the external knowledge base through adding the judgment result of the knowledge correctness. If the external knowledge base is not used, although the rewarding model can be judged, the rewarding model is usually smaller and unstable, the knowledge storage is often not as large as that of the model, and when the unseen knowledge or the fuzzy knowledge is encountered, the model can easily generate the illusion of knowledge, such as Zhang crown plum wear or no midwifery. After adding the external knowledge base, the reward model still needs to learn how to use these knowledge to make a judgment on the correctness of the reply, therefore, when adding m _k And (r) judging the auxiliary model after loss, wherein the reward model can judge the knowledge of the sample answer by combining three standards of correct, partial correct and error more directly, so that the reward model can be prevented from scoring the reply error of the knowledge error to aggravate the illusion of the knowledge of the large language model.

Further, after training the reward model, the reward model may be finely tuned based on the evaluation result after evaluating the answers of the large language model. For example, reinforcement learning fine tuning of the reward model may be performed by means of near-end policy optimization (Proximal Policy Optimization; PPO) to align the large language model with humans. The human feedback is used as a reinforcement learning signal, so that the answer finally generated by the large language model is optimized in the correlation with the problem input by the user, and meanwhile, the requirements and the favorites of the human user can be better met.

In this embodiment, the loss information may be determined based on the first knowledge correctness determination result of the first sample answer relative to the target answer and the second knowledge correctness determination result of the second sample answer relative to the target answer, so that after the quantization gap between the first knowledge correctness determination result and the second knowledge correctness determination result is determined, the model parameters of the initial reward model may be adjusted based on the first score, the second score and the quantization gap, and the final reward model may be obtained based on the loss information. The quantization gap between the first knowledge correctness judgment result and the second knowledge correctness judgment result is added in the loss information, so that not only can the rewarding model learn how to judge the correctness of the output answer of the large language model by combining the knowledge graph more explicitly, but also the rewarding model can judge the knowledge of the answer by combining three standards of correct, partial correct and error more directly, and the phenomenon that the rewarding model mistakenly scores the reply of the knowledge error so as to aggravate the illusion of the knowledge of the large language model can be prevented.

In addition, the reward model provided by the embodiment of the invention evaluates the answers of the large language model, so that the large language model is more in line with human preference, for example, the method plays an important role in the fields of man-machine interaction, automatic question-answering systems, information retrieval and the like, and has great significance in improving the application value of artificial intelligence in the current social development.

It should be appreciated that the target score of the first sample answer may characterize the accuracy of the first sample answer knowledge, and therefore, how an accurate target score can be determined is important. Next, a detailed description will be given of a manner of determining the target score of the sample answer included in each sample pair. The sample answers described in the following embodiments may include the first sample answer or the second sample answer, and may also include sample answers in other sample pairs.

Illustratively, in determining the target score for a sample answer, this may be done as follows: and obtaining a plurality of sample answers corresponding to the sample questions, searching target answers corresponding to the sample questions in the knowledge graph, and determining target scores of the sample answers based on the target answers.

Specifically, fig. 2 is a schematic diagram of a training process of a reward model provided in the embodiment of the present invention, as shown in fig. 2, a Prompt may be obtained, for example, by using open source data or a self-built collection manner, where coverage of the collected Prompt is generally wide and various in form, and after the promt is collected, the obtained sample question and the collected promt may be input into different large language models, so as to obtain different sample answers output by each large language model. Wherein for each Prompt, the user may make preference labels in the sample answers output by these different large language models to provide rewards model learning.

For example, if the sample question is "how many nano chips are mounted on a newly issued XX mobile phone", after the sample question and the collected promt are input into four different large language models, four sample answers can be obtained, which are a: "newly released XX cell phone carries M1 Pro chip, according to the official data of XX cell phone, the chip is 3nm process … …", B: "XX company has not released XX Mobile phones yet, therefore, it is impossible to determine the nano-process technology … …", C: "newly released XX cell phones are loaded with M1 Pro or M1 Max chips, both of which are produced using a bench-charge 5nm process. "D: the newly released A17 Pro chip carried by the XX mobile phone adopts a 3nm (nanometer) process, and the design of the chip is very complex … … percent.

Further, for sample questions, the sample questions may be combined with knowledge graph extraction entities and relationships to form triples, such as entity-attribute values, corresponding to the sample questions. The extraction of entities and relationships may be performed in any manner, for example, named entity recognition (Named Entity Recognition, NER) techniques and relationship extraction (Relation Extraction, RE) techniques may be used, where NER is used to identify entities in the sample problem, and RE is used to determine relationships between these entities. It can be understood that the entity and the attribute in the extracted triplet may be extracted from the sample question, and the attribute value is obtained from the knowledge graph, where the attribute value may be understood as a target answer corresponding to the sample question. For example, for a sample question "how many nano chips are mounted on a newly released XX mobile phone", a triplet (XX mobile phone, chip, a17 Pro chip, 3 nm) can be extracted by combining a knowledge graph, where XX mobile phone is an entity extracted from the sample question, chip is an attribute extracted from the sample question, a17 Pro chip and 3nm are attribute values obtained from the knowledge graph, which are target answers corresponding to the sample question.

Further, a target score for each sample answer may be determined based on the target answer. In this way, the reward model can learn indirectly how severe the different knowledge errors are, and the result of scoring the evaluation of the large language model will also vary.

In this embodiment, the target answer corresponding to the sample question may be searched from the knowledge graph, so that the target score of each sample answer is determined based on the target answer.

For example, in the above embodiment, when determining the target score of each sample answer based on the target answer, the target knowledge accuracy judgment result of each sample answer relative to the target answer may be determined, and the target score of each sample answer may be determined based on the target knowledge accuracy judgment result of the target answer and the target evaluation dimension; the target evaluation dimension includes at least one of: the fluency of the sample answer, the satisfaction of the user on the sample answer and the safety of the sample answer.

Specifically, the sample answer and the target answer may be compared to determine a target knowledge correctness judgment result of each sample answer with respect to the target answer. In one possible implementation manner, for each sample answer, determining that the sample answer includes a first sub-answer and a second sub-answer except the first sub-answer, wherein the first sub-answer includes an answer corresponding to the sample question; based on the target answers, determining a knowledge correctness judgment result of the first sub-answer and a knowledge correctness judgment result of the second sub-answer respectively; and determining a target knowledge correctness judgment result corresponding to the sample answer based on the knowledge correctness judgment result of the first sub-answer and the knowledge correctness judgment result of the second sub-answer.

In order to increase the richness of answers and the plumpness of contents, a large language model usually answers sample questions and is further supplemented. Therefore, a first sub-answer and a second sub-answer can be determined in each sample answer, wherein the first sub-answer is an answer corresponding to the sample question, and the second sub-answer is additionally added to the large language model. For example, in the sample answer a "newly issued XX mobile phone in the above example, an M1 Pro chip is mounted on the XX mobile phone, and according to the official data of the XX mobile phone, the chip is in the 3nm process … …", the first sub answer is "3nm", and the second sub answer is "the XX mobile phone is mounted on the M1 Pro chip".

Further, the first sub-answer and the second sub-answer can be respectively compared with the target answer, so that a knowledge correctness judgment result of the first sub-answer and a knowledge correctness judgment result of the second sub-answer can be obtained, and the target knowledge correctness judgment result corresponding to the sample answer can be determined according to the knowledge correctness judgment results of the two parts.

The knowledge correctness judgment result of the first sub-answer and the knowledge correctness judgment result of the second sub-answer in the sample answer are respectively determined, and the target knowledge correctness judgment result of the sample answer is determined, so that the determination result of the target knowledge correctness judgment result is more fine-grained, the accuracy of the target knowledge correctness judgment result is further improved, and in addition, the fine-grained target knowledge correctness judgment result can enable the finally trained reward model to indirectly learn the severity of distinguishing different knowledge errors.

In addition, when determining the target knowledge correctness determination result corresponding to the sample answer based on the knowledge correctness determination result of the first sub-answer and the knowledge correctness determination result of the second sub-answer, the target knowledge correctness determination result corresponding to the sample answer may be determined to be correct when the knowledge correctness determination result of the first sub-answer is correct and the knowledge correctness determination result of the second sub-answer is correct; under the condition that the knowledge correctness judgment result of the first sub-answer is correct and the knowledge correctness judgment result of the second sub-answer is incorrect, determining that the target knowledge correctness judgment result corresponding to the sample answer is partially correct; and under the condition that the knowledge correctness judgment result of the first sub-answer is wrong, determining that the target knowledge correctness judgment result corresponding to the sample answer is wrong.

For example, as shown in fig. 2, when the first sub-answer and the second sub-answer in the sample answers A, B, C and D are compared with the target answer a17 Pro chip and 3nm in the triplet, respectively, it can be determined that the target knowledge correctness judgment result of the sample answer a is "partially correct", the target knowledge correctness judgment result of the sample answer B is "incorrect", the target knowledge correctness judgment result of the sample answer C is "incorrect", and the target knowledge correctness judgment result of the sample answer D is "correct".

By respectively determining the knowledge correctness judgment result of the first sub-answer and the knowledge correctness judgment result of the second sub-answer, a more accurate target knowledge correctness judgment result can be obtained, so that when the reward model is trained, after the target knowledge correctness judgment result is added into the loss information, the trained reward model can evaluate the knowledge correctness of the answer output by the large language model more accurately.

Further, after determining the target knowledge correctness judgment result of the sample answer, the labeling score of the sample answer can be determined based on the target knowledge correctness judgment result of the target answer and the target evaluation dimension, so as to comprehensively score and rank the sample answer by combining all aspects of standards, wherein the process is an important process for aligning the large language model with the human preference, and therefore, the comprehensive standard needs to be formulated. In the embodiment of the invention, other target evaluation dimensions mainly comprise the fluency of the sample answer, the satisfaction degree of the user on the sample answer and the safety degree of the sample answer besides the target knowledge correctness judgment result.

The target knowledge correctness judging result mainly characterizes the knowledge of the sample answers, and the knowledge evaluation standard mainly focuses on whether the sample answers generated by the large language model are accurate, relevant and have practical reference values. For example, if the target knowledge correctness determination result of the sample answer is wrong, the score of the sample answer about knowledge will not be higher than the score of the sample answer whose target knowledge correctness determination result is correct.

The satisfaction degree of the user on the sample answer can be also understood as the help property of the sample answer, and the help property evaluation standard mainly focuses on whether the sample answer generated by the large language model can meet the user requirement, so as to help the user to solve the problem or provide useful advice. Under this criterion, it will be evaluated whether the response of the large language model is closely related to the sample question posed by the user, and whether a practically viable solution is provided for the user.

The safety degree of the sample answer can also be understood as the safety of the sample answer, and the safety evaluation standard focuses on whether the sample answer generated by the large language model has contents which possibly cause bad results such as misguidance, discrimination or opponent. In order to ensure the safety of the large language model, the output sample answers are screened to eliminate information containing potential risks, and bad contents are marked so as to be correspondingly adjusted in the training process.

As shown in fig. 2, in the foregoing example, after determining the target score based on the target knowledge correctness judgment result and the target evaluation dimension, the sample answers A, B, C and D are ranked based on the target score as D > a > b=c.

After determining the sorting result among the sample answers, determining a first sample answer with high target score and a second sample answer with low target score based on the sorting result, so that the first sample and the second sample are formed by combining the sample questions, and the initial rewarding model is trained to obtain a final rewarding model.

In this embodiment, a target knowledge correctness judgment result for representing the knowledge of a target answer is further added on the basis of a target evaluation dimension, so that the trained reward model can evaluate the knowledge correctness of the answer output by the large language model more accurately, the performance of the large language model in the application process can be effectively improved, the efficiency and the effect of the large language model are improved, and the reward model has stronger adaptability and coping capability especially when complex and variable human language information is processed. In addition, the target score of the sample answer can be determined by combining the target evaluation dimension and the target knowledge correctness judgment result, and the target score is a comprehensive score, so that the sample answer with wrong knowledge can be prevented from giving high scores, the possibly aggravated knowledge fantasy problem in the large language model alignment scheme can be further relieved, the effect of evaluating based on the target evaluation dimension can be reserved, the judgment on the knowledge correctness can be obviously improved, the knowledge fantasy problem of the large language model in the alignment process can be relieved, and the method has good effectiveness and practicability.

It should be noted that the labeling process described above, that is, the process of determining the target score, may also be performed by a labeling person.

Fig. 3 is a schematic diagram of a training process and a fine tuning process of a reward model according to an embodiment of the present invention, and as shown in fig. 3, the process mainly includes a labeling process, a training reward model process and a reinforcement learning fine tuning process.

For the labeling process, this stage mainly comprises four steps: firstly, collecting data of sample questions and generating a plurality of sample answers for comparison, wherein the sample questions can be questions preferred by users; secondly, extracting triples in the sample problem based on the knowledge graph; thirdly, judging the correctness of the sample answer based on the extracted triples; and fourthly, comprehensively sequencing the sample answers. The above-mentioned process may be performed by an electronic device or by a labeling person.

Reinforcement learning (Reinforcement Learning with Human Feedback, RLHF) based on human feedback is a very important part of current large language model training, which is particularly important in aligning human and model preferences and instructions. Specifically, RLHF is a model training method that first requires sampling data that can represent human preferences, and then annotators select which of the two model outputs they like. This human feedback is then used to train a reward model that can be automatically scored for preference decisions in lieu of humans after alignment with human preferences. It follows that the effect of RLHF is very dependent on the reward model, and the scoring mechanism of the reward model is consistent with human preferences, so the labeling process is the most important and complex part of the overall process.

For the process of training the reward model, at this stage, the reward model may be trained using the sample answers collected and ordered at the previous stage as training samples. Through training optimization, the large language model can generate optimal answers through calculation when acquiring new questions.

In addition, reinforcement learning fine adjustment may be performed in order to improve accuracy of the bonus model. At this stage, the system may be fine-tuned for optimization using reinforcement learning methods in combination with trained reward models. The human feedback is used as a reinforcement learning signal, so that the answer finally generated by the large language model is optimized in the correlation with the problem input by the user, and meanwhile, the requirements and the favorites of the human user can be better met.

Fig. 4 is a flow chart of an answer evaluation method according to an embodiment of the present invention, as shown in fig. 4, the method includes:

step 401: and obtaining a predicted answer to the target problem output by the large language model.

In this step, after the target question input by the user is obtained, the target question may be input into the large language model, so as to obtain a predicted answer output by the large language model, where the predicted answer may be understood as an answer to the target question output by the large language model.

Step 402: and inputting the target questions and the predicted answers into a reward model to obtain evaluation results of the predicted answers output by the reward model.

Wherein the reward model is trained based on the training method of the reward model described in any of the above embodiments. The specific training process may refer to any of the foregoing embodiments, and the specific training process is not described herein.

In this step, the predicted answer output by the large language model may have knowledge errors, so that the target question and the obtained predicted answer need to be input into the reward model, and the predicted answer is judged by the reward model to determine whether the predicted answer has knowledge errors, thereby obtaining the evaluation result output by the reward model. The evaluation result includes a judgment result of the correctness of the knowledge of the predicted answer, such as correctness, partial correctness or mistakes.

According to the answer evaluation method provided by the embodiment of the invention, the predicted answer to the target question output by the large language model is obtained, and the target question and the predicted answer are input into the reward model, so that the evaluation result of the predicted answer output by the reward model is obtained. In the training process of the reward model, the knowledge correctness of the first sample answer and the second sample answer in the adopted sample pair is determined based on the target answer in the knowledge graph, the target answer in the knowledge graph is the knowledge correct answer, so that the target score determined based on the knowledge correctness is related to the knowledge correctness of the first sample answer and the second sample answer, and the higher the knowledge correctness is, the higher the target score of the sample answer is, therefore, when the reward model is trained based on the sample pair, the first sample answer with the higher target score and the second sample answer with the lower target score can be pulled apart by a gap, and the trained reward model can pay more attention to the knowledge correctness of the predicted answer when evaluating the predicted answer output by the large language model, so that the recognition accuracy of the predicted answer knowledge correctness can be improved.

The training device of the reward model provided by the invention is described below, and the training device of the reward model described below and the training method of the reward model described above can be referred to correspondingly.

Fig. 5 is a schematic structural diagram of a training device for a bonus model according to an embodiment of the invention, and referring to fig. 5, a training device 500 for a bonus model includes:

an obtaining module 501, configured to obtain a plurality of sample pairs, where each sample pair includes a first sample and a second sample, where the first sample includes a sample question and a first sample answer, and the second sample includes the sample question and a second sample answer, and a target score of the first sample answer is higher than a target score of the second sample answer, where the target score is related to a knowledge accuracy, and the knowledge accuracy is determined based on a target answer matched with the sample question in a knowledge graph;

an input module 502, configured to input, for each of the sample pairs, the first sample and the second sample in the sample pair into an initial reward model, and obtain a first score of the first sample and a second score of the second sample output by the initial reward model;

And the adjusting module 503 is configured to adjust model parameters of the initial rewarding model based on the first score and the second score, so as to obtain a rewarding model, where the rewarding model is used for evaluating answers output by the large language model.

In an exemplary embodiment, the adjusting module 503 is specifically configured to:

In an example embodiment, the apparatus further comprises a lookup module and a determination module, wherein:

the obtaining module 501 is further configured to obtain a plurality of sample answers corresponding to the sample questions;

the searching module is used for searching a target answer corresponding to the sample question in the knowledge graph;

And the determining module is used for determining the target score of each sample answer based on the target answers.

In an example embodiment, the determining module is specifically configured to:

determining a target score of each sample answer based on a target knowledge correctness judgment result and a target evaluation dimension of the target answer; the target evaluation dimension includes at least one of: the fluency of the sample answer, the satisfaction of the user on the sample answer and the safety of the sample answer.

In an example embodiment, the determining module is specifically configured to:

for each sample answer, determining that the sample answers comprise a first sub-answer and a second sub-answer except the first sub-answer, wherein the first sub-answer comprises an answer corresponding to the sample question;

In an example embodiment, the determining module is specifically configured to:

The apparatus of this embodiment may be used to execute the method of any one of the training method side embodiments of the bonus model, and its specific implementation process and technical effects are similar to those of the training method side embodiment of the bonus model, and specific reference may be made to the detailed description of the training method side embodiment of the bonus model, which is not repeated herein.

Fig. 6 is a schematic structural diagram of an answer evaluation device according to an embodiment of the present invention, and referring to fig. 6, an answer evaluation device 600 includes:

The obtaining module 601 is configured to obtain a predicted answer for a target question output by the large language model;

and the input module 602 is configured to input the target question and the predicted answer into a reward model, and obtain an evaluation result of the predicted answer output by the reward model, where the reward model is trained based on the training method of the reward model in any embodiment.

The device of the present embodiment may be used to execute the method of any one of the embodiments of the answer evaluation method side, and the specific implementation process and technical effects thereof are similar to those of the embodiment of the answer evaluation method side, and specific reference may be made to the detailed description of the embodiment of the answer evaluation method side, which is not repeated herein.

Fig. 7 illustrates a physical schematic diagram of an electronic device, as shown in fig. 7, which may include: processor 710, communication interface (Communications Interface) 720, memory 730, and communication bus 740, wherein processor 710, communication interface 720, memory 730 communicate with each other via communication bus 740. Processor 710 may invoke logic instructions in memory 730 to perform a training method for a reward model, the method comprising: obtaining a plurality of sample pairs, wherein each sample pair comprises a first sample and a second sample, the first sample comprises a sample question and a first sample answer, the second sample comprises the sample question and a second sample answer, the target score of the first sample answer is higher than the target score of the second sample answer, the target score is related to knowledge accuracy, and the knowledge accuracy is determined based on the target answer matched with the sample question in a knowledge graph; inputting the first sample and the second sample in the sample pair into an initial rewarding model for each sample pair to obtain a first score of the first sample and a second score of the second sample output by the initial rewarding model; and adjusting model parameters of the initial rewarding model based on the first score and the second score to obtain a rewarding model, wherein the rewarding model is used for evaluating answers output by the large language model.

In addition, processor 710 may also invoke logic instructions in memory 730 to perform an answer evaluation method comprising: obtaining a predicted answer to the target problem output by the large language model; and inputting the target questions and the predicted answers into a reward model to obtain evaluation results of the predicted answers output by the reward model, wherein the reward model is trained based on the training method of the reward model in any embodiment.

Further, the logic instructions in the memory 730 described above may be implemented in the form of software functional units and may be stored in a computer readable storage medium when sold or used as a stand alone product. Based on this understanding, the technical solution of the present invention may be embodied essentially or in a part contributing to the prior art or in a part of the technical solution, in the form of a software product stored in a storage medium, comprising several instructions for causing a computer device (which may be a personal computer, a server, a network device, etc.) to perform all or part of the steps of the method according to the embodiments of the present invention. And the aforementioned storage medium includes: a U-disk, a removable hard disk, a Read-Only Memory (ROM), a random access Memory (RAM, random Access Memory), a magnetic disk, or an optical disk, or other various media capable of storing program codes.

In another aspect, the present invention also provides a computer program product comprising a computer program, the computer program being storable on a non-transitory computer readable storage medium, the computer program, when executed by a processor, being capable of performing a method of training a reward model provided by the methods described above, the method comprising: obtaining a plurality of sample pairs, wherein each sample pair comprises a first sample and a second sample, the first sample comprises a sample question and a first sample answer, the second sample comprises the sample question and a second sample answer, the target score of the first sample answer is higher than the target score of the second sample answer, the target score is related to knowledge accuracy, and the knowledge accuracy is determined based on the target answer matched with the sample question in a knowledge graph; inputting the first sample and the second sample in the sample pair into an initial rewarding model for each sample pair to obtain a first score of the first sample and a second score of the second sample output by the initial rewarding model; and adjusting model parameters of the initial rewarding model based on the first score and the second score to obtain a rewarding model, wherein the rewarding model is used for evaluating answers output by the large language model.

The computer program, when executed by the processor, is further capable of executing the answer evaluation method provided by the methods, the method comprising: obtaining a predicted answer to the target problem output by the large language model; and inputting the target questions and the predicted answers into a reward model to obtain evaluation results of the predicted answers output by the reward model, wherein the reward model is trained based on the training method of the reward model in any embodiment.

In yet another aspect, the present invention also provides a non-transitory computer readable storage medium having stored thereon a computer program which, when executed by a processor, is implemented to perform a method of training a bonus model provided by the above methods, the method comprising: obtaining a plurality of sample pairs, wherein each sample pair comprises a first sample and a second sample, the first sample comprises a sample question and a first sample answer, the second sample comprises the sample question and a second sample answer, the target score of the first sample answer is higher than the target score of the second sample answer, the target score is related to knowledge accuracy, and the knowledge accuracy is determined based on the target answer matched with the sample question in a knowledge graph; inputting the first sample and the second sample in the sample pair into an initial rewarding model for each sample pair to obtain a first score of the first sample and a second score of the second sample output by the initial rewarding model; and adjusting model parameters of the initial rewarding model based on the first score and the second score to obtain a rewarding model, wherein the rewarding model is used for evaluating answers output by the large language model.

In addition, the computer program is implemented when executed by a processor to perform the answer evaluation method provided by the methods described above, the method comprising: obtaining a predicted answer to the target problem output by the large language model; and inputting the target questions and the predicted answers into a reward model to obtain evaluation results of the predicted answers output by the reward model, wherein the reward model is trained based on the training method of the reward model in any embodiment.

The apparatus embodiments described above are merely illustrative, wherein the elements illustrated as separate elements may or may not be physically separate, and the elements shown as elements may or may not be physical elements, may be located in one place, or may be distributed over a plurality of network elements. Some or all of the modules may be selected according to actual needs to achieve the purpose of the solution of this embodiment. Those of ordinary skill in the art will understand and implement the present invention without undue burden.

From the above description of the embodiments, it will be apparent to those skilled in the art that the embodiments may be implemented by means of software plus necessary general hardware platforms, or of course may be implemented by means of hardware. Based on this understanding, the foregoing technical solution may be embodied essentially or in a part contributing to the prior art in the form of a software product, which may be stored in a computer readable storage medium, such as ROM/RAM, a magnetic disk, an optical disk, etc., including several instructions for causing a computer device (which may be a personal computer, a server, or a network device, etc.) to execute the method described in the respective embodiments or some parts of the embodiments.

Finally, it should be noted that: the above embodiments are only for illustrating the technical solution of the present invention, and are not limiting; although the invention has been described in detail with reference to the foregoing embodiments, it will be understood by those of ordinary skill in the art that: the technical scheme described in the foregoing embodiments can be modified or some technical features thereof can be replaced by equivalents; such modifications and substitutions do not depart from the spirit and scope of the technical solutions of the embodiments of the present invention.

Claims

1. A method of training a reward model, comprising:

2. The method of claim 1, wherein adjusting model parameters of the initial reward model based on the first score and the second score to obtain a reward model comprises:

3. The method of training a reward model of claim 1, further comprising:

obtaining a plurality of sample answers corresponding to the sample questions;

4. A method of training a reward model according to claim 3, wherein said determining a target score for each of said sample answers based on said target answers comprises:

5. The method of claim 4, wherein determining a target knowledge correctness determination result of each of the sample answers with respect to the target answer comprises:

6. The method according to claim 5, wherein determining the target knowledge correctness determination result corresponding to the sample answer based on the knowledge correctness determination result of the first sub-answer and the knowledge correctness determination result of the second sub-answer comprises:

7. An answer evaluation method, comprising:

inputting the target questions and the predicted answers into a reward model, and obtaining evaluation results of the predicted answers output by the reward model, wherein the reward model is trained based on the training method of the reward model according to any one of claims 1-6.

8. A training device for a bonus model, comprising:

an obtaining module, configured to obtain a plurality of sample pairs, where each sample pair includes a first sample and a second sample, where the first sample includes a sample question and a first sample answer, and the second sample includes the sample question and a second sample answer, and a target score of the first sample answer is higher than a target score of the second sample answer, where the target score is related to a knowledge accuracy, where the knowledge accuracy is determined based on a target answer matched with the sample question in a knowledge graph;

9. An answer evaluation device, comprising:

and the input module is used for inputting the target questions and the predicted answers into a reward model to obtain evaluation results of the predicted answers output by the reward model, wherein the reward model is trained based on the training method of the reward model according to any one of claims 1-6.

10. An electronic device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, wherein the processor implements the training method of the reward model of any one of claims 1-6 or the answer evaluation method of claim 7 when the program is executed by the processor.

11. A non-transitory computer readable storage medium having stored thereon a computer program, which when executed by a processor, implements a training method of a reward model according to any one of claims 1-6 or implements an answer evaluation method according to claim 7.