CN117852616A

CN117852616A - Big language model alignment fine tuning method and system based on enhanced reject sampling training

Info

Publication number: CN117852616A
Application number: CN202410229872.9A
Authority: CN
Inventors: 陈科海; 江睿立; 白雪峰; 杨沐昀; 赵铁军; 张民
Original assignee: Shenzhen Graduate School Harbin Institute of Technology
Current assignee: Shenzhen Graduate School Harbin Institute of Technology
Priority date: 2024-02-29
Filing date: 2024-02-29
Publication date: 2024-04-09
Anticipated expiration: 2044-02-29
Also published as: CN117852616B

Abstract

The invention discloses a large language model alignment fine tuning method and system based on enhanced refusal sampling training, which relate to the technical field of artificial intelligence and comprise the following steps: generating N pieces of response texts for a preset instruction request text based on the supervised fine-tuned large language model; evaluating each response text based on the trained reward model to obtain a reward score; sorting N pieces of response texts from high to low according to the corresponding reward points, and selecting the first k pieces of response texts to form a target sample set; based on a preset weighting function, calculating the data weight corresponding to each response text; and constructing a weighted fine adjustment data set based on the preset instruction request text, the response text in the target sample set and the data weight, and performing alignment fine adjustment on the large language model subjected to supervision fine adjustment based on the weighted fine adjustment data set to obtain the target large language model. The method and the device solve the technical problems of high overfitting risk and easiness in interference of noisy rewarding points in the prior art.

Description

Big language model alignment fine tuning method and system based on enhanced reject sampling training

Technical Field

The invention relates to the technical field of artificial intelligence, in particular to a large language model alignment fine tuning method and system based on enhanced refused sampling training.

Background

The large-scale pre-training language model is the basis of various chat robots widely used currently, the large-scale generation type deep neural network model realizes modeling of human natural language probability distribution by performing self-supervision pre-training of known precursors and predictive texts on a large-scale text corpus in various fields, smooth and natural text generation can be realized by sampling output word predictive distribution under the condition of given precursor context text, and the model has excellent performance in various natural language understanding and generating tasks. However, as the training corpus expands, it inevitably contains harmful, biased content, or some factual errors, these negative text segments may cause the pre-trained large model to generate text that is not in line with the expectations or violates human merits in responding to the user's instruction requests. In order for the large model to be able to generate text consistent with human expectations and value observations (e.g., usefulness, honest, and harmlessness), additional alignment fine-tuning training is required so that the large model can correctly follow various instruction requests of human users, and only generate response text meeting expectations.

Currently, a typical technique of alignment fine adjustment of a large model is based on human feedback Reinforcement Learning (RLHF), which is generally considered as one of the most effective alignment fine adjustment techniques of a large model, and a large model obtained by fine adjustment using the technique has been widely used in applications such as chat robots. RLHF techniques generally comprise three phases, namely a supervised fine tuning phase for modeling a high-quality instruction request obeying example text using a supervised learning method for a large model, a reward modeling phase for training a scalar output reward model using feedback data of how good a human responds to the large model request, and a reinforcement learning phase for training the large model using reinforcement learning algorithms with the goal of maximizing the reward score output by the reward model. In the reinforcement learning stage, a typical technical scheme can be briefly summarized as iterative operation of main steps of response text generation, reward score evaluation, model parameter updating and the like, and although the scheme can well realize alignment fine tuning training of a large model, a plurality of neural network models including a generated large model and a reward model need to be loaded at the same time to realize the scheme, so that the scheme has the defects of complex structure, high computational power resource requirement and the like. Therefore, some existing large model alignment fine tuning techniques attempt to improve the above-mentioned shortcomings from the standpoint of simplifying the algorithm structure and reducing the computational power requirements. In the prior art, the refused sampling fine tuning technology enables the large model to carry out supervised fine tuning on the optimal sample in the instruction request response text which is generated for many times by the large model in the last stage on the basis of the two stages before the RLHF is used, and realizes self-improvement in the aspect of meeting human expectations and value. The algorithm flow of the reject sampling fine-tuning technology is simple and visual, only the computational power resource required by supervised learning is occupied in the training process, and a considerable part of the prior art is improved on the basis of the technology.

However, the refused sampling fine tuning technology only selects 1 response text with highest reward score for each instruction request text in the instruction request data set for subsequent large model fine tuning training, and has the defects of high fitting risk, easy interference of noisy reward scores and the like, so that the large model for alignment fine tuning by using the technology has limited improvement on human instruction compliance and cannot well achieve the fundamental purpose of alignment fine tuning of the large model.

Disclosure of Invention

In order to solve the technical problems in the prior art, the embodiment of the invention provides a large language model alignment fine tuning method and system based on enhanced sampling rejection training. The technical scheme is as follows:

in one aspect, a method for alignment fine tuning of a large language model based on enhanced reject sampling training is provided, the method comprising: generating N pieces of response texts for a preset instruction request text based on the supervised fine-tuned large language model; n is a positive integer; evaluating each response text based on the trained reward model to obtain a reward score corresponding to each response text; sorting the N pieces of response texts from high to low according to the corresponding reward points, and selecting the first k pieces of response texts to form a target sample set; wherein k is more than 1 and less than or equal to N; based on a preset weighting function, calculating the data weight corresponding to each response text in the target sample set; the preset weighting function is a function of bonus points; and constructing a weighted fine adjustment data set based on the preset instruction request text, the response text in the target sample set and the data weight, and performing alignment fine adjustment on the large language model subjected to supervision fine adjustment based on the weighted fine adjustment data set to obtain a target large language model.

Optionally, before generating N pieces of response text for the preset instruction request text based on the supervised fine tuning large language model, the method further includes: utilizing a high-quality instruction request compliance sample which accords with human expectation and value observation to perform supervised imitation learning on a preset large language model to obtain the supervised and fine-tuned large language model; and training a scalar output reward model by utilizing feedback data of human expectation and value degree of different instruction response texts generated by human on the supervised and fine-tuned large language model to obtain the trained reward model.

Optionally, the preset weighting function includes:wherein f _w Representing a preset weighting function,/->Representing the reward score corresponding to the nth response text generated by the ith preset instruction request text,/->The maximum value of the reward points corresponding to the N pieces of response texts generated by the ith preset instruction request text is represented, and exp represents an exponential function based on a natural constant e.

Optionally, performing an objective function of an alignment fine tuning process on the supervised fine tuned large language model based on the weighted fine tuning dataset includes:wherein θ is ^ARS For the network parameters of the target large language model, θ represents the network parameters of the large language model prior to supervised trimming, NI represents the total data number, w, of the weighted trim dataset _j Trimming the data weight of the j-th data in the data set for the weight,/for>Request text X for preset instructions _j Is to predict the corresponding response text Y under the condition of (1) _j Is a probability of (2).

In another aspect, a large language model alignment fine tuning system based on enhanced reject sampling training is provided, the system comprising: the system comprises a generating module, an evaluating module, a selecting module, a calculating module and a fine adjusting module; the generation module is used for generating N pieces of response texts for the preset instruction request texts based on the supervised and fine-tuned large language model; n is a positive integer; the evaluation module is used for evaluating each response text based on the trained reward model to obtain the reward score corresponding to each response text; the selecting module is used for sequencing the N pieces of response texts from high to low according to the corresponding reward points, and selecting the first k pieces of response texts to form a target sample set; wherein k is more than 1 and less than or equal to N; the calculation module is used for calculating the data weight corresponding to each response text in the target sample set based on a preset weighting function; the preset weighting function is a function of bonus points; the fine adjustment module is used for constructing a weighted fine adjustment data set based on the preset instruction request text, the response text in the target sample set and the data weight, and carrying out alignment fine adjustment on the large language model subjected to supervision fine adjustment based on the weighted fine adjustment data set to obtain a target large language model.

Optionally, the method further comprises: a learning module and a training module; the learning module is used for requesting compliance samples to perform supervised imitation learning on a preset large language model by using high-quality instructions meeting human expectations and value observations to obtain the supervised and fine-tuned large language model; the training module is used for training the scalar output rewarding model by utilizing feedback data of human expectation and value degree of different instruction response texts generated by human on the supervised and fine-tuned large language model to obtain the trained rewarding model.

In another aspect, there is provided an electronic device comprising: a memory, a processor and a computer program stored on the memory and executable on the processor, the processor implementing the method according to the first aspect described above when executing the computer program.

In another aspect, a computer readable storage medium is provided, in which a program code is stored, which program code is callable by a processor for performing the method according to the first aspect described above.

The technical scheme provided by the embodiment of the invention has the beneficial effects that at least: the embodiment of the invention has the advantages that the prior refused sampling fine tuning technology has little requirement on parallel computational power resources, only occupies the parallel computational power resources required by supervised learning, and can ensure that the risk of overfitting of the fine-tuned large model when responding to human instruction requests outside training data is lower by adding training samples in the refused sampling training stage, and simultaneously remains effective under the condition that human preference labeling noise is difficult to completely avoid, thereby better ensuring that the generated text of the large model is more consistent with the expected and valuable view of human beings.

Drawings

In order to more clearly illustrate the technical solutions of the embodiments of the present invention, the drawings required for the description of the embodiments will be briefly described below, and it is apparent that the drawings in the following description are only some embodiments of the present invention, and other drawings may be obtained according to these drawings without inventive effort for a person skilled in the art.

FIG. 1 is a flow chart of a large language model alignment fine tuning method based on enhanced reject sampling training provided by an embodiment of the present invention;

FIG. 2 is a schematic flow diagram of an enhanced reject sampling training phase provided by an embodiment of the present invention;

FIG. 3 is a schematic diagram of a large language model alignment fine tuning system based on enhanced reject sampling training according to an embodiment of the present invention.

Detailed Description

The technical scheme of the invention is described below with reference to the accompanying drawings.

In embodiments of the invention, words such as "exemplary," "such as" and the like are used to mean serving as an example, instance, or illustration. Any embodiment or design described herein as "exemplary" is not necessarily to be construed as preferred or advantageous over other embodiments or designs. Rather, the term use of an example is intended to present concepts in a concrete fashion. Furthermore, in embodiments of the present invention, the meaning of "and/or" may be that of both, or may be that of either, optionally one of both.

In order to make the technical problems, technical solutions and advantages to be solved more apparent, the following detailed description will be given with reference to the accompanying drawings and specific embodiments.

Example 1

FIG. 1 is a flow chart of a large language model alignment fine tuning method based on enhanced reject sampling training according to an embodiment of the present invention. As shown in fig. 1, the method specifically includes the following steps:

step S102, generating N pieces of response texts for a preset instruction request text based on the supervised and fine-tuned large language model; n is a positive integer.

And step S104, evaluating each response text based on the trained reward model to obtain the reward points corresponding to each response text.

Step S106, sorting N pieces of response texts from high to low according to the corresponding reward points, and selecting the first k pieces of response texts to form a target sample set; wherein k is more than 1 and less than or equal to N.

Step S108, calculating the data weight corresponding to each response text in the target sample set based on a preset weighting function; the predetermined weighting function is a function of the bonus points.

As an alternative embodiment of the present invention, the preset weighting function includes:

wherein f _w Representing a preset weight function of the model,representing the reward score corresponding to the nth response text generated by the ith preset instruction request text,/->The maximum value of the reward points corresponding to the N pieces of response texts generated by the ith preset instruction request text is represented, and exp represents an exponential function based on a natural constant e.

Step S110, a weighted fine adjustment data set is constructed based on the preset instruction request text, the response text in the target sample set and the data weight, and alignment fine adjustment is performed on the supervised fine-adjusted large language model based on the weighted fine adjustment data set, so that the target large language model is obtained.

As an alternative embodiment of the present invention, the objective function of performing an alignment fine tuning process on a supervised fine tuned large language model based on a weighted fine tuning dataset includes:

wherein θ is ^ARS For network parameters of the target large language model, θ represents network parameters of the large language model prior to supervised trimming, NI represents the total data number of the weighted trim dataset, w _j To weight fine tune the data weights of the j-th data in the dataset,request text X for preset instructions _j Is to predict the corresponding response text Y under the condition of (1) _j Is a probability of (2).

Specifically, in the embodiment of the present invention, a supervised fine tuning stage and a bonus modeling stage are further included before step S102. Specifically, the method comprises the following steps:

step S100, a high-quality instruction conforming to human expectation and value is used for requesting compliance samples to perform supervised imitation learning on a preset large language model, and the supervised and fine-tuned large language model is obtained.

Step S101, training a scalar output rewarding model by utilizing feedback data of human beings according with human expectation and value degree of different instruction response texts generated by the large language model after supervision and fine adjustment, and obtaining a trained rewarding model.

The implementation of the first two stages, i.e., the supervised fine tuning stage (step S100) and the reward modeling stage (step S101), of the large language model alignment fine tuning method based on enhanced reject sampling training provided by the embodiment of the invention is basically the same as that of the existing reject sampling fine tuning technology, wherein the large model generated in step S101 for instruction request response is the supervised fine tuned large language model obtained in step S100. In particular, the training data set for the supervised fine tuning phase may be expressed asWherein (X) _l ,Y _l ) A first sample representing a training set consisting of instruction request text and corresponding premium response text conforming to human expectations and value observations, the data set typically being written or selected by a human expert, L representing the total number of data of the supervised fine tuning training data set; the training goal of this stage is to maximize the large model given the supervised fine tuning dataset D _SFT Middle instruction request text X _l Is to predict the corresponding response text Y under the condition of (1) _l Probability of->The following formula is shown:

wherein the method comprises the steps ofNetwork parameters, θ, representing large language models ^SFT Is a network parameter initialized to a large language model after supervised fine tuning.

While the training data set for the bonus modeling phase may be generally represented asWhereinRepresenting a relatively human-preferred, high-quality instruction request response text considered to be in line with human expectations and value views and a relatively less-preferred, less-high-quality response text considered to be in line with expectations or value views, respectively, in an mth sample of a training set, both types of response text in the data set typically being generated by a large model after supervised fine tuning, i.e.M represents the total number of data of the reward modeling training dataset; the reward model trained in this stage will be used to evaluate the degree to which the response text Y of a certain instruction request text X meets human expectations and value observations, the scalar reward score +.>Is embodied with the training object of maximizing the rewards modeling data set D _RM Middle instruction request text X _m Corresponding human preference response text +.>Is greater than the non-human preference response text +.>Can be expressed specifically as:

wherein ϕ represents a prize reflecting the degree of human preferenceThe parameters of the excitation model network are set,representing rewards model network parameters after rewards modeling training, +.>Representing a Sigmoid function, representing a scalar prize score that the prize model evaluates the response text of a command request text.

Fig. 2 is a flow chart of an enhanced reject sampling training phase provided in accordance with an embodiment of the present invention. As shown in fig. 2, in the final enhanced reject sampling training phase, the required instructions request a data set D _Inst The process of generating N pieces of response text by using the supervised fine-tuned large language model obtained in step S100 (step S102), and the process of evaluating the bonus points of the response text requested by each instruction by using the trained bonus model obtained in step S101 (step S104) are also consistent with those in the prior art refusal sampling training phase of refusal sampling fine-tuning technology. In practical application, the large language model after supervision and fine tuning, the trained reward model and the instruction request data set can use the existing disclosed model and data, and can also be trained and collected by oneself.

Considering that the reward points should reflect the merits of the instruction request response texts to a certain extent even in the noisy situation, and that the effect of the suboptimal response text on the fine tuning of the large model should be smaller than that of the optimal response text, step S106 sorts the N pieces of response text generated for each preset instruction request text in step S102 according to the corresponding reward points from high to low, and selects the previous onesBar response text, using appropriate weighting function +.>Calculating the data weights of these response texts>Wherein->Representing instruction request text X _i Is a selected response text +.>Corresponding bonus points,/->The maximum of these bonus points.

In an embodiment of the invention, the weighting function should satisfy the following characteristics: assigning a weight value of 1 to the largest reward score, wherein the action of the corresponding response text on the fine tuning of the large model is consistent with that of the conventional case; the weights of the rest of the reward points are smaller than 1 and larger than 0, and decrease along with the decrease of the reward points, and the smaller the reward points, the closer the weights are to 0, which indicates that the effect of the corresponding response text on the large model fine tuning is attenuated along with the decrease of the reward points. Preferably, one available weighting function that satisfies the above conditions can be expressed as:

。

and for response text not selectedThe corresponding data weight is +.>Indicating that these response texts do not contribute to the large model fine-tuning. In summary, text X is requested for an instruction _i All response text of (a)Its corresponding data weight can be expressed as:

the data weight obtained in step S108A new weighted fine tuning data set may be formed in step S110 together with the corresponding preset command request text and response textAnd for supervised weight fine-tuning of large models, wherein the preset instructions request text X _j Can take pass D _Inst Sample X in (B) _i ，Y _j ，w _j Then take pass and X _i Corresponding all response text Y _i,n And data weight w _i,n . The training goal of the supervised weighted fine tuning training is to maximize the data weight w _j Weighted big model given D _ARS Middle instruction request text X _j Is to predict the corresponding response text Y under the condition of (1) _j Probability of->Specifically, the method can be expressed as:

wherein θ represents the network parameters of the large language model, and is initialized to the network parameters θ of the large language model after supervised fine tuning obtained in step S100 ^SFT ，θ ^ARS Representing the enhanced rejection samples obtained in step S110 fine-tuning the network parameters of the large model (i.e., the target large language model).

In practical application, due toThe training samples of (2) do not have a direct effect on the parameter updates during the fine tuning of the large model, which can be considered from the fine tuning dataset D _ARS To reduce unnecessary computational overhead. In the training process, some super parameters are required to be set, including the number of times N of generating response texts for each instruction request text, the number k of response texts selected for each instruction request text, and parameters such as the initial learning rate of a parameter updating algorithm used in training, the number of samples in each batch and the like, and the specific values of the parameters are required to be properly adjusted according to the training effect.

From the above description, the embodiment of the present invention provides a large language model alignment fine tuning method based on enhanced sampling rejection training, which improves the sampling rejection training phase to the enhanced sampling rejection training phase based on the existing sampling rejection fine tuning technology, and requests the data set D for instructions _Inst More than 1 response text is selected for subsequent fine tuning training by each instruction request text, so that the number of samples of the supervised fine tuning data set is increased, the diversity of the instruction request response texts is improved, the fitting risk is reduced, and the capability of correctly responding to new human instruction requests of a large model after alignment fine tuning is enhanced; meanwhile, the selection of more samples means that the optimal samples are selected with higher probability, so that the negative influence of the noise alignment fine tuning process of the bonus points can be reduced, and the effectiveness of fine tuning training is maintained under the condition that the bonus points are noisy.

Example two

FIG. 3 is a schematic diagram of a large language model alignment fine tuning system based on enhanced reject sampling training, according to an embodiment of the present invention. As shown in fig. 3, the system includes: the system comprises a generation module 10, an evaluation module 20, a selection module 30, a calculation module 40 and a fine adjustment module 50.

Specifically, the generating module 10 is configured to generate N pieces of response text for a preset instruction request text based on the supervised and fine-tuned large language model; n is a positive integer.

And the evaluation module 20 is configured to evaluate each response text based on the trained reward model, and obtain a reward score corresponding to each response text.

The selecting module 30 is configured to sort the N pieces of response text from high to low according to the corresponding reward points, and select the first k pieces of response text to form a target sample set; wherein k is more than 1 and less than or equal to N.

A calculating module 40, configured to calculate a data weight corresponding to each response text in the target sample set based on a preset weighting function; the predetermined weighting function is a function of the bonus points.

The trimming module 50 is configured to construct a weighted trimming data set based on the preset instruction request text, the response text in the target sample set, and the data weight, and perform alignment trimming on the supervised trimmed large language model based on the weighted trimming data set, so as to obtain the target large language model.

Specifically, as shown in fig. 3, the system further includes: a learning module 60 and a training module 70.

Specifically, the learning module 60 is configured to request, by using a high-quality instruction conforming to human expectations and value views, supervised simulated learning of the preset large language model by using the compliance sample, thereby obtaining a supervised and fine-tuned large language model;

the training module 70 is configured to train the scalar output reward model by using feedback data of human expectation and value degree of different instruction response texts generated by human on the supervised and fine-tuned large language model, so as to obtain a trained reward model.

Preferably, the preset weighting function includes:

Preferably, the objective function of performing an alignment fine tuning process on the supervised fine tuned large language model based on the weighted fine tuning dataset comprises:

wherein θ is ^ARS For network parameters of the target large language model, θ represents the large language prior to supervised fine tuningNetwork parameters of the model, NI represents the total data number, w, of the weighted fine-tuning dataset _j To weight fine tune the data weights of the j-th data in the dataset,request text X for preset instructions _j Is to predict the corresponding response text Y under the condition of (1) _j Is a probability of (2).

The embodiment of the invention also provides electronic equipment, which comprises: a memory, a processor and a computer program stored on the memory and executable on the processor, the processor implementing the method as in embodiment one when executing the computer program.

The embodiment of the invention also provides a computer readable storage medium, wherein the computer readable storage medium stores program code, and the program code can be called by a processor to execute the method in the first embodiment.

It should be understood that, in various embodiments of the present invention, the sequence numbers of the foregoing processes do not mean the order of execution, and the order of execution of the processes should be determined by the functions and internal logic thereof, and should not constitute any limitation on the implementation process of the embodiments of the present invention.

Those of ordinary skill in the art will appreciate that the various illustrative elements and algorithm steps described in connection with the embodiments disclosed herein may be implemented as electronic hardware, or combinations of computer software and electronic hardware. Whether such functionality is implemented as hardware or software depends upon the particular application and design constraints imposed on the solution. Skilled artisans may implement the described functionality in varying ways for each particular application, but such implementation decisions should not be interpreted as causing a departure from the scope of the present invention.

It will be clearly understood by those skilled in the art that, for convenience and brevity of description, specific working procedures of the apparatus, device and unit described above may refer to corresponding procedures in the foregoing method embodiments, which are not repeated herein.

In the several embodiments provided by the present invention, it should be understood that the disclosed apparatus, device and method may be implemented in other manners. For example, the apparatus embodiments described above are merely illustrative, e.g., the division of the units is merely a logical function division, and there may be additional divisions when actually implemented, e.g., multiple units or components may be combined or integrated into another device, or some features may be omitted or not performed. Alternatively, the coupling or direct coupling or communication connection shown or discussed with each other may be an indirect coupling or communication connection via some interfaces, devices or units, which may be in electrical, mechanical or other form.

The units described as separate units may or may not be physically separate, and units shown as units may or may not be physical units, may be located in one place, or may be distributed on a plurality of network units. Some or all of the units may be selected according to actual needs to achieve the purpose of the solution of this embodiment.

In addition, each functional unit in the embodiments of the present invention may be integrated in one processing unit, or each unit may exist alone physically, or two or more units may be integrated in one unit.

The functions, if implemented in the form of software functional units and sold or used as a stand-alone product, may be stored in a computer-readable storage medium. Based on this understanding, the technical solution of the present invention may be embodied essentially or in a part contributing to the prior art or in a part of the technical solution, in the form of a software product stored in a storage medium, comprising several instructions for causing a computer device (which may be a personal computer, a server, a network device, etc.) to perform all or part of the steps of the method according to the embodiments of the present invention. And the aforementioned storage medium includes: a U-disk, a removable hard disk, a read-only memory (ROM), a random access memory (random access memory, RAM), a magnetic disk, or an optical disk, or other various media capable of storing program codes.

The foregoing is merely illustrative of the present invention, and the present invention is not limited thereto, and any person skilled in the art will readily recognize that variations or substitutions are within the scope of the present invention. Therefore, the protection scope of the present invention shall be subject to the protection scope of the claims.

Claims

1. A large language model alignment fine tuning method based on enhanced reject sampling training, the method comprising:

generating N pieces of response texts for a preset instruction request text based on the supervised fine-tuned large language model; n is a positive integer;

evaluating each response text based on the trained reward model to obtain a reward score corresponding to each response text;

sorting the N pieces of response texts from high to low according to the corresponding reward points, and selecting the first k pieces of response texts to form a target sample set; wherein k is more than 1 and less than or equal to N;

based on a preset weighting function, calculating the data weight corresponding to each response text in the target sample set; the preset weighting function is a function of bonus points;

and constructing a weighted fine adjustment data set based on the preset instruction request text, the response text in the target sample set and the data weight, and performing alignment fine adjustment on the large language model subjected to supervision fine adjustment based on the weighted fine adjustment data set to obtain a target large language model.

2. The method of claim 1, wherein prior to generating N pieces of response text for the preset instruction request text based on the supervised fine tuning large language model, the method further comprises:

utilizing a high-quality instruction request compliance sample which accords with human expectation and value observation to perform supervised imitation learning on a preset large language model to obtain the supervised and fine-tuned large language model;

and training a scalar output reward model by utilizing feedback data of human expectation and value degree of different instruction response texts generated by human on the supervised and fine-tuned large language model to obtain the trained reward model.

3. The method of claim 1, wherein the preset weighting function comprises:

；

wherein,representing a preset weighting function,/->Representing the reward score corresponding to the nth response text generated by the ith preset instruction request text,/->The maximum value of the reward points corresponding to the N pieces of response texts generated by the ith preset instruction request text is represented, and exp represents an exponential function based on a natural constant e.

4. The method of claim 1, wherein performing an objective function of an alignment fine tuning process on the supervised fine tuned large language model based on the weighted fine tuning dataset comprises:

；

wherein θ is ^ARS For the network parameters of the target large language model, θ represents the network parameters of the large language model prior to supervised trimming, NI represents the total data number, w, of the weighted trim dataset _j For fine-tuning the data weights of the j-th data in the dataset,request text X for preset instructions _j Is to predict the corresponding response text Y under the condition of (1) _j Is a probability of (2).

5. A large language model alignment fine tuning system based on enhanced reject sampling training, the system comprising: the system comprises a generating module, an evaluating module, a selecting module, a calculating module and a fine adjusting module; wherein,

the generation module is used for generating N pieces of response texts for the preset instruction request texts based on the supervised and fine-tuned large language model; n is a positive integer;

the evaluation module is used for evaluating each response text based on the trained reward model to obtain the reward score corresponding to each response text;

the selecting module is used for sequencing the N pieces of response texts from high to low according to the corresponding reward points, and selecting the first k pieces of response texts to form a target sample set; wherein k is more than 1 and less than or equal to N;

the calculation module is used for calculating the data weight corresponding to each response text in the target sample set based on a preset weighting function; the preset weighting function is a function of bonus points;

the fine adjustment module is used for constructing a weighted fine adjustment data set based on the preset instruction request text, the response text in the target sample set and the data weight, and carrying out alignment fine adjustment on the large language model subjected to supervision fine adjustment based on the weighted fine adjustment data set to obtain a target large language model.

6. The system of claim 5, further comprising: a learning module and a training module; wherein,

the learning module is used for requesting compliance samples to perform supervised imitation learning on a preset large language model by using high-quality instructions meeting human expectations and value observations to obtain the supervised and fine-tuned large language model;

the training module is used for training the scalar output rewarding model by utilizing feedback data of human expectation and value degree of different instruction response texts generated by human on the supervised and fine-tuned large language model to obtain the trained rewarding model.

7. The system of claim 5, wherein the preset weighting function comprises:

；

8. The system of claim 5, wherein performing an objective function of an alignment fine tuning process on the supervised fine tuned large language model based on the weighted fine tuning dataset comprises:

；

wherein θ is ^ARS For the network parameters of the target large language model, θ represents the network parameters of the large language model prior to supervised trimming, NI represents the total data number, w, of the weighted trim dataset _j Fine tuning the first set of data for the weightThe data weight of j pieces of data,request text X for preset instructions _j Is to predict the corresponding response text Y under the condition of (1) _j Is a probability of (2).

9. An electronic device, comprising: memory, a processor and a computer program stored on the memory and executable on the processor, the processor implementing the method according to any of claims 1-4 when the computer program is executed.

10. A computer readable storage medium having stored therein program code which is callable by a processor to perform the method of any one of claims 1 to 4.