CN117808120A

CN117808120A - Method and apparatus for reinforcement learning of large language models

Info

Publication number: CN117808120A
Application number: CN202311866241.XA
Authority: CN
Inventors: 阎栋; 李佳莲
Original assignee: Beijing Baichuan Intelligent Technology Co ltd
Current assignee: Beijing Baichuan Intelligent Technology Co ltd
Priority date: 2023-12-29
Filing date: 2023-12-29
Publication date: 2024-04-02

Abstract

The present disclosure provides a method, apparatus, device, and computer-readable storage medium for reinforcement learning of large language models. The method adopts a reward model, a reviewer model, an actor model and an initialized large language model to cooperatively execute reinforcement learning training on the large language model, wherein the actor model is used for generating the next action of the large language model, the reviewer model is used for evaluating the quality of the action so as to update the behavior strategy of the large language model, and KL divergence of the initial strategy distribution and the strategy distribution of the actor model is used as a regular term to be added into an objective function, so that the strategy of the large language model is prevented from deviating too much from the original strategy, and the stability of reinforcement learning training is further improved. According to the method, the reinforcement learning training can be optimized in a targeted manner aiming at various problems in the reinforcement learning training of the large language model, so that the performance of the large language model is improved, and the steady training of the cross-model scale can be realized by supplementing various training stability strategies.

Description

Method and apparatus for reinforcement learning of large language models

Technical Field

The present disclosure relates to the field of artificial intelligence, and more particularly, to a method, apparatus, device, and storage medium for reinforcement learning of large language models.

Background

In the field of artificial intelligence nowadays, the application of large language models is more and more widespread, and reinforcement learning is an important technical means, and the application of the large language models has important significance by learning the optimal behavior strategy through interaction with the environment. The reinforcement learning can be used for training a model to generate smoother and more reasonable language expression, improving dialogue generation capability and text generation capability, helping the model to continuously improve the language generation capability of the model, enabling the model to better meet the requirements of users, and improving the overall performance of the model. Through interactive learning with the environment, the large language model can automatically learn and improve its own expressive power, thereby generating more natural and fluent text. This has important significance for applications such as dialog systems, intelligent customer service, intelligent authoring.

Therefore, there is a need for an effective method for reinforcement learning of large language models to optimize reinforcement learning of large language models.

Disclosure of Invention

In order to solve the problems, the method performs reinforcement learning on the large language model by utilizing a near-end strategy optimization algorithm, adopts four models to participate in the reinforcement learning process together, and performs targeted optimization on reinforcement learning of the large language model aiming at the problems in the reinforcement learning process, so that the performance of the large language model is improved.

Embodiments of the present disclosure provide a method, apparatus, device, and computer-readable storage medium for reinforcement learning of large language models.

Embodiments of the present disclosure provide a method for reinforcement learning of a large language model initialized to a supervised fine tuning large language model, the method comprising: acquiring a prompt instruction; generating a prompt response to the prompt instruction by using the large language model; performing multiple iterations of reinforcement learning for the large language model for the hint instruction and the hint response to complete reinforcement learning for the large language model, wherein in each iteration: invoking an initialized large language model, and a latest rewards model, reviewer model and actor model for reinforcement learning of the large language model in parallel to obtain an initial policy distribution of the large language model, rewards of the prompt instructions and prompt responses, a value corresponding to each action of the large language model, and a policy distribution of the actor model, respectively, wherein the policy distribution indicates a probability that the large language model performs each action, and each action generates each word element corresponding to the large language model; determining an objective function based on the rewards, the value, the initial policy distribution and the policy distribution of the actor model; and parameter updating the actor model and the reviewer model in parallel by optimizing the determined objective function to parameter update the large language model, wherein the objective function is used for measuring the performance of the current strategy of the actor model, and KL divergences of the initial strategy distribution and the strategy distribution of the actor model are added to the objective function as regular terms, and coefficients of the KL divergences are dynamically adjusted based on values of the KL divergences.

Embodiments of the present disclosure provide an apparatus for reinforcement learning of a large language model, comprising: the data acquisition module is configured to acquire a prompt instruction; a response generation module configured to generate a hint response to the hint instruction using the large language model; and a reinforcement learning module configured to: performing multiple iterations of reinforcement learning for the large language model for the hint instruction and the hint response to complete reinforcement learning for the large language model, wherein in each iteration: invoking an initialized large language model, and a latest rewards model, reviewer model and actor model for reinforcement learning of the large language model in parallel to obtain an initial policy distribution of the large language model, rewards of the prompt instructions and prompt responses, a value corresponding to each action of the large language model, and a policy distribution of the actor model, respectively, wherein the policy distribution indicates a probability that the large language model performs each action, and each action generates each word element corresponding to the large language model; determining an objective function based on the rewards, the value, the initial policy distribution and the policy distribution of the actor model; and parameter updating the actor model and the reviewer model in parallel by optimizing the determined objective function to parameter update the large language model, wherein the objective function is used for measuring the performance of the current strategy of the actor model, and KL divergences of the initial strategy distribution and the strategy distribution of the actor model are added to the objective function as regular terms, and coefficients of the KL divergences are dynamically adjusted based on values of the KL divergences.

Embodiments of the present disclosure provide an apparatus for reinforcement learning of a large language model, comprising: one or more processors; and one or more memories, wherein the one or more memories have stored therein a computer executable program that, when executed by the processor, performs the method for reinforcement learning of a large language model as described above.

Embodiments of the present disclosure provide a computer readable storage medium having stored thereon computer executable instructions which, when executed by a processor, are for implementing a method for reinforcement learning of large language models as described above.

Embodiments of the present disclosure provide a computer program product or computer program comprising computer instructions stored in a computer readable storage medium. A processor of a computer device reads the computer instructions from a computer-readable storage medium, the processor executing the computer instructions, causing the computer device to perform a method for reinforcement learning of a large language model according to an embodiment of the present disclosure.

The method provided by the embodiment of the disclosure adopts a reward model, a reviewer model, an actor model and an initialized large language model to cooperatively execute reinforcement learning training on the large language model, wherein the actor model is used for generating the next action of the large language model, the reviewer model is used for evaluating the quality of the action and updating the behavior strategy of the large language model, and KL divergence of the initial strategy distribution and the strategy distribution of the actor model is used as a regular term to be added into an objective function, so that the strategy of the large language model is prevented from deviating too much from the original strategy, and the stability of reinforcement learning training is further improved. The method of the embodiment of the disclosure can pertinently optimize the reinforcement learning training aiming at various problems in the reinforcement learning training of the large language model so as to improve the performance of the large language model, and can realize the steady training of the cross-model scale by supplementing various training stability strategies in the reinforcement learning training.

Drawings

In order to more clearly illustrate the technical solutions of the embodiments of the present disclosure, the drawings that are used in the description of the embodiments will be briefly described below. It should be apparent that the drawings in the following description are only some exemplary embodiments of the present disclosure, and that other drawings may be obtained from these drawings by those of ordinary skill in the art without undue effort.

FIG. 1 is a flow chart illustrating a method for reinforcement learning of a large language model according to an embodiment of the present disclosure;

FIG. 2 is a schematic diagram illustrating training of a large language model using an RLHF method according to an embodiment of the present disclosure;

FIG. 3 is a flow chart illustrating each iteration of reinforcement learning for a large language model according to an embodiment of the present disclosure;

FIG. 4 is a schematic diagram illustrating each round of iterations of reinforcement learning for a large language model according to an embodiment of the present disclosure;

FIG. 5 is a schematic diagram showing a relationship between supported model parameters and the number of GPUs under a training architecture according to an embodiment of the disclosure;

FIG. 6 is a graph illustrating a relationship between rewards and iteration round in the event of unstable reinforcement learning training in accordance with an embodiment of the present disclosure;

FIG. 7 is a graph illustrating the relationship between entropy of a strategy for individual tokens and iteration round in the event reinforcement learning training is unstable, according to an embodiment of the present disclosure;

FIG. 8 is a graph illustrating various performances of an optimized reinforcement learning model in the event reinforcement learning training is unstable, according to an embodiment of the present disclosure;

FIG. 9 is a schematic diagram illustrating an apparatus for reinforcement learning of a large language model according to an embodiment of the present disclosure;

FIG. 10 illustrates a schematic diagram of an apparatus for reinforcement learning of a large language model, according to an embodiment of the present disclosure; and

fig. 11 illustrates a schematic diagram of an architecture of an exemplary computing device, according to an embodiment of the present disclosure.

Detailed Description

In order to make the objects, technical solutions and advantages of the present disclosure more apparent, exemplary embodiments according to the present disclosure will be described in detail with reference to the accompanying drawings. It should be apparent that the described embodiments are only some of the embodiments of the present disclosure and not all of the embodiments of the present disclosure, and that the present disclosure is not limited by the example embodiments described herein.

In the present specification and drawings, steps and elements having substantially the same or similar are denoted by the same or similar reference numerals, and repeated descriptions of the steps and elements will be omitted. Meanwhile, in the description of the present disclosure, the terms "first," "second," and the like are used merely to distinguish the descriptions, and are not to be construed as indicating or implying relative importance or order.

In embodiments of the present disclosure, the term "module" or "unit" refers to a computer program or a portion of a computer program having a predetermined function and working with other related portions to achieve a predetermined objective, and may be implemented in whole or in part by using software, hardware (such as a processing circuit or a memory), or a combination thereof. Also, a processor (or multiple processors or memories) may be used to implement one or more modules or units. Furthermore, each module or unit may be part of an overall module or unit that incorporates the functionality of the module or unit.

Unless defined otherwise, all technical and scientific terms used herein have the same meaning as commonly understood by one of ordinary skill in the art to which this disclosure belongs. The terminology used herein is for the purpose of describing embodiments of the invention only and is not intended to be limiting of the invention.

For purposes of describing the present disclosure, the following presents concepts related to the present disclosure.

The method of reinforcement learning for large language models of the present disclosure may be artificial intelligence (Artificial intelligence, AI) based. Artificial intelligence is the theory, method, technique and application system that uses a digital computer or a digital computer-controlled machine to simulate, extend and expand human intelligence, sense the environment, acquire knowledge and use the knowledge to obtain optimal results. In other words, artificial intelligence is an integrated technology of computer science that attempts to understand the essence of intelligence and to produce a new intelligent machine that can react in a similar way to human intelligence. For example, for a method for reinforcement learning of large language models based on artificial intelligence, it is possible to determine the position of a mobile robot with respect to preset markers in a manner similar to a human being recognizing the markers set in the environment in which the robot is located by naked eyes, thereby determining the position of the mobile robot in the environment. Artificial intelligence enables the method for reinforcement learning of a large language model of the present disclosure to have a function of accurately recognizing a plurality of objects in an environment conforming to contour features of preset markers in real time and automatically extracting position features thereof to guide the positioning of a mobile robot by studying design principles and implementation methods of various intelligent machines.

The method of reinforcement learning for large language models of the present disclosure may be based on reinforcement learning (Reinforcement Learning, RL). Reinforcement learning is a branch of machine learning that aims to let agents learn how to make decisions through interactions with the environment so that maximum jackpots can be achieved in the future. In reinforcement learning, an agent observes feedback of an environment by trying different behaviors and gradually learns an optimal behavior strategy. The core concepts of reinforcement learning may include: (1) Agent (Agent): an agent is an entity that performs a learning task that affects the environment by observing the state of the environment and selecting behavior, and in a large language model, the model itself can be considered as an agent; (2) Environment (Environment): the environment is the outside world in which the agent is located, which gives feedback on the agent's behavior and determines the agent's next state. In a large language model, the process of generating text can be seen as an agent's interaction with the environment; (3) rewards (Reward): the agent gets rewards or penalties based on feedback of the environment. Rewards refer to numerical feedback obtained by an agent after performing a certain action in a certain state, with the goal of enabling the agent to learn an optimal action strategy by maximizing the jackpot; (4) Action (Action): different environments allow for different kinds of actions, in a given environment the set of valid actions is often referred to as action space (action space), including discrete action space (discrete action spaces) and continuous action space (continuous action spaces); (5) Policy: policies define rules for agents to choose behavior under certain conditions. The goal of reinforcement learning is to learn an optimal strategy so that the behavior of the agent in different states is selected to maximize the jackpot. In large language models, reinforcement learning may be used to train the model to generate text that is more predictive. By defining an appropriate reward function, the model can adjust its parameters based on the reward signal, thereby generating a more quality, diverse, and relevant text. This approach may help the model adapt to a particular task or scene and improve the quality and fidelity of the generated text.

Further, the method of reinforcement learning for large language models of the present disclosure may be based on reinforcement learning from human feedback (Reinforcement Learning from Human Feedback, RLHF). RLHF refers to reinforcement learning of a model using feedback information from human experts to improve the performance of the model. During the alignment process, the quality of the model generated text may be improved by collecting human expert evaluation of the model generated text and then using the evaluation information to guide training of the model. For example, in a conversation generation task, a user's assessment of conversation content generated by the model may be collected and then used to guide training of the model so that the model generates conversation content that more closely meets the user's expectations.

The presently disclosed method for reinforcement learning of large language models may be a near-end policy optimization (Proximal Policy Optimization, PPO) algorithm. The core idea of the PPO algorithm is to improve the stability of training by limiting the magnitude of the policy updates to ensure that each update is not too large. The PPO algorithm uses two important concepts simultaneously: clipping and surrogate objective (alternative targets). clipping refers to limiting the ratio between the new policy and the old policy to prevent excessive update amplitude; surrogate objective is an alternative optimization objective to measure policy improvement at update time.

In view of the foregoing, embodiments of the present disclosure relate to techniques of artificial intelligence, reinforcement learning, PPO algorithms, and the like, and are further described below with reference to the accompanying drawings.

The large language model reinforcement learning training based on PPO algorithms is a complex and challenging task. During this process, problems may occur, including unstable training, biased bonus functions, loopholes in bonus functions, etc. For example, instability in reinforcement learning training may occur when the reward function does not rise or the reward suddenly drops significantly during training. As another example, during reinforcement learning training, there may be a bias in the reward function, i.e., the reward function grows forward, but the final effect of the large language model is poor. As another example, rewarding a function with holes may result in a large language model learning incorrect strategies and not effectively achieving the goals of the task.

The present disclosure provides a reinforcement learning method for a large language model, which performs reinforcement learning on the large language model by using a near-end strategy optimization algorithm, employs four models to participate in a reinforcement learning process together, and performs targeted optimization on reinforcement learning of the large language model for problems occurring in the reinforcement learning process, thereby improving performance of the large language model.

Specifically, the method provided by the embodiment of the disclosure adopts a reward model, a reviewer model, an actor model and an initialized large language model to cooperatively execute reinforcement learning training on the large language model, wherein the actor model is used for generating the next action of the large language model, the reviewer model is used for evaluating the quality of the action so as to update the behavior strategy of the large language model, and the KL divergence of the initial strategy distribution and the strategy distribution of the actor model is used as a regular term to be added into an objective function, so that the strategy of the large language model is prevented from deviating too much from the original strategy, and the stability of reinforcement learning training is further improved. The method of the embodiment of the disclosure can pertinently optimize the reinforcement learning training aiming at various problems in the reinforcement learning training of the large language model so as to improve the performance of the large language model, and can realize the steady training of the cross-model scale by supplementing various training stability strategies in the reinforcement learning training.

FIG. 1 is a flow chart illustrating a method 100 for reinforcement learning of a large language model according to an embodiment of the present disclosure.

In step S101, a hint instruction may be acquired. Alternatively, the hint instructions may be sampled from a hint instruction dataset, which may be a collection of pre-collected hint instructions of sufficient diversity to ensure that the reinforcement-learning trained large language model can well cover a variety of application scenarios. After the collection of the prompt instruction set is completed, the prompt instructions can be randomly selected from the prompt instruction set for reinforcement learning training of the large language model.

In step S102, a hint response may be generated for the hint instruction using a large language model. According to embodiments of the present disclosure, the large language model may be initialized to a supervised fine-tuned large language model.

Alternatively, in the method for reinforcement learning of a large language model of the present disclosure, in order to secure the stability of training, a model having a certain basic ability is required as an initial model of reinforcement learning, for example, the initial state of the large language model may be a supervised fine-tuned large language model. The large language model with supervised fine tuning can have better performance on various natural language processing tasks, and the fine tuning can be achieved by supervised learning the model on a data set of a specific task. By supervised tuning, a large language model can achieve better performance on a particular task because the model is already pre-trained on large-scale data, with rich language representation capabilities. In the fine tuning process, the model can be adjusted according to a task-specific objective function so as to adapt to the requirements of a specific task.

Next, in the method for reinforcement learning of a large language model of the present disclosure, reinforcement learning may be performed on the large language model by reinforcement learning from human feedback (RLHF) method to further optimize the model training process. Training a large language model using RLHF methods may include training of a reward model (RW model) and training of a reinforcement learning model (RL model) based on the trained reward model, among others.

Fig. 2 is a schematic diagram illustrating training of a large language model using an RLHF method according to an embodiment of the disclosure.

As shown in fig. 2, first, for hinting instructions obtained from a data pool, a current large language model can be used to generate a hinting response text and the generated hinting response can be presented to a human expert for evaluation to collect their feedback. For example, the feedback may be binary (e.g., good/bad), multidimensional (e.g., text scored in multiple dimensions), or a good or bad ordering of individual prompt responses. Thus, these data, labeled based on feedback from human experts, may be used to train the reward model such that the knowledge learned by the trained reward model and its scoring is consistent with human value and interests. In the RLHF process, the human expert plays an important role. They help the large language model learn behavior strategies conforming to human preferences by providing feedback, thereby effectively improving the quality and relevance of the large language model generation.

After training of the reward model for the large language model is completed, a training phase of the reinforcement learning model may be entered. During the training phase of the RL model, the initial model of the RL model may employ a large language model with supervised fine tuning. Thus, the RL model may generate a hint response/action based on hint instructions/states obtained from the data pool, which may then be input into the trained RW model for scoring to evaluate the quality of the generated hint response in the form of a return reward. Thus, the training goal of the RL model may be to have the generated hint response scored as high as possible on the RW model. Such a trained RL model may be saved for subsequent iterative optimization until a predetermined end condition is met, such as the large language model reaching a desired performance or reaching a predetermined number of training rounds, etc.

Thus, based on the acquired hint instructions, a corresponding hint response may be generated from the initial large language model for subsequent reinforcement learning iterative optimization.

In step S103, a plurality of iterations of reinforcement learning for the large language model may be performed for the hint instruction and the hint response to complete reinforcement learning for the large language model.

Alternatively, in embodiments of the present disclosure, the RL model may be trained using the PPO algorithm, i.e., with a random gradient descent for the set objective function. Near-end policy optimization acts as a deep reinforcement learning algorithm for training agents (i.e., RL models) to learn and perform tasks in complex environments. Training by the agent enables it to maximize cumulative returns in interactions with the environment, thereby achieving specified task goals.

The RL model can learn the optimal strategy by interacting with the environment, where at each time step the RL model can perceive the state of the environment and select an action based on its current strategy. The RL model may then perform the action and receive a reward signal, then update its cost function based on the reward signal and select the optimal action strategy based on the cost function, and the RL model may evaluate the long-term value of each state and action by learning the cost function. As the number of interactions of the RL model with the environment increases (i.e., the number of iterations increases), the cost function and strategy thereof can be continuously optimized, thereby enabling the RL model to learn an optimal behavior strategy.

The reinforcement learning training process of the present disclosure is described below with reference to fig. 3. Wherein fig. 3 is a flowchart illustrating each iteration of a round of reinforcement learning for a large language model according to an embodiment of the present disclosure. FIG. 4 is a schematic diagram illustrating each iteration of reinforcement learning for a large language model, according to an embodiment of the present disclosure.

As shown in fig. 3, in step S1031, an initialized large language model, and the latest rewards model, reviewer model, and actor model for reinforcement learning of the large language model may be invoked in parallel to obtain an initial policy distribution of the large language model, rewards of the hint instructions and hint responses, a value corresponding to each action of the large language model, and a policy distribution of the actor model, respectively, wherein the policy distribution may indicate a probability that the large language model performs each action, which may generate each word element corresponding to the large language model.

Optionally, the tuning task of the initialized large language model is modeled as a Reinforcement Learning (RL) problem, and basic elements such as policy (policy), action space (action space), and reward function (reward function) need to be defined. Wherein, the strategy can be based on the large language model, receiving a prompt instruction as input to output a prompt response text (or probability distribution of the text); the action space may be an arrangement combination of all the tokens (token) in the vocabulary at all the output positions; the observation space may be a possible sequence of input tokens (i.e., hint instructions), which is a permutation and combination of all tokens in the vocabulary at all input locations; and the reward function can be an RM model trained in the last stage to coordinate with some policy-level constraints for reward calculation.

Alternatively, in large language model reinforcement learning training using PPO algorithms, four models may be used simultaneously to co-train the large language model to achieve stable strategic optimization and training. In particular, the four models may include a rewards model, a reviewer model, an actor model, and an initialized large language model (i.e., a supervised fine tuning large language model), as shown in FIG. 4. The flexibility of model training can be improved by adopting an independent component mode aiming at the architecture design of four models, wherein multipart call can be designed, each of the four models is used as a component, and each component can be used as an independent server, so that each node can be deployed in a different mode, and resources are reasonably allocated. Therefore, when the model is used, the training and reasoning process can be flexibly switched, and the convenient and flexible calling is realized.

Alternatively, the initialized large language model may output its initial policy profile based on the hint instruction, as shown in FIG. 4. The initial policy distribution may represent that the large language model outputs a probability distribution for each token, i.e., generates a probability for each token, based on the hinting instructions. It should be noted that the initial policy distribution of the large language model output may not be optimal. The RL model needs to continually update its policy distribution through interactions with the environment to find the optimal policy.

Obtaining the initial policy distribution through the initialized large language model can help the RL model to converge to the optimal policy more quickly, and can prevent the RL model from sinking into the local optimum. In addition, the RL model can also be used for carrying out migration learning on different tasks.

Alternatively, in the current iteration, the trained, up-to-date reward model may be used to obtain rewards for the current hint instruction and hint response, as shown in FIG. 4.

Optionally, the value of each state is calculated using a reviewer model. The goal of the reviewer model is to learn a cost function to evaluate the quality of each state. This cost function may be learned and estimated by various methods, such as value iteration, strategy iteration, deep learning model, etc.

Alternatively, the actor model may be used to generate a policy distribution, i.e., a probability distribution of taking different actions in a given state. The actor model may learn the optimal strategy through use with the reviewer model. For example, the input of the actor model may be the current state and the output may be the action probability distribution (i.e., policy distribution). The actor model may generate a policy distribution by: the current state is input into the actor model to calculate the probability of each action from the actor model, and based on these probabilities, one action is randomly selected.

In step S1032, an objective function may be determined based on the reward, the value, the initial policy distribution, and the policy distribution of the actor model.

According to an embodiment of the present disclosure, determining an objective function based on the rewards, the value, the initial policy distribution, and the policy distribution of the actor model may include: determining a dominance value corresponding to each of the tokens and a target value of the reviewer model according to the rewards, the values, the initial policy distribution and the policy distribution of the actor model, wherein the dominance value can indicate the goodness of the action of the large language model in the current state for generating the tokens relative to the action for generating other tokens, and the target value of the reviewer model can be determined according to the rewards in the current state and the values of the next state; and determining the objective function based on the initial policy distribution and the policy distribution of the actor model, the determined dominance value corresponding to each word element, and the objective value of the reviewer model.

Alternatively, from rewards (i.e., an evaluation of a model taking a particular action in a particular state), value (representing an expected return of state), initial policy distribution (model at the beginning of action distribution), and actor model policy distribution (model generating probability distribution of actions), a corresponding dominance value for each term may be determined, which may represent how good or bad the model chooses to generate an action of a term relative to the actions of other terms in the current state.

Alternatively, the Target Value (Target Value) of the reviewer model may be determined based on the rewards in the current state and the Value of the next state. The target value may be an index that the reviewer model uses to evaluate the current state to guide the learning process of the actor model. In reinforcement learning, the target value may generally be used to guide the learning process of the actor model. The calculation of the target value may generally involve factors such as rewards, discount factors, value of the next state, and the like. As an example, in particular, the value of the next state may be calculated from a cost function learned by the reviewer model, which value may represent the expected return that the model can obtain in the next state, and next, given the uncertainty of the future rewards, a Discount Factor (commonly denoted as γ, 0+.ltoreq.γ.ltoreq.1) may be introduced to measure the importance of the future rewards. Thus, based on the above description, the target value can be calculated as: target value = immediate prize + value of next state.

As described above, the goal of the reviewer model may be to learn a cost function to evaluate each state for quality, guiding the model's learning process.

Specifically, the value of each action may be calculated from the value of the state and the policy distribution of the actor model. This may be accomplished, for example, by multiplying the value of the state with the policy distribution of the actor model to obtain a "desired value" for each action. Further, after deriving a "desired value" for each action, a merit value for each action may be calculated, which may represent how good the model selects a certain action relative to selecting other actions in the current state. As an example, one common method of calculating the dominance value may be to subtract the value of the state from the "expected value" of each action, i.e., dominance value = expected value of the action-value of the state. Finally, the calculated dominance value can be adjusted according to the actually observed rewards, so that the superiority degree of each action under the actual condition can be reflected better.

Through the above operations, the benefit value corresponding to each word element (action) can be determined using the rewards, the value, the initial policy distribution and the policy distribution of the actor model, as shown in fig. 4, to help the model evaluate the superiority of each action, thereby guiding the learning process of the model.

Further, based on the initial policy distribution and the policy distribution of the actor model, as well as the determined dominance value for each word element and the target value of the reviewer model, an objective function may be determined for guiding the learning process of the RL model, helping the RL model to continually adjust the policies to maximize the expected return.

Optionally, to achieve stable training of RL models, the method of reinforcement learning for large language models of the present disclosure may also introduce some training stability strategies in model training.

Alternatively, the variables related to scoring may include rewards and dominance values considering that the entire training process of the PPO algorithm is targeted around optimizing scoring. Thus, for both variables, a "normalization" and "clipping" process may be performed.

According to an embodiment of the present disclosure, determining a dominance value corresponding to each of the tokens and a target value of the reviewer model may include normalizing the reward according to the reward, the value, the initial policy distribution, and the policy distribution of the actor model.

Optionally, normalization processing may be performed on rewards generated by the RM model during training to solve the problem that the existing large language model does not constrain the absolute value of rewards.

According to an embodiment of the present disclosure, determining the objective function may include normalizing the dominance values based on the initial policy distribution and the policy distribution of the actor model, the determined dominance value corresponding to each of the tokens, and the objective value of the reviewer model. Alternatively, the dominance values may be normalized at a minimum batch size (batch size) granularity.

According to an embodiment of the present disclosure, determining a dominance value corresponding to each of the tokens, and a target value of the reviewer model, according to the reward, the value, the initial policy distribution, and the policy distribution of the actor model may include removing a hint instruction and a hint response corresponding to the reward if the reward is greater than a predetermined threshold. Optionally, excessive prize values may be clipped to prevent instabilities during the training process.

According to an embodiment of the present disclosure, determining a dominance value corresponding to each of the tokens and a target value of the reviewer model according to the rewards, the values, the initial policy distribution, and the policy distribution of the actor model includes: using a generalized dominance estimation method, a dominance value for each token is calculated by using the rewards in a number of iterations.

Alternatively, in reinforcement learning, the benefit value for each state (or token) may be calculated using the obtained benefit information through training iterations for multiple rounds. By accumulating rewards information over multiple iterations, the contribution of each state (or word element) to the overall reward can be better reflected, thereby calculating the dominance value more accurately. And in the multi-round iteration, the environment can change, and the advantage value is calculated by using rewards in the multi-round iteration, so that the method can be better adapted to the change of the environment, and the robustness and generalization capability of the model are improved. In addition, the advantage value is calculated through rewards in multiple iterations, so that the RL model can be helped to converge to the optimal strategy more quickly, and training efficiency is improved.

Next, in step S1033, the actor model and the reviewer model may be parameter-updated in parallel by optimizing the determined objective function for measuring the performance of the current policy of the actor model, wherein KL divergences of the initial policy profile and the policy profile of the actor model are added to the objective function as regularization terms, wherein coefficients of the KL divergences are dynamically adjusted based on values of the KL divergences.

Alternatively, policy gradient based approaches may be used to maximize the desired return. By maximizing the objective function, the actor model can learn a better policy distribution to maximize the expected return. Meanwhile, the introduction of the dominance value can help the model to better evaluate the quality degree of each action, so that the learning process is guided more effectively. Alternatively, the cost function may be compared to the target value to calculate an error, which is used to update the parameters of the reviewer model, as shown in FIG. 4.

Therefore, under the parallel update of the actor model and the reviewer model, the update efficiency of the RL model can be improved.

It should be noted that the actual selected objective function may vary depending on the particular task and algorithm. In reinforcement learning, there are many different algorithms and methods that can be used to determine the objective function, so selecting an appropriate objective function requires trade-offs and adjustments for the particular situation, and the present disclosure is not limited to the design of the objective function.

Optionally, to avoid that the determined strategy deviates too much from the original strategy, the KL divergence of the initial strategy distribution and the strategy distribution of the actor model may be added as a regularization term to the objective function to improve the stability of the RL model training. The KL divergence is used as an index for measuring the difference between two probability distributions, and the difference between the new strategy distribution and the initial strategy distribution can be restrained by adding the KL divergence as a regular term into an objective function, so that the variation range of the strategy is limited, and the training stability is improved.

By introducing the regularization term, the objective function not only can consider maximizing expected return in the optimization process, but also can maintain the similarity between the new strategy distribution and the initial strategy distribution as much as possible, which is helpful for avoiding excessive strategy change, reducing instability in the training process and improving convergence and generalization performance of the model.

According to an embodiment of the present disclosure, parameter updating the actor model and the reviewer model in parallel by optimizing the determined objective function may include: in the first N iterations of the training for the prompt instruction and the prompt response, keeping parameters of the actor model fixed and updating parameters of the reviewer model, wherein N is a positive integer. Alternatively, during reinforcement learning training of the PPO algorithm, only the reviewer model may be trained in the first N iterations, while keeping the parameters of the reviewer model fixed, to prevent the initialized reviewer model from disrupting the effectiveness of the initial reviewer model.

According to an embodiment of the present disclosure, parameter updating the actor model and the reviewer model in parallel by optimizing the determined objective function may include: in the process of updating parameters of the reviewer model, data used for updating parameters of the reviewer model in the previous iteration is accumulated into the current iteration.

Optionally, during the training of the reviewer model, training data of the reviewer model in the previous iteration may be accumulated into the current iteration so that the reviewer model can see more training data, which is beneficial to the stability of the reinforcement learning training process.

Alternatively, in embodiments of the present disclosure, a megatron+deep speed training architecture may be employed, where Megatron is an item for training a large-scale model, implementing efficient parallel strategies including model parallelism, data parallelism, and pipeline parallelism, while deep speed is an open source for accelerating deep learning model training, providing an efficient training framework supporting distributed training, model parallelism, and data parallelism. Deep speed also includes memory optimization techniques such as gradient accumulation and activation checkpointing to reduce memory requirements. Deep speed can be seamlessly integrated with popular deep learning frameworks (such as pyresch). Thus, deep speed can be employed to improve the efficiency and scalability of large-scale model training, providing an optimizer and some means of running acceleration.

Fig. 5 is a schematic diagram illustrating a relationship between supported model parameters and the number of GPUs under a training architecture according to an embodiment of the present disclosure. As shown in fig. 5, with the Megatron-deep training architecture, the supported model parameters appear to increase almost linearly with the number of GPUs.

According to embodiments of the present disclosure, for errors occurring during the reinforcement learning due to scoring of the reward model, the reward model may be retrained with training data corresponding to the errors, and the retrained latest reward model is applied to reinforcement learning of the large language model.

In the large language model reinforcement learning training process based on the PPO algorithm, collaborative training of an actor model and a reviewer model is involved, the actor model is required to provide a high-quality sampling result for the reviewer model, and meanwhile, accurate evaluation of the reviewer model can help the actor model to optimize strategies. If the sampling result of the actor model is not good in the process, the reviewer model may be inaccurate to learn or cannot give high-score evaluation; if the reviewer model is not learned accurately, the update direction of the actor model may have large noise, resulting in poor model effect. Therefore, the effects of the actor model and the reviewer model in the training process need to be closely focused, the actor model and the reviewer model are matched in a cooperative manner, and the rewarding value is improved.

In the following, a process for targeted optimization of reinforcement learning of a large language model in the present disclosure is described with respect to some problems occurring in the reinforcement learning training process of a large language model based on the PPO algorithm as described above.

As described above, in the case where the reward function does not rise or the reward suddenly drops greatly during the reinforcement learning training of the large language model based on the PPO algorithm, fig. 6 is a graph showing the relationship between the reward and the number of iterative rounds in the case where the reinforcement learning training is unstable according to the embodiment of the present disclosure. As shown in fig. 6, (a) in fig. 6 shows a case where the prize does not increase with an increase in the number of iteration rounds, and (b) shows a case where the prize suddenly drops greatly with an increase in the number of iteration rounds, wherein the light gray curves in (a) and (b) correspond to the baseline SFT model.

In view of the above, it is considered that the reason for such a problem may be that in the process of reinforcement learning training of a large language model based on the PPO algorithm, negative rewards cause an increase in entropy of strategies, resulting in an excessive deviation of the gradually learned strategies from the original strategies, and eventually a deterioration of effects. In reinforcement learning, the entropy of a policy refers to the uncertainty of the policy distribution. In the PPO algorithm, each token (e.g., an action or a state) has a corresponding policy distribution, and the entropy of the policy reflects the uncertainty of this distribution. Specifically, the greater the entropy of the policy, the more evenly the policy distribution, and the more randomly the agent chooses to act, rather than making more targeted decisions based on the state of the environment. Conversely, the smaller the entropy of the policy, the more concentrated the policy distribution, the more targeted the agent will choose to act, and the better will adapt to the change in the environment.

Fig. 7 is a graph illustrating the relationship between entropy of a strategy of each lemma and the number of iteration rounds in the case where reinforcement learning training is unstable according to an embodiment of the present disclosure. As shown in fig. 7, in the large language model reinforcement learning training process based on the PPO algorithm, as the number of iteration rounds increases, the token with high entropy also increases gradually.

Thus, in embodiments of the present disclosure, for the above-described problems, policies may be restricted from deviating too far from the original policies by way of adding KL-divergence regularization terms to the new policies and the original SFT model.

In accordance with an embodiment of the present disclosure, the method for reinforcement learning of a large language model of the present disclosure may further include: generating a plurality of different hint responses for each hint instruction; and selecting a prompt response corresponding to a highest reward from the plurality of different prompt responses using the reward model for parameter updating of the actor model. Alternatively, reinforcement learning training may be enabled to see more positive examples, for example, sampling multiple prompt responses and selecting the prompt response with the highest reward for actor model training when training the actor model.

FIG. 8 is a graph illustrating various performances of an optimized reinforcement learning model in the event reinforcement learning training is unstable, according to an embodiment of the present disclosure. Wherein (a) shows the relationship between the entropy of the strategy of each word element and the iteration round, (b) shows the relationship between the average entropy of the strategy of each word element and the iteration round, and (c) shows the relationship between rewards and the iteration round. As shown in fig. 8, the curves in (a) and (b) each reflect that the reinforcement learning model optimized as described above can realize stable training, the direction of change of the entropy of the strategy (as indicated by the arrow in (a) in fig. 7 and 8) continues to converge, and the bonus function of the reinforcement learning model optimized as described above can realize positive growth as indicated by (c).

Furthermore, for situations where there may be a bias in the reward function, i.e., where the reward function is growing forward but the final effect of the large language model is deteriorating, the method of the present disclosure, through analysis, finds that there may be situations where there is a tendency to reject the answer during reinforcement learning of the model, e.g., where the prompt response is given that the answer cannot be given due to lack of definition of keywords. Thus, according to embodiments of the present disclosure, negative example data of a prompt response corresponding to a refusal answer may be added to training of the reward model in the event that the reward model scores a higher prompt response corresponding to a refusal answer.

Optionally, negative example data of rejecting answers can be added in the reward model training to help the reward model to more quickly obtain more accurate values for scoring of such a class of prompt responses, thereby avoiding convergence of reinforcement learning training to such a class of prompt responses.

In addition, the content and structure of the training data of the reward model can be optimized, so that the reward model can be aligned with the preference of a person, and the reward model is prevented from being wrongly learned by the reinforcement learning training process due to defects.

As described above, the deficiencies of the current model can be analyzed in connection with the problems that occur in specific reinforcement learning training to provide targeted optimization through data adjustment and model parameter adjustment. Thus, in the reinforcement learning training process of the present disclosure, cyclic iterative training of the RM model and the RL model may be implemented to overcome the problems that occur in reinforcement learning training.

Furthermore, ablation experiments can be performed on different variables, and meanwhile, more deep analysis is performed on the mechanism of the training bottom layer, so that optimization is performed more directionally.

Therefore, as described above with reference to fig. 1-8, the method for reinforcement learning of a large language model of the present disclosure employs a reward model, a reviewer model, an actor model, and an initialized large language model to cooperatively perform reinforcement learning training on the large language model, wherein the actor model is used to generate a next action of the large language model, the reviewer model is used to evaluate the quality of the action for updating a behavior policy of the large language model, wherein KL divergence of a policy distribution of the initial policy distribution and the actor model is added as a regularization term to an objective function, so that the policy of the large language model is prevented from deviating too much from an original policy, and stability of reinforcement learning training is improved. According to the method disclosed by the invention, the reinforcement learning training can be pertinently optimized aiming at various problems in the reinforcement learning training of the large language model, so that the performance of the large language model is improved, and the cross-model-scale stable training can be realized by supplementing various training stability strategies in the reinforcement learning training.

Fig. 9 is a schematic diagram illustrating an apparatus 900 for reinforcement learning of a large language model according to an embodiment of the present disclosure.

According to an embodiment of the present disclosure, the apparatus 900 for reinforcement learning of a large language model may include a data acquisition module 901, a response generation module 902, and a reinforcement learning module 903.

The data acquisition module 901 may be configured to acquire hint instructions. Alternatively, the data acquisition module 901 may perform the operations described above with reference to step S101.

For example, hints instructions may be sampled from a hints instruction data set, which may be a collection of hints instructions collected in advance with sufficient diversity to ensure that a reinforcement-learning trained large language model can well cover a variety of application scenarios. After the collection of the prompt instruction set is completed, the prompt instructions can be randomly selected from the prompt instruction set for reinforcement learning training of the large language model.

Response generation module 902 may be configured to generate a hint response for the hint instruction using a large language model. Alternatively, the response generation module 902 may perform the operations described above with reference to step S102.

For example, based on the acquired hint instructions, a corresponding hint response may be generated by the initial large language model for subsequent reinforcement learning iterative optimization.

Reinforcement learning module 903 may be configured to perform multiple iterations of reinforcement learning for the large language model for the hint instruction and the hint response to complete reinforcement learning for the large language model, wherein in each iteration: invoking an initialized large language model, and a latest rewards model, reviewer model and actor model for reinforcement learning of the large language model in parallel to obtain an initial policy distribution of the large language model, rewards of the prompt instructions and prompt responses, a value corresponding to each action of the large language model, and a policy distribution of the actor model, respectively, wherein the policy distribution indicates a probability that the large language model performs each action, and each action generates each word element corresponding to the large language model; determining an objective function based on the rewards, the value, the initial policy distribution and the policy distribution of the actor model; and parameter updating the actor model and the reviewer model in parallel by optimizing the determined objective function to parameter update the large language model, wherein the objective function is used for measuring the performance of the current strategy of the actor model, and KL divergences of the initial strategy distribution and the strategy distribution of the actor model are added to the objective function as regular terms, and coefficients of the KL divergences are dynamically adjusted based on values of the KL divergences. Alternatively, the reinforcement learning module 903 may perform the operations described above with reference to step S103.

For example, the RL model may be trained using the PPO algorithm, i.e., the set objective function is optimized by random gradient descent. Near-end policy optimization acts as a deep reinforcement learning algorithm for training agents (i.e., RL models) to learn and perform tasks in complex environments. Training by the agent enables it to maximize cumulative returns in interactions with the environment, thereby achieving specified task goals.

For example, in large language model reinforcement learning training using PPO algorithms, four models may be used simultaneously to co-train a large language model to achieve stable strategic optimization and training. In particular, the four models may include a rewards model, a reviewer model, an actor model, and an initialized large language model (i.e., a supervised fine tuning large language model). The flexibility of model training can be improved by adopting an independent component mode aiming at the architecture design of four models, wherein multipart call can be designed, each of the four models is used as a component, and each component can be used as an independent server, so that each node can be deployed in a different mode, and resources are reasonably allocated. Therefore, when the model is used, the training and reasoning process can be flexibly switched, and the convenient and flexible calling is realized.

For example, the initialized large language model may output its initial policy distribution based on the hinting instructions. The initial policy distribution may represent that the large language model outputs a probability distribution for each token, i.e., generates a probability for each token, based on the hinting instructions. It should be noted that the initial policy distribution of the large language model output may not be optimal. The RL model needs to continually update its policy distribution through interactions with the environment to find the optimal policy.

For example, in the current iteration, the trained, up-to-date reward model may be used to obtain rewards for current hint instructions and hint responses.

For example, the value of each state is calculated using a reviewer model. The goal of the reviewer model is to learn a cost function to evaluate the quality of each state. This cost function may be learned and estimated by various methods, such as value iteration, strategy iteration, deep learning model, etc.

For example, an actor model may be used to generate a policy distribution, i.e., a probability distribution of taking different actions in a given state. The actor model may learn the optimal strategy through use with the reviewer model. For example, the input of the actor model may be the current state and the output may be the action probability distribution (i.e., policy distribution). The actor model may generate a policy distribution by: the current state is input into the actor model to calculate the probability of each action from the actor model, and based on these probabilities, one action is randomly selected.

Thus, from rewards (i.e., an evaluation of a model taking a particular action in a particular state), value (representing an expected return of state), initial policy distribution (model at the beginning of action distribution), and actor model policy distribution (model generating probability distribution of actions), a corresponding dominance value for each term can be determined, which can represent how good or bad the model chooses to generate a particular term in the current state relative to the behavior of selecting other terms. Further, the target value of the reviewer model may be determined based on the rewards in the current state and the value of the next state.

In accordance with yet another aspect of the present disclosure, an apparatus for reinforcement learning of a large language model is also provided. Fig. 10 shows a schematic diagram of an apparatus 2000 for reinforcement learning of large language models, according to an embodiment of the present disclosure.

As shown in fig. 10, the apparatus 2000 for reinforcement learning of large language models may include one or more processors 2010, and one or more memories 2020. Wherein the memory 2020 has stored therein computer readable code which, when executed by the one or more processors 2010, can perform the method for reinforcement learning of large language models as described above.

The processor in embodiments of the present disclosure may be an integrated circuit chip having signal processing capabilities. The processor may be a general purpose processor, a Digital Signal Processor (DSP), an Application Specific Integrated Circuit (ASIC), an off-the-shelf programmable gate array (FPGA) or other programmable logic device, discrete gate or transistor logic, discrete hardware components. The disclosed methods, steps, and logic blocks in the embodiments of the present application may be implemented or performed. A general purpose processor may be a microprocessor or the processor may be any conventional processor or the like, and may be of the X86 architecture or ARM architecture.

In general, the various example embodiments of the disclosure may be implemented in hardware or special purpose circuits, software, firmware, logic, or any combination thereof. Some aspects may be implemented in hardware, while other aspects may be implemented in firmware or software which may be executed by a controller, microprocessor or other computing device. While aspects of the embodiments of the present disclosure are illustrated or described as block diagrams, flow charts, or using some other pictorial representation, it is well understood that the blocks, apparatus, systems, techniques or methods described herein may be implemented in, as non-limiting examples, hardware, software, firmware, special purpose circuits or logic, general purpose hardware or controller or other computing devices, or some combination thereof.

For example, a method or apparatus according to embodiments of the present disclosure may also be implemented by means of the architecture of computing device 3000 shown in fig. 11. As shown in fig. 11, computing device 3000 may include a bus 3010, one or more CPUs 3020, a Read Only Memory (ROM) 3030, a Random Access Memory (RAM) 3040, a communication port 3050 connected to a network, an input/output component 3060, a hard disk 3070, and the like. A storage device in the computing device 3000, such as the ROM 3030 or the hard disk 3070, may store various data or files for processing and/or communication use of the method for reinforcement learning of a large language model provided by the present disclosure and program instructions executed by the CPU. The computing device 3000 may also include a user interface 3080. Of course, the architecture shown in FIG. 11 is merely exemplary, and one or more components of the computing device shown in FIG. 11 may be omitted as may be practical in implementing different devices.

According to yet another aspect of the present disclosure, a computer-readable storage medium is also provided. The computer storage medium has computer readable instructions stored thereon. When executed by a processor, the computer-readable instructions may perform a method for reinforcement learning of large language models according to embodiments of the present disclosure described with reference to the above figures. The computer readable storage medium in embodiments of the present disclosure may be volatile memory or nonvolatile memory, or may include both volatile and nonvolatile memory. The non-volatile memory may be read-only memory (ROM), programmable read-only memory (PROM), erasable programmable read-only memory (EPROM), electrically erasable programmable read-only memory (EEPROM), or flash memory. Volatile memory can be Random Access Memory (RAM), which acts as external cache memory. By way of example, and not limitation, many forms of RAM are available, such as Static Random Access Memory (SRAM), dynamic Random Access Memory (DRAM), synchronous Dynamic Random Access Memory (SDRAM), double data rate synchronous dynamic random access memory (ddr SDRAM), enhanced Synchronous Dynamic Random Access Memory (ESDRAM), synchronous Link Dynamic Random Access Memory (SLDRAM), and direct memory bus random access memory (DR RAM). It should be noted that the memory of the methods described herein is intended to comprise, without being limited to, these and any other suitable types of memory. It should be noted that the memory of the methods described herein is intended to comprise, without being limited to, these and any other suitable types of memory.

Embodiments of the present disclosure also provide a computer program product or computer program comprising computer instructions stored in a computer readable storage medium. A processor of a computer device reads the computer instructions from a computer-readable storage medium, the processor executing the computer instructions, causing the computer device to perform a method for reinforcement learning of a large language model according to an embodiment of the present disclosure.

It is noted that the flowcharts and block diagrams in the figures illustrate the architecture, functionality, and operation of possible implementations of systems, methods and computer program products according to various embodiments of the present disclosure. In this regard, each block in the flowchart or block diagrams may represent a module, segment, or portion of code, which comprises at least one executable instruction for implementing the specified logical function(s). It should also be noted that, in some alternative implementations, the functions noted in the block may occur out of the order noted in the figures. For example, two blocks shown in succession may, in fact, be executed substantially concurrently, or the blocks may sometimes be executed in the reverse order, depending upon the functionality involved. It will also be noted that each block of the block diagrams and/or flowchart illustration, and combinations of blocks in the block diagrams and/or flowchart illustration, can be implemented by special purpose hardware-based systems which perform the specified functions or acts, or combinations of special purpose hardware and computer instructions.

The exemplary embodiments of the present disclosure described in detail above are illustrative only and are not limiting. Those skilled in the art will understand that various modifications and combinations of these embodiments or features thereof may be made without departing from the principles and spirit of the disclosure, and such modifications should fall within the scope of the disclosure.

Claims

1. A method for reinforcement learning of a large language model initialized to a supervised fine-tuned large language model, the method comprising:

acquiring a prompt instruction;

generating a prompt response to the prompt instruction by using the large language model;

performing multiple iterations of reinforcement learning for the large language model for the hint instruction and the hint response to complete reinforcement learning for the large language model, wherein in each iteration:

invoking an initialized large language model, and a latest rewards model, reviewer model and actor model for reinforcement learning of the large language model in parallel to obtain an initial policy distribution of the large language model, rewards of the prompt instructions and prompt responses, a value corresponding to each action of the large language model, and a policy distribution of the actor model, respectively, wherein the policy distribution indicates a probability that the large language model performs each action, and each action generates each word element corresponding to the large language model;

Determining an objective function based on the rewards, the value, the initial policy distribution and the policy distribution of the actor model; and

parameter updating the actor model and the reviewer model in parallel by optimizing the determined objective function to update parameters of the large language model, wherein the objective function is used for measuring the performance of the current strategy of the actor model, and KL divergences of the initial strategy distribution and the strategy distribution of the actor model are added to the objective function as regular terms, and coefficients of the KL divergences are dynamically adjusted based on values of the KL divergences.

2. The method of claim 1, wherein for errors occurring during the reinforcement learning due to scoring of the bonus model, retraining the bonus model with training data corresponding to the errors and applying a retrained, up-to-date bonus model to reinforcement learning of the large language model.

3. The method of claim 2, wherein negative example data of a prompt response corresponding to a refusal answer is added in training of the reward model if the reward model scores a higher prompt response corresponding to a refusal answer.

4. The method of claim 1, wherein determining an objective function based on the rewards, the value, the initial policy distribution, and a policy distribution of the actor model comprises:

determining a dominance value corresponding to each term and a target value of the reviewer model according to the rewards, the values, the initial policy distribution and the policy distribution of the actor model, wherein the dominance value indicates the goodness of the action of generating the term by the big language model in the current state relative to the action of generating other terms, and the target value of the reviewer model is determined according to the rewards in the current state and the values of the next state; and

the objective function is determined based on the initial policy distribution and the policy distribution of the actor model, the determined dominance value corresponding to each word element, and the objective value of the reviewer model.

5. The method of claim 4, wherein determining a dominance value for each of the tokens and a target value for the reviewer model based on the rewards, the values, the initial policy distribution and the policy distribution of the actor model comprises normalizing the rewards.

6. The method of claim 4, wherein determining a dominance value for each of the tokens and a target value for the reviewer model based on the reward, the value, the initial policy distribution, and a policy distribution of the actor model comprises removing hint instructions and hint responses corresponding to the reward if the reward is greater than a predetermined threshold.

7. The method of claim 4, wherein determining a dominance value for each of the tokens and a target value for the reviewer model based on the rewards, the values, the initial policy distribution, and a policy distribution of the actor model comprises: using a generalized dominance estimation method, a dominance value for each token is calculated by using the rewards in a number of iterations.

8. The method of claim 4, wherein determining the objective function includes normalizing the dominance values based on the initial policy distribution and a policy distribution of the actor model, the determined dominance value for each voxel, and an objective value of the reviewer model.

9. The method of claim 1, wherein parameter updating the actor model and the reviewer model in parallel by optimizing the determined objective function to parameter update the large language model comprises: in the first N iterations of the training for the prompt instruction and the prompt response, keeping parameters of the actor model fixed and updating parameters of the reviewer model, wherein N is a positive integer.

10. The method of claim 1, wherein parameter updating the actor model and the reviewer model in parallel by optimizing the determined objective function to parameter update the large language model comprises: in the process of updating parameters of the reviewer model, data used for updating parameters of the reviewer model in the previous iteration is accumulated into the current iteration.

11. The method of claim 1, further comprising:

generating a plurality of different hint responses for each hint instruction; and

and selecting a prompt response corresponding to the highest reward from the plurality of different prompt responses by using the reward model for updating parameters of the actor model.

12. An apparatus for reinforcement learning of a large language model, comprising:

the data acquisition module is configured to acquire a prompt instruction;

a response generation module configured to generate a hint response to the hint instruction using the large language model; and

a reinforcement learning module configured to:

13. An apparatus for reinforcement learning of a large language model, comprising:

one or more processors; and

one or more memories in which a computer executable program is stored which, when executed by the processor, performs the method of any of claims 1-11.

14. A computer program product stored on a computer readable storage medium and comprising computer instructions which, when executed by a processor, cause a computer device to perform the method of any of claims 1-11.

15. A computer readable storage medium having stored thereon computer executable instructions for implementing the method of any of claims 1-11 when executed by a processor.