CN114021635A

CN114021635A - Method, apparatus, device and storage medium for training a model

Info

Publication number: CN114021635A
Application number: CN202111280587.2A
Authority: CN
Inventors: 何凤翔; 王子铭
Original assignee: Beijing Jingdong Zhenshi Information Technology Co Ltd
Current assignee: Beijing Jingdong Zhenshi Information Technology Co Ltd
Priority date: 2021-10-29
Filing date: 2021-10-29
Publication date: 2022-02-08

Abstract

Embodiments of the present disclosure disclose methods and apparatus for training a model. One embodiment of the method comprises: acquiring a training sample set; the following training steps are performed: taking state information of each time point of the object, which is included in the training sample, as model input, taking action information of the object as model output, solving to obtain candidate parameters of the model based on a loss function and an objective function, wherein the loss function is constructed based on an antagonistic learning algorithm, and the candidate parameters are a set of optimal parameters of two antagonistic parties obtained by using the objective function; when the candidate parameters meet preset conditions, determining the candidate parameters as final parameters of the model, and outputting the final model after training, wherein the preset conditions are used for representing the setting of a set of optimal parameters of the two confronters based on the extreme value principle of the two confronters; and when the candidate parameters do not meet the preset conditions, repeatedly executing the training steps. The scheme realizes a brand-new model training method and device.

Description

Method, apparatus, device and storage medium for training a model

Technical Field

The embodiment of the disclosure relates to the technical field of computers, in particular to the technical field of artificial intelligence, and particularly relates to a method and a device for training a model.

Background

In recent years, countermeasure reinforcement learning has been highly successful in the fields of automatic driving, competitive games, and the like. The introduction of the countermeasure neural network enables the model to better handle the problem with the countermeasure scenario and improves the robustness of the model and the efficiency of acquiring training data. However, the existing countermeasure reinforcement learning model lacks theory and is not enough to support new algorithm design. The method hopes to solve the problem in automatic driving by optimally controlling modeling and providing a mathematical theory basis for designing a new algorithm.

Disclosure of Invention

Embodiments of the present disclosure propose a method, apparatus, device, and storage medium for training a model and a method, apparatus, device, and storage medium for generating information.

In a first aspect, an embodiment of the present disclosure provides a method for training a model, the method including: acquiring a training sample set, wherein training samples in the training sample set comprise state information of each time point of an object and action information of the object; the following training steps are performed: taking the state information of each time point of the object, which is included in the training sample, as model input, taking the action information of the object corresponding to the state information of each time point of the input object as model output, and solving to obtain candidate parameters of the model based on a loss function and an objective function, wherein the loss function is constructed based on an antagonistic learning algorithm, and the candidate parameters are a set of optimal parameters of two antagonistic parties obtained by utilizing the objective function; determining the candidate parameters as final parameters of the model in response to the candidate parameters meeting preset conditions, and outputting the final model after training, wherein the preset conditions are used for representing the setting of a set of optimal parameters of the two confrontation parties based on the extreme value principle of the two confrontation parties; and in response to the candidate parameters not meeting the preset conditions, repeatedly executing the training steps.

In some embodiments, each time point is each time point in the continuous state of the object, and the model is used for solving the optimal control problem of the object in the continuous state; and, the loss function is constructed based on a combination of a counterlearning algorithm and a reinforcement learning algorithm.

In some embodiments, in response to the candidate parameter satisfying a preset condition, determining the candidate parameter as a final parameter of the model, and outputting a final model after training, the method includes: responding to the candidate parameters meeting preset conditions, determining the candidate parameters as final parameters of the model, and outputting the final model after training, wherein the preset conditions are as follows: and the loss value corresponding to the optimal parameter set of the two countermeasures is equal to the unique solution of the first equation, and the solution of the first equation is used for representing and solving the problems existing in the model training.

In some embodiments, the preset condition includes a first sub-condition and a second sub-condition; responding to the candidate parameters meeting the preset conditions, determining the candidate parameters as final parameters of the model, and outputting the final model after training, wherein the step comprises the following steps: responding to the candidate parameters meeting a first sub-condition and a second sub-condition at the same time, determining the candidate parameters as final parameters of the model, and outputting the final model after training, wherein the first sub-condition is used for representing that a set of optimal parameters of the two confrontation parties is set based on the extreme value principle of the two parties, and the second sub-condition is as follows: the set of optimal parameters for both confrontations corresponds to a loss value equal to the unique solution of the first equation.

In some embodiments, the problem is a saddle point problem.

In some embodiments, the first process is constructed based on nonlinear control theory; the first equation is a Hamilton-Jacobian-Isaacs equation or a Hamilton-Jacobian-Bellman equation.

In a second aspect, an embodiment of the present disclosure provides a method for generating information, including: acquiring state information of each time point of the object, wherein each time point is each time point of the object in a continuous state; the state information of the object at each time point is input into a pre-trained information generation model, and the action information of the object is generated, wherein the information generation model is obtained by training through the method of any embodiment of the method for training the model.

In a third aspect, an embodiment of the present disclosure provides an apparatus for training a model, including: the device comprises an acquisition unit, a processing unit and a processing unit, wherein the acquisition unit is configured to acquire a training sample set, and training samples in the training sample set comprise state information of a target at each time point and action information of the target; a training unit configured to perform the following training steps: taking the state information of each time point of the object, which is included in the training sample, as model input, taking the action information of the object corresponding to the state information of each time point of the input object as model output, and solving to obtain candidate parameters of the model based on a loss function and an objective function, wherein the loss function is constructed based on an antagonistic learning algorithm, and the candidate parameters are a set of optimal parameters of two antagonistic parties obtained by utilizing the objective function; determining the candidate parameters as final parameters of the model in response to the candidate parameters meeting preset conditions, and outputting the final model after training, wherein the preset conditions are used for representing the setting of a set of optimal parameters of the two confrontation parties based on the extreme value principle of the two confrontation parties; and the repeating unit is configured to repeatedly execute the training step in response to the candidate parameter not meeting the preset condition.

In some embodiments, the training model is further configured to determine the candidate parameter as a final parameter of the model in response to the candidate parameter satisfying a preset condition, and output the final model after training, wherein the preset condition is: and the loss value corresponding to the optimal parameter set of the two countermeasures is equal to the unique solution of the first equation, and the solution of the first equation is used for representing and solving the problems existing in the model training.

In some embodiments, the preset condition includes a first sub-condition and a second sub-condition; the training model is further configured to respond to the candidate parameter meeting a first sub-condition and a second sub-condition at the same time, determine the candidate parameter as a final parameter of the model, and output the final model after training, wherein the first sub-condition is used for representing that a set of optimal parameters of the two confrontation parties are set based on the extreme value principle of the two parties, and the second sub-condition is that: the set of optimal parameters for both confrontations corresponds to a loss value equal to the unique solution of the first equation.

In some embodiments, the problem in the training unit is a saddle point problem.

In some embodiments, the first course in the training unit is constructed based on nonlinear control theory; the first equation is a Hamilton-Jacobian-Isaacs equation or a Hamilton-Jacobian-Bellman equation.

In a fourth aspect, a disclosed embodiment provides an apparatus for generating information, comprising: an information acquisition unit configured to acquire state information of the object at respective time points, the respective time points being respective time points at which the object is in a continuous state; and an information generating unit configured to input the state information of the object at each time point to a pre-trained information generating model, and generate the motion information of the object, wherein the information generating model is obtained by training according to the method of any one embodiment of the methods for training the model.

In a fifth aspect, an embodiment of the present disclosure provides an electronic device, including: at least one processor; and a memory communicatively coupled to the at least one processor; wherein the memory stores instructions executable by the at least one processor to enable the at least one processor to perform the method as described in any one of the implementations of the first aspect or the second aspect.

In a sixth aspect, an embodiment of the present disclosure provides a non-transitory computer readable storage medium storing computer instructions, wherein the computer instructions are configured to cause a computer to perform the method described in any one of the implementation manners of the first aspect or the second aspect.

The method and the device for training the model provided by the embodiment of the disclosure adopt the steps of obtaining a training sample set, wherein the training samples in the training sample set comprise state information of each time point of an object and action information of the object, and execute the following training steps: taking the state information of each time point of the object included by the training sample as model input, taking the action information of the object corresponding to the state information of each time point of the input object as model output, solving to obtain candidate parameters of the model based on a loss function and an objective function, wherein the loss function is constructed based on an antagonistic learning algorithm, the candidate parameters are a set of optimal parameters of both antagonistic parties obtained by using the objective function, determining the candidate parameters as final parameters of the model in response to the candidate parameters meeting preset conditions, outputting the final model after training, wherein the preset conditions are used for setting the set of the optimal parameters of both antagonistic parties based on the extreme value principle of both parties, repeatedly executing the training steps in response to the candidate parameters not meeting the preset conditions, and reversely guiding the training of reinforcement learning by judging the sufficient necessary conditions of optimality of the model, and then, the extreme value principle of both parties in the optimal control theory is popularized to model training, so that a brand-new model training method and device are realized.

It should be understood that the statements in this section do not necessarily identify key or critical features of the embodiments of the present disclosure, nor do they limit the scope of the present disclosure. Other features of the present disclosure will become apparent from the following description.

Drawings

The drawings are included to provide a better understanding of the present solution and are not to be construed as limiting the present disclosure.

FIG. 1 is a schematic diagram of a first embodiment of a method for training a model according to the present disclosure;

FIG. 2 is a scenario diagram of a method for training a model in which an embodiment of the present disclosure may be implemented;

FIG. 3 is a schematic diagram of a second embodiment of a method for training a model according to the present disclosure;

FIG. 4 is a schematic diagram of a first embodiment of a method for generating information in accordance with the present disclosure;

FIG. 5 is a schematic block diagram of one embodiment of an apparatus for training models according to the present disclosure;

FIG. 6 is a schematic block diagram illustrating one embodiment of an apparatus for generating information according to the present disclosure;

FIG. 7 is a block diagram of an electronic device used to implement an embodiment of the disclosure.

Detailed Description

Exemplary embodiments of the present disclosure are described below with reference to the accompanying drawings, in which various details of the embodiments of the disclosure are included to assist understanding, and which are to be considered as merely exemplary. Accordingly, those of ordinary skill in the art will recognize that various changes and modifications of the embodiments described herein can be made without departing from the scope and spirit of the present disclosure. Also, descriptions of well-known functions and constructions are omitted in the following description for clarity and conciseness.

It should be noted that, in the present disclosure, the embodiments and features of the embodiments may be combined with each other without conflict. The present disclosure will be described in detail below with reference to the accompanying drawings in conjunction with embodiments.

Fig. 1 shows a schematic diagram 100 of a first embodiment of a method for training a model according to the present disclosure. The method for training the model comprises the following steps:

step 101, a training sample set is obtained.

In this embodiment, the executing entity of the method for training the model may obtain a training sample set from other electronic devices or locally by means of wired connection or wireless connection, where the training samples in the training sample set include state information of the object at various time points and motion information of the object, and the state of the object may be a motion state of a vehicle in automatic driving, a motion state of limbs of a game character, and various possible states of the object.

It should be noted that the above-mentioned wireless connection means may include, but is not limited to, 3G, 4G, 5G connection, WiFi connection, bluetooth connection, WiMAX connection, Zigbee connection, uwb (ultra wideband) connection, and other now known or later developed wireless connection means.

Step 102, the following training steps are performed.

Step 1021, using the state information of each time point of the object included in the training sample as model input, using the motion information of the object corresponding to the input state information of each time point of the object as model output, and solving to obtain candidate parameters of the model based on the loss function and the objective function.

In this embodiment, the executing agent may use a deep learning algorithm to obtain candidate parameters of the model by solving, based on a loss function constructed based on the countermeasure learning algorithm and an objective function, the candidate parameters being a set of optimal parameters of both countermeasures obtained by the objective function, with the state information of each time point of the object included in the training samples in the training sample set acquired in step 101 as input, and the motion information of the object corresponding to the state information of each time point of the object as output.

And 1022, in response to that the candidate parameters meet the preset conditions, determining the candidate parameters as final parameters of the model, and outputting the final model after training.

In this embodiment, the executing entity may judge the candidate parameters of the model according to the preset condition, determine the candidate parameters as the final parameters of the model when the candidate parameters of the model satisfy the preset condition, and output the final model after training, so as to complete the training of the model. The preset condition is used for representing that the optimal parameter sets of the two confrontation parties are set based on the extreme value principle of the two confrontation parties, the condition setting can be based on a threshold value or the optimal parameter sets of the two confrontation parties are set by analyzing, comparing or calculating the optimal parameter sets of the two confrontation parties with the previous optimal parameter, and the condition setting can also be based on various existing technologies or theories for solving the problems in the model. The condition may be one defining condition for defining the set of optimum parameters or a combination of a plurality of conditions for defining the set of optimum parameters. The preset conditions can classify and limit the set of the optimal parameters according to different requirements of services.

In some optional implementation manners of this embodiment, in response to that the candidate parameter satisfies a preset condition, determining the candidate parameter as a final parameter of the model, and outputting a final model after training, including: responding to the candidate parameters meeting preset conditions, determining the candidate parameters as final parameters of the model, and outputting the final model after training, wherein the preset conditions are as follows: the loss value corresponding to the optimal parameter set of the two confrontation parties is equal to the unique solution of a first equation, the solution of the first equation is used for representing and solving the problems existing in model training, the problems can be saddle point problems, the first equation can be constructed on the basis of a nonlinear control theory, and the first equation is a Hamilton-Jacobian-Isaac equation (namely an HJI equation) or a Hamilton-Jacobian-Bellman equation (namely an HJB equation). The method solves the problem that various problems (such as saddle point problem) brought by an anti-neural network can not be explained in the existing model, and optimizes the problems in the existing model by setting the necessary conditions of model optimality so as to achieve the optimality of the model.

And 103, in response to the candidate parameters not meeting the preset conditions, repeatedly executing the training steps.

In this embodiment, when the execution subject determination candidate parameter does not satisfy the preset condition, the process jumps to step 102, and the step 102 is repeatedly executed.

It should be noted that the training process of the above model is a well-known technique that is widely researched and applied at present, and is not described herein again.

With continued reference to FIG. 2, the method 200 for training a model of the present embodiment runs in a server 201. The server 201 first obtains a training sample set 202, where training samples in the training sample set include state information of the object at each time point and motion information of the object, and then the server 201 performs the following training steps: the method comprises the steps of inputting state information of each time point of an object, which is included in a training sample, as a model, outputting motion information of the object corresponding to the input state information of each time point of the object as the model, solving to obtain candidate parameters of the model based on a loss function and an objective function, wherein the loss function is constructed based on a countervailing learning algorithm, the candidate parameters are a set of optimal parameters of two countervailing parties obtained by using the objective function, determining the candidate parameters as final parameters of the model in response to the candidate parameters meeting preset conditions, and outputting a final model 203 after training is completed, wherein the preset conditions are used for representing that the set of the optimal parameters of the two countervailing parties is set based on an extreme value principle of the two parties, and repeatedly executing the training steps in response to the candidate parameters not meeting the preset conditions.

The method for training the model provided by the above embodiment of the present disclosure adopts obtaining a training sample set, where the training samples in the training sample set include state information of each time point of the object and action information of the object, and performs the following training steps: taking the state information of each time point of the object included by the training sample as model input, taking the action information of the object corresponding to the state information of each time point of the input object as model output, solving to obtain candidate parameters of the model based on a loss function and an objective function, wherein the loss function is constructed based on an antagonistic learning algorithm, the candidate parameters are a set of optimal parameters of both antagonistic parties obtained by using the objective function, determining the candidate parameters as final parameters of the model in response to the candidate parameters meeting preset conditions, outputting the final model after training, wherein the preset conditions are used for setting the set of the optimal parameters of both antagonistic parties based on the extreme value principle of both parties, repeatedly executing the training steps in response to the candidate parameters not meeting the preset conditions, and reversely guiding the training of reinforcement learning by judging the sufficient necessary conditions of optimality of the model, and then, the extreme value principle of both parties in the optimal control theory is popularized to model training, so that a brand-new model training method is realized.

With further reference to FIG. 3, a schematic diagram 300 of a second embodiment of a method for training a model is shown. The process of the method comprises the following steps:

step 301, a training sample set is obtained.

In this embodiment, an executive body of the method for training the model may obtain a training sample set from other electronic devices or locally through a wired connection or a wireless connection, where the training sample in the training sample set includes state information of the object at each time point and motion information of the object, each time point is each time point in a continuous state of the object, and the continuous state of the object may be a continuous motion state of a vehicle in automatic driving, a continuous motion state of a limb of a game character, and various possible continuous states of the object.

Step 302, the following training steps are performed.

Step 3021, inputting the state information of the object at each time point included in the training sample as a model, outputting the motion information of the object corresponding to the input state information of the object at each time point as a model, and solving to obtain candidate parameters of the model based on the loss function and the objective function.

In this embodiment, the executing agent may use a deep learning algorithm to obtain candidate parameters of the model by solving based on a loss function and an objective function, with the state information of each time point of the object included in the training samples in the training sample set in step 301 as an input, and the motion information of the object corresponding to the input state information of each time point of the object as an output. The model is used for solving the optimal control problem of the object in the continuous state, the loss function is constructed based on the combination of the countervailing learning algorithm and the reinforcement learning algorithm, the model parameters of the countervailing parties in the loss function are parameter functions of the object in the continuous state relative to time, the training target of the model is a set for finding the optimal parameters of the countervailing parties, namely the candidate parameters are the set of the optimal parameters of the countervailing parties obtained by utilizing the target function.

Further by way of example, the trained model is a confrontation reinforcement learning model combining reinforcement learning and confrontation learning, there are an original model Z and a confrontation model D, and the loss function is composed of a state and a parameter x in the two models_z,θ_d,θ_z,θ_dAnd (4) jointly determining. The model parameters of the two countermeasures are the parameter function theta of the object in a continuous state with respect to the time t_z(t),θ_d(t) of (d). The countermeasure reinforcement learning is a game problem of two power systems, namely a quantitative differential countermeasure, and the loss function of the model is as follows:

wherein

L represents the penalty brought by each state and action in training, phi represents the termination penalty brought by the last state, and the joint distribution of L and phi obeys (x)₀,y₀) Mu, training goal is to find the optimum

So that

And step 3022, in response to the candidate parameters meeting the preset conditions, determining the candidate parameters as final parameters of the model, and outputting the final model after training.

In this embodiment, the preset condition may include a first sub-condition and a second sub-condition, the execution subject may determine the candidate parameter of the model according to the first sub-condition and the second sub-condition, when the candidate parameter of the model satisfies both the first sub-condition and the second sub-condition, determine the candidate parameter as a final parameter of the model, and output a final model after training, so as to complete training of the model. The first sub-condition is used for representing the optimal parameter set of the two counterpartners based on the extreme value principle of the two partners, and the second sub-condition is as follows: and the loss value corresponding to the optimal parameter set of the two confrontation parties is equal to the unique solution of the first equation, the solution of the first equation is used for representing and solving the saddle point problem existing in the model training, and the first equation is an HJI equation. The first sub-condition and the second sub-condition may be both a defining condition for defining the set of optimal parameters, or a combination of a plurality of conditions for defining the set of optimal parameters.

The method for optimally controlling the neural network generally comprises an average field optimal control method which gives sufficient conditions for neural network convergence by utilizing average field optimal control and analyzes generalization errors and a successive approximation method which enables different parameters to meet the convergence conditions one by solving partial differential equations based on the conditions for neural network convergence.

Based on the above scenario of optimally controlling the neural network, setting of the first condition in the preset conditions is further exemplified. First, define Hamiltonian H: -L + ψ^Tf, based on the popularization of the traditional two-party extreme value principle in the mean field sense, aiming at the model in the example, the preset conditions comprise that: assuming f is bounded, f and L are with respect to θ_zAnd theta_dContinuously, f, L and phi are continuously differentiable with respect to x, and μ has a tight set of constraints such that

As optimal parameters of the model, x^*(t) is the corresponding optimum, then psi exists^*And xi are such that,

and the number of the first and second electrodes,

wherein H (x (t), θ)_z(t),θ_d(t),ψ(t)):＝-L(x(t),θ_z(t),θ_d(t))+ψ^T(t)f(x(t),θ_z(t),θ_d(t))。

Therefore, any one of the rear sub-strategies in the optimal strategy is also optimal, that is, the strategy from the middle time to the end time is optimal to the control problem from the middle time to the end time, that is, the dynamic planning principle.

Based on this principle, for the setting of the second condition among the preset conditions, it is considered that the start is from an arbitrary intermediate time s to the end time t_fAnd its corresponding optimal loss function v (s, μ) assuming f, L and Φ are bounded, f, L and Φ are continuous with respect to x Ribes, the Ribes constants of f and L and θ_zAnd theta_dIs irrelevant. Order to

The optimal parameters of the model are the corresponding loss functions

And v is the only solution to the mean field HJI equation,

wherein the second condition represents the generalization of the conventional HJI equation in the mean-field sense.

The above example can provide a corresponding optimal solution from any initial value starting model, so that the problem of quantitative differential countermeasures in the mean field sense is researched, and the method is different from the traditional problem of quantitative differential countermeasures, namely only researching the optimality condition of the power system for determining the initial value.

The preset conditions, the first conditions, and the second conditions are not limited to the above-mentioned examples.

And 303, in response to the candidate parameter not meeting the preset condition, repeatedly executing the training step.

In the present embodiment, when the execution subject determines that the candidate parameter does not satisfy the preset condition (i.e., the candidate parameters do not satisfy the first sub-condition and the second sub-condition at the same time), the execution subject jumps to step 302, and the step 302 is repeatedly executed.

In some optional implementations of the present embodiment, the first equation may be a hamilton-jacobi-bellman equation, i.e., an HJB equation. And realizing another training method for the confrontation reinforcement learning model based on the optimal control theory.

As can be seen from fig. 3, compared with the embodiment corresponding to fig. 1, the schematic diagram 300 of the method for training the model in this embodiment is obtained by converting the parameters of the model in the prior art from the parameters in each time step into a function with respect to time in a continuous sense, the optimal control problem is to find the set of optimal parameters of the two counterpartners with the minimum loss function value, namely, the optimization problem of discrete time step is replaced by the optimal control problem of the average field of the continuous power system, the problem that the prior art can only use a single power system to carry out the reinforcement learning of the optimal control and can not explain the saddle point problem caused by the introduction of the antagonistic neural network is solved, the optimality of the model is achieved by satisfying the extreme value principle and the HJI equation of the two parties under the mean field meaning during modeling, and a brand-new training method for the confrontation reinforcement learning model based on the optimal control theory is realized. Meanwhile, the problem that in the prior art, because each time step comprises a deep neural network, a large number of parameters need to be learned is solved, the training process of the model is simplified, and the training efficiency of the model is improved. The performance of existing counterreinforcement learning techniques in specific application scenarios (e.g., autonomous driving, etc.). The model is optimized based on preset conditions, and a series of problems in scenes such as automatic driving can be further optimized.

With further reference to fig. 4, a schematic diagram 400 of a first embodiment of a method for generating information according to the present disclosure is presented. The method for generating information comprises the following steps:

step 401, acquiring status information of each time point of the object.

In this embodiment, the execution subject (e.g., a server or a terminal device) may obtain status information of each time point of the object from other electronic devices or locally by means of wired connection or wireless connection, where each time point is each time point of the object in a continuous state.

Step 402, the state information of the object at each time point is input to the information generation model trained in advance, and the motion information of the object is generated.

In this embodiment, the execution agent may input the state information of the object at each time point acquired in step 401 to a pre-trained information generation model, and generate the motion information of the object by model prediction. The information generating model is trained by the method of any one of the embodiments as described above for the method of training a model.

As can be seen from fig. 4, compared with the embodiment corresponding to fig. 1, the flow 400 of the method for generating information in the present embodiment highlights the step of generating the motion information of the object using the trained information generation model. Therefore, the scheme described by the embodiment can realize more efficient and accurate information prediction.

With further reference to fig. 5, as an implementation of the method shown in fig. 1 to 3, the present disclosure provides an embodiment of an apparatus for training a model, where the embodiment of the apparatus corresponds to the embodiment of the method shown in fig. 1, and besides the features described below, the embodiment of the apparatus may further include the same or corresponding features as the embodiment of the method shown in fig. 1, and produce the same or corresponding effects as the embodiment of the method shown in fig. 1, and the apparatus may be applied to various electronic devices.

As shown in fig. 5, the apparatus 500 for training a model of the present embodiment includes: the device comprises an acquisition unit 501, a training unit 502 and a repeating unit 503, wherein the acquisition unit is configured to acquire a training sample set, wherein training samples in the training sample set comprise state information of a subject at each time point and action information of the subject; a training unit configured to perform the following training steps: taking the state information of each time point of the object, which is included in the training sample, as model input, taking the action information of the object corresponding to the state information of each time point of the input object as model output, and solving to obtain candidate parameters of the model based on a loss function and an objective function, wherein the loss function is constructed based on an antagonistic learning algorithm, and the candidate parameters are a set of optimal parameters of two antagonistic parties obtained by utilizing the objective function; determining the candidate parameters as final parameters of the model in response to the candidate parameters meeting preset conditions, and outputting the final model after training, wherein the preset conditions are used for representing the setting of a set of optimal parameters of the two confrontation parties based on the extreme value principle of the two confrontation parties; and the repeating unit is configured to repeatedly execute the training step in response to the candidate parameter not meeting the preset condition.

In this embodiment, the specific processing of the first obtaining unit 501, the training unit 502, and the repeating unit 503 of the apparatus 500 for training a model and the technical effects thereof can refer to the related descriptions of step 101 to step 103 in the embodiment corresponding to fig. 1, and are not described herein again.

In some optional implementation manners of this embodiment, each time point is each time point in the continuous state of the object, and the model is used to solve the optimal control problem of the object in the continuous state; and, the loss function is constructed based on a combination of a counterlearning algorithm and a reinforcement learning algorithm.

In some optional implementations of this embodiment, the training model is further configured to determine the candidate parameter as a final parameter of the model in response to that the candidate parameter satisfies a preset condition, and output the final model after training, where the preset condition is: and the loss value corresponding to the optimal parameter set of the two countermeasures is equal to the unique solution of the first equation, and the solution of the first equation is used for representing and solving the problems existing in the model training.

In some optional implementations of this embodiment, the preset condition includes a first sub-condition and a second sub-condition; the training model is further configured to respond to the candidate parameter meeting a first sub-condition and a second sub-condition at the same time, determine the candidate parameter as a final parameter of the model, and output the final model after training, wherein the first sub-condition is used for representing that a set of optimal parameters of the two confrontation parties are set based on the extreme value principle of the two parties, and the second sub-condition is that: the set of optimal parameters for both confrontations corresponds to a loss value equal to the unique solution of the first equation.

In some alternative implementations of the present embodiment, the problem in the training unit is a saddle point problem.

In some optional implementations of this embodiment, the first process in the training unit is constructed based on a nonlinear control theory; the first equation is a Hamilton-Jacobian-Isaacs equation or a Hamilton-Jacobian-Bellman equation.

With continuing reference to fig. 6, as an implementation of the method shown in fig. 4 described above, the present disclosure provides an embodiment of an apparatus for generating information, the apparatus embodiment corresponds to the method embodiment shown in fig. 4, and in addition to the features described below, the apparatus embodiment may further include the same or corresponding features as the method embodiment shown in fig. 4, and produce the same or corresponding effects as the method embodiment shown in fig. 4, and the apparatus may be applied to various electronic devices.

As shown in fig. 6, the apparatus 600 for generating information of the present embodiment includes: an information acquisition unit 601 and an information generation unit 602, wherein the information acquisition unit is configured to acquire state information of each time point of the object, each time point being each time point of the object in a continuous state; and an information generating unit configured to input the state information of the object at each time point to a pre-trained information generating model, and generate the motion information of the object, wherein the information generating model is obtained by training according to the method of any one embodiment of the methods for training the model.

In this embodiment, specific processes of the information obtaining unit 601 and the information generating unit 602 of the apparatus 600 for generating information and technical effects brought by the processes may respectively refer to the related descriptions of step 401 to step 402 in the embodiment corresponding to fig. 4, and are not described herein again.

The present disclosure also provides an electronic device and a readable storage medium according to an embodiment of the present disclosure.

As shown in fig. 7, a block diagram of an electronic device for a method of training a model according to an embodiment of the present disclosure. Electronic devices are intended to represent various forms of digital computers, such as laptops, desktops, workstations, personal digital assistants, servers, blade servers, mainframes, and other appropriate computers. The electronic device may also represent various forms of mobile devices, such as personal digital processing, cellular phones, smart phones, wearable devices, and other similar computing devices. The components shown herein, their connections and relationships, and their functions, are meant to be examples only, and are not meant to limit implementations of the disclosure described and/or claimed herein.

As shown in fig. 7, the electronic apparatus includes: one or more processors 701, a memory 702, and interfaces for connecting the various components, including a high-speed interface and a low-speed interface. The various components are interconnected using different buses and may be mounted on a common motherboard or in other manners as desired. The processor may process instructions for execution within the electronic device, including instructions stored in or on the memory to display graphical information of a GUI on an external input/output apparatus (such as a display device coupled to the interface). In other embodiments, multiple processors and/or multiple buses may be used, along with multiple memories and multiple memories, as desired. Also, multiple electronic devices may be connected, with each device providing portions of the necessary operations (e.g., as a server array, a group of blade servers, or a multi-processor system). In fig. 7, one processor 701 is taken as an example.

The memory 702 is a non-transitory computer readable storage medium provided by the present disclosure. Wherein the memory stores instructions executable by the at least one processor to cause the at least one processor to perform the method for training a model provided by the present disclosure. A non-transitory computer readable storage medium of the present disclosure stores computer instructions for causing a computer to perform the method for training a model provided by the present disclosure.

The memory 702, which is a non-transitory computer readable storage medium, may be used to store non-transitory software programs, non-transitory computer executable programs, and modules, such as program instructions/modules corresponding to the method for training a model in the embodiments of the present disclosure (e.g., the obtaining unit 501, the training unit 502, and the repeating unit 503 shown in fig. 5). The processor 701 executes various functional applications of the server and data processing by executing non-transitory software programs, instructions, and modules stored in the memory 702, that is, implements the method for training the model in the above method embodiment.

The memory 702 may include a storage program area and a storage data area, wherein the storage program area may store an operating system, an application program required for at least one function; the storage data area may store data created from use of the electronic device for training the model, and the like. Further, the memory 702 may include high speed random access memory, and may also include non-transitory memory, such as at least one magnetic disk storage device, flash memory device, or other non-transitory solid state storage device. In some embodiments, memory 702 may optionally include memory located remotely from processor 701, which may be connected to an electronic device for training models via a network. Examples of such networks include, but are not limited to, the internet, intranets, local area networks, mobile communication networks, and combinations thereof.

The electronic device for the method of training a model may further comprise: an input device 703 and an output device 704. The processor 701, the memory 702, the input device 703 and the output device 704 may be connected by a bus or other means, and fig. 7 illustrates an example of a connection by a bus.

The input device 703 may receive input numeric or character information and generate key signal inputs related to user settings and function controls of the electronic apparatus used to train the model, such as a touch screen, a keypad, a mouse, a track pad, a touch pad, a pointer, one or more mouse buttons, a track ball, a joystick, or other input device. The output devices 704 may include a display device, auxiliary lighting devices (e.g., LEDs), and tactile feedback devices (e.g., vibrating motors), among others. The display device may include, but is not limited to, a Liquid Crystal Display (LCD), a Light Emitting Diode (LED) display, and a plasma display. In some implementations, the display device can be a touch screen.

Various implementations of the systems and techniques described here can be realized in digital electronic circuitry, integrated circuitry, application specific ASICs (application specific integrated circuits), computer hardware, firmware, software, and/or combinations thereof. These various embodiments may include: implemented in one or more computer programs that are executable and/or interpretable on a programmable system including at least one programmable processor, which may be special or general purpose, receiving data and instructions from, and transmitting data and instructions to, a storage system, at least one input device, and at least one output device.

These computer programs (also known as programs, software applications, or code) include machine instructions for a programmable processor, and may be implemented using high-level procedural and/or object-oriented programming languages, and/or assembly/machine languages. As used herein, the terms "machine-readable medium" and "computer-readable medium" refer to any computer program product, apparatus, and/or device (e.g., magnetic discs, optical disks, memory, Programmable Logic Devices (PLDs)) used to provide machine instructions and/or data to a programmable processor, including a machine-readable medium that receives machine instructions as a machine-readable signal. The term "machine-readable signal" refers to any signal used to provide machine instructions and/or data to a programmable processor.

To provide for interaction with a user, the systems and techniques described here can be implemented on a computer having: a display device (e.g., a CRT (cathode ray tube) or LCD (liquid crystal display) monitor) for displaying information to a user; and a keyboard and a pointing device (e.g., a mouse or a trackball) by which a user can provide input to the computer. Other kinds of devices may also be used to provide for interaction with a user; for example, feedback provided to the user can be any form of sensory feedback (e.g., visual feedback, auditory feedback, or tactile feedback); and input from the user may be received in any form, including acoustic, speech, or tactile input.

The systems and techniques described here can be implemented in a computing system that includes a back-end component (e.g., as a data server), or that includes a middleware component (e.g., an application server), or that includes a front-end component (e.g., a user computer having a graphical user interface or a web browser through which a user can interact with an implementation of the systems and techniques described here), or any combination of such back-end, middleware, or front-end components. The components of the system can be interconnected by any form or medium of digital data communication (e.g., a communication network). Examples of communication networks include: local Area Networks (LANs), Wide Area Networks (WANs), and the Internet.

The computer system may include clients and servers. A client and server are generally remote from each other and typically interact through a communication network. The relationship of client and server arises by virtue of computer programs running on the respective computers and having a client-server relationship to each other.

According to the technical scheme of the embodiment of the disclosure, a training sample set is obtained, wherein training samples in the training sample set comprise state information of objects at various time points and action information of the objects, and the following training steps are executed: taking the state information of each time point of the object included by the training sample as model input, taking the action information of the object corresponding to the state information of each time point of the input object as model output, solving to obtain candidate parameters of the model based on a loss function and an objective function, wherein the loss function is constructed based on an antagonistic learning algorithm, the candidate parameters are a set of optimal parameters of both antagonistic parties obtained by using the objective function, determining the candidate parameters as final parameters of the model in response to the candidate parameters meeting preset conditions, outputting the final model after training, wherein the preset conditions are used for setting the set of the optimal parameters of both antagonistic parties based on the extreme value principle of both parties, repeatedly executing the training steps in response to the candidate parameters not meeting the preset conditions, and reversely guiding the training of reinforcement learning by judging the sufficient necessary conditions of optimality of the model, and then, the extreme value principle of both parties in the optimal control theory is popularized to model training, so that a brand-new model training method and device are realized.

It should be understood that various forms of the flows shown above may be used, with steps reordered, added, or deleted. For example, the steps described in the present disclosure may be executed in parallel or sequentially or in different orders, and are not limited herein as long as the desired results of the technical solutions disclosed in the present disclosure can be achieved.

The above detailed description should not be construed as limiting the scope of the disclosure. It should be understood by those skilled in the art that various modifications, combinations, sub-combinations and substitutions may be made in accordance with design requirements and other factors. Any modification, equivalent replacement, and improvement made within the spirit and principle of the present disclosure should be included in the scope of protection of the present disclosure.

Claims

1. A method for training a model, comprising:

acquiring a training sample set, wherein training samples in the training sample set comprise state information of an object at each time point and action information of the object;

the following training steps are performed: taking the state information of each time point of the object, which is included in the training sample, as model input, taking the action information of the object, which corresponds to the input state information of each time point of the object, as model output, and solving to obtain candidate parameters of the model based on a loss function and an objective function, wherein the loss function is constructed based on a countervailing learning algorithm, and the candidate parameters are a set of optimal parameters of both countervailing parties obtained by using the objective function; responding to the candidate parameter meeting a preset condition, determining the candidate parameter as a final parameter of the model, and outputting a final model after training, wherein the preset condition is used for representing the setting of the optimal parameter set of the two counterparties based on the extreme value principle of the two parties;

and in response to the candidate parameter not meeting the preset condition, repeatedly executing the training steps.

2. The method according to claim 1, wherein the time points are time points of the object in a continuous state, and the model is used for solving an optimal control problem of the object in the continuous state; and, the loss function is constructed based on a combination of a countering learning algorithm and a reinforcement learning algorithm.

3. The method according to any one of claims 1 or 2, wherein the determining the candidate parameter as a final parameter of the model in response to the candidate parameter satisfying a preset condition, and outputting a final model after training comprises:

and in response to the candidate parameters meeting preset conditions, determining the candidate parameters as final parameters of the model, and outputting a final model after training, wherein the preset conditions are as follows: and the loss value corresponding to the optimal parameter set of the two confrontation parties is equal to the unique solution of the first equation, and the solution of the first equation is used for representing and solving the problems existing in the model training.

4. The method of claim 3, wherein the preset condition comprises a first sub-condition and a second sub-condition; the step of determining the candidate parameters as final parameters of the model in response to the candidate parameters meeting preset conditions, and outputting a final model after training comprises:

responding to the candidate parameter meeting the first sub-condition and the second sub-condition at the same time, determining the candidate parameter as a final parameter of the model, and outputting a final model after training, wherein the first sub-condition is used for representing that a set of optimal parameters of the countermeasures are set based on a two-party extreme value principle, and the second sub-condition is as follows: and the loss value corresponding to the optimal parameter set of the two confrontation parties is equal to the unique solution of the first equation.

5. The method of claim 3, wherein the problem is a saddle point problem.

6. The method according to any one of claims 3 or 4, wherein the first equation is constructed based on a nonlinear control theory; the first equation is a Hamilton-Jacobian-Isaacs equation or a Hamilton-Jacobian-Bellman equation.

7. A method for generating information, comprising:

acquiring state information of each time point of an object, wherein each time point is each time point of the object in a continuous state;

inputting the state information of the object at each time point into a pre-trained information generation model, and generating the action information of the object, wherein the information generation model is obtained by training according to the method of one of claims 1 to 6.

8. An apparatus for training a model, comprising:

the device comprises an acquisition unit, a processing unit and a processing unit, wherein the acquisition unit is configured to acquire a training sample set, wherein training samples in the training sample set comprise state information of a subject at each time point and action information of the subject;

a training unit configured to perform the following training steps: taking the state information of each time point of the object, which is included in the training sample, as model input, taking the action information of the object, which corresponds to the input state information of each time point of the object, as model output, and solving to obtain candidate parameters of the model based on a loss function and an objective function, wherein the loss function is constructed based on a countervailing learning algorithm, and the candidate parameters are a set of optimal parameters of both countervailing parties obtained by using the objective function; responding to the candidate parameter meeting a preset condition, determining the candidate parameter as a final parameter of the model, and outputting a final model after training, wherein the preset condition is used for representing the setting of the optimal parameter set of the two counterparties based on the extreme value principle of the two parties;

and the repeating unit is configured to respond to the candidate parameters not meeting the preset condition and repeatedly execute the training steps.

9. The apparatus of claim 8, wherein the training model is further configured to determine the candidate parameter as a final parameter of the model in response to the candidate parameter satisfying a preset condition, and output a final model after training, wherein the preset condition is: and the loss value corresponding to the optimal parameter set of the two confrontation parties is equal to the unique solution of the first equation, and the solution of the first equation is used for representing and solving the problems existing in the model training.

10. An apparatus for generating information, comprising:

an information acquisition unit configured to acquire state information of an object at respective time points, the respective time points being respective time points at which the object is in a continuous state;

an information generating unit configured to input state information of the object at each time point to a pre-trained information generating model, and generate motion information of the object, wherein the information generating model is trained by the method according to one of claims 1 to 6.

11. An electronic device, comprising:

at least one processor; and

a memory communicatively coupled to the at least one processor; wherein the content of the first and second substances,

the memory stores instructions executable by the at least one processor to enable the at least one processor to perform the method of any one of claims 1-6 or the method of claim 7.

12. A non-transitory computer readable storage medium storing computer instructions for causing a computer to perform the method of any one of claims 1-6 or the method of claim 7.