CN113721456A

CN113721456A - Control model training method and device, computer equipment and storage medium

Info

Publication number: CN113721456A
Application number: CN202110237069.6A
Authority: CN
Inventors: 张玥; 詹仙园; 霍雨森; 朱翔宇; 殷宏磊; 郑宇�
Original assignee: Jingdong City Beijing Digital Technology Co Ltd
Current assignee: Jingdong City Beijing Digital Technology Co Ltd
Priority date: 2021-03-03
Filing date: 2021-03-03
Publication date: 2021-11-30
Anticipated expiration: 2041-03-03
Also published as: CN113721456B

Abstract

The application provides a training method, a training device, computer equipment and a storage medium for a control model, wherein the method comprises the steps of obtaining a sample data set, and the sample data set comprises the following steps: a plurality of offline sample data; training an initial control model by adopting a plurality of off-line sample data to obtain a first control model; combining the first control model and the initial value model to respectively generate a plurality of confrontation sample data respectively corresponding to the plurality of offline sample data; and training the first control model by adopting a plurality of off-line sample data and a plurality of confrontation sample data to obtain the target control model. By the method and the device, the robustness of the target control model obtained by training can be effectively improved, so that the target control model can be effectively suitable for real application scenes, and the control accuracy and the control effect of the target control model are effectively improved.

Description

Control model training method and device, computer equipment and storage medium

Technical Field

The present application relates to the field of control technologies, and in particular, to a method and an apparatus for training a control model, a computer device, and a storage medium.

Background

The control field is one of the origins of the reinforcement learning idea and the field in which the reinforcement learning technology is applied most maturely, and in the control field, the artificial intelligence technology can be adopted to tune machine equipment. For example, reinforcement learning techniques are employed to assist search engines to reduce energy consumption in their data centers; the driving control in the field of automatic driving is a sequential decision making process and is a typical application direction of reinforcement learning; other application directions include: the control field of various aircrafts and underwater unmanned planes.

The robustness of a control model refers to its ability to maintain certain performance under uncertain disturbances. The system is considered robust if the control model can maintain some performance under disturbances. The reinforcement learning algorithm used by the control model is based on dynamic characteristics describing the process, which are not typically captured by the model due to errors in the measurement of the features or changes in the features themselves over time. For example, in the field of robot control, since it is extremely costly to train a model directly in a real environment, training is generally performed by building a simulation environment. But it is difficult for the simulation environment to accurately reflect the actual mechanical damping, equipment aging, and wear levels of the actual system. Therefore, in practical application, for complex robots and mechanical arm scenes, it is necessary to improve the robustness of the control model to adapt to the change from the simulation environment to the real environment.

In the related art, the control model is usually trained by using an offline data set, in this way, the control model trained by using the offline data set is limited by a fixed data set and is sensitive to random disturbance, and the trained control model cannot adapt to a real application scene, so that the actual application effect of the reinforcement learning technology in the control model is influenced, and the control accuracy of the control model is influenced.

Disclosure of Invention

The present application is directed to solving, at least to some extent, one of the technical problems in the related art.

Therefore, an object of the present application is to provide a training method, an apparatus, a computer device, and a storage medium for a control model, in which a plurality of offline sample data are used to generate a plurality of corresponding countermeasure sample data, and the plurality of offline sample data and the plurality of countermeasure sample data are combined to train to obtain a target control model, so that robustness of the target control model obtained by training is effectively improved, the target control model can be effectively applied to a real application scenario, and control accuracy and control effect of the target control model are effectively improved.

In order to achieve the above object, a method for training a control model according to an embodiment of a first aspect of the present application includes: obtaining a sample data set, wherein the sample data set comprises: a plurality of offline sample data; training an initial control model by adopting the plurality of offline sample data to obtain a first control model; combining the first control model and the initial value model to respectively generate a plurality of confrontation sample data respectively corresponding to the plurality of offline sample data; and training the first control model by adopting the plurality of offline sample data and the plurality of confrontation sample data to obtain a target control model.

In the training method for controlling a model provided in an embodiment of the first aspect of the present application, by obtaining a sample data set, the sample data set includes: the method comprises the steps of training an initial control model by adopting a plurality of offline sample data to obtain a first control model, generating a plurality of countermeasure sample data corresponding to the offline sample data respectively by combining the first control model and an initial value model, training the first control model by adopting the offline sample data and the countermeasure sample data to obtain a target control model, generating a plurality of corresponding countermeasure sample data by using the offline sample data, and training the target control model by combining the offline sample data and the countermeasure sample data to obtain the target control model, so that the robustness of the trained target control model is effectively improved, the target control model can be effectively suitable for a real application scene, and the control accuracy and the control effect of the target control model are effectively improved.

In order to achieve the above object, an embodiment of the second aspect of the present application provides a training apparatus for controlling a model, including: an obtaining module, configured to obtain a sample data set, where the sample data set includes: a plurality of offline sample data; the first training module is used for training an initial control model by adopting the plurality of offline sample data to obtain a first control model; the generating module is used for respectively generating a plurality of confrontation sample data corresponding to the plurality of offline sample data by combining the first control model and the initial value model; and the second training module is used for training the first control model by adopting the plurality of off-line sample data and the plurality of confrontation sample data to obtain a target control model.

The training device for controlling the model provided by the embodiment of the second aspect of the application comprises a sample data set, wherein the sample data set comprises: the method comprises the steps of training an initial control model by adopting a plurality of offline sample data to obtain a first control model, generating a plurality of countermeasure sample data corresponding to the offline sample data respectively by combining the first control model and an initial value model, training the first control model by adopting the offline sample data and the countermeasure sample data to obtain a target control model, generating a plurality of corresponding countermeasure sample data by using the offline sample data, and training the target control model by combining the offline sample data and the countermeasure sample data to obtain the target control model, so that the robustness of the trained target control model is effectively improved, the target control model can be effectively suitable for a real application scene, and the control accuracy and the control effect of the target control model are effectively improved.

An embodiment of the third aspect of the present application provides a computer device, which includes a memory, a processor, and a computer program stored in the memory and executable on the processor, and when the processor executes the program, the method for training a control model as set forth in the embodiment of the first aspect of the present application is implemented.

An embodiment of a fourth aspect of the present application proposes a non-transitory computer-readable storage medium, on which a computer program is stored, which, when executed by a processor, implements the method for training a control model as proposed in an embodiment of the first aspect of the present application.

An embodiment of a fifth aspect of the present application proposes a computer program product, which when executed by an instruction processor performs the method for training a control model as proposed in an embodiment of the first aspect of the present application.

Additional aspects and advantages of the present application will be set forth in part in the description which follows and, in part, will be obvious from the description, or may be learned by practice of the present application.

Drawings

The foregoing and/or additional aspects and advantages of the present application will become apparent and readily appreciated from the following description of the embodiments, taken in conjunction with the accompanying drawings of which:

fig. 1 is a schematic flowchart of a training method for a control model according to an embodiment of the present application;

fig. 2 is a flowchart illustrating a training method for a control model according to an embodiment of the present application;

FIG. 3 is a schematic flow chart of an application in an embodiment of the present application;

FIG. 4 is a flow chart illustrating a method for training a control model according to another embodiment of the present application;

FIG. 5 is a schematic diagram of an embodiment of a training apparatus for controlling a model;

FIG. 6 is a schematic diagram of an exercise apparatus for controlling a model according to another embodiment of the present application;

FIG. 7 illustrates a block diagram of an exemplary computer device suitable for use to implement embodiments of the present application.

Detailed Description

Reference will now be made in detail to embodiments of the present application, examples of which are illustrated in the accompanying drawings, wherein like or similar reference numerals refer to the same or similar elements or elements having the same or similar function throughout. The embodiments described below with reference to the drawings are exemplary only for the purpose of explaining the present application and are not to be construed as limiting the present application. On the contrary, the embodiments of the application include all changes, modifications and equivalents coming within the spirit and terms of the claims appended hereto.

Fig. 1 is a flowchart illustrating a training method for a control model according to an embodiment of the present application.

It should be noted that an execution subject of the training method for the control model in this embodiment is a training device for the control model, the device may be implemented in a software and/or hardware manner, the device may be configured in an electronic device, and the electronic device may include but is not limited to a terminal, a server, and the like.

As shown in fig. 1, the training method of the control model includes:

s101: acquiring a sample data set, wherein the sample data set comprises: a plurality of offline sample data.

The set of sample data used for training the control model may be referred to as a sample data set.

The control model may be any artificial intelligence model, and when the artificial intelligence model is used for control processing logic in the control field to make a control decision, the artificial intelligence model may be referred to as a control model.

The artificial intelligence model, such as a neural network model, or a machine learning model, etc., is not limited thereto.

It can be understood that, in a usage scenario of the control model, such as the intelligent robot control, state data in a real environment where the intelligent robot is located may be input to the control model, so as to obtain action data output by the control model, so as to control the intelligent robot to make a corresponding action based on the action data, and of course, the input and output of the control model may also be used in any other possible control scenario, which is not limited thereto.

The embodiment of the application provides a method for optimizing robustness of a control model, which can firstly obtain a sample data set, wherein the sample data set comprises: the offline sample data may be, for example, sample data acquired in a simulation control environment, or may also be sample data acquired in a real control environment, which is not limited to this.

S102: and training an initial control model by adopting a plurality of off-line sample data to obtain a first control model.

The implementation of the application firstly obtains a sample data set, wherein the sample data set comprises: a plurality of offline sample data, which may be, for example, environmental status data and motion data (which may be referred to as first motion data) in a simulated control environment or a real control environment, are then used to train the initial control model.

The initial control model may be, for example, a neural network model, a machine learning model, a reinforcement learning model, or the like, that is, the first action data may be regarded as labeled action data, when the initial control model is trained by using the environmental state data and the first action data, the environmental state data may be input into the initial control model to obtain predicted action data output by the initial control model, and until a loss value between the first action data and the predicted action data satisfies a set condition, the trained control model is used as the first control model.

In the embodiment of the present application, an individual value model may be adopted to evaluate the convergence time during the process of training the control model, for example, after the predicted action data and the first action data are input to the value model, a comparison condition between the value of the predicted action data output by the value model and the value of the first action data is obtained, if the comparison condition is: if the difference between the value of the predicted motion data and the value of the first motion data is less than the difference threshold, it is determined that the control model is converged, which is not limited.

The method comprises the steps of obtaining a sample data set, training an initial control model by adopting a plurality of offline sample data to obtain a first control model, and then processing the offline sample data by combining the value model and the first control model to obtain corresponding countermeasure sample data.

S103: and generating a plurality of confrontation sample data respectively corresponding to the plurality of offline sample data by combining the first control model and the initial value model.

The challenge sample data may be regarded as the sample data to which the noise disturbance is added.

That is to say, in the embodiment of the present application, because a plurality of collected offline sample data are taken into consideration, when sensor noise exists in a scene to which the control model is applied, or when the environment of the control model changes, the performance of the control model obtained by training using an away-strategy reinforcement learning method may be greatly affected, so in the embodiment of the present application, noise disturbance is introduced to generate countermeasure sample data to optimize the training process of the control model, the robustness of the control model is improved by optimizing the worst case of the control model under uncertain noise interference, and introduction of an countermeasure attack strategy is realized to improve the control accuracy and the control effect of the control model.

In some embodiments, the first control model and the initial value model may be fused, the fused model is used as a generative model, and the offline sample data is input into the generative model, so as to obtain a plurality of confrontation sample data respectively corresponding to the plurality of offline sample data output by the generative model.

In other embodiments, any other possible manner may be adopted to generate the sample data after the noise disturbance is added, such as a mathematical method, an engineering method, and the like, which is not limited herein.

S104: and training the first control model by adopting a plurality of off-line sample data and a plurality of confrontation sample data to obtain the target control model.

After the plurality of offline sample data are obtained and the plurality of challenge sample data respectively corresponding to the plurality of offline sample data are generated, the first control model can be trained by adopting the plurality of offline sample data and the plurality of challenge sample data to obtain the target control model.

In some embodiments, the trained first control model may be re-optimized and trained again by using a plurality of offline sample data and a plurality of antagonistic sample data, so as to obtain the target control model, or the control processing logic of the trained first control model may be optimized by using a plurality of offline sample data and a plurality of antagonistic sample data, so as to obtain the target control model, which is not limited.

In the embodiment of the present application, as shown in fig. 2, training the first control model by using a plurality of offline sample data and a plurality of confrontation sample data to obtain the target control model includes:

s201: and training an initial value model by adopting a plurality of confrontation sample data and a first control model to obtain a target value model.

In the embodiment of the present application, an initial control model and a value model are exemplified by training based on a reinforcement learning method, a Markov Decision Process (MDP) in reinforcement learning is configured by a quadruple M ═ (S, a, R, T), S represents a state space of an environment, a represents an action space of a control object, R (S, a) represents a reward function, a value returned by the reward function represents a reward obtained by executing action data a under environment state data S, and T (S '| S, a) is a state transition probability function representing a probability that the environment state data S is transitioned to the environment state data S' after the action data a is executed under the environment state data S. The goal of reinforcement learning is to find a mapping of environmental states to actions, i.e., the strategy π (a | s), that maximizes the expectation of future rewards

R_tGamma is the discount factor for the reward at time t. And the control object selects the optimal action data a under the current environment state data s according to the strategy pi, executes the action corresponding to the action data a, observes the reward r fed back by the environment and the next environment state data s', adjusts and updates the strategy pi based on the reward fed back, and continuously iterates until an optimal strategy is found, so that positive feedback can be obtained to the maximum extent.

For example, referring to fig. 3 together, fig. 3 is a schematic diagram of an application process in an embodiment of the present application, and an initial control model and a value model are trained based on a reinforcement learning method to be exemplified, and may be adoptedConstructing a real sample pool by using a plurality of off-line sample data in the sample dataset; then, adding a plurality of countermeasure sample data into a countermeasure sample pool, and sampling from the real sample pool to obtain the offline sample data (s, a, r, c, s') in the process of training the first control model by adopting a plurality of offline sample data, a plurality of countermeasure sample data and the target value model to obtain the target control model_adv) Then, training an initial value model using the offline sample data (s, a, r, c, s ') and the first control model to obtain a target value model, and then, using the offline sample data (s, a, r, c, s ') to fight against the sample data (s, a, c, s ')_adv) And training the first control model by using the target value model to obtain a target control model.

S202: and training the first control model by adopting a plurality of offline sample data, a plurality of confrontation sample data and the target value model to obtain the target control model.

That is, the target value model is obtained by optimizing an initial value model (a value model corresponding to the first control model and used for determining the convergence timing of the first control model, which may be referred to as an initial value model) using a plurality of countermeasure sample data and the first control model, so that the first control model can be iteratively optimized based on the optimized target value model to obtain the target control model.

In some embodiments, the challenge sample data (s, a) is taken using offline sample data (s, a, r, c, s')_adv) And training the first control model with the target value model to obtain the target control model, specifically, obtaining a gradient value of a cost function corresponding to the target value model (for example, the cost function can calculate a value mapping between first action data and predicted action data output by the first control model, and the gradient value is, for example, a gradient value corresponding to the cost function), then updating the gradient value corresponding to the strategy function of the first control model with the gradient value to obtain an updated control model, and training the first control model with a plurality of offline sample data and a plurality of confrontation sample dataAnd (5) training the updated control model until the updated control model converges, and taking the trained control model as a target control model.

Based on the embodiment illustrated in fig. 2 described above, by training the initial value model using a plurality of challenge sample data and the first control model, to obtain a target value model, a plurality of off-line sample data and a plurality of confrontation sample data are adopted, and the target value model trains the first control model to obtain the target control model, thereby effectively realizing the adoption of a plurality of off-line sample data and a plurality of confrontation sample data, performing bidirectional iterative training on the initial value model and the first control model until the initial value model and the first control model are converged, taking the trained control model as a target control model, and the target value model obtained by training is used as a value model for judging the convergence time of the target control model, therefore, the time for converging the target control model is effectively and accurately determined, and the control accuracy and the control effect of the target control model are guaranteed in a multi-dimensional manner.

In this embodiment, by obtaining a sample data set, the sample data set includes: the method comprises the steps of training an initial control model by adopting a plurality of offline sample data to obtain a first control model, generating a plurality of countermeasure sample data corresponding to the offline sample data respectively by combining the first control model and an initial value model, training the first control model by adopting the offline sample data and the countermeasure sample data to obtain a target control model, generating a plurality of corresponding countermeasure sample data by using the offline sample data, and training the target control model by combining the offline sample data and the countermeasure sample data to obtain the target control model, so that the robustness of the trained target control model is effectively improved, the target control model can be effectively suitable for a real application scene, and the control accuracy and the control effect of the target control model are effectively improved.

Fig. 4 is a flowchart illustrating a training method for a control model according to another embodiment of the present application.

As shown in fig. 4, the training method of the control model includes:

s401: acquiring a sample data set, wherein the sample data set comprises: a plurality of offline sample data.

S402: and training an initial control model by adopting a plurality of off-line sample data to obtain a first control model.

For an example of S401 to S402, reference may be made to the foregoing embodiments, and details are not described herein.

S403: random disturbance noise is added to the environmental state data to obtain noise state data.

For example, assuming offline sample data (s, a, r, c, s'), random disturbance noise may be added to the environmental state data s, and the environmental state data s after the noise is added is taken as the noise state data s_noisedThe random disturbance noise may be random noise or gradient noise.

For example, a specific method for adding random disturbance noise may be as follows:

the method comprises the following steps: the noise with mathematical distribution such as gaussian noise or OU noise in reinforcement learning is used.

The method 2 comprises the following steps: with gradient noise, the gradient noise can be calculated as follows:

noise_grad＝noise*grad_dir；

wherein: noise-gaussian (μ, σ);

wherein Q is the initial value model, and U is the first control model.

S404: and inputting the noise state data into the first control model to obtain second action data output by the first control model.

After the random disturbance noise is added to the environmental state data to obtain the noise state data, the noise state data may be input to the first control model to obtain the second action data output by the first control model.

The noise state data is input to the first control model, and the motion data output by the first control model may be referred to as second motion data, which may be specifically regarded as motion data output by the control model after being disturbed by noise.

For example, the noise state data s may be_noisedInputting the data into the first control model to obtain the second action data a disturbed by noise_adv。

S405: and inputting the first action data and the second action data into the initial value model to obtain a target result output by the initial value model.

The noise state data may be input to the first control model to obtain the second motion data output by the first control model, and the first motion data and the second motion data may be input to the initial value model to obtain the target result output by the initial value model.

In this embodiment, the first action data is the action data output when the first control model is not disturbed by noise, and the second action data is the action data output after the first control model is disturbed by noise, so that the initial value model may be used to evaluate the second value Q (s, a) of the disturbed second action data_adv) And the first value (optimum value) Q (s, a) of the undisturbed first motion data, the above objective result can be used to describe this change to assist in determining the timing of generating challenge sample data.

S406: and if the target result meets the preset judgment condition, taking the environmental state data and the second action data as countermeasure sample data corresponding to the offline sample data.

In some embodiments, if the target result of the initial value model output is: and if the second price value corresponding to the second action data is smaller than the first price value corresponding to the first action data, judging that the target result meets the preset judgment condition.

For example, if the second value Q (s, a) of the interfered second motion data is larger than the first value Q (s, a)_adv) Is more than undisturbedIf the first value (optimal value) Q (s, a) of the motion data is small, the interference is considered to be successful, and (s, a) is set_adv) And adding the challenge sample data corresponding to the offline sample data into the challenge sample pool.

For example, when Q (s, a)_adv)<Q (s, a), then (s, a)_adv) Add challenge cuvette.

In other embodiments, if the target result of the initial value model output is: if the second price value corresponding to the second action data is smaller than the preset value threshold, judging that the target result meets the preset judgment condition; the initial preset value threshold is a first value corresponding to the first action data; and after the judgment target result meets the preset judgment condition, updating the initial preset value threshold according to the second value.

For example, an intermediate variable (which may be referred to as a pre-set cost threshold) Q is cached_mininum。Q_mininumIs set as a first value Q (s, a) of the first motion data when Q (s, a)_adv)<Q_mininumWill be (s, a)_adv) As countermeasure sample data corresponding to the offline sample data, and adopts Q_mininumSet as a second value Q (s, a) of the second motion data_adv) Repeating the process N times until the number N of challenge sample data in the challenge sample pool is equal to<N and N are positive integers larger than 1.

The method comprises the steps of inputting first action data and second action data into an initial value model to obtain a target result output by the initial value model, if the target result meets a preset judgment condition, taking noise state data and the second action data as countermeasure sample data corresponding to offline sample data, accurately judging and determining the time for generating the countermeasure sample data, and providing a determination method for flexibly determining the time for generating the countermeasure sample data, so that the generated countermeasure sample data can effectively reflect the data change condition in a real application scene, and the optimization effect of the control model is guaranteed.

S407: and training the first control model by adopting a plurality of off-line sample data and a plurality of confrontation sample data to obtain the target control model.

For an example of S407, reference may be specifically made to the above embodiments, and details are not described herein.

In this embodiment, by obtaining a sample data set, the sample data set includes: the method comprises the steps of training an initial control model by adopting a plurality of offline sample data to obtain a first control model, generating a plurality of countermeasure sample data corresponding to the offline sample data respectively by combining the first control model and an initial value model, training the first control model by adopting the offline sample data and the countermeasure sample data to obtain a target control model, generating a plurality of corresponding countermeasure sample data by using the offline sample data, and training the target control model by combining the offline sample data and the countermeasure sample data to obtain the target control model, so that the robustness of the trained target control model is effectively improved, the target control model can be effectively suitable for a real application scene, and the control accuracy and the control effect of the target control model are effectively improved. The method comprises the steps of inputting first action data and second action data into an initial value model to obtain a target result output by the initial value model, if the target result meets a preset judgment condition, taking noise state data and the second action data as countermeasure sample data corresponding to offline sample data, accurately judging and determining the time for generating the countermeasure sample data, and providing a determination method for flexibly determining the time for generating the countermeasure sample data, so that the generated countermeasure sample data can effectively reflect the data change condition in a real application scene, and the optimization effect of the control model is guaranteed.

Application scenarios for the above embodiments of the present application may be exemplified as follows:

scene one: the application can be applied to a thermal power generation optimization system in the field of industrial control.

The state space of the thermal power generating unit is huge, many internal physical and chemical reaction mechanisms are not clear, external input can be interfered by the environment, different unit characteristics are different, and operators cannot completely know the operation principle of the thermal power generating unit. The thermal power plant is provided with sensors of different types at all places, reading can be collected and gathered to a DCS in real time, sampling frequency of the sensors is dense, and a large amount of data is generated. An offline reinforcement learning model is constructed by utilizing the internal rules of big data mining, the optimization process is defined as a multi-target multi-constraint high-dynamic decision process, and the artificially made objective function is optimized by adjusting control variables.

The most direct goal is to increase boiler efficiency and reduce pollutant emissions, generate more electricity with less coal, and produce less pollution. The used off-line data set comprises data collected by a sensor in the boiler combustion process, and can describe the working condition and the combustion condition of the boiler (such as hearth pressure, steam flow, exhaust gas temperature, NOx content and the like); and (3) adjustable control variables (such as a fan movable blade adjusting valve, a smoke baffle, a primary air adjusting valve and a secondary air adjusting valve and the like) in the combustion process.

Therefore, in the application scene, the offline reinforcement learning model trained by the data set is directly used, is sensitive to input disturbance, and cannot ensure the stability of control. When a disturbance is input, the control variables recommended by the strategy deviate and increase as the disturbance increases. For example, when the input disturbance accounts for five percent of the signal value, the control amount deviation reaches five percent; when the input disturbance is ten percent, the deviation reaches fourteen percent. After the method described in the application is used, the deviation of the control quantity is obviously reduced, and in most cases, the deviation can be reduced by about forty percent.

For example, when the input disturbance is five percent of the signal value, the controlled variable deviation is less than three percent. As mentioned above, the method and the device are successfully applied to the thermal power generation optimization system, the deviation of the control quantity caused by external disturbance is reduced, and the robustness of the control system is improved. Similarly, other off-line optimization control systems in the control field may be used in the present application, including but not limited to motion control (e.g., industrial robots, servo systems, stepper systems, etc.), system control.

Scene two: the intelligent signal lamp control system can be applied to the intelligent traffic field.

Traffic light control relates to the key field of safety, and meanwhile, real-time interactive training is high in cost and difficult to achieve. In the prior art, offline learning is mainly performed based on an offline log data set or interactive training is performed with a high-precision simulator. The input state of the system is mainly two types, including video and image information collected by a camera and a frequency change signal of an oscillator collected by a ground induction coil. The former can be converted into the characteristics of single-lane automobile waiting number, queuing length, passing speed and the like through an image recognition and segmentation technology; the latter can be converted into the characteristics of the number of passing cars, the passing speed and the like. The output is the timing of each phase, if there is red light, yellow light, green light three phase output is the timing of three lights.

The pictures collected by the camera have image noise, the noise is unpredictable, and random errors can be recognized only by a probability statistical method. The neural network is sensitive to image noise, and the image added with random Gaussian noise can be identified as an error type by the classifier. The ground induction coil is essentially an oscillating circuit, circuit noise exists, in actual engineering, the problems of different mechanical strength and high and low temperature ageing resistance of a lead exist, and the problem of acid and alkali corrosion resistance must be considered in certain severe environments. The parameters of the actual environment are not consistent with the environment of the simulator and may change over time. In the prior art, offline data set training or interactive training with a simulator is used, and the method is sensitive to external disturbance in practical application and cannot adapt to actual environmental parameters, so that the method described in the embodiment of the application can be adopted to improve the robustness of the system. In a similar way, the method and the device can be used for offline optimization scheduling problems in the field of decision scheduling, including network appointment vehicle dispatching, shared vehicle scheduling, unmanned delivery and the like.

Fig. 5 is a schematic structural diagram of a training device for controlling a model according to an embodiment of the present application.

As shown in fig. 5, the training device 50 for the control model includes:

an obtaining module 501, configured to obtain a sample data set, where the sample data set includes: a plurality of offline sample data;

a first training module 502, configured to train an initial control model using a plurality of offline sample data to obtain a first control model;

a generating module 503, configured to generate, in combination with the first control model and the initial value model, a plurality of challenge sample data corresponding to the plurality of offline sample data, respectively; and

the second training module 504 is configured to train the first control model by using a plurality of offline sample data and a plurality of confrontation sample data to obtain a target control model.

In some embodiments of the present application, the offline sample data comprises: the environment status data and the first action data, as shown in fig. 6, generate a module 503, including:

a first generation submodule 5031, configured to add random disturbance noise to the environmental state data to obtain noise state data;

the second generating submodule 5032 is configured to input the noise state data to the first control model to obtain second action data output by the first control model;

the third generating sub-module 5033 is configured to generate a plurality of challenge sample data corresponding to the plurality of offline sample data, respectively, according to the noise status data, the first action data, the second action data, and the initial value model.

In some embodiments of the present application, the third generation submodule 5033 is specifically configured to:

inputting the first action data and the second action data into the initial value model to obtain a target result output by the initial value model;

and if the target result meets the preset judgment condition, taking the environmental state data and the second action data as countermeasure sample data corresponding to the offline sample data.

if the target outcome of the initial value model output is: and if the second price value corresponding to the second action data is smaller than the first price value corresponding to the first action data, judging that the target result meets the preset judgment condition.

if the target outcome of the initial value model output is: if the second price value corresponding to the second action data is smaller than the preset value threshold, judging that the target result meets the preset judgment condition;

the initial preset value threshold is a first value corresponding to the first action data;

and after the judgment target result meets the preset judgment condition, updating the initial preset value threshold according to the second value.

In some embodiments of the present application, the second training module 504 is specifically configured to:

training an initial value model by adopting a plurality of confrontation sample data and a first control model to obtain a target value model;

and training the first control model by adopting a plurality of offline sample data, a plurality of confrontation sample data and the target value model to obtain the target control model.

obtaining a gradient value of a value function corresponding to the target value model;

updating the gradient value corresponding to the strategy function of the first control model by adopting the gradient value to obtain an updated control model;

and iteratively training the updated control model by adopting a plurality of off-line sample data and a plurality of confrontation sample data until the updated control model converges, and taking the trained control model as the target control model.

Corresponding to the training method for the control model provided in the embodiment of fig. 1 to 4, the present application also provides a training apparatus for the control model, and since the training apparatus for the control model provided in the embodiment of the present application corresponds to the training method for the control model provided in the embodiment of fig. 1 to 4, the embodiment of the training method for the control model is also applicable to the training apparatus for the control model provided in the embodiment of the present application, and will not be described in detail in the embodiment of the present application.

In order to implement the foregoing embodiments, the present application also provides a computer device, including: the control model training system comprises a memory, a processor and a computer program stored on the memory and capable of running on the processor, wherein when the processor executes the program, the control model training system realizes the control model training method provided by the previous embodiment of the application.

In order to achieve the above embodiments, the present application also proposes a non-transitory computer readable storage medium having stored thereon a computer program which, when executed by a processor, implements a training method of a control model as proposed in the previous embodiments of the present application.

In order to implement the foregoing embodiments, the present application also proposes a computer program product, which when executed by an instruction processor in the computer program product, performs the training method of the control model as proposed in the foregoing embodiments of the present application.

FIG. 7 illustrates a block diagram of an exemplary computer device suitable for use to implement embodiments of the present application. The computer device 12 shown in fig. 7 is only an example, and should not bring any limitation to the function and the scope of use of the embodiments of the present application.

As shown in FIG. 7, computer device 12 is in the form of a general purpose computing device. The components of computer device 12 may include, but are not limited to: one or more processors or processing units 16, a system memory 28, and a bus 18 that couples various system components including the system memory 28 and the processing unit 16.

Bus 18 represents one or more of any of several types of bus structures, including a memory bus or memory controller, a peripheral bus, an accelerated graphics port, and a processor or local bus using any of a variety of bus architectures. These architectures include, but are not limited to, Industry Standard Architecture (ISA) bus, Micro Channel Architecture (MAC) bus, enhanced ISA bus, Video Electronics Standards Association (VESA) local bus, and Peripheral Component Interconnect (PCI) bus, to name a few.

Computer device 12 typically includes a variety of computer system readable media. Such media may be any available media that is accessible by computer device 12 and includes both volatile and nonvolatile media, removable and non-removable media.

Memory 28 may include computer system readable media in the form of volatile Memory, such as Random Access Memory (RAM) 30 and/or cache Memory 32. Computer device 12 may further include other removable/non-removable, volatile/nonvolatile computer system storage media. By way of example only, storage system 34 may be used to read from and write to non-removable, nonvolatile magnetic media (not shown in FIG. 7, and commonly referred to as a "hard drive"). Although not shown in FIG. 7, a disk drive for reading from and writing to a removable, nonvolatile magnetic disk (e.g., a "floppy disk") and an optical disk drive for reading from or writing to a removable, nonvolatile optical disk (e.g., a Compact disk Read Only Memory (CD-ROM), a Digital versatile disk Read Only Memory (DVD-ROM), or other optical media) may be provided. In these cases, each drive may be connected to bus 18 by one or more data media interfaces. Memory 28 may include at least one program product having a set (e.g., at least one) of program modules that are configured to carry out the functions of embodiments of the application.

A program/utility 40 having a set (at least one) of program modules 42 may be stored, for example, in memory 28, such program modules 42 including, but not limited to, an operating system, one or more application programs, other program modules, and program data, each of which examples or some combination thereof may comprise an implementation of a network environment. Program modules 42 generally perform the functions and/or methodologies of the embodiments described herein.

Computer device 12 may also communicate with one or more external devices 14 (e.g., keyboard, pointing device, display 24, etc.), with one or more devices that enable a user to interact with computer device 12, and/or with any devices (e.g., network card, modem, etc.) that enable computer device 12 to communicate with one or more other computing devices. Such communication may be through an input/output (I/O) interface 22. Moreover, computer device 12 may also communicate with one or more networks (e.g., a Local Area Network (LAN), a Wide Area Network (WAN), and/or a public Network such as the Internet) via Network adapter 20. As shown, network adapter 20 communicates with the other modules of computer device 12 via bus 18. It should be understood that although not shown in the figures, other hardware and/or software modules may be used in conjunction with computer device 12, including but not limited to: microcode, device drivers, redundant processing units, external disk drive arrays, RAID systems, tape drives, and data backup storage systems, among others.

The processing unit 16 executes various functional applications and data processing, such as implementing the training method of the control model mentioned in the foregoing embodiments, by executing programs stored in the system memory 28.

Other embodiments of the present application will be apparent to those skilled in the art from consideration of the specification and practice of the invention disclosed herein. This application is intended to cover any variations, uses, or adaptations of the invention following, in general, the principles of the application and including such departures from the present disclosure as come within known or customary practice within the art to which the invention pertains. It is intended that the specification and examples be considered as exemplary only, with a true scope and spirit of the application being indicated by the following claims.

It will be understood that the present application is not limited to the precise arrangements described above and shown in the drawings and that various modifications and changes may be made without departing from the scope thereof. The scope of the application is limited only by the appended claims.

Claims

1. A method of training a control model, the method comprising:

obtaining a sample data set, wherein the sample data set comprises: a plurality of offline sample data;

training an initial control model by adopting the plurality of offline sample data to obtain a first control model;

combining the first control model and the initial value model to respectively generate a plurality of confrontation sample data respectively corresponding to the plurality of offline sample data; and

and training the first control model by adopting the plurality of offline sample data and the plurality of confrontation sample data to obtain a target control model.

2. The method of claim 1, wherein the offline sample data comprises: the generating, by combining the first control model and the initial value model, a plurality of challenge sample data respectively corresponding to the plurality of offline sample data includes:

adding random disturbance noise to the environmental state data to obtain noise state data;

inputting the noise state data into the first control model to obtain second action data output by the first control model;

and respectively generating a plurality of confrontation sample data respectively corresponding to the plurality of offline sample data according to the noise state data, the first action data, the second action data and the initial value model.

3. The method of claim 2, wherein generating a plurality of challenge sample data corresponding to the plurality of offline sample data, respectively, based on the noise state data, the first motion data, the second motion data, and the initial value model, respectively, comprises:

and if the target result meets a preset judgment condition, taking the environmental state data and the second action data as countermeasure sample data corresponding to the offline sample data.

4. The method of claim 3, wherein,

5. The method of claim 3, wherein,

if the target outcome of the initial value model output is: if the second price value corresponding to the second action data is smaller than a preset value threshold, judging that the target result meets the preset judgment condition;

and after the target result is judged to meet the preset judgment condition, updating the initial preset value threshold according to the second price.

6. The method of claim 1, wherein said training the first control model with the plurality of offline sample data and the plurality of countermeasure sample data to obtain a target control model comprises:

training an initial value model by adopting the plurality of confrontation sample data and the first control model to obtain a target value model;

and training the first control model by adopting the plurality of offline sample data, the plurality of confrontation sample data and the target value model to obtain a target control model.

7. The method of claim 6, wherein said training the first control model using the plurality of offline sample data, the plurality of countermeasure sample data, and the target value model to obtain a target control model comprises:

obtaining a gradient value of a cost function corresponding to the target value model;

iteratively training the updated control model by adopting the plurality of offline sample data and the plurality of confrontation sample data until the updated control model converges, and taking the trained control model as the target control model.

8. An exercise device for controlling a model, the device comprising:

an obtaining module, configured to obtain a sample data set, where the sample data set includes: a plurality of offline sample data;

the first training module is used for training an initial control model by adopting the plurality of offline sample data to obtain a first control model;

the generating module is used for respectively generating a plurality of confrontation sample data corresponding to the plurality of offline sample data by combining the first control model and the initial value model; and

and the second training module is used for training the first control model by adopting the plurality of off-line sample data and the plurality of confrontation sample data to obtain a target control model.

9. The apparatus of claim 8, wherein the offline sample data comprises: the generation module comprises:

the first generation submodule is used for adding random disturbance noise to the environment state data to obtain noise state data;

the second generation submodule is used for inputting the noise state data into the first control model so as to obtain second action data output by the first control model;

and a third generation submodule, configured to generate, according to the noise state data, the first action data, the second action data, and the initial value model, a plurality of countermeasure sample data corresponding to the plurality of offline sample data, respectively.

10. The apparatus of claim 9, wherein the third generation submodule is specifically configured to:

11. The apparatus of claim 10, wherein the third generation submodule is specifically configured to:

12. The apparatus of claim 10, wherein the third generation submodule is specifically configured to:

13. The apparatus of claim 8, wherein the second training module is specifically configured to:

14. The apparatus of claim 13, wherein the second training module is specifically configured to:

15. A computer device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, the processor implementing the method according to any one of claims 1-7 when executing the program.

16. A storage medium having instructions that, when executed by a processor of an electronic device, enable the electronic device to perform the method of any of claims 1-7.