WO2023246819A1

WO2023246819A1 - Model training method and related device

Info

Publication number: WO2023246819A1
Application number: PCT/CN2023/101527
Authority: WO
Inventors: 和煦; 李栋
Original assignee: 华为技术有限公司
Priority date: 2022-06-21
Filing date: 2023-06-20
Publication date: 2023-12-28
Also published as: CN115293227A

Abstract

A model training method, relating to the field of artificial intelligence. The method comprises: processing first data by means of a first reinforcement learning model to obtain a first processing result; processing the first data by means of a first target neural network selected from among a plurality of first neural networks to obtain a second processing result, wherein each first neural network is an iteration result obtained by performing iterative training on a first initial neural network; and updating the first reinforcement learning model according to the first processing result and the second processing result. According to the present application, the interference for a target task is output by utilizing a historical training result of a historical adversarial agent (an adversarial agent obtained in a historical iteration process), such that more effective interference for the target task under different scenarios can be obtained, thereby improving the training effect and generalization of a model.

Description

A model training method and related equipment

This application claims priority to the Chinese patent application filed with the China Patent Office on June 21, 2022, with the application number 202210705971.0 and the invention title "A model training method and related equipment", the entire content of which is incorporated into this application by reference. middle.

Technical field

This application relates to the field of artificial intelligence, and in particular to a model training method and related equipment.

Background technique

Artificial Intelligence (AI) is a theory, method, technology and application system that simulates, extends and expands human intelligence through digital computers or machines controlled by digital computers, perceives the environment, acquires knowledge and uses knowledge to obtain the best results. In other words, artificial intelligence is a branch of computer science that attempts to understand the nature of intelligence and produce a new class of intelligent machines that can respond in a manner similar to human intelligence. Artificial intelligence is the study of the design principles and implementation methods of various intelligent machines, so that the machines have the functions of perception, reasoning and decision-making. Research in the field of artificial intelligence includes robotics, natural language processing, computer vision, decision-making and reasoning, human-computer interaction, recommendation and search, basic AI theory, etc.

Reinforcement learning (RL) is an important machine learning method in the field of artificial intelligence. It has many applications in fields such as autonomous driving, intelligent control of robots, and analysis and prediction. Specifically, the main problem to be solved through reinforcement learning is how smart devices directly interact with the environment to learn the skills used to perform specific tasks in order to maximize long-term rewards for specific tasks. In the application process of reinforcement learning algorithms, it is often necessary to interact with the online environment to obtain data and conduct training. The general approach is to model real scenes in the real world and generate an online environment for virtual simulation. In this case, if there is a slight difference between the training environment and the real environment that needs to be deployed, it is likely to cause the trained algorithm to fail, causing the performance in the real scenario to be lower than expected.

The above problems can be alleviated by improving the robustness of reinforcement learning algorithms. One method is to introduce imaginary interference in the virtual environment, train the reinforcement learning algorithm under the interference, improve its ability to deal with interference, and enhance the robustness and generalization of the algorithm. In other words, for the target to be trained For the reinforcement learning model, you can set up an adversarial agent. The data output by the adversarial agent can perform tasks together with the output data of the reinforcement learning model, and the data output by the anti-agent can serve as interference in executing the target task. However, due to the unforeseen differences between the training environment and the deployment environment, in existing training methods, the adversarial agent can only output a certain kind of specific interference (for example, for robot control, a specific range of force can be applied to a certain joint as Interference), when the changes in the real environment are inconsistent with the imaginary interference (that is, the interference output by the anti-agent), the algorithm will be less effective and less robust.

Contents of the invention

This application provides a model training method that can improve the training effect and generalization of the model.

In a first aspect, this application provides a model training method. The method includes: processing first data through a first reinforcement learning model to obtain a first processing result; wherein the first data indicates the state of the target object. , the first processing result is used as control information when performing a target task on the target object; the first data is processed through the first target neural network to obtain a second processing result; wherein, the first The second processing result is used as interference information when executing the target task. The first target neural network is selected from a plurality of first neural networks, and each of the first neural networks is a pair of the first initial neural network. An iterative result obtained by performing iterative training; according to the first processing result and the second processing result, the target task is executed to obtain a third processing result; according to the third processing result, the third processing result is updated A reinforcement learning model to obtain the updated first reinforcement learning model.

In a possible implementation, the first reinforcement learning model may be an initialized model or the output of an iteration in the model training process. It should be understood that the reinforcement learning models in the embodiments of this application include but are not limited to deep neural networks, Bayesian neural networks, etc.

In a possible implementation, during the feedforward process of model training, the first data can be processed through the first reinforcement learning model to obtain the first processing result. The first processing result is used as control information when performing a target task on the target object. For example, the target task is the attitude control of the robot, and the first processing result is the attitude control information of the robot; or, The target task is automatic driving of the vehicle, and the first processing result is the driving control information of the vehicle.

In the existing implementation, an adversarial agent for outputting interference information can be trained. The interference information only performs one kind of interference for the target task. In the embodiment of the present application, on the one hand, multiple adversarial agents for outputting interference information can be trained. Adversarial agents that interfere with information are different adversarial agents. The interference information output by the agent can perform different types of interference on the target task. On the other hand, when training the adversarial agent, not only the adversarial agent obtained in the latest iteration is used to output interference on the target task, but also Use the historical training results of historical adversarial agents (adversarial agents obtained during the historical iteration process) to output interference for the target task, so that more effective interference for the target task can be obtained that is adapted to different scenarios. Thereby improving the training effect and generalization of the model.

In a possible implementation, the first data is status information related to the robot; the target task is attitude control of the robot, and the first processing result is attitude control information of the robot.

In a possible implementation, the robot-related status information may include but is not limited to the robot's position and speed, and information related to the scene it is in (such as obstacle information). The robot's position and speed may include the status of each joint. (position, angle, speed, acceleration, etc.) and other information.

In a possible implementation, the first reinforcement learning model can obtain the attitude control information of the robot based on the input data. The attitude control information can include the control information of each joint of the robot, and the attitude control task of the robot can be performed based on the attitude control information. .

In a possible implementation, the first data is vehicle-related status information; the target task is automatic driving of the vehicle, and the first processing result is the driving control information of the vehicle.

In a possible implementation, the vehicle-related status information may include but is not limited to the vehicle's position, speed, and information related to the scene in which it is located (such as driving road information, obstacle information, pedestrian information, and surrounding vehicle information). ).

In a possible implementation, the first reinforcement learning model can obtain the driving control information of the vehicle based on the input data. The driving control information can include the vehicle's speed, direction, driving trajectory and other information.

In a possible implementation, the method further includes: selecting the first target neural network from the plurality of first neural networks.

In a possible implementation, the first target neural network is selected from a plurality of first neural networks based on a first selection probability corresponding to each of the plurality of first neural networks. . That is to say, each first neural network can be configured with a corresponding probability (ie, the first selection probability described above). When selecting the first target neural network from multiple first neural networks, the first neural network can be selected based on multiple first neural networks. A probability distribution corresponding to a neural network is used to sample and the network is selected based on the sampling results.

In a possible implementation, the processing result obtained by each first neural network processing data is used as interference when executing the target task, and the first selection probability is the same as the processing result output by the corresponding first neural network. The degree of interference with the target task is positively correlated. Among them, the first selection probability can be a trainable parameter. When updating the reinforcement learning model and the model of the adversarial agent, a reward value can be obtained. On the one hand, the reward value can represent the performance of the data output by the reinforcement learning model in executing the target. The excellence in the task can also represent the degree of interference of the interference information output by the adversarial agent on the target task, and the probability distribution corresponding to the first neural network can be updated based on the reward value, so that the first selection probability is consistent with the corresponding The degree of interference of the processing result output by the first neural network to the target task is positively correlated. Through the above method, on the one hand, for an adversarial agent with a large output interference range, its corresponding sampling probability is larger, making it easier to be sampled, which increases the degree of interference to the reinforcement learning model. On the other hand, for For adversarial agents with small output interference range, although their corresponding sampling probability is small, they may still be sampled, which can increase the richness of interference to the reinforcement learning model and improve the generalization of the network.

In a possible implementation, the above probability distribution can be a Nash equilibrium distribution. The probability distribution can be calculated through Nash equilibrium based on the reward value obtained when performing the target task based on the data and interference information obtained during the feedforward of the reinforcement learning model. During the iteration process, the probability distribution can be updated.

The embodiment of the present application controls the behavior space of the adversarial agent and changes the interference intensity of the adversarial agent, making the reinforcement learning strategy robust to both strong and weak interference. In addition, by introducing a game theory optimization framework and using historical strategies to increase the diversity of adversarial agents, the reinforcement learning strategy is more robust to interference from different strategies.

In a possible implementation, updating the first reinforcement learning model according to the third processing result includes:

According to the third processing result, the reward value corresponding to the target task is obtained;

According to the reward value, update the first reinforcement learning model;

The method also includes:

According to the reward value, the first selection probability corresponding to the first target neural network is updated.

In one possible implementation, after sampling an adversarial agent for each adversarial task, the reinforcement learning strategy and the updated strategy of the adversarial agent can be added to the Nash equilibrium matrix, and the Nash equilibrium can be calculated to obtain the reinforcement learning and adversarial agents. Nash equilibrium distribution. Specifically, updating the first reinforcement learning model according to the first processing result and the second processing result includes: obtaining the target according to the first processing result and the second processing result. The reward value corresponding to the task; the first reinforcement learning model is updated according to the reward value; further, the first selection probability corresponding to the first target neural network can be updated according to the reward value.

In a possible implementation, in order to improve the richness of interference to the reinforcement learning model, multiple adversarial agents can be trained, and for each adversarial agent in the multiple adversarial agents, multiple adversarial agents can be trained from Select the adversarial agent that interferes with the reinforcement learning model from the iteration results.

In a possible implementation, the method further includes:

The first data is processed through a second target neural network to obtain a fourth processing result; wherein the fourth processing result is used as interference information when executing the target task, and the second target neural network is Selected from a plurality of second neural networks, each second neural network is an iterative result obtained from the iterative training process of a second initial neural network; the first initial neural network and the second initial neural network are Neural networks are different;

Executing the target task according to the first processing result and the second processing result to obtain a third processing result includes:

According to the first processing result, the fourth processing result and the second processing result, the target task is executed to obtain a third processing result.

In a possible implementation, the interference types of the second processing result and the fourth processing result are different.

For example, the interference type may be a category of interference applied when performing the target task, such as applying force, applying torque, adding obstacles, changing road conditions, changing weather, etc.

In a possible implementation, the interference objects of the second processing result and the fourth processing result are different.

For example, the robot may include multiple joints, and applying force to different joints or different joint groups may be considered to be different interference objects. That is, the second processing result and the fourth processing result are forces applied to different joints or different joint groups.

In a possible implementation, the first target neural network is used to determine the second processing result from a first numerical range according to the first data, and the second target neural network is used to determine the second processing result according to the first numerical range. The first data determines the fourth processing result from a second numerical range, and the second numerical range is different from the first numerical range.

For example, the second processing result and the fourth processing result are both forces exerted on the robot joints. The maximum value of the force determined by the first target neural network is A1, and the maximum value of the force determined by the second target neural network is The maximum value of is A2, A1 and A2 are different.

In a possible implementation, during the iterative training process of the adversarial agent, the reinforcement learning model participating in the training process in the current round can also be selected from the historical iteration results of the reinforcement learning model. For example, it can be based on probability sampling. For similarities, refer to the process of sampling the adversarial agent in the above embodiment.

In a possible implementation, the second data can be processed through a second reinforcement learning model to obtain a fifth processing result; wherein the second reinforcement learning model is derived from the updated first reinforcement learning model. Selected from a plurality of reinforcement learning models, each of the reinforcement learning models is an iterative result obtained from the iterative training process of the initial reinforcement learning model; the second data indicates the state of the target object, and the third The fifth processing result is used as control information when performing the target task on the target object; the second data is processed through a third target neural network to obtain a sixth processing result; the third target neural network Belonging to the plurality of first neural networks; the sixth processing result is used as interference information when executing the target task; executing the target task according to the fifth processing result and the sixth processing result, Obtain a seventh processing result; update the third target neural network according to the seventh processing result to obtain an updated third target neural network.

In a possible implementation, the second reinforcement learning model may be selected from the plurality of reinforcement learning models.

In a possible implementation, selecting the second reinforcement learning model from the plurality of reinforcement learning models includes: based on the second selection probability corresponding to each reinforcement learning model in the plurality of reinforcement learning models. , selecting the second reinforcement learning model from a plurality of reinforcement learning models.

In a possible implementation, the second selection probability is positively related to the positive execution effect of the processing result output by the corresponding reinforcement learning model when executing the target task. Among them, when updating the reinforcement learning model and the model of the adversarial agent, a reward value can be obtained. On the one hand, the reward value can represent the excellence of the data output by the reinforcement learning model in performing the target task, and can be used to strengthen the reinforcement based on the reward value. The probability distribution corresponding to the learning model is updated so that the second selection probability is positively related to the positive execution effect of the processing result output by the corresponding reinforcement learning model when executing the target task.

In a possible implementation, the historical strategy of the reinforcement learning agent can be sampled and selected from the historical strategy collection of the reinforcement learning agent according to the Nash equilibrium distribution for use in the strategy update of the countermeasure agent. In the training environment, deploy the selected reinforcement learning strategy and the current adversarial agent strategy, perform sampling, and obtain the required training samples. Use the obtained training samples to train the adversarial agent strategy.

In a second aspect, this application provides a model training device, which includes:

The data processing module is used to process the first data through the first reinforcement learning model to obtain the first processing result; wherein the first data indicates the state of the target object, and the first processing result is used as the first processing result in the Control information when performing target tasks on the target object;

The first data is processed through a first target neural network to obtain a second processing result; wherein the second processing result is used as interference information when executing the target task, and the first target neural network is Selected from a plurality of first neural networks, each first neural network is an iterative result obtained from the process of iterative training of the first initial neural network;

According to the first processing result and the second processing result, execute the target task and obtain a third processing result;

A model update module, configured to update the first reinforcement learning model according to the third processing result to obtain an updated first reinforcement learning model.

In the existing implementation, an adversarial agent for outputting interference information can be trained. The interference information only performs one kind of interference for the target task. In the embodiment of the present application, on the one hand, multiple adversarial agents for outputting interference information can be trained. Adversarial agents that interfere with information. The interference information output by different adversarial agents can interfere with different types of target tasks. On the other hand, when training adversarial agents, not only the adversarial agents obtained in the latest iteration are used to output For interference on the target task, the historical training results of historical adversarial agents (adversarial agents obtained during the historical iteration process) can also be used to output interference on the target task, so that the system can be adapted to different scenarios. More effective interference with the target task, thereby improving the training effect and generalization of the model.

In one possible implementation,

The target object is a robot; the target task is the attitude control of the robot, and the first processing result is the attitude control information of the robot; or,

The target object is a vehicle; the target task is automatic driving of the vehicle; and the first processing result is the driving control information of the vehicle.

In a possible implementation, the device further includes:

A network selection module, configured to select the first target neural network from the plurality of first neural networks.

In a possible implementation, the first target neural network is selected from a plurality of first neural networks based on a first selection probability corresponding to each of the plurality of first neural networks. .

In a possible implementation, the processing result obtained by each first neural network processing data is used as interference when executing the target task, and the first selection probability is the same as the processing result output by the corresponding first neural network. The degree of interference with the target task is positively correlated.

In a possible implementation, the model update module is specifically used to:

According to the reward value, update the first reinforcement learning model;

The model update module is also used to:

In a possible implementation, the data processing module is also used to:

The data processing module is specifically used for:

In one possible implementation,

The interference types of the second processing result and the fourth processing result are different; or,

The interference objects of the second processing result and the fourth processing result are different; or,

The first target neural network is used to determine the second processing result from a first numerical range according to the first data, and the second target neural network is used to determine the second processing result from a second numerical value according to the first data. The fourth processing result is determined within a range, and the second numerical range is different from the first numerical range.

In a possible implementation, the data processing module is also used to:

Process the second data through the second reinforcement learning model to obtain a fifth processing result; wherein the second reinforcement learning model is selected from a plurality of reinforcement learning models including the updated first reinforcement learning model. Optionally, each of the reinforcement learning models is an iterative result obtained from the iterative training process of the initial reinforcement learning model; the second data indicates the state of the target object, and the fifth processing result is used as the Control information when performing the target task on the target object;

The second data is processed through a third target neural network to obtain a sixth processing result; the third target neural network belongs to the plurality of first neural networks; the sixth processing result is used as the basis for executing the Interfering information during the target task;

According to the fifth processing result and the sixth processing result, execute the target task and obtain a seventh processing result;

The model update module is also used to:

According to the seventh processing result, the third target neural network is updated to obtain an updated third target neural network.

In a possible implementation, the network selection module is also used to:

The second reinforcement learning model is selected from the plurality of reinforcement learning models.

In a possible implementation, the network selection module is specifically used to:

Select the second reinforcement learning model from the plurality of reinforcement learning models based on the second selection probability corresponding to each reinforcement learning model in the plurality of reinforcement learning models.

In the third aspect, this application provides a data processing method, including:

Obtaining first data, the first data indicating the status of the target object;

The first data is processed through the first reinforcement learning model to obtain a first processing result; the first processing result is used as the control information of the target object; wherein,

The first reinforcement learning model is updated by a reward value during an iteration of training, and the reward value is interference information applied when executing the target task according to the control information output by the feedforward process of the first reinforcement learning model. Obtained, the interference information is obtained through the feedforward process of the target neural network, the target neural network is selected from multiple neural networks, and each of the neural networks is one obtained by iteratively training the initial neural network. Iteration results;

According to the first processing result, a target task is performed on the target object.

In one possible implementation,

In a possible implementation, the first neural network is selected from a plurality of first neural networks based on a first selection probability corresponding to each neural network in the plurality of neural networks.

In a possible implementation, the processing results obtained by each neural network processing data are used as interference when executing the target task, and the first selection probability and the processing results output by the corresponding neural network have a positive impact on the target. The degree of interference of the task is positively related.

In the fourth aspect, embodiments of the present application provide a model training device, which may include a memory, a processor, and a bus system. The memory is used to store programs, and the processor is used to execute programs in the memory to perform the first aspect as described above. and any optional methods.

In the fifth aspect, embodiments of the present application provide a data processing device, which may include a memory, a processor, and a bus system, wherein the memory is used to store programs, and the processor is used to execute the programs in the memory to perform the third aspect as described above. and any optional methods.

In a sixth aspect, embodiments of the present application provide a computer-readable storage medium. A computer program is stored in the computer-readable storage medium. When it is run on a computer, it causes the computer to execute the above-mentioned first aspect and any of its options. method, or the above third aspect and any optional method thereof.

In a seventh aspect, embodiments of the present application provide a computer program product including instructions that, when run on a computer, cause the computer to execute the above-mentioned first aspect and any of its optional methods, or the above-mentioned third aspect and any of its optional methods. Any optional method.

In an eighth aspect, this application provides a chip system that includes a processor to support the model training device to implement some or all of the functions involved in the above aspects, for example, sending or processing data involved in the above methods. ; or, information. In a possible design, the chip system also includes a memory, which is used to save necessary program instructions and data for executing the device or training the device. The chip system may be composed of chips, or may include chips and other discrete devices.

Embodiments of the present application provide a model training method. The method includes: processing first data through a first reinforcement learning model to obtain a first processing result; wherein the first data indicates the state of the target object, and the first data indicates the state of the target object. The first processing result is used as control information when performing a target task on the target object; the first data is processed through the first target neural network to obtain a second processing result; wherein, the second processing The result is used as interference information when executing the target task. The first target neural network is selected from a plurality of first neural networks, and each of the first neural networks iterates the first initial neural network. An iterative result obtained in the training process; according to the first processing result and the second processing result, the target task is executed to obtain a third processing result; according to the third processing result, the first reinforcement is updated Learning model to obtain the updated first reinforcement learning model. Through the above method, when training adversarial agents, not only the adversarial agents obtained in the latest iteration are used to output interference for the target task, but also the historical training results of adversarial agents in history (obtained during the historical iteration process) are used. Adversarial agent) to output interference for the target task, so that more effective interference for the target task can be obtained that is adapted to different scenarios, thereby improving the training effect and generalization of the model.

Description of the drawings

Figure 1 is a schematic diagram of an application architecture;

Figure 2 is a schematic diagram of an application architecture;

Figure 3 is a schematic diagram of an application architecture;

Figure 4 is a schematic diagram of an embodiment of a model training method provided by an embodiment of the present application;

Figure 5 is a schematic diagram of a software architecture provided by an embodiment of the present application;

Figure 6 is a schematic diagram of an embodiment of a model training method provided by an embodiment of the present application;

Figure 7 is a schematic diagram of an embodiment of a model training device provided by an embodiment of the present application;

Figure 8 is a schematic structural diagram of an execution device provided by an embodiment of the present application;

Figure 9 is a schematic structural diagram of a server provided by an embodiment of the present application;

Figure 10 is a schematic structural diagram of a chip provided by an embodiment of the present application.

Detailed ways

The embodiments of the present invention will be described below with reference to the drawings in the embodiments of the present invention. The terms used in the embodiments of the present invention are only used to explain specific embodiments of the present invention and are not intended to limit the present invention.

The embodiments of the present application are described below with reference to the accompanying drawings. Persons of ordinary skill in the art know that with the development of technology and the emergence of new scenarios, the technical solutions provided in the embodiments of this application are also applicable to similar technical problems.

The terms "first", "second", etc. in the description and claims of this application and the above-mentioned drawings are used to distinguish similar objects and are not necessarily used to describe a specific order or sequence. It should be understood that the terms so used are interchangeable under appropriate circumstances, and are merely a way of distinguishing objects with the same attributes in describing the embodiments of the present application. Furthermore, the terms "include" and "having" and any variations thereof, are intended to cover non-exclusive inclusions, such that a process, method, system, product or apparatus comprising a series of elements need not be limited to those elements, but may include not explicitly other elements specifically listed or inherent to such processes, methods, products or equipment.

As used herein, the terms "substantially", "about" and similar terms are used as terms of approximation, not as terms of degree, and are intended to take into account measurements or values that would be known to one of ordinary skill in the art. The inherent bias in calculated values. In addition, the use of "may" when describing embodiments of the present invention refers to "one or more possible embodiments." As used herein, the terms "use", "using", and "used" may be deemed to be the same as the terms "utilize", "utilizing", and "utilize", respectively. Synonymous with "utilized". Additionally, the term "exemplary" is intended to refer to an example or illustration.

First, the overall workflow of the artificial intelligence system is described. Please refer to Figure 1. Figure 1 shows a structural schematic diagram of the artificial intelligence main framework. The following is from the "intelligent information chain" (horizontal axis) and "IT value chain" ( The above artificial intelligence theme framework is elaborated on the two dimensions of vertical axis). Among them, the "intelligent information chain" reflects a series of processes from data acquisition to processing. For example, it can be the general process of intelligent information perception, intelligent information representation and formation, intelligent reasoning, intelligent decision-making, intelligent execution and output. In this process, the data has gone through the condensation process of "data-information-knowledge-wisdom". The "IT value chain" reflects the value that artificial intelligence brings to the information technology industry, from the underlying infrastructure of human intelligence and information (providing and processing technology implementation) to the systematic industrial ecological process.

(1)Infrastructure

Infrastructure provides computing power support for artificial intelligence systems, enables communication with the external world, and supports it through basic platforms. Communicate with the outside through sensors; computing power is provided by smart chips (hardware acceleration chips such as CPU, NPU, GPU, ASIC, FPGA, etc.); the basic platform includes distributed computing framework and network and other related platform guarantees and support, which can include cloud storage and Computing, interconnection networks, etc. For example, sensors communicate with the outside world to obtain data, which are provided to smart chips in the distributed computing system provided by the basic platform for calculation.

(2)Data

Data from the upper layer of the infrastructure is used to represent data sources in the field of artificial intelligence. The data involves graphics, images, voice, and text, as well as IoT data of traditional devices, including business data of existing systems and sensory data such as force, displacement, liquid level, temperature, and humidity.

(3)Data processing

Data processing usually includes data training, machine learning, deep learning, search, reasoning, decision-making and other methods.

Among them, machine learning and deep learning can perform symbolic and formal intelligent information modeling, extraction, preprocessing, training, etc. on data.

Reasoning refers to the process of simulating human intelligent reasoning in computers or intelligent systems, using formal information to perform machine thinking and problem solving based on reasoning control strategies. Typical functions are search and matching.

Decision-making refers to the process of decision-making after intelligent information is reasoned, and usually provides functions such as classification, sorting, and prediction.

(4) General ability

After the data is processed as mentioned above, some general capabilities can be formed based on the results of further data processing, such as algorithms or a general system, such as translation, text analysis, computer vision processing, speech recognition, and image processing. identification, etc.

(5) Intelligent products and industry applications

Intelligent products and industry applications refer to the products and applications of artificial intelligence systems in various fields. They are the encapsulation of overall artificial intelligence solutions, productizing intelligent information decision-making and realizing practical applications. Its application fields mainly include: intelligent terminals, intelligent transportation, Smart healthcare, autonomous driving, smart cities, etc.

With the development of artificial intelligence, many tasks that require humans to complete are gradually replaced by smart terminals. The skills used to complete the tasks need to be configured on the smart terminal, as well as the neural network for the task, so as to realize the function of completing specific tasks through the smart terminal. . Specifically, it can be applied to mobile smart terminals. As an example, in the field of autonomous driving, driving operations originally performed by humans can be performed by smart cars instead. The smart cars need to be equipped with a large number of driving skills and for Neural networks for driving skills; as another example, in the field of freight transportation, the handling operations originally performed by humans can be performed by handling robots, and the handling robots need to be equipped with a large number of handling skills and neural networks for handling skills. It can also be applied to smart terminals without mobile operations. As an example, for example, on a parts processing assembly line, the parts grabbing operation originally completed by humans can be completed by an intelligent robotic arm. In this case, the intelligent robotic arm needs to be equipped with Grasping skills and neural networks for grasping skills, in which different grasping skills can have different grabbing angles, displacements of intelligent robotic arms, etc.; as another example, for example, in the field of automatic cooking, the cooking operation is originally completed by humans. It can be completed by an intelligent robotic arm. The intelligent robotic arm needs to be equipped with cooking skills such as raw material grabbing skills, stir-frying skills, and neural networks for cooking skills. Other application scenarios are not exhaustive here.

In order to better understand the solution of the embodiment of the present application, the possible implementation architecture of the embodiment of the present application will be briefly introduced below with reference to Figure 2 and Figure 3 .

Figure 2 is a schematic diagram of a computing system that performs model training in an embodiment of the present application. The computing system includes a terminal device 102 (exemplarily, the terminal device 102 may not be included) and a server 130 (which may also be called a central node) communicatively coupled through a network. Wherein, the terminal device 102 may be any type of computing device, such as, for example, a personal computing device (eg, a laptop or desktop computer), a mobile computing device (eg, a smartphone or tablet), a game console or controller , wearable computing devices, embedded computing devices, or any other type of computing device.

The terminal device 102 may include a processor 112 and a memory 114. Processor 112 may be any suitable processing device (e.g., processor core, microprocessor, application specific integrated circuit (ASIC), field programmable gate array (FPGA), controller , microcontroller, etc.). The memory 114 may include, but is not limited to, Random Access Memory (RAM), Read-Only Memory (ROM), Erasable Programmable Read Only Memory (EPROM), Or portable read-only memory (Compact Disc Read-Only Memory, CD-ROM). The memory 114 may store data 116 and instructions 118 executed by the processor 112 to cause the terminal device 102 to perform operations.

In some implementations, memory 114 may store one or more models 120 . For example, model 120 may be or may additionally include various machine learning models, such as neural networks (eg, deep neural networks) or other types of machine learning models, including nonlinear models and/or linear models. Neural networks may include feedforward neural networks, recurrent neural networks (eg, long short-term memory recurrent neural networks), convolutional neural networks, or other forms of neural networks.

In some implementations, one or more models 120 may be received from server 130 over network 180, stored in memory 114, and then used by one or more processors 112 or otherwise implemented.

Terminal device 102 may also include one or more user input components 122 that receive user input. For example, user input component 122 may be a touch-sensitive component (eg, a touch-sensitive display screen or touch pad) that is sensitive to the touch of a user input object (eg, a finger or stylus). Touch-sensitive components can be used to implement virtual keyboards. Other example user input components include a microphone, a traditional keyboard, or other device through which a user can provide user input.

The terminal device 102 may also include a communication interface 123. The terminal device 102 may be communicatively connected to the server 130 through the communication interface 123. The server 130 may include a communication interface 133. The terminal device 102 may be communicatively connected to the communication interface 133 of the server 130 through the communication interface 123. In this way, data interaction between the terminal device 102 and the server 130 is achieved.

Server 130 may include processor 132 and memory 134. The processor 132 may be any suitable processing device (eg, processor core, microprocessor, application specific integrated circuit (ASIC), field programmable gate array (FPGA), controller, microcontroller, etc.). The memory 134 may include, but is not limited to, random access memory (Random Access Memory, RAM), read-only memory (Read-Only Memory, ROM), erasable and programmable memory. Read-only memory (Erasable Programmable Read Only Memory, EPROM), or portable read-only memory (Compact Disc Read-Only Memory, CD-ROM). Memory 134 may store data 136 and instructions 138 for execution by processor 132 to cause server 130 to perform operations.

As mentioned above, memory 134 may store one or more machine learning models 140. For example, model 140 may be or may additionally include various machine learning models. Example machine learning models include neural networks or other multi-layered nonlinear models. Example neural networks include feedforward neural networks, deep neural networks, recurrent neural networks, and convolutional neural networks.

It should be understood that the model training method in the embodiment of the present application involves AI-related operations. When performing AI operations, the instruction execution architecture of the terminal device and server is not limited to the processor-memory architecture shown in Figure 2. The system architecture provided by the embodiment of the present application will be introduced in detail below with reference to Figure 3 .

Figure 3 is a schematic diagram of the system architecture provided by an embodiment of the present application. As shown in Figure 3, the system architecture 500 includes an execution device 510, a training device 520, a database 530, a client device 540, a data storage system 550 and a data collection system 560.

The execution device 510 includes a computing module 511, an I/O interface 512, a preprocessing module 513 and a preprocessing module 514. The target model/rule 501 may be included in the calculation module 511, and the preprocessing module 513 and the preprocessing module 514 are optional.

Data collection device 560 is used to collect training samples. The training samples may be first data, second data, etc., wherein the first data and the second data may be state information related to the target object (such as a robot, a vehicle, etc.), state information related to the vehicle, etc. After collecting the training samples, the data collection device 560 stores the training samples into the database 530 .

The training device 520 can maintain training samples based on the database 530, and the neural network to be trained (such as the reinforcement learning model and the target neural network in the embodiment of the present application, where the target neural network is used as an adversarial agent of the reinforcement learning model) , to get the target model/rule 501.

It should be noted that in actual applications, the training samples maintained in the database 530 are not necessarily collected from the data collection device 560, and may also be received from other devices. In addition, it should be noted that the training device 520 may not necessarily train the target model/rules 501 based entirely on the training samples maintained by the database 530. It may also obtain training samples from the cloud or other places for model training. The above description should not be used as a guarantee for this application. Limitations of Examples.

The target model/rules 501 trained according to the training device 520 can be applied to different systems or devices, such as the execution device 510 shown in Figure 3. The execution device 510 can be a terminal, such as a mobile phone terminal, a tablet computer, and a notebook. Computers, augmented reality (AR)/virtual reality (VR) equipment, vehicle-mounted terminals, etc., and can also be servers, etc.

Among them, the target model/rule 501 can be used to achieve target tasks, such as driving control in autonomous driving, attitude control on robots, etc.

Specifically, the training device 520 can transfer the trained model to the execution device 510 . The execution device 510 may be the above-mentioned target object.

In Figure 3, the execution device 510 is configured with an input/output (I/O) interface 512 for data interaction with external devices. The user can input data to the I/O interface 512 through the client device 540, or execute Device 510 can automatically collect input data.

The preprocessing module 513 and the preprocessing module 514 are used to perform preprocessing according to the input data received by the I/O interface 512. It should be understood that there may be no preprocessing module 513 and 514 or only one preprocessing module. When the preprocessing module 513 and the preprocessing module 514 do not exist, the computing module 511 can be directly used to process the input data.

When the execution device 510 preprocesses input data, or when the calculation module 511 of the execution device 510 performs calculations and other related processes, the execution device 510 can call data, codes, etc. in the data storage system 550 for corresponding processing. , the data, instructions, etc. obtained by corresponding processing can also be stored in the data storage system 550.

Finally, the I/O interface 512 provides the processing results to the client device 540, thereby providing them to the user, or performing control operations based on the processing results.

In the situation shown in FIG. 3 , the user can manually set the input data, and the "manually set input data" can be operated through the interface provided by the I/O interface 512 . In another case, the client device 540 can automatically send input data to the I/O interface 512. If requiring the client device 540 to automatically send the input data requires the user's authorization, the user can set corresponding permissions in the client device 540. The user can view the results output by the execution device 510 on the client device 540, and the specific presentation form may be display, sound, action, etc. The client device 540 can also be used as a data collection terminal to collect the input data and output I/O of the input I/O interface 512 as shown in the figure. The output result of the interface 512 is used as new sample data and stored in the database 530 . Of course, it is also possible to collect without going through the client device 540. Instead, the I/O interface 512 directly uses the input data input to the I/O interface 512 and the output result of the output I/O interface 512 as a new sample as shown in the figure. The data is stored in database 530.

It is worth noting that Figure 3 is only a schematic diagram of a system architecture provided by an embodiment of the present application. The positional relationship between the devices, devices, modules, etc. shown in the figure does not constitute any limitation. For example, in Figure 3, the data The storage system 550 is an external memory relative to the execution device 510. In other cases, the data storage system 550 can also be placed in the execution device 510. It should be understood that the above execution device 510 may be deployed in the client device 540.

From the training side of the model:

In the embodiment of the present application, the above-mentioned training device 520 can obtain the code stored in the memory (not shown in Figure 3, which can be integrated with the training device 520 or deployed separately from the training device 520) to implement the model training in the embodiment of the present application. Related steps.

In the embodiment of the present application, the training device 520 may include hardware circuits (such as application specific integrated circuits (ASICs), field-programmable gate arrays (FPGAs), general-purpose processors, digital signal processors (digital signal processing, DSP, microprocessor or microcontroller, etc.), or a combination of these hardware circuits. For example, the training device 520 can be a hardware system with the function of executing instructions, such as a CPU, DSP, etc., or a combination of other hardware circuits. A hardware system with the function of executing instructions, such as ASIC, FPGA, etc., or a combination of the above-mentioned hardware systems without the function of executing instructions and a hardware system with the function of executing instructions.

It should be understood that the training device 520 can be a combination of a hardware system that does not have the function of executing instructions and a hardware system that has the function of executing instructions. Some of the steps related to model training provided by the embodiments of the present application can also be implemented by the training device 520 that does not have the function of executing instructions. It is implemented by the hardware system that executes the instruction function, which is not limited here.

Since the embodiments of the present application involve the application of a large number of neural networks, in order to facilitate understanding, the relevant terms involved in the embodiments of the present application and related concepts such as neural networks are first introduced below.

(1)Neural network

The neural network can be composed of neural units. The neural unit can refer to an operation unit that takes xs (ie, input data) and intercept 1 as input. The output of the operation unit can be:

Among them, s=1, 2,...n, n is a natural number greater than 1, Ws is the weight of xs, and b is the bias of the neural unit. f is the activation function of the neural unit, which is used to introduce nonlinear characteristics into the neural network to convert the input signal in the neural unit into an output signal. The output signal of this activation function can be used as the input of the next convolutional layer, and the activation function can be a sigmoid function. A neural network is a network formed by connecting multiple above-mentioned single neural units together, that is, the output of one neural unit can be the input of another neural unit. The input of each neural unit can be connected to the local receptive field of the previous layer to extract the features of the local receptive field. The local receptive field can be an area composed of several neural units.

(2) Deep neural network

Deep Neural Network (DNN), also known as multi-layer neural network, can be understood as a neural network with many hidden layers. There is no special metric for "many" here. From the division of DNN according to the position of different layers, the neural network inside DNN can be divided into three categories: input layer, hidden layer, and output layer. Generally speaking, the first layer is the input layer, the last layer is the output layer, and the layers in between are hidden layers. The layers are fully connected, that is to say, any neuron in the i-th layer must be connected to any neuron in the i+1-th layer. Although DNN looks very complicated, the work of each layer is actually not complicated. Simply put, it is the following linear relationship expression: in, is the input vector, is the output vector, is the offset vector, W is the weight matrix (also called coefficient), and α() is the activation function. Each layer is just a pair of input vectors After such a simple operation, the output vector is obtained Since there are many DNN layers, the coefficient W and offset vector The number is also very large. The definitions of these parameters in DNN are as follows: Taking the coefficient W as an example: Assume that in a three-layer DNN, the linear coefficient from the fourth neuron in the second layer to the second neuron in the third layer is defined as The superscript 3 represents the number of layers where the coefficient W is located, and the subscript corresponds to the output third layer index 2 and the input second layer index 4. The summary is: the coefficient from the k-th neuron in layer L-1 to the j-th neuron in layer L is defined as It should be noted that the input layer has no W parameter. In deep neural networks, more hidden layers make the network more capable of describing complex situations in the real world. Theoretically, a model with more parameters has higher complexity and greater "capacity", which means it can complete more complex learning tasks. Training a deep neural network is the process of learning the weight matrix. The ultimate goal is to obtain the weight matrix of all layers of the trained deep neural network (a weight matrix formed by the vectors W of many layers).

(3) Reinforcement learning (RL), also known as reinforcement learning, evaluation learning or reinforcement learning, is one of the paradigms and methodologies of machine learning. It is used to describe and solve the interaction process between the agent and the environment. The problem of learning strategies to maximize returns or achieve specific goals.

A common model for reinforcement learning is the standard Markov decision process (MDP). According to given conditions, reinforcement learning can be divided into model-based reinforcement learning (model-based RL) and model-free reinforcement learning (model-free RL), as well as active reinforcement learning (active RL) and passive reinforcement learning (passive RL). Variants of reinforcement learning include inverse reinforcement learning, hierarchical reinforcement learning, and reinforcement learning for partially observable systems. The algorithms used to solve reinforcement learning problems can be divided into two categories: policy search algorithms and value function algorithms. Deep learning models can be used in reinforcement learning to form deep reinforcement learning.

(4)Loss function

In the process of training a deep neural network, because we hope that the output of the deep neural network is as close as possible to the value that we really want to predict, we can compare the predicted value of the current network with the really desired target value, and then based on the difference between the two to update the weight vector of each layer of the neural network according to the difference (of course, there is usually an initialization process before the first update, that is, preconfiguring parameters for each layer in the deep neural network). For example, if the predicted value of the network If it is high, adjust the weight vector to make its prediction lower, and continue to adjust until the deep neural network can predict the really desired target value or a value that is very close to the really desired target value. Therefore, it is necessary to define in advance "how to compare the difference between the predicted value and the target value". This is the loss function (loss function) or objective function (objective function), which is used to measure the difference between the predicted value and the target value. Important equations. Among them, taking the loss function as an example, the higher the output value (loss) of the loss function, the greater the difference. Then the training of the deep neural network becomes a process of reducing this loss as much as possible.

(5)Back propagation algorithm

The convolutional neural network can use the error back propagation (BP) algorithm to modify the size of the parameters in the initial super-resolution model during the training process, so that the reconstruction error loss of the super-resolution model becomes smaller and smaller. Specifically, forward propagation of the input signal until the output will produce an error loss, and the parameters in the initial super-resolution model are updated by back-propagating the error loss information, so that the error loss converges. The backpropagation algorithm is a backpropagation movement dominated by error loss, aiming to obtain the optimal parameters of the super-resolution model, such as the weight matrix.

(6) Nash equilibrium (nash equilibrium)

Also known as non-cooperative game equilibrium, it is an important term in game theory. In a game process, regardless of the other party's strategy choice, one party will choose a certain strategy, and this strategy is called a dominant strategy. If any player chooses the optimal strategy when the strategies of all other players are determined, then this combination is defined as a Nash equilibrium.

A strategy combination is called a Nash equilibrium. When each player's equilibrium strategy is to maximize his or her expected return, at the same time, all other players also follow this strategy.

(7) Reinforcement learning model

Reinforcement learning (RL), also known as reinforcement learning, evaluation learning or reinforcement learning, is one of the paradigms and methodologies of machine learning. It is used to describe and solve the problem of how an agent learns during its interaction with the environment. Strategies to maximize returns or achieve specific goals.

(8)Intelligent body

Agent is a concept in the field of artificial intelligence. Any entity that can think independently and interact with the environment can be abstracted into an agent. The basic characteristics of an intelligent agent are: the intelligent agent can react according to changes in the environment, and then automatically adjust its behavior and status. Different intelligent agents can also interact with other intelligent agents according to their own intentions.

In the application process of reinforcement learning algorithms, it is often necessary to interact with the online environment to obtain data and conduct training. The general approach is to model real scenes in the real world and generate an online environment for virtual simulation. In this case, if there is a slight difference between the training environment and the real environment that needs to be deployed, it is likely to cause the trained algorithm to fail, causing the performance in the real scenario to be lower than expected.

The above problems can be alleviated by improving the robustness of reinforcement learning algorithms. One method is to introduce imaginary interference in the virtual environment, train the reinforcement learning algorithm under the interference, improve its ability to deal with interference, and enhance the robustness and generalization of the algorithm. In other words, for the target to be trained For the reinforcement learning model, you can set up an adversarial agent. The data output by the adversarial agent can perform tasks together with the output data of the reinforcement learning model, and the data output by the anti-agent can serve as interference in executing the target task. Due to the unforeseen differences between the training environment and the deployment environment, existing training methods mainly resist certain specific interferences. However, when the changes in the real environment are inconsistent with the imaginary interference, the algorithm effect will decrease.

In order to solve the above problem, refer to Figure 4, which is a flow diagram of a model training method provided by an embodiment of the present application. As shown in Figure 4, a model training method provided by an embodiment of the present application includes:

401. Process the first data through the first reinforcement learning model to obtain a first processing result; wherein the first data indicates the state of the target object, and the first processing result is used as the execution method on the target object. Control information during target tasks.

The execution subject of step 401 may be a training device (for example, the training device may be a terminal device or a server). For details, reference may be made to the description in the above embodiments, which will not be described again here.

In a possible implementation, the training device can obtain the model training object (first reinforcement learning model) and the training sample (first data).

In a possible implementation, the first reinforcement learning model may be an initialized model or the output of an iteration in the model training process.

Optionally, in a possible implementation, the first processing result can be used as a hard constraint imposed on the target object when performing the target task.

It should be understood that the reinforcement learning models in the embodiments of this application include but are not limited to deep neural networks, Bayesian neural networks, etc.

402. Process the first data through the first target neural network to obtain a second processing result; wherein the first processing result is used to execute the target task, and the second processing result is used as the basis for executing the target. interference during the task, the first target neural network is selected from a plurality of first neural networks, and each first neural network is obtained by iteratively training the first initial neural network. The result of an iteration.

In one possible implementation, the training device can obtain an adversarial agent for the reinforcement learning model, and the adversarial agent can output interference information for the target task.

It should be understood that the first target neural network in the embodiment of this application includes but is not limited to deep neural network, Bayesian neural network, etc.

In a possible implementation, when determining an adversarial agent for outputting interference information as the first reinforcement learning model, the first target neural network may be selected from the plurality of first neural networks, wherein, Each first neural network is an iterative result obtained by iteratively training the first initial neural network.

For example, in the process of iterative training of the first initial neural network, we can obtain neural network 1, neural network 2, neural network 3, neural network 4, neural network 5, neural network 6, neural network 7, neural network 8, neural network Network 9, when determining the adversarial agent for outputting interference information as the first reinforcement learning model, can be selected from the set [Neural Network 1, Neural Network 2, Neural Network 3, Neural Network 4, Neural Network 5, Neural Network 6 , Neural Network 7, Neural Network 8, Neural Network 9] Select a neural network.

In a possible implementation, selecting the first target neural network from a plurality of first neural networks includes: based on a first selection corresponding to each first neural network in the plurality of first neural networks. probability, selecting the first target neural network from a plurality of first neural networks. That is to say, each first neural network can be configured with a corresponding probability (ie, the first selection probability described above). When selecting the first target neural network from multiple first neural networks, the first neural network can be selected based on multiple first neural networks. A probability distribution corresponding to a neural network is used to sample and the network is selected based on the sampling results.

Next, the first choice probability is introduced:

In a possible implementation, the processing result obtained by each first neural network processing data is used as interference when executing the target task, and the first selection probability is the same as the processing result output by the corresponding first neural network. The degree of interference with the target task is positively correlated. Among them, when updating the model of the reinforcement learning model and the adversarial agent, a reward value can be obtained. On the one hand, the reward value can represent the excellence of the data output by the reinforcement learning model in performing the target task, and can also represent the adversarial intelligence. According to the degree of interference of the interference information output by the body to the target task, the probability distribution corresponding to the first neural network can be updated based on the reward value, so that the first selection probability and the corresponding processing result output by the first neural network have a positive impact on the first neural network. The degree of interference from the target task is positively related. Through the above method, on the one hand, for an adversarial agent with a large output interference range, its corresponding sampling probability is larger, making it easier to be sampled, which increases the degree of interference to the reinforcement learning model. On the other hand, for For adversarial agents with small output interference range, although their corresponding sampling probability is small, they may still be sampled, which can increase the richness of interference to the reinforcement learning model and improve the generalization of the network.

In a possible implementation, during the feedforward process of model training, the first data can be processed through a first target neural network to obtain a second processing result, and the second processing result is used as the basis for executing the target. Interfering information during the task.

For example, in a robot control scenario, the second processing result may be a force or moment applied to at least one joint on the robot. For example, in an autonomous driving scenario, the second processing result may be a force or moment exerted on the road conditions of the vehicle. Obstacles or other obstacle information that can affect driving strategies.

In a possible implementation, in order to improve the richness of interference to the reinforcement learning model, multiple adversarial agents can be trained, and for each adversarial agent in the multiple adversarial agents, multiple adversarial agents can be trained from Choose to interfere with the reinforcement learning model among the iteration results. of adversarial agents.

For example, for the first initial neural network, in the process of iterative training of the first initial neural network, neural network A1, neural network A2, neural network A3, neural network A4, neural network A5, neural network A6, neural network Network A7, neural network A8, and neural network A9, when determining the adversarial agent used to output interference information as the first reinforcement learning model, can be obtained from the set [neural network A1, neural network A2, neural network A3, neural network A4 , neural network A5, neural network A6, neural network A7, neural network A8, neural network A9], which is the first target neural network in the above embodiment. For the second initial neural network that is different from the first initial neural network, in the process of iterative training of the second initial neural network, neural network B1, neural network B2, neural network B3, neural network B4, and neural network B5 can be obtained , neural network B6, neural network B7, neural network B8, neural network B9, when determining the adversarial agent used to output interference information as the first reinforcement learning model, it can be obtained from the set [neural network B1, neural network B2, neural network Select a neural network among the network B3, neural network B4, neural network B5, neural network B6, neural network B7, neural network B8, and neural network B9], that is, the second target neural network. The data output by the first target neural network and the second target neural network can be used as interference information applied to the first reinforcement learning model.

Specifically, in a possible implementation, during the feedforward process according to the second target neural network, the first data can be processed through the second target neural network to obtain a fourth processing result; wherein, The second processing result is used as interference when executing the target task, the second target neural network is selected from a plurality of second neural networks, and each of the second neural networks is a pair of second initial neural networks. An iterative result obtained by the iterative training process of the network; the first initial neural network and the second initial neural network are different.

403. According to the first processing result and the second processing result, execute the target task to obtain a third processing result.

Among them, the first processing result can be used as a hard constraint when executing the target task, that is, the first processing result can be used as the control information that the target object needs to satisfy when executing the target task, and the second processing result can be applied to the target object when executing the target task. The third processing result can be the state of the target object when (or after) it performs the target task, and the third processing result can be used to determine the reward value.

It should be understood that the first processing result and the second processing result may be part of the data for determining the third processing result, and may also include other interference information in addition to the second processing result (such as the fourth processing introduced in the above embodiment). result), the target task can be executed based on the first processing result, the second processing result and other processing results to obtain a third processing result.

404. Update the first reinforcement learning model according to the third processing result to obtain an updated first reinforcement learning model.

In a possible implementation, when updating the model, the first reinforcement learning model can be updated according to the third processing result to obtain an updated first reinforcement learning model.

For example, when updating the first reinforcement learning model, the cumulative reward obtained can be maximized. The update method can adopt the reinforcement learning algorithm of the continuous action space. Alternatively, the trust region policy optimization algorithm (trust region policy optimization) can be adopted. TRPO).

Among them, the first target neural network and the second target neural network can perform different adversarial tasks. In a possible implementation, the ones required for this training can be selected from multiple adversarial tasks (for example, in order). Confrontation mission.

In a possible implementation, the historical strategy of the adversary agent can be sampled from the historical strategy set of the adversary agent according to the Nash equilibrium distribution, and used for the adversarial reinforcement learning strategy. In the training environment, deploy the selected adversarial agent strategy and the current reinforcement learning strategy, perform sampling, and obtain the required training samples. Use the obtained training samples to train the reinforcement learning strategy. That is to say, according to the According to the first processing result, the second processing result and the fourth processing result, the first reinforcement learning model is updated to obtain an updated first reinforcement learning model (that is, according to the first processing result, the The second processing result and the fourth processing result are executed, and the target task is performed to obtain the third processing result, and the first reinforcement learning model is updated according to the third processing result).

In one possible implementation, after sampling an adversarial agent for each adversarial task, the reinforcement learning strategy and the updated strategy of the adversarial agent can be added to the Nash equilibrium matrix, and the Nash equilibrium can be calculated to obtain the reinforcement learning and adversarial agents. Nash equilibrium distribution (that is, the above-mentioned first choice probability and the second choice probability introduced later). Specifically, updating the first reinforcement learning model according to the first processing result and the second processing result includes: obtaining the target according to the first processing result and the second processing result. The reward value corresponding to the task; the first reinforcement learning model is updated according to the reward value; further, the first selection probability corresponding to the first target neural network can be updated according to the reward value.

Next, taking the target object as a robot and the target task as robot control as an example, we will introduce a software architecture of the embodiment of this application:

Referring to Figure 5, Figure 5 shows a robot control system. As shown in Figure 5, the robot control system may include: state awareness and processing module, robust decision-making module, and robot control module.

Among them, regarding the state sensing and processing module: the function of this module is to sense the information of the robot (such as the information used to describe the state of the target object introduced in the above embodiment, such as the first data, the second data, etc.). Specifically, it combines the information transmitted by each sensor to determine the robot's own status, including the robot's basic information (position, speed), the status of each joint (position, angle, speed, acceleration) and other information, and transfers this information to decision-making module

Regarding the robust decision-making module: the function of this module is to output upper-level behavioral decisions in the future based on the current robot status and the task being performed (such as when performing the target task on the target object introduced in the above embodiment). control information). Specifically, based on the current state of the robot output by the state sensing and processing module, this module can output behavioral decisions for a period of time in the future through the method corresponding to Figure 4, and pass them to the robot control module.

About the robot control module: This module controls the movement of the robot by controlling the joints of the robot and executing the behavior output by the robust decision-making module.

Specifically, refer to FIG. 6 , which is a flowchart of applying the model training method in the embodiment of the present application to a robot control simulation scenario. The robot adopts the model training method in the embodiment of this application through the multi-task framework and gambling theory optimization theory, and finally outputs behavioral decisions that can maximize the forward speed of the robot and obtain more rewards. The implementation method is introduced in detail below.

S1. Input the parameters of multi-task learning Φ=[φ ₁ , φ ₂ ,...], initialize the reinforcement learning strategy π, initialize the adversarial agent strategy μ _i for each task i, and select the i-th parameter in Φ as The action space parameters of the adversarial agent strategy μ _i construct multiple tasks. In each task, the adversarial agent can exert an interference force on the body of the simulated robot. The initial Nash equilibrium distribution can be a uniform distribution.

S2. Select the corresponding confrontation agent in sequence according to Φ as the current confrontation task.

S3. Based on the distribution of adversarial agents in Nash equilibrium, sample a historical strategy of adversarial agents. As an adversarial strategy for the current adversarial task, the adversarial strategy is deployed to the training environment.

S4. According to the reinforcement learning strategy π and the adversarial agent strategy μ _i,t , control the robot to sample in the training environment to obtain M samples (s state, a _pro , a _adv , s′, r reward), where a _pro , a _adv are the behavioral output of the reinforcement learning strategy and the behavior of the adversarial agent respectively.

S5. Update the reinforcement learning strategy π, and the updated objective function is:

That is, to maximize the cumulative reward obtained, the update method can adopt the reinforcement learning algorithm of the continuous action space, and optionally, the trust region policy optimization algorithm (TRPO) can be adopted.

S6. Based on the historical distribution of reinforcement learning strategies in Nash equilibrium, sample the historical strategy of a reinforcement learning agent A reinforcement learning strategy needs to be interfered with as the current adversarial agent strategy, and the reinforcement learning strategy is deployed to the training environment.

S7. According to the reinforcement learning strategy π _i,t and the adversarial agent strategy μ _i , control the robot to sample in the training environment to obtain M samples (s, a _pro , a _adv , s′, r).

S8. Update the adversarial agent strategy μ _i . The updated objective function is:

That is, to minimize the cumulative reward obtained by the reinforcement learning strategy and prevent the reinforcement learning agent from achieving the goal, the update method can adopt the reinforcement learning algorithm of the continuous action space. Alternatively, the trusted region policy optimization algorithm TRPO can be used.

S9. Every k steps, for each task, add the reinforcement learning strategy and the updated strategy of the adversarial agent to the Nash equilibrium matrix, and pass it through Historically test the performance of newly added strategies and existing historical strategies in the training environment, obtain the Nash equilibrium value matrix of the newly added strategy, and calculate the Nash equilibrium based on the value matrix to obtain the Nash equilibrium distribution of reinforcement learning and adversarial agents.

Determine whether the current task is over. If it is not over, execute step S2. Otherwise, execute step S10.

S10. Deploy the trained reinforcement learning strategy to a test environment that is different from the training environment to test the robustness.

The above-described embodiment uses a robust reinforcement learning control framework based on multi-task learning and game theory to construct multiple adversarial tasks by changing the action space of the adversarial agent to improve the robustness of the reinforcement learning algorithm. In addition, an optimization framework based on game theory is introduced to select the most appropriate confrontation strategy based on historical strategy performance during the training process of each task, making the reinforcement learning strategy more robust.

It should be understood that the game theory optimization framework in the embodiments of this application includes but is not limited to policy-space response oracles (PSRO), etc.; the training of reinforcement learning models includes but is not limited to sampling reinforcement learning algorithms, such as Letter space policy optimization algorithm (TRPO), proximal policy optimization algorithm (proximal policy optimization, PPO), etc.

The present application provides a model training method, which method includes: processing first data through a first reinforcement learning model to obtain a first processing result; wherein the first data indicates the state of the target object, and the third A processing result is used as control information when performing a target task on the target object; the first data is processed through the first target neural network to obtain a second processing result; wherein the second processing result is represented by As interference information when performing the target task, the first target neural network is selected from a plurality of first neural networks, and each of the first neural networks is iteratively trained on the first initial neural network. An iteration result obtained by the process; according to the first processing result and the second processing result, the target task is executed to obtain a third processing result; according to the third processing result, the first reinforcement learning model is updated , to obtain the updated first reinforcement learning model. In the existing implementation, an adversarial agent for outputting interference information can be trained. The interference information only performs one kind of interference for the target task. In the embodiment of the present application, on the one hand, multiple adversarial agents for outputting interference information can be trained. Adversarial agents that interfere with information. The interference information output by different adversarial agents can interfere with different types of target tasks. On the other hand, when training adversarial agents, not only the adversarial agents obtained in the latest iteration are used to output For interference on the target task, the historical training results of historical adversarial agents (adversarial agents obtained during the historical iteration process) can also be used to output interference on the target task, so that the system can be adapted to different scenarios. More effective interference with the target task, thereby improving the training effect and generalization of the model.

Referring to Figure 7, Figure 7 is a schematic structural diagram of a model training device provided by an embodiment of the present application. As shown in Figure 7, the device 700 includes:

The data processing module 701 is used to process the first data through the first reinforcement learning model to obtain a first processing result; wherein the first data indicates the state of the target object, and the first processing result is used as the target object. Control information when performing target tasks on the target object;

For the specific description of the data processing module 701, reference may be made to the description of step 401, step 402, and step 403 in the above embodiment, which will not be described again here.

The model update module 702 is configured to update the first reinforcement learning model according to the third processing result to obtain an updated first reinforcement learning model.

For a specific description of the model update module 702, reference may be made to the description of step 404 in the above embodiment, which will not be described again here.

In one possible implementation,

In a possible implementation, the model update module is specifically used to:

According to the reward value, update the first reinforcement learning model;

The model update module is also used to:

In a possible implementation, the data processing module is also used to:

The data processing module is specifically used for:

In a possible implementation, the interference types of the second processing result and the fourth processing result are different; or,

In a possible implementation, the data processing module is also used to:

The model update module is also used to:

In a possible implementation, the second reinforcement learning model is selected from multiple reinforcement learning models based on the second selection probability corresponding to each reinforcement learning model in the multiple reinforcement learning models.

Next, an execution device provided by an embodiment of the present application is introduced. Please refer to Figure 8. Figure 8 is a schematic structural diagram of an execution device provided by an embodiment of the present application. The execution device 800 can be embodied as a mobile phone, a tablet, a notebook computer, Smart wearable devices, etc. are not limited here. Specifically, the execution device 800 includes: a receiver 801, a transmitter 802, a processor 803 and a memory 804 (the number of processors 803 in the execution device 800 can be one or more, one processor is taken as an example in Figure 8) , wherein the processor 803 may include an application processor 8031 and a communication processor 8032. In some embodiments of the present application, the receiver 801, the transmitter 802, the processor 803, and the memory 804 may be connected through a bus or other means.

Memory 804 may include read-only memory and random access memory and provides instructions and data to processor 803 . A portion of memory 804 may also include non-volatile random access memory (NVRAM). Memory 804 stores processor and operating instructions, executable modules or data structures, or subsets thereof, or extended sets thereof, where: The operation instructions may include various operation instructions for implementing various operations.

Processor 803 controls execution of operations of the device. In specific applications, various components of the execution device are coupled together through a bus system. In addition to the data bus, the bus system may also include a power bus, a control bus, a status signal bus, etc. However, for the sake of clarity, various buses are called bus systems in the figure.

The methods disclosed in the above embodiments of the present application can be applied to the processor 803 or implemented by the processor 803. The processor 803 may be an integrated circuit chip with signal processing capabilities. During the implementation process, each step of the above method can be completed by instructions in the form of hardware integrated logic circuits or software in the processor 803 . The above-mentioned processor 803 can be a general processor, a digital signal processor (DSP), a microprocessor or a microcontroller, and can further include an application specific integrated circuit (ASIC), a field programmable Gate array (field-programmable gate array, FPGA) or other programmable logic devices, discrete gate or transistor logic devices, discrete hardware components. The processor 803 can implement or execute the disclosed methods, steps and logical block diagrams in the embodiments of this application. A general-purpose processor may be a microprocessor or the processor may be any conventional processor, etc. The steps of the method disclosed in conjunction with the embodiments of the present application can be directly implemented by a hardware decoding processor, or executed by a combination of hardware and software modules in the decoding processor. The software module can be located in random access memory, flash memory, read-only memory, programmable read-only memory or electrically erasable programmable memory, registers and other mature storage media in this field. The storage medium is located in the memory 804. The processor 803 reads the information in the memory 804 and completes the steps of the above method in combination with its hardware.

The receiver 801 may be used to receive input numeric or character information and generate signal inputs related to performing relevant settings and functional controls of the device. The transmitter 802 can be used to output numeric or character information; the transmitter 802 can also be used to send instructions to the disk group to modify data in the disk group.

In the embodiment of the present application, in one case, the processor 803 is configured to execute the steps of the model obtained through the model training method in the corresponding embodiment of FIG. 4 .

The embodiment of the present application also provides a server. Please refer to Figure 9. Figure 9 is a schematic structural diagram of the server provided by the embodiment of the present application. Specifically, the server 900 is implemented by one or more servers. The server 900 can be configured or There is a relatively large difference due to different performance, which may include one or more central processing units (CPU) 99 (for example, one or more processors) and memory 932, and one or more storage applications 942 or data 944 storage medium 930 (eg, one or more mass storage devices). Among them, the memory 932 and the storage medium 930 may be short-term storage or persistent storage. The program stored in the storage medium 930 may include one or more modules (not shown in the figure), and each module may include a series of instruction operations on the server. Furthermore, the central processor 99 may be configured to communicate with the storage medium 930 and execute a series of instruction operations in the storage medium 930 on the server 900 .

The server 900 may also include one or more power supplies 99, one or more wired or wireless network interfaces 950, one or more input and output interfaces 958; or, one or more operating systems 941, such as Windows ServerTM, Mac OS XTM, UnixTM, LinuxTM, FreeBSDTM and so on.

In this embodiment of the present application, the central processor 99 is used to execute the steps of the model training method in the corresponding embodiment of FIG. 4 .

An embodiment of the present application also provides a computer program product including computer readable instructions, which when run on a computer causes the computer to perform the steps performed by the foregoing execution device, or causes the computer to perform the steps performed by the foregoing training device. A step of.

Embodiments of the present application also provide a computer-readable storage medium. The computer-readable storage medium stores a program for performing signal processing. When the program is run on a computer, it causes the computer to perform the steps performed by the aforementioned execution device. , or, causing the computer to perform the steps performed by the aforementioned training device.

The execution device, training device or terminal device provided by the embodiment of the present application may specifically be a chip. The chip includes: a processing unit and a communication unit. The processing unit may be, for example, a processor. The communication unit may be, for example, an input/output interface. Pins or circuits, etc. The processing unit can execute the computer execution instructions stored in the storage unit, so that the chip in the execution device executes the model training method described in the above embodiment, or so that the chip in the training device executes the steps related to model training in the above embodiment. . Optionally, the storage unit is a storage unit within the chip, such as a register, cache, etc. The storage unit may also be a storage unit located outside the chip in the wireless access device, such as Read-only memory (ROM) or other types of static storage devices that can store static information and instructions, random access memory (random access memory, RAM), etc.

Specifically, please refer to Figure 10. Figure 10 is a schematic structural diagram of a chip provided by an embodiment of the present application. The chip can be expressed as: Neural network processor NPU 1000, NPU 1000 is mounted on the main CPU (Host CPU) as a co-processor, and the Host CPU allocates tasks. The core part of the NPU is the arithmetic circuit 1003. The arithmetic circuit 1003 is controlled by the controller 1004 to extract the matrix data in the memory and perform multiplication operations.

In some implementations, the computing circuit 1003 includes multiple processing units (Process Engine, PE). In some implementations, arithmetic circuit 1003 is a two-dimensional systolic array. The arithmetic circuit 1003 may also be a one-dimensional systolic array or other electronic circuit capable of performing mathematical operations such as multiplication and addition. In some implementations, arithmetic circuit 1003 is a general-purpose matrix processor.

For example, assume there is an input matrix A, a weight matrix B, and an output matrix C. The arithmetic circuit obtains the corresponding data of matrix B from the weight memory 1002 and caches it on each PE in the arithmetic circuit. The operation circuit takes matrix A data and matrix B from the input memory 1001 to perform matrix operations, and the partial result or final result of the obtained matrix is stored in an accumulator (accumulator) 1008 .

The unified memory 1006 is used to store input data and output data. The weight data directly passes through the storage unit access controller (Direct Memory Access Controller, DMAC) 1005, and the DMAC is transferred to the weight memory 1002. Input data is also transferred to unified memory 1006 via DMAC.

BIU is the Bus Interface Unit, that is, the bus interface unit 1010, which is used for the interaction between the AXI bus and the DMAC and the Instruction Fetch Buffer (IFB) 1009.

The bus interface unit 1010 (Bus Interface Unit, BIU for short) is used to fetch the memory 1009 to obtain instructions from the external memory, and is also used for the storage unit access controller 1005 to obtain the original data of the input matrix A or the weight matrix B from the external memory.

DMAC is mainly used to transfer the input data in the external memory DDR to the unified memory 1006 or the weight data to the weight memory 1002 or the input data to the input memory 1001 .

The vector calculation unit 1007 includes multiple arithmetic processing units, and if necessary, further processes the output of the arithmetic circuit, such as vector multiplication, vector addition, exponential operation, logarithmic operation, size comparison, etc. Mainly used for non-convolutional/fully connected layer network calculations in neural networks, such as Batch Normalization, pixel-level summation, upsampling of feature planes, etc.

In some implementations, vector calculation unit 1007 can store the processed output vectors to unified memory 1006 . For example, the vector calculation unit 1007 can apply a linear function; or a nonlinear function to the output of the operation circuit 1003, such as linear interpolation on the feature plane extracted by the convolution layer, or a vector of accumulated values, to generate an activation value. In some implementations, vector calculation unit 1007 generates normalized values, pixel-wise summed values, or both. In some implementations, the processed output vector can be used as an activation input to the arithmetic circuit 1003, such as for use in a subsequent layer in a neural network.

The instruction fetch buffer 1009 connected to the controller 1004 is used to store instructions used by the controller 1004;

The unified memory 1006, the input memory 1001, the weight memory 1002 and the fetch memory 1009 are all On-Chip memories. External memory is private to the NPU hardware architecture.

The processor mentioned in any of the above places can be a general central processing unit, a microprocessor, an ASIC, or one or more integrated circuits used to control the execution of the above programs.

In addition, it should be noted that the device embodiments described above are only illustrative. The units described as separate components may or may not be physically separated, and the components shown as units may or may not be physically separate. The physical unit can be located in one place, or it can be distributed across multiple network units. Some or all of the modules can be selected according to actual needs to achieve the purpose of the solution of this embodiment. In addition, in the drawings of the device embodiments provided in this application, the connection relationship between modules indicates that there are communication connections between them, which can be specifically implemented as one or more communication buses or signal lines.

Through the above description of the embodiments, those skilled in the art can clearly understand that the present application can be implemented by software plus necessary general hardware. Of course, it can also be implemented by dedicated hardware including dedicated integrated circuits, dedicated CPUs, dedicated memories, Special components, etc. to achieve. In general, all functions performed by computer programs can be easily implemented with corresponding hardware. Moreover, the specific hardware structures used to implement the same function can also be diverse, such as analog circuits, digital circuits or special-purpose circuits. circuit etc. However, for this application, software program implementation is a better implementation in most cases. Based on this understanding, the technical solution of the present application can be embodied in the form of a software product in essence or that contributes to the existing technology. The computer software product is stored in a readable storage medium, such as a computer floppy disk. , U disk, mobile hard disk, ROM, RAM, magnetic disk or optical disk, etc., including several instructions to cause a computer device (which can be a personal computer, training device, or network device, etc.) to execute the steps described in various embodiments of this application. method.

In the above embodiments, it may be implemented in whole or in part by software, hardware, firmware, or any combination thereof. When implemented using software, it may be implemented in whole or in part in the form of a computer program product.

The computer program product includes one or more computer instructions. When the computer program instructions are loaded and executed on a computer, the processes or functions described in the embodiments of the present application are generated in whole or in part. The computer may be a general-purpose computer, a special-purpose computer, a computer network, or other programmable device. The computer instructions may be stored in or transmitted from one computer-readable storage medium to another, for example, the computer instructions may be transferred from a website, computer, training device, or data The center transmits to another website site, computer, training equipment or data center through wired (such as coaxial cable, optical fiber, digital subscriber line (DSL)) or wireless (such as infrared, wireless, microwave, etc.) means. The computer-readable storage medium may be any available medium that a computer can store, or a data storage device such as a training device or a data center integrated with one or more available media. The available media may be magnetic media (eg, floppy disk, hard disk, magnetic tape), optical media (eg, DVD), or semiconductor media (eg, solid state disk (Solid State Disk, SSD)), etc.

Claims

A model training method, characterized in that the method includes:

Through the first reinforcement learning model, the first data is processed to obtain a first processing result; wherein the first data indicates the state of the target object, and the first processing result is used as a basis for executing the target task on the target object. time control information;

The first data is processed through a first target neural network to obtain a second processing result; wherein the second processing result is used as interference information when executing the target task, and the first target neural network is Selected from a plurality of first neural networks, each first neural network is an iterative result obtained from the process of iterative training of the first initial neural network;

According to the first processing result and the second processing result, execute the target task and obtain a third processing result;

According to the third processing result, the first reinforcement learning model is updated to obtain an updated first reinforcement learning model.
The method according to claim 1, characterized in that:

The target object is a robot; the target task is the attitude control of the robot, and the first processing result is the attitude control information of the robot; or,

The target object is a vehicle; the target task is automatic driving of the vehicle; and the first processing result is the driving control information of the vehicle.
The method according to claim 1 or 2, characterized in that,

The first target neural network is selected from a plurality of first neural networks based on a first selection probability corresponding to each first neural network in the plurality of first neural networks.
The method according to claim 3, characterized in that the processing result obtained by each first neural network processing data is used as interference when executing the target task, and the first selection probability is consistent with the corresponding first neural network. The degree of interference of the processing results output by the network to the target task is positively related.
The method according to claim 3 or 4, characterized in that, updating the first reinforcement learning model according to the third processing result includes:

According to the third processing result, the reward value corresponding to the target task is obtained;

According to the reward value, update the first reinforcement learning model;

The method also includes:

According to the reward value, the first selection probability corresponding to the first target neural network is updated.
The method according to any one of claims 1 to 5, characterized in that the method further includes:

The first data is processed through a second target neural network to obtain a fourth processing result; wherein the fourth processing result is used as interference information when executing the target task, and the second target neural network is Selected from a plurality of second neural networks, each second neural network is an iterative result obtained from the iterative training process of a second initial neural network; the first initial neural network and the second initial neural network are Neural networks are different;

Executing the target task according to the first processing result and the second processing result to obtain a third processing result includes:

According to the first processing result, the fourth processing result and the second processing result, the target task is executed to obtain a third processing result.
The method according to claim 6, characterized in that:

The interference types of the second processing result and the fourth processing result are different; or,

The interference objects of the second processing result and the fourth processing result are different; or,

The first target neural network is used to determine the second processing result from a first numerical range according to the first data, and the second target neural network is used to determine the second processing result from a second numerical value according to the first data. The fourth processing result is determined within a range, and the second numerical range is different from the first numerical range.
The method according to any one of claims 1 to 7, characterized in that the method further includes:

Process the second data through the second reinforcement learning model to obtain a fifth processing result; wherein the second reinforcement learning model is selected from a plurality of reinforcement learning models including the updated first reinforcement learning model. Optionally, each of the reinforcement learning models is an iterative result obtained from the iterative training process of the initial reinforcement learning model; the second data indicates the state of the target object, and the fifth processing result is used as the Control information when performing the target task on the target object;

The second data is processed through a third target neural network to obtain a sixth processing result; the third target neural network belongs to the plurality of first neural networks; the sixth processing result is used as the basis for executing the Interfering information during the target task;

According to the fifth processing result and the sixth processing result, execute the target task and obtain a seventh processing result;

According to the seventh processing result, the third target neural network is updated to obtain an updated third target neural network.
The method of claim 8, wherein the second reinforcement learning model is selected from a plurality of reinforcement learning models based on the second selection probability corresponding to each reinforcement learning model in the plurality of reinforcement learning models. owned.
A model training device, characterized in that the device includes:

The data processing module is used to process the first data through the first reinforcement learning model to obtain the first processing result; wherein the first data indicates the state of the target object, and the first processing result is used as the first processing result in the Control information when performing target tasks on the target object;

The first data is processed through a first target neural network to obtain a second processing result; wherein the second processing result is used as interference information when executing the target task, and the first target neural network is Selected from a plurality of first neural networks, each first neural network is an iterative result obtained from the process of iterative training of the first initial neural network;

According to the first processing result and the second processing result, execute the target task and obtain a third processing result;

A model update module, configured to update the first reinforcement learning model according to the third processing result to obtain an updated first reinforcement learning model.
The device according to claim 10, characterized in that:

The target object is a robot; the target task is the attitude control of the robot, and the first processing result is the attitude control information of the robot; or,

The target object is a vehicle; the target task is automatic driving of the vehicle; and the first processing result is the driving control information of the vehicle.
The device according to claim 10 or 11, characterized in that,

The first target neural network is selected from a plurality of first neural networks based on a first selection probability corresponding to each first neural network in the plurality of first neural networks.
The device according to claim 12, characterized in that the processing result obtained by each first neural network processing data is used as interference when executing the target task, and the first selection probability is consistent with the corresponding first neural network. The degree of interference of the processing results output by the network to the target task is positively related.
The device according to claim 12 or 13, characterized in that the model update module is specifically used for:

According to the third processing result, the reward value corresponding to the target task is obtained;

According to the reward value, update the first reinforcement learning model;

The model update module is also used to:

According to the reward value, the first selection probability corresponding to the first target neural network is updated.
The device according to any one of claims 10 to 14, characterized in that the data processing module is also used to:

The first data is processed through a second target neural network to obtain a fourth processing result; wherein the fourth processing result is used as interference information when executing the target task, and the second target neural network is selected from a plurality of second neural networks, each The second neural network is an iterative result obtained from the iterative training process of the second initial neural network; the first initial neural network and the second initial neural network are different;

The data processing module is specifically used for:

According to the first processing result, the fourth processing result and the second processing result, the target task is executed to obtain a third processing result.
The device according to claim 15, characterized in that:

The interference types of the second processing result and the fourth processing result are different; or,

The interference objects of the second processing result and the fourth processing result are different; or,

The first target neural network is used to determine the second processing result from a first numerical range according to the first data, and the second target neural network is used to determine the second processing result from a second numerical value according to the first data. The fourth processing result is determined within a range, and the second numerical range is different from the first numerical range.
The device according to any one of claims 10 to 16, characterized in that the data processing module is also used to:

Process the second data through the second reinforcement learning model to obtain a fifth processing result; wherein the second reinforcement learning model is selected from a plurality of reinforcement learning models including the updated first reinforcement learning model. Optionally, each of the reinforcement learning models is an iterative result obtained from the iterative training process of the initial reinforcement learning model; the second data indicates the state of the target object, and the fifth processing result is used as the Control information when performing the target task on the target object;

The second data is processed through a third target neural network to obtain a sixth processing result; the third target neural network belongs to the plurality of first neural networks; the sixth processing result is used as the basis for executing the Interfering information during the target task;

According to the fifth processing result and the sixth processing result, execute the target task and obtain a seventh processing result;

The model update module is also used to:

According to the seventh processing result, the third target neural network is updated to obtain an updated third target neural network.
The device according to claim 17, wherein the second reinforcement learning model is selected from a plurality of reinforcement learning models based on the second selection probability corresponding to each reinforcement learning model in the plurality of reinforcement learning models. owned.
A model training device, characterized in that the device includes a memory and a processor; the memory stores code, and the processor is configured to obtain the code and execute as described in any one of claims 1 to 9 Methods.
A computer-readable storage medium, characterized by comprising computer-readable instructions that, when run on a computer device, cause the computer device to execute the method described in any one of claims 1 to 9 .
A computer program product, characterized in that it includes computer-readable instructions that, when run on a computer device, cause the computer device to execute the method according to any one of claims 1 to 9.