WO2023246819A1 - Procédé d'entraînement de modèle et dispositif associé - Google Patents

Procédé d'entraînement de modèle et dispositif associé Download PDF

Info

Publication number
WO2023246819A1
WO2023246819A1 PCT/CN2023/101527 CN2023101527W WO2023246819A1 WO 2023246819 A1 WO2023246819 A1 WO 2023246819A1 CN 2023101527 W CN2023101527 W CN 2023101527W WO 2023246819 A1 WO2023246819 A1 WO 2023246819A1
Authority
WO
WIPO (PCT)
Prior art keywords
processing result
neural network
target
reinforcement learning
data
Prior art date
Application number
PCT/CN2023/101527
Other languages
English (en)
Chinese (zh)
Inventor
和煦
李栋
Original Assignee
华为技术有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 华为技术有限公司 filed Critical 华为技术有限公司
Publication of WO2023246819A1 publication Critical patent/WO2023246819A1/fr

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods

Definitions

  • This application relates to the field of artificial intelligence, and in particular to a model training method and related equipment.
  • Artificial Intelligence is a theory, method, technology and application system that simulates, extends and expands human intelligence through digital computers or machines controlled by digital computers, perceives the environment, acquires knowledge and uses knowledge to obtain the best results.
  • artificial intelligence is a branch of computer science that attempts to understand the nature of intelligence and produce a new class of intelligent machines that can respond in a manner similar to human intelligence.
  • Artificial intelligence is the study of the design principles and implementation methods of various intelligent machines, so that the machines have the functions of perception, reasoning and decision-making.
  • Research in the field of artificial intelligence includes robotics, natural language processing, computer vision, decision-making and reasoning, human-computer interaction, recommendation and search, basic AI theory, etc.
  • Reinforcement learning is an important machine learning method in the field of artificial intelligence. It has many applications in fields such as autonomous driving, intelligent control of robots, and analysis and prediction. Specifically, the main problem to be solved through reinforcement learning is how smart devices directly interact with the environment to learn the skills used to perform specific tasks in order to maximize long-term rewards for specific tasks. In the application process of reinforcement learning algorithms, it is often necessary to interact with the online environment to obtain data and conduct training. The general approach is to model real scenes in the real world and generate an online environment for virtual simulation. In this case, if there is a slight difference between the training environment and the real environment that needs to be deployed, it is likely to cause the trained algorithm to fail, causing the performance in the real scenario to be lower than expected.
  • the above problems can be alleviated by improving the robustness of reinforcement learning algorithms.
  • One method is to introduce imaginary interference in the virtual environment, train the reinforcement learning algorithm under the interference, improve its ability to deal with interference, and enhance the robustness and generalization of the algorithm.
  • you can set up an adversarial agent.
  • the data output by the adversarial agent can perform tasks together with the output data of the reinforcement learning model, and the data output by the anti-agent can serve as interference in executing the target task.
  • the adversarial agent can only output a certain kind of specific interference (for example, for robot control, a specific range of force can be applied to a certain joint as Interference), when the changes in the real environment are inconsistent with the imaginary interference (that is, the interference output by the anti-agent), the algorithm will be less effective and less robust.
  • a certain kind of specific interference for example, for robot control, a specific range of force can be applied to a certain joint as Interference
  • This application provides a model training method that can improve the training effect and generalization of the model.
  • this application provides a model training method.
  • the method includes: processing first data through a first reinforcement learning model to obtain a first processing result; wherein the first data indicates the state of the target object. , the first processing result is used as control information when performing a target task on the target object; the first data is processed through the first target neural network to obtain a second processing result; wherein, the first The second processing result is used as interference information when executing the target task.
  • the first target neural network is selected from a plurality of first neural networks, and each of the first neural networks is a pair of the first initial neural network.
  • the first reinforcement learning model may be an initialized model or the output of an iteration in the model training process.
  • the reinforcement learning models in the embodiments of this application include but are not limited to deep neural networks, Bayesian neural networks, etc.
  • the first data can be processed through the first reinforcement learning model to obtain the first processing result.
  • the first processing result is used as control information when performing a target task on the target object.
  • the target task is the attitude control of the robot, and the first processing result is the attitude control information of the robot; or,
  • the target task is automatic driving of the vehicle, and the first processing result is the driving control information of the vehicle.
  • an adversarial agent for outputting interference information can be trained.
  • the interference information only performs one kind of interference for the target task.
  • multiple adversarial agents for outputting interference information can be trained.
  • Adversarial agents that interfere with information are different adversarial agents.
  • the interference information output by the agent can perform different types of interference on the target task.
  • training the adversarial agent not only the adversarial agent obtained in the latest iteration is used to output interference on the target task, but also Use the historical training results of historical adversarial agents (adversarial agents obtained during the historical iteration process) to output interference for the target task, so that more effective interference for the target task can be obtained that is adapted to different scenarios. Thereby improving the training effect and generalization of the model.
  • the first data is status information related to the robot; the target task is attitude control of the robot, and the first processing result is attitude control information of the robot.
  • the robot-related status information may include but is not limited to the robot's position and speed, and information related to the scene it is in (such as obstacle information).
  • the robot's position and speed may include the status of each joint. (position, angle, speed, acceleration, etc.) and other information.
  • the first reinforcement learning model can obtain the attitude control information of the robot based on the input data.
  • the attitude control information can include the control information of each joint of the robot, and the attitude control task of the robot can be performed based on the attitude control information. .
  • the first data is vehicle-related status information
  • the target task is automatic driving of the vehicle
  • the first processing result is the driving control information of the vehicle.
  • the vehicle-related status information may include but is not limited to the vehicle's position, speed, and information related to the scene in which it is located (such as driving road information, obstacle information, pedestrian information, and surrounding vehicle information). ).
  • the first reinforcement learning model can obtain the driving control information of the vehicle based on the input data.
  • the driving control information can include the vehicle's speed, direction, driving trajectory and other information.
  • the method further includes: selecting the first target neural network from the plurality of first neural networks.
  • the first target neural network is selected from a plurality of first neural networks based on a first selection probability corresponding to each of the plurality of first neural networks. . That is to say, each first neural network can be configured with a corresponding probability (ie, the first selection probability described above).
  • the first neural network can be selected based on multiple first neural networks. A probability distribution corresponding to a neural network is used to sample and the network is selected based on the sampling results.
  • the processing result obtained by each first neural network processing data is used as interference when executing the target task, and the first selection probability is the same as the processing result output by the corresponding first neural network.
  • the degree of interference with the target task is positively correlated.
  • the first selection probability can be a trainable parameter.
  • a reward value can be obtained.
  • the reward value can represent the performance of the data output by the reinforcement learning model in executing the target.
  • the excellence in the task can also represent the degree of interference of the interference information output by the adversarial agent on the target task, and the probability distribution corresponding to the first neural network can be updated based on the reward value, so that the first selection probability is consistent with the corresponding
  • the degree of interference of the processing result output by the first neural network to the target task is positively correlated.
  • the above probability distribution can be a Nash equilibrium distribution.
  • the probability distribution can be calculated through Nash equilibrium based on the reward value obtained when performing the target task based on the data and interference information obtained during the feedforward of the reinforcement learning model. During the iteration process, the probability distribution can be updated.
  • the embodiment of the present application controls the behavior space of the adversarial agent and changes the interference intensity of the adversarial agent, making the reinforcement learning strategy robust to both strong and weak interference.
  • the reinforcement learning strategy is more robust to interference from different strategies.
  • updating the first reinforcement learning model according to the third processing result includes:
  • the reward value corresponding to the target task is obtained
  • the method also includes:
  • the first selection probability corresponding to the first target neural network is updated.
  • the reinforcement learning strategy and the updated strategy of the adversarial agent can be added to the Nash equilibrium matrix, and the Nash equilibrium can be calculated to obtain the reinforcement learning and adversarial agents.
  • Nash equilibrium distribution Specifically, updating the first reinforcement learning model according to the first processing result and the second processing result includes: obtaining the target according to the first processing result and the second processing result. The reward value corresponding to the task; the first reinforcement learning model is updated according to the reward value; further, the first selection probability corresponding to the first target neural network can be updated according to the reward value.
  • multiple adversarial agents can be trained, and for each adversarial agent in the multiple adversarial agents, multiple adversarial agents can be trained from Select the adversarial agent that interferes with the reinforcement learning model from the iteration results.
  • the method further includes:
  • the first data is processed through a second target neural network to obtain a fourth processing result; wherein the fourth processing result is used as interference information when executing the target task, and the second target neural network is Selected from a plurality of second neural networks, each second neural network is an iterative result obtained from the iterative training process of a second initial neural network; the first initial neural network and the second initial neural network are Neural networks are different;
  • Executing the target task according to the first processing result and the second processing result to obtain a third processing result includes:
  • the target task is executed to obtain a third processing result.
  • the interference types of the second processing result and the fourth processing result are different.
  • the interference type may be a category of interference applied when performing the target task, such as applying force, applying torque, adding obstacles, changing road conditions, changing weather, etc.
  • the interference objects of the second processing result and the fourth processing result are different.
  • the robot may include multiple joints, and applying force to different joints or different joint groups may be considered to be different interference objects. That is, the second processing result and the fourth processing result are forces applied to different joints or different joint groups.
  • the first target neural network is used to determine the second processing result from a first numerical range according to the first data
  • the second target neural network is used to determine the second processing result according to the first numerical range.
  • the first data determines the fourth processing result from a second numerical range
  • the second numerical range is different from the first numerical range.
  • the second processing result and the fourth processing result are both forces exerted on the robot joints.
  • the maximum value of the force determined by the first target neural network is A1
  • the maximum value of the force determined by the second target neural network is The maximum value of is A2, A1 and A2 are different.
  • the reinforcement learning model participating in the training process in the current round can also be selected from the historical iteration results of the reinforcement learning model. For example, it can be based on probability sampling. For similarities, refer to the process of sampling the adversarial agent in the above embodiment.
  • the second data can be processed through a second reinforcement learning model to obtain a fifth processing result; wherein the second reinforcement learning model is derived from the updated first reinforcement learning model.
  • each of the reinforcement learning models is an iterative result obtained from the iterative training process of the initial reinforcement learning model; the second data indicates the state of the target object, and the third
  • the fifth processing result is used as control information when performing the target task on the target object; the second data is processed through a third target neural network to obtain a sixth processing result; the third target neural network Belonging to the plurality of first neural networks; the sixth processing result is used as interference information when executing the target task; executing the target task according to the fifth processing result and the sixth processing result, Obtain a seventh processing result; update the third target neural network according to the seventh processing result to obtain an updated third target neural network.
  • the second reinforcement learning model may be selected from the plurality of reinforcement learning models.
  • selecting the second reinforcement learning model from the plurality of reinforcement learning models includes: based on the second selection probability corresponding to each reinforcement learning model in the plurality of reinforcement learning models. , selecting the second reinforcement learning model from a plurality of reinforcement learning models.
  • the second selection probability is positively related to the positive execution effect of the processing result output by the corresponding reinforcement learning model when executing the target task.
  • a reward value can be obtained.
  • the reward value can represent the excellence of the data output by the reinforcement learning model in performing the target task, and can be used to strengthen the reinforcement based on the reward value.
  • the probability distribution corresponding to the learning model is updated so that the second selection probability is positively related to the positive execution effect of the processing result output by the corresponding reinforcement learning model when executing the target task.
  • the historical strategy of the reinforcement learning agent can be sampled and selected from the historical strategy collection of the reinforcement learning agent according to the Nash equilibrium distribution for use in the strategy update of the countermeasure agent.
  • deploy the selected reinforcement learning strategy and the current adversarial agent strategy perform sampling, and obtain the required training samples. Use the obtained training samples to train the adversarial agent strategy.
  • this application provides a model training device, which includes:
  • the data processing module is used to process the first data through the first reinforcement learning model to obtain the first processing result; wherein the first data indicates the state of the target object, and the first processing result is used as the first processing result in the Control information when performing target tasks on the target object;
  • the first data is processed through a first target neural network to obtain a second processing result; wherein the second processing result is used as interference information when executing the target task, and the first target neural network is Selected from a plurality of first neural networks, each first neural network is an iterative result obtained from the process of iterative training of the first initial neural network;
  • a model update module configured to update the first reinforcement learning model according to the third processing result to obtain an updated first reinforcement learning model.
  • an adversarial agent for outputting interference information can be trained.
  • the interference information only performs one kind of interference for the target task.
  • multiple adversarial agents for outputting interference information can be trained.
  • the interference information output by different adversarial agents can interfere with different types of target tasks.
  • training adversarial agents not only the adversarial agents obtained in the latest iteration are used to output For interference on the target task, the historical training results of historical adversarial agents (adversarial agents obtained during the historical iteration process) can also be used to output interference on the target task, so that the system can be adapted to different scenarios. More effective interference with the target task, thereby improving the training effect and generalization of the model.
  • the target object is a robot; the target task is the attitude control of the robot, and the first processing result is the attitude control information of the robot; or,
  • the target object is a vehicle; the target task is automatic driving of the vehicle; and the first processing result is the driving control information of the vehicle.
  • the device further includes:
  • a network selection module configured to select the first target neural network from the plurality of first neural networks.
  • the first target neural network is selected from a plurality of first neural networks based on a first selection probability corresponding to each of the plurality of first neural networks.
  • the processing result obtained by each first neural network processing data is used as interference when executing the target task, and the first selection probability is the same as the processing result output by the corresponding first neural network.
  • the degree of interference with the target task is positively correlated.
  • model update module is specifically used to:
  • the reward value corresponding to the target task is obtained
  • the model update module is also used to:
  • the first selection probability corresponding to the first target neural network is updated.
  • the data processing module is also used to:
  • the first data is processed through a second target neural network to obtain a fourth processing result; wherein the fourth processing result is used as interference information when executing the target task, and the second target neural network is Selected from a plurality of second neural networks, each second neural network is an iterative result obtained from the iterative training process of a second initial neural network; the first initial neural network and the second initial neural network are Neural networks are different;
  • the data processing module is specifically used for:
  • the target task is executed to obtain a third processing result.
  • the interference types of the second processing result and the fourth processing result are different; or,
  • the interference objects of the second processing result and the fourth processing result are different; or,
  • the first target neural network is used to determine the second processing result from a first numerical range according to the first data
  • the second target neural network is used to determine the second processing result from a second numerical value according to the first data.
  • the fourth processing result is determined within a range, and the second numerical range is different from the first numerical range.
  • the data processing module is also used to:
  • each of the reinforcement learning models is an iterative result obtained from the iterative training process of the initial reinforcement learning model;
  • the second data indicates the state of the target object, and the fifth processing result is used as the Control information when performing the target task on the target object;
  • the second data is processed through a third target neural network to obtain a sixth processing result;
  • the third target neural network belongs to the plurality of first neural networks;
  • the sixth processing result is used as the basis for executing the Interfering information during the target task;
  • the model update module is also used to:
  • the third target neural network is updated to obtain an updated third target neural network.
  • the network selection module is also used to:
  • the second reinforcement learning model is selected from the plurality of reinforcement learning models.
  • the network selection module is specifically used to:
  • this application provides a data processing method, including:
  • the first data is processed through the first reinforcement learning model to obtain a first processing result; the first processing result is used as the control information of the target object;
  • the first reinforcement learning model is updated by a reward value during an iteration of training, and the reward value is interference information applied when executing the target task according to the control information output by the feedforward process of the first reinforcement learning model. Obtained, the interference information is obtained through the feedforward process of the target neural network, the target neural network is selected from multiple neural networks, and each of the neural networks is one obtained by iteratively training the initial neural network. Iteration results;
  • a target task is performed on the target object.
  • the target object is a robot; the target task is the attitude control of the robot, and the first processing result is the attitude control information of the robot; or,
  • the target object is a vehicle; the target task is automatic driving of the vehicle; and the first processing result is the driving control information of the vehicle.
  • the first neural network is selected from a plurality of first neural networks based on a first selection probability corresponding to each neural network in the plurality of neural networks.
  • the processing results obtained by each neural network processing data are used as interference when executing the target task, and the first selection probability and the processing results output by the corresponding neural network have a positive impact on the target.
  • the degree of interference of the task is positively related.
  • a model training device which may include a memory, a processor, and a bus system.
  • the memory is used to store programs
  • the processor is used to execute programs in the memory to perform the first aspect as described above. and any optional methods.
  • embodiments of the present application provide a data processing device, which may include a memory, a processor, and a bus system, wherein the memory is used to store programs, and the processor is used to execute the programs in the memory to perform the third aspect as described above. and any optional methods.
  • embodiments of the present application provide a computer-readable storage medium.
  • a computer program is stored in the computer-readable storage medium. When it is run on a computer, it causes the computer to execute the above-mentioned first aspect and any of its options. method, or the above third aspect and any optional method thereof.
  • embodiments of the present application provide a computer program product including instructions that, when run on a computer, cause the computer to execute the above-mentioned first aspect and any of its optional methods, or the above-mentioned third aspect and any of its optional methods. Any optional method.
  • this application provides a chip system that includes a processor to support the model training device to implement some or all of the functions involved in the above aspects, for example, sending or processing data involved in the above methods. ; or, information.
  • the chip system also includes a memory, which is used to save necessary program instructions and data for executing the device or training the device.
  • the chip system may be composed of chips, or may include chips and other discrete devices.
  • Embodiments of the present application provide a model training method.
  • the method includes: processing first data through a first reinforcement learning model to obtain a first processing result; wherein the first data indicates the state of the target object, and the first data indicates the state of the target object.
  • the first processing result is used as control information when performing a target task on the target object; the first data is processed through the first target neural network to obtain a second processing result; wherein, the second processing The result is used as interference information when executing the target task.
  • the first target neural network is selected from a plurality of first neural networks, and each of the first neural networks iterates the first initial neural network.
  • Figure 1 is a schematic diagram of an application architecture
  • Figure 2 is a schematic diagram of an application architecture
  • Figure 3 is a schematic diagram of an application architecture
  • Figure 4 is a schematic diagram of an embodiment of a model training method provided by an embodiment of the present application.
  • Figure 5 is a schematic diagram of a software architecture provided by an embodiment of the present application.
  • Figure 6 is a schematic diagram of an embodiment of a model training method provided by an embodiment of the present application.
  • Figure 7 is a schematic diagram of an embodiment of a model training device provided by an embodiment of the present application.
  • Figure 8 is a schematic structural diagram of an execution device provided by an embodiment of the present application.
  • Figure 9 is a schematic structural diagram of a server provided by an embodiment of the present application.
  • Figure 10 is a schematic structural diagram of a chip provided by an embodiment of the present application.
  • the terms “substantially”, “about” and similar terms are used as terms of approximation, not as terms of degree, and are intended to take into account measurements or values that would be known to one of ordinary skill in the art. The inherent bias in calculated values.
  • the use of “may” when describing embodiments of the present invention refers to “one or more possible embodiments.”
  • the terms “use”, “using”, and “used” may be deemed to be the same as the terms “utilize”, “utilizing”, and “utilize”, respectively. Synonymous with “utilized”.
  • the term “exemplary” is intended to refer to an example or illustration.
  • Figure 1 shows a structural schematic diagram of the artificial intelligence main framework.
  • the following is from the “intelligent information chain” (horizontal axis) and “IT value chain” ( The above artificial intelligence theme framework is elaborated on the two dimensions of vertical axis).
  • the "intelligent information chain” reflects a series of processes from data acquisition to processing. For example, it can be the general process of intelligent information perception, intelligent information representation and formation, intelligent reasoning, intelligent decision-making, intelligent execution and output. In this process, the data has gone through the condensation process of "data-information-knowledge-wisdom".
  • the "IT value chain” reflects the value that artificial intelligence brings to the information technology industry, from the underlying infrastructure of human intelligence and information (providing and processing technology implementation) to the systematic industrial ecological process.
  • Infrastructure provides computing power support for artificial intelligence systems, enables communication with the external world, and supports it through basic platforms.
  • computing power is provided by smart chips (hardware acceleration chips such as CPU, NPU, GPU, ASIC, FPGA, etc.);
  • the basic platform includes distributed computing framework and network and other related platform guarantees and support, which can include cloud storage and Computing, interconnection networks, etc.
  • sensors communicate with the outside world to obtain data, which are provided to smart chips in the distributed computing system provided by the basic platform for calculation.
  • Data from the upper layer of the infrastructure is used to represent data sources in the field of artificial intelligence.
  • the data involves graphics, images, voice, and text, as well as IoT data of traditional devices, including business data of existing systems and sensory data such as force, displacement, liquid level, temperature, and humidity.
  • Data processing usually includes data training, machine learning, deep learning, search, reasoning, decision-making and other methods.
  • machine learning and deep learning can perform symbolic and formal intelligent information modeling, extraction, preprocessing, training, etc. on data.
  • Reasoning refers to the process of simulating human intelligent reasoning in computers or intelligent systems, using formal information to perform machine thinking and problem solving based on reasoning control strategies. Typical functions are search and matching.
  • Decision-making refers to the process of decision-making after intelligent information is reasoned, and usually provides functions such as classification, sorting, and prediction.
  • some general capabilities can be formed based on the results of further data processing, such as algorithms or a general system, such as translation, text analysis, computer vision processing, speech recognition, and image processing. identification, etc.
  • Intelligent products and industry applications refer to the products and applications of artificial intelligence systems in various fields. They are the encapsulation of overall artificial intelligence solutions, productizing intelligent information decision-making and realizing practical applications. Its application fields mainly include: intelligent terminals, intelligent transportation, Smart healthcare, autonomous driving, smart cities, etc.
  • the parts grabbing operation originally completed by humans can be completed by an intelligent robotic arm.
  • the intelligent robotic arm needs to be equipped with Grasping skills and neural networks for grasping skills, in which different grasping skills can have different grabbing angles, displacements of intelligent robotic arms, etc.; as another example, for example, in the field of automatic cooking, the cooking operation is originally completed by humans. It can be completed by an intelligent robotic arm.
  • the intelligent robotic arm needs to be equipped with cooking skills such as raw material grabbing skills, stir-frying skills, and neural networks for cooking skills. Other application scenarios are not exhaustive here.
  • FIG. 2 is a schematic diagram of a computing system that performs model training in an embodiment of the present application.
  • the computing system includes a terminal device 102 (exemplarily, the terminal device 102 may not be included) and a server 130 (which may also be called a central node) communicatively coupled through a network.
  • the terminal device 102 may be any type of computing device, such as, for example, a personal computing device (eg, a laptop or desktop computer), a mobile computing device (eg, a smartphone or tablet), a game console or controller , wearable computing devices, embedded computing devices, or any other type of computing device.
  • the terminal device 102 may include a processor 112 and a memory 114.
  • Processor 112 may be any suitable processing device (e.g., processor core, microprocessor, application specific integrated circuit (ASIC), field programmable gate array (FPGA), controller , microcontroller, etc.).
  • the memory 114 may include, but is not limited to, Random Access Memory (RAM), Read-Only Memory (ROM), Erasable Programmable Read Only Memory (EPROM), Or portable read-only memory (Compact Disc Read-Only Memory, CD-ROM).
  • the memory 114 may store data 116 and instructions 118 executed by the processor 112 to cause the terminal device 102 to perform operations.
  • memory 114 may store one or more models 120 .
  • model 120 may be or may additionally include various machine learning models, such as neural networks (eg, deep neural networks) or other types of machine learning models, including nonlinear models and/or linear models.
  • Neural networks may include feedforward neural networks, recurrent neural networks (eg, long short-term memory recurrent neural networks), convolutional neural networks, or other forms of neural networks.
  • one or more models 120 may be received from server 130 over network 180, stored in memory 114, and then used by one or more processors 112 or otherwise implemented.
  • Terminal device 102 may also include one or more user input components 122 that receive user input.
  • user input component 122 may be a touch-sensitive component (eg, a touch-sensitive display screen or touch pad) that is sensitive to the touch of a user input object (eg, a finger or stylus).
  • Touch-sensitive components can be used to implement virtual keyboards.
  • Other example user input components include a microphone, a traditional keyboard, or other device through which a user can provide user input.
  • the terminal device 102 may also include a communication interface 123.
  • the terminal device 102 may be communicatively connected to the server 130 through the communication interface 123.
  • the server 130 may include a communication interface 133.
  • the terminal device 102 may be communicatively connected to the communication interface 133 of the server 130 through the communication interface 123. In this way, data interaction between the terminal device 102 and the server 130 is achieved.
  • Server 130 may include processor 132 and memory 134.
  • the processor 132 may be any suitable processing device (eg, processor core, microprocessor, application specific integrated circuit (ASIC), field programmable gate array (FPGA), controller, microcontroller, etc.).
  • the memory 134 may include, but is not limited to, random access memory (Random Access Memory, RAM), read-only memory (Read-Only Memory, ROM), erasable and programmable memory. Read-only memory (Erasable Programmable Read Only Memory, EPROM), or portable read-only memory (Compact Disc Read-Only Memory, CD-ROM).
  • Memory 134 may store data 136 and instructions 138 for execution by processor 132 to cause server 130 to perform operations.
  • memory 134 may store one or more machine learning models 140.
  • model 140 may be or may additionally include various machine learning models.
  • Example machine learning models include neural networks or other multi-layered nonlinear models.
  • Example neural networks include feedforward neural networks, deep neural networks, recurrent neural networks, and convolutional neural networks.
  • model training method in the embodiment of the present application involves AI-related operations.
  • the instruction execution architecture of the terminal device and server is not limited to the processor-memory architecture shown in Figure 2.
  • the system architecture provided by the embodiment of the present application will be introduced in detail below with reference to Figure 3 .
  • FIG 3 is a schematic diagram of the system architecture provided by an embodiment of the present application.
  • the system architecture 500 includes an execution device 510, a training device 520, a database 530, a client device 540, a data storage system 550 and a data collection system 560.
  • the execution device 510 includes a computing module 511, an I/O interface 512, a preprocessing module 513 and a preprocessing module 514.
  • the target model/rule 501 may be included in the calculation module 511, and the preprocessing module 513 and the preprocessing module 514 are optional.
  • Data collection device 560 is used to collect training samples.
  • the training samples may be first data, second data, etc., wherein the first data and the second data may be state information related to the target object (such as a robot, a vehicle, etc.), state information related to the vehicle, etc.
  • the data collection device 560 stores the training samples into the database 530 .
  • the training device 520 can maintain training samples based on the database 530, and the neural network to be trained (such as the reinforcement learning model and the target neural network in the embodiment of the present application, where the target neural network is used as an adversarial agent of the reinforcement learning model) , to get the target model/rule 501.
  • the neural network to be trained such as the reinforcement learning model and the target neural network in the embodiment of the present application, where the target neural network is used as an adversarial agent of the reinforcement learning model
  • the training samples maintained in the database 530 are not necessarily collected from the data collection device 560, and may also be received from other devices.
  • the training device 520 may not necessarily train the target model/rules 501 based entirely on the training samples maintained by the database 530. It may also obtain training samples from the cloud or other places for model training. The above description should not be used as a guarantee for this application. Limitations of Examples.
  • the target model/rules 501 trained according to the training device 520 can be applied to different systems or devices, such as the execution device 510 shown in Figure 3.
  • the execution device 510 can be a terminal, such as a mobile phone terminal, a tablet computer, and a notebook.
  • AR augmented reality
  • VR virtual reality
  • the target model/rule 501 can be used to achieve target tasks, such as driving control in autonomous driving, attitude control on robots, etc.
  • the training device 520 can transfer the trained model to the execution device 510 .
  • the execution device 510 may be the above-mentioned target object.
  • the execution device 510 is configured with an input/output (I/O) interface 512 for data interaction with external devices.
  • the user can input data to the I/O interface 512 through the client device 540, or execute Device 510 can automatically collect input data.
  • the preprocessing module 513 and the preprocessing module 514 are used to perform preprocessing according to the input data received by the I/O interface 512. It should be understood that there may be no preprocessing module 513 and 514 or only one preprocessing module. When the preprocessing module 513 and the preprocessing module 514 do not exist, the computing module 511 can be directly used to process the input data.
  • the execution device 510 When the execution device 510 preprocesses input data, or when the calculation module 511 of the execution device 510 performs calculations and other related processes, the execution device 510 can call data, codes, etc. in the data storage system 550 for corresponding processing. , the data, instructions, etc. obtained by corresponding processing can also be stored in the data storage system 550.
  • the I/O interface 512 provides the processing results to the client device 540, thereby providing them to the user, or performing control operations based on the processing results.
  • the user can manually set the input data, and the "manually set input data" can be operated through the interface provided by the I/O interface 512 .
  • the client device 540 can automatically send input data to the I/O interface 512. If requiring the client device 540 to automatically send the input data requires the user's authorization, the user can set corresponding permissions in the client device 540.
  • the user can view the results output by the execution device 510 on the client device 540, and the specific presentation form may be display, sound, action, etc.
  • the client device 540 can also be used as a data collection terminal to collect the input data and output I/O of the input I/O interface 512 as shown in the figure.
  • the output result of the interface 512 is used as new sample data and stored in the database 530 .
  • the I/O interface 512 directly uses the input data input to the I/O interface 512 and the output result of the output I/O interface 512 as a new sample as shown in the figure.
  • the data is stored in database 530.
  • Figure 3 is only a schematic diagram of a system architecture provided by an embodiment of the present application.
  • the positional relationship between the devices, devices, modules, etc. shown in the figure does not constitute any limitation.
  • the data The storage system 550 is an external memory relative to the execution device 510. In other cases, the data storage system 550 can also be placed in the execution device 510. It should be understood that the above execution device 510 may be deployed in the client device 540.
  • the above-mentioned training device 520 can obtain the code stored in the memory (not shown in Figure 3, which can be integrated with the training device 520 or deployed separately from the training device 520) to implement the model training in the embodiment of the present application. Related steps.
  • the training device 520 may include hardware circuits (such as application specific integrated circuits (ASICs), field-programmable gate arrays (FPGAs), general-purpose processors, digital signal processors (digital signal processing, DSP, microprocessor or microcontroller, etc.), or a combination of these hardware circuits.
  • ASICs application specific integrated circuits
  • FPGAs field-programmable gate arrays
  • DSP digital signal processors
  • the training device 520 can be a hardware system with the function of executing instructions, such as a CPU, DSP, etc., or a combination of other hardware circuits.
  • a hardware system with the function of executing instructions such as ASIC, FPGA, etc., or a combination of the above-mentioned hardware systems without the function of executing instructions and a hardware system with the function of executing instructions.
  • the training device 520 can be a combination of a hardware system that does not have the function of executing instructions and a hardware system that has the function of executing instructions. Some of the steps related to model training provided by the embodiments of the present application can also be implemented by the training device 520 that does not have the function of executing instructions. It is implemented by the hardware system that executes the instruction function, which is not limited here.
  • the neural network can be composed of neural units.
  • the neural unit can refer to an operation unit that takes xs (ie, input data) and intercept 1 as input.
  • the output of the operation unit can be:
  • s 1, 2,...n, n is a natural number greater than 1
  • Ws is the weight of xs
  • b is the bias of the neural unit.
  • f is the activation function of the neural unit, which is used to introduce nonlinear characteristics into the neural network to convert the input signal in the neural unit into an output signal.
  • the output signal of this activation function can be used as the input of the next convolutional layer, and the activation function can be a sigmoid function.
  • a neural network is a network formed by connecting multiple above-mentioned single neural units together, that is, the output of one neural unit can be the input of another neural unit.
  • the input of each neural unit can be connected to the local receptive field of the previous layer to extract the features of the local receptive field.
  • the local receptive field can be an area composed of several neural units.
  • Deep Neural Network also known as multi-layer neural network
  • DNN Deep Neural Network
  • the neural network inside DNN can be divided into three categories: input layer, hidden layer, and output layer.
  • the first layer is the input layer
  • the last layer is the output layer
  • the layers in between are hidden layers.
  • the layers are fully connected, that is to say, any neuron in the i-th layer must be connected to any neuron in the i+1-th layer.
  • the coefficient from the k-th neuron in layer L-1 to the j-th neuron in layer L is defined as It should be noted that the input layer has no W parameter.
  • more hidden layers make the network more capable of describing complex situations in the real world. Theoretically, a model with more parameters has higher complexity and greater "capacity", which means it can complete more complex learning tasks.
  • Training a deep neural network is the process of learning the weight matrix. The ultimate goal is to obtain the weight matrix of all layers of the trained deep neural network (a weight matrix formed by the vectors W of many layers).
  • Reinforcement learning also known as reinforcement learning, evaluation learning or reinforcement learning, is one of the paradigms and methodologies of machine learning. It is used to describe and solve the interaction process between the agent and the environment. The problem of learning strategies to maximize returns or achieve specific goals.
  • a common model for reinforcement learning is the standard Markov decision process (MDP).
  • reinforcement learning can be divided into model-based reinforcement learning (model-based RL) and model-free reinforcement learning (model-free RL), as well as active reinforcement learning (active RL) and passive reinforcement learning (passive RL).
  • Variants of reinforcement learning include inverse reinforcement learning, hierarchical reinforcement learning, and reinforcement learning for partially observable systems.
  • the algorithms used to solve reinforcement learning problems can be divided into two categories: policy search algorithms and value function algorithms. Deep learning models can be used in reinforcement learning to form deep reinforcement learning.
  • the convolutional neural network can use the error back propagation (BP) algorithm to modify the size of the parameters in the initial super-resolution model during the training process, so that the reconstruction error loss of the super-resolution model becomes smaller and smaller.
  • BP error back propagation
  • forward propagation of the input signal until the output will produce an error loss
  • the parameters in the initial super-resolution model are updated by back-propagating the error loss information, so that the error loss converges.
  • the backpropagation algorithm is a backpropagation movement dominated by error loss, aiming to obtain the optimal parameters of the super-resolution model, such as the weight matrix.
  • non-cooperative game equilibrium Also known as non-cooperative game equilibrium, it is an important term in game theory. In a game process, regardless of the other party's strategy choice, one party will choose a certain strategy, and this strategy is called a dominant strategy. If any player chooses the optimal strategy when the strategies of all other players are determined, then this combination is defined as a Nash equilibrium.
  • a strategy combination is called a Nash equilibrium.
  • each player's equilibrium strategy is to maximize his or her expected return, at the same time, all other players also follow this strategy.
  • Reinforcement learning also known as reinforcement learning, evaluation learning or reinforcement learning, is one of the paradigms and methodologies of machine learning. It is used to describe and solve the problem of how an agent learns during its interaction with the environment. Strategies to maximize returns or achieve specific goals.
  • a common model for reinforcement learning is the standard Markov decision process (MDP).
  • reinforcement learning can be divided into model-based reinforcement learning (model-based RL) and model-free reinforcement learning (model-free RL), as well as active reinforcement learning (active RL) and passive reinforcement learning (passive RL).
  • Variants of reinforcement learning include inverse reinforcement learning, hierarchical reinforcement learning, and reinforcement learning for partially observable systems.
  • the algorithms used to solve reinforcement learning problems can be divided into two categories: policy search algorithms and value function algorithms. Deep learning models can be used in reinforcement learning to form deep reinforcement learning.
  • Agent is a concept in the field of artificial intelligence. Any entity that can think independently and interact with the environment can be abstracted into an agent. The basic characteristics of an intelligent agent are: the intelligent agent can react according to changes in the environment, and then automatically adjust its behavior and status. Different intelligent agents can also interact with other intelligent agents according to their own intentions.
  • One method is to introduce imaginary interference in the virtual environment, train the reinforcement learning algorithm under the interference, improve its ability to deal with interference, and enhance the robustness and generalization of the algorithm.
  • the data output by the adversarial agent can perform tasks together with the output data of the reinforcement learning model, and the data output by the anti-agent can serve as interference in executing the target task.
  • existing training methods mainly resist certain specific interferences.
  • the algorithm effect will decrease.
  • a model training method provided by an embodiment of the present application includes:
  • the execution subject of step 401 may be a training device (for example, the training device may be a terminal device or a server).
  • the training device may be a terminal device or a server.
  • the training device can obtain the model training object (first reinforcement learning model) and the training sample (first data).
  • the first data is status information related to the robot; the target task is attitude control of the robot, and the first processing result is attitude control information of the robot.
  • the robot-related status information may include but is not limited to the robot's position and speed, and information related to the scene it is in (such as obstacle information).
  • the robot's position and speed may include the status of each joint. (position, angle, speed, acceleration, etc.) and other information.
  • the first reinforcement learning model can obtain the attitude control information of the robot based on the input data.
  • the attitude control information can include the control information of each joint of the robot, and the attitude control task of the robot can be performed based on the attitude control information. .
  • the first data is vehicle-related status information
  • the target task is automatic driving of the vehicle
  • the first processing result is the driving control information of the vehicle.
  • the vehicle-related status information may include but is not limited to the vehicle's position, speed, and information related to the scene in which it is located (such as driving road information, obstacle information, pedestrian information, and surrounding vehicle information). ).
  • the first reinforcement learning model can obtain the driving control information of the vehicle based on the input data.
  • the driving control information can include the vehicle's speed, direction, driving trajectory and other information.
  • the first reinforcement learning model may be an initialized model or the output of an iteration in the model training process.
  • the first data can be processed through the first reinforcement learning model to obtain the first processing result.
  • the first processing result is used as control information when performing a target task on the target object.
  • the target task is the attitude control of the robot, and the first processing result is the attitude control information of the robot; or,
  • the target task is automatic driving of the vehicle, and the first processing result is the driving control information of the vehicle.
  • the first processing result can be used as a hard constraint imposed on the target object when performing the target task.
  • reinforcement learning models in the embodiments of this application include but are not limited to deep neural networks, Bayesian neural networks, etc.
  • the first target neural network is selected from a plurality of first neural networks, and each first neural network is obtained by iteratively training the first initial neural network. The result of an iteration.
  • the training device can obtain an adversarial agent for the reinforcement learning model, and the adversarial agent can output interference information for the target task.
  • an adversarial agent for outputting interference information can be trained.
  • the interference information only performs one kind of interference for the target task.
  • multiple adversarial agents for outputting interference information can be trained.
  • the interference information output by different adversarial agents can interfere with different types of target tasks.
  • training adversarial agents not only the adversarial agents obtained in the latest iteration are used to output For interference on the target task, the historical training results of historical adversarial agents (adversarial agents obtained during the historical iteration process) can also be used to output interference on the target task, so that the system can be adapted to different scenarios. More effective interference with the target task, thereby improving the training effect and generalization of the model.
  • the first target neural network in the embodiment of this application includes but is not limited to deep neural network, Bayesian neural network, etc.
  • the first target neural network when determining an adversarial agent for outputting interference information as the first reinforcement learning model, may be selected from the plurality of first neural networks, wherein, Each first neural network is an iterative result obtained by iteratively training the first initial neural network.
  • neural network 1 neural network 2, neural network 3, neural network 4, neural network 5, neural network 6, neural network 7, neural network 8, neural network Network 9, when determining the adversarial agent for outputting interference information as the first reinforcement learning model, can be selected from the set [Neural Network 1, Neural Network 2, Neural Network 3, Neural Network 4, Neural Network 5, Neural Network 6 , Neural Network 7, Neural Network 8, Neural Network 9] Select a neural network.
  • selecting the first target neural network from a plurality of first neural networks includes: based on a first selection corresponding to each first neural network in the plurality of first neural networks. probability, selecting the first target neural network from a plurality of first neural networks. That is to say, each first neural network can be configured with a corresponding probability (ie, the first selection probability described above).
  • the first neural network can be selected based on multiple first neural networks. A probability distribution corresponding to a neural network is used to sample and the network is selected based on the sampling results.
  • the processing result obtained by each first neural network processing data is used as interference when executing the target task, and the first selection probability is the same as the processing result output by the corresponding first neural network.
  • the degree of interference with the target task is positively correlated.
  • a reward value can be obtained.
  • the reward value can represent the excellence of the data output by the reinforcement learning model in performing the target task, and can also represent the adversarial intelligence.
  • the probability distribution corresponding to the first neural network can be updated based on the reward value, so that the first selection probability and the corresponding processing result output by the first neural network have a positive impact on the first neural network.
  • the degree of interference from the target task is positively related.
  • the above probability distribution can be a Nash equilibrium distribution.
  • the probability distribution can be calculated through Nash equilibrium based on the reward value obtained when performing the target task based on the data and interference information obtained during the feedforward of the reinforcement learning model. During the iteration process, the probability distribution can be updated.
  • the embodiment of the present application controls the behavior space of the adversarial agent and changes the interference intensity of the adversarial agent, making the reinforcement learning strategy robust to both strong and weak interference.
  • the reinforcement learning strategy is more robust to interference from different strategies.
  • the first data can be processed through a first target neural network to obtain a second processing result, and the second processing result is used as the basis for executing the target. Interfering information during the task.
  • the second processing result may be a force or moment applied to at least one joint on the robot.
  • the second processing result may be a force or moment exerted on the road conditions of the vehicle. Obstacles or other obstacle information that can affect driving strategies.
  • multiple adversarial agents can be trained, and for each adversarial agent in the multiple adversarial agents, multiple adversarial agents can be trained from Choose to interfere with the reinforcement learning model among the iteration results. of adversarial agents.
  • neural network A1, neural network A2, neural network A3, neural network A4, neural network A5, neural network A6, neural network Network A7, neural network A8, and neural network A9 when determining the adversarial agent used to output interference information as the first reinforcement learning model, can be obtained from the set [neural network A1, neural network A2, neural network A3, neural network A4 , neural network A5, neural network A6, neural network A7, neural network A8, neural network A9], which is the first target neural network in the above embodiment.
  • neural network B1, neural network B2, neural network B3, neural network B4, and neural network B5 can be obtained , neural network B6, neural network B7, neural network B8, neural network B9, when determining the adversarial agent used to output interference information as the first reinforcement learning model, it can be obtained from the set [neural network B1, neural network B2, neural network Select a neural network among the network B3, neural network B4, neural network B5, neural network B6, neural network B7, neural network B8, and neural network B9], that is, the second target neural network.
  • the data output by the first target neural network and the second target neural network can be used as interference information applied to the first reinforcement learning model.
  • the first data can be processed through the second target neural network to obtain a fourth processing result; wherein, The second processing result is used as interference when executing the target task, the second target neural network is selected from a plurality of second neural networks, and each of the second neural networks is a pair of second initial neural networks.
  • An iterative result obtained by the iterative training process of the network; the first initial neural network and the second initial neural network are different.
  • the interference types of the second processing result and the fourth processing result are different.
  • the interference type may be a category of interference applied when performing the target task, such as applying force, applying torque, adding obstacles, changing road conditions, changing weather, etc.
  • the interference objects of the second processing result and the fourth processing result are different.
  • the robot may include multiple joints, and applying force to different joints or different joint groups may be considered to be different interference objects. That is, the second processing result and the fourth processing result are forces applied to different joints or different joint groups.
  • the first target neural network is used to determine the second processing result from a first numerical range according to the first data
  • the second target neural network is used to determine the second processing result according to the first numerical range.
  • the first data determines the fourth processing result from a second numerical range
  • the second numerical range is different from the first numerical range.
  • the second processing result and the fourth processing result are both forces exerted on the robot joints.
  • the maximum value of the force determined by the first target neural network is A1
  • the maximum value of the force determined by the second target neural network is The maximum value of is A2, A1 and A2 are different.
  • the first processing result can be used as a hard constraint when executing the target task, that is, the first processing result can be used as the control information that the target object needs to satisfy when executing the target task, and the second processing result can be applied to the target object when executing the target task.
  • the third processing result can be the state of the target object when (or after) it performs the target task, and the third processing result can be used to determine the reward value.
  • the first processing result and the second processing result may be part of the data for determining the third processing result, and may also include other interference information in addition to the second processing result (such as the fourth processing introduced in the above embodiment). result), the target task can be executed based on the first processing result, the second processing result and other processing results to obtain a third processing result.
  • the first reinforcement learning model when updating the model, can be updated according to the third processing result to obtain an updated first reinforcement learning model.
  • the cumulative reward obtained can be maximized.
  • the update method can adopt the reinforcement learning algorithm of the continuous action space.
  • the trust region policy optimization algorithm trust region policy optimization
  • TRPO trust region policy optimization
  • the first target neural network and the second target neural network can perform different adversarial tasks.
  • the ones required for this training can be selected from multiple adversarial tasks (for example, in order). Confrontation mission.
  • the historical strategy of the adversary agent can be sampled from the historical strategy set of the adversary agent according to the Nash equilibrium distribution, and used for the adversarial reinforcement learning strategy.
  • deploy the selected adversarial agent strategy and the current reinforcement learning strategy perform sampling, and obtain the required training samples.
  • the reinforcement learning strategy and the updated strategy of the adversarial agent can be added to the Nash equilibrium matrix, and the Nash equilibrium can be calculated to obtain the reinforcement learning and adversarial agents.
  • Nash equilibrium distribution that is, the above-mentioned first choice probability and the second choice probability introduced later.
  • updating the first reinforcement learning model according to the first processing result and the second processing result includes: obtaining the target according to the first processing result and the second processing result.
  • the reward value corresponding to the task; the first reinforcement learning model is updated according to the reward value; further, the first selection probability corresponding to the first target neural network can be updated according to the reward value.
  • the reinforcement learning model participating in the training process in the current round can also be selected from the historical iteration results of the reinforcement learning model. For example, it can be based on probability sampling. For similarities, refer to the process of sampling the adversarial agent in the above embodiment.
  • the second data can be processed through a second reinforcement learning model to obtain a fifth processing result; wherein the second reinforcement learning model is derived from the updated first reinforcement learning model.
  • each of the reinforcement learning models is an iterative result obtained from the iterative training process of the initial reinforcement learning model; the second data indicates the state of the target object, and the third
  • the fifth processing result is used as control information when performing the target task on the target object; the second data is processed through a third target neural network to obtain a sixth processing result; the third target neural network Belonging to the plurality of first neural networks; the sixth processing result is used as interference information when executing the target task; executing the target task according to the fifth processing result and the sixth processing result, Obtain a seventh processing result; update the third target neural network according to the seventh processing result to obtain an updated third target neural network.
  • the second reinforcement learning model may be selected from the plurality of reinforcement learning models.
  • selecting the second reinforcement learning model from the plurality of reinforcement learning models includes: based on the second selection probability corresponding to each reinforcement learning model in the plurality of reinforcement learning models. , selecting the second reinforcement learning model from a plurality of reinforcement learning models.
  • the second selection probability is positively related to the positive execution effect of the processing result output by the corresponding reinforcement learning model when executing the target task.
  • a reward value can be obtained.
  • the reward value can represent the excellence of the data output by the reinforcement learning model in performing the target task, and can be used to strengthen the reinforcement based on the reward value.
  • the probability distribution corresponding to the learning model is updated so that the second selection probability is positively related to the positive execution effect of the processing result output by the corresponding reinforcement learning model when executing the target task.
  • the historical strategy of the reinforcement learning agent can be sampled and selected from the historical strategy collection of the reinforcement learning agent according to the Nash equilibrium distribution for use in the strategy update of the countermeasure agent.
  • deploy the selected reinforcement learning strategy and the current adversarial agent strategy perform sampling, and obtain the required training samples. Use the obtained training samples to train the adversarial agent strategy.
  • Embodiments of the present application provide a model training method.
  • the method includes: processing first data through a first reinforcement learning model to obtain a first processing result; wherein the first data indicates the state of the target object, and the first data indicates the state of the target object.
  • the first processing result is used as control information when performing a target task on the target object; the first data is processed through the first target neural network to obtain a second processing result; wherein, the second processing The result is used as interference information when executing the target task.
  • the first target neural network is selected from a plurality of first neural networks, and each of the first neural networks iterates the first initial neural network.
  • Figure 5 shows a robot control system.
  • the robot control system may include: state awareness and processing module, robust decision-making module, and robot control module.
  • the function of this module is to sense the information of the robot (such as the information used to describe the state of the target object introduced in the above embodiment, such as the first data, the second data, etc.). Specifically, it combines the information transmitted by each sensor to determine the robot's own status, including the robot's basic information (position, speed), the status of each joint (position, angle, speed, acceleration) and other information, and transfers this information to decision-making module
  • the function of this module is to output upper-level behavioral decisions in the future based on the current robot status and the task being performed (such as when performing the target task on the target object introduced in the above embodiment). control information). Specifically, based on the current state of the robot output by the state sensing and processing module, this module can output behavioral decisions for a period of time in the future through the method corresponding to Figure 4, and pass them to the robot control module.
  • This module controls the movement of the robot by controlling the joints of the robot and executing the behavior output by the robust decision-making module.
  • FIG. 6 is a flowchart of applying the model training method in the embodiment of the present application to a robot control simulation scenario.
  • the robot adopts the model training method in the embodiment of this application through the multi-task framework and gambling theory optimization theory, and finally outputs behavioral decisions that can maximize the forward speed of the robot and obtain more rewards.
  • the implementation method is introduced in detail below.
  • the reinforcement learning strategy ⁇ and the adversarial agent strategy ⁇ i,t control the robot to sample in the training environment to obtain M samples (s state, a pro , a adv , s′, r reward), where a pro , a adv are the behavioral output of the reinforcement learning strategy and the behavior of the adversarial agent respectively.
  • the update method can adopt the reinforcement learning algorithm of the continuous action space, and optionally, the trust region policy optimization algorithm (TRPO) can be adopted.
  • TRPO trust region policy optimization algorithm
  • the update method can adopt the reinforcement learning algorithm of the continuous action space.
  • the trusted region policy optimization algorithm TRPO can be used.
  • the above-described embodiment uses a robust reinforcement learning control framework based on multi-task learning and game theory to construct multiple adversarial tasks by changing the action space of the adversarial agent to improve the robustness of the reinforcement learning algorithm.
  • an optimization framework based on game theory is introduced to select the most appropriate confrontation strategy based on historical strategy performance during the training process of each task, making the reinforcement learning strategy more robust.
  • the game theory optimization framework in the embodiments of this application includes but is not limited to policy-space response oracles (PSRO), etc.; the training of reinforcement learning models includes but is not limited to sampling reinforcement learning algorithms, such as Letter space policy optimization algorithm (TRPO), proximal policy optimization algorithm (proximal policy optimization, PPO), etc.
  • TRPO Letter space policy optimization algorithm
  • proximal policy optimization algorithm proximal policy optimization, PPO
  • the present application provides a model training method, which method includes: processing first data through a first reinforcement learning model to obtain a first processing result; wherein the first data indicates the state of the target object, and the third A processing result is used as control information when performing a target task on the target object; the first data is processed through the first target neural network to obtain a second processing result; wherein the second processing result is represented by As interference information when performing the target task, the first target neural network is selected from a plurality of first neural networks, and each of the first neural networks is iteratively trained on the first initial neural network.
  • an adversarial agent for outputting interference information can be trained.
  • the interference information only performs one kind of interference for the target task.
  • multiple adversarial agents for outputting interference information can be trained.
  • the interference information output by different adversarial agents can interfere with different types of target tasks.
  • the historical training results of historical adversarial agents can also be used to output interference on the target task, so that the system can be adapted to different scenarios. More effective interference with the target task, thereby improving the training effect and generalization of the model.
  • Figure 7 is a schematic structural diagram of a model training device provided by an embodiment of the present application. As shown in Figure 7, the device 700 includes:
  • the data processing module 701 is used to process the first data through the first reinforcement learning model to obtain a first processing result; wherein the first data indicates the state of the target object, and the first processing result is used as the target object. Control information when performing target tasks on the target object;
  • the first data is processed through a first target neural network to obtain a second processing result; wherein the second processing result is used as interference information when executing the target task, and the first target neural network is Selected from a plurality of first neural networks, each first neural network is an iterative result obtained from the process of iterative training of the first initial neural network;
  • step 401 For the specific description of the data processing module 701, reference may be made to the description of step 401, step 402, and step 403 in the above embodiment, which will not be described again here.
  • the model update module 702 is configured to update the first reinforcement learning model according to the third processing result to obtain an updated first reinforcement learning model.
  • model update module 702 For a specific description of the model update module 702, reference may be made to the description of step 404 in the above embodiment, which will not be described again here.
  • an adversarial agent for outputting interference information can be trained.
  • the interference information only performs one kind of interference for the target task.
  • multiple adversarial agents for outputting interference information can be trained.
  • the interference information output by different adversarial agents can interfere with different types of target tasks.
  • training adversarial agents not only the adversarial agents obtained in the latest iteration are used to output For interference on the target task, the historical training results of historical adversarial agents (adversarial agents obtained during the historical iteration process) can also be used to output interference on the target task, so that the system can be adapted to different scenarios. More effective interference with the target task, thereby improving the training effect and generalization of the model.
  • the target object is a robot; the target task is the attitude control of the robot, and the first processing result is the attitude control information of the robot; or,
  • the target object is a vehicle; the target task is automatic driving of the vehicle; and the first processing result is the driving control information of the vehicle.
  • the first target neural network is selected from a plurality of first neural networks based on a first selection probability corresponding to each of the plurality of first neural networks.
  • model update module is specifically used to:
  • the reward value corresponding to the target task is obtained
  • the model update module is also used to:
  • the first selection probability corresponding to the first target neural network is updated.
  • the data processing module is also used to:
  • the first data is processed through a second target neural network to obtain a fourth processing result; wherein the fourth processing result is used as interference information when executing the target task, and the second target neural network is Selected from a plurality of second neural networks, each second neural network is an iterative result obtained from the iterative training process of a second initial neural network; the first initial neural network and the second initial neural network are Neural networks are different;
  • the data processing module is specifically used for:
  • the target task is executed to obtain a third processing result.
  • the interference types of the second processing result and the fourth processing result are different; or,
  • the interference objects of the second processing result and the fourth processing result are different; or,
  • the first target neural network is used to determine the second processing result from a first numerical range according to the first data
  • the second target neural network is used to determine the second processing result from a second numerical value according to the first data.
  • the fourth processing result is determined within a range, and the second numerical range is different from the first numerical range.
  • the data processing module is also used to:
  • each of the reinforcement learning models is an iterative result obtained from the iterative training process of the initial reinforcement learning model;
  • the second data indicates the state of the target object, and the fifth processing result is used as the Control information when performing the target task on the target object;
  • the second data is processed through a third target neural network to obtain a sixth processing result;
  • the third target neural network belongs to the plurality of first neural networks;
  • the sixth processing result is used as the basis for executing the Interfering information during the target task;
  • the model update module is also used to:
  • the third target neural network is updated to obtain an updated third target neural network.
  • the second reinforcement learning model is selected from multiple reinforcement learning models based on the second selection probability corresponding to each reinforcement learning model in the multiple reinforcement learning models.
  • FIG. 8 is a schematic structural diagram of an execution device provided by an embodiment of the present application.
  • the execution device 800 can be embodied as a mobile phone, a tablet, a notebook computer, Smart wearable devices, etc. are not limited here.
  • the execution device 800 includes: a receiver 801, a transmitter 802, a processor 803 and a memory 804 (the number of processors 803 in the execution device 800 can be one or more, one processor is taken as an example in Figure 8) , wherein the processor 803 may include an application processor 8031 and a communication processor 8032.
  • the receiver 801, the transmitter 802, the processor 803, and the memory 804 may be connected through a bus or other means.
  • Memory 804 may include read-only memory and random access memory and provides instructions and data to processor 803 .
  • a portion of memory 804 may also include non-volatile random access memory (NVRAM).
  • Memory 804 stores processor and operating instructions, executable modules or data structures, or subsets thereof, or extended sets thereof, where: The operation instructions may include various operation instructions for implementing various operations.
  • Processor 803 controls execution of operations of the device.
  • various components of the execution device are coupled together through a bus system.
  • the bus system may also include a power bus, a control bus, a status signal bus, etc.
  • various buses are called bus systems in the figure.
  • the methods disclosed in the above embodiments of the present application can be applied to the processor 803 or implemented by the processor 803.
  • the processor 803 may be an integrated circuit chip with signal processing capabilities. During the implementation process, each step of the above method can be completed by instructions in the form of hardware integrated logic circuits or software in the processor 803 .
  • the above-mentioned processor 803 can be a general processor, a digital signal processor (DSP), a microprocessor or a microcontroller, and can further include an application specific integrated circuit (ASIC), a field programmable Gate array (field-programmable gate array, FPGA) or other programmable logic devices, discrete gate or transistor logic devices, discrete hardware components.
  • DSP digital signal processor
  • ASIC application specific integrated circuit
  • FPGA field-programmable gate array
  • the processor 803 can implement or execute the disclosed methods, steps and logical block diagrams in the embodiments of this application.
  • a general-purpose processor may be a microprocessor or the processor may be any conventional processor, etc.
  • the steps of the method disclosed in conjunction with the embodiments of the present application can be directly implemented by a hardware decoding processor, or executed by a combination of hardware and software modules in the decoding processor.
  • the software module can be located in random access memory, flash memory, read-only memory, programmable read-only memory or electrically erasable programmable memory, registers and other mature storage media in this field.
  • the storage medium is located in the memory 804.
  • the processor 803 reads the information in the memory 804 and completes the steps of the above method in combination with its hardware.
  • the receiver 801 may be used to receive input numeric or character information and generate signal inputs related to performing relevant settings and functional controls of the device.
  • the transmitter 802 can be used to output numeric or character information; the transmitter 802 can also be used to send instructions to the disk group to modify data in the disk group.
  • the processor 803 is configured to execute the steps of the model obtained through the model training method in the corresponding embodiment of FIG. 4 .
  • FIG. 9 is a schematic structural diagram of the server provided by the embodiment of the present application.
  • the server 900 is implemented by one or more servers.
  • the server 900 can be configured or There is a relatively large difference due to different performance, which may include one or more central processing units (CPU) 99 (for example, one or more processors) and memory 932, and one or more storage applications 942 or data 944 storage medium 930 (eg, one or more mass storage devices).
  • the memory 932 and the storage medium 930 may be short-term storage or persistent storage.
  • the program stored in the storage medium 930 may include one or more modules (not shown in the figure), and each module may include a series of instruction operations on the server.
  • the central processor 99 may be configured to communicate with the storage medium 930 and execute a series of instruction operations in the storage medium 930 on the server 900 .
  • the server 900 may also include one or more power supplies 99, one or more wired or wireless network interfaces 950, one or more input and output interfaces 958; or, one or more operating systems 941, such as Windows ServerTM, Mac OS XTM, UnixTM, LinuxTM, FreeBSDTM and so on.
  • operating systems 941 such as Windows ServerTM, Mac OS XTM, UnixTM, LinuxTM, FreeBSDTM and so on.
  • the central processor 99 is used to execute the steps of the model training method in the corresponding embodiment of FIG. 4 .
  • An embodiment of the present application also provides a computer program product including computer readable instructions, which when run on a computer causes the computer to perform the steps performed by the foregoing execution device, or causes the computer to perform the steps performed by the foregoing training device. A step of.
  • Embodiments of the present application also provide a computer-readable storage medium.
  • the computer-readable storage medium stores a program for performing signal processing.
  • the program When the program is run on a computer, it causes the computer to perform the steps performed by the aforementioned execution device. , or, causing the computer to perform the steps performed by the aforementioned training device.
  • the execution device, training device or terminal device provided by the embodiment of the present application may specifically be a chip.
  • the chip includes: a processing unit and a communication unit.
  • the processing unit may be, for example, a processor.
  • the communication unit may be, for example, an input/output interface. Pins or circuits, etc.
  • the processing unit can execute the computer execution instructions stored in the storage unit, so that the chip in the execution device executes the model training method described in the above embodiment, or so that the chip in the training device executes the steps related to model training in the above embodiment.
  • the storage unit is a storage unit within the chip, such as a register, cache, etc.
  • the storage unit may also be a storage unit located outside the chip in the wireless access device, such as Read-only memory (ROM) or other types of static storage devices that can store static information and instructions, random access memory (random access memory, RAM), etc.
  • ROM Read-only memory
  • RAM random access memory
  • Figure 10 is a schematic structural diagram of a chip provided by an embodiment of the present application.
  • the chip can be expressed as: Neural network processor NPU 1000, NPU 1000 is mounted on the main CPU (Host CPU) as a co-processor, and the Host CPU allocates tasks.
  • the core part of the NPU is the arithmetic circuit 1003.
  • the arithmetic circuit 1003 is controlled by the controller 1004 to extract the matrix data in the memory and perform multiplication operations.
  • the computing circuit 1003 includes multiple processing units (Process Engine, PE).
  • arithmetic circuit 1003 is a two-dimensional systolic array.
  • the arithmetic circuit 1003 may also be a one-dimensional systolic array or other electronic circuit capable of performing mathematical operations such as multiplication and addition.
  • arithmetic circuit 1003 is a general-purpose matrix processor.
  • the arithmetic circuit obtains the corresponding data of matrix B from the weight memory 1002 and caches it on each PE in the arithmetic circuit.
  • the operation circuit takes matrix A data and matrix B from the input memory 1001 to perform matrix operations, and the partial result or final result of the obtained matrix is stored in an accumulator (accumulator) 1008 .
  • the unified memory 1006 is used to store input data and output data.
  • the weight data directly passes through the storage unit access controller (Direct Memory Access Controller, DMAC) 1005, and the DMAC is transferred to the weight memory 1002.
  • Input data is also transferred to unified memory 1006 via DMAC.
  • DMAC Direct Memory Access Controller
  • BIU is the Bus Interface Unit, that is, the bus interface unit 1010, which is used for the interaction between the AXI bus and the DMAC and the Instruction Fetch Buffer (IFB) 1009.
  • IFB Instruction Fetch Buffer
  • the bus interface unit 1010 (Bus Interface Unit, BIU for short) is used to fetch the memory 1009 to obtain instructions from the external memory, and is also used for the storage unit access controller 1005 to obtain the original data of the input matrix A or the weight matrix B from the external memory.
  • BIU Bus Interface Unit
  • DMAC is mainly used to transfer the input data in the external memory DDR to the unified memory 1006 or the weight data to the weight memory 1002 or the input data to the input memory 1001 .
  • the vector calculation unit 1007 includes multiple arithmetic processing units, and if necessary, further processes the output of the arithmetic circuit, such as vector multiplication, vector addition, exponential operation, logarithmic operation, size comparison, etc.
  • vector calculation unit 1007 can store the processed output vectors to unified memory 1006 .
  • the vector calculation unit 1007 can apply a linear function; or a nonlinear function to the output of the operation circuit 1003, such as linear interpolation on the feature plane extracted by the convolution layer, or a vector of accumulated values, to generate an activation value.
  • vector calculation unit 1007 generates normalized values, pixel-wise summed values, or both.
  • the processed output vector can be used as an activation input to the arithmetic circuit 1003, such as for use in a subsequent layer in a neural network.
  • the instruction fetch buffer 1009 connected to the controller 1004 is used to store instructions used by the controller 1004;
  • the unified memory 1006, the input memory 1001, the weight memory 1002 and the fetch memory 1009 are all On-Chip memories. External memory is private to the NPU hardware architecture.
  • the processor mentioned in any of the above places can be a general central processing unit, a microprocessor, an ASIC, or one or more integrated circuits used to control the execution of the above programs.
  • the device embodiments described above are only illustrative.
  • the units described as separate components may or may not be physically separated, and the components shown as units may or may not be physically separate.
  • the physical unit can be located in one place, or it can be distributed across multiple network units. Some or all of the modules can be selected according to actual needs to achieve the purpose of the solution of this embodiment.
  • the connection relationship between modules indicates that there are communication connections between them, which can be specifically implemented as one or more communication buses or signal lines.
  • the present application can be implemented by software plus necessary general hardware. Of course, it can also be implemented by dedicated hardware including dedicated integrated circuits, dedicated CPUs, dedicated memories, Special components, etc. to achieve. In general, all functions performed by computer programs can be easily implemented with corresponding hardware. Moreover, the specific hardware structures used to implement the same function can also be diverse, such as analog circuits, digital circuits or special-purpose circuits. circuit etc. However, for this application, software program implementation is a better implementation in most cases. Based on this understanding, the technical solution of the present application can be embodied in the form of a software product in essence or that contributes to the existing technology.
  • the computer software product is stored in a readable storage medium, such as a computer floppy disk. , U disk, mobile hard disk, ROM, RAM, magnetic disk or optical disk, etc., including several instructions to cause a computer device (which can be a personal computer, training device, or network device, etc.) to execute the steps described in various embodiments of this application. method.
  • a computer device which can be a personal computer, training device, or network device, etc.
  • the computer program product includes one or more computer instructions.
  • the computer may be a general-purpose computer, a special-purpose computer, a computer network, or other programmable device.
  • the computer instructions may be stored in or transmitted from one computer-readable storage medium to another, for example, the computer instructions may be transferred from a website, computer, training device, or data
  • the center transmits to another website site, computer, training equipment or data center through wired (such as coaxial cable, optical fiber, digital subscriber line (DSL)) or wireless (such as infrared, wireless, microwave, etc.) means.
  • wired such as coaxial cable, optical fiber, digital subscriber line (DSL)
  • wireless such as infrared, wireless, microwave, etc.
  • the computer-readable storage medium may be any available medium that a computer can store, or a data storage device such as a training device or a data center integrated with one or more available media.
  • the available media may be magnetic media (eg, floppy disk, hard disk, magnetic tape), optical media (eg, DVD), or semiconductor media (eg, solid state disk (Solid State Disk, SSD)), etc.

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Data Mining & Analysis (AREA)
  • General Health & Medical Sciences (AREA)
  • Biomedical Technology (AREA)
  • Biophysics (AREA)
  • Computational Linguistics (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Evolutionary Computation (AREA)
  • Artificial Intelligence (AREA)
  • Molecular Biology (AREA)
  • Computing Systems (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Mathematical Physics (AREA)
  • Software Systems (AREA)
  • Health & Medical Sciences (AREA)
  • Feedback Control In General (AREA)

Abstract

L'invention concerne un procédé d'entraînement de modèle, se rapportant au domaine de l'intelligence artificielle. Le procédé consiste à : traiter des premières données au moyen d'un premier modèle d'apprentissage par renforcement pour obtenir un premier résultat de traitement ; traiter les premières données au moyen d'un premier réseau neuronal cible sélectionné parmi une pluralité de premiers réseaux neuronaux pour obtenir un second résultat de traitement, chaque premier réseau neuronal étant un résultat d'itération obtenu par réalisation d'un entraînement itératif sur un premier réseau neuronal initial ; et mettre à jour le premier modèle d'apprentissage par renforcement en fonction du premier résultat de traitement et du second résultat de traitement. Selon la présente invention, l'interférence pour une tâche cible est délivrée en utilisant un résultat d'entraînement historique d'un agent antagoniste historique (un agent antagoniste obtenu dans un processus d'itération historique), de telle sorte qu'une interférence plus efficace pour la tâche cible dans différents scénarios peut être obtenue, ce qui permet d'améliorer l'effet d'entraînement et la généralisation d'un modèle.
PCT/CN2023/101527 2022-06-21 2023-06-20 Procédé d'entraînement de modèle et dispositif associé WO2023246819A1 (fr)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN202210705971.0A CN115293227A (zh) 2022-06-21 2022-06-21 一种模型训练方法及相关设备
CN202210705971.0 2022-06-21

Publications (1)

Publication Number Publication Date
WO2023246819A1 true WO2023246819A1 (fr) 2023-12-28

Family

ID=83821246

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2023/101527 WO2023246819A1 (fr) 2022-06-21 2023-06-20 Procédé d'entraînement de modèle et dispositif associé

Country Status (2)

Country Link
CN (1) CN115293227A (fr)
WO (1) WO2023246819A1 (fr)

Families Citing this family (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115293227A (zh) * 2022-06-21 2022-11-04 华为技术有限公司 一种模型训练方法及相关设备
CN116330290B (zh) * 2023-04-10 2023-08-18 大连理工大学 基于多智能体深度强化学习的五指灵巧机器手控制方法
CN116996403B (zh) * 2023-09-26 2023-12-15 深圳市乙辰科技股份有限公司 应用ai模型的网络流量诊断方法及系统

Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US6119112A (en) * 1997-11-19 2000-09-12 International Business Machines Corporation Optimum cessation of training in neural networks
CN113011575A (zh) * 2019-12-19 2021-06-22 华为技术有限公司 神经网络模型更新方法、图像处理方法及装置
CN113919482A (zh) * 2021-09-22 2022-01-11 上海浦东发展银行股份有限公司 智能体训练方法、装置、计算机设备和存储介质
CN113988196A (zh) * 2021-11-01 2022-01-28 乐聚(深圳)机器人技术有限公司 一种机器人移动方法、装置、设备及存储介质
CN114565092A (zh) * 2020-11-13 2022-05-31 华为技术有限公司 一种神经网络结构确定方法及其装置
CN115293227A (zh) * 2022-06-21 2022-11-04 华为技术有限公司 一种模型训练方法及相关设备

Patent Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US6119112A (en) * 1997-11-19 2000-09-12 International Business Machines Corporation Optimum cessation of training in neural networks
CN113011575A (zh) * 2019-12-19 2021-06-22 华为技术有限公司 神经网络模型更新方法、图像处理方法及装置
CN114565092A (zh) * 2020-11-13 2022-05-31 华为技术有限公司 一种神经网络结构确定方法及其装置
CN113919482A (zh) * 2021-09-22 2022-01-11 上海浦东发展银行股份有限公司 智能体训练方法、装置、计算机设备和存储介质
CN113988196A (zh) * 2021-11-01 2022-01-28 乐聚(深圳)机器人技术有限公司 一种机器人移动方法、装置、设备及存储介质
CN115293227A (zh) * 2022-06-21 2022-11-04 华为技术有限公司 一种模型训练方法及相关设备

Also Published As

Publication number Publication date
CN115293227A (zh) 2022-11-04

Similar Documents

Publication Publication Date Title
WO2023246819A1 (fr) Procédé d'entraînement de modèle et dispositif associé
WO2022022274A1 (fr) Procédé et appareil d'instruction de modèles
WO2022042713A1 (fr) Procédé d'entraînement d'apprentissage profond et appareil à utiliser dans un dispositif informatique
WO2022068623A1 (fr) Procédé de formation de modèle et dispositif associé
CN112651511A (zh) 一种训练模型的方法、数据处理的方法以及装置
CN113065636A (zh) 一种卷积神经网络的剪枝处理方法、数据处理方法及设备
CN114997412A (zh) 一种推荐方法、训练方法以及装置
WO2023274052A1 (fr) Procédé de classification d'images et son dispositif associé
WO2023231961A1 (fr) Procédé d'apprentissage de renforcement multi-agent et dispositif associé
WO2023072175A1 (fr) Procédé de traitement de données de nuage en points, procédé d'apprentissage de réseau neuronal et dispositif associé
WO2023185925A1 (fr) Procédé de traitement de données et appareil associé
CN113065633A (zh) 一种模型训练方法及其相关联设备
CN115238909A (zh) 一种基于联邦学习的数据价值评估方法及其相关设备
CN111738403A (zh) 一种神经网络的优化方法及相关设备
CN114169393A (zh) 一种图像分类方法及其相关设备
WO2024017282A1 (fr) Procédé et dispositif de traitement de données
WO2023246735A1 (fr) Procédé de recommandation d'article et dispositif connexe associé
CN113627163A (zh) 一种注意力模型、特征提取方法及相关装置
WO2023197857A1 (fr) Procédé de partitionnement de modèle et dispositif associé
WO2023197910A1 (fr) Procédé de prédiction de comportement d'utilisateur et dispositif associé
WO2023045949A1 (fr) Procédé de formation de modèle et dispositif associé
CN115565104A (zh) 一种动作预测方法及其相关设备
CN116710974A (zh) 在合成数据系统和应用程序中使用域对抗学习的域适应
WO2023051236A1 (fr) Procédé de résolution d'équation différentielle partielle, et dispositif associé
WO2023143128A1 (fr) Procédé de traitement de données et dispositif associé

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 23826465

Country of ref document: EP

Kind code of ref document: A1