CN116151385A - Robot autonomous learning method based on generation of countermeasure network - Google Patents

Robot autonomous learning method based on generation of countermeasure network Download PDF

Info

Publication number
CN116151385A
CN116151385A CN202111344484.8A CN202111344484A CN116151385A CN 116151385 A CN116151385 A CN 116151385A CN 202111344484 A CN202111344484 A CN 202111344484A CN 116151385 A CN116151385 A CN 116151385A
Authority
CN
China
Prior art keywords
sample
function
robot
samples
autonomous learning
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202111344484.8A
Other languages
Chinese (zh)
Inventor
库涛
俞宁
林乐新
刘金鑫
李进
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Shenyang Institute of Automation of CAS
Original Assignee
Shenyang Institute of Automation of CAS
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Shenyang Institute of Automation of CAS filed Critical Shenyang Institute of Automation of CAS
Priority to CN202111344484.8A priority Critical patent/CN116151385A/en
Publication of CN116151385A publication Critical patent/CN116151385A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N20/00Machine learning
    • BPERFORMING OPERATIONS; TRANSPORTING
    • B25HAND TOOLS; PORTABLE POWER-DRIVEN TOOLS; MANIPULATORS
    • B25JMANIPULATORS; CHAMBERS PROVIDED WITH MANIPULATION DEVICES
    • B25J9/00Programme-controlled manipulators
    • B25J9/16Programme controls
    • B25J9/1602Programme controls characterised by the control system, structure, architecture
    • BPERFORMING OPERATIONS; TRANSPORTING
    • B25HAND TOOLS; PORTABLE POWER-DRIVEN TOOLS; MANIPULATORS
    • B25JMANIPULATORS; CHAMBERS PROVIDED WITH MANIPULATION DEVICES
    • B25J9/00Programme-controlled manipulators
    • B25J9/16Programme controls
    • B25J9/1628Programme controls characterised by the control loop
    • B25J9/163Programme controls characterised by the control loop learning, adaptive, model based, rule based expert control
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02TCLIMATE CHANGE MITIGATION TECHNOLOGIES RELATED TO TRANSPORTATION
    • Y02T10/00Road transport of goods or passengers
    • Y02T10/10Internal combustion engine [ICE] based vehicles
    • Y02T10/40Engine management systems

Landscapes

  • Engineering & Computer Science (AREA)
  • Software Systems (AREA)
  • Mechanical Engineering (AREA)
  • Robotics (AREA)
  • Theoretical Computer Science (AREA)
  • Evolutionary Computation (AREA)
  • Medical Informatics (AREA)
  • Physics & Mathematics (AREA)
  • Computing Systems (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Mathematical Physics (AREA)
  • Data Mining & Analysis (AREA)
  • Automation & Control Theory (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Artificial Intelligence (AREA)
  • Feedback Control In General (AREA)

Abstract

The invention constructs a robot autonomous learning method based on generating an countermeasure network, and is applied to robot autonomous learning with few samples or zero samples in an industrial scene. The method comprises the following steps: 1) Establishing a chain model for the behavior of the robot based on the Markov chain; 2) Acquiring more samples by utilizing the generated countermeasure network according to the existing samples or expert data; 3) Obtaining a reward function and training an optimal decision through inverse reinforcement learning; 4) Acquiring an optimal value function and an optimal strategy function according to the reward function and the optimal strategy; 5) And (5) completing the establishment of an autonomous learning model of the robot. The robot autonomous learning method based on the generated countermeasure network is mainly oriented to the situation that an experience sample is absent in an industrial scene, and the goal of robot autonomous learning is achieved through the combination of the generated countermeasure network and the inverse reinforcement learning, so that the automation and intelligence level of the robot is improved.

Description

Robot autonomous learning method based on generation of countermeasure network
Technical Field
The invention belongs to the field of intelligent control of robots and autonomous learning of robots, and particularly relates to an autonomous learning method of a robot based on a generated countermeasure network.
Background
The robot autonomous learning method mainly refers to a machine learning method which enables a robot to accumulate experience data through self interaction with the environment so as to autonomously perform action decision. Robot autonomous learning belongs to one of important means of robot control, and often plays an important role in functions of robot environment perception, behavior control, dynamic decision, automatic execution and the like in an intelligent integrated control system. The method not only needs a decision-making method learned by a robot to have high optimization degree, but also has extremely high requirements on indexes such as learning speed, reaction speed and the like. Therefore, the continuous improvement of the autonomous learning method of the robot is an important subject of the current robot research.
Typically, such learning methods require extensive sample training and key parameter settings by human beings to ensure learning efficiency and accuracy. This makes the learning result of the robot often limited by data and size and human parameter settings. Meanwhile, if the data set has contaminated data, the final optimization degree is likely to be greatly reduced, and even the actual requirements cannot be met. In addition, the method requires a designer to have a higher experience base on an actual scene, so that parameters can be accurately set, if the designer cannot accurately judge the actual requirement, deviation of the learning direction is likely to occur, and finally the expected decision capability cannot be achieved. The problems are the problems which need to be solved in the autonomous learning of the robot at present.
Disclosure of Invention
The invention combines the generation of the countermeasure network technology and the inverse reinforcement learning method, combines the two technologies into a whole, and provides the robot autonomous learning method, which aims to reduce the dependence of the robot autonomous learning on expert samples, improve the learning efficiency of the robot and increase the optimization degree of the robot autonomous decision.
The technical scheme adopted by the invention for achieving the purpose is as follows:
a robot autonomous learning method based on generating a countermeasure network, comprising the steps of:
constructing a Markov chain model, acquiring a complete action track and decision steps of the robot, sampling the action track and decision steps to generate a real sample set representing the action, and storing the real sample set into a real sample pool;
randomly generating signals and transmitting the signals into a generator, generating samples by the generator, and storing the generated samples into a virtual sample pool;
the generated sample is transmitted into a discriminator, the discriminator compares the generated sample with the real sample, the generated sample is dynamically adjusted according to the comparison result, and the virtual sample pool is updated;
mixing the updated virtual sample pool with the real sample pool to form a mixed sample pool, and randomly extracting data in the mixed sample pool;
randomly generating a strategy and executing the strategy;
sampling the executed strategy, and comparing the sampling result with the data extracted from the mixed sample pool to obtain a reward function and an optimal strategy;
training a Markov chain model according to the reward function, taking the state of the robot as the input of the model, obtaining the corresponding action, and completing the autonomous learning of the robot.
The construction of the Markov chain model is specifically as follows: and establishing a five-tuple (S, A, P, R, gamma) according to the Markov chain model, wherein the set S represents a current state set, the set A represents a next moment action set, P is the probability of various actions in A, R is a reward function, and gamma epsilon (0, 1) is a discount coefficient.
The discriminator compares the generated sample with the real sample, specifically: the generated sample and the real sample are mixed to form a training sample, the training sample is sent into a discriminator for discrimination, and the probability D (x) of the training sample from the generated sample is output.
And dynamically adjusting and generating samples according to the comparison result, namely respectively calculating the loss function of the discriminator and the loss function of the generator according to the probability D (x), and stopping adjusting and generating the samples when the loss function of the discriminator and the loss function of the generator reach Nash equilibrium.
Loss function L of said arbiter discriminator (D) The method comprises the following steps:
L discriminator (D)=E x~P [-log D(x)]+E x~G [-log(1-D(x))]
wherein ,Ex~P [-log D(x)]Representing loss of separation of real samples into generated samples, E x~G [-log(1-D(x))]Representing the loss of splitting the generated samples into true samples.
The loss function L of the generator generator (G) The method comprises the following steps:
L generator (G)=E x~G [-log D(x)]+E x~G [log(1-D(x))]
wherein ,Ex~G [-log D(x)]Representing loss of the discriminator classifying the generated samples into generated samples, E x~G [log(1-D(x))]Representing the loss of the discriminator classifying the generated sample into a true sample.
Evaluating a policy using a value function comprising a representation of a state value function V π (s) and a function Q representing an action value π (s, a) wherein:
Figure BDA0003353483310000031
Figure BDA0003353483310000032
wherein pi (s, a) is the policy of the (s, a) state, R is the reward function, P (s, a, s ') is the probability of the state s→s', and action a 'is the action of the next state s'.
The executed strategy is sampled, and the sampling result is compared with the data extracted from the mixed sample pool to obtain a reward function and an optimal strategy, specifically:
optimum value function V *(s) and Q* (s, a) is obtained by the following formula,
Figure BDA0003353483310000033
Figure BDA0003353483310000034
further obtain action a 1 Is the optimal strategy pi * The conditions for(s) are:
Figure BDA0003353483310000035
wherein ,
Figure BDA0003353483310000036
action a is performed when the state is s 1 Probability of state occurrence s→s->
Figure BDA0003353483310000037
An optimal state value function representing a state s', P sa (s ') represents the probability that any action a epsilon A is performed when the state is s, the state occurs s→s',
bond V * (s) obtaining
Figure BDA0003353483310000038
Finally, the rewarding function and the optimal strategy are obtained.
Training a Markov chain model according to the reward function, specifically, obtaining parameters in the reward function by calculating the maximum entropy, and further determining the Markov chain model.
The calculated maximum entropy is specifically as follows:
max-p log p
Figure BDA0003353483310000041
where p is the probability, l i Representing the ith in the probabilistic modelTrace, f represents feature expectations, f E Representing expert characteristics expectations τ i An ith element in the expert sample set;
Figure BDA0003353483310000042
wherein ,λi (i=0 to n) is the i-th parameter in the bonus function R.
The invention has the following beneficial effects and advantages:
1. the invention provides a robot autonomous learning method based on a generated countermeasure network, which is mainly oriented to the robot learning problem in the application scene of the robot, and realizes the robot learning under the condition of few samples by combining the generated countermeasure network model and the inverse reinforcement learning method, thereby reducing the dependence on the size of a sample data set and effectively improving the learning efficiency.
2. The robot learning method is autonomously learned by the robot, basically does not need human intervention, reduces the interference of human factors, and improves the optimization degree of robot decision.
3. The robot learning method adopts an inverse reinforcement learning method, can obtain a proper reward function according to the environment, and finally trains out an optimal strategy function. The generalization performance of the robot is greatly improved.
4. The robot learning method adopts the generation of the countermeasure network model, and can generate a large number of approximately real samples, so that a large amount of data can be learned when the real samples are fewer, more samples with higher optimization degree can be obtained, the final performance is not influenced by the optimization degree of the samples, and the intelligent performance of the robot is effectively improved.
Drawings
FIG. 1 is a flow chart of a robot autonomous learning process;
fig. 2 is a diagram of the relationship between the parts.
Detailed Description
The present invention will be described in further detail with reference to the accompanying drawings and examples.
As shown in fig. 1 and 2, step one: a markov chain model is built. A five-tuple (S, A, P, R, gamma) can be established according to the Markov chain model, wherein the set S represents the current state set, the set A represents the next moment action set, P is the probability of various actions in A, R is the reward function, gamma E (0, 1) is the discount coefficient, and the discount coefficient is used for calculating the accumulated reward value.
Step two: a small number of expert samples or instances are provided and placed into a real sample pool. Expert samples or examples generally refer to complete motion trajectories or decision steps, and after systematic sampling, a series of motion sets d= { τ are generated 12 ,…,τ n }. Storing the sampled action sets in a real sample pool D 1 Waiting for subsequent comparisons.
And thirdly, randomly generating signals, and transmitting the signals into a generator to generate corresponding data. The random signal is typically noise and is used to characterize a certain random environmental element. The generator G generates corresponding samples x from the signal, and the samples x to G are expressed as x to G.
And step four, the generator transmits the data to a discriminator, and the discriminator compares the data and feeds back the result to the generator. The task of the arbiter D is to classify the sample input as the output of the generator, or as an actual sample from the underlying data distribution p (x). And searching similar samples from the real sample pool, mixing the samples with the samples x generated in the generator, sending the samples into a discriminator for discrimination, and outputting the probability D (x) of training samples from the generated samples. Further, the arbiter loss can be calculated. The loss of the arbiter is the average logarithmic probability that it assigns to the correct class, evaluated on a mixed set of actual samples and output from the generator,
L discriminator (D)=E x~P [-log D(x)]+E x~G [-log(1-D(x))]
wherein ,Ex~P [-log D(x)]Representing loss of separation of real samples into generated samples, E x~G [-log(1-D(x))]Representing the loss of splitting the generated samples into true samples.
And fifthly, the generator adjusts the sample according to the feedback result of the discriminator. The task of the generator is to produce an output that is classified by the arbiter as coming from the underlying data distribution. If the loss of the arbiter is large, it indicates that the quality of the generator that generates the set of samples is high, otherwise, the arbiter quality is high. The loss of generator generates the sum of the average log probability that the sample was classified as correct and the average log probability that the sample was classified as incorrect,
L generator (G)=E x~G [-log D(x)]+E x~G [log(1-D(x))]
wherein ,Ex~G [-log D(x)]Representing the loss of dividing the generated samples into generated samples, E x~G [log(1- D(x))]Representing the loss of splitting the generated samples into true samples.
And step six, mixing the real sample pool and the virtual sample pool. Through solving two loss functions, the generator model G and the discriminant model D finally achieve a Nash balance, and the generated samples have higher similarity with the real data and are stored in the sample pool D 2 Referred to as a virtual sample pool. At this time, the real sample pool and the virtual sample pool are fully mixed to form a mixed sample pool D d So that samples can be randomly drawn to real samples or generated samples with a certain probability when the samples are drawn.
Step seven, randomly extracting data D 'from the mixed sample pool' d . At this time, it is possible to draw either the generated sample or the real sample, but through the generation of continuous updates against the network, the generated sample also has a sample quality similar to that of the real sample.
And step eight, randomly generating a strategy and executing the strategy. A policy q can be randomly generated according to the environment k The policy is enforced and sampled. The concept of a value function is typically introduced when evaluating policies. In general, V π (s) represents a state value function and Q π (s, a) represents an action value function. The calculation formula is that,
Figure BDA0003353483310000061
Figure BDA0003353483310000062
wherein pi (s, a) is the strategy of the (s, a) state, R is the reward function, and P (s, a, s ') is the probability of the state s- & gts'.
And step nine, comparing the executed strategy with the data in the sample pool, and updating the rewarding value. Sampling the executed strategy and comparing it with the high quality sample in the mixed sample cell, using the current strategy sample D' s With high quality sample D' d Finding the optimal reward function under the current condition. Optimum value function V *(s) and Q* (s, a) can be obtained by the following formula,
Figure BDA0003353483310000063
Figure BDA0003353483310000071
further, the action a is known 1 Is the optimal strategy pi * The filling conditions of(s) are that
Figure BDA0003353483310000072
And can be written as
Figure BDA0003353483310000073
Bond V * (s) it can be seen that
Figure BDA0003353483310000074
Eventually, a bonus function and an optimal strategy can be obtained.
And step ten, optimizing the strategy function. After the optimal rewarding function is found, the current rewarding value can be determined, and the strategy function is optimized according to the rewarding value and the high-quality sample, so that the performance of the strategy function is improved.
And step eleven, obtaining a reward function and an optimal decision. Through continuous optimization, the strategy corresponding to the optimal rewarding function under any condition can be finally obtained.
And step twelve, training a model according to the reward function, obtaining an optimal value function and a strategy function, and completing model construction. Through the reward function, reinforcement learning training can be performed, and finally the optimal value function strategy function is obtained. The reward function typically obtained by inverse reinforcement learning has some ambiguity and it is often also necessary to find the maximum entropy in order to avoid ambiguity. Namely, the following problems are obtained,
max-p log p
Figure BDA0003353483310000075
where p is the probability, l i Representing trajectories in a probabilistic model, f representing feature expectations, f E Representing expert feature expectations;
Figure BDA0003353483310000076
wherein ,λi (i=0 to n) is a parameter in the bonus function.
As described above, while the present invention has been particularly shown and described, it is not to be construed as limiting the invention itself. Various changes in form and details may be made therein without departing from the spirit and scope of the invention as defined by the appended claims.

Claims (10)

1. The robot autonomous learning method based on the generation of the countermeasure network is characterized by comprising the following steps:
constructing a Markov chain model, acquiring a complete action track and decision steps of the robot, sampling the action track and decision steps to generate a real sample set representing the action, and storing the real sample set into a real sample pool;
randomly generating signals and transmitting the signals into a generator, generating samples by the generator, and storing the generated samples into a virtual sample pool;
the generated sample is transmitted into a discriminator, the discriminator compares the generated sample with the real sample, the generated sample is dynamically adjusted according to the comparison result, and the virtual sample pool is updated;
mixing the updated virtual sample pool with the real sample pool to form a mixed sample pool, and randomly extracting data in the mixed sample pool;
randomly generating a strategy and executing the strategy;
sampling the executed strategy, and comparing the sampling result with the data extracted from the mixed sample pool to obtain a reward function and an optimal strategy;
training a Markov chain model according to the reward function, taking the state of the robot as the input of the model, obtaining the corresponding action, and completing the autonomous learning of the robot.
2. The autonomous learning method of a robot based on generating an countermeasure network according to claim 1, wherein the constructing a markov chain model specifically includes: and establishing a five-tuple (S, A, P, R, gamma) according to the Markov chain model, wherein the set S represents a current state set, the set A represents a next moment action set, P is the probability of various actions in A, R is a reward function, and gamma epsilon (0, 1) is a discount coefficient.
3. The autonomous learning method of a robot based on generating an countermeasure network according to claim 1, wherein the discriminator compares the generated sample with the true sample, specifically: the generated sample and the real sample are mixed to form a training sample, the training sample is sent into a discriminator for discrimination, and the probability D (x) of the training sample from the generated sample is output.
4. A robot autonomous learning method based on generating an countermeasure network according to claim 1 or 3, wherein the generating samples are dynamically adjusted according to the comparison result, specifically, the loss function of the discriminator and the loss function of the generator are calculated according to the probability D (x), respectively, and when the loss function of the discriminator and the loss function of the generator reach nash equilibrium, the adjusting of the generating samples is stopped.
5. The autonomous learning method of a robot based on generating an countermeasure network according to claim 4, wherein the loss function L of the arbiter discriminator (D) The method comprises the following steps:
L discriminator (D)=E x~P [-logD(x)]+E x~G [-log(1-D(x))]
wherein ,Ex~P [-logD(x)]Representing loss of separation of real samples into generated samples, E x~G [-log(1-D(x))]Representing the loss of splitting the generated samples into true samples.
6. The autonomous learning method of a robot based on generating an countermeasure network according to claim 4, wherein the loss function L of the generator generator (G) The method comprises the following steps:
L generator (G)=E x~G [-logD(x)]+E x~G [log(1-D(x))]
wherein ,Ex~G [-logD(x)]Representing loss of the discriminator classifying the generated samples into generated samples, E x~G [log(1-D(x))]Representing the loss of the discriminator classifying the generated sample into a true sample.
7. A method of autonomous learning by a robot based on generating an countermeasure network according to claim 1, characterized by using a value function evaluation strategy, the value function including a state-representative value function V π (s) and a function Q representing an action value π (s, a) wherein:
Figure FDA0003353483300000021
Figure FDA0003353483300000022
wherein pi (s, a) is the policy of the (s, a) state, R is the reward function, P (s, a, s ') is the probability of the state s→s', and action a 'is the action of the next state s'.
8. The autonomous learning method of a robot based on generating an countermeasure network according to claim 1, wherein the steps of sampling the executed policy and comparing the sampling result with the data extracted from the mixed sample pool to obtain a reward function and an optimal policy are as follows:
optimum value function V *(s) and Q* (s, a) is obtained by the following formula,
Figure FDA0003353483300000023
Figure FDA0003353483300000024
further obtain action a 1 Is the optimal strategy pi * The conditions for(s) are:
Figure FDA0003353483300000031
wherein ,
Figure FDA0003353483300000032
action a is performed when the state is s 1 Probability of state occurrence s→s->
Figure FDA0003353483300000033
An optimal state value function representing a state s', P sa (s') means that arbitrary motion is performed when the state is sAs a e a, the probability of the state occurrence s→s',
bond V * (s) obtaining
Figure FDA0003353483300000034
Finally, the rewarding function and the optimal strategy are obtained.
9. The autonomous learning method of a robot based on generating a countermeasure network according to claim 1, wherein the training of the markov chain model according to the reward function is specifically that parameters in the reward function are obtained by calculating a maximum entropy, and the markov chain model is further determined.
10. The autonomous learning method of a robot based on generating an countermeasure network according to claim 9, wherein the calculating the maximum entropy is specifically:
max-plogp
Figure FDA0003353483300000035
where p is the probability, l i Represents the ith trace in the probabilistic model, f represents the feature expectations, f E Representing expert characteristics expectations τ i An ith element in the expert sample set;
Figure FDA0003353483300000036
wherein ,λi (i=0 to n) is the i-th parameter in the bonus function R.
CN202111344484.8A 2021-11-15 2021-11-15 Robot autonomous learning method based on generation of countermeasure network Pending CN116151385A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202111344484.8A CN116151385A (en) 2021-11-15 2021-11-15 Robot autonomous learning method based on generation of countermeasure network

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202111344484.8A CN116151385A (en) 2021-11-15 2021-11-15 Robot autonomous learning method based on generation of countermeasure network

Publications (1)

Publication Number Publication Date
CN116151385A true CN116151385A (en) 2023-05-23

Family

ID=86354821

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202111344484.8A Pending CN116151385A (en) 2021-11-15 2021-11-15 Robot autonomous learning method based on generation of countermeasure network

Country Status (1)

Country Link
CN (1) CN116151385A (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN117250970A (en) * 2023-11-13 2023-12-19 青岛澎湃海洋探索技术有限公司 Method for realizing AUV fault detection based on model embedding generation countermeasure network

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN117250970A (en) * 2023-11-13 2023-12-19 青岛澎湃海洋探索技术有限公司 Method for realizing AUV fault detection based on model embedding generation countermeasure network
CN117250970B (en) * 2023-11-13 2024-02-02 青岛澎湃海洋探索技术有限公司 Method for realizing AUV fault detection based on model embedding generation countermeasure network

Similar Documents

Publication Publication Date Title
CN112668235B (en) Robot control method based on off-line model pre-training learning DDPG algorithm
CN113511082B (en) Hybrid electric vehicle energy management method based on rule and double-depth Q network
Feng Controller synthesis of fuzzy dynamic systems based on piecewise Lyapunov functions
CN110488861A (en) Unmanned plane track optimizing method, device and unmanned plane based on deeply study
CN110481536B (en) Control method and device applied to hybrid electric vehicle
CN103679139B (en) Face identification method based on particle swarm optimization BP network
CN111191769B (en) Self-adaptive neural network training and reasoning device
CN110594317B (en) Starting control strategy based on double-clutch type automatic transmission
WO2022252457A1 (en) Autonomous driving control method, apparatus and device, and readable storage medium
CN116151385A (en) Robot autonomous learning method based on generation of countermeasure network
CN113722980A (en) Ocean wave height prediction method, system, computer equipment, storage medium and terminal
Schuman et al. Low size, weight, and power neuromorphic computing to improve combustion engine efficiency
CN112487933B (en) Radar waveform identification method and system based on automatic deep learning
CN117709712A (en) Situation prediction method and terminal for power distribution network based on hybrid neural network
CN108388115A (en) NCS method for compensating network delay based on generalized predictive control
Puccetti et al. Speed tracking control using model-based reinforcement learning in a real vehicle
Gladwin et al. A controlled migration genetic algorithm operator for hardware-in-the-loop experimentation
Zheng et al. Variance reduction based partial trajectory reuse to accelerate policy gradient optimization
Riid et al. Interpretability of fuzzy systems and its application to process control
Lee et al. A real-time intelligent speed optimization planner using reinforcement learning
CN116755046B (en) Multifunctional radar interference decision-making method based on imperfect expert strategy
CN113065693B (en) Traffic flow prediction method based on radial basis function neural network
CN116176606A (en) Method and device for reinforcement learning of intelligent agent for controlling vehicle driving
Guo et al. Knowledge Transfer in Multi-Agent Reinforcement Learning Using Decision Trees
CN118278495A (en) Method for generating power grid operation mode based on reinforcement learning

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination