CN116858241A - Application method of mobile robot in reinforcement learning of matching network - Google Patents

Application method of mobile robot in reinforcement learning of matching network Download PDF

Info

Publication number
CN116858241A
CN116858241A CN202310795952.6A CN202310795952A CN116858241A CN 116858241 A CN116858241 A CN 116858241A CN 202310795952 A CN202310795952 A CN 202310795952A CN 116858241 A CN116858241 A CN 116858241A
Authority
CN
China
Prior art keywords
robot
matching network
samples
ppo
reinforcement learning
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202310795952.6A
Other languages
Chinese (zh)
Inventor
张祺琛
倪彬
滕伟
潘志刚
彭志颖
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Air Force Logistics University Of Pla
Original Assignee
Air Force Logistics University Of Pla
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Air Force Logistics University Of Pla filed Critical Air Force Logistics University Of Pla
Priority to CN202310795952.6A priority Critical patent/CN116858241A/en
Publication of CN116858241A publication Critical patent/CN116858241A/en
Pending legal-status Critical Current

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • G06N3/096Transfer learning
    • GPHYSICS
    • G01MEASURING; TESTING
    • G01CMEASURING DISTANCES, LEVELS OR BEARINGS; SURVEYING; NAVIGATION; GYROSCOPIC INSTRUMENTS; PHOTOGRAMMETRY OR VIDEOGRAMMETRY
    • G01C21/00Navigation; Navigational instruments not provided for in groups G01C1/00 - G01C19/00
    • G01C21/20Instruments for performing navigational calculations
    • GPHYSICS
    • G01MEASURING; TESTING
    • G01CMEASURING DISTANCES, LEVELS OR BEARINGS; SURVEYING; NAVIGATION; GYROSCOPIC INSTRUMENTS; PHOTOGRAMMETRY OR VIDEOGRAMMETRY
    • G01C21/00Navigation; Navigational instruments not provided for in groups G01C1/00 - G01C19/00
    • G01C21/26Navigation; Navigational instruments not provided for in groups G01C1/00 - G01C19/00 specially adapted for navigation in a road network
    • G01C21/34Route searching; Route guidance
    • G01C21/3446Details of route searching algorithms, e.g. Dijkstra, A*, arc-flags, using precalculated routes
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/22Matching criteria, e.g. proximity measures
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N20/00Machine learning
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/0464Convolutional networks [CNN, ConvNet]
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02TCLIMATE CHANGE MITIGATION TECHNOLOGIES RELATED TO TRANSPORTATION
    • Y02T10/00Road transport of goods or passengers
    • Y02T10/10Internal combustion engine [ICE] based vehicles
    • Y02T10/40Engine management systems

Landscapes

  • Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Theoretical Computer Science (AREA)
  • Remote Sensing (AREA)
  • Radar, Positioning & Navigation (AREA)
  • General Physics & Mathematics (AREA)
  • Data Mining & Analysis (AREA)
  • Artificial Intelligence (AREA)
  • Evolutionary Computation (AREA)
  • Software Systems (AREA)
  • General Engineering & Computer Science (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Mathematical Physics (AREA)
  • Computing Systems (AREA)
  • Biophysics (AREA)
  • Molecular Biology (AREA)
  • General Health & Medical Sciences (AREA)
  • Computational Linguistics (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Biomedical Technology (AREA)
  • Health & Medical Sciences (AREA)
  • Automation & Control Theory (AREA)
  • Medical Informatics (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Evolutionary Biology (AREA)
  • Manipulator (AREA)

Abstract

The invention discloses an application method of a mobile robot in reinforcement learning of a matching network, wherein a model mainly comprises a near-end strategy optimization algorithm (PPO) and the matching network, wherein the PPO is used for realizing navigation of the robot in an unknown environment, and the matching network is used for providing a reward value of each action of the robot; only a small number of successful samples in the current environment need to be provided before the DRL training is started, so that accurate rewarding function generation can be realized; the model obtains the rewarding value by comparing the difference between the path sample of the current robot and the provided positive and negative samples.

Description

Application method of mobile robot in reinforcement learning of matching network
Technical Field
The invention discloses an application method of a mobile robot in reinforcement learning of a matching network, and belongs to the technical field of mobile robot navigation.
Background
The mobile robot can move to the target point through an autonomous navigation system, and a traditional navigation algorithm comprises positioning and mapping, path planning and motion control. A map of high accuracy must be provided when using conventional navigation methods. The conventional navigation method may be limited when the robot is in an unknown environment.
Deep Reinforcement Learning (DRL) achieves mapping from states to actions through constant interaction with the environment, and is increasingly used in a variety of fields. In recent years, various DRL technologies are used in navigation tasks, double-flow Q-network is used in dynamic environments to realize navigation and obstacle avoidance, the method is suitable for training of multi-subtask auxiliary reinforcement learning, and DDPG is used in map-free environments to realize continuous action control and target-driven navigation. The above work realizes that the robot realizes the navigation task through the DRL. However, due to the simpler design of the reward function, the learning speed of the DRL is slow, and meanwhile, when the robot is in a complex environment, the traditional reward function is difficult for the robot to learn a proper strategy.
To solve the problem that the bonus function is difficult to design, an inverse reinforcement learning method is proposed. Reverse reinforcement learning can be performed by expert demonstration of strategies that enable DRL learning to solve complex problems, but expert demonstration often requires a significant amount of time to acquire. At present, a certain research is performed on samples required for reducing demonstration study, but the problem that the required sample size is large still exists. Meanwhile, the expert demonstration mode can lead the DRL to lose the autonomous trial-and-error process, and the core purpose of reinforcement learning is destroyed. Another alternative to the reward function is to select an appropriate classifier and provide a certain amount of data to train the classifier. The classifier gets the current state success probability as a reward function of reinforcement learning. But to ensure that the classifier can provide an accurate success rate in any situation, a large amount of data needs to be collected to ensure that the classifier can be adequately trained. Meanwhile, the situation of fitting can also occur when the classifier is used in the navigation task, so that the effect of the rewarding value generated by the classifier is similar to that of the discrete rewarding value, and the effect of accelerating reinforcement learning training can not be achieved.
In the current research, meta-learning (Meta-learning) can learn the characteristics of a sample through a small number of samples, and the classification of the current sample is obtained through the comparison between the current sample and a support set. Google in 2017 proposed matching networks to achieve low sample learning. For this purpose, the matching network is applied to the generation of a reward function of a navigation task, and is trained by a navigation data set prepared in advance before the DRL training is started, and a small number of samples successfully completed under the current map are provided for the matching network. The matching network judges the success rate of the robot to finish the task in the current state by comparing the similarity between the current path and the successful path.
How to realize navigation under an unknown environment by a mobile robot is always a urgent need to solve the problems and challenges. Deep reinforcement learning enables robots to learn navigation rules in unknown environments to navigate. The bonus function design is an important link in reinforcement learning training, and the quality of the bonus function directly influences the training speed and generalization capability of the reinforcement learning model. In this context we provide a generic rewarding function model that, in use, only requires a small number of positive and negative samples to be provided to the model to provide accurate rewarding values. The reward function model can still be used in a brand new map environment. We combine the bonus function model with an efficient reinforcement learning model. The effect of the model is evaluated in the simulation environment, and experimental results show that the model can improve the training speed by more than 50%, and can realize end-to-end mobile robot navigation in the complex unknown map environment.
Disclosure of Invention
The invention aims to provide an application method of a mobile robot in reinforcement learning of a matching network, so as to solve the problems in the background technology.
In order to achieve the above purpose, the present invention provides the following technical solutions: the application method of the mobile robot in reinforcement learning of the matching network is characterized in that the model mainly comprises a near-end strategy optimization algorithm PPO and the matching network, wherein the PPO is used for realizing navigation of the robot in an unknown environment, the matching network is used for providing a reward value of each action of the robot, and the method steps of the autonomous navigation system of the robot are as follows:
step one: the matching network can quickly find the differences among different categories by introducing a small amount of examples in the training process;
step two: the robot will have a set of state sequences (s 1 ,s 2 ,...,s t ) Each state contains radar observation data and a robot (L 1 ,L 2 ,...,L t ) Relative position coordinates (c) with the target point 1 ,c 2 ,...,c t );
Step three: the matching network randomly selects a group of samples from the existing database, wherein the samples comprise equal amounts of successful samples and failure samples;
step four: in the process that the robot continuously interacts with the environment, the matching network compares a state sequence obtained by the current robot with a provided sample sequence, and the similarity between the two sequence states is the success rate of the current robot for completing the task;
step five: the success rate of the current state is used as a reward value to be fed back to the reinforcement learning near-end strategy optimization algorithm;
step six: and (3) using the reward value generated by the matching network and the common reward value based on the Euclidean distance as the reward value obtained after the current robot performs the action, and updating the parameters according to the obtained reward value and the current state of the robot by the PPO.
As a preferred scheme of the invention, the PPO is updated in small batches in a plurality of training steps by proposing a new objective function, and parameters of the PPO in the training process are updated as follows:
wherein θ is t The specific numerical value of the parameter of the PPO network at the current t moment is represented, and alpha represents the learning rate;
where L (θ) is the objective function of the PPO update, L (θ) is expressed as:
L(θ)=E[min(r i (θ)A i ,clip(r i (θ),1-ε,1+ε)A i )]
in the middle ofExpressed as the probability ratio between the two strategies before and after the update, where pi θ (a i |s i ) Indicating that under parameter θ, policy pi is in a state s i Generates action a i Probability of A i And then, representing an advantage function, wherein the clip function is a cut-off function, and limiting the value of r (theta) between 1-epsilon and 1+epsilon, so that the situation that the strategy has mutation is avoided, and the training effect is ensured to be stable.
As a preferred embodiment of the present invention, the matching network is capable of learning a support setClassification relation between k samples->I.e. the mapping relationship between the sample and the label, when a group of test set samples are givenThe matching network is able to define the probability distribution of the test set in the support set +.>Wherein the mapping->Implemented by neural networks, the most efficient matching networkThe simplified model is shown below
Wherein x is i ,y i Representing a support setSamples and labels of (a),>then the samples and labels in the test set are represented; beta represents the mechanism of attention; the expression of β is as follows:
c denotes the distress distance between the two vectors and f, g denote the coding of the support and test sets, respectively.
As a preferred embodiment of the present invention, the matching network takes as input a set of state sequences and predicts the success rate of the current state, the current sequence (s 1 ,s 2 ,...,s t ) By encoder g θ Becomes the input vector G; while the matching network selects the provided sample as a positive sample, randomly extracts a negative sample from the database, and passes the sample through the encoder f θ And obtaining a vector F, and calculating the similarity between the vectors G and F through cosine similarity.
As a preferable scheme of the invention, the radar generates one-dimensional array data each time, so that a one-dimensional convolution method is used for compressing the data, and the relative position relationship between the robot and the target point is also one-dimensional array, so that the data is compressed by using one-dimensional convolution, and the compressed data are combined and the coded vector is obtained by using a full-connection layer.
As a preferable scheme of the invention, the PPO inputs the state s of the current robot t And outputs the speed and angular velocity of the robot, in experimentsThe robot angular speed range and the linear speed range are set, and the network is formed by full connection layers.
Compared with the prior art, the invention has the beneficial effects that: according to the application method of the mobile robot in the reinforcement learning of the matching network, the training speed of the DRL in the navigation task is improved; only a small number of successful samples in the current environment need to be provided before the DRL training is started, so that accurate rewarding function generation can be realized; the model obtains the rewarding value by comparing the difference between the path sample of the current robot and the provided positive and negative samples. The resulting prize value will guide the robot in learning the navigation skills along with the environmental prize value. In experiments, the MNR+ER-based reward function model is shown to be capable of effectively improving the training speed of the DRL and accurately learning the navigation skills. The model can provide accurate rewards value for accelerating DRL training with little or no expert sample.
Drawings
FIG. 1 is a flow chart of a model of the present invention;
FIG. 2 is a matching network rewards flowchart of the invention;
FIG. 3 is a diagram of an encoder network architecture of the present invention;
FIG. 4 is a diagram of the PPO grid construction of the present invention;
FIG. 5 is a graph of error and accuracy curves of the present invention.
Detailed Description
The technical solutions in the embodiments of the present invention are clearly and completely described, and it is obvious that the described embodiments are only some embodiments of the present invention, but not all embodiments. All other embodiments, which can be made by those skilled in the art based on the embodiments of the invention without making any inventive effort, are intended to be within the scope of the invention.
Referring to fig. 1-5, the invention provides an application method of a mobile robot in reinforcement learning of a matching network, wherein a model mainly comprises a near-end policy optimization algorithm PPO and the matching network, wherein the PPO is used for realizing navigation of the robot in an unknown environment, the matching network is used for providing a reward value of each action of the robot, and the method steps of an autonomous navigation system of the robot are as follows:
step one: the matching network can quickly find the differences among different categories by introducing a small amount of examples in the training process;
step two: the robot will have a set of state sequences (s 1 ,s 2 ,...,s t ) Each state contains radar observation data and a robot (L 1 ,L 2 ,...,L t ) Relative position coordinates (c) with the target point 1 ,c 2 ,...,c t );
Step three: the matching network randomly selects a group of samples from the existing database, wherein the samples comprise equal amounts of successful samples and failure samples;
step four: in the process that the robot continuously interacts with the environment, the matching network compares a state sequence obtained by the current robot with a provided sample sequence, and the similarity between the two sequence states is the success rate of the current robot for completing the task;
step five: the success rate of the current state is used as a reward value to be fed back to the reinforcement learning near-end strategy optimization algorithm;
step six: and (3) using the reward value generated by the matching network and the common reward value based on the Euclidean distance as the reward value obtained after the current robot performs the action, and updating the parameters according to the obtained reward value and the current state of the robot by the PPO.
The PPO is updated in small batches in a plurality of training steps by proposing a new objective function, and the parameter updating of the PPO in the training process is as follows:
wherein θ is t The specific numerical value of the parameter of the PPO network at the current t moment is represented, and alpha represents the learning rate;
where L (θ) is the objective function of the PPO update, L (θ) is expressed as:
L(θ)=E[min(r i (θ)A i ,clip(r i (θ),1-ε,1+ε)A i )]
in the middle ofExpressed as the probability ratio between the two strategies before and after the update, where pi θ (a i |s i ) Indicating that under parameter θ, policy pi is in a state s i Generates action a i Probability of A i And then, representing an advantage function, wherein the clip function is a cut-off function, and limiting the value of r (theta) between 1-epsilon and 1+epsilon, so that the situation that the strategy has mutation is avoided, and the training effect is ensured to be stable.
The matching network is capable of learning a support setClassification relation between k samples in a seriesI.e. mapping between samples and labels, when a set of test set samples is given +.>The matching network is able to define the probability distribution of the test set in the support set +.>Wherein the mapping->Implemented by neural networks, the most simplified model of the matching network is as follows
Wherein x is i ,y i Representing a support setSamples and labels of (a),>then the samples and labels in the test set are represented; beta represents the mechanism of attention; the expression of β is as follows:
c denotes the distress distance between the two vectors and f, g denote the coding of the support and test sets, respectively. Mainly realized by Convolutional Neural Network (CNN) and long-short-time cyclic neural network (LSTM), in the experiment, we set f=g, and the training process of MN is described by the following pseudo code
Further, the matching network takes as input a set of state sequences and predicts the success rate of the current state, the current sequence (s 1 ,s 2 ,...,s t ) By encoder g θ Becomes the input vector G; while the matching network selects the provided sample as a positive sample, randomly extracts a negative sample from the database, and passes the sample through the encoder f θ And obtaining a vector F, and calculating the similarity between the vectors G and F through cosine similarity.
The flow chart is as in fig. 2, the encoder g θ And f θ The structure of the radar is shown in figure 3, the radar generates one-dimensional array data each time, and for this purpose, we use one-dimensional convolution method to compress the data, robot and sensorThe relative position relationship between the target points is also a one-dimensional array, so the data is compressed by one-dimensional convolution, and the compressed data are combined and the full-connection layer is used for obtaining the coded vector.
According to FIG. 4, the PPO inputs the state s of the current robot t And outputs the speed and angular velocity of the robot, in experiments we set the angular velocity range and linear velocity range of the robot, the network is composed of fully connected layers.
To collect the appropriate data to train the matching network rewards, we collected the state sequence under a variety of different maps, 5 maps were designed for data collection. Each map uses randomly generated starting points and target points in the process of collecting data. In order to ensure that the distribution of positive and negative samples of the acquired data is balanced, the PPO algorithm based on the traditional rewarding function is used for navigating the Agent. The number of positive and negative samples collected under each map is shown in table 1,
table 1 distribution of data sets under each map
Before reinforcement learning training begins, we need to pretrain the Matching network. The Matching network adopts a training mode of 2-way 20-shot, namely samples in each training are wrapped in two categories (success and failure), and 20 samples are selected in each category. The matching method adopts cosine similarity. The training model learning rate was set to 1e-4 and the batch size was 20. We use 60% of the paths in the sample as training sets and the rest as validation sets. The error and accuracy in the training process of the Matching network are shown in fig. 5.
Although embodiments of the present invention have been shown and described, it will be understood by those skilled in the art that various changes, modifications, substitutions and alterations can be made therein without departing from the principles and spirit of the invention, the scope of which is defined in the appended claims and their equivalents.

Claims (6)

1. The application method of the mobile robot in reinforcement learning of the matching network is characterized in that the model mainly comprises a near-end strategy optimization algorithm PPO and the matching network, wherein the PPO is used for realizing navigation of the robot in an unknown environment, and the matching network is used for providing a reward value of each action of the robot, and the method of the autonomous navigation system of the robot is characterized by comprising the following steps:
step one: the matching network can quickly find the differences among different categories by introducing a small amount of examples in the training process;
step two: the robot will have a set of state sequences (s 1 ,s 2 ,...,s t ) Each state contains radar observation data and a robot (L 1 ,L 2 ,...,L t ) Relative position coordinates (c) with the target point 1 ,c 2 ,...,c t );
Step three: the matching network randomly selects a group of samples from the existing database, wherein the samples comprise equal amounts of successful samples and failure samples;
step four: in the process that the robot continuously interacts with the environment, the matching network compares a state sequence obtained by the current robot with a provided sample sequence, and the similarity between the two sequence states is the success rate of the current robot for completing the task;
step five: the success rate of the current state is used as a reward value to be fed back to the reinforcement learning near-end strategy optimization algorithm;
step six: and (3) using the reward value generated by the matching network and the common reward value based on the Euclidean distance as the reward value obtained after the current robot performs the action, and updating the parameters according to the obtained reward value and the current state of the robot by the PPO.
2. The method for applying the mobile robot to reinforcement learning of the matching network according to claim 1, wherein the method comprises the following steps: the PPO is updated in small batches in a plurality of training steps by proposing a new objective function, and the parameter updating of the PPO in the training process is as follows:
wherein θ is t The specific numerical value of the parameter of the PPO network at the current t moment is represented, and alpha represents the learning rate;
where L (θ) is the objective function of the PPO update, L (θ) is expressed as:
L(θ)=E[min(r i (θ)A i ,clip(r i (θ),1-ε,1+ε)A i )]
in the middle ofExpressed as the probability ratio between the two strategies before and after the update, where pi θ (a i |s i ) Indicating that under parameter θ, policy pi is in a state s i Generates action a i Probability of A i And then, representing an advantage function, wherein the clip function is a cut-off function, and limiting the value of r (theta) between 1-epsilon and 1+epsilon, so that the situation that the strategy has mutation is avoided, and the training effect is ensured to be stable.
3. The method for applying the mobile robot to reinforcement learning of the matching network according to claim 1, wherein the method comprises the following steps: the matching network is capable of learning a support setClassification relation between k samples in a seriesI.e. mapping between samples and labels, when a set of test set samples is given +.>The matching network is able to define the probability distribution of the test set in the support set +.>Wherein the mapping->Implemented by neural networks, the most simplified model of the matching network is as follows
Wherein x is i ,y i Representing a support setSamples and labels of (a),>then the samples and labels in the test set are represented; beta represents the mechanism of attention; the expression of β is as follows:
c denotes the distress distance between the two vectors and f, g denote the coding of the support and test sets, respectively.
4. The method for applying the mobile robot to reinforcement learning of the matching network according to claim 1, wherein the method comprises the following steps: the matching network takes as input a set of state sequences and predicts the success rate of the current state, the current sequence (s 1 ,s 2 ,...,s t ) By encoder g θ Becomes the input vector G; at the same time, the matching network selects the provided sample as a positive sample, and randomly extracts a negative sample from the database to be used as a positive sampleThe samples pass through an encoder f θ And obtaining a vector F, and calculating the similarity between the vectors G and F through cosine similarity.
5. The method for applying the mobile robot to reinforcement learning of the matching network according to claim 4, wherein: the radar generates one-dimensional array data each time, therefore, the data is compressed by using a one-dimensional convolution method, the relative position relationship between the robot and the target point is also one-dimensional array, so that the data is compressed by using one-dimensional convolution, and the compressed data are combined and the coded vector is obtained by using a full-connection layer.
6. The method for applying the mobile robot to reinforcement learning of the matching network according to claim 1, wherein the method comprises the following steps: the PPO inputs the state s of the current robot t And outputs the speed and angular velocity of the robot, in experiments we set the angular velocity range and linear velocity range of the robot, the network is composed of fully connected layers.
CN202310795952.6A 2023-07-01 2023-07-01 Application method of mobile robot in reinforcement learning of matching network Pending CN116858241A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202310795952.6A CN116858241A (en) 2023-07-01 2023-07-01 Application method of mobile robot in reinforcement learning of matching network

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202310795952.6A CN116858241A (en) 2023-07-01 2023-07-01 Application method of mobile robot in reinforcement learning of matching network

Publications (1)

Publication Number Publication Date
CN116858241A true CN116858241A (en) 2023-10-10

Family

ID=88220949

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202310795952.6A Pending CN116858241A (en) 2023-07-01 2023-07-01 Application method of mobile robot in reinforcement learning of matching network

Country Status (1)

Country Link
CN (1) CN116858241A (en)

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN117873089A (en) * 2024-01-10 2024-04-12 南京理工大学 Multi-mobile robot cooperation path planning method based on clustering PPO algorithm
CN117933346A (en) * 2024-03-25 2024-04-26 之江实验室 Instant rewarding learning method based on self-supervision reinforcement learning

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN117873089A (en) * 2024-01-10 2024-04-12 南京理工大学 Multi-mobile robot cooperation path planning method based on clustering PPO algorithm
CN117933346A (en) * 2024-03-25 2024-04-26 之江实验室 Instant rewarding learning method based on self-supervision reinforcement learning

Similar Documents

Publication Publication Date Title
CN116858241A (en) Application method of mobile robot in reinforcement learning of matching network
CN113093724B (en) AGV path planning method based on improved ant colony algorithm
CN110745136A (en) Driving self-adaptive control method
CN111047085B (en) Hybrid vehicle working condition prediction method based on meta-learning
CN112356830A (en) Intelligent parking method based on model reinforcement learning
CN114372570A (en) Multi-mode vehicle trajectory prediction method
CN114913493A (en) Lane line detection method based on deep learning
Chen et al. Multiagent path finding using deep reinforcement learning coupled with hot supervision contrastive loss
Ou et al. GPU-based global path planning using genetic algorithm with near corner initialization
CN114819316A (en) Complex optimization method for multi-agent task planning
CN113503885A (en) Robot path navigation method and system based on sampling optimization DDPG algorithm
CN117359643A (en) Multi-modal element learning robot self-positioning system and method thereof
CN116977712A (en) Knowledge distillation-based road scene segmentation method, system, equipment and medium
CN111598620A (en) Remote accurate pricing system for second-hand vehicles
CN116106751A (en) Lithium ion battery state of charge estimation method based on Informar
CN114065620B (en) Interpretable molecular dynamics trajectory analysis method based on pixel diagram and CNN
CN114200936B (en) AGV real-time path planning method based on optimal control and width learning
Zhou et al. Deep reinforcement learning with long-time memory capability for robot mapless navigation
CN112991744B (en) Automatic driving decision-making method and system suitable for long-distance urban road
CN113449466B (en) Solar radiation prediction method and system for optimizing RELM based on PCA and chaos GWO
Ganapathy et al. Neural Q-learning controller for mobile robot
CN115009291A (en) Automatic driving aid decision-making method and system based on network evolution replay buffer area
CN115630566A (en) Data assimilation method and system based on deep learning and dynamic constraint
Tang et al. Deep sparse representation via deep dictionary learning for reinforcement learning
CN111882124A (en) Homogeneous platform development effect prediction method based on generation confrontation simulation learning

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination