CN111950726A - Decision method based on multi-task learning, decision model training method and device - Google Patents
Decision method based on multi-task learning, decision model training method and device Download PDFInfo
- Publication number
- CN111950726A CN111950726A CN202010660005.2A CN202010660005A CN111950726A CN 111950726 A CN111950726 A CN 111950726A CN 202010660005 A CN202010660005 A CN 202010660005A CN 111950726 A CN111950726 A CN 111950726A
- Authority
- CN
- China
- Prior art keywords
- task
- target
- target task
- action
- candidate
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
- 238000000034 method Methods 0.000 title claims abstract description 99
- 238000012549 training Methods 0.000 title claims abstract description 50
- 239000013598 vector Substances 0.000 claims abstract description 347
- 230000009471 action Effects 0.000 claims description 384
- 230000006870 function Effects 0.000 claims description 174
- 230000015654 memory Effects 0.000 claims description 60
- 230000033001 locomotion Effects 0.000 claims description 38
- 238000012545 processing Methods 0.000 claims description 30
- 238000003860 storage Methods 0.000 claims description 27
- 238000012512 characterization method Methods 0.000 claims description 26
- 238000013528 artificial neural network Methods 0.000 claims description 23
- 230000002787 reinforcement Effects 0.000 claims description 17
- 238000004590 computer program Methods 0.000 claims description 11
- 238000000354 decomposition reaction Methods 0.000 claims description 11
- 230000000694 effects Effects 0.000 abstract description 14
- 238000013473 artificial intelligence Methods 0.000 abstract description 3
- 238000004891 communication Methods 0.000 description 48
- 238000010586 diagram Methods 0.000 description 17
- 230000006872 improvement Effects 0.000 description 15
- 230000008569 process Effects 0.000 description 12
- 230000002860 competitive effect Effects 0.000 description 9
- 230000003993 interaction Effects 0.000 description 8
- 230000003287 optical effect Effects 0.000 description 8
- 230000002093 peripheral effect Effects 0.000 description 8
- 230000006399 behavior Effects 0.000 description 7
- 230000004083 survival effect Effects 0.000 description 7
- 238000013500 data storage Methods 0.000 description 6
- 241001465754 Metazoa Species 0.000 description 5
- 230000005540 biological transmission Effects 0.000 description 5
- 230000003068 static effect Effects 0.000 description 5
- 230000001133 acceleration Effects 0.000 description 4
- 238000013527 convolutional neural network Methods 0.000 description 4
- 230000002829 reductive effect Effects 0.000 description 4
- 230000010267 cellular communication Effects 0.000 description 3
- 230000008878 coupling Effects 0.000 description 3
- 238000010168 coupling process Methods 0.000 description 3
- 238000005859 coupling reaction Methods 0.000 description 3
- 230000034994 death Effects 0.000 description 3
- 231100000517 death Toxicity 0.000 description 3
- 239000000446 fuel Substances 0.000 description 3
- 230000002452 interceptive effect Effects 0.000 description 3
- 238000007726 management method Methods 0.000 description 3
- 230000009467 reduction Effects 0.000 description 3
- 238000012546 transfer Methods 0.000 description 3
- LFQSCWFLJHTTHZ-UHFFFAOYSA-N Ethanol Chemical compound CCO LFQSCWFLJHTTHZ-UHFFFAOYSA-N 0.000 description 2
- ATUOYWHBWRKTHZ-UHFFFAOYSA-N Propane Chemical compound CCC ATUOYWHBWRKTHZ-UHFFFAOYSA-N 0.000 description 2
- 238000009825 accumulation Methods 0.000 description 2
- 230000008859 change Effects 0.000 description 2
- 238000002485 combustion reaction Methods 0.000 description 2
- 230000006835 compression Effects 0.000 description 2
- 238000007906 compression Methods 0.000 description 2
- 238000010276 construction Methods 0.000 description 2
- 238000005457 optimization Methods 0.000 description 2
- 230000036961 partial effect Effects 0.000 description 2
- 230000037361 pathway Effects 0.000 description 2
- 230000000306 recurrent effect Effects 0.000 description 2
- 230000004044 response Effects 0.000 description 2
- GNFTZDOKVXKIBK-UHFFFAOYSA-N 3-(2-methoxyethoxy)benzohydrazide Chemical compound COCCOC1=CC=CC(C(=O)NN)=C1 GNFTZDOKVXKIBK-UHFFFAOYSA-N 0.000 description 1
- 241000282412 Homo Species 0.000 description 1
- HBBGRARXTFLTSG-UHFFFAOYSA-N Lithium ion Chemical compound [Li+] HBBGRARXTFLTSG-UHFFFAOYSA-N 0.000 description 1
- 239000002253 acid Substances 0.000 description 1
- 230000001413 cellular effect Effects 0.000 description 1
- 238000006243 chemical reaction Methods 0.000 description 1
- 230000001010 compromised effect Effects 0.000 description 1
- 125000004122 cyclic group Chemical group 0.000 description 1
- 238000001514 detection method Methods 0.000 description 1
- 238000004821 distillation Methods 0.000 description 1
- 239000000835 fiber Substances 0.000 description 1
- 230000004927 fusion Effects 0.000 description 1
- 230000000670 limiting effect Effects 0.000 description 1
- 229910001416 lithium ion Inorganic materials 0.000 description 1
- 238000004519 manufacturing process Methods 0.000 description 1
- 238000005259 measurement Methods 0.000 description 1
- 230000007246 mechanism Effects 0.000 description 1
- 239000003208 petroleum Substances 0.000 description 1
- 239000001294 propane Substances 0.000 description 1
- 230000000979 retarding effect Effects 0.000 description 1
- 238000000926 separation method Methods 0.000 description 1
- 239000000126 substance Substances 0.000 description 1
- 230000007704 transition Effects 0.000 description 1
Images
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/08—Learning methods
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F18/00—Pattern recognition
- G06F18/20—Analysing
- G06F18/21—Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
- G06F18/214—Generating training patterns; Bootstrap methods, e.g. bagging or boosting
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F30/00—Computer-aided design [CAD]
- G06F30/20—Design optimisation, verification or simulation
- G06F30/27—Design optimisation, verification or simulation using machine learning, e.g. artificial intelligence, neural networks, support vector machines [SVM] or training a model
-
- G—PHYSICS
- G08—SIGNALLING
- G08G—TRAFFIC CONTROL SYSTEMS
- G08G1/00—Traffic control systems for road vehicles
- G08G1/16—Anti-collision systems
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Physics & Mathematics (AREA)
- Evolutionary Computation (AREA)
- General Physics & Mathematics (AREA)
- Data Mining & Analysis (AREA)
- Artificial Intelligence (AREA)
- General Engineering & Computer Science (AREA)
- Computer Vision & Pattern Recognition (AREA)
- Life Sciences & Earth Sciences (AREA)
- Software Systems (AREA)
- Health & Medical Sciences (AREA)
- Molecular Biology (AREA)
- Evolutionary Biology (AREA)
- Biomedical Technology (AREA)
- Biophysics (AREA)
- Computational Linguistics (AREA)
- General Health & Medical Sciences (AREA)
- Bioinformatics & Cheminformatics (AREA)
- Computing Systems (AREA)
- Mathematical Physics (AREA)
- Bioinformatics & Computational Biology (AREA)
- Medical Informatics (AREA)
- Computer Hardware Design (AREA)
- Geometry (AREA)
- Management, Administration, Business Operations System, And Electronic Commerce (AREA)
Abstract
The application discloses a decision-making method based on multi-task learning, a decision-making model training method and a device thereof in the field of artificial intelligence, wherein the decision-making model training method comprises the following steps: randomly acquiring a plurality of sample data from a first sample database, wherein the first sample database comprises the sample data of a plurality of candidate tasks, the sample data of the target task comprises a task vector corresponding to the target task, the task vector corresponding to the target task is obtained based on a common subtask and a characteristic subtask in the plurality of candidate tasks, and the target task is any one of the plurality of candidate tasks; adjusting a decision model M according to a plurality of sample datatTo obtain a decision model Mt+1(ii) a Decision makingModel Mt+1Whether to converge; when decision model Mt+1Upon convergence, determining the decision model Mt+1Is a target decision model. By adopting the embodiment of the application, the decision effect and the convergence capability of the decision model are improved, and the mutual influence among multiple tasks is avoided.
Description
Technical Field
The application relates to the field of artificial intelligence, in particular to a decision-making method based on multi-task learning, a decision-making model training method and a device.
Background
Reinforcement learning is an important branch of the field of artificial intelligence and has surpassed the ability of ordinary humans to accomplish certain tasks. However, for the reinforcement learning algorithm, the model obtained after one training can only be used for one specific task, and if the model is applied to another new task, the model needs to be retrained to obtain a new model. This means that the training algorithm, while generic, can only be applied to specific task scenarios.
With the increasing application of reinforcement learning algorithms in the industry, many application scenarios are not limited to the requirement that a reinforcement learning model can process a single task, but the requirement that the model can achieve a better effect in a multi-task scenario. The multiple tasks mean that the reinforcement learning algorithm needs to learn multiple Markov models, but the state transition probability is not unique, so that the reinforcement learning algorithm is poor in convergence effect and even cannot converge. And because reward mechanisms of different tasks are different, a simple task can quickly lead the effect of the model to be learned, and other tasks which are sparsely rewarded are not explored enough, so that the learning effect is unbalanced, and the overall effect of the model is poor. In view of the above, a need exists for a reinforcement learning algorithm that can simultaneously learn multiple tasks.
One existing solution is: many learning algorithms are less effective by balancing limited resources in a single learning algorithm to satisfy multi-task learning, and balancing multiple tasks. For example, in the learning process, the reward values of some tasks are large, so that the algorithm focuses on the tasks with prominent reward values at the cost of sacrificing generality, and other tasks cannot achieve good effects; there are algorithms that unify the value of the rewards for each task by way of reward reduction, which may change the optimization goal, if the reward values are all large non-negative values, then the reduction becomes the optimization of the frequency of obtaining rewards rather than accumulating the expected rewards. And the balance of the algorithm among tasks depends on the size of the reward value and the reward density, and the reward reduction still causes the imbalance of the algorithm among different tasks.
Another solution is called distillation-based learning: mainly, a student network is constructed through an expert network which has supervision and learns a plurality of specific tasks, the learning algorithm provides a result of multi-task strategy compromise, and each expert network needs to be obtained by large-scale training in advance. Although the learning algorithm avoids the problem of unbalanced reward values, the learning algorithm is still balanced among a plurality of tasks, the learning effect is not ideal, and the performance of the learning algorithm is limited by the expert network and cannot be further improved.
Disclosure of Invention
In the scheme of the embodiment of the application, the tasks are subjected to joint representation to obtain task vectors obtained by a characteristic subtask and a common subtask, mutual influence among the tasks can be avoided during model training, strategy learning of the tasks can be promoted by the common subtask, and the task specific learning is performed by the characteristic subtask to improve the multi-task strategy effect and the convergence speed of the model; during decision making, the same model can be used for making decisions on a plurality of tasks, and mutual influence among the tasks is avoided.
In a first aspect, an embodiment of the present application provides a training method based on a multi-task learning decision model, including:
s1: acquiring a target task from a plurality of candidate tasks; and acquiring the state information s of the target task according to the target tasktAnd acquiring a task vector corresponding to the target task according to the target task, wherein the task vector corresponding to the target task is based onThe common subtask and the characteristic subtask of the multiple candidate tasks are obtained; s2: according to the state information s of the target tasktTask vector corresponding to target task and decision model MtGenerating sample year data of the target task, and adding sample data of the target task to a primary sample database to obtain a first sample database; s3: randomly acquiring a plurality of sample data from a first sample database; the multiple sample data are sample data of part or all of the multiple candidate tasks; s4: adjusting a decision model M according to a plurality of sample data by using a reinforcement learning methodtTo obtain a decision model Mt+1(ii) a S5: decision model Mt+1Whether to converge; when decision model Mt+1Upon convergence, the decision model M is determinedt+1Is a target decision model.
The common subtask is a subtask shared by subtasks of the plurality of candidate tasks, and the characteristic subtask is a subtask unique to a certain candidate task in the subtasks of the plurality of candidate tasks. For example, in an intersection scene, the plurality of candidate tasks may include intersection straight going, intersection left turning, and intersection right turning; the intersection straight going comprises two subtasks of intersection straight going collision or arrival and traffic efficiency improvement; the left turn at the intersection comprises two subtasks of collision or arrival at the left turn at the intersection and improvement of the traffic efficiency; the right turn at the intersection comprises two subtasks of collision or arrival at the right turn at the intersection and improvement of traffic efficiency. The intersection straight-going collision or arrival, the intersection left-turning collision or arrival and the intersection right-turning collision or arrival are characteristic subtasks, and the traffic efficiency is improved to be a common subtask.
Alternatively, the reinforcement learning method may be a reinforcement learning method based on a value function.
The task vector of any one of the candidate tasks is obtained based on the characteristic subtask and the common subtask in the candidate tasks, so that a strategy for learning the candidate tasks through one model is realized, wherein the common subtask of the candidate tasks can promote the learning of the strategy of the candidate tasks, and the convergence capability of the model is improved; the specific subtask is used for performing targeted learning of multiple candidate tasks, mutual influence among the multiple tasks is avoided, mutual compromise of the model among the multiple tasks is also avoided, and the excellent effect can be achieved when the same model makes decisions on the multiple tasks.
In a possible embodiment, obtaining a task vector corresponding to a target task according to the target task includes:
performing task decomposition on each candidate task in the plurality of candidate tasks to obtain a subtask corresponding to each candidate task; extracting characteristic subtasks and common subtasks of the multiple candidate tasks according to the subtasks corresponding to each candidate task in the multiple candidate tasks; and acquiring a task vector corresponding to the target task according to the characteristic subtasks and the common subtasks of the plurality of candidate tasks.
In one possible embodiment, s is based on the state information of the target tasktTask vector corresponding to target task and decision model MtGenerating sample data of a target task, comprising:
state information s of target tasktTask vector input decision model M corresponding to target tasktThe target actions of the target task are selected from the candidate actions of the target task and the actions randomly acquired from the action space according to a preset probability, and the probability that the candidate actions of the target task are selected is the preset probability; obtaining state information s of the target task after the target action is executedt+1And according to the state information s of the target taskt+1Acquiring an award value vector of a target task; the reward values in the reward value vector correspond to the subtasks in the task vector corresponding to the target task one by one; the sample data of the target task comprises a task vector corresponding to the target task and state information s of the target tasktTarget action of target task, state information s of target taskt+1And a reward value vector for the target task.
By constructing the reward value vector, the execution condition of each subtask in the target task is fed back to the decision model for learning, and the precision of the decision model is improved.
In one possible embodiment, the goal is based onStatus information s of taskstTask vector input decision model M corresponding to target tasktThe processing to obtain the candidate action of the target task comprises:
decision model MtAccording to the state information s of the target tasktAcquiring an action value function vector of the target task according to the task vector corresponding to the target task, wherein action value functions in the action value function vector of the target task correspond to subtasks corresponding to elements in the task vector corresponding to the target task one by one; decision model MtAcquiring a value function of the target task according to the action value function vector of the target task and the task vector corresponding to the target task; decision model MtAnd acquiring candidate actions of the target task from the action space according to the value function of the target task, wherein the candidate actions of the target task are the actions which enable the value function of the target task to be maximum in the action space.
The value function of the target task is obtained according to the task vector and the action value function vector corresponding to the target task, and then the target action is determined based on the value function, so that the influence of the action value function of the subtask irrelevant to the target task in the action value function vector on the selection of the target action is avoided when the target task is decided.
In one possible embodiment, selecting the target action of the target task from the candidate actions of the target task and the actions randomly obtained from the action space according to the preset probability includes:
when the first parameter is greater than the preset probability, determining the candidate action of the target task as the target action of the target task; the first parameter is a random number with a value range of [0,1 ]; and when the first parameter is not greater than the preset probability, determining the action randomly acquired from the action space as the target action of the target task.
By determining the action randomly acquired from the action space as the target action of the target task, the method realizes the exploration of new action when the decision model is trained, thereby obtaining the action with better effect when the decision model is used.
It should be noted that the initial value of the predetermined probability is 1 or a larger value close to 1; the preset probability is gradually reduced as the number of training times increases.
In one possible embodiment, the decision model M is adapted according to a plurality of sample datatTo obtain a decision model Mt+1The method comprises the following steps:
according to the loss function and the state information s of each sample data in a plurality of sample datatTask vector, target action, state information st+1Calculating the loss value according to the reward value vector; adjusting decision model M according to loss valuetTo obtain a decision model Mt+1。
Wherein the first state information is obtained before the target action is executed, and the second state information is obtained after the target action is executed.
Wherein the loss value can be expressed as:
wherein r in the formula is the reward value vector in the sample data, g is the task vector in the sample data, atThe discount coefficient γ is a constant for the target action in the sample data.
In one possible embodiment, the decision model M is determinedt+1Whether to converge, including:
according to the state information s of the target taskt+1Judging whether the target task is finished or not; when it is determined that the target task is not ended, let t be t +1, and repeatedly perform steps S2-S5 until the target task is ended;
when the target task is determined to be finished, judging the decision model Mt+1Whether to converge; in determining decision model Mt+1When the convergence time is not reached, t is t +1, and the steps S1-S5 are repeatedly executed until the decision model M is reachedt+1And (6) converging.
In a second aspect, an embodiment of the present application provides a decision method based on multitask learning, including:
obtaining a plurality of candidate tasks, obtaining a target from the plurality of candidate tasksA task; and acquiring the state information s of the target task according to the target taskt(ii) a Performing task joint characterization on the target task according to the multiple candidate tasks to obtain task vectors corresponding to the target task, wherein the task vector corresponding to each candidate task is obtained based on the characteristic subtasks and the common subtasks of the multiple candidate tasks; according to the task vector corresponding to the target task and the state information s of the target tasktAnd determining the target action from the action space.
The plurality of candidate tasks may be tasks in the same scene or tasks in different scenes.
The target task is subjected to joint characterization according to the subtasks of the multiple candidate tasks, so that the multiple tasks can be decided by using the same model, and the mutual influence among the multiple tasks is avoided.
In a possible embodiment, performing task joint characterization on a target task according to a plurality of candidate tasks to obtain a task vector corresponding to the target task, includes:
performing task decomposition on each candidate task in the plurality of candidate tasks to obtain a subtask corresponding to each candidate task; extracting characteristic subtasks and common subtasks of the multiple candidate tasks according to the subtasks corresponding to each candidate task in the multiple candidate tasks; and acquiring a task vector corresponding to the target task according to the characteristic subtasks and the common subtasks of the plurality of candidate tasks.
In one possible embodiment, the task vector corresponding to the target task and the state information s of the target task are usedtDetermining a target action from the action space, comprising:
the task vector corresponding to the target task and the state information s of the target tasktAnd inputting the target motion into a target decision model for processing to obtain a target action, wherein the target decision model is realized based on a neural network.
Alternatively, the neural network may be a fully-connected neural network, a convolutional neural network, a recurrent neural network, or other neural network.
In one possible embodiment, the target task pairCorresponding task vector and state information s of target tasktInputting the target action into a target decision model for processing to obtain a target action, wherein the target action comprises the following steps:
according to the task vector corresponding to the target task and the state information s of the target tasktAcquiring an action value function vector of the target task, wherein action value functions in the action value function vector correspond to subtasks corresponding to elements in a task vector corresponding to the target task one by one; acquiring a value function of the target task according to the task vector and the action value function vector corresponding to the target task; and acquiring the target action from the action space according to the value function of the target task, wherein the target action is the action which enables the value function of the target task to be maximum in the action space.
The value function of the target task is obtained according to the task vector and the action value function vector corresponding to the target task, and then the target action is determined based on the value function, so that the influence of the action value function of a subtask irrelevant to the target task in the action value function vector on the selection of the target action is avoided when the target task is decided, and the decision effect of the target task is improved.
Alternatively, the value function may be a Q-value function, and the Q-value function of the target task may be expressed as: r(s)t,akG), the target action may be expressed as:
wherein g is a task vector corresponding to the target task, akIs a motion in the motion space.
In a third aspect, an embodiment of the present application provides a decision model training device based on multitask learning, including:
the acquisition unit is used for randomly acquiring a plurality of sample data from the first sample database; the first sample database comprises sample data of a plurality of candidate tasks, the sample data of the target task comprises a task vector corresponding to the target task, the task vector corresponding to the target task is obtained based on a common subtask and a characteristic subtask in the candidate tasks, and the target task is any one of the candidate tasks;
an adjusting unit for adjusting the decision model M according to a plurality of sample data by using a reinforcement learning methodtTo obtain a decision model Mt+1;
A determination unit for determining the model Mt+1Upon convergence, the decision model M is determinedt+1Is a target decision model.
In one possible embodiment, the method includes acquiring a target task from a plurality of candidate tasks; and acquiring the state information s of the target task according to the target tasktAnd acquiring a task vector corresponding to the target task according to the target task,
the above-mentioned trainer also includes:
an updating unit for updating the state information s according to the target tasktTask vector corresponding to target task and decision model MtGenerating sample data of the target task, and adding the sample data of the target task to the preliminary sample database to obtain a first sample database;
in a possible embodiment, in terms of obtaining a task vector corresponding to a target task according to the target task, the obtaining unit is specifically configured to:
performing task decomposition on each candidate task in the plurality of candidate tasks to obtain a subtask corresponding to each candidate task; extracting characteristic subtasks and common subtasks of the multiple candidate tasks according to the subtasks corresponding to each candidate task in the multiple candidate tasks; and acquiring a task vector corresponding to the target task according to the characteristic subtasks and the common subtasks of the plurality of candidate tasks.
In a possible embodiment, the state information s according to the target tasktTask vector and decision model M for the target tasktIn an aspect of generating sample data of the target task, the update unit is specifically configured to:
according to the state information s of the target tasktTask vector input decision model M corresponding to target tasktProcessing to obtain candidate actions of the target task; candidate actions and random slave actions from a target task according to a preset probabilitySelecting target actions of a target task from the actions acquired in the space, wherein the probability that the candidate actions of the target task are selected is a preset probability; obtaining state information s of the target task after the target action is executedt+1And according to the state information s of the target taskt+1Acquiring an award value vector of a target task; the reward values in the reward value vector correspond to the subtasks corresponding to the elements in the task vector corresponding to the target task one by one; the sample data of the target task comprises a task vector corresponding to the target task and state information s of the target tasktTarget action of target task, state information s of target taskt+1And a target mission reward value vector.
In one possible embodiment, the state information s of the target tasktTask vector input decision model M corresponding to target tasktThe updating unit is specifically configured to:
according to the state information s of the target tasktAcquiring an action value function vector of the target task according to the task vector corresponding to the target task, wherein action value functions in the action value function vector of the target task correspond to subtasks corresponding to elements in the task vector corresponding to the target task one by one; acquiring a value function of the target task according to the action value function vector of the target task and the task vector corresponding to the target task; and acquiring the target action of the target task from the action space according to the value function of the target task, wherein the target action is the action which enables the value function of the target task to be maximum in the action space.
In a possible embodiment, in terms of selecting the target action of the target task from the candidate actions of the target task and the actions randomly obtained from the action space according to the preset probability, the updating unit is specifically configured to:
when the first parameter is greater than the preset probability, determining the candidate action of the target task as the target action of the target task; the first parameter is a random number with a value range of [0,1 ]; and when the first parameter is not greater than the preset probability, determining the action randomly acquired from the action space as the target action of the target task.
In a possible embodiment, the adjusting unit is specifically configured to:
according to the loss function and the state information s of each sample data in a plurality of sample datatTask vector, target action, state information st+1Calculating the loss value according to the reward value vector; adjusting decision model M according to loss valuetTo obtain a decision model Mt+1。
In a fourth aspect, an embodiment of the present application provides a decision device based on multitask learning, including:
the device comprises an acquisition unit, a processing unit and a processing unit, wherein the acquisition unit is used for acquiring a plurality of candidate tasks and acquiring a target task from the candidate tasks; and acquiring the state information s of the target task according to the target taskt;
The joint characterization unit is used for performing task joint characterization on the target task according to the multiple candidate tasks to obtain a task vector corresponding to the target task, wherein the task vector corresponding to each candidate task is obtained based on the characteristic subtasks and the common subtasks of the multiple candidate tasks;
a determining unit for determining the task vector corresponding to the target task and the state information s of the target tasktAnd determining the target motion from the motion space.
In a possible embodiment, the joint characterization unit is specifically configured to:
performing task decomposition on each candidate task in the plurality of candidate tasks to obtain a subtask corresponding to each candidate task; extracting characteristic subtasks and common subtasks of the multiple candidate tasks according to the subtasks corresponding to each candidate task in the multiple candidate tasks; and acquiring a task vector corresponding to the target task according to the characteristic subtasks and the common subtasks of the plurality of candidate tasks.
In a possible embodiment, the determining unit is specifically configured to:
the task vector corresponding to the target task and the state information s of the target tasktAnd inputting the data into a target decision model for processing to obtain target actions of the target task, wherein the target decision model is realized based on a neural network.
In one possible embodiment, the task vector corresponding to the target task and the state information s of the target tasktThe input into the objective decision model is processed to obtain an aspect of the objective action of the objective task, and the determining unit is specifically configured to:
according to the task vector corresponding to the target task and the state information s of the target tasktAcquiring action value function vectors of the target task, wherein action value functions in the action value function vectors correspond to subtasks corresponding to elements in the task vectors corresponding to the target task one by one; acquiring a value function of the target task according to the task vector and the action value function vector corresponding to the target task; and acquiring the target action from the action space according to the value function of the target task, wherein the target action is the action which enables the value function of the target task to be maximum in the action space.
In a fifth aspect, an embodiment of the present application provides another decision model training apparatus based on multitask learning, including:
a memory to store instructions; and
at least one processor coupled to the memory;
wherein the instructions, when executed by the at least one processor, cause the processor to perform some or all of the method of the first aspect.
In a sixth aspect, an embodiment of the present application provides another decision device based on multitask learning, including:
a memory to store instructions; and
at least one processor coupled to the memory;
wherein the instructions, when executed by the at least one processor, cause the processor to perform some or all of the method of the second aspect.
In a seventh aspect, an embodiment of the present application provides a chip system, where the chip system is applied to an electronic device; the chip system comprises one or more interface circuits, and one or more processors; the interface circuit and the processor are interconnected through a line; the interface circuit is to receive a signal from a memory of the electronic device and to send the signal to the processor, the signal comprising computer instructions stored in the memory; when the processor executes the computer instructions, the electronic device performs part or all of the method according to the first aspect or the second aspect.
In an eighth aspect, embodiments of the present application provide a computer storage medium storing a computer program comprising program instructions that, when executed by a processor, cause the processor to perform some or all of the methods of the first or second aspects.
In a ninth aspect, embodiments of the present application provide a computer program product, which includes computer instructions, when the computer instructions are executed on an electronic device, cause the electronic device to perform part or all of the method according to the first aspect or the second aspect.
Wherein the computer program product is executable on an intelligent carrier (such as a mobile vehicle, a robotic arm, a recommended search engine, etc.) on which the computer system is installed. The executable codes for acquiring task/state information, processing system state, selecting decision and controlling need to be cooperatively participated in a CPU/GPU and a storage system when running on a storage component of a computer system. Meanwhile, a network communication component of the computer system is used, and the decision model is stored on a storage component of the computer system.
These and other aspects of the present application will be more readily apparent from the following description of the embodiments.
Drawings
In order to more clearly illustrate the embodiments of the present application or the technical solutions in the prior art, the drawings used in the description of the embodiments or the prior art will be briefly described below, it is obvious that the drawings in the following description are only some embodiments of the present application, and for those skilled in the art, other drawings can be obtained according to the drawings without creative efforts.
FIG. 1a is a functional block diagram of a vehicle according to an embodiment of the present disclosure;
FIG. 1b is a diagram illustrating an architecture of a computer system according to an embodiment of the present application;
FIG. 1c is a block diagram of an autopilot system according to an embodiment of the present application;
FIG. 1d is a block diagram of another embodiment of an autopilot system according to the present application;
fig. 2 is a schematic flowchart of a decision method based on multi-task learning according to an embodiment of the present disclosure;
FIG. 3 is a schematic view of an intersection scene;
fig. 4 is a schematic flowchart of a method for training a decision model based on multi-task learning according to an embodiment of the present disclosure;
FIG. 5 is a schematic flowchart of another method for training a decision model based on multi-task learning according to an embodiment of the present disclosure;
fig. 6 is a schematic structural diagram of a decision device according to an embodiment of the present disclosure;
FIG. 7 is a schematic structural diagram of an in-vehicle device according to an embodiment of the present disclosure;
fig. 8 is a schematic structural diagram of a decision device based on multitask learning according to an embodiment of the present application;
fig. 9 is a schematic structural diagram of a decision model training apparatus according to an embodiment of the present application;
fig. 10 is a schematic structural diagram of another decision-making device according to an embodiment of the present disclosure;
FIG. 11 is a schematic structural diagram of another decision model training apparatus according to an embodiment of the present disclosure;
fig. 12 is a partial schematic view of a computer program product according to an embodiment of the present application.
Detailed Description
Embodiments of the present application are described below with reference to the drawings.
Fig. 1a is a functional block diagram of a vehicle 100 according to an embodiment of the present invention. In one embodiment, the vehicle 100 is configured in a fully or partially autonomous driving mode. For example, the vehicle 100 may control itself while in the autonomous driving mode, and may determine a current state of the vehicle and its surroundings by human operation, determine a possible behavior of at least one other vehicle in the surroundings, and determine a confidence level corresponding to a likelihood that the other vehicle performs the possible behavior, controlling the vehicle 100 based on the determined information. While the vehicle 100 is in the autonomous driving mode, the vehicle 100 may be placed into operation without human interaction.
The vehicle 100 may include various subsystems such as a travel system 102, a sensor system 104, a control system 106, one or more peripherals 108, as well as a power supply 110, a computer system 112, and a user interface 116. Alternatively, vehicle 100 may include more or fewer subsystems, and each subsystem may include multiple elements. In addition, each of the sub-systems and elements of the vehicle 100 may be interconnected by wire or wirelessly.
The travel system 102 may include components that provide powered motion to the vehicle 100. In one embodiment, the travel system 102 may include an engine 118, an energy source 119, a transmission 120, and wheels/tires 121. The engine 118 may be an internal combustion engine, an electric motor, an air compression engine, or other types of engine combinations, such as a hybrid engine of a gasoline engine and an electric motor, or a hybrid engine of an internal combustion engine and an air compression engine. The engine 118 converts the energy source 119 into mechanical energy.
Examples of energy sources 119 include gasoline, diesel, other petroleum-based fuels, propane, other compressed gas-based fuels, ethanol, solar panels, batteries, and other sources of electrical power. The energy source 119 may also provide energy to other systems of the vehicle 100.
The transmission 120 may transmit mechanical power from the engine 118 to the wheels 121. The transmission 120 may include a gearbox, a differential, and a drive shaft. In one embodiment, the transmission 120 may also include other devices, such as a clutch. Wherein the drive shaft may comprise one or more shafts that may be coupled to one or more wheels 121.
The sensor system 104 may include a number of sensors that sense information about the environment surrounding the vehicle 100. For example, the sensor system 104 may include a positioning system 122 (which may be a GPS system, a beidou system, or other positioning system), an Inertial Measurement Unit (IMU) 124, a radar 126, a laser range finder 128, and a camera 130. The sensor system 104 may also include sensors of internal systems of the monitored vehicle 100 (e.g., an in-vehicle air quality monitor, a fuel gauge, an oil temperature gauge, etc.). Sensor data from one or more of these sensors may be used to detect the object and its corresponding characteristics (position, shape, orientation, velocity, etc.). Such detection and identification is a critical function of the safe operation of the autonomous vehicle 100.
The positioning system 122 may be used to estimate the geographic location of the vehicle 100. The IMU 124 is used to sense position and orientation changes of the vehicle 100 based on inertial acceleration. In one embodiment, IMU 124 may be a combination of an accelerometer and a gyroscope.
The radar 126 may utilize radio signals to sense objects within the surrounding environment of the vehicle 100. In some embodiments, in addition to sensing objects, radar 126 may also be used to sense the speed and/or heading of an object.
The laser rangefinder 128 may utilize laser light to sense objects in the environment in which the vehicle 100 is located. In some embodiments, the laser rangefinder 128 may include one or more laser sources, laser scanners, and one or more detectors, among other system components.
The camera 130 may be used to capture multiple images of the surrounding environment of the vehicle 100. The camera 130 may be a still camera or a video camera.
The control system 106 is for controlling the operation of the vehicle 100 and its components. Control system 106 may include various elements including a steering system 132, a throttle 134, a braking unit 136, a sensor fusion system 138, a computer vision system 140, a route control system 142, and an obstacle avoidance system 144.
The steering system 132 is operable to adjust the heading of the vehicle 100. For example, in one embodiment, a steering wheel system.
The throttle 134 is used to control the operating speed of the engine 118 and thus the speed of the vehicle 100.
The brake unit 136 is used to control the deceleration of the vehicle 100. The brake unit 136 may use friction to slow the wheel 121. In other embodiments, the brake unit 136 may convert the kinetic energy of the wheel 121 into an electric current. The brake unit 136 may take other forms to slow the rotational speed of the wheels 121 to control the speed of the vehicle 100.
The computer vision system 140 may be operable to process and analyze images captured by the camera 130 to identify objects and/or features in the environment surrounding the vehicle 100. The objects and/or features may include traffic signals, road boundaries, and obstacles. The computer vision system 140 may use object recognition algorithms, Motion from Motion (SFM) algorithms, video tracking, and other computer vision techniques. In some embodiments, the computer vision system 140 may be used to map an environment, track objects, estimate the speed of objects, and so forth.
The route control system 142 is used to determine a travel route of the vehicle 100. In some embodiments, the route control system 142 may combine data from the sensors 138, the GPS 122, and one or more predetermined maps to determine a travel route for the vehicle 100.
The obstacle avoidance system 144 is used to identify, evaluate, and avoid or otherwise negotiate potential obstacles in the environment of the vehicle 100.
Of course, in one example, the control system 106 may additionally or alternatively include components other than those shown and described. Or may reduce some of the components shown above.
Vehicle 100 interacts with external sensors, other vehicles, other computer systems, or users through peripherals 108. The peripheral devices 108 may include a wireless communication system 146, an in-vehicle computer 148, a microphone 150, and/or speakers 152.
In some embodiments, the peripheral devices 108 provide a means for a user of the vehicle 100 to interact with the user interface 116. For example, the onboard computer 148 may provide information to a user of the vehicle 100. The user interface 116 may also operate the in-vehicle computer 148 to receive user input. The in-vehicle computer 148 may be operated via a touch screen. In other cases, the peripheral devices 108 may provide a means for the vehicle 100 to communicate with other devices located within the vehicle. For example, the microphone 150 may receive audio (e.g., voice commands or other audio input) from a user of the vehicle 100. Similarly, the speaker 152 may output audio to a user of the vehicle 100.
The wireless communication system 146 may communicate wirelessly with one or more devices, either directly or via a communication network. For example, the wireless communication system 146 may use 3G cellular communication, such as CDMA, EVD0, GSM/GPRS, or 4G cellular communication, such as LTE. Or 5G cellular communication. The wireless communication system 146 may communicate with a Wireless Local Area Network (WLAN) using WiFi. In some embodiments, the wireless communication system 146 may utilize an infrared link, bluetooth, or ZigBee to communicate directly with the device. Other wireless protocols, such as various vehicle communication systems, for example, the wireless communication system 146 may include one or more Dedicated Short Range Communications (DSRC) devices that may include public and/or private data communications between vehicles and/or roadside stations.
The power supply 110 may provide power to various components of the vehicle 100. In one embodiment, power source 110 may be a rechargeable lithium ion or lead acid battery. One or more battery packs of such batteries may be configured as a power source to provide power to various components of the vehicle 100. In some embodiments, the power source 110 and the energy source 119 may be implemented together, such as in some all-electric vehicles.
Some or all of the functionality of the vehicle 100 is controlled by the computer system 112. The computer system 112 may include at least one processor 113, the processor 113 executing instructions 115 stored in a non-transitory computer readable medium, such as a data storage device 114. The computer system 112 may also be a plurality of computing devices that control individual components or subsystems of the vehicle 100 in a distributed manner.
The processor 113 may be any conventional processor, such as a commercially available CPU. Alternatively, the processor may be a dedicated device such as an ASIC or other hardware-based processor. Although FIG. 1a functionally illustrates a processor, memory, and other elements of the computer 110 in the same block, those skilled in the art will appreciate that the processor, computer, or memory may in fact comprise multiple processors, computers, or memories that may or may not be stored within the same physical housing. For example, the memory may be a hard disk drive or other storage medium located in a different housing than the computer 110. Thus, references to a processor or computer are to be understood as including references to a collection of processors or computers or memories which may or may not operate in parallel. Rather than using a single processor to perform the steps described herein, some components, such as the steering component and the retarding component, may each have their own processor that performs only computations related to the component-specific functions.
The processor 113 obtains current state information of the vehicle through the sensor obtained by the sensing system 104, the processor 113 obtains a plurality of candidate tasks and determines a target task from the candidate tasks, the target task is subjected to joint representation according to the candidate tasks to obtain a task vector corresponding to the target task, the task vector corresponding to the target task and the current state information are input into the decision model to be processed according to the task vector corresponding to the target task, a target action of the target task is obtained, and the control system 106 executes the target action to control the vehicle 100 to run.
In some embodiments, the data storage device 114 may include instructions 115 (e.g., program logic), and the instructions 115 may be executed by the processor 113 to perform various functions of the vehicle 100, including those described above. The data storage 114 may also contain additional instructions, including instructions to send data to, receive data from, interact with, and/or control one or more of the travel system 102, the sensor system 104, the control system 106, and the peripheral devices 108.
In addition to instructions 115, data storage device 114 may also store data such as road maps, route information, the location, direction, speed of the vehicle, and other such vehicle data, among other information. Such information may be used by the vehicle 100 and the computer system 112 during operation of the vehicle 100 in autonomous, semi-autonomous, and/or manual modes.
A user interface 116 for providing information to and receiving information from a user of the vehicle 100. Optionally, the user interface 116 may include one or more input/output devices within the collection of peripheral devices 108, such as a wireless communication system 146, an in-vehicle computer 148, a microphone 150, and a speaker 152.
The computer system 112 may control the functions of the vehicle 100 based on inputs received from various subsystems (e.g., the travel system 102, the sensor system 104, and the control system 106) and from the user interface 116. For example, the computer system 112 may utilize input from the control system 106 in order to control the steering unit 132 to avoid obstacles detected by the sensor system 104 and the obstacle avoidance system 144. In some embodiments, the computer system 112 is operable to provide control over many aspects of the vehicle 100 and its subsystems.
Alternatively, one or more of these components described above may be mounted or associated separately from the vehicle 100. For example, the data storage device 114 may exist partially or completely separate from the vehicle 1100. The above components may be communicatively coupled together in a wired and/or wireless manner.
Optionally, the above components are only an example, in an actual application, components in the above modules may be added or deleted according to an actual need, and fig. 1a should not be construed as limiting the embodiment of the present invention.
An autonomous automobile traveling on a roadway, such as vehicle 100 above, may identify objects within its surrounding environment to determine an adjustment to the current speed. The object may be another vehicle, a traffic control device, or another type of object. In some examples, each identified object may be considered independently, and based on the respective characteristics of the object, such as its current speed, acceleration, separation from the vehicle, etc., may be used to determine the speed at which the autonomous vehicle is to be adjusted.
Alternatively, the autonomous automobile vehicle 100 or a computing device associated with the autonomous vehicle 100 (e.g., the computer system 112, the computer vision system 140, the data storage 114 of fig. 1 a) may predict behavior of the identified object based on characteristics of the identified object and the state of the surrounding environment (e.g., traffic, rain, ice on the road, etc.). Optionally, each identified object depends on the behavior of each other, so it is also possible to predict the behavior of a single identified object taking all identified objects together into account. The vehicle 100 is able to adjust its speed based on the predicted behaviour of said identified object. In other words, the autonomous vehicle is able to determine what steady state the vehicle will need to adjust to (e.g., accelerate, decelerate, or stop) based on the predicted behavior of the object. In this process, other factors may also be considered to determine the speed of the vehicle 100, such as the lateral position of the vehicle 100 in the road on which it is traveling, the curvature of the road, the proximity of static and dynamic objects, and so forth.
In addition to providing instructions to adjust the speed of the autonomous vehicle, the computing device may also provide instructions to modify the steering angle of the vehicle 100 to cause the autonomous vehicle to follow a given trajectory and/or to maintain a safe lateral and longitudinal distance from objects in the vicinity of the autonomous vehicle (e.g., cars in adjacent lanes on the road).
The vehicle 100 may be a car, a truck, a motorcycle, a bus, a boat, an airplane, a helicopter, a lawn mower, an amusement car, a playground vehicle, construction equipment, a trolley, a golf cart, a train, a trolley, etc., and the embodiment of the present invention is not particularly limited.
According to FIG. 1b, computer system 101 includes a processor 103, processor 103 coupled to a system bus 105. Processor 103 may be one or more processors, each of which may include one or more processor cores. A display adapter (video adapter)107, which may drive a display 109, the display 109 coupled with system bus 105. System bus 105 is coupled through a bus bridge 111 and an input/output (I/O) bus 113. The I/O interface 115 is coupled to an I/O bus. The I/O interface 115 communicates with various I/O devices, such as an input device 117 (e.g., keyboard, mouse, touch screen, etc.), a multimedia disk (media tray)121 (e.g., CD-ROM, multimedia interface, etc.). A transceiver 123 (which can send and/or receive radio communication signals), a camera 155 (which can capture scenic and motion digital video images), and an external USB interface 125. Wherein, optionally, the interface connected with the I/O interface 115 may be a USB interface.
Optionally, in various embodiments described herein, computer system 101 may be located remotely from the autonomous vehicle and may communicate wirelessly with autonomous vehicle 0. In other aspects, some processes described herein are performed on a processor disposed within an autonomous vehicle, others being performed by a remote processor, including taking the actions required to perform a single maneuver.
The hard drive interface is coupled to system bus 105. The hardware drive interface is connected with the hard disk drive. System memory 135 is coupled to system bus 105. Data running in system memory 135 may include the operating system 137 and application programs 143 of computer 101.
The operating system includes a Shell 139 and a kernel 141. Shell 139 is an interface between the user and the kernel of the operating system. The shell is the outermost layer of the operating system. The shell manages the interaction between the user and the operating system, waiting for user input, interpreting the user input to the operating system, and processing the output results of the various operating systems.
The application programs 143 include programs related to controlling the automatic driving of a vehicle, such as programs for managing the interaction of an automatically driven vehicle with obstacles on the road, programs for controlling the route or speed of an automatically driven vehicle, and programs for controlling the interaction of an automatically driven vehicle with other automatically driven vehicles on the road. Application programs 143 also exist on the system of software deploying server 149. In one embodiment, computer system 101 may download application program 143 from software deploying server 149 when it is desired to execute autopilot-related program 147.
When the processor 103 executes the application program 143, the processor performs the following steps: acquiring a target task from a plurality of candidate tasks according to the navigation information, such as intersection straight-ahead movement; performing joint characterization on the target task according to the plurality of candidate tasks to obtain a task vector corresponding to the target task; the task vector corresponding to the target task and the state information s of the target task acquired by the sensor 153tThe data are input into a decision model to be processed so as to obtain target actions of a target task, such as acceleration, parking and the like, and then the target actions are executed to control the vehicle to run.
Acquiring a target task from a plurality of candidate tasks; and acquiring the state information s of the target task according to the target tasktAnd a task vector corresponding to the target task, the task vector corresponding to the target task is obtained based on the common subtask and the characteristic subtask of the plurality of candidate tasks, and is obtained according to the state information s of the target tasktUpdating the sample database by the task vector corresponding to the target task and the decision model to obtain an updated sample database; randomly acquiring a plurality of sample data from the updated sample database; the plurality of sample data are sample data of part or all of the candidate tasks; adjusting parameters in the decision model according to a plurality of sample data by using a reinforcement learning method to obtain an adjusted decision model; judging whether the adjusted decision model is converged; and when the adjusted decision model converges, determining the adjusted decision model as a target decision model.
The decision model may be implemented based on a fully-connected neural network, a convolutional neural network, a cyclic neural network, or other neural networks.
In one example, computer 720 may include a server having multiple computers, such as a load balancing server farm, that exchange information with different nodes of a network for the purpose of receiving, processing, and transmitting data from computer system 112. The server may be configured similarly to computer system 110, with a processor 730, memory 740, instructions 750, and data 760.
Optionally, the data 760 includes coordinates of the host vehicle on a world coordinate system, speed, coordinates of surrounding social vehicles on the world coordinate system, speed and heading angle, and the like.
When executing the instructions 750, the processor 730 specifically implements the following steps:
acquiring the coordinates (x) of the surrounding social vehicles in the world coordinate system according to the coordinates and the speed of the self vehicle in the world coordinate system and the coordinates, the speed and the course angle of the surrounding social vehicles in the world coordinate systemit,yit) Velocity vitAnd a heading angle thetait。
Acquiring the speed of the vehicle and the coordinates (x) of the surrounding social vehicles in the coordinate system of the vehicleit,yit) Velocity vitAnd a heading angle thetaitThe speed of the vehicle and the coordinates (x) of the surrounding social vehicles in the coordinate system of the vehicleit,yit) Velocity vitAnd a heading angle thetaitTo the computer system 112 of the vehicle's computer 101.
It should be noted that the above instruction can be regarded as a conversion instruction.
Fig. 1d shows an example of an autonomously driven vehicle and a cloud service center according to an example embodiment. Cloud service center 520 may receive information (such as vehicle sensors collecting data or other information) from autonomous vehicles 510, 512, and 514 within its operating environment 500 via network 502, such as a wireless communication network.
Cloud service center 520 obtains information about the speed, coordinates, heading angle, etc. of autonomous vehicles 510, 512, and 514 in the world coordinate system via network 502.
The cloud service center runs the stored programs related to controlling the automatic driving of the automobile according to the received data to control the automatic driving vehicles 510, 512 and 514. The programs related to controlling the automatic driving of the automobile can be programs for managing the interaction between the automatic driving automobile and obstacles on the road, programs for controlling the route or the speed of the automatic driving automobile and programs for controlling the interaction between the automatic driving automobile and other automatic driving automobiles on the road.
The cloud service center 520 acquires the state information of any vehicle A in the automatic driving vehicles 510, 512 and 514, wherein the state information comprises the speed of the vehicle A, the coordinates of the surrounding vehicles in the coordinate system of the vehicle A, the speed and the heading angle; acquiring a target task from a plurality of candidate tasks according to navigation information of the vehicle A; performing joint characterization on the target task according to the plurality of candidate tasks to obtain a task vector corresponding to the target task; and inputting the state information and the task vector corresponding to the target task into a decision model for processing so as to obtain the target action of the target task.
After acquiring the target action, the cloud service center 520 sends the target action to the vehicle a so that the vehicle a travels according to the target action.
The network 502 provides portions of the map out to the autonomous vehicles 510, 512, or 514. In other examples, operations may be divided between different locations or centers. For example, multiple cloud service centers may receive, validate, combine, and/or send information reports. Information reports and/or sensor data may also be sent between autonomous vehicles in some examples. Other configurations are also possible.
The cloud service centers can share information such as the speed, coordinates and course angle of the vehicle in the service area under the world coordinate system; in a plurality of cloud service centers, when the cloud service center 1 cannot provide driving service for the vehicle B within the service range of the cloud service center, the cloud service center 1 may send relevant information of the vehicle B (such as state information of the vehicle B, an action space, and a task vector corresponding to a task to be executed) to the cloud service center 2; the cloud service center 2 determines a target action of the vehicle B according to the state information of the vehicle B, the action space and the task vector corresponding to the task to be executed, then sends the target action to the cloud service center 1, and the cloud service center 1 sends the target action to the vehicle B.
In some examples, the center sends suggested solutions to the autonomous vehicle regarding possible driving conditions within the environment (e.g., informing of a front obstacle and informing of how to bypass it). For example, the cloud service center may assist the vehicle in determining how to travel when facing a particular obstacle within the environment. The cloud service center sends a response to the autonomous vehicle indicating how the vehicle should travel in the given scenario. For example, the cloud service center may confirm the presence of a temporary stop sign in front of the road based on the collected sensor data, and also determine that the lane is closed due to the application based on a "lane closure" sign and sensor data of the construction vehicle on the lane. Accordingly, the cloud service center sends a suggested mode of operation for automatically driving the vehicle through the obstacle (e.g., instructing the vehicle to change lanes on another road). The operational steps used for the autonomous vehicle may be added to the driving information map when the cloud service center observes the video stream within its operating environment and has confirmed that the autonomous vehicle can safely and successfully traverse the obstacle. Accordingly, this information may be sent to other vehicles in the area that may encounter the same obstacle in order to assist the other vehicles not only in recognizing the closed lane but also in knowing how to pass.
Referring to fig. 2, fig. 2 is a schematic flowchart of a decision method based on multi-task learning according to an embodiment of the present application. As shown in fig. 2, the method includes:
s201, obtaining a plurality of candidate tasks, determining a target task from the candidate tasks, and obtaining state information S of the target task according to the target taskt。
Optionally, the plurality of candidate tasks may be tasks in the same scene, for example, in an intersection scene, the plurality of candidate tasks may include intersection straight movement, intersection left turn, and intersection right turn; for another example, in the case of a tactical competitive game, the candidate tasks may include the highest number of killers, the longest survival time, and the lowest number of deaths.
Optionally, the plurality of candidate tasks may further include tasks in different scenes, such as tasks in an intersection scene and tasks in a tactical competitive game scene.
For example, in the scene of a crossroad, since the host vehicle and the social vehicle perform a plurality of interactive games, the information of the host vehicle and the information of the surrounding social vehicles need to be obtained, and therefore the state information s of the target tasktIncluding information about the own vehicle and information about surrounding social vehicles. Optionally, the information of the own vehicle includes a speed v of the own vehicleetThe information of the surrounding social vehicle includes coordinates (x) of the surrounding social vehicle in the own vehicle coordinate systemit,yit) Velocity vitAnd a heading angle thetait. Assuming that surrounding social vehicles include 5 vehicles closest to the own vehicle, status information stCan be represented as st=[vet,x1t,y1t,v1t,θ1t,…,x5t,y5t,v5t,θ5t]。
For example, in the scene of a competitive tactical game, the state information s of the target tasktIncluding the coordinates (x) of the principal character in the mappt,ypt) And a forward speed vptAnd coordinates (x) of teammates in the coordinate system of the principal characterit,yit) V) forward speed viAnd an advancing angle thetait. Optionally, status information s of the target tasktAnd the life value, the fire value, the number of enemies, the survival time, the position information of enemy army, the life value, the fire value and the like of the person and teammates can also be included.
In a specific example, in a scene of an intersection, a target task can be determined from a plurality of candidate tasks according to navigation information of a user. For example, in the scene of an intersection, the plurality of candidate tasks include left turn at the intersection, right turn at the intersection and straight going at the intersection; and if the crossroad needs to turn right according to the navigation information, determining the target task from the candidate tasks as the crossroad right turn.
In another specific example, in a tactical competitive game scenario, a target task may be determined from a plurality of candidate tasks based on game settings. For example, in a tactical competitive game, the plurality of candidate tasks include the largest number of enemies and the longest survival time, and if the condition for winning the game is the longest survival time, the target task specified from the plurality of candidate tasks is the longest survival time.
S202, performing task joint characterization on the target task according to the candidate tasks to obtain a task vector corresponding to the target task.
And the task vector corresponding to each candidate task is obtained based on the characteristic subtasks and the common subtasks of the plurality of candidate tasks.
Specifically, decomposing each candidate task in the multiple candidate tasks according to the prior knowledge to obtain a subtask corresponding to each candidate task; acquiring characteristic subtasks and common subtasks of the multiple candidate tasks according to the subtasks corresponding to each candidate task in the multiple candidate tasks; and acquiring a task vector corresponding to the target task according to the characteristic subtasks and the common subtasks of the plurality of candidate tasks.
For example, in an intersection scene, the plurality of candidate tasks may include intersection straight going, intersection left turning, and intersection right turning; according to the priori knowledge, the intersection straight-going is decomposed into two subtasks of intersection straight-going collision or arrival and traffic efficiency improvement; the left turn at the intersection is decomposed into two subtasks of collision or arrival at the left turn at the intersection and improvement of traffic efficiency; the right turn at the intersection is decomposed into two subtasks of collision or arrival of the right turn at the intersection and improvement of traffic efficiency. The intersection straight-going collision or arrival, the intersection left-turning collision or arrival and the intersection right-turning collision or arrival are characteristic subtasks, and the traffic efficiency is improved to be a common subtask.
After the characteristic subtask and the characteristic subtask are obtained, performing task joint characterization on each candidate task in the multiple candidate tasks according to the characteristic subtask and the common subtask to obtain a task vector corresponding to each candidate task. For example, for a scene of an intersection, the task vector corresponding to each candidate task is obtained based on four subtasks of intersection left-turn collision or arrival, intersection straight-going collision or arrival, intersection right-turn collision or arrival and improvement of traffic efficiency.
Optionally, the task vector corresponding to each of the plurality of candidate tasks is composed of a plurality of elements, and the plurality of elements are respectively in one-to-one correspondence with the characteristic subtasks and the commonality subtasks of the plurality of candidate tasks. And whether the candidate task comprises the subtask corresponding to the element is represented by different values of each element in the task vector corresponding to the candidate task.
For example, for three candidate tasks (including intersection left turn, intersection straight going, and intersection right turn) of an intersection scene, a task vector corresponding to each candidate task is composed of four elements, where the four elements respectively correspond to the four subtasks: the collision or arrival of the left turn at the intersection, the collision or arrival of the straight going at the intersection, the collision or arrival of the right turn at the intersection and the improvement of the traffic efficiency. And whether the candidate task comprises the subtask corresponding to the element is represented by different values of the element in the task vector corresponding to each candidate task.
Optionally, a task vector corresponding to a left turn at the intersection may be represented as [1,0,0,1], a task vector corresponding to a straight line at the intersection may be represented as [0,1,0,1], and a task vector corresponding to a right turn at the intersection may be represented as [0,0,1,1 ]; wherein, the first '1' from left to right in the task vector [1,0,0,1] corresponding to the left turn at the intersection represents that the candidate task 'left turn at the intersection' includes the subtask 'left turn at the intersection collision or arrival', the first '0' represents that the candidate task 'left turn at the intersection' does not include the subtask 'straight-ahead collision or arrival', the second '0' represents that the candidate task 'left turn at the intersection' does not include the subtask 'right turn at the intersection collision or arrival', and the second '1' represents that the candidate task 'left turn at the intersection' includes the subtask 'improves the traffic efficiency'; the first 0 from left to right in a task vector [0,1,0,1] corresponding to intersection straight running indicates that a candidate task "intersection straight running" does not include a subtask "intersection left-turn collision or arrival", the first 1 indicates that the candidate task "intersection straight running" includes a subtask "intersection straight running collision or arrival", the second 0 indicates that the candidate task "intersection straight running" does not include a subtask "intersection right-turn collision or arrival", and the second 1 indicates that the candidate task "intersection straight running" includes a subtask "to improve the traffic efficiency"; the first 0 in the task vector [0,0,1,1] corresponding to the right turn at the intersection from left to right represents that the candidate task 'right turn at the intersection' does not comprise the subtask 'left turn at the intersection collision or arrival', the second 0 represents that the candidate task 'right turn at the intersection' does not comprise the subtask 'straight-ahead collision or arrival', the first 1 represents that the candidate task 'right turn at the intersection' comprises the subtask 'right turn at the intersection collision or arrival', and the second 1 represents that the candidate task 'right turn at the intersection' comprises the subtask 'improved traffic efficiency'.
Optionally, a task vector corresponding to a left turn at the intersection may be represented as [80,0,0,20], a task vector corresponding to a straight going at the intersection may be represented as [0,80,0,20], and a task vector corresponding to a right turn at the intersection may be represented as [0,0,80,20 ]; wherein, in the task vector [80,0,0,20] corresponding to the left turn at the intersection, "80" indicates that the candidate task "left turn at the intersection" includes the subtask "left turn at the intersection collision or arrival," the first "0" from left to right indicates that the candidate task "left turn at the intersection" does not include the subtask "straight-ahead collision or arrival," the second "0" indicates that the candidate task "left turn at the intersection" does not include the subtask "right turn at the intersection collision or arrival," and "20" indicates that the candidate task "left turn at the intersection" includes the subtask "to improve the traffic efficiency. In a task vector [0,80,0,20] corresponding to intersection straight-going, the first ' 0 ' from left to right represents that a candidate task ' intersection straight-going ' does not comprise a subtask ' intersection left-turn collision or arrival ', the ' 80 ' represents that the candidate task ' intersection straight-going ' comprises a subtask ' intersection straight-going collision or arrival ', the second ' 0 ' represents that the candidate task ' intersection straight-going ' does not comprise a subtask ' intersection right-turn collision or arrival ', and the ' 20 ' represents that the candidate task ' intersection straight-going ' comprises a subtask ' intersection right-turn collision or arrival ' to improve the traffic efficiency '. The task vector corresponding to the candidate task is equivalent to the task vector corresponding to the candidate task, the candidate task comprises the subtask corresponding to the larger element value, the subtask not comprising the smaller element value is equivalent to the task vector corresponding to the candidate task, and the candidate task comprises the subtask corresponding to the non-zero element value and does not comprise the subtask corresponding to the zero element value. In other words, the element in the task vector corresponding to the candidate task is the weight of the subtask corresponding to the element; the more important the candidate task is, the more the weight corresponding to the subtask is, and the less important the candidate task is, the less the weight corresponding to the subtask is.
It should be noted that the sequence of the subtasks in the task vector corresponding to the plurality of candidate tasks includes, but is not limited to, the sequence described above (the [ collision or arrival at a left turn at an intersection, the collision or arrival at a straight going intersection, the collision or arrival at a right turn at an intersection, and the improvement of the traffic efficiency), and may also be other sequences.
S203, according to the task vector corresponding to the target task and the state information S of the target tasktAnd determining the target action from the action space.
Specifically, a task vector corresponding to the target task and state information s of the target task are settInput to decision model MtTo obtain a target action of the target task, wherein the decision model is implemented by a neural network.
Alternatively, the neural network may be a fully-connected neural network, a convolutional neural network, a recurrent neural network, or other type of neural network.
In one embodiment, a task vector corresponding to a target task and state information s of the target tasktInput to decision model MtThe processing to obtain the target action of the target task specifically includes:
according to the task vector corresponding to the target task and the state information s of the target tasktAnd acquiring an action value function of the target task, wherein the action value function in the action value function vector corresponds to the subtasks corresponding to the elements in the task vector corresponding to the target task one by one. The motion value function vector may be represented as R(s)t,ak,g),akAnd g is a task vector corresponding to the target task. Function vector R(s) of action values to avoid the target taskt,akG) the action value function R(s) of the target task according to the task vector g corresponding to the target taskt,akAnd g), deleting the action value function of the subtask irrelevant to the target task in g) to obtain the value function of the target task, and determining the target action from the action space according to the value function of the target task, wherein the target action is the action which enables the value function of the target task to be maximum in the action space.
Optionally, the value function is a Q value function; wherein, the Q value function of the target task may be expressed as: q(s)t,ak,g)=gTR(st,akG); the target action may be represented as:
it should be noted that the motion space is composed of the motion that can be executed when the target task is executed. Such as speed planning control of the vehicle given navigation information, i.e. given the waypoints of the unmanned vehicle. In order to enhance the interactivity of the vehicle, the unmanned vehicle is required to have the capabilities of parking waiting, crowding and accelerating to pass through. Therefore, the designed motion space needs to cover a larger speed range, and the embodiment adopts discrete motion space [0,3m/s,6m/s,9m/s ].
At the target action atAfter being executed, the state information s of the target task is acquiredt+1And according to the state information s of the target taskt+1It is determined whether the target task is finished. For example, the target task is the straight-going at the intersection and can be based on the state information st+1And judging whether the vehicle passes through the intersection or not. If it is based on the status information st+1Determining that the target task is not finished, and re-executing the steps S202-S203; if it is based on the status information st+1And determining that the target task is finished, and executing the next task according to the steps S201-S203.
It is to be noted here that the status information stFor status information at time t, status information st+1Is the state information at time t +1, the two are the same type of information at different times, target action atIs the action performed at time t. Target action atThe execution subject of (a) may or may not be the same as the execution subject of the decision model. For example, after the decision device obtains the target action, the target action is sent to the automobile, and the automobile control device executes the target action to control the automobile.
It can be seen that in the scheme of the application, the target task is subjected to joint characterization according to the subtasks of the multiple candidate tasks, so that the multiple tasks can be decided by using the same model, and the mutual influence among the multiple tasks is avoided; the value function of the target task is obtained according to the task vector and the action value function vector corresponding to the target task, and then the target action is determined based on the value function, so that the influence of the action value function of a subtask irrelevant to the target task in the action value function vector on the selection of the target action is avoided when the target task is decided, and the decision effect of the target task is improved.
In a specific example, as shown in fig. 3, for an intersection interaction scene, the unmanned vehicle needs to have the capability of three tasks, namely, intersection left-turn, intersection straight-going and intersection right-turn, in other words, the candidate tasks of the unmanned vehicle include intersection left-turn, intersection straight-going and intersection right-turn; determining that the crossroad needs to go straight according to the navigation information, and determining a target task from the candidate tasks as crossroad straight;
decomposing the three candidate tasks respectively according to the prior knowledge to obtain the subtask of each candidate task in the three candidate tasks; the sub-task of turning left at the intersection comprises the collision of turning left at the intersection or the arrival and the improvement of the traffic efficiency, the sub-task of turning right at the intersection comprises the collision of turning right at the intersection or the arrival and the improvement of the traffic efficiency, and the sub-task of turning right at the intersection comprises the collision of turning right at the intersection or the arrival and the improvement of the traffic efficiency; extracting a characteristic subtask and a common subtask from the subtasks of the three candidate tasks; the characteristic subtasks comprise intersection left-turn collision or arrival, intersection straight-going collision or arrival and intersection right-turn collision or arrival, and the common subtasks comprise the functions of improving the traffic efficiency; acquiring a task vector corresponding to each candidate task in the three candidate tasks according to the characteristic subtask and the common subtask; wherein, the task vector g' corresponding to the straight-going intersection is [0,1,0,1 ];
in the scene of the crossroad, as the self-vehicle needs to carry out a plurality of times of interactive games with surrounding social vehicles, complete information of the social vehicles around the self-vehicle needs to be obtained. Obtaining status information stState information stIncluding the speed v of the bicycleeSocial vehicle on-vehicle coordinate systemLower position coordinate (x)i,yi) Velocity viAnd heading angle thetai. Assuming that surrounding social vehicles are five vehicles closest to the own vehicle, state information stCan be expressed as: st=[ve,x1,y1,v1,θ1,…,x5,y5,v5,θ5];
Corresponding task vector g and state information s of straight going intersectiontInputting the data into a decision model for processing to obtain a target action corresponding to the straight movement of the intersection, specifically, a task vector g and state information s corresponding to the straight movement of the intersectiontObtaining a motion value function vector R(s)t,akG), motion value function vector R(s)t,akThe action value function in g) corresponds to the subtasks corresponding to the elements in the task vector corresponding to the straight-going intersection one by one; in order to eliminate action value function vector R(s) corresponding to straight-going at intersectiont,akAnd g), the influence of the action value function corresponding to the right-turn collision or arrival at the intersection and the action value function corresponding to the left-turn collision or arrival at the intersection is based on the task vector g and the action value function vector R(s) corresponding to the straight-going intersectiont,akG) obtaining a Q value function corresponding to straight-going at the intersection; the Q-value function can be expressed as: q(s)t,ak,g)=gTR(st,ak,g);
Finally, the Q value function Q(s) corresponding to the straight-going intersection is usedt,akG) action taking the maximum value is determined as target action atFor example, accelerating straight lines. At the target action atAfter being executed, the state information s of the target task is acquiredt+1By means of status information st+1And judging whether the target task is finished or not. E.g. based on status information st+1And determining that the vehicle passes through the intersection by the coordinates of the middle vehicle, thereby determining that the intersection straight-going task is finished. And if the fact that the vehicle does not pass through the crossroad is determined according to the state information, the target action corresponding to the straight-ahead movement of the crossroad is obtained again according to the method until the vehicle passes through the crossroad.
It can be seen that the embodiment can be used in a crossroad scene to realize the simultaneous learning of the traffic strategies in three directions in one decision model, successfully find the time of entry, improve the traffic efficiency through multiple interactions and avoid collision.
Referring to fig. 4, fig. 4 is a schematic flowchart of a training method of a decision model based on multi-task learning according to an embodiment of the present application. As shown in fig. 4, the method includes:
s401, randomly obtaining a plurality of sample data from a first sample database, and adjusting a decision model M according to the plurality of sample data by using a reinforcement learning methodtTo obtain a decision model Mt+1。
The first sample database comprises sample data of a plurality of candidate tasks, the sample data of the target task comprises a task vector corresponding to the target task, the task vector corresponding to the target task is obtained based on a common subtask and a characteristic subtask in the candidate tasks, and the target task is any one of the candidate tasks.
Specifically, according to the loss function and the state information s of each sample data in a plurality of sample datatTask vector, target action, state information st+1Calculating the loss value according to the reward value vector r; adjusting decision model M according to loss valuetTo obtain a decision model Mt+1。
Wherein the loss value can be expressed as:
wherein r in the formula is the reward value vector in the sample data, g is the task vector in the sample data, and action atThe discount coefficient γ is a constant for the action in the sample data.
In a possible embodiment, before randomly acquiring a plurality of sample data from the first sample database, the method of this embodiment further includes:
acquiring a target task from a plurality of candidate tasks; and obtaining the state information of the target task according to the target taskInformation stAcquiring a task vector corresponding to the target task;
according to the state information s of the target tasktTask vector corresponding to target task and decision model MtAnd generating sample data of the target task, and adding the sample data of the target task to the preliminary sample database to obtain a first sample database.
Optionally, the plurality of candidate tasks may be tasks in the same scene, for example, in an intersection scene, the plurality of candidate tasks may include intersection straight movement, intersection left turn, and intersection right turn; for another example, in the case of a tactical competitive game, the candidate tasks may include the highest number of killers, the longest survival time, and the lowest number of deaths.
Optionally, the plurality of candidate tasks may further include tasks in different scenes, such as tasks in an intersection scene and tasks in a tactical competitive game scene.
For example, in the scene of a crossroad, since the host vehicle and the social vehicle perform a plurality of interactive games, the information of the host vehicle and the information of the surrounding social vehicles need to be obtained, and therefore the state information s of the target tasktIncluding information about the own vehicle and information about surrounding social vehicles. Optionally, the information of the own vehicle includes a speed v of the own vehicleetThe information of the surrounding social vehicle includes coordinates (x) of the surrounding social vehicle in the own vehicle coordinate systemit,yit) Velocity vitAnd a heading angle thetait. Assuming that surrounding social vehicles include 5 vehicles closest to the own vehicle, status information stCan be represented as st=[vet,x1t,y1t,v1t,θ1t,…,x5t,y5t,v5t,θ5t]。
For example, in the scene of a competitive tactical game, the state information s of the target tasktIncluding the coordinates (x) of the principal character in the mappt,ypt) And a forward speed vptAnd coordinates (x) of teammates in the coordinate system of the principal characterit,yit) V) forward speed viAnd an advancing angle thetait. OptionallyStatus information s of the target tasktAnd the life value, the fire value, the number of enemies, the survival time, the position information of enemy army, the life value, the fire value and the like of the person and teammates can also be included.
And the task vector corresponding to the target task is obtained according to the characteristic subtask and the common subtask of the multiple candidate tasks.
Decomposing each candidate task in the multiple candidate tasks according to the prior knowledge to obtain a subtask corresponding to each candidate task; acquiring characteristic subtasks and common subtasks of the multiple candidate tasks according to the subtasks corresponding to each candidate task in the multiple candidate tasks; and acquiring a task vector corresponding to the target task according to the characteristic subtasks and the common subtasks of the plurality of candidate tasks.
It should be noted that, for a specific description of obtaining a task vector corresponding to a target task according to the target task, reference may be made to the related description of step S202, and no description is given here.
In one possible embodiment, s is based on the state information of the target tasktTask vector corresponding to target task and decision model MtGenerating sample data for a target task, comprising:
state information s of target tasktTask vector input decision model M corresponding to target tasktProcessing to obtain candidate actions of the target task; selecting target actions of the target task from the candidate actions of the target task and the actions randomly acquired from the action space according to a preset probability, wherein the probability that the candidate actions of the target task are selected is the preset probability; obtaining state information s of the target task after the target action is executedt+1And according to the state information s of the target taskt+1Acquiring an award value vector of a target task; the reward value in the reward value vector corresponds to the subtasks corresponding to the elements in the task vector corresponding to the target task one by one, and the sample data of the target task comprises the task vector corresponding to the target task and the state information s of the target tasktTarget action of target task, state information s of target taskt+1And a reward value vector for the target task.
In one example, the state information s of the target tasktInputting a task vector g corresponding to a target task into a decision model MtThe processing to obtain the target action of the target task comprises:
decision model MtAccording to the state information s of the target tasktThe task vector g corresponding to the target task obtains the action value function vector R(s)t,akG), action value function vector R(s) of target taskt,akThe action value function in g) corresponds to the subtasks corresponding to the elements in the task vector corresponding to the target task one by one; decision model MtFunction vector R(s) according to action value of target taskt,akG) obtaining the target action a of the target task from the action space by the task vector corresponding to the target taskt. Function vector R(s) of action values due to target taskt,akG) includes the action value function of the subtask irrelevant to the target task, for example, the subtask irrelevant to the left turn of the intersection includes the right turn arrival or collision of the intersection and the straight collision or arrival of the intersection, and in order to avoid the influence of the action value function of the subtask irrelevant to the target task, the action value function R(s) of the target task is calculated according to the task vector g corresponding to the target taskt,akDeleting the action value function of the subtask irrelevant to the target task in the step g) to obtain a value function of the target task; and determining a candidate action from the action space according to the value function, wherein the candidate action is an action which enables the value function of the target task to be maximum in the action space.
In one possible embodiment, the target action a of the target task is selected from the candidate actions of the target task and the actions randomly obtained from the action space according to the preset probabilitytThe method comprises the following steps:
when the first parameter is larger than the preset probability, determining the candidate action of the target task as the target action a of the target taskt(ii) a The first parameter is a value range of [0,1]]The random number of (2); when the first parameter is not greater than the preset probability, determining the action randomly acquired from the action space as a target action a of the target taskt。
It should be noted that the initial value of the predetermined probability is 1 or a larger value close to 1; the preset probability is gradually reduced along with the increase of the training times; the larger value of setting the initial value of the preset probability to be 1 or close to 1 aims to search for new actions as much as possible in the initial stage of training the decision model and avoid the optimal actions during training.
Optionally, the maximum value of the preset probability is 1, and the minimum value is 0.1.
Optionally, the upper function is a Q-value function.
At the target action atAfter being executed, the state information s of the target task is acquiredt+1And according to the state information s of the target taskt+1Acquiring an award value vector of a target task; and the reward values in the reward value vector correspond to the subtasks corresponding to the elements in the task vector corresponding to the target task one by one. For example, for the task of turning left at the intersection, the vector r(s) of the reward value ist+1A, g) can be represented by [ rl,rs,rr,rc]Wherein r islReward value, r, for a subtask intersection left turn collision or arrivalsReward value, r, for a subtask intersection straight-ahead collision or arrivalrReward value, r, for right turn collision or arrival at subtask intersectioncA reward value that improves passage efficiency for the subtask.
according to the task vector g corresponding to the target task and the state information s of the target tasktTarget action atStatus information st+1Obtaining sample data of target task by reward value vector r<st,at,r,st+1,g>And sample data of the target task<st,at,r,st+1,g>And saving the data to a preliminary sample database to obtain a first sample database.
It should be noted that the execution subject of the target action may be the same as or different from the execution subject of the decision model. For example, after obtaining the target action, the execution subject of the decision model sends the target action to the execution subject of the target action to execute the target action.
S402, judging a decision model Mt+1Whether to converge; when decision model Mt+1Upon convergence, the decision model M is determinedt+1Is a target decision model.
In one example, decision model M is judgedt+1Whether to converge, including:
according to the state information s of the target taskt+1Judging whether the target task is executed completely; when the target task is determined to be executed completely, judging the decision model Mt+1Whether or not to converge.
For example, for an intersection scenario, the status information st+1Including speed v of the vehiclee(t+1)Coordinates (x) of surrounding vehicles in the own vehicle coordinate systemi(t+1),yi(t+1)) Velocity Vi(t+1)And heading angle thetai(t+1)Can pass through the speed v of the bicyclee(t+1)Coordinates (x) of surrounding vehicles in the own vehicle coordinate systemi(t+1),yi(t+1)) Velocity Vi(t+1)And heading angle thetai(t+1)Judging whether the self vehicle passes through the intersection or not; if the target task passes through the intersection, determining that the target task is ended; and if the target task does not pass through the intersection, determining that the target task is not finished.
Further, when it is determined that the target task is not ended, let t be t +1 and repeat the execution of "state information S according to target task" in step S401tTask vector corresponding to target task and decision model MtAnd generating sample data of the target task, adding the sample data of the target task into the primary sample database to obtain a first sample database and S402 until the target task is finished.
After the target task is determined to be finished, judging a decision model Mt+1Whether or not to converge.
Decision model Mt+1Whether to converge, if so, the decision model M is determinedt+1Determining the model as a target decision model, if not converging, repeatedly executing the steps S401-S402 until the decision modelMt+1And (6) converging.
Alternatively, the decision model M is determined by determining whether the accumulated award value convergest+1Whether or not to converge. Specifically, a task vector g corresponding to the target task and state information s are sett+1Input to decision model Mt+1To obtain a new reference action, specifically, according to a task vector g and state information s corresponding to the target taskt+1Obtaining a motion value function vector R(s)t+1,akG) and according to the motion value function vector R(s)t+1,akG) determining a Q-value function Q(s) for the target task from a task vector g corresponding to the target taskt+1,ak(ii) a g) A function Q(s) of the Q value in the motion spacet+1,ak(ii) a g) Determining the action with the largest value as a new reference action; selecting target action a of the target task from new candidate actions and actions randomly acquired from the action space according to the new preset probabilityt+1The probability that the new candidate action is selected is a new preset probability; in the execution of the target action at+1Then, the status information s is acquiredt+2According to a reward value function and status information st+2Determining a reward value vector r(s) for a target action taskt+2,at+1,g);
According to current state information st+2Judging whether the target task is finished or not; and when the target task is determined not to be ended, making t equal to t +1, and repeatedly executing the steps until the target task is ended. And accumulating the reward values corresponding to the subtasks related to the target task in each reward value vector in the plurality of reward value vectors to obtain a reward accumulated value.
Determining a decision model M if the reward accumulation values converget+1Converging; if the reward accumulated value is not converged, determining a decision model Mt+1And does not converge.
It should be noted that, in the above process, a new target action a is obtainedt+1State information st+2And a prize value vector r(s)t+2,at+1G), according to the task vector g and the state information s corresponding to the target taskt+1Object ofAction at+1State information st+2And a prize value vector r(s)t+2,at+1And g) generating sample data, and storing the sample data in the first sample database to obtain a new sample database.
After the target task is finished, deleting the target task from the plurality of candidate tasks to obtain a deleted candidate task; when step S401 needs to be executed again, specifically, acquiring the target task from the plurality of candidate tasks means acquiring a new target task from the deleted candidate tasks.
It can be seen that in the scheme of the application, the task vector formed by the characteristic subtask and the common subtask is obtained by performing joint characterization on the target task, so that a strategy for learning a plurality of candidate tasks through one model is realized, wherein the common subtask of the plurality of candidate tasks can promote the learning of the strategy of the plurality of candidate tasks, and the convergence capability of the model is improved; the specific subtasks are used for carrying out the targeted learning of a plurality of candidate tasks, so that the mutual influence among a plurality of tasks is avoided, the model is also prevented from being compromised among a plurality of tasks, and the excellent effect can be achieved when the same model carries out decision making on the plurality of tasks; feeding back the execution condition of each subtask in the target task to the decision model for learning by constructing a reward value vector; the value function of the target task is obtained according to the task vector and the action value function vector corresponding to the target task, and then the target action is determined based on the value function, so that the influence of the action value function of the subtask irrelevant to the target task in the action value function vector on the selection of the target action is avoided when the target task is decided.
Referring to fig. 5, fig. 5 is a schematic flowchart of another decision model training method based on multi-task learning according to an embodiment of the present application. As shown in fig. 5, the method includes:
s501, determining a target task from a plurality of candidate tasks, and acquiring a task vector g corresponding to the target task and state information S of the target task according to the target taskt。
The plurality of tasks are tasks in the same scene, for example, in the scene of an intersection, the plurality of candidate tasks include intersection left-turn, intersection straight-going and intersection right-turn; for another example, in the case of a tactical competitive game, the candidate tasks include generating the longest time, the fewest number of deaths, and the largest number of killers. Of course, the plurality of candidate tasks may be tasks in different scenarios.
The target task is any one of the candidate tasks that are not used for training the decision model among the plurality of candidate tasks.
Wherein the task vector corresponding to the target task is obtained based on the characteristic subtasks and the common subtasks of the plurality of candidate tasks.
Optionally, before the target task is determined from the multiple candidate tasks, performing task joint characterization on each candidate task in the multiple candidate tasks to obtain a task vector corresponding to each candidate task.
Specifically. Performing task decomposition on each candidate task in the plurality of candidate tasks according to prior knowledge to obtain a subtask of each candidate task; acquiring a characteristic subtask and a common subtask according to a subtask of each candidate task in a plurality of candidate tasks; and performing task joint characterization on each candidate task in the plurality of candidate tasks according to the characteristic subtasks and the common subtasks to obtain a task vector corresponding to each candidate task. And elements in the task vector corresponding to each candidate task correspond to the characteristic subtasks and the common subtasks one by one. And whether the candidate task comprises the subtask corresponding to the element is represented by different values of each element in the task vector corresponding to the candidate task.
For example, the plurality of candidate tasks include intersection left-turn, intersection straight-going, and intersection right-turn; according to the priori knowledge, the left turn of the intersection can be decomposed into two subtasks of left turn arrival or collision of the intersection and improvement of the traffic efficiency, the straight-going intersection can be decomposed into two subtasks of left turn arrival or collision of the intersection and improvement of the traffic efficiency, and the right turn of the intersection can be decomposed into two subtasks of right turn arrival or collision of the intersection and improvement of the traffic efficiency; the characteristic subtasks are intersection left-turn arrival or collision, intersection straight-going collision or arrival and intersection right-turn collision or arrival, and the common subtask is used for improving the traffic efficiency.
In one example, a task vector corresponding to a left turn at an intersection can be represented as [1,0,0,1], a task vector corresponding to a straight going at an intersection can be represented as [0,1,0,1], and a task vector corresponding to a right turn at an intersection can be represented as [0,0,1,1], where an element 1 in the vector represents that a sub-task of a candidate task includes a sub-task corresponding to the element, and an element 0 in the vector represents that a sub-task of the candidate task does not include a sub-task corresponding to the element.
For the state information s of the target task in the crossroad scenetIncluding information about the own vehicle and information about surrounding social vehicles. Optionally, the information of the own vehicle includes a speed v of the own vehicleetThe information of the surrounding social vehicle includes coordinates (x) of the surrounding social vehicle in the own vehicle coordinate systemit,yit) Velocity vitAnd a heading angle thetait. Assuming that surrounding social vehicles include 5 vehicles closest to the own vehicle, status information s of the target tasktCan be represented as st=[vet,x1t,y1t,v1t,θ1t,…,x5t,y5t,v5t,θ5t]。
S502, a task vector g corresponding to the target task and state information S of the target tasktInput to decision model MtTo obtain candidate actions, and selecting the target action a of the target task from the candidate actions of the target task and the actions randomly obtained from the action space according to the preset probabilityt。
And the probability that the candidate action of the target task is selected is a preset probability.
Task vector g corresponding to target task and state information s of target tasktAction value function vector R(s) of target taskt,akG). Wherein the action value function vector R(s) of the target taskt,akAnd the action value function in the g) corresponds to the subtasks corresponding to the elements in the corresponding task vector in the target task one by one. The motion value function in the motion value function vector is used to characterize the correspondence toPerforms action akThen the quality function of the result is obtained.
It should be noted here that the decision model M is described abovetIs implemented based on a neural network, such as a fully-connected neural network, a convolutional neural network, a pooled neural network, or other forms of neural networks.
According to the task vector g corresponding to the target task and the action value function vector R(s) of the target taskt,akG) determining the Q function Q(s) of the target taskt,ak;g)。
Function vector R(s) of action values due to target taskt,akG) includes the action value function of the subtask irrelevant to the target task, for example, the subtask irrelevant to the left turn of the intersection includes the right turn arrival or collision of the intersection and the straight collision or arrival of the intersection, and in order to avoid the influence of the action value function of the subtask irrelevant to the target task, the action value function R(s) of the target task is calculated according to the task vector g corresponding to the target taskt,akAnd g) deleting the action value functions of the subtasks which are not related to the target task to obtain the Q value function of the target task.
Wherein, the Q value function of the target task can be expressed as:
Q(st,ak;g)=gT R(st,ak,g)
function Q(s) of Q value according to target taskt,ak(ii) a g) Determining candidate motion from the motion space, the candidate motion being a function Q(s) of Q value in the motion spacet,ak(ii) a g) The action with the largest value.
In one possible embodiment, the target action a of the target task is selected from the candidate actions of the target task and the actions randomly obtained from the action space according to the preset probabilitytThe method comprises the following steps:
when the first parameter is larger than the preset probability, determining the candidate action of the target task as the target action a of the target taskt(ii) a The first parameter is a value range of [0,1]]The random number of (2); when the first parameter is not more than the preset probability, the first parameter is randomly acquired from the action spaceTarget action a of action determination as target taskt。
It should be noted that the initial value of the preset probability is 1 or a larger value close to 1; the preset probability is gradually reduced along with the increase of the training times; optionally, the maximum value of the preset probability is 1, and the minimum value is 0.1.
S503, at target action atAfter being executed, the state information s of the target task is acquiredt+1And according to the status information st+1And obtaining the reward value vector r(s) of the target task by the reward value functiont+1,at,g)。
Wherein the prize value vector(s)t+1,atThe reward value in g) corresponds to the subtask corresponding to the element in the task vector corresponding to the target task one by one, for example, for the left turn of the intersection, the reward value vector can be expressed as [ r [l,rs,rr,rc]Wherein r islReward value, r, for a subtask intersection left turn collision or arrivalsReward value, r, for a subtask intersection straight-ahead collision or arrivalrReward value, r, for right turn collision or arrival at subtask intersectioncA reward value that improves passage efficiency for the subtask.
it is to be noted here that the status information stAnd status information st+1The same type of information at different times.
S504, according to the task vector g corresponding to the target task and the state information StStatus information st+1Target action atAnd a prize value vector r(s)t+1,atG) obtaining sample data of the target task<st,at,r,st+1,g>And using the sample data<st,at,r,st+1,g>And storing the data in a sample database.
It should be noted that the sample database includes sample data of multiple candidate tasks, and for the same candidate task, sample data acquired at different times may be included.
S505, randomly acquiring a plurality of pieces of sample data from a sample database, and calculating a loss value according to a loss function and the plurality of pieces of sample data; adjusting the decision model M according to the loss valuetTo obtain a decision model Mt+1。
Wherein the loss value can be expressed as:
wherein r in the formula is the reward value vector in the sample data, g is the task vector in the sample data, atThe discount coefficient γ is a constant for the target action in the sample data.
It should be noted that the plurality of pieces of sample data may be sample data of the same task, or may be sample data of different tasks, where part of the plurality of pieces of sample data are of the same task, and part of the plurality of pieces of sample data are of different tasks.
S506, according to the state information St+1And judging whether the target task is finished or not.
For example, for an intersection scenario, the status information st+1Including speed v of the vehiclee(t+1)Coordinates (x) of surrounding vehicles in the own vehicle coordinate systemi(t+1),yi(t+1)) Velocity Vi(t+1)And heading angle thetai(t+1)Can pass through the speed v of the bicyclee(t+1)Coordinates (x) of surrounding vehicles in the own vehicle coordinate systemi(t+1),yi(t+1)) Velocity Vi(t+1)And heading angle thetai(t+1)Judging whether the self vehicle passes through the intersection or not; if the target task passes through the intersection, determining that the target task is ended; and if the target task does not pass through the intersection, determining that the target task is not finished.
Wherein, when the target task is determined to be finished, executing step S507; if it is determined that the target task is not finished, let t be t +1 and execute step S502.
S507, judging a decision model Mt+1Whether to receive or notAnd (7) converging.
Decision model Mt+1If the convergence is not determined, step S508 is executed, and if the convergence is not determined, step S501 is executed.
Alternatively, the decision model M is determined by determining whether the accumulated award value convergest+1Whether or not to converge. Specifically, a task vector g corresponding to the target task and state information s are sett+1Input to decision model Mt+1To obtain a new reference action, specifically, according to a task vector g and state information s corresponding to the target taskt+1Obtaining a motion value function vector R(s)t+1,akG) and according to the motion value function vector R(s)t+1,akG) determining a Q-value function Q(s) for the target task from a task vector g corresponding to the target taskt+1,ak(ii) a g) A function Q(s) of the Q value in the motion spacet+1,ak(ii) a g) Determining the action with the largest value as a new reference action, and selecting a new target action a from new candidate actions and actions randomly acquired from an action space according to a new preset probabilityt+1The probability that the new candidate action is selected is a new preset probability; at a new target action at+1After being executed, the state information s is acquiredt+2According to a reward value function and current state information st+2Determining a reward value vector r(s) for a target action taskt+2,at+1G); according to status information st+2Judging whether the target task is finished or not; and when the target task is determined not to be ended, making t equal to t +1, and repeatedly executing the steps until the target task is ended. And accumulating the reward values corresponding to the subtasks related to the target task in each reward value vector of the plurality of reward value vectors to obtain a reward accumulated value.
Determining a decision model M if the reward accumulation values converget+1Converging; if the reward accumulated value is not converged, determining a decision model Mt+1And does not converge.
It should be noted that, in the above process, a new target action a is obtainedt+1State information st+2And a prize value vector r(s)t+1,at+1G), according to the task vector g and the state information s corresponding to the target taskt+1Target action at+1State information st+2And a prize value vector r(s)t+2,at+1And g) generating sample data and storing the sample data into a sample database.
After the target task is finished, deleting the target task from the plurality of candidate tasks to obtain a deleted candidate task; when step S501 needs to be executed again, specifically, acquiring the target task from the plurality of candidate tasks means acquiring a new target task from the deleted candidate tasks.
S508, determining a model Mt+1And determining the target decision model and storing the target decision model.
Referring to fig. 6, fig. 6 is a schematic structural diagram of a decision device according to an embodiment of the present disclosure. As shown in fig. 6, the decision device 600 includes:
the task and state information obtaining module 601 is configured to obtain a target task and a task vector corresponding to the target task from a plurality of candidate tasks, obtain state information of the target task in a process of training a decision model by the decision model training module 602 and a process of making a decision by the decision module 603 based on the target decision model, and send the target task, the task vector of the target task, and the state information of the target task to the decision model training module 602 and the decision module 603;
a decision model training module 602, configured to update a sample database according to a task vector corresponding to a target task and state information of the target task; randomly acquiring a plurality of sample data from the updated sample database, and training a decision model according to the plurality of sample data to obtain a target decision model; and sends the target decision model to the decision module 603;
the decision module 603 is configured to obtain a target action based on the trained decision model, the state information of the target task, and the task vector corresponding to the target task, and send the target action to the control module 604.
And a control module 604 for performing the target action to complete the target task.
It should be noted that the task and status information obtaining module 601 is specifically configured to execute relevant contents of steps S201, S401, and S501; the decision model training module 602 is configured to execute the relevant contents of steps S402 and S502-S508; the decision model 603 is used for executing relevant contents of steps S202-S203, and will not be described in detail here.
Referring to fig. 7, fig. 7 is a schematic structural diagram of an on-board device according to an embodiment of the present application. As shown in fig. 7, the in-vehicle apparatus 700 includes:
the environment sensing module 701 is used for sensing surrounding environment information of the own vehicle, for example, the surrounding environment information is obtained by integrating various sensors, and information such as the position and the speed of the own vehicle, the position and the course angle of social vehicles around the own vehicle is obtained; and sends this information as state information to the decision block 703.
A navigation information module 702, configured to clarify navigation information of a vehicle at an intersection, where the navigation information includes information such as a distance from the intersection, intersection traffic light information, and a vehicle direction, where each steering direction of the vehicle is the multiple candidate tasks; the navigation information is sent to the decision module 703.
A decision module 703, configured to determine a target task, such as a left turn at an intersection, from the multiple candidate tasks according to the navigation information; performing task joint characterization on the target task according to the plurality of candidate tasks to obtain a task vector corresponding to the target task; the task vector corresponding to the target task and the state information obtained by the environment sensing module 701 are input into the decision model for processing, so as to obtain target actions of the target task, such as left-turn acceleration, left-turn deceleration and the like, and the target actions are sent to the vehicle control module 704.
And a vehicle control module 704 for controlling the vehicle to run according to the target action to complete the target task.
It should be noted that, for a specific process of the decision module 703 obtaining the target action based on the state information and the task vector corresponding to the target task, reference may be made to the related descriptions of steps S202 to S203, and no specific description is provided here.
Referring to fig. 8, fig. 8 is a schematic structural diagram of a decision device for multitask learning. As shown in fig. 8, the decision device 800 includes:
an obtaining unit 801 configured to obtain a plurality of candidate tasks and obtain a target task from the plurality of candidate tasks; and the target task acquires the state information s of the target taskt;
A joint characterization unit 802, configured to perform task joint characterization on a target task according to multiple candidate tasks to obtain a task vector corresponding to the target task, where the task vector corresponding to each candidate task is obtained based on a characteristic subtask and a common subtask of the multiple candidate tasks;
a determining unit 803, configured to determine a task vector corresponding to the target task and state information s of the target tasktAnd determining the target action from the action space.
In a possible embodiment, the joint characterization unit 802 is specifically configured to:
performing task decomposition on each candidate task in the plurality of candidate tasks to obtain a subtask corresponding to each candidate task; extracting characteristic subtasks and common subtasks of the multiple candidate tasks according to the subtasks corresponding to each candidate task in the multiple candidate tasks; and acquiring a task vector corresponding to the target task according to the characteristic subtasks and the common subtasks of the plurality of candidate tasks.
In one possible embodiment, the task vector corresponding to the target task and the state information s of the target tasktIn terms of obtaining the action value function vector of the target task, the determining unit 803 is specifically configured to:
the task vector corresponding to the target task and the state information s of the target tasktAnd inputting the data into a target decision model for processing to obtain a target action of a target task, wherein the target decision model is realized based on a neural network.
In a possible embodiment, the determining unit 803 is specifically configured to:
according to the task vector corresponding to the target task and the state information s of the target tasktObtaining an action value function vector of the target task, wherein an action value function in the action value function vector is corresponding to an element in a task vector of the target taskThe corresponding subtasks are in one-to-one correspondence; acquiring a value function of the target task according to the task vector and the action value function vector corresponding to the target task; and acquiring the target action from the action space according to the value function of the target task, wherein the target action is the action which enables the value function of the target task to be maximum in the action space.
It should be noted that the above units (the obtaining unit 801, the joint characterization unit 802, and the determination unit 803) are used for executing relevant contents of the methods shown in the above steps S201 to S203.
In the present embodiment, the decision device 800 is presented in the form of a unit. As used herein, a unit may refer to a specific application-specific integrated circuit (ASIC), a processor and memory that execute one or more software or firmware programs, an integrated logic circuit, and/or other devices that may provide the described functionality. Furthermore, the obtaining unit 801, the joint characterization unit 802, and the determination unit 803 in the above decision device 800 may be implemented by the processor 1000 of the decision device shown in fig. 10.
Referring to fig. 9, fig. 9 is a schematic structural diagram of a decision model training apparatus according to an embodiment of the present application. As shown in fig. 9, the training apparatus 900 includes:
an obtaining unit 901 that obtains a plurality of sample data from a first sample database at random; the first sample database comprises sample data of a plurality of candidate tasks, the sample data of the target task comprises a task vector corresponding to the target task, the task vector corresponding to the target task is obtained based on a common subtask and a characteristic subtask in the candidate tasks, and the target task is any one of the candidate tasks;
an adjusting unit 902, configured to adjust the decision model M according to a plurality of sample data by using a reinforcement learning methodtTo obtain a decision model Mt+1;
A determining unit 903 for determining the model Mt+1Upon convergence, the decision model M is determinedt+1Is a target decision model.
In one possible embodiment of the present invention,
an acquisition unit 901 forObtaining a target task from a plurality of candidate tasks; and obtaining state information s of the target tasktAnd acquiring a task vector corresponding to the target task,
the training apparatus 900 further comprises:
an updating unit 904 for updating the state information s according to the target tasktTask vector corresponding to target task and decision model MtAnd generating sample data of the target task, and adding the sample data of the target task to the preliminary sample database to obtain a first sample database.
In a possible embodiment, in terms of obtaining a task vector corresponding to a target task, the obtaining unit 901 is specifically configured to:
performing task decomposition on each candidate task in the plurality of candidate tasks to obtain a subtask corresponding to each candidate task; extracting characteristic subtasks and common subtasks of the multiple candidate tasks according to the subtasks corresponding to each candidate task in the multiple candidate tasks; and acquiring a task vector corresponding to the target task according to the characteristic subtasks and the common subtasks of the plurality of candidate tasks.
In a possible embodiment, the state information s according to the target tasktTask vector corresponding to target task and decision model MtIn generating sample data of the target task, the updating unit 904 is specifically configured to:
state information s of target tasktTask vector input decision model M corresponding to target tasktProcessing to obtain candidate actions of the target task; selecting target actions of the target task from the candidate actions of the target task and the actions randomly acquired from the action space according to a preset probability, wherein the probability that the candidate actions of the target task are selected is the preset probability; obtaining state information s of the target task after the target action is executedt+1And according to the state information s of the target taskt+1Acquiring an award value vector of a target task; the reward values in the reward value vector correspond to the subtasks corresponding to the elements in the task vector corresponding to the target task one by one;
the sample data of the target task comprises the targetTask vector corresponding to target task and state information s of target tasktTarget action of target task, state information s of target taskt+1And a reward value vector for the target task.
In one possible embodiment, the state information s of the target tasktTask vector input decision model M corresponding to target tasktTo obtain the aspect of the candidate action of the target task, the updating unit 904 is specifically configured to:
make decision model MtAccording to the state information s of the target tasktAcquiring an action value function vector of the target task according to the task vector corresponding to the target task, wherein action value functions in the action value function vector of the target task correspond to subtasks corresponding to elements in the task vector corresponding to the target task one by one; acquiring a value function of the target task according to the action value function vector of the target task and the task vector corresponding to the target task; and acquiring candidate actions of the target task from the action space according to the value function of the target task, wherein the candidate actions are actions enabling the value function of the target task to be maximum in the action space.
In a possible embodiment, in terms of selecting the target action of the target task from the candidate actions of the target task and the actions randomly obtained from the action space according to the preset probability, the updating unit 904 is specifically configured to:
when the first parameter is greater than the preset probability, determining the candidate action of the target task as the target action of the target task; the first parameter is a random number with a value range of [0,1 ]; and when the first parameter is not greater than the preset probability, determining the action randomly acquired from the action space as the target action of the target task.
In a possible embodiment, the adjusting unit 902 is specifically configured to:
according to the loss function and the state information s of each sample data in a plurality of sample datatTask vector, target action, state information st+1Calculating the loss value according to the reward value vector; adjusting decision model M according to loss valuetTo obtain a decision model Mt+1。
The units (the acquiring unit 901, the adjusting unit 902, the determining unit 903, and the updating unit 904) are configured to execute relevant contents of the methods shown in the steps S401 and S402 and the steps S501 to S508.
In this embodiment, the training apparatus 900 is presented in the form of a unit. As used herein, a unit may refer to a specific application-specific integrated circuit (ASIC), a processor and memory that execute one or more software or firmware programs, an integrated logic circuit, and/or other devices that may provide the described functionality. Further, the acquiring unit 901, the adjusting unit 902, the determining unit 903, and the updating unit 904 in the above training apparatus 900 may be implemented by the processor 1100 of the training apparatus shown in fig. 11.
The decision-making means shown in fig. 10 may be implemented in the structure of fig. 10, and the decision-making means 1000 comprises at least one processor 1001, at least one memory 1002 and at least one communication interface 1003. The processor 1001, the memory 1002, and the communication interface 1003 are connected by a communication bus and perform communication with each other.
The memory 1002 may be, but is not limited to, a read-only memory (ROM) or other type of static storage device that may store static information and instructions, a Random Access Memory (RAM) or other type of dynamic storage device that may store information and instructions, an electrically erasable programmable read-only memory (EEPROM), a compact disk read-only memory (CD-ROM) or other optical disk storage, optical disk storage (including compact disk, laser disk, optical disk, digital versatile disk, blu-ray disk, etc.), magnetic disk storage media or other magnetic storage devices, or any other medium that can be used to carry or store desired program code in the form of instructions or data structures and that can be accessed by a computer. The memory may be self-contained and coupled to the processor via a bus. The memory may also be integral to the processor.
The memory 1002 is used for storing application program codes for executing the above schemes, and the execution is controlled by the processor 1001. The processor 1001 is used to execute the application code stored in the memory 1002.
The memory 1002 stores code that may perform one of the multi-task learning based decision methods provided above.
The processor 1001 may also employ or one or more integrated circuits for executing related programs to implement the decision method based on multitask joint learning according to the embodiment of the present application.
The processor 1001 may also be an integrated circuit chip having signal processing capabilities. In implementation, the steps of the decision method of the present application may be implemented by integrated logic circuits of hardware or instructions in the form of software in the processor 1001. In implementation, the steps of the training method for the state generation model and the selection strategy of the present application may be implemented by integrated logic circuits of hardware or instructions in the form of software in the processor 1001. The processor 1001 may also be a general purpose processor, a Digital Signal Processor (DSP), an ASIC, an off-the-shelf programmable gate array (FPGA) or other programmable logic device, discrete gate or transistor logic, discrete hardware components. The various methods, steps and block diagrams of modules disclosed in the embodiments of the present application may be implemented or performed. A general purpose processor may be a microprocessor or the processor may be any conventional processor or the like. The steps of the method disclosed in connection with the embodiments of the present application may be directly implemented by a hardware decoding processor, or implemented by a combination of hardware and software modules in the decoding processor. The software module may be located in ram, flash memory, rom, prom, or eprom, registers, etc. storage media as is well known in the art. The storage medium is located in the memory 1002, and the processor 1001 reads information in the memory 1002 and completes the decision method according to the embodiment of the present application in combination with hardware thereof.
The communication interface 1003 enables communication between the decision-making device and other devices or communication networks using a transceiver device such as, but not limited to, a transceiver. For example, the status information may be acquired through the communication interface 1003, or the target action may be transmitted to an execution apparatus (such as a control device of a vehicle).
A bus may include a pathway to transfer information between various components of the device (e.g., memory 1002, processor 1001, communication interface 1003). In one possible embodiment, the processor 1001 specifically performs the following steps:
acquiring a plurality of candidate tasks, and acquiring a target task from the candidate tasks; and acquiring the state information s of the target task according to the target taskt(ii) a Performing task joint characterization on the target task according to the multiple candidate tasks to obtain task vectors corresponding to the target task, wherein the task vector corresponding to each candidate task is obtained based on the characteristic subtasks and the common subtasks of the multiple candidate tasks; according to the task vector corresponding to the target task and the state information s of the target tasktAnd determining the target motion from the motion space.
The decision model training apparatus shown in fig. 11 can be implemented in the structure of fig. 11, and the training apparatus 1100 includes at least one processor 1101, at least one memory 1102 and at least one communication interface 1103. The processor 1101, the storage 1102 and the communication interface 1103 are connected by a communication bus and perform communication with each other.
The memory 1102 may be, but is not limited to, a read-only memory (ROM) or other type of static storage device that may store static information and instructions, a Random Access Memory (RAM) or other type of dynamic storage device that may store information and instructions, an electrically erasable programmable read-only memory (EEPROM), a compact disc read-only memory (CD-ROM) or other optical disk storage, optical disk storage (including compact disc, laser disc, optical disc, digital versatile disc, blu-ray disc, etc.), magnetic disk storage media or other magnetic storage devices, or any other medium that can be used to carry or store desired program code in the form of instructions or data structures and that can be accessed by a computer. The memory may be self-contained and coupled to the processor via a bus. The memory may also be integral to the processor.
The memory 1102 is used for storing application program codes for executing the above schemes, and the execution of the application program codes is controlled by the processor 1101. The processor 1101 is configured to execute the application code stored in the memory 1102.
The memory 1102 stores code that performs one of the multi-task learning based decision model training methods provided above.
The processor 1101 may also employ or one or more integrated circuits for executing related programs to implement the decision model training method based on multi-task learning according to the embodiment of the present application.
The processor 1101 may also be an integrated circuit chip having signal processing capabilities. In implementation, the steps of the decision model training method of the present application may be implemented by integrated logic circuits of hardware or instructions in the form of software in the processor 1101. In implementation, the steps of the training method of the state generation model and the selection strategy of the present application may be implemented by integrated logic circuits of hardware or instructions in the form of software in the processor 1101. The processor 1101 may also be a general purpose processor, a Digital Signal Processor (DSP), an ASIC, an FPGA (field programmable gate array) or other programmable logic device, discrete gate or transistor logic device, or discrete hardware components. The various methods, steps and block diagrams of modules disclosed in the embodiments of the present application may be implemented or performed. A general purpose processor may be a microprocessor or the processor may be any conventional processor or the like. The steps of the method disclosed in connection with the embodiments of the present application may be directly implemented by a hardware decoding processor, or implemented by a combination of hardware and software modules in the decoding processor. The software module may be located in ram, flash memory, rom, prom, or eprom, registers, etc. storage media as is well known in the art. The storage medium is located in the memory 1102, and the processor 1101 reads the information in the memory 1102 and completes the decision model training method of the embodiment of the present application in combination with the hardware thereof.
The communication interface 1103 uses a transceiver device, such as, but not limited to, a transceiver, to enable communication between the decision model training device and other devices or communication networks. For example, sample data used in model training may be acquired through the communication interface 1103, or a trained decision model may be transmitted to the decision device.
A bus may include a pathway to transfer information between various components of the device (e.g., memory 1102, processor 1101, communication interface 1103). In one possible embodiment, the processor 1101 performs the following steps:
acquiring a target task from a plurality of candidate tasks; and obtaining state information s of the target tasktAnd acquiring a task vector corresponding to the target task, wherein the task vector corresponding to the target task is obtained based on the common subtask and the characteristic subtask of a plurality of candidate tasks and is obtained according to the state information s of the target tasktTask vector corresponding to target task and decision model MtGenerating sample data of the target task, and adding the sample data of the target task to the preliminary sample database to obtain a first sample database; randomly acquiring a plurality of sample data from a first sample database; the multiple sample data are sample data of part or all of the multiple candidate tasks; adjusting a decision model M according to a plurality of sample data by adopting a reinforcement learning methodtTo obtain a decision model Mt+1(ii) a Decision model Mt+1Whether to converge; when decision model Mt+1Upon convergence, the decision model M is determinedt+1Is a target decision model.
Embodiments of the present application provide a computer storage medium storing a computer program comprising program instructions that, when executed by a processor, cause the processor to perform part or all of the steps of any one of the above-described method embodiments of a multi-task learning based decision method or a decision model training method.
In some embodiments, the disclosed methods may be implemented as computer program instructions encoded on a computer-readable storage medium in a machine-readable format or encoded on other non-transitory media or articles of manufacture. Fig. 12 schematically illustrates a conceptual partial view of an example computer program product comprising a computer program for executing a computer process on a computing device, arranged in accordance with at least some embodiments presented herein. In one embodiment, the example computer program product 1200 is provided using a signal bearing medium 1201. The signal bearing medium 1201 may include one or more program instructions 1202 that, when executed by one or more processors, may provide the functions or portions of the functions described above with respect to fig. 2, 4, and 5. Thus, for example, referring to the embodiment shown in FIG. 2, one or more features of steps S201-203 may be undertaken by one or more instructions associated with the signal bearing medium 1201. Further, program instructions 1202 in FIG. 12 also describe example instructions.
In some examples, signal bearing medium 1201 may include a computer readable medium 1203, such as, but not limited to, a hard disk drive, a Compact Disc (CD), a Digital Video Disc (DVD), a digital tape, a Memory, a Read-Only Memory (ROM), a Random Access Memory (RAM), or the like. In some implementations, the signal bearing medium 1201 may include a computer recordable medium 1204 such as, but not limited to, a memory, a read/write (R/W) CD, a R/W DVD, and so forth. In some implementations, the signal bearing medium 1201 can include a communication medium 1205 such as, but not limited to, a digital and/or analog communication medium (e.g., a fiber optic cable, a waveguide, a wired communications link, a wireless communication link, etc.). Thus, for example, signal bearing medium 1201 may be conveyed by a wireless form of communication medium 1205 (e.g., a wireless communication medium conforming to the IEEE 802.11 standard or other transmission protocol). The one or more program instructions 1202 may be, for example, computer-executable instructions or logic-implemented instructions. In some examples, a computing device such as described with respect to fig. 2, 4, and 5 may be configured to provide various operations, functions, or actions in response to program instructions 1202 conveyed to the computing device by one or more of computer readable medium 1203, computer recordable medium 1204, and/or communications medium 1205. It should be understood that the arrangements described herein are for illustrative purposes only. Thus, those skilled in the art will appreciate that other arrangements and other elements (e.g., machines, interfaces, functions, orders, and groupings of functions, etc.) can be used instead, and that some elements may be omitted altogether depending upon the desired results. In addition, many of the described elements are functional entities that may be implemented as discrete or distributed components or in conjunction with other components, in any suitable combination and location.
It should be noted that, for simplicity of description, the above-mentioned method embodiments are described as a series of acts or combination of acts, but those skilled in the art will recognize that the present application is not limited by the order of acts described, as some steps may occur in other orders or concurrently depending on the application. Further, those skilled in the art should also appreciate that the embodiments described in the specification are preferred embodiments and that the acts and modules referred to are not necessarily required in this application.
In the foregoing embodiments, the descriptions of the respective embodiments have respective emphasis, and for parts that are not described in detail in a certain embodiment, reference may be made to related descriptions of other embodiments.
In the embodiments provided in the present application, it should be understood that the disclosed apparatus may be implemented in other manners. For example, the above-described embodiments of the apparatus are merely illustrative, and for example, the division of the units is only one type of division of logical functions, and there may be other divisions when actually implementing, for example, a plurality of units or components may be combined or may be integrated into another system, or some features may be omitted, or not implemented. In addition, the shown or discussed mutual coupling or direct coupling or communication connection may be an indirect coupling or communication connection of some interfaces, devices or units, and may be an electric or other form.
The units described as separate parts may or may not be physically separate, and parts displayed as units may or may not be physical units, may be located in one place, or may be distributed on a plurality of network units. Some or all of the units can be selected according to actual needs to achieve the purpose of the solution of the embodiment.
In addition, functional units in the embodiments of the present application may be integrated into one processing unit, or each unit may exist alone physically, or two or more units are integrated into one unit. The integrated unit can be realized in a form of hardware, and can also be realized in a form of a software functional unit.
The integrated unit, if implemented in the form of a software functional unit and sold or used as a stand-alone product, may be stored in a computer readable memory. Based on such understanding, the technical solution of the present application may be substantially implemented or a part of or all or part of the technical solution contributing to the prior art may be embodied in the form of a software product stored in a memory, and including several instructions for causing a computer device (which may be a personal computer, a server, or a network device) to execute all or part of the steps of the method described in the embodiments of the present application. And the aforementioned memory comprises: various media capable of storing program codes, such as a U disk, a ROM, a RAM, a removable hard disk, a magnetic disk, or an optical disk.
Those skilled in the art will appreciate that all or part of the steps in the methods of the above embodiments may be implemented by associated hardware instructed by a program, which may be stored in a computer-readable memory, which may include: flash disk, ROM, RAM, magnetic or optical disk, and the like.
The foregoing detailed description of the embodiments of the present application has been presented to illustrate the principles and implementations of the present application, and the above description of the embodiments is only provided to help understand the method and the core concept of the present application; meanwhile, for a person skilled in the art, according to the idea of the present application, there may be variations in the specific embodiments and the application scope, and in view of the above, the content of the present specification should not be construed as a limitation to the present application.
Claims (27)
1. A decision model training method based on multi-task learning is characterized by comprising the following steps:
randomly acquiring a plurality of sample data from a first sample database; the first sample database comprises sample data of a plurality of candidate tasks, the sample data of a target task comprises a task vector corresponding to the target task, the task vector corresponding to the target task is obtained based on a common subtask and a characteristic subtask in the candidate tasks, and the target task is any one of the candidate tasks;
adjusting the decision model M according to the multiple sample data by using a reinforcement learning methodtTo obtain a decision model Mt+1;
When the decision model M ist+1Upon convergence, determining the decision model Mt+1Is a target decision model.
2. The method of claim 1, wherein prior to randomly obtaining a plurality of sample data from a first sample database, the method further comprises:
acquiring a target task from the plurality of candidate tasks;
acquiring state information s of the target task according to the target tasktAnd obtaining the target task pair according to the target taskThe vector of the task to be processed,
according to the state information s of the target tasktTask vector corresponding to the target task and the decision model MtAnd generating sample data of the target task, and adding the sample data of the target task to a primary sample database to obtain the first sample database.
3. The method according to claim 2, wherein the obtaining a task vector corresponding to the target task according to the target task includes:
performing task decomposition on each candidate task in the plurality of candidate tasks to obtain a subtask corresponding to each candidate task;
extracting characteristic subtasks and common subtasks of the multiple candidate tasks according to the subtasks corresponding to each candidate task of the multiple candidate tasks;
and acquiring a task vector corresponding to the target task according to the characteristic subtasks and the common subtasks of the candidate tasks.
4. Method according to claim 2 or 3, characterized in that said state information s according to said target tasktTask vector corresponding to the target task and the decision model MtGenerating sample data for the target task, comprising:
state information s of the target tasktInputting a task vector corresponding to the target task into the decision model MtTo obtain candidate actions of the target task;
selecting the target action of the target task from the candidate action of the target task and the action randomly obtained from the action space according to a preset probability, wherein the probability of selecting the candidate action is the preset probability;
acquiring state information s of the target task after the target action is executedt+1And according to the state information s of the target taskt+1Obtaining the target taskA vector of reward values of; the reward values in the reward value vector correspond to the subtasks corresponding to the elements in the task vector corresponding to the target task one by one;
the sample data of the target task comprises a task vector corresponding to the target task and state information s of the target tasktTarget action of the target task, and state information s of the target taskt+1And a reward value vector for the target task.
5. The method of claim 4, wherein the state information s of the target task is determinedtInputting a task vector corresponding to the target task into the decision model MtThe processing to obtain the candidate action of the target task comprises:
the decision model MtAccording to the state information s of the target tasktAcquiring an action value function vector of the target task according to the task vector corresponding to the target task;
the decision model MtAcquiring a value function of the target task according to the action value function vector of the target task and the task vector corresponding to the target task;
the decision model MtAnd acquiring a candidate action of the target task from an action space according to the value function of the target task, wherein the candidate action is an action which enables the value of the value function of the target task to be maximum in the action space.
6. The method according to claim 4 or 5, wherein the selecting the target action of the target task from the candidate actions of the target task and the actions randomly obtained from the action space according to a preset probability comprises:
when the first parameter is larger than the preset probability, determining the candidate action of the target task as the target action of the target task; wherein, the first parameter is a random number with a value range of [0,1 ];
and when the first parameter is not greater than the preset probability, determining the action randomly acquired from the action space as the target action of the target task.
7. The method according to any of claims 4-6, wherein said adapting the decision model M according to said plurality of sample datatTo obtain a decision model Mt+1The method comprises the following steps:
according to the loss function and the state information s of each sample data in the plurality of sample datatThe task vector, the target action, the state information st+1Calculating the loss value according to the reward value vector;
adjusting the decision model M according to the loss valuetTo obtain the decision model Mt+1。
8. A decision method based on multi-task learning is characterized by comprising the following steps:
acquiring a plurality of candidate tasks, and acquiring a target task from the candidate tasks; and acquiring the state information s of the target task according to the target taskt;
Performing task joint characterization on the target task according to the candidate tasks to obtain a task vector corresponding to the target task, wherein the task vector corresponding to each candidate task is obtained based on the characteristic subtasks and the common subtasks of the candidate tasks;
according to the task vector corresponding to the target task and the state information s of the target tasktAnd determining the target action from the action space.
9. The method according to claim 8, wherein the performing task joint characterization on the target task according to the plurality of candidate tasks to obtain a task vector corresponding to the target task comprises:
performing task decomposition on each candidate task in the plurality of candidate tasks to obtain a subtask corresponding to each candidate task;
extracting characteristic subtasks and common subtasks of the multiple candidate tasks according to the subtasks corresponding to each candidate task of the multiple candidate tasks;
and acquiring a task vector corresponding to the target task according to the characteristic subtasks and the common subtasks of the candidate tasks.
10. The method according to claim 8 or 9, wherein the task vector corresponding to the target task and the state information s of the target tasktDetermining a target action from the action space, comprising:
the task vector corresponding to the target task and the state information s of the target tasktInputting the target action into a target decision model for processing to obtain the target action, wherein the target decision model is realized based on a neural network.
11. The method according to claim 10, wherein the task vector corresponding to the target task and the state information s of the target task are combinedtInputting the target action into a target decision model for processing to obtain the target action, wherein the target action comprises:
the target decision model is used for determining the state information s of the target task according to the task vector corresponding to the target task and the state information s of the target tasktAcquiring action value function vectors of the target task, wherein action value functions in the action value function vectors correspond to subtasks corresponding to elements in the task vectors corresponding to the target task one by one;
the target decision model obtains a value function of the target task according to the task vector corresponding to the target task and the action value function vector;
and the target decision model acquires a target action from the action space according to a value function of a target task, wherein the target action is an action which enables the value of the value function of the target task to be maximum in the action space.
12. A decision model training device based on multi-task learning is characterized by comprising:
the device comprises an acquisition unit, a storage unit and a processing unit, wherein the acquisition unit is used for acquiring a plurality of sample data from a first sample database; the first sample database comprises sample data of a plurality of candidate tasks, the sample data of a target task comprises a task vector corresponding to the target task, the task vector corresponding to the target task is obtained based on a common subtask and a characteristic subtask in the candidate tasks, and the target task is any one of the candidate tasks;
an adjusting unit, configured to adjust the decision model M according to the multiple sample data by using a reinforcement learning methodtTo obtain a decision model Mt+1;
A determination unit for determining whether the decision model M is a new decision modelt+1Upon convergence, determining the decision model Mt+1Is a target decision model.
13. The apparatus of claim 12,
the acquiring unit is further used for acquiring a target task from a plurality of candidate tasks; and acquiring the state information s of the target task according to the target tasktAcquiring a task vector corresponding to the target task according to the target task, wherein the task vector corresponding to the target task is obtained based on the common subtask and the characteristic subtask of the candidate tasks,
the device further comprises:
an updating unit for updating the state information s of the target tasktTask vector corresponding to the target task and decision model MtAnd generating sample data of the target task, and adding the sample data of the target task to a primary sample database to obtain the first sample database.
14. The apparatus according to claim 13, wherein, in the aspect of obtaining the task vector corresponding to the target task according to the target task, the obtaining unit is specifically configured to:
performing task decomposition on each candidate task in the plurality of candidate tasks to obtain a subtask corresponding to each candidate task;
extracting characteristic subtasks and common subtasks of the multiple candidate tasks according to the subtasks corresponding to each candidate task of the multiple candidate tasks;
and acquiring a task vector corresponding to the target task according to the characteristic subtasks and the common subtasks of the candidate tasks.
15. The apparatus according to claim 13 or 14, wherein the status information s according to the target tasktTask vector corresponding to the target task and decision model MtIn an aspect of generating sample data of the target task, the updating unit is specifically configured to:
state information s of the target tasktInputting a task vector corresponding to the target task into a decision model MtTo obtain candidate actions of the target task;
selecting the target action of the target task from the candidate action of the target task and the action randomly obtained from the action space according to a preset probability, wherein the probability of selecting the candidate action is the preset probability;
acquiring state information s of the target task after the target action is executedt+1And according to the state information s of the target taskt+1Acquiring an award value vector of the target task; the reward values in the reward value vector correspond to the subtasks corresponding to the elements in the task vector corresponding to the target task one by one;
the sample data of the target task comprises a task vector corresponding to the target task and state information s of the target tasktTarget action of the target task, and state information s of the target taskt+1And a reward value vector for the target task.
16. The apparatus of claim 15, wherein the status information s of the target task is settInputting a task vector corresponding to the target task into a decision model MtThe update unit is specifically configured to:
making the decision model MtAccording to the state information s of the target tasktAcquiring an action value function vector of the target task according to the task vector corresponding to the target task; acquiring a value function of the target task according to the action value function vector of the target task and the task vector corresponding to the target task; and acquiring a candidate action of the target task from an action space according to the value function of the target task, wherein the candidate action is an action which enables the value of the value function of the target task to be maximum in the action space.
17. The method according to claim 15 or 16, wherein in terms of selecting the target action of the target task from the candidate actions of the target task and the actions randomly obtained from the action space according to a preset probability, the updating unit is specifically configured to:
when the first parameter is larger than the preset probability, determining the candidate action of the target task as the target action of the target task; the first parameter is a random number with a value range of [0,1 ];
and when the first parameter is not greater than the preset probability, determining the action randomly acquired from the action space as the target action of the target task.
18. The apparatus according to any one of claims 15 to 17, wherein the adjusting unit is specifically configured to:
according to the loss function and the state information s of each sample data in the plurality of sample datatThe task vector, the target action, the state information st+1And the reward value vector calculationObtaining a loss value;
adjusting the decision model M according to the loss valuetTo obtain the decision model Mt+1。
19. A decision-making device based on multitask learning, comprising:
the system comprises an acquisition unit, a processing unit and a processing unit, wherein the acquisition unit is used for acquiring a plurality of candidate tasks and acquiring a target task from the candidate tasks; and acquiring the state information s of the target task according to the target taskt;
The joint characterization unit is used for performing task joint characterization on the target task according to the candidate tasks to obtain a task vector corresponding to the target task, wherein the task vector corresponding to each candidate task is obtained based on the characteristic subtasks and the common subtasks of the candidate tasks;
a determining unit, further configured to determine a task vector corresponding to the target task and state information s of the target tasktAnd determining the target motion from the motion space.
20. The apparatus according to claim 19, wherein the joint characterization unit is specifically configured to:
performing task decomposition on each candidate task in the plurality of candidate tasks to obtain a subtask corresponding to each candidate task;
extracting characteristic subtasks and common subtasks of the multiple candidate tasks according to the subtasks corresponding to each candidate task of the multiple candidate tasks;
and acquiring a task vector corresponding to the target task according to the characteristic subtasks and the common subtasks of the candidate tasks.
21. The apparatus according to claim 19 or 20, wherein the determining unit is specifically configured to:
the task vector corresponding to the target task and the state information s of the target tasktInput to the goal blockAnd processing in a strategy model to obtain the target action of the target task, wherein the target strategy model is realized based on a neural network.
22. The apparatus according to claim 21, wherein the determining unit is specifically configured to:
according to the task vector corresponding to the target task and the state information s of the target tasktAcquiring action value function vectors of the target task, wherein action value functions in the action value function vectors correspond to subtasks corresponding to elements in the task vectors corresponding to the target task one by one;
acquiring a value function of the target task according to the task vector corresponding to the target task and the action value function vector;
and acquiring a target action from the action space according to the value function of the target task, wherein the target action is an action which enables the value function of the target task to be maximum in the action space.
23. A predictive model training apparatus, comprising:
a memory to store instructions; and
a processor coupled with the memory;
wherein the processor, when executing the instructions, performs the method of any one of claims 1-7.
24. A decision-making device based on multitask learning, comprising:
a memory to store instructions; and
a processor coupled with the memory;
wherein the processor, when executing the instructions, performs the method of any one of claims 8-11.
25. A chip system, wherein the chip system is applied to an electronic device; the chip system comprises one or more interface circuits, and one or more processors; the interface circuit and the processor are interconnected through a line; the interface circuit is to receive a signal from a memory of the electronic device and to send the signal to the processor, the signal comprising computer instructions stored in the memory; the electronic device performs the method of any one of claims 1-11 when the processor executes the computer instructions.
26. A computer storage medium, characterized in that the computer storage medium stores a computer program comprising program instructions that, when executed by a processor, cause the processor to perform the method according to any one of claims 1-11.
27. A computer program product comprising computer instructions which, when run on an electronic device, cause the electronic device to perform the method of any one of claims 1-11.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202010660005.2A CN111950726A (en) | 2020-07-09 | 2020-07-09 | Decision method based on multi-task learning, decision model training method and device |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202010660005.2A CN111950726A (en) | 2020-07-09 | 2020-07-09 | Decision method based on multi-task learning, decision model training method and device |
Publications (1)
Publication Number | Publication Date |
---|---|
CN111950726A true CN111950726A (en) | 2020-11-17 |
Family
ID=73340421
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202010660005.2A Pending CN111950726A (en) | 2020-07-09 | 2020-07-09 | Decision method based on multi-task learning, decision model training method and device |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN111950726A (en) |
Cited By (6)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN112494949A (en) * | 2020-11-20 | 2021-03-16 | 超参数科技(深圳)有限公司 | Intelligent agent action strategy making method, server and storage medium |
CN113033805A (en) * | 2021-03-30 | 2021-06-25 | 北京字节跳动网络技术有限公司 | Control method, device and equipment for multi-compound task execution and storage medium |
CN113569464A (en) * | 2021-06-21 | 2021-10-29 | 国网山东省电力公司电力科学研究院 | Wind turbine generator oscillation mode prediction method and device based on deep learning network and multi-task learning strategy |
WO2022221979A1 (en) * | 2021-04-19 | 2022-10-27 | 华为技术有限公司 | Automated driving scenario generation method, apparatus, and system |
WO2024067115A1 (en) * | 2022-09-28 | 2024-04-04 | 华为技术有限公司 | Training method for gflownet, and related apparatus |
CN118332392A (en) * | 2024-06-14 | 2024-07-12 | 江西财经大学 | Multi-task psychological health identification method and system integrating priori knowledge and expert network |
Citations (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN109447259A (en) * | 2018-09-21 | 2019-03-08 | 北京字节跳动网络技术有限公司 | Multitasking and multitasking model training method, device and hardware device |
CN109657696A (en) * | 2018-11-05 | 2019-04-19 | 阿里巴巴集团控股有限公司 | Multitask supervised learning model training, prediction technique and device |
CN109901572A (en) * | 2018-12-13 | 2019-06-18 | 华为技术有限公司 | Automatic Pilot method, training method and relevant apparatus |
US20200090048A1 (en) * | 2017-05-19 | 2020-03-19 | Deepmind Technologies Limited | Multi-task neural network systems with task-specific policies and a shared policy |
-
2020
- 2020-07-09 CN CN202010660005.2A patent/CN111950726A/en active Pending
Patent Citations (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20200090048A1 (en) * | 2017-05-19 | 2020-03-19 | Deepmind Technologies Limited | Multi-task neural network systems with task-specific policies and a shared policy |
CN109447259A (en) * | 2018-09-21 | 2019-03-08 | 北京字节跳动网络技术有限公司 | Multitasking and multitasking model training method, device and hardware device |
CN109657696A (en) * | 2018-11-05 | 2019-04-19 | 阿里巴巴集团控股有限公司 | Multitask supervised learning model training, prediction technique and device |
CN109901572A (en) * | 2018-12-13 | 2019-06-18 | 华为技术有限公司 | Automatic Pilot method, training method and relevant apparatus |
Cited By (8)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN112494949A (en) * | 2020-11-20 | 2021-03-16 | 超参数科技(深圳)有限公司 | Intelligent agent action strategy making method, server and storage medium |
CN112494949B (en) * | 2020-11-20 | 2023-10-31 | 超参数科技(深圳)有限公司 | Intelligent body action policy making method, server and storage medium |
CN113033805A (en) * | 2021-03-30 | 2021-06-25 | 北京字节跳动网络技术有限公司 | Control method, device and equipment for multi-compound task execution and storage medium |
WO2022221979A1 (en) * | 2021-04-19 | 2022-10-27 | 华为技术有限公司 | Automated driving scenario generation method, apparatus, and system |
CN113569464A (en) * | 2021-06-21 | 2021-10-29 | 国网山东省电力公司电力科学研究院 | Wind turbine generator oscillation mode prediction method and device based on deep learning network and multi-task learning strategy |
WO2024067115A1 (en) * | 2022-09-28 | 2024-04-04 | 华为技术有限公司 | Training method for gflownet, and related apparatus |
CN118332392A (en) * | 2024-06-14 | 2024-07-12 | 江西财经大学 | Multi-task psychological health identification method and system integrating priori knowledge and expert network |
CN118332392B (en) * | 2024-06-14 | 2024-08-13 | 江西财经大学 | Multi-task psychological health identification method and system integrating priori knowledge and expert network |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN110379193B (en) | Behavior planning method and behavior planning device for automatic driving vehicle | |
CN109901574B (en) | Automatic driving method and device | |
CN109901572B (en) | Automatic driving method, training method and related device | |
WO2021102955A1 (en) | Path planning method for vehicle and path planning apparatus for vehicle | |
WO2022001773A1 (en) | Trajectory prediction method and apparatus | |
US20220080972A1 (en) | Autonomous lane change method and apparatus, and storage medium | |
JP6963158B2 (en) | Centralized shared autonomous vehicle operation management | |
CN111950726A (en) | Decision method based on multi-task learning, decision model training method and device | |
CN113835421B (en) | Method and device for training driving behavior decision model | |
JP2023508114A (en) | AUTOMATED DRIVING METHOD, RELATED DEVICE, AND COMPUTER-READABLE STORAGE MEDIUM | |
CN110371132B (en) | Driver takeover evaluation method and device | |
WO2021212379A1 (en) | Lane line detection method and apparatus | |
CN112534483B (en) | Method and device for predicting vehicle exit | |
CN113167038B (en) | Method and device for vehicle to pass through barrier gate cross bar | |
WO2022016901A1 (en) | Method for planning driving route of vehicle, and intelligent vehicle | |
WO2022017307A1 (en) | Autonomous driving scenario generation method, apparatus and system | |
CN114440908A (en) | Method and device for planning vehicle driving path, intelligent vehicle and storage medium | |
CN113552867A (en) | Planning method of motion trail and wheel type mobile equipment | |
WO2019088977A1 (en) | Continual planning and metareasoning for controlling an autonomous vehicle | |
CN114261404A (en) | Automatic driving method and related device | |
CN113859265A (en) | Reminding method and device in driving process | |
CN115039095A (en) | Target tracking method and target tracking device | |
CN113799794A (en) | Method and device for planning longitudinal motion parameters of vehicle | |
CN113552869A (en) | Method for optimizing decision rules, method for controlling vehicle driving and related device | |
CN114103950A (en) | Lane changing track planning method and device |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination |