CN111950726A

CN111950726A - Decision method based on multi-task learning, decision model training method and device

Info

Publication number: CN111950726A
Application number: CN202010660005.2A
Authority: CN
Inventors: 开昰雄; 王滨; 刘武龙; 庄雨铮
Original assignee: Huawei Technologies Co Ltd
Current assignee: Huawei Technologies Co Ltd
Priority date: 2020-07-09
Filing date: 2020-07-09
Publication date: 2020-11-17

Abstract

The application discloses a decision-making method based on multi-task learning, a decision-making model training method and a device thereof in the field of artificial intelligence, wherein the decision-making model training method comprises the following steps: randomly acquiring a plurality of sample data from a first sample database, wherein the first sample database comprises the sample data of a plurality of candidate tasks, the sample data of the target task comprises a task vector corresponding to the target task, the task vector corresponding to the target task is obtained based on a common subtask and a characteristic subtask in the plurality of candidate tasks, and the target task is any one of the plurality of candidate tasks; adjusting a decision model M according to a plurality of sample data_tTo obtain a decision model M_t+1(ii) a Decision makingModel M_t+1Whether to converge; when decision model M_t+1Upon convergence, determining the decision model M_t+1Is a target decision model. By adopting the embodiment of the application, the decision effect and the convergence capability of the decision model are improved, and the mutual influence among multiple tasks is avoided.

Description

Decision method based on multi-task learning, decision model training method and device

Technical Field

The application relates to the field of artificial intelligence, in particular to a decision-making method based on multi-task learning, a decision-making model training method and a device.

Background

Reinforcement learning is an important branch of the field of artificial intelligence and has surpassed the ability of ordinary humans to accomplish certain tasks. However, for the reinforcement learning algorithm, the model obtained after one training can only be used for one specific task, and if the model is applied to another new task, the model needs to be retrained to obtain a new model. This means that the training algorithm, while generic, can only be applied to specific task scenarios.

With the increasing application of reinforcement learning algorithms in the industry, many application scenarios are not limited to the requirement that a reinforcement learning model can process a single task, but the requirement that the model can achieve a better effect in a multi-task scenario. The multiple tasks mean that the reinforcement learning algorithm needs to learn multiple Markov models, but the state transition probability is not unique, so that the reinforcement learning algorithm is poor in convergence effect and even cannot converge. And because reward mechanisms of different tasks are different, a simple task can quickly lead the effect of the model to be learned, and other tasks which are sparsely rewarded are not explored enough, so that the learning effect is unbalanced, and the overall effect of the model is poor. In view of the above, a need exists for a reinforcement learning algorithm that can simultaneously learn multiple tasks.

One existing solution is: many learning algorithms are less effective by balancing limited resources in a single learning algorithm to satisfy multi-task learning, and balancing multiple tasks. For example, in the learning process, the reward values of some tasks are large, so that the algorithm focuses on the tasks with prominent reward values at the cost of sacrificing generality, and other tasks cannot achieve good effects; there are algorithms that unify the value of the rewards for each task by way of reward reduction, which may change the optimization goal, if the reward values are all large non-negative values, then the reduction becomes the optimization of the frequency of obtaining rewards rather than accumulating the expected rewards. And the balance of the algorithm among tasks depends on the size of the reward value and the reward density, and the reward reduction still causes the imbalance of the algorithm among different tasks.

Another solution is called distillation-based learning: mainly, a student network is constructed through an expert network which has supervision and learns a plurality of specific tasks, the learning algorithm provides a result of multi-task strategy compromise, and each expert network needs to be obtained by large-scale training in advance. Although the learning algorithm avoids the problem of unbalanced reward values, the learning algorithm is still balanced among a plurality of tasks, the learning effect is not ideal, and the performance of the learning algorithm is limited by the expert network and cannot be further improved.

Disclosure of Invention

In the scheme of the embodiment of the application, the tasks are subjected to joint representation to obtain task vectors obtained by a characteristic subtask and a common subtask, mutual influence among the tasks can be avoided during model training, strategy learning of the tasks can be promoted by the common subtask, and the task specific learning is performed by the characteristic subtask to improve the multi-task strategy effect and the convergence speed of the model; during decision making, the same model can be used for making decisions on a plurality of tasks, and mutual influence among the tasks is avoided.

In a first aspect, an embodiment of the present application provides a training method based on a multi-task learning decision model, including:

s1: acquiring a target task from a plurality of candidate tasks; and acquiring the state information s of the target task according to the target task_tAnd acquiring a task vector corresponding to the target task according to the target task, wherein the task vector corresponding to the target task is based onThe common subtask and the characteristic subtask of the multiple candidate tasks are obtained; s2: according to the state information s of the target task_tTask vector corresponding to target task and decision model M_tGenerating sample year data of the target task, and adding sample data of the target task to a primary sample database to obtain a first sample database; s3: randomly acquiring a plurality of sample data from a first sample database; the multiple sample data are sample data of part or all of the multiple candidate tasks; s4: adjusting a decision model M according to a plurality of sample data by using a reinforcement learning method_tTo obtain a decision model M_t+1(ii) a S5: decision model M_t+1Whether to converge; when decision model M_t+1Upon convergence, the decision model M is determined_t+1Is a target decision model.

The common subtask is a subtask shared by subtasks of the plurality of candidate tasks, and the characteristic subtask is a subtask unique to a certain candidate task in the subtasks of the plurality of candidate tasks. For example, in an intersection scene, the plurality of candidate tasks may include intersection straight going, intersection left turning, and intersection right turning; the intersection straight going comprises two subtasks of intersection straight going collision or arrival and traffic efficiency improvement; the left turn at the intersection comprises two subtasks of collision or arrival at the left turn at the intersection and improvement of the traffic efficiency; the right turn at the intersection comprises two subtasks of collision or arrival at the right turn at the intersection and improvement of traffic efficiency. The intersection straight-going collision or arrival, the intersection left-turning collision or arrival and the intersection right-turning collision or arrival are characteristic subtasks, and the traffic efficiency is improved to be a common subtask.

Alternatively, the reinforcement learning method may be a reinforcement learning method based on a value function.

The task vector of any one of the candidate tasks is obtained based on the characteristic subtask and the common subtask in the candidate tasks, so that a strategy for learning the candidate tasks through one model is realized, wherein the common subtask of the candidate tasks can promote the learning of the strategy of the candidate tasks, and the convergence capability of the model is improved; the specific subtask is used for performing targeted learning of multiple candidate tasks, mutual influence among the multiple tasks is avoided, mutual compromise of the model among the multiple tasks is also avoided, and the excellent effect can be achieved when the same model makes decisions on the multiple tasks.

In a possible embodiment, obtaining a task vector corresponding to a target task according to the target task includes:

performing task decomposition on each candidate task in the plurality of candidate tasks to obtain a subtask corresponding to each candidate task; extracting characteristic subtasks and common subtasks of the multiple candidate tasks according to the subtasks corresponding to each candidate task in the multiple candidate tasks; and acquiring a task vector corresponding to the target task according to the characteristic subtasks and the common subtasks of the plurality of candidate tasks.

In one possible embodiment, s is based on the state information of the target task_tTask vector corresponding to target task and decision model M_tGenerating sample data of a target task, comprising:

state information s of target task_tTask vector input decision model M corresponding to target task_tThe target actions of the target task are selected from the candidate actions of the target task and the actions randomly acquired from the action space according to a preset probability, and the probability that the candidate actions of the target task are selected is the preset probability; obtaining state information s of the target task after the target action is executed_t+1And according to the state information s of the target task_t+1Acquiring an award value vector of a target task; the reward values in the reward value vector correspond to the subtasks in the task vector corresponding to the target task one by one; the sample data of the target task comprises a task vector corresponding to the target task and state information s of the target task_tTarget action of target task, state information s of target task_t+1And a reward value vector for the target task.

By constructing the reward value vector, the execution condition of each subtask in the target task is fed back to the decision model for learning, and the precision of the decision model is improved.

In one possible embodiment, the goal is based onStatus information s of tasks_tTask vector input decision model M corresponding to target task_tThe processing to obtain the candidate action of the target task comprises:

decision model M_tAccording to the state information s of the target task_tAcquiring an action value function vector of the target task according to the task vector corresponding to the target task, wherein action value functions in the action value function vector of the target task correspond to subtasks corresponding to elements in the task vector corresponding to the target task one by one; decision model M_tAcquiring a value function of the target task according to the action value function vector of the target task and the task vector corresponding to the target task; decision model M_tAnd acquiring candidate actions of the target task from the action space according to the value function of the target task, wherein the candidate actions of the target task are the actions which enable the value function of the target task to be maximum in the action space.

The value function of the target task is obtained according to the task vector and the action value function vector corresponding to the target task, and then the target action is determined based on the value function, so that the influence of the action value function of the subtask irrelevant to the target task in the action value function vector on the selection of the target action is avoided when the target task is decided.

In one possible embodiment, selecting the target action of the target task from the candidate actions of the target task and the actions randomly obtained from the action space according to the preset probability includes:

when the first parameter is greater than the preset probability, determining the candidate action of the target task as the target action of the target task; the first parameter is a random number with a value range of [0,1 ]; and when the first parameter is not greater than the preset probability, determining the action randomly acquired from the action space as the target action of the target task.

By determining the action randomly acquired from the action space as the target action of the target task, the method realizes the exploration of new action when the decision model is trained, thereby obtaining the action with better effect when the decision model is used.

It should be noted that the initial value of the predetermined probability is 1 or a larger value close to 1; the preset probability is gradually reduced as the number of training times increases.

In one possible embodiment, the decision model M is adapted according to a plurality of sample data_tTo obtain a decision model M_t+1The method comprises the following steps:

according to the loss function and the state information s of each sample data in a plurality of sample data_tTask vector, target action, state information s_t+1Calculating the loss value according to the reward value vector; adjusting decision model M according to loss value_tTo obtain a decision model M_t+1。

Wherein the first state information is obtained before the target action is executed, and the second state information is obtained after the target action is executed.

Wherein the loss value can be expressed as:

wherein r in the formula is the reward value vector in the sample data, g is the task vector in the sample data, a_tThe discount coefficient γ is a constant for the target action in the sample data.

In one possible embodiment, the decision model M is determined_t+1Whether to converge, including:

according to the state information s of the target task_t+1Judging whether the target task is finished or not; when it is determined that the target task is not ended, let t be t +1, and repeatedly perform steps S2-S5 until the target task is ended;

when the target task is determined to be finished, judging the decision model M_t+1Whether to converge; in determining decision model M_t+1When the convergence time is not reached, t is t +1, and the steps S1-S5 are repeatedly executed until the decision model M is reached_t+1And (6) converging.

In a second aspect, an embodiment of the present application provides a decision method based on multitask learning, including:

obtaining a plurality of candidate tasks, obtaining a target from the plurality of candidate tasksA task; and acquiring the state information s of the target task according to the target task_t(ii) a Performing task joint characterization on the target task according to the multiple candidate tasks to obtain task vectors corresponding to the target task, wherein the task vector corresponding to each candidate task is obtained based on the characteristic subtasks and the common subtasks of the multiple candidate tasks; according to the task vector corresponding to the target task and the state information s of the target task_tAnd determining the target action from the action space.

The plurality of candidate tasks may be tasks in the same scene or tasks in different scenes.

The target task is subjected to joint characterization according to the subtasks of the multiple candidate tasks, so that the multiple tasks can be decided by using the same model, and the mutual influence among the multiple tasks is avoided.

In a possible embodiment, performing task joint characterization on a target task according to a plurality of candidate tasks to obtain a task vector corresponding to the target task, includes:

In one possible embodiment, the task vector corresponding to the target task and the state information s of the target task are used_tDetermining a target action from the action space, comprising:

the task vector corresponding to the target task and the state information s of the target task_tAnd inputting the target motion into a target decision model for processing to obtain a target action, wherein the target decision model is realized based on a neural network.

Alternatively, the neural network may be a fully-connected neural network, a convolutional neural network, a recurrent neural network, or other neural network.

In one possible embodiment, the target task pairCorresponding task vector and state information s of target task_tInputting the target action into a target decision model for processing to obtain a target action, wherein the target action comprises the following steps:

according to the task vector corresponding to the target task and the state information s of the target task_tAcquiring an action value function vector of the target task, wherein action value functions in the action value function vector correspond to subtasks corresponding to elements in a task vector corresponding to the target task one by one; acquiring a value function of the target task according to the task vector and the action value function vector corresponding to the target task; and acquiring the target action from the action space according to the value function of the target task, wherein the target action is the action which enables the value function of the target task to be maximum in the action space.

The value function of the target task is obtained according to the task vector and the action value function vector corresponding to the target task, and then the target action is determined based on the value function, so that the influence of the action value function of a subtask irrelevant to the target task in the action value function vector on the selection of the target action is avoided when the target task is decided, and the decision effect of the target task is improved.

Alternatively, the value function may be a Q-value function, and the Q-value function of the target task may be expressed as: r(s)_t,a_kG), the target action may be expressed as:

wherein g is a task vector corresponding to the target task, a_kIs a motion in the motion space.

In a third aspect, an embodiment of the present application provides a decision model training device based on multitask learning, including:

the acquisition unit is used for randomly acquiring a plurality of sample data from the first sample database; the first sample database comprises sample data of a plurality of candidate tasks, the sample data of the target task comprises a task vector corresponding to the target task, the task vector corresponding to the target task is obtained based on a common subtask and a characteristic subtask in the candidate tasks, and the target task is any one of the candidate tasks;

an adjusting unit for adjusting the decision model M according to a plurality of sample data by using a reinforcement learning method_tTo obtain a decision model M_t+1；

A determination unit for determining the model M_t+1Upon convergence, the decision model M is determined_t+1Is a target decision model.

In one possible embodiment, the method includes acquiring a target task from a plurality of candidate tasks; and acquiring the state information s of the target task according to the target task_tAnd acquiring a task vector corresponding to the target task according to the target task,

the above-mentioned trainer also includes:

an updating unit for updating the state information s according to the target task_tTask vector corresponding to target task and decision model M_tGenerating sample data of the target task, and adding the sample data of the target task to the preliminary sample database to obtain a first sample database;

in a possible embodiment, in terms of obtaining a task vector corresponding to a target task according to the target task, the obtaining unit is specifically configured to:

In a possible embodiment, the state information s according to the target task_tTask vector and decision model M for the target task_tIn an aspect of generating sample data of the target task, the update unit is specifically configured to:

according to the state information s of the target task_tTask vector input decision model M corresponding to target task_tProcessing to obtain candidate actions of the target task; candidate actions and random slave actions from a target task according to a preset probabilitySelecting target actions of a target task from the actions acquired in the space, wherein the probability that the candidate actions of the target task are selected is a preset probability; obtaining state information s of the target task after the target action is executed_t+1And according to the state information s of the target task_t+1Acquiring an award value vector of a target task; the reward values in the reward value vector correspond to the subtasks corresponding to the elements in the task vector corresponding to the target task one by one; the sample data of the target task comprises a task vector corresponding to the target task and state information s of the target task_tTarget action of target task, state information s of target task_t+1And a target mission reward value vector.

In one possible embodiment, the state information s of the target task_tTask vector input decision model M corresponding to target task_tThe updating unit is specifically configured to:

according to the state information s of the target task_tAcquiring an action value function vector of the target task according to the task vector corresponding to the target task, wherein action value functions in the action value function vector of the target task correspond to subtasks corresponding to elements in the task vector corresponding to the target task one by one; acquiring a value function of the target task according to the action value function vector of the target task and the task vector corresponding to the target task; and acquiring the target action of the target task from the action space according to the value function of the target task, wherein the target action is the action which enables the value function of the target task to be maximum in the action space.

In a possible embodiment, in terms of selecting the target action of the target task from the candidate actions of the target task and the actions randomly obtained from the action space according to the preset probability, the updating unit is specifically configured to:

In a possible embodiment, the adjusting unit is specifically configured to:

In a fourth aspect, an embodiment of the present application provides a decision device based on multitask learning, including:

the device comprises an acquisition unit, a processing unit and a processing unit, wherein the acquisition unit is used for acquiring a plurality of candidate tasks and acquiring a target task from the candidate tasks; and acquiring the state information s of the target task according to the target task_t；

The joint characterization unit is used for performing task joint characterization on the target task according to the multiple candidate tasks to obtain a task vector corresponding to the target task, wherein the task vector corresponding to each candidate task is obtained based on the characteristic subtasks and the common subtasks of the multiple candidate tasks;

a determining unit for determining the task vector corresponding to the target task and the state information s of the target task_tAnd determining the target motion from the motion space.

In a possible embodiment, the joint characterization unit is specifically configured to:

In a possible embodiment, the determining unit is specifically configured to:

the task vector corresponding to the target task and the state information s of the target task_tAnd inputting the data into a target decision model for processing to obtain target actions of the target task, wherein the target decision model is realized based on a neural network.

In one possible embodiment, the task vector corresponding to the target task and the state information s of the target task_tThe input into the objective decision model is processed to obtain an aspect of the objective action of the objective task, and the determining unit is specifically configured to:

according to the task vector corresponding to the target task and the state information s of the target task_tAcquiring action value function vectors of the target task, wherein action value functions in the action value function vectors correspond to subtasks corresponding to elements in the task vectors corresponding to the target task one by one; acquiring a value function of the target task according to the task vector and the action value function vector corresponding to the target task; and acquiring the target action from the action space according to the value function of the target task, wherein the target action is the action which enables the value function of the target task to be maximum in the action space.

In a fifth aspect, an embodiment of the present application provides another decision model training apparatus based on multitask learning, including:

a memory to store instructions; and

at least one processor coupled to the memory;

wherein the instructions, when executed by the at least one processor, cause the processor to perform some or all of the method of the first aspect.

In a sixth aspect, an embodiment of the present application provides another decision device based on multitask learning, including:

a memory to store instructions; and

at least one processor coupled to the memory;

wherein the instructions, when executed by the at least one processor, cause the processor to perform some or all of the method of the second aspect.

In a seventh aspect, an embodiment of the present application provides a chip system, where the chip system is applied to an electronic device; the chip system comprises one or more interface circuits, and one or more processors; the interface circuit and the processor are interconnected through a line; the interface circuit is to receive a signal from a memory of the electronic device and to send the signal to the processor, the signal comprising computer instructions stored in the memory; when the processor executes the computer instructions, the electronic device performs part or all of the method according to the first aspect or the second aspect.

In an eighth aspect, embodiments of the present application provide a computer storage medium storing a computer program comprising program instructions that, when executed by a processor, cause the processor to perform some or all of the methods of the first or second aspects.

In a ninth aspect, embodiments of the present application provide a computer program product, which includes computer instructions, when the computer instructions are executed on an electronic device, cause the electronic device to perform part or all of the method according to the first aspect or the second aspect.

Wherein the computer program product is executable on an intelligent carrier (such as a mobile vehicle, a robotic arm, a recommended search engine, etc.) on which the computer system is installed. The executable codes for acquiring task/state information, processing system state, selecting decision and controlling need to be cooperatively participated in a CPU/GPU and a storage system when running on a storage component of a computer system. Meanwhile, a network communication component of the computer system is used, and the decision model is stored on a storage component of the computer system.

These and other aspects of the present application will be more readily apparent from the following description of the embodiments.

Drawings

In order to more clearly illustrate the embodiments of the present application or the technical solutions in the prior art, the drawings used in the description of the embodiments or the prior art will be briefly described below, it is obvious that the drawings in the following description are only some embodiments of the present application, and for those skilled in the art, other drawings can be obtained according to the drawings without creative efforts.

FIG. 1a is a functional block diagram of a vehicle according to an embodiment of the present disclosure;

FIG. 1b is a diagram illustrating an architecture of a computer system according to an embodiment of the present application;

FIG. 1c is a block diagram of an autopilot system according to an embodiment of the present application;

FIG. 1d is a block diagram of another embodiment of an autopilot system according to the present application;

fig. 2 is a schematic flowchart of a decision method based on multi-task learning according to an embodiment of the present disclosure;

FIG. 3 is a schematic view of an intersection scene;

fig. 4 is a schematic flowchart of a method for training a decision model based on multi-task learning according to an embodiment of the present disclosure;

FIG. 5 is a schematic flowchart of another method for training a decision model based on multi-task learning according to an embodiment of the present disclosure;

fig. 6 is a schematic structural diagram of a decision device according to an embodiment of the present disclosure;

FIG. 7 is a schematic structural diagram of an in-vehicle device according to an embodiment of the present disclosure;

fig. 8 is a schematic structural diagram of a decision device based on multitask learning according to an embodiment of the present application;

fig. 9 is a schematic structural diagram of a decision model training apparatus according to an embodiment of the present application;

fig. 10 is a schematic structural diagram of another decision-making device according to an embodiment of the present disclosure;

FIG. 11 is a schematic structural diagram of another decision model training apparatus according to an embodiment of the present disclosure;

fig. 12 is a partial schematic view of a computer program product according to an embodiment of the present application.

Detailed Description

Embodiments of the present application are described below with reference to the drawings.

Fig. 1a is a functional block diagram of a vehicle 100 according to an embodiment of the present invention. In one embodiment, the vehicle 100 is configured in a fully or partially autonomous driving mode. For example, the vehicle 100 may control itself while in the autonomous driving mode, and may determine a current state of the vehicle and its surroundings by human operation, determine a possible behavior of at least one other vehicle in the surroundings, and determine a confidence level corresponding to a likelihood that the other vehicle performs the possible behavior, controlling the vehicle 100 based on the determined information. While the vehicle 100 is in the autonomous driving mode, the vehicle 100 may be placed into operation without human interaction.

The vehicle 100 may include various subsystems such as a travel system 102, a sensor system 104, a control system 106, one or more peripherals 108, as well as a power supply 110, a computer system 112, and a user interface 116. Alternatively, vehicle 100 may include more or fewer subsystems, and each subsystem may include multiple elements. In addition, each of the sub-systems and elements of the vehicle 100 may be interconnected by wire or wirelessly.

The travel system 102 may include components that provide powered motion to the vehicle 100. In one embodiment, the travel system 102 may include an engine 118, an energy source 119, a transmission 120, and wheels/tires 121. The engine 118 may be an internal combustion engine, an electric motor, an air compression engine, or other types of engine combinations, such as a hybrid engine of a gasoline engine and an electric motor, or a hybrid engine of an internal combustion engine and an air compression engine. The engine 118 converts the energy source 119 into mechanical energy.

Examples of energy sources 119 include gasoline, diesel, other petroleum-based fuels, propane, other compressed gas-based fuels, ethanol, solar panels, batteries, and other sources of electrical power. The energy source 119 may also provide energy to other systems of the vehicle 100.

The transmission 120 may transmit mechanical power from the engine 118 to the wheels 121. The transmission 120 may include a gearbox, a differential, and a drive shaft. In one embodiment, the transmission 120 may also include other devices, such as a clutch. Wherein the drive shaft may comprise one or more shafts that may be coupled to one or more wheels 121.

The sensor system 104 may include a number of sensors that sense information about the environment surrounding the vehicle 100. For example, the sensor system 104 may include a positioning system 122 (which may be a GPS system, a beidou system, or other positioning system), an Inertial Measurement Unit (IMU) 124, a radar 126, a laser range finder 128, and a camera 130. The sensor system 104 may also include sensors of internal systems of the monitored vehicle 100 (e.g., an in-vehicle air quality monitor, a fuel gauge, an oil temperature gauge, etc.). Sensor data from one or more of these sensors may be used to detect the object and its corresponding characteristics (position, shape, orientation, velocity, etc.). Such detection and identification is a critical function of the safe operation of the autonomous vehicle 100.

The positioning system 122 may be used to estimate the geographic location of the vehicle 100. The IMU 124 is used to sense position and orientation changes of the vehicle 100 based on inertial acceleration. In one embodiment, IMU 124 may be a combination of an accelerometer and a gyroscope.

The radar 126 may utilize radio signals to sense objects within the surrounding environment of the vehicle 100. In some embodiments, in addition to sensing objects, radar 126 may also be used to sense the speed and/or heading of an object.

The laser rangefinder 128 may utilize laser light to sense objects in the environment in which the vehicle 100 is located. In some embodiments, the laser rangefinder 128 may include one or more laser sources, laser scanners, and one or more detectors, among other system components.

The camera 130 may be used to capture multiple images of the surrounding environment of the vehicle 100. The camera 130 may be a still camera or a video camera.

The control system 106 is for controlling the operation of the vehicle 100 and its components. Control system 106 may include various elements including a steering system 132, a throttle 134, a braking unit 136, a sensor fusion system 138, a computer vision system 140, a route control system 142, and an obstacle avoidance system 144.

The steering system 132 is operable to adjust the heading of the vehicle 100. For example, in one embodiment, a steering wheel system.

The throttle 134 is used to control the operating speed of the engine 118 and thus the speed of the vehicle 100.

The brake unit 136 is used to control the deceleration of the vehicle 100. The brake unit 136 may use friction to slow the wheel 121. In other embodiments, the brake unit 136 may convert the kinetic energy of the wheel 121 into an electric current. The brake unit 136 may take other forms to slow the rotational speed of the wheels 121 to control the speed of the vehicle 100.

The computer vision system 140 may be operable to process and analyze images captured by the camera 130 to identify objects and/or features in the environment surrounding the vehicle 100. The objects and/or features may include traffic signals, road boundaries, and obstacles. The computer vision system 140 may use object recognition algorithms, Motion from Motion (SFM) algorithms, video tracking, and other computer vision techniques. In some embodiments, the computer vision system 140 may be used to map an environment, track objects, estimate the speed of objects, and so forth.

The route control system 142 is used to determine a travel route of the vehicle 100. In some embodiments, the route control system 142 may combine data from the sensors 138, the GPS 122, and one or more predetermined maps to determine a travel route for the vehicle 100.

The obstacle avoidance system 144 is used to identify, evaluate, and avoid or otherwise negotiate potential obstacles in the environment of the vehicle 100.

Of course, in one example, the control system 106 may additionally or alternatively include components other than those shown and described. Or may reduce some of the components shown above.

Vehicle 100 interacts with external sensors, other vehicles, other computer systems, or users through peripherals 108. The peripheral devices 108 may include a wireless communication system 146, an in-vehicle computer 148, a microphone 150, and/or speakers 152.

In some embodiments, the peripheral devices 108 provide a means for a user of the vehicle 100 to interact with the user interface 116. For example, the onboard computer 148 may provide information to a user of the vehicle 100. The user interface 116 may also operate the in-vehicle computer 148 to receive user input. The in-vehicle computer 148 may be operated via a touch screen. In other cases, the peripheral devices 108 may provide a means for the vehicle 100 to communicate with other devices located within the vehicle. For example, the microphone 150 may receive audio (e.g., voice commands or other audio input) from a user of the vehicle 100. Similarly, the speaker 152 may output audio to a user of the vehicle 100.

The wireless communication system 146 may communicate wirelessly with one or more devices, either directly or via a communication network. For example, the wireless communication system 146 may use 3G cellular communication, such as CDMA, EVD0, GSM/GPRS, or 4G cellular communication, such as LTE. Or 5G cellular communication. The wireless communication system 146 may communicate with a Wireless Local Area Network (WLAN) using WiFi. In some embodiments, the wireless communication system 146 may utilize an infrared link, bluetooth, or ZigBee to communicate directly with the device. Other wireless protocols, such as various vehicle communication systems, for example, the wireless communication system 146 may include one or more Dedicated Short Range Communications (DSRC) devices that may include public and/or private data communications between vehicles and/or roadside stations.

The power supply 110 may provide power to various components of the vehicle 100. In one embodiment, power source 110 may be a rechargeable lithium ion or lead acid battery. One or more battery packs of such batteries may be configured as a power source to provide power to various components of the vehicle 100. In some embodiments, the power source 110 and the energy source 119 may be implemented together, such as in some all-electric vehicles.

Some or all of the functionality of the vehicle 100 is controlled by the computer system 112. The computer system 112 may include at least one processor 113, the processor 113 executing instructions 115 stored in a non-transitory computer readable medium, such as a data storage device 114. The computer system 112 may also be a plurality of computing devices that control individual components or subsystems of the vehicle 100 in a distributed manner.

The processor 113 may be any conventional processor, such as a commercially available CPU. Alternatively, the processor may be a dedicated device such as an ASIC or other hardware-based processor. Although FIG. 1a functionally illustrates a processor, memory, and other elements of the computer 110 in the same block, those skilled in the art will appreciate that the processor, computer, or memory may in fact comprise multiple processors, computers, or memories that may or may not be stored within the same physical housing. For example, the memory may be a hard disk drive or other storage medium located in a different housing than the computer 110. Thus, references to a processor or computer are to be understood as including references to a collection of processors or computers or memories which may or may not operate in parallel. Rather than using a single processor to perform the steps described herein, some components, such as the steering component and the retarding component, may each have their own processor that performs only computations related to the component-specific functions.

The processor 113 obtains current state information of the vehicle through the sensor obtained by the sensing system 104, the processor 113 obtains a plurality of candidate tasks and determines a target task from the candidate tasks, the target task is subjected to joint representation according to the candidate tasks to obtain a task vector corresponding to the target task, the task vector corresponding to the target task and the current state information are input into the decision model to be processed according to the task vector corresponding to the target task, a target action of the target task is obtained, and the control system 106 executes the target action to control the vehicle 100 to run.

In some embodiments, the data storage device 114 may include instructions 115 (e.g., program logic), and the instructions 115 may be executed by the processor 113 to perform various functions of the vehicle 100, including those described above. The data storage 114 may also contain additional instructions, including instructions to send data to, receive data from, interact with, and/or control one or more of the travel system 102, the sensor system 104, the control system 106, and the peripheral devices 108.

In addition to instructions 115, data storage device 114 may also store data such as road maps, route information, the location, direction, speed of the vehicle, and other such vehicle data, among other information. Such information may be used by the vehicle 100 and the computer system 112 during operation of the vehicle 100 in autonomous, semi-autonomous, and/or manual modes.

A user interface 116 for providing information to and receiving information from a user of the vehicle 100. Optionally, the user interface 116 may include one or more input/output devices within the collection of peripheral devices 108, such as a wireless communication system 146, an in-vehicle computer 148, a microphone 150, and a speaker 152.

The computer system 112 may control the functions of the vehicle 100 based on inputs received from various subsystems (e.g., the travel system 102, the sensor system 104, and the control system 106) and from the user interface 116. For example, the computer system 112 may utilize input from the control system 106 in order to control the steering unit 132 to avoid obstacles detected by the sensor system 104 and the obstacle avoidance system 144. In some embodiments, the computer system 112 is operable to provide control over many aspects of the vehicle 100 and its subsystems.

Alternatively, one or more of these components described above may be mounted or associated separately from the vehicle 100. For example, the data storage device 114 may exist partially or completely separate from the vehicle 1100. The above components may be communicatively coupled together in a wired and/or wireless manner.

Optionally, the above components are only an example, in an actual application, components in the above modules may be added or deleted according to an actual need, and fig. 1a should not be construed as limiting the embodiment of the present invention.

An autonomous automobile traveling on a roadway, such as vehicle 100 above, may identify objects within its surrounding environment to determine an adjustment to the current speed. The object may be another vehicle, a traffic control device, or another type of object. In some examples, each identified object may be considered independently, and based on the respective characteristics of the object, such as its current speed, acceleration, separation from the vehicle, etc., may be used to determine the speed at which the autonomous vehicle is to be adjusted.

Alternatively, the autonomous automobile vehicle 100 or a computing device associated with the autonomous vehicle 100 (e.g., the computer system 112, the computer vision system 140, the data storage 114 of fig. 1 a) may predict behavior of the identified object based on characteristics of the identified object and the state of the surrounding environment (e.g., traffic, rain, ice on the road, etc.). Optionally, each identified object depends on the behavior of each other, so it is also possible to predict the behavior of a single identified object taking all identified objects together into account. The vehicle 100 is able to adjust its speed based on the predicted behaviour of said identified object. In other words, the autonomous vehicle is able to determine what steady state the vehicle will need to adjust to (e.g., accelerate, decelerate, or stop) based on the predicted behavior of the object. In this process, other factors may also be considered to determine the speed of the vehicle 100, such as the lateral position of the vehicle 100 in the road on which it is traveling, the curvature of the road, the proximity of static and dynamic objects, and so forth.

In addition to providing instructions to adjust the speed of the autonomous vehicle, the computing device may also provide instructions to modify the steering angle of the vehicle 100 to cause the autonomous vehicle to follow a given trajectory and/or to maintain a safe lateral and longitudinal distance from objects in the vicinity of the autonomous vehicle (e.g., cars in adjacent lanes on the road).

The vehicle 100 may be a car, a truck, a motorcycle, a bus, a boat, an airplane, a helicopter, a lawn mower, an amusement car, a playground vehicle, construction equipment, a trolley, a golf cart, a train, a trolley, etc., and the embodiment of the present invention is not particularly limited.

According to FIG. 1b, computer system 101 includes a processor 103, processor 103 coupled to a system bus 105. Processor 103 may be one or more processors, each of which may include one or more processor cores. A display adapter (video adapter)107, which may drive a display 109, the display 109 coupled with system bus 105. System bus 105 is coupled through a bus bridge 111 and an input/output (I/O) bus 113. The I/O interface 115 is coupled to an I/O bus. The I/O interface 115 communicates with various I/O devices, such as an input device 117 (e.g., keyboard, mouse, touch screen, etc.), a multimedia disk (media tray)121 (e.g., CD-ROM, multimedia interface, etc.). A transceiver 123 (which can send and/or receive radio communication signals), a camera 155 (which can capture scenic and motion digital video images), and an external USB interface 125. Wherein, optionally, the interface connected with the I/O interface 115 may be a USB interface.

Processor 103 may be any conventional processor, including a reduced instruction set computing ("RISC") processor, a complex instruction set computing ("CISC") processor, or a combination thereof. Alternatively, the processor may be a dedicated device such as an application specific integrated circuit ("ASIC"). Alternatively, the processor 103 may be a neural network processor or a combination of a neural network processor and a conventional processor as described above.

Optionally, in various embodiments described herein, computer system 101 may be located remotely from the autonomous vehicle and may communicate wirelessly with autonomous vehicle 0. In other aspects, some processes described herein are performed on a processor disposed within an autonomous vehicle, others being performed by a remote processor, including taking the actions required to perform a single maneuver.

Computer 101 may communicate with software deploying server 149 via network interface 129. The network interface 129 is a hardware network interface, such as a network card. The network 127 may be an external network, such as the internet, or an internal network, such as an ethernet or a Virtual Private Network (VPN). Optionally, the network 127 may also be a wireless network, such as a WiFi network, a cellular network, and the like.

The hard drive interface is coupled to system bus 105. The hardware drive interface is connected with the hard disk drive. System memory 135 is coupled to system bus 105. Data running in system memory 135 may include the operating system 137 and application programs 143 of computer 101.

The operating system includes a Shell 139 and a kernel 141. Shell 139 is an interface between the user and the kernel of the operating system. The shell is the outermost layer of the operating system. The shell manages the interaction between the user and the operating system, waiting for user input, interpreting the user input to the operating system, and processing the output results of the various operating systems.

Kernel 141 is comprised of those portions of the operating system that are used to manage memory, files, peripherals, and system resources. Interacting directly with the hardware, the operating system kernel typically runs processes and provides inter-process communication, CPU slot management, interrupts, memory management, IO management, and the like.

The application programs 143 include programs related to controlling the automatic driving of a vehicle, such as programs for managing the interaction of an automatically driven vehicle with obstacles on the road, programs for controlling the route or speed of an automatically driven vehicle, and programs for controlling the interaction of an automatically driven vehicle with other automatically driven vehicles on the road. Application programs 143 also exist on the system of software deploying server 149. In one embodiment, computer system 101 may download application program 143 from software deploying server 149 when it is desired to execute autopilot-related program 147.

When the processor 103 executes the application program 143, the processor performs the following steps: acquiring a target task from a plurality of candidate tasks according to the navigation information, such as intersection straight-ahead movement; performing joint characterization on the target task according to the plurality of candidate tasks to obtain a task vector corresponding to the target task; the task vector corresponding to the target task and the state information s of the target task acquired by the sensor 153_tThe data are input into a decision model to be processed so as to obtain target actions of a target task, such as acceleration, parking and the like, and then the target actions are executed to control the vehicle to run.

Sensor 153 is associated with computer system 101. The sensor 153 is used to detect the environment surrounding the computer 101. For example, the sensor 153 may detect an animal, a car, an obstacle, a crosswalk, and the like, and further, the sensor may detect an environment around the animal, the car, the obstacle, the crosswalk, and the like, such as: the environment surrounding the animal, e.g., other animals present around the animal, weather conditions, brightness of the surrounding environment, etc. Alternatively, if the computer 101 is located on an autonomous automobile, the sensor may be a camera, infrared sensor, chemical detector, microphone, or the like.

Acquiring a target task from a plurality of candidate tasks; and acquiring the state information s of the target task according to the target task_tAnd a task vector corresponding to the target task, the task vector corresponding to the target task is obtained based on the common subtask and the characteristic subtask of the plurality of candidate tasks, and is obtained according to the state information s of the target task_tUpdating the sample database by the task vector corresponding to the target task and the decision model to obtain an updated sample database; randomly acquiring a plurality of sample data from the updated sample database; the plurality of sample data are sample data of part or all of the candidate tasks; adjusting parameters in the decision model according to a plurality of sample data by using a reinforcement learning method to obtain an adjusted decision model; judging whether the adjusted decision model is converged; and when the adjusted decision model converges, determining the adjusted decision model as a target decision model.

The decision model may be implemented based on a fully-connected neural network, a convolutional neural network, a cyclic neural network, or other neural networks.

Computer system 112 can also receive information from other computer systems or transfer information to other computer systems. Alternatively, sensor data collected from the sensor system 104 of the vehicle 100 may be transferred to another computer for processing of this data. As shown in fig. 1c, data from computer system 112 may be transmitted via a network to cloud-side computer 720 for further processing. The network and intermediate nodes may comprise various configurations and protocols, including the internet, world wide web, intranets, virtual private networks, wide area networks, local area networks, private networks using proprietary communication protocols of one or more companies, ethernet, WiFi, and HTTP, as well as various combinations of the foregoing. Such communications may be by any device capable of communicating data to and from other computers, such as modems and wireless interfaces.

In one example, computer 720 may include a server having multiple computers, such as a load balancing server farm, that exchange information with different nodes of a network for the purpose of receiving, processing, and transmitting data from computer system 112. The server may be configured similarly to computer system 110, with a processor 730, memory 740, instructions 750, and data 760.

Optionally, the data 760 includes coordinates of the host vehicle on a world coordinate system, speed, coordinates of surrounding social vehicles on the world coordinate system, speed and heading angle, and the like.

When executing the instructions 750, the processor 730 specifically implements the following steps:

acquiring the coordinates (x) of the surrounding social vehicles in the world coordinate system according to the coordinates and the speed of the self vehicle in the world coordinate system and the coordinates, the speed and the course angle of the surrounding social vehicles in the world coordinate system_it,y_it) Velocity v_itAnd a heading angle theta_it。

Acquiring the speed of the vehicle and the coordinates (x) of the surrounding social vehicles in the coordinate system of the vehicle_it,y_it) Velocity v_itAnd a heading angle theta_itThe speed of the vehicle and the coordinates (x) of the surrounding social vehicles in the coordinate system of the vehicle_it,y_it) Velocity v_itAnd a heading angle theta_itTo the computer system 112 of the vehicle's computer 101.

It should be noted that the above instruction can be regarded as a conversion instruction.

Fig. 1d shows an example of an autonomously driven vehicle and a cloud service center according to an example embodiment. Cloud service center 520 may receive information (such as vehicle sensors collecting data or other information) from

autonomous vehicles

510, 512, and 514 within its operating environment 500 via network 502, such as a wireless communication network.

Cloud service center 520 obtains information about the speed, coordinates, heading angle, etc. of

autonomous vehicles

510, 512, and 514 in the world coordinate system via network 502.

The cloud service center runs the stored programs related to controlling the automatic driving of the automobile according to the received data to control the

automatic driving vehicles

510, 512 and 514. The programs related to controlling the automatic driving of the automobile can be programs for managing the interaction between the automatic driving automobile and obstacles on the road, programs for controlling the route or the speed of the automatic driving automobile and programs for controlling the interaction between the automatic driving automobile and other automatic driving automobiles on the road.

The cloud service center 520 acquires the state information of any vehicle A in the

automatic driving vehicles

510, 512 and 514, wherein the state information comprises the speed of the vehicle A, the coordinates of the surrounding vehicles in the coordinate system of the vehicle A, the speed and the heading angle; acquiring a target task from a plurality of candidate tasks according to navigation information of the vehicle A; performing joint characterization on the target task according to the plurality of candidate tasks to obtain a task vector corresponding to the target task; and inputting the state information and the task vector corresponding to the target task into a decision model for processing so as to obtain the target action of the target task.

After acquiring the target action, the cloud service center 520 sends the target action to the vehicle a so that the vehicle a travels according to the target action.

The network 502 provides portions of the map out to the

autonomous vehicles

510, 512, or 514. In other examples, operations may be divided between different locations or centers. For example, multiple cloud service centers may receive, validate, combine, and/or send information reports. Information reports and/or sensor data may also be sent between autonomous vehicles in some examples. Other configurations are also possible.

The cloud service centers can share information such as the speed, coordinates and course angle of the vehicle in the service area under the world coordinate system; in a plurality of cloud service centers, when the cloud service center 1 cannot provide driving service for the vehicle B within the service range of the cloud service center, the cloud service center 1 may send relevant information of the vehicle B (such as state information of the vehicle B, an action space, and a task vector corresponding to a task to be executed) to the cloud service center 2; the cloud service center 2 determines a target action of the vehicle B according to the state information of the vehicle B, the action space and the task vector corresponding to the task to be executed, then sends the target action to the cloud service center 1, and the cloud service center 1 sends the target action to the vehicle B.

In some examples, the center sends suggested solutions to the autonomous vehicle regarding possible driving conditions within the environment (e.g., informing of a front obstacle and informing of how to bypass it). For example, the cloud service center may assist the vehicle in determining how to travel when facing a particular obstacle within the environment. The cloud service center sends a response to the autonomous vehicle indicating how the vehicle should travel in the given scenario. For example, the cloud service center may confirm the presence of a temporary stop sign in front of the road based on the collected sensor data, and also determine that the lane is closed due to the application based on a "lane closure" sign and sensor data of the construction vehicle on the lane. Accordingly, the cloud service center sends a suggested mode of operation for automatically driving the vehicle through the obstacle (e.g., instructing the vehicle to change lanes on another road). The operational steps used for the autonomous vehicle may be added to the driving information map when the cloud service center observes the video stream within its operating environment and has confirmed that the autonomous vehicle can safely and successfully traverse the obstacle. Accordingly, this information may be sent to other vehicles in the area that may encounter the same obstacle in order to assist the other vehicles not only in recognizing the closed lane but also in knowing how to pass.

Referring to fig. 2, fig. 2 is a schematic flowchart of a decision method based on multi-task learning according to an embodiment of the present application. As shown in fig. 2, the method includes:

s201, obtaining a plurality of candidate tasks, determining a target task from the candidate tasks, and obtaining state information S of the target task according to the target task_t。

Optionally, the plurality of candidate tasks may be tasks in the same scene, for example, in an intersection scene, the plurality of candidate tasks may include intersection straight movement, intersection left turn, and intersection right turn; for another example, in the case of a tactical competitive game, the candidate tasks may include the highest number of killers, the longest survival time, and the lowest number of deaths.

Optionally, the plurality of candidate tasks may further include tasks in different scenes, such as tasks in an intersection scene and tasks in a tactical competitive game scene.

For example, in the scene of a crossroad, since the host vehicle and the social vehicle perform a plurality of interactive games, the information of the host vehicle and the information of the surrounding social vehicles need to be obtained, and therefore the state information s of the target task_tIncluding information about the own vehicle and information about surrounding social vehicles. Optionally, the information of the own vehicle includes a speed v of the own vehicle_etThe information of the surrounding social vehicle includes coordinates (x) of the surrounding social vehicle in the own vehicle coordinate system_it,y_it) Velocity v_itAnd a heading angle theta_it. Assuming that surrounding social vehicles include 5 vehicles closest to the own vehicle, status information s_tCan be represented as s_t＝[v_et,x_1t,y_1t,v_1t,θ_1t,…,x_5t,y_5t,v_5t,θ_5t]。

For example, in the scene of a competitive tactical game, the state information s of the target task_tIncluding the coordinates (x) of the principal character in the map_pt,y_pt) And a forward speed v_ptAnd coordinates (x) of teammates in the coordinate system of the principal character_it,y_it) V) forward speed v_iAnd an advancing angle theta_it. Optionally, status information s of the target task_tAnd the life value, the fire value, the number of enemies, the survival time, the position information of enemy army, the life value, the fire value and the like of the person and teammates can also be included.

In a specific example, in a scene of an intersection, a target task can be determined from a plurality of candidate tasks according to navigation information of a user. For example, in the scene of an intersection, the plurality of candidate tasks include left turn at the intersection, right turn at the intersection and straight going at the intersection; and if the crossroad needs to turn right according to the navigation information, determining the target task from the candidate tasks as the crossroad right turn.

In another specific example, in a tactical competitive game scenario, a target task may be determined from a plurality of candidate tasks based on game settings. For example, in a tactical competitive game, the plurality of candidate tasks include the largest number of enemies and the longest survival time, and if the condition for winning the game is the longest survival time, the target task specified from the plurality of candidate tasks is the longest survival time.

S202, performing task joint characterization on the target task according to the candidate tasks to obtain a task vector corresponding to the target task.

And the task vector corresponding to each candidate task is obtained based on the characteristic subtasks and the common subtasks of the plurality of candidate tasks.

Specifically, decomposing each candidate task in the multiple candidate tasks according to the prior knowledge to obtain a subtask corresponding to each candidate task; acquiring characteristic subtasks and common subtasks of the multiple candidate tasks according to the subtasks corresponding to each candidate task in the multiple candidate tasks; and acquiring a task vector corresponding to the target task according to the characteristic subtasks and the common subtasks of the plurality of candidate tasks.

For example, in an intersection scene, the plurality of candidate tasks may include intersection straight going, intersection left turning, and intersection right turning; according to the priori knowledge, the intersection straight-going is decomposed into two subtasks of intersection straight-going collision or arrival and traffic efficiency improvement; the left turn at the intersection is decomposed into two subtasks of collision or arrival at the left turn at the intersection and improvement of traffic efficiency; the right turn at the intersection is decomposed into two subtasks of collision or arrival of the right turn at the intersection and improvement of traffic efficiency. The intersection straight-going collision or arrival, the intersection left-turning collision or arrival and the intersection right-turning collision or arrival are characteristic subtasks, and the traffic efficiency is improved to be a common subtask.

After the characteristic subtask and the characteristic subtask are obtained, performing task joint characterization on each candidate task in the multiple candidate tasks according to the characteristic subtask and the common subtask to obtain a task vector corresponding to each candidate task. For example, for a scene of an intersection, the task vector corresponding to each candidate task is obtained based on four subtasks of intersection left-turn collision or arrival, intersection straight-going collision or arrival, intersection right-turn collision or arrival and improvement of traffic efficiency.

Optionally, the task vector corresponding to each of the plurality of candidate tasks is composed of a plurality of elements, and the plurality of elements are respectively in one-to-one correspondence with the characteristic subtasks and the commonality subtasks of the plurality of candidate tasks. And whether the candidate task comprises the subtask corresponding to the element is represented by different values of each element in the task vector corresponding to the candidate task.

For example, for three candidate tasks (including intersection left turn, intersection straight going, and intersection right turn) of an intersection scene, a task vector corresponding to each candidate task is composed of four elements, where the four elements respectively correspond to the four subtasks: the collision or arrival of the left turn at the intersection, the collision or arrival of the straight going at the intersection, the collision or arrival of the right turn at the intersection and the improvement of the traffic efficiency. And whether the candidate task comprises the subtask corresponding to the element is represented by different values of the element in the task vector corresponding to each candidate task.

Optionally, a task vector corresponding to a left turn at the intersection may be represented as [1,0,0,1], a task vector corresponding to a straight line at the intersection may be represented as [0,1,0,1], and a task vector corresponding to a right turn at the intersection may be represented as [0,0,1,1 ]; wherein, the first '1' from left to right in the task vector [1,0,0,1] corresponding to the left turn at the intersection represents that the candidate task 'left turn at the intersection' includes the subtask 'left turn at the intersection collision or arrival', the first '0' represents that the candidate task 'left turn at the intersection' does not include the subtask 'straight-ahead collision or arrival', the second '0' represents that the candidate task 'left turn at the intersection' does not include the subtask 'right turn at the intersection collision or arrival', and the second '1' represents that the candidate task 'left turn at the intersection' includes the subtask 'improves the traffic efficiency'; the first 0 from left to right in a task vector [0,1,0,1] corresponding to intersection straight running indicates that a candidate task "intersection straight running" does not include a subtask "intersection left-turn collision or arrival", the first 1 indicates that the candidate task "intersection straight running" includes a subtask "intersection straight running collision or arrival", the second 0 indicates that the candidate task "intersection straight running" does not include a subtask "intersection right-turn collision or arrival", and the second 1 indicates that the candidate task "intersection straight running" includes a subtask "to improve the traffic efficiency"; the first 0 in the task vector [0,0,1,1] corresponding to the right turn at the intersection from left to right represents that the candidate task 'right turn at the intersection' does not comprise the subtask 'left turn at the intersection collision or arrival', the second 0 represents that the candidate task 'right turn at the intersection' does not comprise the subtask 'straight-ahead collision or arrival', the first 1 represents that the candidate task 'right turn at the intersection' comprises the subtask 'right turn at the intersection collision or arrival', and the second 1 represents that the candidate task 'right turn at the intersection' comprises the subtask 'improved traffic efficiency'.

Optionally, a task vector corresponding to a left turn at the intersection may be represented as [80,0,0,20], a task vector corresponding to a straight going at the intersection may be represented as [0,80,0,20], and a task vector corresponding to a right turn at the intersection may be represented as [0,0,80,20 ]; wherein, in the task vector [80,0,0,20] corresponding to the left turn at the intersection, "80" indicates that the candidate task "left turn at the intersection" includes the subtask "left turn at the intersection collision or arrival," the first "0" from left to right indicates that the candidate task "left turn at the intersection" does not include the subtask "straight-ahead collision or arrival," the second "0" indicates that the candidate task "left turn at the intersection" does not include the subtask "right turn at the intersection collision or arrival," and "20" indicates that the candidate task "left turn at the intersection" includes the subtask "to improve the traffic efficiency. In a task vector [0,80,0,20] corresponding to intersection straight-going, the first ' 0 ' from left to right represents that a candidate task ' intersection straight-going ' does not comprise a subtask ' intersection left-turn collision or arrival ', the ' 80 ' represents that the candidate task ' intersection straight-going ' comprises a subtask ' intersection straight-going collision or arrival ', the second ' 0 ' represents that the candidate task ' intersection straight-going ' does not comprise a subtask ' intersection right-turn collision or arrival ', and the ' 20 ' represents that the candidate task ' intersection straight-going ' comprises a subtask ' intersection right-turn collision or arrival ' to improve the traffic efficiency '. The task vector corresponding to the candidate task is equivalent to the task vector corresponding to the candidate task, the candidate task comprises the subtask corresponding to the larger element value, the subtask not comprising the smaller element value is equivalent to the task vector corresponding to the candidate task, and the candidate task comprises the subtask corresponding to the non-zero element value and does not comprise the subtask corresponding to the zero element value. In other words, the element in the task vector corresponding to the candidate task is the weight of the subtask corresponding to the element; the more important the candidate task is, the more the weight corresponding to the subtask is, and the less important the candidate task is, the less the weight corresponding to the subtask is.

It should be noted that the sequence of the subtasks in the task vector corresponding to the plurality of candidate tasks includes, but is not limited to, the sequence described above (the [ collision or arrival at a left turn at an intersection, the collision or arrival at a straight going intersection, the collision or arrival at a right turn at an intersection, and the improvement of the traffic efficiency), and may also be other sequences.

S203, according to the task vector corresponding to the target task and the state information S of the target task_tAnd determining the target action from the action space.

Specifically, a task vector corresponding to the target task and state information s of the target task are set_tInput to decision model M_tTo obtain a target action of the target task, wherein the decision model is implemented by a neural network.

Alternatively, the neural network may be a fully-connected neural network, a convolutional neural network, a recurrent neural network, or other type of neural network.

In one embodiment, a task vector corresponding to a target task and state information s of the target task_tInput to decision model M_tThe processing to obtain the target action of the target task specifically includes:

according to the task vector corresponding to the target task and the state information s of the target task_tAnd acquiring an action value function of the target task, wherein the action value function in the action value function vector corresponds to the subtasks corresponding to the elements in the task vector corresponding to the target task one by one. The motion value function vector may be represented as R(s)_t,a_k,g)，a_kAnd g is a task vector corresponding to the target task. Function vector R(s) of action values to avoid the target task_t,a_kG) the action value function R(s) of the target task according to the task vector g corresponding to the target task_t,a_kAnd g), deleting the action value function of the subtask irrelevant to the target task in g) to obtain the value function of the target task, and determining the target action from the action space according to the value function of the target task, wherein the target action is the action which enables the value function of the target task to be maximum in the action space.

Optionally, the value function is a Q value function; wherein, the Q value function of the target task may be expressed as: q(s)_t,a_k,g)＝g^TR(s_t,a_kG); the target action may be represented as:

it should be noted that the motion space is composed of the motion that can be executed when the target task is executed. Such as speed planning control of the vehicle given navigation information, i.e. given the waypoints of the unmanned vehicle. In order to enhance the interactivity of the vehicle, the unmanned vehicle is required to have the capabilities of parking waiting, crowding and accelerating to pass through. Therefore, the designed motion space needs to cover a larger speed range, and the embodiment adopts discrete motion space [0,3m/s,6m/s,9m/s ].

At the target action a_tAfter being executed, the state information s of the target task is acquired_t+1And according to the state information s of the target task_t+1It is determined whether the target task is finished. For example, the target task is the straight-going at the intersection and can be based on the state information s_t+1And judging whether the vehicle passes through the intersection or not. If it is based on the status information s_t+1Determining that the target task is not finished, and re-executing the steps S202-S203; if it is based on the status information s_t+1And determining that the target task is finished, and executing the next task according to the steps S201-S203.

It is to be noted here that the status information s_tFor status information at time t, status information s_t+1Is the state information at time t +1, the two are the same type of information at different times, target action a_tIs the action performed at time t. Target action a_tThe execution subject of (a) may or may not be the same as the execution subject of the decision model. For example, after the decision device obtains the target action, the target action is sent to the automobile, and the automobile control device executes the target action to control the automobile.

It can be seen that in the scheme of the application, the target task is subjected to joint characterization according to the subtasks of the multiple candidate tasks, so that the multiple tasks can be decided by using the same model, and the mutual influence among the multiple tasks is avoided; the value function of the target task is obtained according to the task vector and the action value function vector corresponding to the target task, and then the target action is determined based on the value function, so that the influence of the action value function of a subtask irrelevant to the target task in the action value function vector on the selection of the target action is avoided when the target task is decided, and the decision effect of the target task is improved.

In a specific example, as shown in fig. 3, for an intersection interaction scene, the unmanned vehicle needs to have the capability of three tasks, namely, intersection left-turn, intersection straight-going and intersection right-turn, in other words, the candidate tasks of the unmanned vehicle include intersection left-turn, intersection straight-going and intersection right-turn; determining that the crossroad needs to go straight according to the navigation information, and determining a target task from the candidate tasks as crossroad straight;

decomposing the three candidate tasks respectively according to the prior knowledge to obtain the subtask of each candidate task in the three candidate tasks; the sub-task of turning left at the intersection comprises the collision of turning left at the intersection or the arrival and the improvement of the traffic efficiency, the sub-task of turning right at the intersection comprises the collision of turning right at the intersection or the arrival and the improvement of the traffic efficiency, and the sub-task of turning right at the intersection comprises the collision of turning right at the intersection or the arrival and the improvement of the traffic efficiency; extracting a characteristic subtask and a common subtask from the subtasks of the three candidate tasks; the characteristic subtasks comprise intersection left-turn collision or arrival, intersection straight-going collision or arrival and intersection right-turn collision or arrival, and the common subtasks comprise the functions of improving the traffic efficiency; acquiring a task vector corresponding to each candidate task in the three candidate tasks according to the characteristic subtask and the common subtask; wherein, the task vector g' corresponding to the straight-going intersection is [0,1,0,1 ];

in the scene of the crossroad, as the self-vehicle needs to carry out a plurality of times of interactive games with surrounding social vehicles, complete information of the social vehicles around the self-vehicle needs to be obtained. Obtaining status information s_tState information s_tIncluding the speed v of the bicycle_eSocial vehicle on-vehicle coordinate systemLower position coordinate (x)_i,y_i) Velocity v_iAnd heading angle theta_i. Assuming that surrounding social vehicles are five vehicles closest to the own vehicle, state information s_tCan be expressed as: s_t＝[v_e,x₁,y₁,v₁,θ₁,…,x₅,y₅,v₅,θ₅]；

Corresponding task vector g and state information s of straight going intersection_tInputting the data into a decision model for processing to obtain a target action corresponding to the straight movement of the intersection, specifically, a task vector g and state information s corresponding to the straight movement of the intersection_tObtaining a motion value function vector R(s)_t,a_kG), motion value function vector R(s)_t,a_kThe action value function in g) corresponds to the subtasks corresponding to the elements in the task vector corresponding to the straight-going intersection one by one; in order to eliminate action value function vector R(s) corresponding to straight-going at intersection_t,a_kAnd g), the influence of the action value function corresponding to the right-turn collision or arrival at the intersection and the action value function corresponding to the left-turn collision or arrival at the intersection is based on the task vector g and the action value function vector R(s) corresponding to the straight-going intersection_t,a_kG) obtaining a Q value function corresponding to straight-going at the intersection; the Q-value function can be expressed as: q(s)_t,a_k,g)＝g^TR(s_t,a_k,g)；

Finally, the Q value function Q(s) corresponding to the straight-going intersection is used_t,a_kG) action taking the maximum value is determined as target action a_tFor example, accelerating straight lines. At the target action a_tAfter being executed, the state information s of the target task is acquired_t+1By means of status information s_t+1And judging whether the target task is finished or not. E.g. based on status information s_t+1And determining that the vehicle passes through the intersection by the coordinates of the middle vehicle, thereby determining that the intersection straight-going task is finished. And if the fact that the vehicle does not pass through the crossroad is determined according to the state information, the target action corresponding to the straight-ahead movement of the crossroad is obtained again according to the method until the vehicle passes through the crossroad.

It can be seen that the embodiment can be used in a crossroad scene to realize the simultaneous learning of the traffic strategies in three directions in one decision model, successfully find the time of entry, improve the traffic efficiency through multiple interactions and avoid collision.

Referring to fig. 4, fig. 4 is a schematic flowchart of a training method of a decision model based on multi-task learning according to an embodiment of the present application. As shown in fig. 4, the method includes:

s401, randomly obtaining a plurality of sample data from a first sample database, and adjusting a decision model M according to the plurality of sample data by using a reinforcement learning method_tTo obtain a decision model M_t+1。

The first sample database comprises sample data of a plurality of candidate tasks, the sample data of the target task comprises a task vector corresponding to the target task, the task vector corresponding to the target task is obtained based on a common subtask and a characteristic subtask in the candidate tasks, and the target task is any one of the candidate tasks.

Specifically, according to the loss function and the state information s of each sample data in a plurality of sample data_tTask vector, target action, state information s_t+1Calculating the loss value according to the reward value vector r; adjusting decision model M according to loss value_tTo obtain a decision model M_t+1。

Wherein the loss value can be expressed as:

wherein r in the formula is the reward value vector in the sample data, g is the task vector in the sample data, and action a_tThe discount coefficient γ is a constant for the action in the sample data.

In a possible embodiment, before randomly acquiring a plurality of sample data from the first sample database, the method of this embodiment further includes:

acquiring a target task from a plurality of candidate tasks; and obtaining the state information of the target task according to the target taskInformation s_tAcquiring a task vector corresponding to the target task;

according to the state information s of the target task_tTask vector corresponding to target task and decision model M_tAnd generating sample data of the target task, and adding the sample data of the target task to the preliminary sample database to obtain a first sample database.

For example, in the scene of a competitive tactical game, the state information s of the target task_tIncluding the coordinates (x) of the principal character in the map_pt,y_pt) And a forward speed v_ptAnd coordinates (x) of teammates in the coordinate system of the principal character_it,y_it) V) forward speed v_iAnd an advancing angle theta_it. OptionallyStatus information s of the target task_tAnd the life value, the fire value, the number of enemies, the survival time, the position information of enemy army, the life value, the fire value and the like of the person and teammates can also be included.

And the task vector corresponding to the target task is obtained according to the characteristic subtask and the common subtask of the multiple candidate tasks.

Decomposing each candidate task in the multiple candidate tasks according to the prior knowledge to obtain a subtask corresponding to each candidate task; acquiring characteristic subtasks and common subtasks of the multiple candidate tasks according to the subtasks corresponding to each candidate task in the multiple candidate tasks; and acquiring a task vector corresponding to the target task according to the characteristic subtasks and the common subtasks of the plurality of candidate tasks.

It should be noted that, for a specific description of obtaining a task vector corresponding to a target task according to the target task, reference may be made to the related description of step S202, and no description is given here.

In one possible embodiment, s is based on the state information of the target task_tTask vector corresponding to target task and decision model M_tGenerating sample data for a target task, comprising:

state information s of target task_tTask vector input decision model M corresponding to target task_tProcessing to obtain candidate actions of the target task; selecting target actions of the target task from the candidate actions of the target task and the actions randomly acquired from the action space according to a preset probability, wherein the probability that the candidate actions of the target task are selected is the preset probability; obtaining state information s of the target task after the target action is executed_t+1And according to the state information s of the target task_t+1Acquiring an award value vector of a target task; the reward value in the reward value vector corresponds to the subtasks corresponding to the elements in the task vector corresponding to the target task one by one, and the sample data of the target task comprises the task vector corresponding to the target task and the state information s of the target task_tTarget action of target task, state information s of target task_t+1And a reward value vector for the target task.

In one example, the state information s of the target task_tInputting a task vector g corresponding to a target task into a decision model M_tThe processing to obtain the target action of the target task comprises:

decision model M_tAccording to the state information s of the target task_tThe task vector g corresponding to the target task obtains the action value function vector R(s)_t,a_kG), action value function vector R(s) of target task_t,a_kThe action value function in g) corresponds to the subtasks corresponding to the elements in the task vector corresponding to the target task one by one; decision model M_tFunction vector R(s) according to action value of target task_t,a_kG) obtaining the target action a of the target task from the action space by the task vector corresponding to the target task_t. Function vector R(s) of action values due to target task_t,a_kG) includes the action value function of the subtask irrelevant to the target task, for example, the subtask irrelevant to the left turn of the intersection includes the right turn arrival or collision of the intersection and the straight collision or arrival of the intersection, and in order to avoid the influence of the action value function of the subtask irrelevant to the target task, the action value function R(s) of the target task is calculated according to the task vector g corresponding to the target task_t,a_kDeleting the action value function of the subtask irrelevant to the target task in the step g) to obtain a value function of the target task; and determining a candidate action from the action space according to the value function, wherein the candidate action is an action which enables the value function of the target task to be maximum in the action space.

In one possible embodiment, the target action a of the target task is selected from the candidate actions of the target task and the actions randomly obtained from the action space according to the preset probability_tThe method comprises the following steps:

when the first parameter is larger than the preset probability, determining the candidate action of the target task as the target action a of the target task_t(ii) a The first parameter is a value range of [0,1]]The random number of (2); when the first parameter is not greater than the preset probability, determining the action randomly acquired from the action space as a target action a of the target task_t。

It should be noted that the initial value of the predetermined probability is 1 or a larger value close to 1; the preset probability is gradually reduced along with the increase of the training times; the larger value of setting the initial value of the preset probability to be 1 or close to 1 aims to search for new actions as much as possible in the initial stage of training the decision model and avoid the optimal actions during training.

Optionally, the maximum value of the preset probability is 1, and the minimum value is 0.1.

Optionally, the upper function is a Q-value function.

At the target action a_tAfter being executed, the state information s of the target task is acquired_t+1And according to the state information s of the target task_t+1Acquiring an award value vector of a target task; and the reward values in the reward value vector correspond to the subtasks corresponding to the elements in the task vector corresponding to the target task one by one. For example, for the task of turning left at the intersection, the vector r(s) of the reward value is_t+1A, g) can be represented by [ r_l,r_s,r_r,r_c]Wherein r is_lReward value, r, for a subtask intersection left turn collision or arrival_sReward value, r, for a subtask intersection straight-ahead collision or arrival_rReward value, r, for right turn collision or arrival at subtask intersection_cA reward value that improves passage efficiency for the subtask.

The reward value function may be expressed as:

according to the task vector g corresponding to the target task and the state information s of the target task_tTarget action a_tStatus information s_t+1Obtaining sample data of target task by reward value vector r<s_t,a_t,r,s_t+1,g>And sample data of the target task<s_t,a_t,r,s_t+1,g>And saving the data to a preliminary sample database to obtain a first sample database.

It should be noted that the execution subject of the target action may be the same as or different from the execution subject of the decision model. For example, after obtaining the target action, the execution subject of the decision model sends the target action to the execution subject of the target action to execute the target action.

S402, judging a decision model M_t+1Whether to converge; when decision model M_t+1Upon convergence, the decision model M is determined_t+1Is a target decision model.

In one example, decision model M is judged_t+1Whether to converge, including:

according to the state information s of the target task_t+1Judging whether the target task is executed completely; when the target task is determined to be executed completely, judging the decision model M_t+1Whether or not to converge.

For example, for an intersection scenario, the status information s_t+1Including speed v of the vehicle_e(t+1)Coordinates (x) of surrounding vehicles in the own vehicle coordinate system_i(t+1),y_i(t+1)) Velocity V_i(t+1)And heading angle theta_i(t+1)Can pass through the speed v of the bicycle_e(t+1)Coordinates (x) of surrounding vehicles in the own vehicle coordinate system_i(t+1),y_i(t+1)) Velocity V_i(t+1)And heading angle theta_i(t+1)Judging whether the self vehicle passes through the intersection or not; if the target task passes through the intersection, determining that the target task is ended; and if the target task does not pass through the intersection, determining that the target task is not finished.

Further, when it is determined that the target task is not ended, let t be t +1 and repeat the execution of "state information S according to target task" in step S401_tTask vector corresponding to target task and decision model M_tAnd generating sample data of the target task, adding the sample data of the target task into the primary sample database to obtain a first sample database and S402 until the target task is finished.

After the target task is determined to be finished, judging a decision model M_t+1Whether or not to converge.

Decision model M_t+1Whether to converge, if so, the decision model M is determined_t+1Determining the model as a target decision model, if not converging, repeatedly executing the steps S401-S402 until the decision modelM_t+1And (6) converging.

Alternatively, the decision model M is determined by determining whether the accumulated award value converges_t+1Whether or not to converge. Specifically, a task vector g corresponding to the target task and state information s are set_t+1Input to decision model M_t+1To obtain a new reference action, specifically, according to a task vector g and state information s corresponding to the target task_t+1Obtaining a motion value function vector R(s)_t+1,a_kG) and according to the motion value function vector R(s)_t+1,a_kG) determining a Q-value function Q(s) for the target task from a task vector g corresponding to the target task_t+1,a_k(ii) a g) A function Q(s) of the Q value in the motion space_t+1,a_k(ii) a g) Determining the action with the largest value as a new reference action; selecting target action a of the target task from new candidate actions and actions randomly acquired from the action space according to the new preset probability_t+1The probability that the new candidate action is selected is a new preset probability; in the execution of the target action a_t+1Then, the status information s is acquired_t+2According to a reward value function and status information s_t+2Determining a reward value vector r(s) for a target action task_t+2,a_t+1,g)；

According to current state information s_t+2Judging whether the target task is finished or not; and when the target task is determined not to be ended, making t equal to t +1, and repeatedly executing the steps until the target task is ended. And accumulating the reward values corresponding to the subtasks related to the target task in each reward value vector in the plurality of reward value vectors to obtain a reward accumulated value.

Determining a decision model M if the reward accumulation values converge_t+1Converging; if the reward accumulated value is not converged, determining a decision model M_t+1And does not converge.

It should be noted that, in the above process, a new target action a is obtained_t+1State information s_t+2And a prize value vector r(s)_t+2,a_t+1G), according to the task vector g and the state information s corresponding to the target task_t+1Object ofAction a_t+1State information s_t+2And a prize value vector r(s)_t+2,a_t+1And g) generating sample data, and storing the sample data in the first sample database to obtain a new sample database.

After the target task is finished, deleting the target task from the plurality of candidate tasks to obtain a deleted candidate task; when step S401 needs to be executed again, specifically, acquiring the target task from the plurality of candidate tasks means acquiring a new target task from the deleted candidate tasks.

It can be seen that in the scheme of the application, the task vector formed by the characteristic subtask and the common subtask is obtained by performing joint characterization on the target task, so that a strategy for learning a plurality of candidate tasks through one model is realized, wherein the common subtask of the plurality of candidate tasks can promote the learning of the strategy of the plurality of candidate tasks, and the convergence capability of the model is improved; the specific subtasks are used for carrying out the targeted learning of a plurality of candidate tasks, so that the mutual influence among a plurality of tasks is avoided, the model is also prevented from being compromised among a plurality of tasks, and the excellent effect can be achieved when the same model carries out decision making on the plurality of tasks; feeding back the execution condition of each subtask in the target task to the decision model for learning by constructing a reward value vector; the value function of the target task is obtained according to the task vector and the action value function vector corresponding to the target task, and then the target action is determined based on the value function, so that the influence of the action value function of the subtask irrelevant to the target task in the action value function vector on the selection of the target action is avoided when the target task is decided.

Referring to fig. 5, fig. 5 is a schematic flowchart of another decision model training method based on multi-task learning according to an embodiment of the present application. As shown in fig. 5, the method includes:

s501, determining a target task from a plurality of candidate tasks, and acquiring a task vector g corresponding to the target task and state information S of the target task according to the target task_t。

The plurality of tasks are tasks in the same scene, for example, in the scene of an intersection, the plurality of candidate tasks include intersection left-turn, intersection straight-going and intersection right-turn; for another example, in the case of a tactical competitive game, the candidate tasks include generating the longest time, the fewest number of deaths, and the largest number of killers. Of course, the plurality of candidate tasks may be tasks in different scenarios.

The target task is any one of the candidate tasks that are not used for training the decision model among the plurality of candidate tasks.

Wherein the task vector corresponding to the target task is obtained based on the characteristic subtasks and the common subtasks of the plurality of candidate tasks.

Optionally, before the target task is determined from the multiple candidate tasks, performing task joint characterization on each candidate task in the multiple candidate tasks to obtain a task vector corresponding to each candidate task.

Specifically. Performing task decomposition on each candidate task in the plurality of candidate tasks according to prior knowledge to obtain a subtask of each candidate task; acquiring a characteristic subtask and a common subtask according to a subtask of each candidate task in a plurality of candidate tasks; and performing task joint characterization on each candidate task in the plurality of candidate tasks according to the characteristic subtasks and the common subtasks to obtain a task vector corresponding to each candidate task. And elements in the task vector corresponding to each candidate task correspond to the characteristic subtasks and the common subtasks one by one. And whether the candidate task comprises the subtask corresponding to the element is represented by different values of each element in the task vector corresponding to the candidate task.

For example, the plurality of candidate tasks include intersection left-turn, intersection straight-going, and intersection right-turn; according to the priori knowledge, the left turn of the intersection can be decomposed into two subtasks of left turn arrival or collision of the intersection and improvement of the traffic efficiency, the straight-going intersection can be decomposed into two subtasks of left turn arrival or collision of the intersection and improvement of the traffic efficiency, and the right turn of the intersection can be decomposed into two subtasks of right turn arrival or collision of the intersection and improvement of the traffic efficiency; the characteristic subtasks are intersection left-turn arrival or collision, intersection straight-going collision or arrival and intersection right-turn collision or arrival, and the common subtask is used for improving the traffic efficiency.

In one example, a task vector corresponding to a left turn at an intersection can be represented as [1,0,0,1], a task vector corresponding to a straight going at an intersection can be represented as [0,1,0,1], and a task vector corresponding to a right turn at an intersection can be represented as [0,0,1,1], where an element 1 in the vector represents that a sub-task of a candidate task includes a sub-task corresponding to the element, and an element 0 in the vector represents that a sub-task of the candidate task does not include a sub-task corresponding to the element.

For the state information s of the target task in the crossroad scene_tIncluding information about the own vehicle and information about surrounding social vehicles. Optionally, the information of the own vehicle includes a speed v of the own vehicle_etThe information of the surrounding social vehicle includes coordinates (x) of the surrounding social vehicle in the own vehicle coordinate system_it,y_it) Velocity v_itAnd a heading angle theta_it. Assuming that surrounding social vehicles include 5 vehicles closest to the own vehicle, status information s of the target task_tCan be represented as s_t＝[v_et,x_1t,y_1t,v_1t,θ_1t,…,x_5t,y_5t,v_5t,θ_5t]。

S502, a task vector g corresponding to the target task and state information S of the target task_tInput to decision model M_tTo obtain candidate actions, and selecting the target action a of the target task from the candidate actions of the target task and the actions randomly obtained from the action space according to the preset probability_t。

And the probability that the candidate action of the target task is selected is a preset probability.

Task vector g corresponding to target task and state information s of target task_tAction value function vector R(s) of target task_t,a_kG). Wherein the action value function vector R(s) of the target task_t,a_kAnd the action value function in the g) corresponds to the subtasks corresponding to the elements in the corresponding task vector in the target task one by one. The motion value function in the motion value function vector is used to characterize the correspondence toPerforms action a_kThen the quality function of the result is obtained.

It should be noted here that the decision model M is described above_tIs implemented based on a neural network, such as a fully-connected neural network, a convolutional neural network, a pooled neural network, or other forms of neural networks.

According to the task vector g corresponding to the target task and the action value function vector R(s) of the target task_t,a_kG) determining the Q function Q(s) of the target task_t,a_k；g)。

Function vector R(s) of action values due to target task_t,a_kG) includes the action value function of the subtask irrelevant to the target task, for example, the subtask irrelevant to the left turn of the intersection includes the right turn arrival or collision of the intersection and the straight collision or arrival of the intersection, and in order to avoid the influence of the action value function of the subtask irrelevant to the target task, the action value function R(s) of the target task is calculated according to the task vector g corresponding to the target task_t,a_kAnd g) deleting the action value functions of the subtasks which are not related to the target task to obtain the Q value function of the target task.

Wherein, the Q value function of the target task can be expressed as:

Q(s_t,a_k；g)＝g^T R(s_t,a_k,g)

function Q(s) of Q value according to target task_t,a_k(ii) a g) Determining candidate motion from the motion space, the candidate motion being a function Q(s) of Q value in the motion space_t,a_k(ii) a g) The action with the largest value.

when the first parameter is larger than the preset probability, determining the candidate action of the target task as the target action a of the target task_t(ii) a The first parameter is a value range of [0,1]]The random number of (2); when the first parameter is not more than the preset probability, the first parameter is randomly acquired from the action spaceTarget action a of action determination as target task_t。

It should be noted that the initial value of the preset probability is 1 or a larger value close to 1; the preset probability is gradually reduced along with the increase of the training times; optionally, the maximum value of the preset probability is 1, and the minimum value is 0.1.

S503, at target action a_tAfter being executed, the state information s of the target task is acquired_t+1And according to the status information s_t+1And obtaining the reward value vector r(s) of the target task by the reward value function_t+1,a_t,g)。

Wherein the prize value vector(s)_t+1,a_tThe reward value in g) corresponds to the subtask corresponding to the element in the task vector corresponding to the target task one by one, for example, for the left turn of the intersection, the reward value vector can be expressed as [ r [_l,r_s,r_r,r_c]Wherein r is_lReward value, r, for a subtask intersection left turn collision or arrival_sReward value, r, for a subtask intersection straight-ahead collision or arrival_rReward value, r, for right turn collision or arrival at subtask intersection_cA reward value that improves passage efficiency for the subtask.

The reward value function may be expressed as:

it is to be noted here that the status information s_tAnd status information s_t+1The same type of information at different times.

S504, according to the task vector g corresponding to the target task and the state information S_tStatus information s_t+1Target action a_tAnd a prize value vector r(s)_t+1,a_tG) obtaining sample data of the target task<s_t,a_t,r,s_t+1,g>And using the sample data<s_t,a_t,r,s_t+1,g>And storing the data in a sample database.

It should be noted that the sample database includes sample data of multiple candidate tasks, and for the same candidate task, sample data acquired at different times may be included.

S505, randomly acquiring a plurality of pieces of sample data from a sample database, and calculating a loss value according to a loss function and the plurality of pieces of sample data; adjusting the decision model M according to the loss value_tTo obtain a decision model M_t+1。

Wherein the loss value can be expressed as:

It should be noted that the plurality of pieces of sample data may be sample data of the same task, or may be sample data of different tasks, where part of the plurality of pieces of sample data are of the same task, and part of the plurality of pieces of sample data are of different tasks.

S506, according to the state information S_t+1And judging whether the target task is finished or not.

Wherein, when the target task is determined to be finished, executing step S507; if it is determined that the target task is not finished, let t be t +1 and execute step S502.

S507, judging a decision model M_t+1Whether to receive or notAnd (7) converging.

Decision model M_t+1If the convergence is not determined, step S508 is executed, and if the convergence is not determined, step S501 is executed.

Alternatively, the decision model M is determined by determining whether the accumulated award value converges_t+1Whether or not to converge. Specifically, a task vector g corresponding to the target task and state information s are set_t+1Input to decision model M_t+1To obtain a new reference action, specifically, according to a task vector g and state information s corresponding to the target task_t+1Obtaining a motion value function vector R(s)_t+1,a_kG) and according to the motion value function vector R(s)_t+1,a_kG) determining a Q-value function Q(s) for the target task from a task vector g corresponding to the target task_t+1,a_k(ii) a g) A function Q(s) of the Q value in the motion space_t+1,a_k(ii) a g) Determining the action with the largest value as a new reference action, and selecting a new target action a from new candidate actions and actions randomly acquired from an action space according to a new preset probability_t+1The probability that the new candidate action is selected is a new preset probability; at a new target action a_t+1After being executed, the state information s is acquired_t+2According to a reward value function and current state information s_t+2Determining a reward value vector r(s) for a target action task_t+2,a_t+1G); according to status information s_t+2Judging whether the target task is finished or not; and when the target task is determined not to be ended, making t equal to t +1, and repeatedly executing the steps until the target task is ended. And accumulating the reward values corresponding to the subtasks related to the target task in each reward value vector of the plurality of reward value vectors to obtain a reward accumulated value.

It should be noted that, in the above process, a new target action a is obtained_t+1State information s_t+2And a prize value vector r(s)_t+1,a_t+1G), according to the task vector g and the state information s corresponding to the target task_t+1Target action a_t+1State information s_t+2And a prize value vector r(s)_t+2,a_t+1And g) generating sample data and storing the sample data into a sample database.

After the target task is finished, deleting the target task from the plurality of candidate tasks to obtain a deleted candidate task; when step S501 needs to be executed again, specifically, acquiring the target task from the plurality of candidate tasks means acquiring a new target task from the deleted candidate tasks.

S508, determining a model M_t+1And determining the target decision model and storing the target decision model.

Referring to fig. 6, fig. 6 is a schematic structural diagram of a decision device according to an embodiment of the present disclosure. As shown in fig. 6, the decision device 600 includes:

the task and state information obtaining module 601 is configured to obtain a target task and a task vector corresponding to the target task from a plurality of candidate tasks, obtain state information of the target task in a process of training a decision model by the decision model training module 602 and a process of making a decision by the decision module 603 based on the target decision model, and send the target task, the task vector of the target task, and the state information of the target task to the decision model training module 602 and the decision module 603;

a decision model training module 602, configured to update a sample database according to a task vector corresponding to a target task and state information of the target task; randomly acquiring a plurality of sample data from the updated sample database, and training a decision model according to the plurality of sample data to obtain a target decision model; and sends the target decision model to the decision module 603;

the decision module 603 is configured to obtain a target action based on the trained decision model, the state information of the target task, and the task vector corresponding to the target task, and send the target action to the control module 604.

And a control module 604 for performing the target action to complete the target task.

It should be noted that the task and status information obtaining module 601 is specifically configured to execute relevant contents of steps S201, S401, and S501; the decision model training module 602 is configured to execute the relevant contents of steps S402 and S502-S508; the decision model 603 is used for executing relevant contents of steps S202-S203, and will not be described in detail here.

Referring to fig. 7, fig. 7 is a schematic structural diagram of an on-board device according to an embodiment of the present application. As shown in fig. 7, the in-vehicle apparatus 700 includes:

the environment sensing module 701 is used for sensing surrounding environment information of the own vehicle, for example, the surrounding environment information is obtained by integrating various sensors, and information such as the position and the speed of the own vehicle, the position and the course angle of social vehicles around the own vehicle is obtained; and sends this information as state information to the decision block 703.

A navigation information module 702, configured to clarify navigation information of a vehicle at an intersection, where the navigation information includes information such as a distance from the intersection, intersection traffic light information, and a vehicle direction, where each steering direction of the vehicle is the multiple candidate tasks; the navigation information is sent to the decision module 703.

A decision module 703, configured to determine a target task, such as a left turn at an intersection, from the multiple candidate tasks according to the navigation information; performing task joint characterization on the target task according to the plurality of candidate tasks to obtain a task vector corresponding to the target task; the task vector corresponding to the target task and the state information obtained by the environment sensing module 701 are input into the decision model for processing, so as to obtain target actions of the target task, such as left-turn acceleration, left-turn deceleration and the like, and the target actions are sent to the vehicle control module 704.

And a vehicle control module 704 for controlling the vehicle to run according to the target action to complete the target task.

It should be noted that, for a specific process of the decision module 703 obtaining the target action based on the state information and the task vector corresponding to the target task, reference may be made to the related descriptions of steps S202 to S203, and no specific description is provided here.

Referring to fig. 8, fig. 8 is a schematic structural diagram of a decision device for multitask learning. As shown in fig. 8, the decision device 800 includes:

an obtaining unit 801 configured to obtain a plurality of candidate tasks and obtain a target task from the plurality of candidate tasks; and the target task acquires the state information s of the target task_t；

A joint characterization unit 802, configured to perform task joint characterization on a target task according to multiple candidate tasks to obtain a task vector corresponding to the target task, where the task vector corresponding to each candidate task is obtained based on a characteristic subtask and a common subtask of the multiple candidate tasks;

a determining unit 803, configured to determine a task vector corresponding to the target task and state information s of the target task_tAnd determining the target action from the action space.

In a possible embodiment, the joint characterization unit 802 is specifically configured to:

In one possible embodiment, the task vector corresponding to the target task and the state information s of the target task_tIn terms of obtaining the action value function vector of the target task, the determining unit 803 is specifically configured to:

the task vector corresponding to the target task and the state information s of the target task_tAnd inputting the data into a target decision model for processing to obtain a target action of a target task, wherein the target decision model is realized based on a neural network.

In a possible embodiment, the determining unit 803 is specifically configured to:

according to the task vector corresponding to the target task and the state information s of the target task_tObtaining an action value function vector of the target task, wherein an action value function in the action value function vector is corresponding to an element in a task vector of the target taskThe corresponding subtasks are in one-to-one correspondence; acquiring a value function of the target task according to the task vector and the action value function vector corresponding to the target task; and acquiring the target action from the action space according to the value function of the target task, wherein the target action is the action which enables the value function of the target task to be maximum in the action space.

It should be noted that the above units (the obtaining unit 801, the joint characterization unit 802, and the determination unit 803) are used for executing relevant contents of the methods shown in the above steps S201 to S203.

In the present embodiment, the decision device 800 is presented in the form of a unit. As used herein, a unit may refer to a specific application-specific integrated circuit (ASIC), a processor and memory that execute one or more software or firmware programs, an integrated logic circuit, and/or other devices that may provide the described functionality. Furthermore, the obtaining unit 801, the joint characterization unit 802, and the determination unit 803 in the above decision device 800 may be implemented by the processor 1000 of the decision device shown in fig. 10.

Referring to fig. 9, fig. 9 is a schematic structural diagram of a decision model training apparatus according to an embodiment of the present application. As shown in fig. 9, the training apparatus 900 includes:

an obtaining unit 901 that obtains a plurality of sample data from a first sample database at random; the first sample database comprises sample data of a plurality of candidate tasks, the sample data of the target task comprises a task vector corresponding to the target task, the task vector corresponding to the target task is obtained based on a common subtask and a characteristic subtask in the candidate tasks, and the target task is any one of the candidate tasks;

an adjusting unit 902, configured to adjust the decision model M according to a plurality of sample data by using a reinforcement learning method_tTo obtain a decision model M_t+1；

A determining unit 903 for determining the model M_t+1Upon convergence, the decision model M is determined_t+1Is a target decision model.

In one possible embodiment of the present invention,

an acquisition unit 901 forObtaining a target task from a plurality of candidate tasks; and obtaining state information s of the target task_tAnd acquiring a task vector corresponding to the target task,

the training apparatus 900 further comprises:

an updating unit 904 for updating the state information s according to the target task_tTask vector corresponding to target task and decision model M_tAnd generating sample data of the target task, and adding the sample data of the target task to the preliminary sample database to obtain a first sample database.

In a possible embodiment, in terms of obtaining a task vector corresponding to a target task, the obtaining unit 901 is specifically configured to:

In a possible embodiment, the state information s according to the target task_tTask vector corresponding to target task and decision model M_tIn generating sample data of the target task, the updating unit 904 is specifically configured to:

state information s of target task_tTask vector input decision model M corresponding to target task_tProcessing to obtain candidate actions of the target task; selecting target actions of the target task from the candidate actions of the target task and the actions randomly acquired from the action space according to a preset probability, wherein the probability that the candidate actions of the target task are selected is the preset probability; obtaining state information s of the target task after the target action is executed_t+1And according to the state information s of the target task_t+1Acquiring an award value vector of a target task; the reward values in the reward value vector correspond to the subtasks corresponding to the elements in the task vector corresponding to the target task one by one;

the sample data of the target task comprises the targetTask vector corresponding to target task and state information s of target task_tTarget action of target task, state information s of target task_t+1And a reward value vector for the target task.

In one possible embodiment, the state information s of the target task_tTask vector input decision model M corresponding to target task_tTo obtain the aspect of the candidate action of the target task, the updating unit 904 is specifically configured to:

make decision model M_tAccording to the state information s of the target task_tAcquiring an action value function vector of the target task according to the task vector corresponding to the target task, wherein action value functions in the action value function vector of the target task correspond to subtasks corresponding to elements in the task vector corresponding to the target task one by one; acquiring a value function of the target task according to the action value function vector of the target task and the task vector corresponding to the target task; and acquiring candidate actions of the target task from the action space according to the value function of the target task, wherein the candidate actions are actions enabling the value function of the target task to be maximum in the action space.

In a possible embodiment, in terms of selecting the target action of the target task from the candidate actions of the target task and the actions randomly obtained from the action space according to the preset probability, the updating unit 904 is specifically configured to:

In a possible embodiment, the adjusting unit 902 is specifically configured to:

The units (the acquiring unit 901, the adjusting unit 902, the determining unit 903, and the updating unit 904) are configured to execute relevant contents of the methods shown in the steps S401 and S402 and the steps S501 to S508.

In this embodiment, the training apparatus 900 is presented in the form of a unit. As used herein, a unit may refer to a specific application-specific integrated circuit (ASIC), a processor and memory that execute one or more software or firmware programs, an integrated logic circuit, and/or other devices that may provide the described functionality. Further, the acquiring unit 901, the adjusting unit 902, the determining unit 903, and the updating unit 904 in the above training apparatus 900 may be implemented by the processor 1100 of the training apparatus shown in fig. 11.

The decision-making means shown in fig. 10 may be implemented in the structure of fig. 10, and the decision-making means 1000 comprises at least one processor 1001, at least one memory 1002 and at least one communication interface 1003. The processor 1001, the memory 1002, and the communication interface 1003 are connected by a communication bus and perform communication with each other.

Communication interface 1003 may be used for communicating with other devices or communication networks, such as an ethernet, a Radio Access Network (RAN), a Wireless Local Area Network (WLAN), etc.

The memory 1002 may be, but is not limited to, a read-only memory (ROM) or other type of static storage device that may store static information and instructions, a Random Access Memory (RAM) or other type of dynamic storage device that may store information and instructions, an electrically erasable programmable read-only memory (EEPROM), a compact disk read-only memory (CD-ROM) or other optical disk storage, optical disk storage (including compact disk, laser disk, optical disk, digital versatile disk, blu-ray disk, etc.), magnetic disk storage media or other magnetic storage devices, or any other medium that can be used to carry or store desired program code in the form of instructions or data structures and that can be accessed by a computer. The memory may be self-contained and coupled to the processor via a bus. The memory may also be integral to the processor.

The memory 1002 is used for storing application program codes for executing the above schemes, and the execution is controlled by the processor 1001. The processor 1001 is used to execute the application code stored in the memory 1002.

The memory 1002 stores code that may perform one of the multi-task learning based decision methods provided above.

The processor 1001 may also employ or one or more integrated circuits for executing related programs to implement the decision method based on multitask joint learning according to the embodiment of the present application.

The processor 1001 may also be an integrated circuit chip having signal processing capabilities. In implementation, the steps of the decision method of the present application may be implemented by integrated logic circuits of hardware or instructions in the form of software in the processor 1001. In implementation, the steps of the training method for the state generation model and the selection strategy of the present application may be implemented by integrated logic circuits of hardware or instructions in the form of software in the processor 1001. The processor 1001 may also be a general purpose processor, a Digital Signal Processor (DSP), an ASIC, an off-the-shelf programmable gate array (FPGA) or other programmable logic device, discrete gate or transistor logic, discrete hardware components. The various methods, steps and block diagrams of modules disclosed in the embodiments of the present application may be implemented or performed. A general purpose processor may be a microprocessor or the processor may be any conventional processor or the like. The steps of the method disclosed in connection with the embodiments of the present application may be directly implemented by a hardware decoding processor, or implemented by a combination of hardware and software modules in the decoding processor. The software module may be located in ram, flash memory, rom, prom, or eprom, registers, etc. storage media as is well known in the art. The storage medium is located in the memory 1002, and the processor 1001 reads information in the memory 1002 and completes the decision method according to the embodiment of the present application in combination with hardware thereof.

The communication interface 1003 enables communication between the decision-making device and other devices or communication networks using a transceiver device such as, but not limited to, a transceiver. For example, the status information may be acquired through the communication interface 1003, or the target action may be transmitted to an execution apparatus (such as a control device of a vehicle).

A bus may include a pathway to transfer information between various components of the device (e.g., memory 1002, processor 1001, communication interface 1003). In one possible embodiment, the processor 1001 specifically performs the following steps:

acquiring a plurality of candidate tasks, and acquiring a target task from the candidate tasks; and acquiring the state information s of the target task according to the target task_t(ii) a Performing task joint characterization on the target task according to the multiple candidate tasks to obtain task vectors corresponding to the target task, wherein the task vector corresponding to each candidate task is obtained based on the characteristic subtasks and the common subtasks of the multiple candidate tasks; according to the task vector corresponding to the target task and the state information s of the target task_tAnd determining the target motion from the motion space.

The decision model training apparatus shown in fig. 11 can be implemented in the structure of fig. 11, and the training apparatus 1100 includes at least one processor 1101, at least one memory 1102 and at least one communication interface 1103. The processor 1101, the storage 1102 and the communication interface 1103 are connected by a communication bus and perform communication with each other.

Communication interface 1103 is used for communicating with other devices or communication networks, such as ethernet, Radio Access Network (RAN), Wireless Local Area Networks (WLAN), etc.

The memory 1102 may be, but is not limited to, a read-only memory (ROM) or other type of static storage device that may store static information and instructions, a Random Access Memory (RAM) or other type of dynamic storage device that may store information and instructions, an electrically erasable programmable read-only memory (EEPROM), a compact disc read-only memory (CD-ROM) or other optical disk storage, optical disk storage (including compact disc, laser disc, optical disc, digital versatile disc, blu-ray disc, etc.), magnetic disk storage media or other magnetic storage devices, or any other medium that can be used to carry or store desired program code in the form of instructions or data structures and that can be accessed by a computer. The memory may be self-contained and coupled to the processor via a bus. The memory may also be integral to the processor.

The memory 1102 is used for storing application program codes for executing the above schemes, and the execution of the application program codes is controlled by the processor 1101. The processor 1101 is configured to execute the application code stored in the memory 1102.

The memory 1102 stores code that performs one of the multi-task learning based decision model training methods provided above.

The processor 1101 may also employ or one or more integrated circuits for executing related programs to implement the decision model training method based on multi-task learning according to the embodiment of the present application.

The processor 1101 may also be an integrated circuit chip having signal processing capabilities. In implementation, the steps of the decision model training method of the present application may be implemented by integrated logic circuits of hardware or instructions in the form of software in the processor 1101. In implementation, the steps of the training method of the state generation model and the selection strategy of the present application may be implemented by integrated logic circuits of hardware or instructions in the form of software in the processor 1101. The processor 1101 may also be a general purpose processor, a Digital Signal Processor (DSP), an ASIC, an FPGA (field programmable gate array) or other programmable logic device, discrete gate or transistor logic device, or discrete hardware components. The various methods, steps and block diagrams of modules disclosed in the embodiments of the present application may be implemented or performed. A general purpose processor may be a microprocessor or the processor may be any conventional processor or the like. The steps of the method disclosed in connection with the embodiments of the present application may be directly implemented by a hardware decoding processor, or implemented by a combination of hardware and software modules in the decoding processor. The software module may be located in ram, flash memory, rom, prom, or eprom, registers, etc. storage media as is well known in the art. The storage medium is located in the memory 1102, and the processor 1101 reads the information in the memory 1102 and completes the decision model training method of the embodiment of the present application in combination with the hardware thereof.

The communication interface 1103 uses a transceiver device, such as, but not limited to, a transceiver, to enable communication between the decision model training device and other devices or communication networks. For example, sample data used in model training may be acquired through the communication interface 1103, or a trained decision model may be transmitted to the decision device.

A bus may include a pathway to transfer information between various components of the device (e.g., memory 1102, processor 1101, communication interface 1103). In one possible embodiment, the processor 1101 performs the following steps:

acquiring a target task from a plurality of candidate tasks; and obtaining state information s of the target task_tAnd acquiring a task vector corresponding to the target task, wherein the task vector corresponding to the target task is obtained based on the common subtask and the characteristic subtask of a plurality of candidate tasks and is obtained according to the state information s of the target task_tTask vector corresponding to target task and decision model M_tGenerating sample data of the target task, and adding the sample data of the target task to the preliminary sample database to obtain a first sample database; randomly acquiring a plurality of sample data from a first sample database; the multiple sample data are sample data of part or all of the multiple candidate tasks; adjusting a decision model M according to a plurality of sample data by adopting a reinforcement learning method_tTo obtain a decision model M_t+1(ii) a Decision model M_t+1Whether to converge; when decision model M_t+1Upon convergence, the decision model M is determined_t+1Is a target decision model.

Embodiments of the present application provide a computer storage medium storing a computer program comprising program instructions that, when executed by a processor, cause the processor to perform part or all of the steps of any one of the above-described method embodiments of a multi-task learning based decision method or a decision model training method.

In some embodiments, the disclosed methods may be implemented as computer program instructions encoded on a computer-readable storage medium in a machine-readable format or encoded on other non-transitory media or articles of manufacture. Fig. 12 schematically illustrates a conceptual partial view of an example computer program product comprising a computer program for executing a computer process on a computing device, arranged in accordance with at least some embodiments presented herein. In one embodiment, the example computer program product 1200 is provided using a signal bearing medium 1201. The signal bearing medium 1201 may include one or more program instructions 1202 that, when executed by one or more processors, may provide the functions or portions of the functions described above with respect to fig. 2, 4, and 5. Thus, for example, referring to the embodiment shown in FIG. 2, one or more features of steps S201-203 may be undertaken by one or more instructions associated with the signal bearing medium 1201. Further, program instructions 1202 in FIG. 12 also describe example instructions.

In some examples, signal bearing medium 1201 may include a computer readable medium 1203, such as, but not limited to, a hard disk drive, a Compact Disc (CD), a Digital Video Disc (DVD), a digital tape, a Memory, a Read-Only Memory (ROM), a Random Access Memory (RAM), or the like. In some implementations, the signal bearing medium 1201 may include a computer recordable medium 1204 such as, but not limited to, a memory, a read/write (R/W) CD, a R/W DVD, and so forth. In some implementations, the signal bearing medium 1201 can include a communication medium 1205 such as, but not limited to, a digital and/or analog communication medium (e.g., a fiber optic cable, a waveguide, a wired communications link, a wireless communication link, etc.). Thus, for example, signal bearing medium 1201 may be conveyed by a wireless form of communication medium 1205 (e.g., a wireless communication medium conforming to the IEEE 802.11 standard or other transmission protocol). The one or more program instructions 1202 may be, for example, computer-executable instructions or logic-implemented instructions. In some examples, a computing device such as described with respect to fig. 2, 4, and 5 may be configured to provide various operations, functions, or actions in response to program instructions 1202 conveyed to the computing device by one or more of computer readable medium 1203, computer recordable medium 1204, and/or communications medium 1205. It should be understood that the arrangements described herein are for illustrative purposes only. Thus, those skilled in the art will appreciate that other arrangements and other elements (e.g., machines, interfaces, functions, orders, and groupings of functions, etc.) can be used instead, and that some elements may be omitted altogether depending upon the desired results. In addition, many of the described elements are functional entities that may be implemented as discrete or distributed components or in conjunction with other components, in any suitable combination and location.

It should be noted that, for simplicity of description, the above-mentioned method embodiments are described as a series of acts or combination of acts, but those skilled in the art will recognize that the present application is not limited by the order of acts described, as some steps may occur in other orders or concurrently depending on the application. Further, those skilled in the art should also appreciate that the embodiments described in the specification are preferred embodiments and that the acts and modules referred to are not necessarily required in this application.

In the foregoing embodiments, the descriptions of the respective embodiments have respective emphasis, and for parts that are not described in detail in a certain embodiment, reference may be made to related descriptions of other embodiments.

In the embodiments provided in the present application, it should be understood that the disclosed apparatus may be implemented in other manners. For example, the above-described embodiments of the apparatus are merely illustrative, and for example, the division of the units is only one type of division of logical functions, and there may be other divisions when actually implementing, for example, a plurality of units or components may be combined or may be integrated into another system, or some features may be omitted, or not implemented. In addition, the shown or discussed mutual coupling or direct coupling or communication connection may be an indirect coupling or communication connection of some interfaces, devices or units, and may be an electric or other form.

The units described as separate parts may or may not be physically separate, and parts displayed as units may or may not be physical units, may be located in one place, or may be distributed on a plurality of network units. Some or all of the units can be selected according to actual needs to achieve the purpose of the solution of the embodiment.

In addition, functional units in the embodiments of the present application may be integrated into one processing unit, or each unit may exist alone physically, or two or more units are integrated into one unit. The integrated unit can be realized in a form of hardware, and can also be realized in a form of a software functional unit.

The integrated unit, if implemented in the form of a software functional unit and sold or used as a stand-alone product, may be stored in a computer readable memory. Based on such understanding, the technical solution of the present application may be substantially implemented or a part of or all or part of the technical solution contributing to the prior art may be embodied in the form of a software product stored in a memory, and including several instructions for causing a computer device (which may be a personal computer, a server, or a network device) to execute all or part of the steps of the method described in the embodiments of the present application. And the aforementioned memory comprises: various media capable of storing program codes, such as a U disk, a ROM, a RAM, a removable hard disk, a magnetic disk, or an optical disk.

Those skilled in the art will appreciate that all or part of the steps in the methods of the above embodiments may be implemented by associated hardware instructed by a program, which may be stored in a computer-readable memory, which may include: flash disk, ROM, RAM, magnetic or optical disk, and the like.

The foregoing detailed description of the embodiments of the present application has been presented to illustrate the principles and implementations of the present application, and the above description of the embodiments is only provided to help understand the method and the core concept of the present application; meanwhile, for a person skilled in the art, according to the idea of the present application, there may be variations in the specific embodiments and the application scope, and in view of the above, the content of the present specification should not be construed as a limitation to the present application.

Claims

1. A decision model training method based on multi-task learning is characterized by comprising the following steps:

randomly acquiring a plurality of sample data from a first sample database; the first sample database comprises sample data of a plurality of candidate tasks, the sample data of a target task comprises a task vector corresponding to the target task, the task vector corresponding to the target task is obtained based on a common subtask and a characteristic subtask in the candidate tasks, and the target task is any one of the candidate tasks;

adjusting the decision model M according to the multiple sample data by using a reinforcement learning method_tTo obtain a decision model M_t+1；

When the decision model M is_t+1Upon convergence, determining the decision model M_t+1Is a target decision model.

2. The method of claim 1, wherein prior to randomly obtaining a plurality of sample data from a first sample database, the method further comprises:

acquiring a target task from the plurality of candidate tasks;

acquiring state information s of the target task according to the target task_tAnd obtaining the target task pair according to the target taskThe vector of the task to be processed,

according to the state information s of the target task_tTask vector corresponding to the target task and the decision model M_tAnd generating sample data of the target task, and adding the sample data of the target task to a primary sample database to obtain the first sample database.

3. The method according to claim 2, wherein the obtaining a task vector corresponding to the target task according to the target task includes:

performing task decomposition on each candidate task in the plurality of candidate tasks to obtain a subtask corresponding to each candidate task;

extracting characteristic subtasks and common subtasks of the multiple candidate tasks according to the subtasks corresponding to each candidate task of the multiple candidate tasks;

and acquiring a task vector corresponding to the target task according to the characteristic subtasks and the common subtasks of the candidate tasks.

4. Method according to claim 2 or 3, characterized in that said state information s according to said target task_tTask vector corresponding to the target task and the decision model M_tGenerating sample data for the target task, comprising:

state information s of the target task_tInputting a task vector corresponding to the target task into the decision model M_tTo obtain candidate actions of the target task;

selecting the target action of the target task from the candidate action of the target task and the action randomly obtained from the action space according to a preset probability, wherein the probability of selecting the candidate action is the preset probability;

acquiring state information s of the target task after the target action is executed_t+1And according to the state information s of the target task_t+1Obtaining the target taskA vector of reward values of; the reward values in the reward value vector correspond to the subtasks corresponding to the elements in the task vector corresponding to the target task one by one;

the sample data of the target task comprises a task vector corresponding to the target task and state information s of the target task_tTarget action of the target task, and state information s of the target task_t+1And a reward value vector for the target task.

5. The method of claim 4, wherein the state information s of the target task is determined_tInputting a task vector corresponding to the target task into the decision model M_tThe processing to obtain the candidate action of the target task comprises:

the decision model M_tAccording to the state information s of the target task_tAcquiring an action value function vector of the target task according to the task vector corresponding to the target task;

the decision model M_tAcquiring a value function of the target task according to the action value function vector of the target task and the task vector corresponding to the target task;

the decision model M_tAnd acquiring a candidate action of the target task from an action space according to the value function of the target task, wherein the candidate action is an action which enables the value of the value function of the target task to be maximum in the action space.

6. The method according to claim 4 or 5, wherein the selecting the target action of the target task from the candidate actions of the target task and the actions randomly obtained from the action space according to a preset probability comprises:

when the first parameter is larger than the preset probability, determining the candidate action of the target task as the target action of the target task; wherein, the first parameter is a random number with a value range of [0,1 ];

and when the first parameter is not greater than the preset probability, determining the action randomly acquired from the action space as the target action of the target task.

7. The method according to any of claims 4-6, wherein said adapting the decision model M according to said plurality of sample data_tTo obtain a decision model M_t+1The method comprises the following steps:

according to the loss function and the state information s of each sample data in the plurality of sample data_tThe task vector, the target action, the state information s_t+1Calculating the loss value according to the reward value vector;

adjusting the decision model M according to the loss value_tTo obtain the decision model M_t+1。

8. A decision method based on multi-task learning is characterized by comprising the following steps:

acquiring a plurality of candidate tasks, and acquiring a target task from the candidate tasks; and acquiring the state information s of the target task according to the target task_t；

Performing task joint characterization on the target task according to the candidate tasks to obtain a task vector corresponding to the target task, wherein the task vector corresponding to each candidate task is obtained based on the characteristic subtasks and the common subtasks of the candidate tasks;

according to the task vector corresponding to the target task and the state information s of the target task_tAnd determining the target action from the action space.

9. The method according to claim 8, wherein the performing task joint characterization on the target task according to the plurality of candidate tasks to obtain a task vector corresponding to the target task comprises:

10. The method according to claim 8 or 9, wherein the task vector corresponding to the target task and the state information s of the target task_tDetermining a target action from the action space, comprising:

the task vector corresponding to the target task and the state information s of the target task_tInputting the target action into a target decision model for processing to obtain the target action, wherein the target decision model is realized based on a neural network.

11. The method according to claim 10, wherein the task vector corresponding to the target task and the state information s of the target task are combined_tInputting the target action into a target decision model for processing to obtain the target action, wherein the target action comprises:

the target decision model is used for determining the state information s of the target task according to the task vector corresponding to the target task and the state information s of the target task_tAcquiring action value function vectors of the target task, wherein action value functions in the action value function vectors correspond to subtasks corresponding to elements in the task vectors corresponding to the target task one by one;

the target decision model obtains a value function of the target task according to the task vector corresponding to the target task and the action value function vector;

and the target decision model acquires a target action from the action space according to a value function of a target task, wherein the target action is an action which enables the value of the value function of the target task to be maximum in the action space.

12. A decision model training device based on multi-task learning is characterized by comprising:

the device comprises an acquisition unit, a storage unit and a processing unit, wherein the acquisition unit is used for acquiring a plurality of sample data from a first sample database; the first sample database comprises sample data of a plurality of candidate tasks, the sample data of a target task comprises a task vector corresponding to the target task, the task vector corresponding to the target task is obtained based on a common subtask and a characteristic subtask in the candidate tasks, and the target task is any one of the candidate tasks;

an adjusting unit, configured to adjust the decision model M according to the multiple sample data by using a reinforcement learning method_tTo obtain a decision model M_t+1；

A determination unit for determining whether the decision model M is a new decision model_t+1Upon convergence, determining the decision model M_t+1Is a target decision model.

13. The apparatus of claim 12,

the acquiring unit is further used for acquiring a target task from a plurality of candidate tasks; and acquiring the state information s of the target task according to the target task_tAcquiring a task vector corresponding to the target task according to the target task, wherein the task vector corresponding to the target task is obtained based on the common subtask and the characteristic subtask of the candidate tasks,

the device further comprises:

an updating unit for updating the state information s of the target task_tTask vector corresponding to the target task and decision model M_tAnd generating sample data of the target task, and adding the sample data of the target task to a primary sample database to obtain the first sample database.

14. The apparatus according to claim 13, wherein, in the aspect of obtaining the task vector corresponding to the target task according to the target task, the obtaining unit is specifically configured to:

15. The apparatus according to claim 13 or 14, wherein the status information s according to the target task_tTask vector corresponding to the target task and decision model M_tIn an aspect of generating sample data of the target task, the updating unit is specifically configured to:

state information s of the target task_tInputting a task vector corresponding to the target task into a decision model M_tTo obtain candidate actions of the target task;

acquiring state information s of the target task after the target action is executed_t+1And according to the state information s of the target task_t+1Acquiring an award value vector of the target task; the reward values in the reward value vector correspond to the subtasks corresponding to the elements in the task vector corresponding to the target task one by one;

16. The apparatus of claim 15, wherein the status information s of the target task is set_tInputting a task vector corresponding to the target task into a decision model M_tThe update unit is specifically configured to:

making the decision model M_tAccording to the state information s of the target task_tAcquiring an action value function vector of the target task according to the task vector corresponding to the target task; acquiring a value function of the target task according to the action value function vector of the target task and the task vector corresponding to the target task; and acquiring a candidate action of the target task from an action space according to the value function of the target task, wherein the candidate action is an action which enables the value of the value function of the target task to be maximum in the action space.

17. The method according to claim 15 or 16, wherein in terms of selecting the target action of the target task from the candidate actions of the target task and the actions randomly obtained from the action space according to a preset probability, the updating unit is specifically configured to:

when the first parameter is larger than the preset probability, determining the candidate action of the target task as the target action of the target task; the first parameter is a random number with a value range of [0,1 ];

18. The apparatus according to any one of claims 15 to 17, wherein the adjusting unit is specifically configured to:

according to the loss function and the state information s of each sample data in the plurality of sample data_tThe task vector, the target action, the state information s_t+1And the reward value vector calculationObtaining a loss value;

19. A decision-making device based on multitask learning, comprising:

the system comprises an acquisition unit, a processing unit and a processing unit, wherein the acquisition unit is used for acquiring a plurality of candidate tasks and acquiring a target task from the candidate tasks; and acquiring the state information s of the target task according to the target task_t；

The joint characterization unit is used for performing task joint characterization on the target task according to the candidate tasks to obtain a task vector corresponding to the target task, wherein the task vector corresponding to each candidate task is obtained based on the characteristic subtasks and the common subtasks of the candidate tasks;

a determining unit, further configured to determine a task vector corresponding to the target task and state information s of the target task_tAnd determining the target motion from the motion space.

20. The apparatus according to claim 19, wherein the joint characterization unit is specifically configured to:

21. The apparatus according to claim 19 or 20, wherein the determining unit is specifically configured to:

the task vector corresponding to the target task and the state information s of the target task_tInput to the goal blockAnd processing in a strategy model to obtain the target action of the target task, wherein the target strategy model is realized based on a neural network.

22. The apparatus according to claim 21, wherein the determining unit is specifically configured to:

according to the task vector corresponding to the target task and the state information s of the target task_tAcquiring action value function vectors of the target task, wherein action value functions in the action value function vectors correspond to subtasks corresponding to elements in the task vectors corresponding to the target task one by one;

acquiring a value function of the target task according to the task vector corresponding to the target task and the action value function vector;

and acquiring a target action from the action space according to the value function of the target task, wherein the target action is an action which enables the value function of the target task to be maximum in the action space.

23. A predictive model training apparatus, comprising:

a memory to store instructions; and

a processor coupled with the memory;

wherein the processor, when executing the instructions, performs the method of any one of claims 1-7.

24. A decision-making device based on multitask learning, comprising:

a memory to store instructions; and

a processor coupled with the memory;

wherein the processor, when executing the instructions, performs the method of any one of claims 8-11.

25. A chip system, wherein the chip system is applied to an electronic device; the chip system comprises one or more interface circuits, and one or more processors; the interface circuit and the processor are interconnected through a line; the interface circuit is to receive a signal from a memory of the electronic device and to send the signal to the processor, the signal comprising computer instructions stored in the memory; the electronic device performs the method of any one of claims 1-11 when the processor executes the computer instructions.

26. A computer storage medium, characterized in that the computer storage medium stores a computer program comprising program instructions that, when executed by a processor, cause the processor to perform the method according to any one of claims 1-11.

27. A computer program product comprising computer instructions which, when run on an electronic device, cause the electronic device to perform the method of any one of claims 1-11.