CN111046981B

CN111046981B - Training method and device for unmanned vehicle control model

Info

Publication number: CN111046981B
Application number: CN202010184383.8A
Authority: CN
Inventors: 任冬淳; 夏华夏; 樊明宇; 丁曙光; 钱德恒
Original assignee: Beijing Sankuai Online Technology Co Ltd
Current assignee: Beijing Sankuai Online Technology Co Ltd
Priority date: 2020-03-17
Filing date: 2020-03-17
Publication date: 2020-07-03
Anticipated expiration: 2040-03-17
Also published as: CN111046981A

Abstract

The specification discloses a training method and a device of an unmanned vehicle control model, wherein a feature matrix formed by historical environment features used for calculating rewards at the last moment and current environment features determined according to current environment information can be determined for each moment when the model is trained, then the features used for calculating the rewards at the current moment are selected from the feature matrix based on the current environment features and the importance degree of the historical environment features to the feature matrix, then the rewards are determined according to the current environment features and the selected features to train the unmanned vehicle control model, and after the training is finished, the unmanned vehicle control is carried out according to the trained model. Because the characteristics of the calculated reward are determined based on the importance degree of the characteristics of each characteristic pair including the historical environmental characteristics to the whole, more effective rewards can be determined based on the change of the environmental information during training, the problem of sparse rewards is solved, and the cost and the time are saved.

Description

Training method and device for unmanned vehicle control model

Technical Field

The application relates to the technical field of unmanned driving, in particular to a training method and device for an unmanned vehicle control model.

Background

At present, the problem that how to avoid an obstacle by an unmanned vehicle is mainly solved by an unmanned vehicle control method in the technical field of unmanned driving, and the obstacle avoiding process of the unmanned vehicle is generally as follows: and inputting environmental information acquired by the unmanned vehicle in real time, the driving state of the unmanned vehicle and the like into a pre-trained model, and controlling the unmanned vehicle to avoid obstacles to drive according to the output of the model.

In the prior art, a model training method is usually used for model training, and the model is obtained through continuous 'trial and error' process training. Specifically, when the reinforcement learning model is trained, the unmanned vehicle determines a reward according to the influence of the action at the previous moment on the environmental information, and the reward and the environmental information at the current moment are input into the reinforcement learning model so as to control the unmanned vehicle according to the output of the model. And training the reinforcement learning model through continuous input and output in the driving process. Let the model "learn" what output should be for different situations.

However, in the existing training reinforcement learning model process, it is usually determined that the output of the model is correct control when the unmanned vehicle reaches the destination, positive feedback is given, wrong control is determined when a dangerous condition occurs in the driving process, and negative feedback is given, so that the reward is usually effective only when the unmanned vehicle reaches the destination or the dangerous condition feedback occurs, that is, the reward for converging the model parameters can be realized, most of the reward is difficult to converge the model in the driving process, so that the effective reward obtained in each training process is sparse, and the model training cost is high and the time consumption is long.

Disclosure of Invention

The embodiment of the specification provides a training method and a training device for an unmanned vehicle control model, which are used for partially solving the problems in the prior art.

The embodiment of the specification adopts the following technical scheme:

the training method of the unmanned vehicle control model provided by the specification comprises the following steps:

acquiring current environment information of a position where the unmanned vehicle is located in a driving process, and determining current environment characteristics according to the current environment information;

determining a feature matrix formed by historical environment features used for calculating rewards at the last moment and the current environment features, wherein the historical environment features are determined according to historical environment information acquired by the unmanned vehicle in the driving process;

according to the similarity between the features in the feature matrix, determining the importance degree of the current environment feature and the historical environment features to the feature matrix, and selecting the feature used for calculating the reward at the current moment from the feature matrix according to the importance degree;

determining rewards through a preset reward function according to the current environment characteristics and the selected characteristics;

and inputting the current environment information and the reward into an unmanned vehicle control model to be trained for model training, wherein the unmanned vehicle control model is used for unmanned vehicle control.

Optionally, the obtaining of the current environment information of the position of the unmanned vehicle in the driving process specifically includes:

acquiring obstacle information around the unmanned vehicle, positioning information of the unmanned vehicle at the current moment and lane information corresponding to the current driving process as the current environment information;

wherein the lane information includes: the current lane position is determined according to the positioning information of the unmanned vehicle at the current moment, and the subsequent lane position is determined according to the positioning information and the path planning corresponding to the driving process, wherein the subsequent lane position is the lane position where the unmanned vehicle can drive.

Optionally, determining the current environment characteristic according to the current environment information specifically includes:

inputting the obstacle information as input into a pre-trained feature extraction model to obtain an output feature vector;

and splicing the feature vector, the positioning information and the lane information to determine the current environmental features.

Optionally, the current environment features are in the form of a column vector;

according to the similarity between the features in the feature matrix, determining the importance degree of the current environment feature and the historical environment features to the feature matrix, and selecting the feature used for calculating the reward at the current moment from the feature matrix according to the importance degree, specifically comprising:

determining a similarity matrix corresponding to the feature matrix according to the similarity between the features in the feature matrix;

determining the historical environment characteristics with the minimum similarity to the current environment characteristics according to the similarity matrix, and taking the corresponding columns of the determined historical environment characteristics in the similarity matrix as the initial columns of the intermediate matrix;

determining a residual error matrix of the similarity matrix and the intermediate matrix;

determining the importance degree of each column in the similarity matrix to the residual matrix according to the residual matrix and the similarity matrix, and adding a first number of columns in the similarity matrix to the intermediate matrix according to the sequence of the importance degrees from large to small;

and determining the characteristics used for calculating the reward in the characteristic matrix at the current moment according to the intermediate matrix.

Optionally, adding a first number of columns in the similarity matrix to the intermediate matrix according to a descending order of importance, specifically including:

extracting a second number of columns from the similarity matrix according to the sequence of the importance degrees from large to small, and adding the columns into the intermediate matrix;

judging whether the number of columns of the intermediate matrix reaches the first number or not;

if yes, determining that all columns of the intermediate matrix are obtained;

if not, the residual error matrix is determined again, and the first number of columns are extracted from the similarity matrix and added to the intermediate matrix according to the importance degree of each column in the similarity matrix to the newly determined residual error matrix until the number of columns in the intermediate matrix reaches the first number.

Optionally, the method further comprises:

and when the number of the historical environmental characteristics is less than a first number, taking the historical environmental characteristics as the characteristics used for calculating the reward at the current moment.

Optionally, determining the reward through a preset reward function according to the current environment characteristic and the selected characteristic, specifically including

Determining rewards through a preset reward function according to the similarity between the current environment characteristics and the selected characteristics;

the smaller the similarity between the current environmental characteristics and each selected characteristic is, the higher the reward is, and the larger the similarity between the current environmental characteristics and each selected characteristic is, the smaller the reward is.

Optionally, the unmanned vehicle control is performed according to the unmanned vehicle control model obtained through training, and the method specifically includes:

acquiring current environment information of the position of the unmanned vehicle at the current moment;

inputting the current environment information into the unmanned vehicle control model, wherein the unmanned vehicle control model is obtained by training a plurality of driving processes, a feature matrix formed by each historical environment feature and the current environment feature at the moment is determined for each moment in each driving process, the importance degree of each feature in the feature matrix is determined according to the similarity between the features in the feature matrix at the moment, the feature for calculating the reward is determined according to the importance degree, and the unmanned vehicle control model is trained on the basis of the reward calculated by the feature for calculating the reward and the environment information at the moment;

and controlling the unmanned vehicle to move according to the direction and the speed output by the unmanned vehicle control model.

This specification provides a training device of unmanned vehicle control model, includes:

the acquisition module is used for acquiring current environment information of the position of the unmanned vehicle in the driving process and determining current environment characteristics according to the current environment information;

the determining module is used for determining a feature matrix formed by various historical environment features used for calculating rewards at the last moment and the current environment feature, wherein the various historical environment features are determined according to various historical environment information acquired by the unmanned vehicle in the driving process;

the selection module is used for determining the importance degree of the current environmental characteristics and the historical environmental characteristics to the characteristic matrix according to the similarity among the characteristics in the characteristic matrix, and selecting the characteristics used for calculating the reward at the current moment from the characteristic matrix according to the importance degree;

the calculation module is used for determining rewards according to the current environment characteristics and the selected characteristics through a preset reward function;

and the training module inputs the current environment information and the reward into an unmanned vehicle control model to be trained to perform model training, wherein the unmanned vehicle control model is used for unmanned vehicle control.

The present specification provides a computer-readable storage medium storing a computer program which, when executed by a processor, implements any of the methods described above.

The electronic device provided by the present specification includes a memory, a processor, and a computer program stored in the memory and executable on the processor, and when the processor executes the program, the processor implements any one of the above-described methods for training the control model of the unmanned vehicle.

The embodiment of the specification adopts at least one technical scheme which can achieve the following beneficial effects:

the method comprises the steps of determining a feature matrix formed by historical environment features used for calculating rewards at the last moment and current environment features determined according to current environment information at each moment when a model is trained, then selecting the features used for calculating the rewards at the current moment from the feature matrix based on the current environment features and the importance degree of the historical environment features to the feature matrix, determining the rewards according to the current environment features and the selected features to train the unmanned vehicle control model, and performing unmanned vehicle control according to the trained model after the training is finished. Because the characteristics of the calculated reward are determined based on the importance degree of the characteristics of each characteristic pair including the historical environmental characteristics to the whole body, more effective rewards can be determined based on the change of the environmental information during training, the problem of sparse rewards in the training process is solved, the training cost is reduced, and the training time is saved.

Drawings

The accompanying drawings, which are included to provide a further understanding of the application and are incorporated in and constitute a part of this application, illustrate embodiment(s) of the application and together with the description serve to explain the application and not to limit the application. In the drawings:

FIG. 1 is a schematic diagram illustrating a training process of an unmanned vehicle control model provided in an embodiment of the present disclosure;

FIG. 2 is a schematic lane view provided in an embodiment of the present disclosure;

FIG. 3a is a schematic diagram of a reinforcement learning model;

FIG. 3b is a schematic structural diagram of a control model of an unmanned vehicle according to an embodiment of the present disclosure;

FIG. 4 is a schematic diagram of a feature matrix provided in an embodiment of the present disclosure;

fig. 5 is a schematic view of a control process of an unmanned vehicle according to an embodiment of the present disclosure;

FIG. 6 is a schematic structural diagram of a training apparatus for an unmanned vehicle control model provided in an embodiment of the present disclosure;

fig. 7 is a schematic view of an electronic device for implementing a training method of an unmanned vehicle control model according to an embodiment of the present disclosure.

Detailed Description

In order to make the objects, technical solutions and advantages of the present disclosure more apparent, the technical solutions of the present disclosure will be clearly and completely described below with reference to the specific embodiments of the present disclosure and the accompanying drawings. It should be apparent that the described embodiments are only some of the embodiments of the present application, and not all of the embodiments. All other embodiments obtained by a person skilled in the art based on the embodiments in the present specification without any inventive step are within the scope of the present application.

The technical solutions provided by the embodiments of the present application are described in detail below with reference to the accompanying drawings.

Fig. 1 is a schematic diagram of a training process of an unmanned vehicle control model provided in an embodiment of the present specification, including:

s100: the method comprises the steps of obtaining current environment information of a position where the unmanned vehicle is located in the driving process, and determining current environment characteristics according to the current environment information.

In one or more embodiments of the present specification, the training process of the model is a process of training a reinforcement learning model, which is a model for unmanned vehicle control, wherein the training process may be performed in a real environment or a simulation environment, which is not limited in the present specification. In addition, because vehicles on the road usually do not run dangerously in a real environment, and dangerous behaviors of other vehicles are not increased specially in a simulation environment, in order to increase training efficiency and save training time and training cost, the specification provides a training method for calculating rewards by using historical environment characteristics.

In this specification, the process of the unmanned vehicle traveling from the departure point to the destination according to the pre-planned travel path, whether in a real environment or in a simulation environment, may be referred to as a travel process or a training process, and the process is terminated when the unmanned vehicle reaches the destination or a dangerous situation occurs. The dangerous conditions comprise the conditions of scratch, collision and the like between the unmanned vehicle and obstacles in the environment, and the obstacles comprise other vehicles, road guardrails, indication boards, buildings and the like in the environment except the unmanned vehicle. The reinforcement learning model is specifically used for controlling the unmanned vehicle to travel on the pre-planned travel path, and the control strategy of the unmanned vehicle, such as the direction and the speed, can be output.

Specifically, since the training process is complicated, the server usually performs the training of the model, and similarly, in this specification, the server may specifically perform the process of training the unmanned vehicle control model.

For each time in the driving process, firstly, the server can acquire the current environment information of the position of the unmanned vehicle in the driving process, namely the environment information at the time. Wherein the environmental information may include: obstacle information around the unmanned vehicle, positioning information of the unmanned vehicle at the moment and lane information corresponding to the driving process. In this specification, the lane information includes: the current lane position can be determined by an electronic map according to the positioning information of the unmanned vehicle at the current moment, similarly, the subsequent lane position can be determined by the electronic map according to the positioning information and the path planning, and the subsequent lane position is the position of the lane which is not driven in the path and can be driven subsequently by the unmanned vehicle.

In this specification, the lane position may be represented by coordinates of several lane center lines, as shown in fig. 2. In fig. 2, thin lines indicate boundaries between lanes, thick lines indicate edges of roads, circular dots indicate coordinates of lane center lines, i.e., a 1-a 3, b 1-b 3 and c 1-c 3, and positions of a lane where the unmanned vehicle is currently located and positions of subsequent lanes may be respectively represented by a series of coordinates. It should be noted that, the number of coordinates and the coordinate interval of the vehicle-to-vehicle position may be set as required, that is, the number of dots of each lane and the dot pitch in fig. 2 may be set as required, which is not limited in this specification.

The server may take the obstacle information as an input after acquiring the current environment information, and input a pre-trained feature extraction model to obtain an output feature vector, where the feature extraction model may specifically be a convolutional neural network model, and the output may be a feature vector in the form of 1 × n, where n may be set as needed, and the feature extraction model is mainly used to convert the obstacle information into a vector with a set length, and of course, the feature extraction model may include an activation layer or a pooling layer, and the description is not limited.

The feature vector in the form of 1 × n is output based on the feature extraction model, and the current environment feature after the server stitching may be in the form of 1 × m, where m is greater than n, and the coordinate position in the lane information may be arranged in order of the planned path from the near to the far from the unmanned vehicle positioning information, taking the lane shown in fig. 2 as an example, the current environment feature may be (feature vector, x coordinate of positioning information, y coordinate of positioning information, x coordinate of a1, x coordinate of a … … a3 of a1, x coordinate of a … … c3 of a3, y coordinate of c 3).

S102: determining a feature matrix formed by historical environment features used for calculating rewards at the last moment and the current environment features, wherein the historical environment features are determined according to historical environment information acquired by the unmanned vehicle in the driving process.

In this specification, after determining the current environment characteristic, the server may construct a characteristic matrix according to each historical environment characteristic used for calculating the reward at the previous time and the current environment characteristic, so as to determine the characteristic used for calculating the reward at the current time according to the importance degree of each characteristic in the characteristic matrix to the characteristic matrix in the subsequent process. The historical environmental characteristics are characteristics determined by the server according to the historical environmental information acquired by the unmanned vehicle during the driving process, namely the characteristics determined in step S100 are executed at each historical time. Therefore, in this specification, the server also needs to store the features determined at different times.

Fig. 3a shows a structure of a conventional reinforcement learning model, where the model inputs environmental information and rewards, and outputs actions, and at each time, the model determines the influence of the actions output at the previous time on the external environment, determines rewards by a reward function, and determines the output at the time according to the environmental information and rewards at the time.

Fig. 3b is a structure of an unmanned vehicle control model provided in the present specification, in which an incentive input by the model is determined according to the current environmental characteristics determined in step S100 and the historical environmental characteristics stored in the cache and used for calculating the incentive at the previous time. And, the current environmental features are determined from the current environmental information through a pre-trained feature extraction model.

The form of the current environment features along the above example can be 1 × m, and similarly, the form of the historical environment features is also 1 × m, so that the constructed feature matrix can be shown in fig. 4. in fig. 4, assuming that the number of the historical environment features is k, the feature matrix is a matrix of m rows and k +1 columns, namely, a feature matrix

Of the matrix of (a). The columns in the dashed box in FIG. 4 are the current environmental characteristics, and the remaining columns are the calendarsAnd (4) history environment characteristics.

S104: and determining the importance degree of the current environment characteristics and the historical environment characteristics to the characteristic matrix according to the similarity between the characteristics in the characteristic matrix, and selecting the characteristics used for calculating the reward at the current moment from the characteristic matrix according to the importance degree.

In this specification, after determining the feature matrix, the server may specifically calculate the importance degree of each column to the feature matrix, that is, the importance degree of each feature in the feature matrix to the feature matrix, so as to select the feature used for calculating the reward at the current time. The importance degree of the features in the feature matrix characterizes the difference between the features and other features. For example, if there is a large difference between one feature and the other features in each feature of the feature matrix, which is denoted as feature a, and there is a small difference between the other feature and the other features, which is denoted as feature B, when the median of each feature value in the feature matrix is used to replace the values of feature a and feature B, respectively, the similarity between the feature matrix after replacing feature a and the original feature matrix is extracted and is smaller than the similarity between the feature matrix after replacing feature B and the original feature matrix. That is, feature a is more characteristic of the original feature matrix than feature B.

Specifically, in this specification, the server may first determine the similarity between the features in the feature matrix, for example, taking the feature matrix shown in fig. 4 as an example, for each feature, calculate the similarity between the feature and other features in the feature matrix to obtain a similarity vector corresponding to the feature, where the similarity vector is 1 × k

. The similarity between the features in the feature matrix may be calculated by using an existing method for calculating vector similarity, such as euclidean distance calculation, cosine similarity calculation, and the like. Or the server may also calculate the inner product of the two features as the similarity of the features.

Secondly, the server can determine the historical environment characteristics with the minimum similarity to the current environment characteristics according to the similarity matrix, and determine the column corresponding to the historical environment characteristics with the minimum similarity to the current environment characteristics from the similarity matrix as the initial column of the intermediate matrix. For example, assuming that the second column having the lowest similarity to the first column in the feature matrix is found from the similarity matrix, the content of the second column in the similarity matrix may be determined as the initial column of the intermediate matrix.

Then, the residual matrix of the similarity matrix and the intermediate matrix is determined, for example, assuming that the similarity matrix is represented by X and the intermediate matrix is represented by Y, it can be formulated according to the formula

A residual matrix R is determined, wherein,

is a pseudo-inverse matrix of Y.

And then, according to the residual error matrix and the similarity matrix, determining the importance degree of each column in the similarity matrix to the residual error matrix. The server can be based on a formula

And determining the importance degree. Wherein j represents the identifier of the feature, j is the feature in the feature matrix, j is the similarity of the feature and other features in the similarity matrix, and j is the residual error of the similarity corresponding to the feature in the residual error matrix.

The degree of importance of the j feature is indicated,

and representing the column corresponding to the j characteristic in the residual matrix, wherein R is the residual matrix. The closer the value of the formula is to 1, the more important the feature j is. Of course, this formula

It can be considered that a selection probability is calculated, and a number is calculatedThe larger the value, the greater the probability that a column in the feature matrix is selected as the feature for which the reward is calculated.

Finally, the server can add a first number of columns in the similarity matrix to the intermediate matrix according to the sequence of the importance degrees from large to small, and determine the features used for calculating the reward at the current moment from the feature matrix according to the features corresponding to the columns contained in the intermediate matrix.

S106: and determining the reward through a preset reward function according to the current environment characteristic and the selected characteristic.

In this specification, after selecting a feature from the feature matrix, the server may determine the reward through a preset reward function according to the similarity between the current environment feature and the selected feature. When the selected characteristics are different from the characteristics of the current environment, the reward is higher, namely the unmanned vehicle is encouraged to select the environment which is not tried to drive, so that the model can learn the correct output under different environments more quickly.

Specifically, the server may be formulated according to the formula in this specification

Calculating the reward, wherein C is the reward, b is a preset formula parameter, F is a similarity calculation function,

representing the environmental characteristics at time t, t being the current time,

it is the current environmental characteristics that are represented,

representing the selected feature at time t. The similarity calculation function may specifically be a maximum value, a minimum value, a median, or an average value, etc., of the similarity between the current environmental characteristic and each selected environmental characteristic, and which function and formula parameter b are specifically selected may be set as required.

S108: and inputting the current environment information and the reward into an unmanned vehicle control model to be trained for model training, wherein the unmanned vehicle control model is used for unmanned vehicle control.

Finally, after calculating the reward in the present specification, the server may input the current environmental information and the reward to the unmanned vehicle control model to be trained in the same manner as the existing process of training the reinforcement learning model, determine the output of the unmanned vehicle control model, and repeat the processes S100 to S108 at the next moment until the driving process is finished or a dangerous condition occurs. And determining that the model parameter controlled by the unmanned vehicle converges through multiple training processes, and determining that the model training is finished when the change is smaller than a preset value. Or after the training times reach the preset times, determining that the model training is finished. Of course, the present specification does not limit the conditions for ending the model training, and the conditions can be specifically set according to needs. The unmanned vehicle control model obtained through training can be used for unmanned vehicle control.

The training method based on the unmanned vehicle control model shown in fig. 1 is characterized in that a feature matrix formed by various historical environment features used for calculating rewards at the previous moment and the current environment features determined according to the current environment information can be determined for each moment when the model is trained, then the features used for calculating the rewards at the current moment are selected from the feature matrix based on the current environment features and the importance degree of the various historical environment features to the feature matrix, then the rewards are determined according to the current environment features and the selected features to train the unmanned vehicle control model, and after the training is finished, the unmanned vehicle control is carried out according to the trained model. Because the characteristics of the calculated reward are determined based on the importance degree of the characteristics of each characteristic pair including the historical environmental characteristics to the whole body, more effective rewards can be determined based on the change of the environmental information during training, the problem of sparse rewards in the training process is solved, the training cost is reduced, and the training time is saved.

In addition, the unmanned vehicle can be used for unmanned distribution, and the training method of the unmanned vehicle control model provided by the specification can be particularly applied to the field of distribution by using the unmanned vehicle, so that the unmanned control model for distribution is trained, and the time and the cost for training the model are saved. The unmanned vehicle using the model obtained by training through the training method provided by the specification can be used in various delivery scenes, such as delivery scenes for express delivery, takeaway and the like by using the unmanned vehicle.

In addition, in the present specification, when the running course is still short and the number of the historical environmental features determined at the historical time is small when the features for calculating the reward are determined in step S104, that is, when the number of the historical environmental features is smaller than the first number, the total number of the historical environmental features may be set as the features for calculating the reward at the current time without determining the importance of each feature.

That is, before step S104, the server may further perform a determining step of determining whether the number of the historical environmental features is smaller than the first number, if so, using the historical environmental features as features used for calculating the reward at the current time, and if not, performing step S104.

Further, in step S104 in this specification, the server may also select a second number of columns at a time to add to the intermediate matrix, and obtain the intermediate matrix with the number of columns reaching the first number through multiple selections. Wherein the second number is smaller than the first number and the first number is divisible by the second number, e.g. the second number is 1, each time the server selects a column with the highest significance and updates the residual matrix and determines an intermediate matrix through a number of rounds to determine the features for calculating the reward.

Specifically, in step S104, the server may perform the following processes:

s1040: the similarity between the features in the feature matrix is determined.

S1041: and determining the historical environment characteristics with the minimum similarity to the current environment characteristics according to the similarity matrix, and taking the corresponding columns of the determined historical environment characteristics in the similarity matrix as the initial columns of the intermediate matrix.

S1042: and determining a residual error matrix of the similarity matrix and the intermediate matrix.

S1043: and determining the importance degree of each column in the similarity matrix to the residual matrix according to the residual matrix and the similarity matrix.

S1044: and extracting a second number of columns from the similarity matrix according to the sequence of the importance degrees from large to small, and adding the columns into the intermediate matrix.

S1045: and judging whether the number of the columns of the intermediate matrix reaches the first number, if so, executing step S1046, otherwise, executing step S1047.

S1046: all columns of the intermediate matrix are determined.

S1047: and re-determining the residual error matrix, and extracting the first number of columns from the similarity matrix according to the importance degree of each column in the similarity matrix to the re-determined residual error matrix and adding the first number of columns to the intermediate matrix until the number of columns in the intermediate matrix reaches the first number.

It should be noted that, in step S100, the feature extraction model outputs the feature vector in the form of 1 × n for example, and the feature vector in the form of 1 × n facilitates the subsequent steps to construct a feature matrix and perform importance calculation, and of course, if the feature vector in the form of 1 × n is not output, the subsequent steps may also perform importance calculation by corresponding adjustment of the mathematical formula, as long as the environment feature at which time each element in the feature matrix belongs can be distinguished, and the calculation includes processes of calculating similarity and determining a residual error matrix, and the like.

Based on the training method of the unmanned vehicle control model shown in fig. 1, after the unmanned vehicle control model is obtained through training, the unmanned vehicle control model can be applied to control the unmanned vehicle to move in the driving process of the unmanned vehicle, and the description correspondingly provides a schematic diagram of the unmanned vehicle control process, as shown in fig. 5, the method specifically includes the following steps:

s200: and acquiring the current environment information of the position of the unmanned vehicle at the current moment.

In this specification, the control flow may be executed by a control apparatus on the unmanned vehicle, and for each time during travel, first the control apparatus may acquire current environmental information of a position at which the time (i.e., the current time) of the unmanned vehicle is located. As described in step S100, the environment information may include: the specific contents of the obstacle information around the unmanned vehicle, the positioning information of the unmanned vehicle at the moment and the lane information corresponding to the driving process are not repeated in this specification.

S202: and inputting the current environment information into a pre-trained unmanned vehicle control model.

In this specification, the unmanned vehicle control model is obtained by training a plurality of driving processes, that is, training a model training process shown in fig. 1. When the server trains the unmanned vehicle control model, the server can determine a feature matrix composed of various historical environment features and the current environment features at each time in each driving process, determine the importance degree of each feature in the feature matrix according to the similarity between the features in the feature matrix at the time, determine the feature for calculating the reward according to the importance degree, and train the unmanned vehicle control model based on the reward calculated by the feature for calculating the reward and the environment information at the time. When it is determined through the multiple driving processes that the unmanned vehicle control model satisfies the training end condition, it is determined that the training of the unmanned vehicle control model is ended, and the control process shown in fig. 5 can be executed in the unmanned vehicle. The training end condition can be set according to needs, and the description is not limited. For example, when the model parameters converge and the change is smaller than the preset value, the model training is determined to be finished.

S204: and controlling the unmanned vehicle to move according to the direction and the speed output by the unmanned vehicle control model.

In this specification, the unmanned vehicle control model may be obtained based on a model training process shown in fig. 1, and is specifically a reinforcement learning model. When the control device inputs the environmental information acquired in step S200 into the pre-trained unmanned vehicle control model, the unmanned vehicle control model may output the current direction and speed of the unmanned vehicle, and the control device may control the unmanned vehicle to move according to the direction and speed according to the output of the model.

In addition, the unmanned vehicle can be used for unmanned distribution in the present specification, and the above-mentioned unmanned vehicle control method provided in the present specification can be particularly applied to the field of distribution using unmanned vehicles, and can control the course of unmanned vehicle actions, such as a scene of distribution such as express delivery, takeaway, etc. using unmanned vehicles.

Based on the training process of the unmanned vehicle control model shown in fig. 1, the embodiment of the present specification further provides a schematic structural diagram of a training apparatus of the unmanned vehicle control model, as shown in fig. 6.

Fig. 6 is a schematic structural diagram of a training apparatus for an unmanned vehicle control model provided in an embodiment of the present specification, where the apparatus includes:

the acquiring module 300 is used for acquiring current environment information of the position of the unmanned vehicle in the driving process and determining current environment characteristics according to the current environment information;

a determining module 302, configured to determine a feature matrix formed by historical environment features used for calculating rewards at a previous time and the current environment feature, where the historical environment features are determined according to historical environment information obtained by the unmanned vehicle during the driving process;

a selecting module 304, configured to determine importance degrees of the current environmental features and the historical environmental features to the feature matrix according to similarities among the features in the feature matrix, and select a feature used for calculating a reward at a current time from the feature matrix according to the importance degrees;

the calculation module 306 determines the reward through a preset reward function according to the current environment characteristics and the selected characteristics;

and the training module 308 inputs the current environment information and the reward into an unmanned vehicle control model to be trained for model training, wherein the unmanned vehicle control model is used for unmanned vehicle control.

Optionally, the obtaining module 300 obtains obstacle information around the unmanned vehicle, positioning information of the unmanned vehicle at the current moment, and lane information corresponding to the driving process, as the current environment information, where the lane information includes: the current lane position is determined according to the positioning information of the unmanned vehicle at the current moment, and the subsequent lane position is determined according to the positioning information and the path planning corresponding to the driving process, wherein the subsequent lane position is the lane position where the unmanned vehicle can drive.

Optionally, the obtaining module 300 takes the obstacle information as an input, inputs a pre-trained feature extraction model to obtain an output feature vector, and splices the feature vector, the positioning information, and the lane information to determine the current environmental feature.

Optionally, the current environment features are in the form of column vectors, the selecting module 304 determines a similarity matrix corresponding to the feature matrix according to similarities between the features in the feature matrix, determining the historical environment characteristics with the minimum similarity to the current environment characteristics according to the similarity matrix, and using the corresponding column of the determined historical environmental characteristics in the similarity matrix as the initial column of an intermediate matrix, determining a residual error matrix of the similarity matrix and the intermediate matrix, determining the importance degree of each column in the similarity matrix to the residual matrix according to the residual matrix and the similarity matrix, and adding a first number of columns in the similarity matrix to the intermediate matrix according to the order of the importance degrees from large to small, and determining the characteristics used for calculating the reward in the characteristic matrix at the current moment according to the intermediate matrix.

Optionally, the selecting module 304 extracts a second number of columns from the similarity matrix according to the order of the importance degrees from large to small, adds the second number of columns to the intermediate matrix, determines whether the number of columns of the intermediate matrix reaches the first number, determines that all columns of the intermediate matrix are obtained if the number of columns of the intermediate matrix reaches the first number, and re-determines the residual error matrix if the number of columns of the intermediate matrix does not reach the first number, and extracts the first number of columns from the similarity matrix to add the first number of columns of the intermediate matrix to the intermediate matrix according to the importance degrees of the columns of the similarity matrix to the re-determined residual error matrix again until the number of columns of the intermediate matrix reaches the first number.

Optionally, the selecting module 304 is configured to use the historical environmental characteristics as the characteristics used for calculating the reward at the current moment when the number of the historical environmental characteristics is smaller than the first number.

Optionally, the calculating module 306 determines the reward through a preset reward function according to the similarity between the current environment feature and the selected feature, where the smaller the similarity between the current environment feature and each selected feature is, the higher the reward is, and the larger the similarity between the current environment feature and each selected feature is, the smaller the reward is.

Optionally, the apparatus further comprises: the control module 308 acquires current environment information of a position where the unmanned vehicle is located at the current time, inputs the current environment information into the unmanned vehicle control model, and the unmanned vehicle control model is obtained through multiple driving process training, wherein for each time in each driving process, a feature matrix composed of various historical environment features and the current environment features at the time is determined, the importance degree of each feature in the feature matrix is determined according to the similarity between each feature in the feature matrix at the time, the feature of the reward is determined and calculated according to the importance degree, the reward calculated based on the feature of the reward is calculated and the environment information at the time are used for training the unmanned vehicle control model, and the unmanned vehicle is controlled to move according to the direction and the speed output by the unmanned vehicle control model.

The present specification also provides a computer readable storage medium storing a computer program, which can be used to execute any one of the above-mentioned methods for training an unmanned vehicle control model.

Based on the training process of the unmanned vehicle control model provided in fig. 1, the embodiment of the present specification further provides a schematic structural diagram of the electronic device shown in fig. 7. As shown in fig. 7, at the hardware level, the electronic device includes a processor, an internal bus, a network interface, a memory, and a non-volatile memory, but may also include hardware required for other services. The processor reads a corresponding computer program from the nonvolatile memory into the memory and then runs the computer program to realize any one of the above-mentioned training methods of the unmanned vehicle control model.

Of course, besides the software implementation, the present specification does not exclude other implementations, such as logic devices or a combination of software and hardware, and the like, that is, the execution subject of the following processing flow is not limited to each logic unit, and may be hardware or logic devices.

In the 90 s of the 20 th century, improvements in a technology could clearly distinguish between improvements in hardware (e.g., improvements in circuit structures such as diodes, transistors, switches, etc.) and improvements in software (improvements in process flow). However, as technology advances, many of today's process flow improvements have been seen as direct improvements in hardware circuit architecture. Designers almost always obtain the corresponding hardware circuit structure by programming an improved method flow into the hardware circuit. Thus, it cannot be said that an improvement in the process flow cannot be realized by hardware physical modules. For example, a Programmable Logic Device (PLD), such as a Field Programmable Gate Array (FPGA), is an integrated circuit whose Logic functions are determined by programming the Device by a user. A digital system is "integrated" on a PLD by the designer's own programming without requiring the chip manufacturer to design and fabricate application-specific integrated circuit chips. Furthermore, nowadays, instead of manually making an integrated Circuit chip, such Programming is often implemented by "logic compiler" software, which is similar to a software compiler used in program development and writing, but the original code before compiling is also written by a specific Programming Language, which is called Hardware Description Language (HDL), and HDL is not only one but many, such as abel (advanced Boolean Expression Language), ahdl (alternate Language Description Language), traffic, pl (core unified Programming Language), HDCal, JHDL (Java Hardware Description Language), langue, Lola, HDL, laspam, hardsradware (Hardware Description Language), vhjhd (Hardware Description Language), and vhigh-Language, which are currently used in most common. It will also be apparent to those skilled in the art that hardware circuitry that implements the logical method flows can be readily obtained by merely slightly programming the method flows into an integrated circuit using the hardware description languages described above.

The controller may be implemented in any suitable manner, for example, the controller may take the form of, for example, a microprocessor or processor and a computer-readable medium storing computer-readable program code (e.g., software or firmware) executable by the (micro) processor, logic gates, switches, an Application Specific Integrated Circuit (ASIC), a programmable logic controller, and an embedded microcontroller, examples of which include, but are not limited to, the following microcontrollers: ARC 625D, Atmel AT91SAM, Microchip PIC18F26K20, and Silicone Labs C8051F320, the memory controller may also be implemented as part of the control logic for the memory. Those skilled in the art will also appreciate that, in addition to implementing the controller as pure computer readable program code, the same functionality can be implemented by logically programming method steps such that the controller is in the form of logic gates, switches, application specific integrated circuits, programmable logic controllers, embedded microcontrollers and the like. Such a controller may thus be considered a hardware component, and the means included therein for performing the various functions may also be considered as a structure within the hardware component. Or even means for performing the functions may be regarded as being both a software module for performing the method and a structure within a hardware component.

The systems, devices, modules or units illustrated in the above embodiments may be implemented by a computer chip or an entity, or by a product with certain functions. One typical implementation device is a computer. In particular, the computer may be, for example, a personal computer, a laptop computer, a cellular telephone, a camera phone, a smartphone, a personal digital assistant, a media player, a navigation device, an email device, a game console, a tablet computer, a wearable device, or a combination of any of these devices.

For convenience of description, the above devices are described as being divided into various units by function, and are described separately. Of course, the functions of the various elements may be implemented in the same one or more software and/or hardware implementations of the present description.

As will be appreciated by one skilled in the art, embodiments of the present invention may be provided as a method, system, or computer program product. Accordingly, the present invention may take the form of an entirely hardware embodiment, an entirely software embodiment or an embodiment combining software and hardware aspects. Furthermore, the present invention may take the form of a computer program product embodied on one or more computer-usable storage media (including, but not limited to, disk storage, CD-ROM, optical storage, and the like) having computer-usable program code embodied therein.

The present invention is described with reference to flowchart illustrations and/or block diagrams of methods, apparatus (systems), and computer program products according to embodiments of the invention. It will be understood that each flow and/or block of the flow diagrams and/or block diagrams, and combinations of flows and/or blocks in the flow diagrams and/or block diagrams, can be implemented by computer program instructions. These computer program instructions may be provided to a processor of a general purpose computer, special purpose computer, embedded processor, or other programmable data processing apparatus to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing apparatus, create means for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.

These computer program instructions may also be stored in a computer-readable memory that can direct a computer or other programmable data processing apparatus to function in a particular manner, such that the instructions stored in the computer-readable memory produce an article of manufacture including instruction means which implement the function specified in the flowchart flow or flows and/or block diagram block or blocks.

These computer program instructions may also be loaded onto a computer or other programmable data processing apparatus to cause a series of operational steps to be performed on the computer or other programmable apparatus to produce a computer implemented process such that the instructions which execute on the computer or other programmable apparatus provide steps for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.

In a typical configuration, a computing device includes one or more processors (CPUs), input/output interfaces, network interfaces, and memory.

The memory may include forms of volatile memory in a computer readable medium, Random Access Memory (RAM) and/or non-volatile memory, such as Read Only Memory (ROM) or flash memory (flash RAM). Memory is an example of a computer-readable medium.

Computer-readable media, including both non-transitory and non-transitory, removable and non-removable media, may implement information storage by any method or technology. The information may be computer readable instructions, data structures, modules of a program, or other data. Examples of computer storage media include, but are not limited to, phase change memory (PRAM), Static Random Access Memory (SRAM), Dynamic Random Access Memory (DRAM), other types of Random Access Memory (RAM), Read Only Memory (ROM), Electrically Erasable Programmable Read Only Memory (EEPROM), flash memory or other memory technology, compact disc read only memory (CD-ROM), Digital Versatile Discs (DVD) or other optical storage, magnetic cassettes, magnetic tape magnetic disk storage or other magnetic storage devices, or any other non-transmission medium that can be used to store information that can be accessed by a computing device. As defined herein, a computer readable medium does not include a transitory computer readable medium such as a modulated data signal and a carrier wave.

It should also be noted that the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, an element defined by the phrase "comprising an … …" does not exclude the presence of other like elements in a process, method, article, or apparatus that comprises the element.

As will be appreciated by one skilled in the art, embodiments of the present description may be provided as a method, system, or computer program product. Accordingly, the description may take the form of an entirely hardware embodiment, an entirely software embodiment or an embodiment combining software and hardware aspects. Furthermore, the description may take the form of a computer program product embodied on one or more computer-usable storage media (including, but not limited to, disk storage, CD-ROM, optical storage, and the like) having computer-usable program code embodied therein.

This description may be described in the general context of computer-executable instructions, such as program modules, being executed by a computer. Generally, program modules include routines, programs, objects, components, data structures, etc. that perform particular tasks or implement particular abstract data types. The specification may also be practiced in distributed computing environments where tasks are performed by remote processing devices that are linked through a communications network. In a distributed computing environment, program modules may be located in both local and remote computer storage media including memory storage devices.

The embodiments in the present specification are described in a progressive manner, and the same and similar parts among the embodiments are referred to each other, and each embodiment focuses on the differences from the other embodiments. In particular, for the system embodiment, since it is substantially similar to the method embodiment, the description is simple, and for the relevant points, reference may be made to the partial description of the method embodiment.

The above description is only an example of the present specification, and is not intended to limit the present specification. Various modifications and alterations to this description will become apparent to those skilled in the art. Any modification, equivalent replacement, improvement, etc. made within the spirit and principle of the present specification should be included in the scope of the claims of the present specification.

Claims

1. A training method of an unmanned vehicle control model is characterized by comprising the following steps:

2. The method of claim 1, wherein obtaining current environmental information of a position where the unmanned vehicle is located during driving specifically comprises:

acquiring obstacle information around the unmanned vehicle, positioning information of the unmanned vehicle at the current moment and lane information corresponding to the driving process as the current environment information;

3. The method of claim 2, wherein determining the current environmental characteristics based on the current environmental information specifically comprises:

4. The method of claim 1, wherein the current environmental features are in the form of a column vector;

5. The method according to claim 4, wherein adding a first number of columns in the similarity matrix to the intermediate matrix according to an order of importance from large to small comprises:

if yes, determining that all columns of the intermediate matrix are obtained;

6. The method of claim 1, wherein the method further comprises:

7. The method of claim 1, wherein determining the reward based on the current environmental characteristics and the selected characteristics via a predetermined reward function includes

8. The method of claim 1, wherein performing the unmanned vehicle control according to the unmanned vehicle control model specifically comprises:

9. A training device of an unmanned vehicle control model is characterized by comprising:

10. A computer-readable storage medium, characterized in that the storage medium stores a computer program which, when executed by a processor, implements the method of any of the preceding claims 1-8.

11. An electronic device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, wherein the processor implements the method of any of claims 1-8 when executing the program.