CN117465480A

CN117465480A - Target object movement control method, device, equipment and medium

Info

Publication number: CN117465480A
Application number: CN202311483620.0A
Authority: CN
Inventors: 王渤谦
Original assignee: Beijing Jd Yuansheng Technology Co ltd
Current assignee: Beijing Jd Yuansheng Technology Co ltd
Priority date: 2023-11-08
Filing date: 2023-11-08
Publication date: 2024-01-30

Abstract

The disclosure provides a method, a device, equipment, a medium, a program product and an automatic driving vehicle for controlling movement of a target object, which can be applied to the technical fields of artificial intelligence and automatic driving. Determining a target intelligent agent corresponding to a target scene according to the target scene of the target object at the t moment, wherein the target intelligent agent is obtained by training based on sample attribute information of the sample object in the target scene; determining a target reference line according to a first moving target of the target object, wherein the target reference line is provided with a plurality of path points; inputting attribute information of a plurality of path points and target objects in a target scene into a first target network, and outputting target path points where the target objects are located at the t+1th moment next to the t moment; inputting the attribute information and the first reference speed of the target object at the target path point into a second target network, and outputting a target speed; and controlling movement of the target object based on the target waypoint and the target speed.

Description

Target object movement control method, device, equipment and medium

Technical Field

The present disclosure relates to the field of artificial intelligence and autopilot technology, and in particular to a method, apparatus, device, medium, program product, and autopilot vehicle for movement control of a target object.

Background

With the development of technology and the improvement of living standard of people, as a currently emerging technology, the demand of people for unmanned vehicles (unmanned vehicles for short) is increasing, and decision-making planning technology applied to unmanned vehicles is attracting attention.

In the implementation process of the present disclosure, it is found that in the existing decision-making planning, the behavior decision-making module outputs a specific driving behavior, the path planning module outputs a track conforming to rules of safety, traffic regulations, etc. according to the driving behavior and surrounding environment information, and finally the control module executes the track. The path planning module and the behavior decision module are mutually matched, so that the smoothness of the track of the unmanned vehicle is ensured, but the action of the behavior decision module is weakened, and specific problems such as obstacle treatment, traffic regulations and the like are all completed in the path planning module, so that the operation time is longer.

Disclosure of Invention

In view of the foregoing, the present disclosure provides a movement control method, apparatus, device, medium, program product, and autonomous vehicle of a target object.

According to a first aspect of the present disclosure, there is provided a movement control method of a target object, including:

determining a target intelligent agent corresponding to a target scene according to the target scene of the target object at the t moment, wherein the target intelligent agent is obtained by training based on sample attribute information of a sample object in the target scene, and comprises a first target network and a second target network;

Determining a target reference line according to a first moving target of the target object, wherein the target reference line is provided with a plurality of path points;

inputting attribute information of a plurality of path points and target objects in a target scene into a first target network, and outputting target path points where the target objects are located at the t+1th moment next to the t moment;

inputting the attribute information and the first reference speed of the target object at the target path point into a second target network, and outputting a target speed, wherein the target speed is the speed of the target object at the target path point; and

the movement of the target object is controlled based on the target waypoint and the target speed.

According to an embodiment of the present disclosure, the movement control method of the target object further includes:

and stopping the movement control of the target object under the condition that the target object is determined to be faulty in the movement process.

in the process of controlling the movement of the target object, acquiring first movement state information of the target object and second movement state information of other objects except the target object in a target scene;

determining the meeting time of the target object and other objects according to the target path point, the target speed, the first moving state information and the second moving state information;

And under the condition that the meeting time is less than the threshold value, determining that the target object fails in the moving process.

According to an embodiment of the disclosure, the first target network is obtained after training the first policy network using the first objective function and the value network; the second target network is obtained after training a second strategy network by using a second target function and a value network;

the movement control method of the target object further comprises the following steps:

determining a sample reference line according to a second moving target of the sample object, wherein the sample reference line is provided with a plurality of sample path points;

determining a sample path point set from a plurality of sample path points according to preset conditions;

determining the number of training rounds and the number of training steps of each training round;

the following is repeatedly performed for each training step number of each training round:

sample attribute information and sample path points in a sample path point set in a target scene are input into an initialized first strategy network, and initial path points are output;

inputting the sample attribute information and the second reference speed of the sample object at the sample path point into an initialized second strategy network, and outputting an initial speed;

inputting the initial path point and the initial speed into an initialized value network, and outputting a reward value in the process of controlling the movement of the sample object;

Under the condition that the prize value is not more than the threshold value, updating parameters of the first strategy network according to a first objective function, and obtaining the first objective network under the condition that the update times reach the first preset update times;

and under the condition that the prize value is not over the threshold value, updating parameters of the second strategy network according to the second objective function, and under the condition that the update times reach the first preset update times, obtaining the second objective network.

According to an embodiment of the present disclosure, controlling movement of a target object based on a target waypoint and a target speed includes:

determining a moving instruction according to the target path point, the target speed and the tracking algorithm;

and controlling the movement of the target object according to the movement instruction.

According to an embodiment of the present disclosure, attribute information of a target object in a target scene includes: speed information of the target object, movement angle information of the target object, information associated with the lane, and relative position information of the target object and each path point;

the method for inputting the attribute information of the plurality of path points and the target object in the target scene into the first target network, outputting the target path point of the target object at the t+1th moment next to the t moment, and comprises the following steps:

Attribute information of a plurality of waypoints and target objects in a target scene is input to a first target network so as to perform the following operations:

according to the target scene, determining the number of lanes around the lane where the target object is located;

determining the distributed path points on each lane according to the number of lanes, the plurality of path points and the relative position information of the target object and each path point;

and outputting the target path point according to the speed information of the target object, the moving angle information of the target object, the information related to the lanes and the path points distributed on each lane.

According to an embodiment of the present disclosure, inputting attribute information and a first reference speed of a target object at a target waypoint into a second target network, outputting a target speed, includes:

inputting attribute information and a first reference speed of the target object at the target waypoint into the second target network so as to perform the following operations:

predicting the predicted speed of the target object at the target path point according to the speed information of the target object, the moving angle information of the target object, the information related to the lanes and the path point distributed on each lane;

and outputting the target speed according to the predicted speed and the first reference speed.

According to an embodiment of the present disclosure, in a case where the target scene is determined to be a road scene, the information associated with the lane includes an exchangeable road condition for characterizing the lane in which the target object is located;

in the case that the target scene is determined to be an intersection scene, the information associated with the lane includes distance information for representing a distance between the target object and the intersection at the t-th moment;

in the case where the target scene is determined to be a road merging and separating scene, the information associated with the lane includes information for characterizing a distance between the target object and the sink or exit point at the t-th time.

A second aspect of the present disclosure provides a movement control device of a target object, including:

the first determining module is used for determining a target agent corresponding to a target scene according to the target scene of the target object at the t moment, wherein the target agent is obtained by training based on sample attribute information of a sample object in the target scene, and comprises a first target network and a second target network;

the second determining module is used for determining a target reference line according to a first moving target of the target object, wherein the target reference line is provided with a plurality of path points;

The first processing module is used for inputting attribute information of a plurality of path points and target objects in a target scene into a first target network and outputting target path points where the target objects are located at the t+1th moment next to the t moment;

the second processing module is used for inputting the attribute information and the first reference speed of the target object at the target path point into the second target network and outputting the target speed, wherein the target speed is the speed of the target object at the target path point; and

and the first control module is used for controlling the movement of the target object based on the target path point and the target speed.

A third aspect of the present disclosure provides an electronic device, comprising: one or more processors; and a memory for storing one or more programs, wherein the one or more programs, when executed by the one or more processors, cause the one or more processors to perform the movement control method of the target object.

A fourth aspect of the present disclosure also provides a computer-readable storage medium having stored thereon executable instructions that, when executed by a processor, cause the processor to perform the method of controlling movement of a target object as described above.

A fifth aspect of the present disclosure also provides a computer program product comprising a computer program which, when executed by a processor, implements the above-described movement control method of a target object.

A sixth aspect of the present disclosure also provides an autonomous vehicle comprising the electronic device described above.

According to the embodiment of the disclosure, according to the target scene where the target object is located, the target intelligent agent can be obtained by matching the intelligent agents trained in different scenes. The method has the advantages that the target intelligent body is utilized to determine the speeds of the target path point and the target object at the target path point, the specific problems of processing barriers, traffic regulations and the like which are needed to be performed by the path planning module can be completed when the intelligent body is trained, a large amount of time consumed by on-line processing is transferred to the pre-off-line training of the intelligent body, the speeds of the target path point and the target object at the target path point can be accurately determined in a short time by the target intelligent body, the determined target path point and the target speed belong to a better state, and the problems that the specific problems of processing barriers, traffic regulations and the like are completed by the path planning module in the conventional decision planning are solved, so that the operation time is longer.

Drawings

The foregoing and other objects, features and advantages of the disclosure will be more apparent from the following description of embodiments of the disclosure with reference to the accompanying drawings, in which:

FIG. 1 schematically illustrates an application scenario diagram of a target object movement control method, apparatus, device, medium, program product, and autonomous vehicle according to an embodiment of the present disclosure;

FIG. 2 schematically illustrates a flow chart of a method of movement control of a target object according to an embodiment of the disclosure;

FIG. 3 (a) schematically illustrates a longitudinal sampling schematic of the Lattice algorithm at cruise according to an embodiment of the present disclosure;

FIG. 3 (b) schematically illustrates a schematic diagram of waypoint sampling according to an embodiment of the disclosure;

FIG. 4 schematically illustrates an unmanned vehicle motion planning schematic in accordance with an embodiment of the present disclosure;

FIG. 5 (a) schematically illustrates an H-PPO algorithm architecture diagram according to another embodiment of the present disclosure;

FIG. 5 (b) schematically illustrates two action network structure diagrams under the H-PPO algorithm according to another embodiment of the present disclosure;

FIG. 6 schematically illustrates a block diagram of a movement control device of a target object according to an embodiment of the present disclosure; and

fig. 7 schematically illustrates a block diagram of an electronic device adapted to implement a method of movement control of a target object according to an embodiment of the present disclosure.

Detailed Description

Hereinafter, embodiments of the present disclosure will be described with reference to the accompanying drawings. It should be understood that the description is only exemplary and is not intended to limit the scope of the present disclosure. In the following detailed description, for purposes of explanation, numerous specific details are set forth in order to provide a thorough understanding of the embodiments of the present disclosure. It may be evident, however, that one or more embodiments may be practiced without these specific details. In addition, in the following description, descriptions of well-known structures and techniques are omitted so as not to unnecessarily obscure the concepts of the present disclosure.

The terminology used herein is for the purpose of describing particular embodiments only and is not intended to be limiting of the disclosure. The terms "comprises," "comprising," and/or the like, as used herein, specify the presence of stated features, steps, operations, and/or components, but do not preclude the presence or addition of one or more other features, steps, operations, or components.

All terms (including technical and scientific terms) used herein have the same meaning as commonly understood by one of ordinary skill in the art unless otherwise defined. It should be noted that the terms used herein should be construed to have meanings consistent with the context of the present specification and should not be construed in an idealized or overly formal manner.

Where expressions like at least one of "A, B and C, etc. are used, the expressions should generally be interpreted in accordance with the meaning as commonly understood by those skilled in the art (e.g.," a system having at least one of A, B and C "shall include, but not be limited to, a system having a alone, B alone, C alone, a and B together, a and C together, B and C together, and/or A, B, C together, etc.).

In the technical solution of the present disclosure, the related user information (including, but not limited to, user personal information, user image information, user equipment information, such as location information, etc.) and data (including, but not limited to, data for analysis, stored data, displayed data, etc.) are information and data authorized by the user or sufficiently authorized by each party, and the related data is collected, stored, used, processed, transmitted, provided, disclosed, applied, etc. and processed, all in compliance with the related laws and regulations and standards of the related country and region, necessary security measures are taken, no prejudice to the public order, and corresponding operation entries are provided for the user to select authorization or rejection.

In the technical scheme of the embodiment of the disclosure, the authorization or consent of the user is obtained before the personal information of the user is obtained or acquired.

In practicing the present disclosure, it is found that a decision planning system includes a path planning module, a behavior decision module, and a control execution module. The behavior decision module outputs specific driving behaviors, the path planning module outputs a track conforming to rules of safety, traffic regulations and the like according to the driving behaviors and surrounding environment information, and finally the execution module is controlled to execute the track. Path planning has many types of approaches, such as model-based optimization, sample-difference based, but can be seen in essence as a search problem: searching a safe, efficient and executable track in a travelable area generally requires engineering personnel to balance between the two, with the final track accuracy being inversely proportional to the algorithm run time. The action decision is to reduce the search space of the path planning, for example, the action decision outputs a left lane change, and only the left lane needs to be searched. The decision period of the behavior decision module is generally about 2s, so that the phenomenon that the unmanned vehicle frequently changes behaviors to cause unsmooth track is avoided. The decision frequency of the path planning module is typically around 20Hz to avoid collisions and accumulated errors. The behavior decision module and the path planning module are matched with each other, so that the smoothness of the track of the unmanned vehicle is ensured, but the action of the behavior decision module is weakened, and specific problems such as obstacle treatment, traffic regulations and the like are all completed in the path planning module, so that the operation time is still longer.

Embodiments of the present disclosure provide a movement control method, apparatus, device, medium, program product, and autonomous vehicle for a target object. The target object movement control method comprises the following steps: determining a target intelligent agent corresponding to a target scene according to the target scene of the target object at the t moment, wherein the target intelligent agent is obtained by training based on sample attribute information of a sample object in the target scene, and comprises a first target network and a second target network; determining a target reference line according to a first moving target of the target object, wherein the target reference line is provided with a plurality of path points; inputting attribute information of a plurality of path points and target objects in a target scene into a first target network, and outputting target path points where the target objects are located at the t+1th moment next to the t moment; inputting the attribute information and the first reference speed of the target object at the target path point into a second target network, and outputting a target speed, wherein the target speed is the speed of the target object at the target path point; and controlling movement of the target object based on the target waypoint and the target speed.

Fig. 1 schematically illustrates an application scenario diagram of a target object movement control method, apparatus, device, medium, program product and autonomous vehicle according to an embodiment of the present disclosure.

As shown in fig. 1, an application scenario 100 according to this embodiment may include a first target object 101, a second target object 102, a third target object 103, a network 104, and a server 105. The network 104 is used as a medium to provide a communication link between the first target object 101, the second target object 102, the third target object 103 and the server 105. The network 104 may include various connection types, such as wired, wireless communication links, or fiber optic cables, among others.

The user may interact with the server 105 over the network 104 using at least one of the first target object 101, the second target object 102, the third target object 103, to receive or send messages, etc. The first, second and third target objects 101, 102, 103 may be provided with a controller for controlling movement of the target object according to the route point and the speed of the target object at the route point, a sensor for sensing the surrounding environment, and a high-precision map navigation and positioning for guiding the target object to travel.

The first target object 101, the second target object 102, the third target object 103 may be an autonomous or unmanned vehicle with high precision map navigation and positioning, sensors, controllers, wheeled mobile robots, etc.

The server 105 may be a server providing various services, such as a background management server (by way of example only) providing support for a user to navigate or autopilot using the first, second, and third target objects 101, 102, 103. The background management server may perform processing such as analysis according to a target scene where the target object is located at the t-th time, and feedback a processing result (for example, a generated movement instruction of the control target object) to the target object.

It should be noted that, the method for controlling movement of the target object provided by the embodiments of the present disclosure may be generally performed by the server 105. Accordingly, the movement control device of the target object provided in the embodiments of the present disclosure may be generally disposed in the server 105. The method for controlling movement of a target object provided by the embodiments of the present disclosure may also be performed by a server or a server cluster that is different from the server 105 and is capable of communicating with the first target object 101, the second target object 102, the third target object 103, and/or the server 105. Accordingly, the movement control device of the target object provided in the embodiments of the present disclosure may also be provided in a server or a server cluster that is different from the server 105 and is capable of communicating with the first target object 101, the second target object 102, the third target object 103, and/or the server 105.

It should be understood that the number of target objects, networks and servers in fig. 1 is merely illustrative. There may be any number of target objects, networks, and servers, as desired for implementation.

The movement control method of the target object of the disclosed embodiment will be described in detail below with reference to the scenario described in fig. 1 through fig. 2 to 5.

Fig. 2 schematically illustrates a flowchart of a method of movement control of a target object according to an embodiment of the present disclosure.

As shown in fig. 2, the movement control method 200 of the target object of this embodiment includes operations S210 to S250.

In operation S210, a target agent corresponding to the target scene is determined according to the target scene in which the target object is located at the t-th moment, where the target agent is obtained by training based on sample attribute information of the sample object in the target scene, and the target agent includes a first target network and a second target network.

According to embodiments of the present disclosure, the target object may be an autonomous or unmanned vehicle or a wheeled mobile robot, or the like, with high precision map navigation and positioning, sensors, controllers, and the like.

According to an embodiment of the present disclosure, the target scene may be determined according to position information obtained by positioning during actual movement of the target object.

According to embodiments of the present disclosure, the target agent is pre-trained. The method can be obtained by training the objective function and the value network based on sample attribute information of the sample object in the target scene. Wherein parameters of the target agent are updated by minimizing the objective function. The value network may be used to evaluate the path points and speed of the training. The sample attribute information may be information related to both the target scene and the sample object that can facilitate the sample object in determining the path point.

The first target network is used for determining a target path point where a target object at a t+1 moment is located at the next moment of the t moment. The second target network is used to determine the speed of the target object at the target waypoint.

For example, if the target scene determined from the positional information obtained by positioning during the actual movement of the target object is under the road scene, the target agent corresponding to the target scene is an agent that needs to be trained under the road scene. If the target scene determined according to the position information obtained by positioning in the actual moving process of the target object is in the crossing scene, the target agent corresponding to the target scene is the agent which needs to be trained in the crossing scene. If the target scene determined according to the position information obtained by positioning the target object in the actual moving process is in the road merging and separating scene, the target agent corresponding to the target scene is an agent which needs to be trained in the road merging and separating scene.

In operation S220, a target reference line is determined according to a first moving target in which the target object moves, wherein the target reference line has a plurality of path points thereon.

According to embodiments of the present disclosure, the first moving target may be a final task point to which the target object performs a task. And planning a corresponding target reference line according to the target object execution task through the simulation model, and guiding the target object to reach a final task point to be reached by the execution task. The target reference line may be represented as a series of discrete waypoints.

For example, in a road scene, a road may be a plurality of or one lane, and a target reference line may be a reference line composed of a plurality of waypoints on the plurality of or one lane.

In operation S230, attribute information of the plurality of path points and the target object in the target scene is input to the first target network, and the target path point where the target object is located at the t+1st time next to the t time is output.

According to the embodiment of the disclosure, the attribute information can be information related to both the target scene and the target object, and can be beneficial to the target object to determine the path point. The first target network may be a network capable of making a target path point policy where the target object is located at the t+1 time instant next to the t time instant.

For example, in a road scene, the attribute information may be information related to a road, such as lane change information, etc.; the attribute information may also be information related to the target object, such as motion information of the target object, and the like.

In operation S240, the attribute information and the first reference speed of the target object at the target waypoint are input to the second target network, and the target speed, which is the speed of the target object at the target waypoint, is output.

According to an embodiment of the present disclosure, the first reference speed is used to characterize a maximum movement speed at which the target object is allowed to move at the target waypoint. The first reference speed may be determined according to traffic rules. The second target network may be a network capable of predicting the speed at which the target object is located at the target path point at time t+1.

In operation S250, movement of the target object is controlled based on the target waypoint and the target speed.

According to the embodiment of the disclosure, a control instruction can be generated according to a controller on the target object, a target path point and a target speed, and movement of the target object can be controlled.

According to the embodiment of the disclosure, the target intelligent agent can be obtained by matching the intelligent agents trained in different scenes according to the target scene of the target object. The method has the advantages that the target intelligent body is utilized to determine the speeds of the target path point and the target object at the target path point, the specific problems of processing barriers, traffic regulations and the like which are needed to be performed by the path planning module can be completed when the intelligent body is trained, a large amount of time consumed by on-line processing is transferred to the pre-off-line training of the intelligent body, the speeds of the target path point and the target object at the target path point can be accurately determined in a short time by the target intelligent body, the determined target path point and the target speed belong to a better state, and the problems that the specific problems of processing barriers, traffic regulations and the like are completed by the path planning module in the conventional decision planning are solved, so that the operation time is longer.

According to an embodiment of the present disclosure, the movement control method of the target object may further include:

According to the embodiment of the disclosure, whether the target object is in fault in the moving process can be determined by determining whether the target object is in collision in the moving process. Determining that the target object collides in the moving process and that the target object fails in the moving process; and determining that the target object does not collide in the moving process and determining that the target object does not have a fault in the moving process. When it is determined that the target object fails in the moving process, a stop control instruction may be generated to stop the moving control of the target object, so that the target object stops moving. After the stop control command is generated, the determination of the target waypoint and the target speed may be resumed.

According to the embodiment of the disclosure, as a certain path point is decided and selected through the target intelligent agent, the potential safety hazard problem exists in a black box model belonging to the neural network. The method and the device for determining the target object in the moving process are used for fully guaranteeing feasibility of the decision result of the target agent, further determining the fault condition of the target object in the moving process and guaranteeing safety of the target object in the process of executing the decision result.

According to the embodiment of the present disclosure, the first movement state information and the second movement state information may be acquired through high-precision map navigation and positioning and sensors and the like possessed by the target object. The encounter time may be the time at which a collision may occur. The time of possible collision of the target object with other objects can be calculated through the tracking controller, and under the condition that the time of collision is smaller than a threshold value, the target object is determined to have a fault in the moving process, and the moving control of the target object is stopped. Wherein the threshold may be determined based on the time the target object moves to the target waypoint.

According to the method and the device for determining the meeting time of the target object and other objects, the situation that the target object breaks down in the moving process can be accurately determined under the condition that the meeting time is smaller than the threshold value, movement control of the target object is stopped, and safety of the target object in the moving process is guaranteed.

FIG. 3 (a) schematically illustrates a longitudinal sampling schematic of the Lattice algorithm at cruise according to an embodiment of the present disclosure; fig. 3 (b) schematically illustrates a schematic diagram of path point sampling according to an embodiment of the present disclosure.

The Lattice algorithm converts the three-dimensional problem into the two-dimensional problem through horizontal and vertical track decoupling, so that the solving complexity is reduced. The key flow of the algorithm comprises: generating a transverse and longitudinal track beam; evaluating each track according to the evaluation function; carrying out feasibility analysis on the track according to the evaluation result, and outputting an optimal track; the transverse and longitudinal tracks are coupled. The method is characterized in that transverse and longitudinal tracks and reasonable design evaluation functions are generated, so that the advantages and disadvantages of the tracks can be effectively evaluated.

For track generation, a track algorithm describes the track by using a polynomial curve method, samples the last state in a state space, and solves polynomial coefficients by combining the initial state, so as to obtain a plurality of tracks.

For example, as shown in fig. 3 (a), the generation of a longitudinal trajectory in the cruising state is specifically described as an example. Since there is no obstacle, longitudinal planning can be performed considering only the target speed of the vehicle, and the longitudinal trajectory of the vehicle is described using a fourth-order polynomial as shown in the following formula (1):

for sampling of the last state, the sampling time length can be set to be 8s, the sampling interval is 1s, and the specific sampling time is [1s,2s,3s,4s,5s,6s,7s,8s]At a certain sampling moment, firstly, calculating a speed range of the moment according to the maximum value of the acceleration and deceleration of the vehicle and the cruise speed limit, and then uniformly inserting 4 points in the speed range to obtain 6 sampling points of the moment, wherein a specific sampling process is shown in fig. 3 (b). Knowing the initial stateAnd end state->I.e. the coefficient c in equation (1) can be solved ₀ 、c ₁ 、c ₂ 、c ₃ 、c ₄ . The 48 final states are obtained by sampling, and 48 tracks can be solved.

For the evaluation function, the Lattice algorithm can design the evaluation function in terms of horizontal and vertical comfort, safety, task completion and the like.

From the above analysis, the key point of the Lattice algorithm is to reasonably divide the state space and sample, where the planning accuracy of the Lattice algorithm is in a proportional relationship with the sampling density, and the greater the sampling density, the higher the planned track accuracy, but the correspondingly increased calculation amount. The present disclosure converts sampling into action selection of the agent, thereby avoiding the amount of computation when the Lattice algorithm makes a decision.

According to an embodiment of the disclosure, the first target network is obtained after training the first policy network using the first objective function and the value network; the second target network is obtained after training the second policy network using the second target function and the value network.

The method for controlling movement of the target object may further include:

determining a sample path point set from a plurality of sample path sample reference lines according to preset conditions;

According to embodiments of the present disclosure, the second moving target may be a final task point to which the sample object performs the task. Corresponding sample reference lines can be planned according to the sample object execution task through the simulation model, and the sample object is guided to reach a final task point to be reached by the execution task. The sample reference line may be represented as a series of discrete waypoints.

According to an embodiment of the present disclosure, the first objective function may be determined according to an initial policy of the first policy network at time t and a policy to be updated of the first policy network at time t.

For example, a first objective function J _t (θ _c ) Can be expressed as shown in the following formula (2):

wherein,proportional term, which may represent the first policy network, " >A‘ _t May represent a merit function and epsilon may represent a custom parameter.

According to an embodiment of the present disclosure, the second objective function may be determined according to an initial policy of the second policy network at time t and a policy to be updated of the second policy network at time t.

For example, a second objective function J _d (θ _d ) Can be expressed as shown in the following formula (3):

wherein,proportional term, which may represent the second policy network, ">A‘ _t May represent a merit function and epsilon may represent a custom parameter.

It should be noted that, the first policy network and the second policy network are independent from each other and have respective objective functions, and when the networks are updated, the two policy networks are updated independently according to the respective objective functions. Due to whenOr->When the ratio is too large, the gradient in the formula (2) or the formula (3) is too large, training is unstable, and an optimal solution is easy to miss, so that the initial strategy cannot be too large in difference with the strategy to be updated, and the strategy to be updated can be blocked by a clip function>Controlled to [ 1-epsilon, 1+epsilon ]]Between them.

According to an embodiment of the present disclosure, the preset condition may be a preset sampling density. The sample path points can be sampled from a plurality of sample path points according to a preset sampling density, so that a preset number of sample path points are obtained, and the sample path points form a sample path point set.

For example, sample Path Point set P _d Can be expressed as P _d ＝{p ₁ ，p ₂ ，...，p _n }. Wherein each path point p corresponds to a continuous velocity interval v _p ＝[0，max_speed _p ]The complete action of the agent output is a tuple (P, v), P ε P _d ，v∈v _p The entire motion space can be represented by the following formula (1):

because the intelligent agent needs to output actions in the action space, namely selecting path points, the intelligent agent needs to help sample objects to make decisions according to the observation space of different scenes when training the intelligent agent, and needs to add and collect P _d The relevant information is entered into the observation space.

The training round number can be M, the maximum step number per round can be T, and the first strategy network theta can be randomly initialized _c Second policy network θ _d And a value network phi. The following is repeatedly performed for each training step number of each training round: sample attribute information and sample path points in a sample path point set in a target scene are input into an initialized first strategy network, and initial path points are output. Inputting the sample attribute information and the second reference speed of the sample object at the sample path point into an initialized second strategy network, outputting the initial speed when the initial path point is reached, namely, deciding a desired path point p and a desired speed v when the point is reached through the first strategy network and the second strategy network _p The controller is based on p and v _p Controlling the travel of a sample object, inputting an initialized value network phi and calculating a jackpot r in the control process _t The local trajectory τ is expressed as { s } _t ，a _t ，r _t ，s _t+1 Form storage. Wherein s is _t The state of the sample object at time t can be represented; a, a _t Can represent the complete action of the agent output at time t, i.e. (p, v) _p )；s _t+1 The state of the sample object at time t+1 may be represented.

The parameters of the first policy network may be updated according to the above formula (2), and the first target network may be obtained until the number of updates is determined to reach the first preset number of updates.

The parameters of the second policy network may be updated according to the above formula (3), and the second target network may be obtained until it is determined that the number of updates reaches the second preset number of updates. The first preset number of updates may be the same as the second preset number of updates.

The value network phi may update the parameters by minimizing the third objective function. For example, the third objective function L (Φ) can be represented by the following equation (5):

wherein V is _φ (s _t ) Can represent the state of the sample object at time t as s _t A time value network phi; gamma ray ^t′-t Denoted as excitation estimation at time t' -t.

According to the embodiment of the disclosure, by referring to the method for sampling the last state of the Lattice algorithm, the mixed action space of the intelligent agent is designed, a large amount of time consumed by the online selection of the Lattice algorithm is transferred to the offline training of the intelligent agent, and the intelligent agent obtained through training under different scenes can output the optimal last state, namely an accurate path point and the speed when reaching the path point.

According to the embodiment of the disclosure, the target path point and the target speed are input into the tracking algorithm, a movement instruction can be output, and the movement of the target object can be controlled according to the movement instruction.

According to the embodiment of the present disclosure, in order to perform decision results of a target path point and a target speed that are rapidly and accurately determined by a target agent, a pure tracking algorithm is used to accurately control a target object.

According to an embodiment of the present disclosure, attribute information of a target object in a target scene includes: speed information of the target object, movement angle information of the target object, information associated with the lane, and relative position information of the target object and each of the waypoints.

According to an embodiment of the present disclosure, the speed information of the target object may include moving speed information of the target object, acceleration information of the target object moving, and the like. The movement angle information of the target object may include rotation angle information of the target object movement, heading angle information of the target object movement, and the like. Information associated with the lane may be determined from the target scene. The information associated with the lane may be distance information of the target object from the center line of the lane where it is currently located, or the like.

For example, as shown in fig. 3 (b), the target scene is a general road scene, the target object may be an unmanned vehicle, and the number of lanes around the lane in which the unmanned vehicle is located is three. The gray dots in fig. 3 (b) constitute a plurality of reference lines, and the black dots are a plurality of path points obtained by sampling. If the number of lanes around the unmanned vehicle is less than three, the same number of waypoints can be evenly distributed into one or two lanes around the unmanned vehicle in order to ensure that the input dimension remains unchanged.

According to the embodiment of the disclosure, the first target network of the target intelligent agent is utilized to determine the target path point through the speed information of the target object, the movement angle information of the target object, the information related to the lanes and the path points distributed on each lane, so that the target intelligent agent can accurately determine the target path point in a short time, and the determined target path point belongs to a better state.

According to an embodiment of the present disclosure, inputting attribute information and a first reference speed of a target object at a target waypoint into a second target network, outputting a target speed may include:

According to an embodiment of the present disclosure, the first reference speed is used to characterize a maximum movement speed at which the target object is allowed to move at the target waypoint. The first reference speed may be determined according to traffic rules.

According to the embodiment of the disclosure, the maximum movement speed of allowing the target object to move at the target path point can be normalized, namely, the first reference speed is normalized, and the normalized reference speed is obtained. The normalized reference speed may be multiplied by the predicted speed to obtain the target speed.

According to the embodiment of the disclosure, the second target network of the target intelligent agent is utilized to obtain the predicted speed through the speed information of the target object, the movement angle information of the target object, the information related to the lanes and the path points distributed on each lane, so that the target intelligent agent can accurately determine the target speed at the target path points in a short time based on the predicted speed, and the determined target speed belongs to a better state.

According to an embodiment of the present disclosure, in the case where the target scene is determined to be a road scene, the information associated with the lane includes an exchangeable road condition for characterizing the lane in which the target object is located. The information associated with the lane may be information of whether or not there is a lane of the lane change on the left or right.

In the case where the target scene is determined to be an intersection scene, the information associated with the lane includes information for characterizing a distance between the target object and the intersection at the t-th time. The information associated with the lane may be distance information of the target object from the intersection and information of whether the target object can use the intersection, that is, whether the target object has road right.

In the case where the target scene is determined to be a road merging and separating scene, the information associated with the lane includes information for characterizing a distance between the target object and the sink or exit point at the t-th time. The information associated with the lane may be information of a distance of the target object from the entry point or the exit point.

According to the embodiment of the disclosure, since the agent needs to output actions in the action space, namely, select the path points, the agent needs to help the sample object to make decisions according to the observation space of different scenes when training the agent, and information related to a plurality of path points and the sample object needs to be added into the observation space.

For example, in the case where the sample object is an unmanned vehicle, the observation properties of the observation space in different scenes may be specifically shown in tables 1 to 3 below.

Table 1 observation properties of unmanned vehicles in road scene

TABLE 2 observation Properties of unmanned vehicles in intersection scene

TABLE 3 observation Properties of unmanned vehicle in Lane merge and separation scenarios

Since the reward function needs to be changed correspondingly after the action space is changed in different scenes, the following formula (6) can be specifically adopted:

r＝r _c +r _ttc +r _o +r _goal +r _reach +r _steering +r _lane +r _speed +r _v +r _{sub_goal} (6)

wherein r is _c A penalty term representing a collision of the target object, which may be a negative preset excitation value if a collision occurs, and may be zero if no collision occurs; r is (r) _ttc An excitation item indicating that the guiding target object is pulled away from other objects can be determined according to the meeting time with other objects; r is (r) _o A penalty term representing the exit of the target object from the road, which may be a negative preset excitation value if the exit is a zero if the exit is a non-exit; r is (r) _goal An excitation term indicating a distance between a target object and a destination, wherein the larger the distance is, the smaller the excitation value is; r is (r) _reach An excitation term representing that the target object reaches the target path point, wherein the excitation term can be a preset excitation value if the target path point is reached, and can be zero if the target path point is not reached; r is (r) _steering The excitation item which indicates when the vehicle speed exceeds the preset vehicle speed is inversely proportional to the steering wheel rotation angle, and the larger the rotation angle is, the smaller the excitation value is; r is (r) _lane Indicating that the guidance target object is traveling along the lane center line as much as possible; r is (r) _speed A penalty term indicating when the speed of the target object exceeds the road speed limit, the more the road speed limit is exceededThe more, the smaller the excitation value; r is (r) _v An excitation term indicating that the target object is guided to satisfy the target speed, inversely proportional to a difference between the speed at which the target object reaches the waypoint and the target speed; r is (r) _{sub_goal} The excitation term representing guiding the target object to the target path point is inversely proportional to the distance between the path point reached by the target object and the target path point.

R is as follows _c 、r _ttc R _o For ensuring the security of the target object. r is (r) _goal And r _reach For ensuring that the target object is able to reach the target waypoint. r is (r) _steering For ensuring the smoothness of the target object. r is (r) _lane And r _speed For the target object to comply with traffic regulations. r is (r) _v And r _{sub_goal} The two items are also used for rewarding and punishing whether the target path point and the target speed are easy to control the target object to a certain extent, and if the control performance of the target object is exceeded, the target object is difficult to reach the target path point or the speed requirement is difficult to meet, larger punishment is caused, so that an intelligent agent is guided to make a decision which is easy to control.

For example, r _{sub_goal} ＝-0.1*d _{sub_goal} ，d _{sub_goal} Representing the distance the target object reaches the target waypoint for guiding the target object to the target waypoint. When the distance between the path point where the unmanned vehicle is located and the target path point is smaller than 1.5m, the target object is considered to reach the target path point.

According to the embodiment of the disclosure, by considering various road condition information in different scenes, the intelligent agent in different scenes is trained respectively, and when online control is performed, the target intelligent agent can be flexibly invoked in various scenes.

Fig. 4 schematically illustrates an unmanned vehicle motion planning schematic in accordance with an embodiment of the present disclosure.

As shown in fig. 4, the motion planning schematic of this embodiment can be used to reference the design method of the conventional motion planning system, and the layering behavior decision framework is improved. In the layered behavior decision, a driving scene can be reasonably converted through a finite state machine, an agent is selected for the current driving scene to make a decision, the agent outputs a specific path point and the speed when the agent reaches the point, and then a pure tracking algorithm is used for controlling the vehicle to travel to the path point.

The improvement of the layered behavior decision framework can train the intelligent agent by utilizing vehicle state information, map data, global path planning, real-time environment perception information, other traffic participant information and the like of the unmanned vehicle, so that the intelligent agent can make layered behavior decisions.

It should be noted that, to ensure sufficient safety of the unmanned vehicle, a safety controller may be designed, and in the process of executing the decision result, if a collision is likely, a decision needs to be made again to avoid the collision.

FIG. 5 (a) schematically illustrates an H-PPO algorithm architecture diagram according to another embodiment of the present disclosure; fig. 5 (b) schematically illustrates a two-action network structure diagram under the H-PPO algorithm according to another embodiment of the present disclosure.

According to an embodiment of the present disclosure, an H-PPO algorithm may be used to process the action space of the agent of the present disclosure, where the architecture of the H-PPO algorithm is shown in fig. 5 (a), and after a state is input, the state may be encoded by a state encoding network and then pass through two parallel action networks. The discrete action network is used to learn a strategy to select one discrete waypoint and the continuous action network is used to learn a strategy to select a desired speed for each waypoint. The two action networks share a Critic network. In brief, the H-PPO algorithm can be considered to consist of two independent PPO agents, with the exception that they share a Critic network. Both action networks may employ a multi-layer perceptron.

For example, taking a road scene as an example, the structure of two action networks may be as shown in fig. 5 (b). The input in fig. 5 (b) is a 351-dimensional vector, and the input dimensions are different in other scenarios. After being encoded by the state encoding network, the two action networks are utilized for processing. Where fc_30 is a fully connected layer set for a number of waypoints of 30, and when the number of waypoints is modified, the number of neurons of the layer should also be modified. The expected path point can be obtained by sampling after processing through the discrete action network, and the expected speed of the target object at the expected path point can be obtained after processing through the continuous action network based on the expected path point. The first target network of the disclosed target agent may correspond to a discrete action network in an H-PPO algorithm and the second target network of the disclosed target agent may correspond to a continuous action network in an H-PPO algorithm. The final desired path point may be a target path point determined by the first target network and the desired speed may be a target speed determined by the second target network.

Based on the above method for controlling movement of the target object, the disclosure also provides a device for controlling movement of the target object. The device will be described in detail below in connection with fig. 6.

Fig. 6 schematically illustrates a block diagram of a movement control apparatus of a target object according to an embodiment of the present disclosure.

As shown in fig. 6, the movement control apparatus 600 of the target object of this embodiment includes a first determination module 610, a second determination module 620, a first processing module 630, a second processing module 640, and a first control module 650.

The first determining module 610 is configured to determine, according to a target scene where the target object at the t-th moment is located, a target agent corresponding to the target scene, where the target agent is obtained by training based on sample attribute information of the sample object in the target scene, and the target agent includes a first target network and a second target network. In an embodiment, the first determining module 610 may be configured to perform the operation S210 described above, which is not described herein.

The second determining module 620 is configured to determine a target reference line according to a first moving target that the target object moves, where the target reference line has a plurality of path points thereon. In an embodiment, the second determining module 620 may be configured to perform the operation S220 described above, which is not described herein.

The first processing module 630 is configured to input attribute information of a plurality of path points and a target object in a target scene into the first target network, and output a target path point where the target object is located at a t+1th moment next to the t moment. In an embodiment, the first processing module 630 may be configured to perform the operation S230 described above, which is not described herein.

The second processing module 640 is configured to input the attribute information and the first reference speed of the target object at the target waypoint into the second target network, and output a target speed, where the target speed is the speed of the target object at the target waypoint. In an embodiment, the second processing module 640 may be configured to perform the operation S240 described above, which is not described herein.

The first control module 650 is configured to control movement of the target object based on the target waypoint and the target speed. In an embodiment, the first control module 650 may be configured to perform the operation S250 described above, which is not described herein.

According to an embodiment of the present disclosure, the movement control device 600 of the target object further includes: and a second control module.

The second control module is used for stopping the movement control of the target object under the condition that the target object is determined to be faulty in the moving process.

According to an embodiment of the present disclosure, the movement control device 600 of the target object further includes: the device comprises an acquisition module, a third determination module and a fourth determination module.

The acquisition module is used for acquiring first movement state information of the target object and second movement state information of other objects except the target object in the target scene in the process of controlling movement of the target object.

The third determining module is used for determining the meeting time of the target object and other objects according to the target path point, the target speed, the first moving state information and the second moving state information.

And the fourth determining module is used for determining that the target object fails in the moving process under the condition that the meeting time is smaller than the threshold value.

the movement control device 600 of the target object further includes: a fifth determination module, a sixth determination module, a seventh determination module, and a third processing module.

The fifth determining module is used for determining a sample reference line according to a second moving target of the sample object, wherein the sample reference line is provided with a plurality of sample path points.

The sixth determining module is used for determining a sample path point set from a plurality of sample path sample application book reference lines according to preset conditions.

The seventh determining module is used for determining the number of training rounds and the number of training steps of each training round.

The third processing module is configured to repeatedly perform the following operations for each training step number of each training round:

According to an embodiment of the present disclosure, the first control module 650 may include: a first determination subunit and a control subunit.

The first determination subunit is configured to determine a movement instruction according to the target path point, the target speed, and the tracking algorithm.

The control subunit is used for controlling the movement of the target object according to the movement instruction.

wherein the first processing module 630 may include: the first input unit, the second determination subunit, the third determination subunit, and the first output unit.

The first input unit is used for inputting attribute information of a plurality of path points and target objects in a target scene into a first target network.

The second determining subunit is used for determining the number of lanes around the lane where the target object is located according to the target scene.

The third determination subunit is configured to determine a route point allocated on each lane according to the number of lanes, the plurality of route points, and relative position information of the target object and each route point.

The first output unit is used for outputting a target path point according to the speed information of the target object, the moving angle information of the target object, the information related to the lanes and the path points distributed on each lane.

According to an embodiment of the present disclosure, the second processing module 640 includes: the apparatus includes a second input unit, a prediction unit, and a second output unit.

The second input unit is used for inputting the attribute information and the first reference speed of the target object at the target path point into the second target network.

The prediction unit is used for predicting and obtaining the predicted speed of the target object at the target path point according to the speed information of the target object, the moving angle information of the target object, the information related to the lanes and the path point distributed on each lane.

The second output unit is used for outputting a target speed according to the predicted speed and the first reference speed.

Any of the first determination module 610, the second determination module 620, the first processing module 630, the second processing module 640, and the first control module 650 may be combined in one module to be implemented, or any of the modules may be split into a plurality of modules according to an embodiment of the present disclosure. Alternatively, at least some of the functionality of one or more of the modules may be combined with at least some of the functionality of other modules and implemented in one module. According to embodiments of the present disclosure, at least one of the first determination module 610, the second determination module 620, the first processing module 630, the second processing module 640, and the first control module 650 may be implemented at least in part as hardware circuitry, such as a Field Programmable Gate Array (FPGA), a Programmable Logic Array (PLA), a system on a chip, a system on a substrate, a system on a package, an Application Specific Integrated Circuit (ASIC), or as hardware or firmware in any other reasonable manner of integrating or packaging the circuitry, or as any one of or a suitable combination of any of the three. Alternatively, at least one of the first determination module 610, the second determination module 620, the first processing module 630, the second processing module 640, and the first control module 650 may be at least partially implemented as a computer program module, which when executed, may perform the corresponding functions.

As shown in fig. 7, an electronic device 700 according to an embodiment of the present disclosure includes a processor 701 that can perform various appropriate actions and processes according to a program stored in a Read Only Memory (ROM) 702 or a program loaded from a storage section 708 into a Random Access Memory (RAM) 703. The processor 701 may include, for example, a general purpose microprocessor (e.g., a CPU), an instruction set processor and/or an associated chipset and/or a special purpose microprocessor (e.g., an Application Specific Integrated Circuit (ASIC)), or the like. The processor 701 may also include on-board memory for caching purposes. The processor 701 may comprise a single processing unit or a plurality of processing units for performing different actions of the method flows according to embodiments of the disclosure.

In the RAM 703, various programs and data necessary for the operation of the electronic apparatus 700 are stored. The processor 701, the ROM 702, and the RAM 703 are connected to each other through a bus 704. The processor 701 performs various operations of the method flow according to the embodiments of the present disclosure by executing programs in the ROM 702 and/or the RAM 703. Note that the program may be stored in one or more memories other than the ROM 702 and the RAM 703. The processor 701 may also perform various operations of the method flow according to embodiments of the present disclosure by executing programs stored in one or more memories.

According to an embodiment of the present disclosure, the electronic device 700 may further include an input/output (I/O) interface 705, the input/output (I/O) interface 705 also being connected to the bus 704. The electronic device 700 may also include one or more of the following components connected to the I/O interface 705: an input section 706 including a keyboard, a mouse, and the like; an output portion 707 including a Cathode Ray Tube (CRT), a Liquid Crystal Display (LCD), and the like, a speaker, and the like; a storage section 708 including a hard disk or the like; and a communication section 709 including a network interface card such as a LAN card, a modem, or the like. The communication section 709 performs communication processing via a network such as the internet. The drive 710 is also connected to the I/O interface 705 as needed. A removable medium 711 such as a magnetic disk, an optical disk, a magneto-optical disk, a semiconductor memory, or the like is mounted on the drive 710 as necessary, so that a computer program read therefrom is mounted into the storage section 708 as necessary.

The present disclosure also provides a computer-readable storage medium that may be embodied in the apparatus/device/system described in the above embodiments; or may exist alone without being assembled into the apparatus/device/system. The computer-readable storage medium carries one or more programs which, when executed, implement methods in accordance with embodiments of the present disclosure.

According to embodiments of the present disclosure, the computer-readable storage medium may be a non-volatile computer-readable storage medium, which may include, for example, but is not limited to: a portable computer diskette, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), a portable compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing. In the context of this disclosure, a computer-readable storage medium may be any tangible medium that can contain, or store a program for use by or in connection with an instruction execution system, apparatus, or device. For example, according to embodiments of the present disclosure, the computer-readable storage medium may include ROM 702 and/or RAM 703 and/or one or more memories other than ROM 702 and RAM 703 described above.

Embodiments of the present disclosure also include a computer program product comprising a computer program containing program code for performing the methods shown in the flowcharts. The program code, when executed in a computer system, causes the computer system to perform the methods provided by embodiments of the present disclosure.

The above-described functions defined in the system/apparatus of the embodiments of the present disclosure are performed when the computer program is executed by the processor 701. The systems, apparatus, modules, units, etc. described above may be implemented by computer program modules according to embodiments of the disclosure.

In one embodiment, the computer program may be based on a tangible storage medium such as an optical storage device, a magnetic storage device, or the like. In another embodiment, the computer program may also be transmitted, distributed over a network medium in the form of signals, downloaded and installed via the communication section 709, and/or installed from the removable medium 711. The computer program may include program code that may be transmitted using any appropriate network medium, including but not limited to: wireless, wired, etc., or any suitable combination of the foregoing.

In such an embodiment, the computer program may be downloaded and installed from a network via the communication portion 709, and/or installed from the removable medium 711. The above-described functions defined in the system of the embodiments of the present disclosure are performed when the computer program is executed by the processor 701. The systems, devices, apparatus, modules, units, etc. described above may be implemented by computer program modules according to embodiments of the disclosure.

According to embodiments of the present disclosure, program code for performing computer programs provided by embodiments of the present disclosure may be written in any combination of one or more programming languages, and in particular, such computer programs may be implemented in high-level procedural and/or object-oriented programming languages, and/or assembly/machine languages. Programming languages include, but are not limited to, such as Java, c++, python, "C" or similar programming languages. The program code may execute entirely on the user's computing device, partly on the user's device, partly on a remote computing device, or entirely on the remote computing device or server. In the case of remote computing devices, the remote computing device may be connected to the user computing device through any kind of network, including a Local Area Network (LAN) or a Wide Area Network (WAN), or may be connected to an external computing device (e.g., connected via the Internet using an Internet service provider).

The flowcharts and block diagrams in the figures illustrate the architecture, functionality, and operation of possible implementations of systems, methods and computer program products according to various embodiments of the present disclosure. In this regard, each block in the flowchart or block diagrams may represent a module, segment, or portion of code, which comprises one or more executable instructions for implementing the specified logical function(s). It should also be noted that, in some alternative implementations, the functions noted in the block may occur out of the order noted in the figures. For example, two blocks shown in succession may, in fact, be executed substantially concurrently, or the blocks may sometimes be executed in the reverse order, depending upon the functionality involved. It will also be noted that each block of the block diagrams or flowchart illustration, and combinations of blocks in the block diagrams or flowchart illustration, can be implemented by special purpose hardware-based systems which perform the specified functions or acts, or combinations of special purpose hardware and computer instructions.

Those skilled in the art will appreciate that the features recited in the various embodiments of the disclosure and/or in the claims may be provided in a variety of combinations and/or combinations, even if such combinations or combinations are not explicitly recited in the disclosure. In particular, the features recited in the various embodiments of the present disclosure and/or the claims may be variously combined and/or combined without departing from the spirit and teachings of the present disclosure. All such combinations and/or combinations fall within the scope of the present disclosure.

The embodiments of the present disclosure are described above. However, these examples are for illustrative purposes only and are not intended to limit the scope of the present disclosure. Although the embodiments are described above separately, this does not mean that the measures in the embodiments cannot be used advantageously in combination. The scope of the disclosure is defined by the appended claims and equivalents thereof. Various alternatives and modifications can be made by those skilled in the art without departing from the scope of the disclosure, and such alternatives and modifications are intended to fall within the scope of the disclosure.

Claims

1. A movement control method of a target object, comprising:

determining a target intelligent agent corresponding to a target scene according to the target scene of a target object at a t moment, wherein the target intelligent agent is obtained by training based on sample attribute information of a sample object in the target scene, and comprises a first target network and a second target network;

inputting a plurality of path points and attribute information of the target object in the target scene into the first target network, and outputting a target path point where the target object is located at a t+1 time point at the next time point of the t time point;

inputting the attribute information and a first reference speed of the target object at the target path point into the second target network, and outputting a target speed, wherein the target speed is the speed of the target object at the target path point; and

and controlling the movement of the target object based on the target path point and the target speed.

2. The method of claim 1, further comprising:

3. The method of claim 2, further comprising:

acquiring first movement state information of the target object and second movement state information of other objects except the target object in the target scene in the process of controlling movement of the target object;

Determining the meeting time of the target object and the other objects according to the target path point, the target speed, the first moving state information and the second moving state information;

and under the condition that the meeting time is smaller than a threshold value, determining that the target object fails in the moving process.

4. A method according to any one of claims 1 to 3, wherein the first target network is obtained after training a first policy network using a first objective function and a value network; the second target network is obtained after training a second strategy network by using a second target function and the value network;

the method further comprises the steps of:

repeating the following operations for each of the training steps of each of the training rounds:

inputting the sample attribute information and the sample path points in the sample path point set in the target scene into the initialized first strategy network, and outputting initial path points;

Inputting the sample attribute information and a second reference speed of the sample object at the sample path point into the initialized second strategy network, and outputting an initial speed;

inputting the initial path point and the initial speed into the initialized value network, and outputting a reward value in the process of controlling the movement of the sample object;

under the condition that the rewarding value is not more than a threshold value, updating parameters of the first strategy network according to the first objective function, and obtaining the first objective network under the condition that the updating times reach a first preset updating times;

and under the condition that the reward value is not more than the threshold value, updating parameters of the second strategy network according to the second objective function, and under the condition that the updating times reach the first preset updating times, obtaining the second objective network.

5. A method according to any one of claims 1 to 3, wherein said controlling movement of said target object based on said target waypoint and said target speed comprises:

determining a moving instruction according to the target path point, the target speed and a tracking algorithm;

6. The method of claim 1, wherein the attribute information of the target object under the target scene comprises: speed information of the target object, movement angle information of the target object, information associated with the lane, and relative position information of the target object and each of the waypoints;

the inputting attribute information of the plurality of path points and the target object in the target scene into the first target network, and outputting the target path point where the target object is located at the t+1th moment next to the t moment includes:

inputting attribute information of a plurality of the waypoints and the target object in the target scene into the first target network so as to perform the following operations:

determining the number of lanes around the lane where the target object is located according to the target scene;

7. The method of claim 6, wherein the inputting the attribute information and the first reference speed of the target object at the target waypoint into the second target network, outputting a target speed, comprises:

inputting the attribute information and a first reference speed of the target object at the target waypoint into the second target network so as to perform the following operations:

predicting a predicted speed of the target object at the target path point according to the speed information of the target object, the moving angle information of the target object, the information related to the lanes and the path point distributed on each lane;

8. The method of claim 6, wherein,

in the case that the target scene is determined to be a road scene, the information associated with the lane includes an exchangeable road condition for characterizing a lane in which the target object is located;

in the case that the target scene is determined to be an intersection scene, the information associated with the lane includes information for characterizing a distance between the target object and an intersection at the t-th time;

In the case where the target scene is determined to be a road merge and separation scene, the information associated with the lane includes information characterizing a distance between the target object and an ingress or egress point at the t-th time.

9. A movement control apparatus of a target object, comprising:

the first determining module is used for determining a target intelligent agent corresponding to a target scene according to the target scene of the target object at the t moment, wherein the target intelligent agent is obtained by training based on sample attribute information of a sample object in the target scene, and the target intelligent agent comprises a first target network and a second target network;

the first processing module is used for inputting attribute information of a plurality of path points and the target object in the target scene into the first target network and outputting a target path point where the target object is located at a t+1th moment next to the t moment;

a second processing module, configured to input the attribute information and a first reference speed of the target object at the target path point into the second target network, and output a target speed, where the target speed is a speed of the target object at the target path point; and

10. An electronic device, comprising:

one or more processors;

a memory for storing one or more programs,

wherein the one or more programs, when executed by the one or more processors, cause the one or more processors to implement the method of any of claims 1 to 8.

11. A computer readable storage medium having stored thereon executable instructions which when executed by a processor cause the processor to implement the method of any of claims 1 to 8.

12. A computer program product comprising a computer program which, when executed by a processor, implements the method according to any of claims 1-8.

13. An autonomous vehicle comprising the electronic device of claim 10.