CN116403190A

CN116403190A - Track determination method and device for target object, electronic equipment and storage medium

Info

Publication number: CN116403190A
Application number: CN202310363276.5A
Authority: CN
Inventors: 张笑铭; 潘铭星
Original assignee: Chengdu Horizon Journey Technology Co ltd
Current assignee: Chengdu Horizon Journey Technology Co ltd
Priority date: 2023-04-06
Filing date: 2023-04-06
Publication date: 2023-07-07

Abstract

The embodiment of the disclosure discloses a track determination method, a track determination device, electronic equipment and a storage medium of a target object, wherein the method comprises the following steps: obtaining an observation boundary frame information sequence of a target object, wherein the observation boundary frame information sequence comprises boundary frame information in observation images respectively corresponding to each time frame in at least one time frame of the target object; based on the observation boundary box information sequence, determining the observation track characteristics corresponding to the target object by utilizing a track coding network in a track prediction model obtained by pre-training; based on the observed track characteristics, determining the corresponding variable coding characteristics of the target object by utilizing a conditional variable self-encoder in the track prediction model; and determining the motion trail of at least one future time frame corresponding to the target object by utilizing a trail prediction network in the trail prediction model based on the variant coding characteristics. The embodiment of the disclosure can effectively improve the track prediction efficiency, reduce the track prediction delay and improve the real-time performance of track prediction.

Description

Track determination method and device for target object, electronic equipment and storage medium

Technical Field

The present disclosure relates to computer vision technology, and in particular, to a method, an apparatus, an electronic device, and a storage medium for determining a trajectory of a target object.

Background

In the scenes of automatic driving, auxiliary driving and the like, future track prediction of dynamic objects such as pedestrians, cyclists, trams and the like is an important link, and obstacle reference information can be provided for planning and controlling vehicles. In the related art, track prediction of an object is generally performed by adopting a track prediction model based on a cyclic neural network based on 2D image domain information, but the cyclic neural network has higher delay, so that the track prediction has lower instantaneity.

Disclosure of Invention

In order to solve the technical problems of low real-time performance of track prediction and the like, the embodiment of the disclosure provides a track determination method, device, electronic equipment and storage medium of a target object, and the track prediction method, device, electronic equipment and storage medium are used for realizing future track prediction based on a lightweight track prediction model, so that the real-time performance of track prediction is effectively improved.

In a first aspect of the present disclosure, there is provided a track determining method of a target object, including: obtaining an observation boundary frame information sequence of a target object, wherein the observation boundary frame information sequence comprises boundary frame information in observation images respectively corresponding to time frames of the target object in at least one time frame; based on the observation boundary box information sequence, determining the observation track characteristics corresponding to the target object by utilizing a track coding network in a track prediction model obtained by pre-training; based on the observed track characteristics, determining corresponding coding characteristics of the target object by utilizing a condition variation self-encoder in the track prediction model; and determining the motion trail of at least one future time frame corresponding to the target object by utilizing a trail prediction network in the trail prediction model based on the variant coding characteristics.

In a second aspect of the present disclosure, there is provided a trajectory determining device of a target object, including: the first acquisition module is used for acquiring an observation boundary frame information sequence of a target object, wherein the observation boundary frame information sequence comprises boundary frame information in observation images respectively corresponding to time frames of at least one time frame of the target object; the first processing module is used for determining the observation track characteristics corresponding to the target object by utilizing a track coding network in a track prediction model obtained by pre-training based on the observation boundary box information sequence; the second processing module is used for determining the variable component coding features corresponding to the target object by utilizing a conditional variable component self-encoder in the track prediction model based on the observed track features; and the third processing module is used for determining the motion trail of at least one future time frame corresponding to the target object by utilizing a trail prediction network in the trail prediction model based on the variant coding characteristics.

In a third aspect of the present disclosure, a computer-readable storage medium is provided, where a computer program is stored, where the computer program is configured to perform the method for determining a trajectory of a target object according to any one of the above embodiments of the present disclosure.

In a fourth aspect of the present disclosure, there is provided an electronic device including: a processor; a memory for storing the processor-executable instructions; the processor is configured to read the executable instructions from the memory and execute the instructions to implement the track determining method of the target object according to any one of the foregoing embodiments of the disclosure.

In a fifth aspect of the present disclosure, there is provided a computer program product, which when executed by a processor, performs the method for determining a trajectory of a target object provided by the embodiment of the first aspect (or the embodiment of the second aspect) of the present disclosure.

According to the track determining method, the track determining device, the electronic equipment and the storage medium of the target object, which are provided by the embodiment of the disclosure, the track prediction based on the lightweight track prediction model can be realized by combining the feature encoding network and the track prediction network based on the condition variation self-encoder, the track prediction efficiency can be effectively improved, the track prediction delay is reduced, the track prediction instantaneity is improved, and further accurate and effective obstacle reference information can be timely provided for a vehicle, so that the vehicle can know the obstacle condition as early as possible, avoid the obstacle condition and improve the running safety of the vehicle.

The technical scheme of the present disclosure is described in further detail below through the accompanying drawings and examples.

Drawings

FIG. 1 is an exemplary application scenario of a target object trajectory determination method provided by the present disclosure;

FIG. 2 is a flow chart of a method for determining a trajectory of a target object provided by an exemplary embodiment of the present disclosure;

FIG. 3 is a schematic diagram of a network architecture of a trajectory prediction model provided by an exemplary embodiment of the present disclosure;

FIG. 4 is a flow chart of a method of trajectory determination of a target object provided by another exemplary embodiment of the present disclosure;

FIG. 5 is a schematic diagram of bounding box information provided by an exemplary embodiment of the present disclosure;

FIG. 6 is a flow chart of step 201 provided by another exemplary embodiment of the present disclosure;

FIG. 7 is a schematic diagram of a network architecture of a track-coded network provided in an exemplary embodiment of the present disclosure;

FIG. 8 is a schematic diagram of a track encoding network provided by another exemplary embodiment of the present disclosure;

FIG. 9 is a flow chart of a method of trajectory determination of a target object provided by another exemplary embodiment of the present disclosure;

FIG. 10 is a schematic diagram of an observed bounding box information sequence provided by an exemplary embodiment of the present disclosure;

FIG. 11 is a schematic diagram of a network architecture of a conditional variable self-encoder provided by an exemplary embodiment of the present disclosure;

FIG. 12 is a network architecture diagram of a trajectory prediction network provided by an exemplary embodiment of the present disclosure;

FIG. 13 is a network architecture diagram of a second convolutional network provided in an exemplary embodiment of the present disclosure;

FIG. 14 is a schematic diagram of a training network architecture of a trajectory prediction model provided by an exemplary embodiment of the present disclosure;

fig. 15 is a schematic structural view of a trajectory determining device of a target object provided in an exemplary embodiment of the present disclosure;

fig. 16 is a schematic structural view of a trajectory determining device of a target object provided in another exemplary embodiment of the present disclosure;

fig. 17 is a block diagram of an electronic device provided in an embodiment of the present disclosure.

Detailed Description

For the purpose of illustrating the present disclosure, exemplary embodiments of the present disclosure will be described in detail below with reference to the drawings, it being apparent that the described embodiments are only some, but not all embodiments of the present disclosure, and it is to be understood that the present disclosure is not limited by the exemplary embodiments.

It should be noted that: the relative arrangement of the components and steps, numerical expressions and numerical values set forth in these embodiments do not limit the scope of the present disclosure unless it is specifically stated otherwise.

Summary of the disclosure

In the process of realizing the present disclosure, the inventor finds that in the scenes of automatic driving, auxiliary driving and the like, future track prediction of dynamic objects such as pedestrians, cyclists, tramcars and the like is an important link, and can provide obstacle reference information for planning and controlling vehicles. In the related art, track prediction of an object is generally performed by adopting a track prediction model based on a cyclic neural network based on 2D image domain information, but the cyclic neural network has higher delay, so that the track prediction has lower instantaneity.

Exemplary overview

Fig. 1 is an exemplary application scenario of a track determining method of a target object provided in the present disclosure.

In the scenes of automatic driving, auxiliary driving and the like, by using the track determining method of the target object (executed in the track determining device of the target object), the observation boundary frame information sequence of the target object can be determined based on the observation image containing at least one time frame of the target object, the observation boundary frame information sequence can comprise boundary frame information in the observation image of the target object corresponding to each time frame respectively, further, the observation track characteristic corresponding to the target object can be determined by using a track coding network in a track prediction model obtained through pre-training based on the observation boundary frame information sequence, the change coding characteristic corresponding to the target object is determined by using a condition change self-encoder in the track prediction model based on the observation track characteristic, the motion track of at least one future time frame corresponding to the target object is determined by using a track prediction network in the track prediction model based on the change coding characteristic, the track prediction of a light-weight track prediction model based on the condition change self-encoder is realized, the track prediction efficiency of the track prediction can be effectively improved, the track prediction delay can be improved, the real-time of the track prediction can be improved, the effective obstacle information can be provided for the vehicle, the vehicle can be avoided as soon as possible, the obstacle can be known, and the safety of the vehicle can be avoided.

Exemplary method

Fig. 2 is a flowchart illustrating a method for determining a trajectory of a target object according to an exemplary embodiment of the present disclosure. The embodiment can be applied to an electronic device, specifically, for example, a vehicle-mounted computing platform, as shown in fig. 2, and includes the following steps:

step 201, an observation boundary box information sequence of a target object is obtained.

The observation boundary box information sequence comprises boundary box information in an observation image of the target object corresponding to each time frame in at least one time frame. The observation image refers to an image which is acquired by a camera and can observe a target object. The number of target objects may be one or more.

In some alternative embodiments, the at least one time frame may include a current time frame and at least one historical time frame.

In some alternative embodiments, the specific representation mode of the bounding box information may be set according to actual requirements, for example, four corner coordinates, two corner coordinates of a diagonal corner, and the like, which may be represented as a bounding box, and may be specifically set according to actual requirements.

In some alternative embodiments, the bounding box information may be obtained based on object detection, such as performing object detection on the observed image based on a pre-trained object detection model, to obtain bounding box information of the target object in the observed image. The target detection model may be any practicable model, such as a target detection model based on RCNN (Region Convolutional Neural Network, regional convolutional neural network) and its series, a target detection model based on YOLO, and the like, and is not particularly limited.

Step 202, based on the observation boundary box information sequence, determining the observation track characteristics corresponding to the target object by utilizing a track coding network in a track prediction model obtained by pre-training.

The track coding network can be a convolutional neural network-based coding network, and the specific network structure can be set according to actual requirements. And encoding the observation boundary box information sequence through a track encoding network, and extracting the observation track characteristics of the target object.

And 203, determining the variable coding characteristics corresponding to the target object by using a conditional variable self-encoder in the track prediction model based on the observed track characteristics.

The condition variable self-encoder can be a self-encoder based on a convolutional neural network, a specific network structure can be set according to actual requirements, the condition variable self-encoder is used for extracting distribution characteristics of observation track characteristics, and the distribution characteristics are combined with Gaussian distribution sampling to generate variable coding characteristics corresponding to a target object.

Step 204, determining a motion trail of at least one future time frame corresponding to the target object by using a trail prediction network in the trail prediction model based on the variant coding features.

Wherein the number of future time frames may be set according to actual requirements. The track prediction network can be a decoding network based on a convolutional neural network, and the specific network structure can be set according to actual requirements. And decoding the variation coding features through a track prediction network, and predicting a future motion track corresponding to the target object.

In some alternative embodiments, fig. 3 is a schematic diagram of a network structure of a trajectory prediction model provided in an exemplary embodiment of the present disclosure. The track coding network is used for coding the observed boundary frame information sequence to obtain observed track characteristics, the conditional variation self-encoder is used for coding the observed track characteristics to obtain variation coding characteristics, and the track prediction network is used for processing the variation coding characteristics to obtain boundary frame information corresponding to each future time frame of the target object respectively and serve as a motion track of the future time frame of the target object.

According to the track determination method of the target object, based on the condition variation self-encoder, the characteristic encoding network and the track prediction network are combined, track prediction based on the lightweight track prediction model can be achieved, so that track prediction efficiency can be effectively improved, track prediction delay is reduced, track prediction instantaneity is improved, accurate and effective obstacle reference information can be timely provided for a vehicle, the vehicle can know the obstacle condition as soon as possible, avoidance is achieved, and vehicle running safety is improved.

Fig. 4 is a flowchart illustrating a track determining method of a target object according to another exemplary embodiment of the present disclosure.

In some alternative embodiments, the bounding box information includes coordinates of at least one set of diagonal corner points of the bounding box.

Step 201 of obtaining an observation boundary box information sequence of a target object includes:

in step 2011, the observation images corresponding to the time frames are acquired.

The observation image may be an image acquired by a camera on the vehicle.

In some optional embodiments, if the frame rate of the time frame is smaller than the frame rate of the image acquired by the camera, the image acquired by the camera may be sampled according to the frame interval of the time frame, so as to obtain the observation image corresponding to each time frame.

In step 2012, each observation image is processed by using the target detection model obtained by pre-training, so as to obtain detection frame information of the target object in each observation image.

The target detection model may be any practical model, such as a target detection model based on RCNN (Region Convolutional Neural Network, regional convolutional neural network) and its series, a target detection model based on YOLO, and the like, which are not particularly limited. The detection frames of the objects observed in the observation images can be regressed through the target detection model, the corresponding types of the detection frames and the probability that the objects belong to the types can be obtained, the detection frame information of the target objects can be determined by combining the types and the probabilities of the detection frames, the detection frame information can be expressed in any implementation mode, for example, the center point coordinates and the size of the detection frames are used as the detection frame information of the target objects, for example, the center point coordinates and at least one corner point coordinates of the detection frames are used as the detection frame information of the target objects, for example, the four corner point coordinates of the detection frames are used as the detection frame information of the target objects, and the detection frame information can be specifically set according to actual requirements.

Step 2013, for any time frame, determining coordinates of at least one group of diagonal corner points based on the detection frame information corresponding to the time frame, and using the coordinates as boundary frame information corresponding to the time frame.

Wherein the coordinates of each set of diagonal corner points include coordinates of two corner points on the same diagonal among the corner points of the detection frame.

In some alternative embodiments, to reduce the amount of computation, the bounding box information may include a set of corner point coordinates.

Illustratively, fig. 5 is a schematic diagram of bounding box information provided by an exemplary embodiment of the present disclosure. Wherein, the liquid crystal display device comprises a liquid crystal display device,

representing the first corner coordinates of a set of corner coordinates,/a>

Representing a second corner seatAnd (5) marking. X represents the horizontal coordinates of the image coordinate system, Y represents the vertical coordinates of the image coordinate system, and bi is used to represent the i-th time frame.

Step 2014, determining an observation boundary box information sequence based on the boundary box information corresponding to each time frame.

For example, the observed bounding box information sequence may be expressed as

Where n represents the total number of observed time frames.

According to the method and the device, the target detection model is used for carrying out target detection on the observation images to obtain the boundary box information of the target object in each observation image, so that the observation boundary box information sequence of the target object can be effectively determined, and an accurate and effective observation track is provided for the prediction of the future motion track of the target object.

Fig. 6 is a flow chart of step 201 provided by another exemplary embodiment of the present disclosure.

In some alternative embodiments, the obtaining of the observed bounding box information sequence of the target object of step 201 includes:

step 201a, obtaining an observation image corresponding to a current time frame.

Step 201b, processing the observation image corresponding to the current time frame by using the target detection model obtained by pre-training, and obtaining detection frame information of the target object in the observation image of the current time frame.

Step 201c, determining the bounding box information corresponding to the current time frame based on the detection box information.

Step 201d, determining an observed bounding box information sequence of the target object based on the bounding box information corresponding to the current time frame and the bounding box information corresponding to the previous time frame.

The boundary box information corresponding to the previous time frame comprises the boundary box information corresponding to each history time frame obtained before.

Specifically, after each time frame detects the target object, the time frame and the boundary frame information of the target object are correspondingly stored, so that the real-time maintenance of the boundary frame information of the target object is realized, and the stored boundary frame information of each time frame can be directly acquired in the subsequent processing process and used for determining the observation boundary frame information sequence of the target object. The processing efficiency can be effectively improved by maintaining the boundary box information of the target object in the observation image in real time.

The specific operations of steps 201a to 201d can be referred to the aforementioned steps 2011 to 2014, and will not be described herein.

In some optional embodiments, determining, based on the observation bounding box information sequence, the observation trajectory feature corresponding to the target object using the trajectory encoding network in the trajectory prediction model obtained by pre-training in step 202 includes:

and step 2021, processing the observed bounding box information sequence by using a second fully-connected network in the track coding network to obtain a first processing result.

The second fully-connected network may include at least one fully-connected layer, and the specific network structure may be set according to actual requirements.

In some alternative embodiments, the second fully-connected network may include at least one fully-connected layer and an activation function set.

Step 2022, processing the first processing result by using a third convolution network in the track coding network to obtain a second processing result.

The third convolution network may include at least one convolution layer, and may further include an activation function, which may be specifically set according to actual requirements.

Step 2023, determining the observed trajectory feature based on the first processing result and the second processing result.

In some alternative embodiments, the first processing result and the second processing result may be fused, the fused result is used as the observation path feature, and further processing, such as pooling, may be performed on the fused result, and the result after further processing is used as the observation path feature.

The embodiment realizes the track coding network based on the full connection layer, the convolution layer and the activation function, is beneficial to realizing a lightweight track prediction model and improves track prediction efficiency.

In some alternative embodiments, determining the observed trajectory feature based on the first processing result and the second processing result of step 2023 includes:

fusing the first processing result and the second processing result by using a second feature fusion network in the track coding network to obtain a second fusion feature; and carrying out pooling treatment on the second fusion characteristic by utilizing a pooling network in the track coding network to obtain a pooling result which is used as an observation track characteristic.

The second feature fusion network may include a corresponding element addition (add) layer that adds corresponding elements of the first processing result and the second processing result to obtain a second fusion feature. The pooling network is used for transforming the dimension of the second fusion feature into a preset dimension. For example, the second fusion feature is a feature of b×nl 1, and the second fusion feature is transformed into a feature of b×l2 through the pooling network, where B represents the number of target objects, n represents the number of observed time frames, L1 represents the fusion feature dimension of each time frame of each target object, and may be set according to actual requirements, and L2 may also be set according to actual requirements, for example, L1 is set to 512, and L2 is set to 128, which is not limited specifically.

In some alternative embodiments, the pooling network may be an average pooling layer.

According to the embodiment, the first processing result and the second processing result are fused, so that residual connection from the first processing result to the second processing result is realized, and model gradient disappearance can be prevented.

In an alternative example, fig. 7 is a schematic diagram of a network structure of a track coding network according to an exemplary embodiment of the present disclosure. The track coding network includes a second fully-connected network, a third convolutional network, a second feature fusion network, and a pooling network. The second full-connection network is used for processing the observed bounding box information sequence to obtain a first processing result, the third convolution network is used for processing the first processing result to obtain a second processing result, the output of the second full-connection network is used as the input of a second feature fusion network through residual connection, the second feature fusion network is used for fusing the first processing result and the second processing result to obtain a second fusion feature, and the pooling network is used for pooling the second fusion feature to obtain a pooling result as an observed track feature.

In an alternative example, fig. 8 is a schematic structural diagram of a track coding network provided in another exemplary embodiment of the present disclosure. In this example, the second fully-connected network includes one fully-connected layer and an activation function set, and in practical applications, the second fully-connected network may include a plurality of fully-connected layers and activation function sets (i.e. including fully-connected layer+activation function, …), which is not limited in particular. The third convolutional network includes a plurality of convolutional layer+active function sets and a last convolutional layer. The number of specific convolution layer+activation function sets may be set according to practical requirements, for example, may be set to 1 set, 2 sets, 3 sets, etc., and is not specifically limited. The second feature fusion network adds (add, i.e.

) A layer. The activation function may be set according to actual requirements, for example, may be set as a Relu activation function.

Fig. 9 is a flowchart illustrating a track determining method of a target object according to another exemplary embodiment of the present disclosure.

In some alternative embodiments, the methods of the present disclosure further comprise:

and 310, normalizing the observation boundary box information sequence to obtain a normalized sequence.

Wherein the normalization process is used to convert coordinates in the observed bounding box information sequence to a normalized range, i.e., [ -1,1] range.

Illustratively, fig. 10 is a schematic diagram of an observation bounding box information sequence provided by an exemplary embodiment of the present disclosure. Where XOY denotes an image coordinate system, an observation trajectory (an observation bounding box information sequence) includes n frames, and a future trajectory to be predicted includes m frames. H and W respectively represent the height and width of an image region where an observation track and a prediction track of a target object are located in an image coordinate system, and the normalization processing can be performed on an observation boundary box information sequence according to H and W, and specifically can be expressed as follows:

wherein, (X, Y) represents the corner coordinates to be normalized, and (X ', Y') represents the normalized corner coordinates. And determining a normalized sequence based on the normalized coordinates corresponding to the coordinates in the observation boundary box information sequence.

Under the condition of normalizing the observation boundary box information sequence, the motion track of the future time frame predicted and obtained by the track prediction model is the track in the normalization range, and then the motion track needs to be subjected to inverse normalization processing, and the inverse normalization result is taken as the final motion track of the future time frame, and the specific operation of inverse normalization can be expressed as follows:

where (x ', y') denotes the corner coordinates in the normalized range obtained by prediction and (x ", y") denotes the corner coordinates after inverse normalization.

The processing of the observed bounding box information sequence by using the second fully-connected network in the track coding network in step 2021, to obtain a first processing result includes:

in step 20211, the normalized sequence is processed using the second fully connected network, to obtain a first processing result.

According to the method, the device and the system, the observation boundary box information sequence is subjected to normalization processing, so that the observation boundary box information sequence can be converted into a uniform smaller range, the adverse effect on model training and reasoning caused by larger differences between coordinate ranges of different time frames can be reduced, and the accuracy of a prediction result can be effectively improved.

In some alternative embodiments, determining the corresponding variant code feature of the target object using the conditional variant self-encoder in the track prediction model based on the observed track feature of step 203 includes:

Step 2031, generating, based on the observed trajectory characteristics, observed distribution characteristics corresponding to the observed trajectory characteristics using the conditional variation from the encoding network in the encoder.

Wherein, the coding network can be a coder based on convolutional neural network, and the distribution characteristic Z is observed _p May include a mean feature mu _p Sum of variance feature sigma _p ，

Step 2032, generating a sampling tensor based on the preset gaussian distribution and the preset sampling rule.

The preset gaussian distribution is a gaussian distribution with a variance of 1 and a mean of 0, and may be expressed as e-N (0, 1), and the preset sampling rule may be set according to actual requirements, for example, may be random sampling. The size of the sampling tensor may be set according to actual requirements, for example, b×l3, where B represents the number of target objects, and L3 may be set according to actual requirements, for example, may be set to 64 or other values, which is not limited in particular. The sampling tensor E of b×l3 is obtained by sampling in gaussian distributions E to N (0, 1), for example by random sampling.

In step 2033, observed sample characteristics are generated based on the observed distribution characteristics and the sample tensor.

Wherein, the observation sampling feature Z can be expressed as:

Z＝μ _p +σ _p ⊙E

wherein, as indicated by the multiplication of corresponding elements, σ _p The dimensions of (2) are the same as E.

Step 2034, determining variant code features based on the observed sample features and the observed trace features.

The variation coding feature may be obtained by fusing an observation sampling feature and an observation track feature, for example, in a splicing (Concat) manner. The convolution processing may be further performed on the fused features, where the convolution result is used as a variation coding feature, so that the dimension of the variation coding feature is adapted to the number of future time frames, for example, the dimension of the variation coding feature is b×l4×m, where B represents the number of target objects, m represents the number of future time frames, and L4 is determined according to the dimension of the observation sampling feature and the dimension of the observation track feature, for example, the dimension of the observation sampling feature is b×l3, the dimension of the observation track feature is b×l2, and l4=l2+l3.

The embodiment realizes the condition variation self-encoder based on the convolutional neural network and Gaussian distribution sampling, completes the determination of the variation encoding characteristics of the target object, and further improves the processing efficiency of the model, thereby improving the track prediction efficiency.

In some alternative embodiments, the at least one future time frame comprises a first number of future time frames.

Determining the variant code feature based on the observed sample feature and the observed trace feature of step 2034 comprises:

Fusing the observation sampling characteristics and the observation track characteristics by using a first characteristic fusion network in the condition variation self-coding network to obtain first fusion characteristics; the first fusion feature is converted to a variant-code feature using a first convolutional network in a conditional variant self-encoder.

The first feature fusion network may be a splicing (Concat) layer, and the observed sampling feature and the observed track feature are spliced to realize fusion, and the first convolution network may include at least one convolution layer and an activation function corresponding to the convolution layer, and is used for expanding the number of channels of the first fusion feature to adapt to the number of future time frames to be predicted.

According to the embodiment, fusion of the observation sampling characteristics and the observation track characteristics and expansion of the number of channels are realized through the characteristic fusion and the convolution network, and effective variation coding characteristics are provided for subsequent track prediction.

In an alternative example, fig. 11 is a schematic diagram of a network structure of a condition-variable self-encoder provided in an exemplary embodiment of the present disclosure. The condition variation self-encoder comprises an encoding network, an observation sampling feature generation unit, a first feature fusion network and a first convolution network, wherein the encoding network is used for encoding observation track features to generate observation distribution features, the observation sampling generation unit is used for generating sampling tensors based on preset Gaussian distribution and preset sampling rules, the observation sampling features are generated based on the observation distribution features and the sampling tensors, the first feature fusion network is used for fusing the observation track features and the observation sampling features to obtain first fusion features, and the first convolution network is used for processing the first fusion features to obtain variation encoding features.

In some alternative embodiments, determining the motion trajectory of the at least one future time frame corresponding to the target object based on the variant-coding feature of step 204 using a trajectory prediction network in a trajectory prediction model, includes:

and 2041, processing the variation coding characteristic by using a second convolution network in the track prediction network to obtain a first intermediate result.

The second convolution network may include at least one convolution layer, and may further include an activation function and a residual connection, where the specific network structure may be set according to actual requirements.

And 2042, processing the first intermediate feature by using a first fully-connected network in the track prediction network to obtain a second intermediate result.

The first full-connection network may include at least one full-connection layer, and may further include an activation function, where a specific network structure may be set according to actual requirements.

And 2043, processing the second intermediate result by using a preset activation function in the track prediction network to obtain a third intermediate result.

The preset activation function is used for converting the second intermediate result into a normalization range. For example, the preset activation function may be a hyperbolic tangent function (tanh).

In some alternative embodiments, if the observed bounding box information sequence is not normalized, the trajectory prediction network may not include a preset activation function, that is, the motion trajectory of the future time frame corresponding to the target object may be determined directly based on the second intermediate result.

Step 2044, determining a motion trail corresponding to the target object based on the third intermediate result.

And obtaining a motion trail corresponding to the target object through inverse normalization of the third intermediate result.

The embodiment realizes the track prediction network based on the convolution layer, the full connection layer and the activation function, is beneficial to the realization of a lightweight track prediction model, and can further improve the track prediction efficiency.

In some alternative embodiments, the third intermediate result is a result within the normalized range.

Based on the third intermediate result, determining a motion trail corresponding to the target object in step 2044 includes:

performing inverse normalization processing on the third intermediate result to obtain an inverse normalization result; determining that each future time frame corresponds to the target boundary frame information respectively based on the inverse normalization result; and determining the motion trail corresponding to the target object based on the information of each target boundary box.

The specific operation of inverse normalization may be referred to the foregoing, and will not be described herein.

In the embodiment, the third intermediate result in the normalization range is converted into the coordinates of the image coordinate system through inverse normalization, and the accuracy of the track prediction result is further improved through normalization and inverse normalization.

In an alternative example, fig. 12 is a network architecture diagram of a trajectory prediction network provided by an exemplary embodiment of the present disclosure. The track prediction network comprises a second convolution network, a first full-connection network, a preset activation function and inverse normalization processing, wherein the second convolution network is used for processing variation coding features to obtain a first intermediate result, the first full-connection layer is used for processing the first intermediate result to obtain a second intermediate result, the preset activation function is used for processing the second intermediate result to obtain a third intermediate result in a normalization range, the third intermediate result is subjected to inverse normalization processing to obtain an inverse normalization result, and the inverse normalization result is used for determining a future motion track of a target object.

In an alternative example, fig. 13 is a network architecture diagram of a second convolutional network provided by an exemplary embodiment of the present disclosure. The second convolution network comprises at least one convolution layer+activation function set, one convolution layer, residual connection and corresponding element addition (add, i.e.

) And the second convolution network fuses the variation coding characteristic with the output of the last convolution layer to obtain a first intermediate result. The activation function may be a Relu activation function.

In an alternative example, the observed bounding box information sequence is represented as

The dimension is B4, the input dimension of the second fully connected network in the track coding network is 4, the output dimension is 512, the dimension of the obtained first processing result is B512, the dimension of the second processing result obtained through the third convolution network is B512, the obtained observation track feature is B128 through the second feature fusion network and the pooling network, the dimension of the observation distribution feature is B128 through the processing of the observation track feature by the coding network in the conditional variation self-coder, 0-63 dimension in 128 dimension of the observation distribution feature is taken as the mean feature, 64-127 are taken as the variance feature, and the mean feature mu is obtained _p Sum of variance feature sigma _p The dimension is B.times.64, the corresponding generated sampling tensor dimension is B.times.64, the obtained observation sampling feature Z is also B.times.64, and the observation sampling feature Z and the observation track feature are spliced by using a first feature fusion networkAnd obtaining a first fusion characteristic of B.times.192, processing the first fusion characteristic by using a first convolution network to obtain a variation coding characteristic of B.times.192.m, processing the variation coding characteristic by using a track prediction network to obtain a motion track of B.times. 4*m, wherein m is the number of predicted future time frames.

In some alternative embodiments, fig. 14 is a schematic diagram of a training network structure of a trajectory prediction model provided in an exemplary embodiment of the present disclosure. The training network includes a first track coding network for training the observed bounding box information sequences, a second track coding network for bounding box information sequence tags, a conditional variation self-encoder, and a track prediction network. The first track coding network and the second track coding network have the same network structure, but different initialization parameters. The training observation boundary box information sequence and the boundary box information sequence label can be determined by boundary box information of T time frames of at least one training object, which is obtained in advance, for example, boundary box information corresponding to the first n time frames in the T time frames can be used as the training observation boundary box information sequence, and boundary box information corresponding to the last m time frames can be used as the boundary box information sequence label. The training process of the track prediction model comprises the following steps: processing the training observation boundary box information sequence through a first track coding network to obtain training observation track characteristics, processing the boundary box information sequence label through a second track coding network to obtain true track characteristics, wherein the condition variation self-encoder can comprise two coding networks, namely a first coding network and a second coding network, respectively, an observation sampling characteristic generating unit (the output of the first coding network is used as the input of the unit), a first characteristic fusion network and a first convolution network (not shown), and processing the training observation track characteristics through the first coding network in the condition variation self-encoder to obtain training observation distribution characteristics

(representing Z _P Obeying gaussian distribution) and training variation coding features, μ _P Representing training mean features, sigma _P Represent training variance features, lead toThe overcondition variation is processed by a second coding network in the encoder to obtain a true value distribution characteristic +.>

μ _Q Representing the true value mean characteristic, sigma _Q Representing true value variance characteristics, processing training variation coding characteristics through a track prediction network to obtain a training prediction track, determining a first Loss (Loss 1) based on training observation distribution characteristics and true value distribution characteristics, and determining a second Loss (Loss 2) based on the training prediction track and a bounding box information sequence label; based on the first loss and the second loss, updating network parameters by adopting a gradient descent method to obtain an updated track prediction network, ending training in response to the updated track prediction network meeting a training ending condition, and taking the updated first track coding network, the condition variation self-encoder (which may not comprise the second coding network) and the track prediction network as a trained track prediction model. And responding to the situation that the updated track prediction network does not meet the training ending condition, continuing to carry out iterative training on the updated track prediction network according to the flow until the updated track prediction network meets the training ending condition, and obtaining a trained track prediction model. The training ending condition may include model convergence and iteration number reaching a preset number threshold, which may be specifically set according to actual requirements. The first loss and the second loss may be determined based on a first loss function and a second loss function, respectively, which may be set according to actual requirements.

In some alternative embodiments, the first loss function is an average absolute error loss function, and the second loss function is a KL (Kullback-Leibler) divergence loss function, which may be specifically set according to actual requirements.

By way of example, the average absolute error loss function may be expressed as follows:

where D represents the training predicted trajectory, GT represents the true-value trajectory (i.e., bounding box information sequence tag), and B represents the number of lots (Batch size), i.e., the number of training objects.

Illustratively, the KL divergence loss function may be expressed as follows:

the value of KL (p|q) represents the KL divergence of the training observation distribution characteristic relative to the true value distribution characteristic, and the meaning of each symbol is referred to above.

According to the track determination method of the target object, which is provided by the embodiment of the disclosure, a lightweight track prediction model is constructed and trained based on convolution and full connection, and track prediction efficiency is further improved based on a condition variation self-encoder on the basis of ensuring track prediction accuracy.

The embodiments of the present disclosure may be implemented alone or in any combination without collision, and may specifically be set according to actual needs, which is not limited by the present disclosure.

Any of the target object trajectory determination methods provided by the embodiments of the present disclosure may be performed by any suitable device having data processing capabilities, including, but not limited to: terminal equipment, servers, etc. Alternatively, any of the track determining methods of the target object provided by the embodiments of the present disclosure may be executed by a processor, such as the processor executing any of the track determining methods of the target object mentioned by the embodiments of the present disclosure by calling corresponding instructions stored in a memory. And will not be described in detail below.

Exemplary apparatus

Fig. 15 is a schematic structural view of a trajectory determining device of a target object provided in an exemplary embodiment of the present disclosure. The apparatus of this embodiment may be used to implement an embodiment of a track determining method of a corresponding target object of the present disclosure, where the apparatus shown in fig. 15 includes: a first acquisition module 501, a first processing module 502, a second processing module 503, and a third processing module 504.

The first obtaining module 501 is configured to obtain an observation boundary box information sequence of the target object, where the observation boundary box information sequence includes boundary box information in an observation image corresponding to each time frame of at least one time frame of the target object.

The first processing module 502 is configured to determine, based on the observation bounding box information sequence, an observation track feature corresponding to the target object using a track coding network in a track prediction model obtained by training in advance.

And a second processing module 503, configured to determine, based on the observed track feature, a corresponding code feature of the target object by using the condition variation self-encoder in the track prediction model.

A third processing module 504 is configured to determine, based on the variant coding feature, a motion trajectory of at least one future time frame corresponding to the target object using a trajectory prediction network in the trajectory prediction model.

Fig. 16 is a schematic structural view of a trajectory determining device of a target object provided in another exemplary embodiment of the present disclosure.

The first acquisition module 501 includes:

a first acquiring unit 5011 is configured to acquire observation images corresponding to the respective time frames.

The first processing unit 5012 is configured to process each observation image by using a target detection model obtained by training in advance, and obtain detection frame information of the target object in each observation image.

The first determining unit 5013 is configured to determine, for any time frame, coordinates of at least one set of diagonal corner points based on detection frame information corresponding to the time frame, as boundary frame information corresponding to the time frame.

A second determining unit 5014 for determining an observation boundary box information sequence based on the boundary box information to which each time frame corresponds.

In some alternative embodiments, the at least one time frame includes a current time frame and at least one historical time frame. The first obtaining module 501 is specifically configured to:

obtaining an observation image corresponding to a current time frame; processing an observation image corresponding to the current time frame by utilizing a target detection model obtained by pre-training to obtain detection frame information of a target object in the observation image of the current time frame; determining boundary frame information corresponding to the current time frame based on the detection frame information; and determining an observation boundary frame information sequence based on boundary frame information corresponding to the current time frame and boundary frame information corresponding to the previous time frame, wherein the boundary frame information corresponding to the previous time frame comprises boundary frame information corresponding to each history time frame obtained before.

In some alternative embodiments, the first processing module 502 includes:

and the second processing unit 5021 is configured to process the observation bounding box information sequence by using a second fully connected network in the track coding network to obtain a first processing result.

And the third processing unit 5022 is configured to process the first processing result by using a third convolution network in the track coding network, so as to obtain a second processing result.

A fourth processing unit 5023 for determining an observation path feature based on the first processing result and the second processing result.

In some alternative embodiments, the fourth processing unit 5023 is specifically configured to:

In some alternative embodiments, the second processing unit 5021 is further configured to normalize the observed bounding box information sequence to obtain a normalized sequence.

The second processing unit 5021 specifically is configured to: and processing the normalized sequence by using a second full-connection network to obtain a first processing result.

In some alternative embodiments, the second processing module 503 includes:

a fifth processing unit 5031 is configured to generate, based on the observed track feature, an observed distribution feature corresponding to the observed track feature from the encoding network in the encoder using the conditional variation.

The first generating unit 5032 is configured to generate a sampling tensor based on a preset gaussian distribution and a preset sampling rule.

A second generation unit 5033 for generating an observation sampling feature based on the observation distribution feature and the sampling tensor.

A sixth processing unit 5034 is configured to determine a variant coding feature based on the observed sampling feature and the observed trajectory feature.

The sixth processing unit 5034 specifically is configured to:

In some alternative embodiments, the third processing module 504 includes:

a seventh processing unit 5041 is configured to process the variant code feature by using a second convolution network in the track prediction network, to obtain a first intermediate result.

An eighth processing unit 5042 is configured to process the first intermediate feature by using a first fully-connected network in the trajectory prediction network, to obtain a second intermediate result.

A ninth processing unit 5043, configured to process the second intermediate result by using a preset activation function in the trajectory prediction network, so as to obtain a third intermediate result.

The tenth processing unit 5044 is configured to determine a motion trajectory corresponding to the target object based on the third intermediate result.

The tenth processing unit 5044 is specifically configured to:

The beneficial technical effects corresponding to the exemplary embodiments of the present apparatus may refer to the corresponding beneficial technical effects of the foregoing exemplary method section, and will not be described herein.

Exemplary electronic device

Fig. 17 is a block diagram of an electronic device provided in an embodiment of the present disclosure, including at least one processor 11 and a memory 12.

The processor 11 may be a Central Processing Unit (CPU) or other form of processing unit having data processing and/or instruction execution capabilities, and may control other components in the electronic device 10 to perform desired functions.

Memory 12 may include one or more computer program products that may include various forms of computer-readable storage media, such as volatile memory and/or non-volatile memory. Volatile memory can include, for example, random Access Memory (RAM) and/or cache memory (cache) and the like. The non-volatile memory may include, for example, read Only Memory (ROM), hard disk, flash memory, and the like. One or more computer program instructions may be stored on a computer readable storage medium and executed by the processor 11 to implement the methods and/or other desired functions of the various embodiments of the present disclosure above.

In one example, the electronic device 10 may further include: an input device 13 and an output device 14, which are interconnected by a bus system and/or other forms of connection mechanisms (not shown).

The input means 13 may also comprise, for example, a keyboard, a mouse, etc.

The output device 14 may output various information to the outside, which may include, for example, a display, a speaker, a printer, and a communication network and a remote output apparatus connected thereto, etc.

Of course, only some of the components of the electronic device 10 relevant to the present disclosure are shown in fig. 17 for simplicity, components such as buses, input/output interfaces, etc. being omitted. In addition, the electronic device 10 may include any other suitable components depending on the particular application.

Exemplary computer program product and computer readable storage Medium

In addition to the methods and apparatus described above, embodiments of the present disclosure may also provide a computer program product comprising computer program instructions which, when executed by a processor, cause the processor to perform steps in the methods of the various embodiments of the present disclosure described in the "exemplary methods" section above.

The computer program product may write program code for performing the operations of embodiments of the present disclosure in any combination of one or more programming languages, including an object oriented programming language such as Java, C++ or the like and conventional procedural programming languages, such as the "C" programming language or similar programming languages. The program code may execute entirely on the user's computing device, partly on the user's device, as a stand-alone software package, partly on the user's computing device, partly on a remote computing device, or entirely on the remote computing device or server.

Furthermore, embodiments of the present disclosure may also be a computer-readable storage medium, having stored thereon computer program instructions, which when executed by a processor, cause the processor to perform the steps in the methods of the various embodiments of the present disclosure described in the "exemplary methods" section above.

A computer readable storage medium may employ any combination of one or more readable media. The readable medium may be a readable signal medium or a readable storage medium. The readable storage medium may be, for example but not limited to, a system, apparatus, or device including electronic, magnetic, optical, electromagnetic, infrared, or semiconductor, or a combination of any of the foregoing. More specific examples (a non-exhaustive list) of the readable storage medium would include the following: an electrical connection having one or more wires, a portable disk, a hard disk, random Access Memory (RAM), read-only memory (ROM), erasable programmable read-only memory (EPROM or flash memory), optical fiber, portable compact disk read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing.

The basic principles of the present disclosure have been described above in connection with specific embodiments, but the advantages, benefits, effects, etc. mentioned in this disclosure are merely examples and are not to be considered as necessarily possessed by the various embodiments of the present disclosure. Furthermore, the specific details disclosed herein are for purposes of illustration and understanding only, and are not intended to be limiting, since the disclosure is not necessarily limited to practice with the specific details described.

Various modifications and alterations to this disclosure may be made by those skilled in the art without departing from the spirit and scope of this application. Thus, the present disclosure is intended to include such modifications and alterations insofar as they come within the scope of the appended claims or the equivalents thereof.

Claims

1. A method of trajectory determination of a target object, comprising:

obtaining an observation boundary frame information sequence of a target object, wherein the observation boundary frame information sequence comprises boundary frame information in observation images respectively corresponding to time frames of the target object in at least one time frame;

based on the observation boundary box information sequence, determining the observation track characteristics corresponding to the target object by utilizing a track coding network in a track prediction model obtained by pre-training;

based on the observed track characteristics, determining corresponding coding characteristics of the target object by utilizing a condition variation self-encoder in the track prediction model;

and determining the motion trail of at least one future time frame corresponding to the target object by utilizing a trail prediction network in the trail prediction model based on the variant coding characteristics.

2. The method of claim 1, wherein the determining, based on the observed trajectory characteristics, a corresponding variant code characteristic of the target object using a conditional variant self-encoder in the trajectory prediction model comprises:

Based on the observation track characteristics, generating observation distribution characteristics corresponding to the observation track characteristics by utilizing the condition variation self-coding network in the coder;

generating a sampling tensor based on a preset Gaussian distribution and a preset sampling rule;

generating an observation sampling feature based on the observation distribution feature and the sampling tensor;

the variant code feature is determined based on the observed sample feature and the observed trace feature.

3. The method of claim 2, wherein the at least one future time frame comprises a first number of future time frames;

the determining the variant code feature based on the observed sample feature and the observed trace feature comprises:

fusing the observation sampling characteristics and the observation track characteristics by using a first characteristic fusion network in the condition variation self-coding network to obtain first fusion characteristics;

the first fusion feature is converted to the variant-coding feature using a first convolutional network in the conditional variant self-encoder.

4. The method of claim 1, wherein the determining, based on the variant-coding features, a motion trajectory of at least one future time frame corresponding to the target object using a trajectory prediction network in the trajectory prediction model, comprises:

Processing the variation coding features by using a second convolution network in the track prediction network to obtain a first intermediate result;

processing the first intermediate feature by using a first fully-connected network in the track prediction network to obtain a second intermediate result;

processing the second intermediate result by using a preset activation function in the track prediction network to obtain a third intermediate result;

and determining the motion trail corresponding to the target object based on the third intermediate result.

5. The method of claim 4, wherein the third intermediate result is a result within a normalized range;

the determining the motion trail corresponding to the target object based on the third intermediate result includes:

performing inverse normalization processing on the third intermediate result to obtain an inverse normalization result;

determining that each future time frame corresponds to target boundary frame information based on the inverse normalization result;

and determining the motion trail corresponding to the target object based on the target boundary box information.

6. The method according to claim 1, wherein the determining, based on the observation bounding box information sequence, an observation trajectory feature corresponding to the target object using a trajectory encoding network in a trajectory prediction model obtained by training in advance, includes:

Processing the observation boundary box information sequence by using a second fully-connected network in the track coding network to obtain a first processing result;

processing the first processing result by using a third convolution network in the track coding network to obtain a second processing result;

and determining the observation track characteristic based on the first processing result and the second processing result.

7. The method of claim 6, wherein the determining the observed trajectory feature based on the first processing result and the second processing result comprises:

fusing the first processing result and the second processing result by using a second feature fusion network in the track coding network to obtain a second fusion feature;

and carrying out pooling treatment on the second fusion characteristic by utilizing a pooling network in the track coding network to obtain a pooling result which is used as the observation track characteristic.

8. The method of claim 6, further comprising:

normalizing the observation boundary box information sequence to obtain a normalized sequence;

the processing the observation bounding box information sequence by using a second fully-connected network in the track coding network to obtain a first processing result comprises the following steps:

And processing the normalized sequence by using the second full-connection network to obtain the first processing result.

9. The method of claim 1, wherein the bounding box information includes coordinates of at least one set of diagonal corners of a bounding box;

the obtaining the observation boundary box information sequence of the target object comprises the following steps:

acquiring the observation images corresponding to the time frames respectively;

processing each observation image by utilizing a target detection model obtained by pre-training to obtain detection frame information of the target object in each observation image;

and for any time frame, determining the coordinates of at least one group of diagonal corner points based on the detection frame information corresponding to the time frame, and taking the coordinates as the boundary frame information corresponding to the time frame.

10. The method of claim 1, wherein the at least one time frame comprises a current time frame and at least one historical time frame;

acquiring the observation image corresponding to the current time frame;

processing the observation image corresponding to the current time frame by utilizing a target detection model obtained by pre-training to obtain detection frame information of the target object in the observation image of the current time frame;

Determining the boundary frame information corresponding to the current time frame based on the detection frame information;

and determining the observed boundary frame information sequence based on the boundary frame information corresponding to the current time frame and the boundary frame information corresponding to the previous time frame, wherein the boundary frame information corresponding to the previous time frame comprises the boundary frame information corresponding to each history time frame obtained before.

11. A trajectory determining device of a target object, comprising:

the first acquisition module is used for acquiring an observation boundary frame information sequence of a target object, wherein the observation boundary frame information sequence comprises boundary frame information in observation images respectively corresponding to time frames of at least one time frame of the target object;

the first processing module is used for determining the observation track characteristics corresponding to the target object by utilizing a track coding network in a track prediction model obtained by pre-training based on the observation boundary box information sequence;

the second processing module is used for determining the variable component coding features corresponding to the target object by utilizing a conditional variable component self-encoder in the track prediction model based on the observed track features;

And the third processing module is used for determining the motion trail of at least one future time frame corresponding to the target object by utilizing a trail prediction network in the trail prediction model based on the variant coding characteristics.

12. A computer-readable storage medium storing a computer program for executing the trajectory determination method of the target object according to any one of the preceding claims 1 to 10.

13. An electronic device, the electronic device comprising:

a processor;

a memory for storing the processor-executable instructions;

the processor is configured to read the executable instructions from the memory and execute the instructions to implement the method for determining a trajectory of a target object according to any one of the preceding claims 1 to 10.