CN117765226B

CN117765226B - Track prediction method, track prediction device and storage medium

Info

Publication number: CN117765226B
Application number: CN202410196336.3A
Authority: CN
Inventors: 赵文一; 华炜; 马也驰
Original assignee: Zhejiang Lab
Current assignee: Zhejiang Lab
Priority date: 2024-02-22
Filing date: 2024-02-22
Publication date: 2024-06-04
Anticipated expiration: 2044-02-22
Also published as: CN117765226A

Abstract

The application relates to a track prediction method, a track prediction device and a storage medium, wherein the method comprises the following steps: acquiring a first image feature of a target image; the first image feature comprises scene information; acquiring a spatial index of a target space of the target image; generating a second image feature of the target image in the target space according to the first image feature and the spatial index of the target space; generating a current space-time state according to the second image characteristics; the space-time state comprises history information; obtaining a predicted space-time state according to the current space-time state; and generating a predicted segmented image according to the predicted space-time state to realize track prediction. The track prediction accuracy is improved through the method and the device.

Description

Track prediction method, track prediction device and storage medium

Technical Field

The present application relates to the field of computer vision, and in particular, to a track prediction method, apparatus, and storage medium.

Background

Development of automatic driving technology is of great significance to innovation of future traffic systems. In an autonomous vehicle, accurately predicting the trajectory of an obstacle is one of the key factors to ensure safe driving and efficient planning of a travel path. By accurately predicting the trajectory of the obstacle, the autopilot system can respond in time, avoid collisions with the obstacle, and better interact with other road traffic participants.

The track prediction method at the present stage mostly depends on the result of target detection, carries out track prediction through pure track data generated by target detection and tracking, only depends on the information of historical motion tracks, ignores the influence of other environmental factors, and generally lacks understanding of the context of traffic environment, and cannot consider the influence of other factors such as vehicles, pedestrians, road structures and the like on the behavior of obstacles, so that the track prediction is inaccurate.

Disclosure of Invention

In this embodiment, a track prediction method, apparatus and storage medium are provided to solve the problem of inaccurate track prediction in the prior art.

In a first aspect, in this embodiment, there is provided a track prediction method, including:

Acquiring a first image feature of a target image; the first image feature comprises scene information;

acquiring a spatial index of a target space of the target image;

Generating a second image feature of the target image in the target space according to the first image feature and the spatial index of the target space;

Generating a current space-time state according to the second image characteristics; the space-time state comprises history information;

obtaining a predicted space-time state according to the current space-time state;

And generating a predicted segmented image according to the predicted space-time state to realize track prediction.

In some of these embodiments, the acquiring the first image feature of the target image includes:

acquiring a third image feature of the target image;

Generating a height context feature from the third image feature and a height feature encoder;

and generating the first image feature according to the height context feature.

In some of these embodiments, the generating the first image feature from the height context feature includes:

generating a discrete height estimate from the height context feature;

and fusing the height context feature and the discrete height estimate to generate the first image feature.

In some of these embodiments, the acquiring the spatial index of the target space of the target image includes:

And calculating coordinates of pixel coordinates in the target image under a target space, and generating a spatial index of the target space of the target image.

In some of these embodiments, the generating a second image feature of the target image in the target space according to the first image feature and the spatial index of the target space includes:

and carrying out voxel pooling on the first image feature and the spatial index of the target space to generate a second image feature of the target image in the target space.

In some of these embodiments, said deriving a predicted spatiotemporal state from said current spatiotemporal state comprises:

Modeling the current space-time state to obtain the current distribution;

Acquiring hidden variables of current distribution;

and obtaining a predicted space-time state according to the current space-time and the hidden variables.

In some of these embodiments, said deriving a predicted spatiotemporal state from said current spatiotemporal and said hidden variables comprises:

and inputting the current space-time and the hidden variables into a pre-trained state prediction model to obtain a predicted space-time state.

In some of these embodiments, generating a predicted segmented image from the predicted spatiotemporal state, implementing trajectory prediction, includes:

the predicted spatiotemporal state is input to a decoder to generate a predicted segmented image.

In a second aspect, in this embodiment, there is provided a trajectory prediction apparatus, including:

The first acquisition module is used for acquiring first image features of the target image; the first image feature comprises scene information;

The second acquisition module is used for acquiring the spatial index of the target space of the target image;

The first generation module is used for generating a second image feature of the target image in the target space according to the first image feature and the spatial index of the target space;

The second generation module is used for generating a current space-time state according to the second image characteristics; the space-time state comprises history information;

The determining module is used for obtaining a predicted space-time state according to the current space-time state;

and the prediction module is used for generating a predicted segmented image according to the predicted space-time state to realize track prediction.

In a third aspect, in this embodiment, there is provided a computer-readable storage medium having stored thereon a computer program which, when executed by a processor, performs the steps of the trajectory prediction method of the first aspect.

Compared with the prior art, the track prediction method, the track prediction device and the storage medium provided by the embodiment generate the predicted segmented image according to the image characteristics of the target image, so that track prediction is realized, the image characteristics comprise scene information, the environment context is better understood by analyzing visible objects, road signs, traffic signs and the like in the scene in combination with the image characteristic prediction method, the track prediction accuracy is improved, and the safety of an automatic driving system is ensured.

The details of one or more embodiments of the application are set forth in the accompanying drawings and the description below to provide a more thorough understanding of the other features, objects, and advantages of the application.

Drawings

The accompanying drawings, which are included to provide a further understanding of the application and are incorporated in and constitute a part of this specification, illustrate embodiments of the application and together with the description serve to explain the application and do not constitute a limitation on the application. In the drawings:

Fig. 1 is a block diagram of a hardware structure of a terminal performing a trajectory prediction method according to an embodiment of the present application;

FIG. 2 is a flow chart of a trajectory prediction method according to an embodiment of the present application;

FIG. 3 is a flow chart of acquiring a first image feature of a target image according to an embodiment of the present application;

FIG. 4 is a flow chart of another trajectory prediction method according to an embodiment of the present application;

FIG. 5 is a schematic illustration of a segmented image according to an embodiment of the present application;

fig. 6 is a block diagram of a track prediction device according to an embodiment of the present application.

Detailed Description

The present application will be described and illustrated with reference to the accompanying drawings and examples for a clearer understanding of the objects, technical solutions and advantages of the present application.

Unless defined otherwise, technical or scientific terms used herein have the same meaning as commonly understood by one of ordinary skill in the art to which this application belongs. The terms "a," "an," "the," "these" and similar terms in this application are not intended to be limiting in number, but may be singular or plural. The terms "comprising," "including," "having," and any variations thereof, as used herein, are intended to encompass non-exclusive inclusion; for example, a process, method, and system, article, or apparatus that comprises a list of steps or modules (units) is not limited to the list of steps or modules (units), but may include other steps or modules (units) not listed or inherent to such process, method, article, or apparatus. The terms "connected," "coupled," and the like in this disclosure are not limited to physical or mechanical connections, but may include electrical connections, whether direct or indirect. The term "plurality" as used herein means two or more. "and/or" describes an association relationship of an association object, meaning that there may be three relationships, e.g., "a and/or B" may mean: a exists alone, A and B exist together, and B exists alone. Typically, the character "/" indicates that the associated object is an "or" relationship. The terms "first," "second," "third," and the like, as referred to in this disclosure, merely distinguish similar objects and do not represent a particular ordering for objects.

The method embodiments provided in the present embodiment may be executed in a terminal, a computer, or similar computing device. Such as on a terminal, fig. 1 is a block diagram of the hardware architecture of a terminal that performs a trajectory prediction method according to an embodiment of the present application. As shown in fig. 1, the terminal may include one or more (only one is shown in fig. 1) processors 102 and a memory 104 for storing data, wherein the processors 102 may include, but are not limited to, a microprocessor MCU, a programmable logic device FPGA, or the like. The terminal may also include a transmission device 106 for communication functions and an input-output device 108. It will be appreciated by those skilled in the art that the structure shown in fig. 1 is merely illustrative and is not intended to limit the structure of the terminal. For example, the terminal may also include more or fewer components than shown in fig. 1, or have a different configuration than shown in fig. 1.

The memory 104 may be used to store a computer program, for example, a software program of application software and a module, such as a computer program corresponding to a trajectory prediction method in the present embodiment, and the processor 102 executes the computer program stored in the memory 104 to perform various functional applications and data processing, that is, to implement the above-described method. Memory 104 may include high-speed random access memory, and may also include non-volatile memory, such as one or more magnetic storage devices, flash memory, or other non-volatile solid-state memory. In some examples, the memory 104 may further include memory remotely located relative to the processor 102, which may be connected to the terminal via a network. Examples of such networks include, but are not limited to, the internet, intranets, local area networks, mobile communication networks, and combinations thereof.

The transmission device 106 is used to receive or transmit data via a network. The network includes a wireless network provided by a communication provider of the terminal. In one example, the transmission device 106 includes a network adapter (Network Interface Controller, simply referred to as a NIC) that can connect to other network devices through a base station to communicate with the internet. In one example, the transmission device 106 may be a Radio Frequency (RF) module for communicating with the internet wirelessly.

In this embodiment, a track prediction method is provided, and fig. 2 is a flowchart of a track prediction method according to an embodiment of the present application, as shown in fig. 2, where the flowchart includes the following steps:

step S210, acquiring first image features of a target image; the first image feature includes scene information.

Specifically, the target image may be an image acquired by the road-side device. The target image includes scene information, where the scene information is information in a scene shot by the road-side device, and the scene information is all information in the target image, such as vehicle information, pedestrian information, visible object information, road sign information, traffic sign information, and the like, where the vehicle information includes a current state of a vehicle, external information of the vehicle, and overhead state information of the vehicle, and the like, and is not limited specifically herein. The target image is input to an image feature encoder, image features of the target image are generated, and first image features are generated from the image features of the target image, where the first image features are used to characterize information in the target image. The image features of the target image herein include image features of a history image, where the history image is an image of a history time of the target image, and the history image may be plural. For example, the image acquisition time of the target image is t, and for the h historical images, the corresponding image acquisition time of the h historical images can be t-1, t-2, … … and t-h respectively, and the h historical images and the target image are input into an image feature encoder to generate the image features of the target image.

Step S220, obtaining a spatial index of a target space of the target image.

Specifically, the pixel coordinates in the target image are converted into coordinates in the target space, thereby generating a spatial index of the target space of the target image. The target space here may be BEV (bird's eye view) space, road side space, vehicle side space, or the like, and is not particularly limited herein.

Step S230, generating a second image feature of the target image in the target space according to the first image feature and the spatial index of the target space.

Specifically, voxel pooling is performed on the first image feature and a spatial index of the target space, and a second image feature of the target image under the target space is generated.

Step S240, generating a current space-time state according to the second image characteristics; the spatiotemporal state includes historical information.

Specifically, the second image features of the target image in the target space are input into a space-time fusion model, and the current space-time state is generated. The spatio-temporal state may be a multi-dimensional tensor, comprising image features of the target image and the history image.

Step S250, obtaining the predicted space-time state according to the current space-time state.

Specifically, a plurality of future spatiotemporal states, i.e., a plurality of predicted spatiotemporal states, are iteratively predicted in time steps based on the current spatiotemporal state. If the time of the current space-time state is t, predicting H predicted space-time states after the current time t, and the corresponding time of the H predicted space-time states is t+1, t+2 and … … t+H respectively.

Step S260, a predicted segmented image is generated according to the predicted space-time state, and track prediction is achieved.

Specifically, H predicted segmented images are generated according to the H predicted spatio-temporal states, and trajectory prediction is implemented. The segmented image may be a segmented image of a target object in the target image, such as a pedestrian, a vehicle, or the like, and the trajectory prediction of the target object is implemented according to the coordinate position of the target object in the segmented image.

In this embodiment, according to the image features of the target image, a predicted segmented image is generated, track prediction is implemented, the image features include scene information, and the prediction method combined with the image features better understand the environment context by analyzing visible objects, road signs, traffic signs and the like in the scene, so that the accuracy of track prediction is improved, and the safety of the automatic driving system is ensured.

In some of these embodiments, as shown in fig. 3, step S210, acquiring a first image feature of a target image, includes the steps of:

In step S310, a third image feature of the target image is acquired.

Step S320, generating a height context feature according to the third image feature and the height feature encoder.

Step S330, generating a first image feature according to the height context feature.

In some of these embodiments, generating the first image feature from the height context feature comprises: generating a discrete height estimate based on the height context features; the height context feature and the discrete height estimate are fused to generate a first image feature.

In some of these embodiments, acquiring a spatial index of a target space of a target image includes: and calculating coordinates of pixel coordinates in the target image under the target space, and generating a spatial index of the target space of the target image.

In some of these embodiments, deriving the predicted spatiotemporal state from the current spatiotemporal state comprises: modeling the current space-time state to obtain the current distribution; acquiring hidden variables of current distribution; and obtaining the predicted space-time state according to the current space-time and hidden variables.

In some of these embodiments, deriving the predicted spatiotemporal state from the current spatiotemporal and implicit variables comprises: and inputting the current space-time and hidden variables into a pre-trained state prediction model to obtain a predicted space-time state.

In some of these embodiments, generating a predicted segmented image from the predicted spatio-temporal states, implementing trajectory prediction, includes: the predicted spatiotemporal state is input to a decoder to generate a predicted segmented image.

The present embodiment is described and illustrated below by way of preferred embodiments.

FIG. 4 is a flow chart of another trajectory prediction method according to an embodiment of the present application. As shown in fig. 4, the flow includes the steps of:

In step S410, the image obtained by the road-side camera passes through the image feature encoder to generate image features.

Specifically, the image acquired by the roadside camera passes through the image feature encoder F _feature to generate an image feature _img. The image features of the target image are the target image or the history image in the foregoing embodiment. The image feature _img is the image feature of the target image in the foregoing embodiment. The input of the image feature encoder F _feature is the image data collected by the road-side camera, and the input size is (b, h, n, c, ih, iw), where b represents the batch size, h represents the history step number, in this embodiment, the image input of the current frame and the previous 2 frames is set to 3, n represents the number of cameras, in this embodiment, the input of one road-side camera is set to 1, c is the number of image channels, in this embodiment, RGB image output is used, so c is 3, ih represents the image pixel height, and iw represents the image pixel width. The image feature encoder in step S410 is formed by stacking a residual neural network ResNet and a feature pyramid network FPN. The network structure is formed by stacking a residual neural network ResNet serving as a backbone and a characteristic pyramid network FPN serving as a bottleneck layer. The size of each stage of ResNet is reduced by half, the number of characteristic channels is increased by 2 times, in this embodiment, the size of the characteristic map output by stage1 is (b, 256, ih/4, iw/4), the first dimension b is the batch size, which is the number of samples input at one time, b >1 in the training process generally, and b is generally 1 in the prediction or reasoning process; the second dimension is a characteristic dimension, which is a manually defined super parameter; ih is the height of the resolution of the original input image, iw is the width, the feature map size output by stage2 is (b, 512, ih/8, iw/8), the feature map size output by stage3 is (b, 1024, ih/16, iw/16), and the feature map size output by stage4 is (b, 2048, ih/32, iw/32). The feature pyramid network FPN accesses the output of 4 stages, aligns the four-way feature map size to (b, 128, bh, bw) through convolution and deconvolution, and splices into a tensor in the second dimension, and outputs the image feature _img (b, 512, bh, bw), where bh=ih/16, bw=iw/16 in this embodiment. It means that the four feature maps are all (b, 128, bh, bw), and the four tensors are spliced in the second dimension, 128×4=512, and (b, 512, bh, bw) is changed.

Step S420, generating image height fusion features according to the image features.

Specifically, the image features are transferred into a height feature encoder F _height to form height context features, the height context features are subjected to softmax (normalized exponential function) on each feature channel to obtain discrete height estimates, and the height context features are fused with the discrete height estimates to generate image height fusion features. The image highly-fused feature is the first image feature of the image in the foregoing embodiment. The structure of the height feature encoder F _height comprises a convolution network F _conv, a context attention computing network F _mlp, a height attention computing network H _mlp, a context feature attention network F _se and a height feature attention network H _se, the main structure of the two attention networks is SElayer, and the input of the height feature encoder comprises image features output by F _feature, camera internal parameters and an image enhancement matrix. In this embodiment, the image feature _img first re-extracts features through the convolution network F _conv, the camera reference, image enhancement matrix input context attention computing network F _mlp generates the context attention SE _context, the camera reference, image enhancement matrix input altitude attention computing network H _mlp generates the context attention SE _height,feature_img and SE _context together input context feature attention network F _se,feature_img and SE _height together input altitude feature attention network H _se, the output of the two branches is spliced to construct a height context feature, which is then fused with the discrete height estimate to generate an image height fusion feature. The network in the entire step S420 does not change the feature map size and the channel number.

In step S430, coordinates of pixel coordinates in the image in BEV space are calculated to generate a BEV spatial index.

Specifically, the BEV spatial index generating method in step S430 is to divide the image plane into (gh, gw) grids, expand D heights at each grid position, and finally output a view cone size of (D, gh, gw, 3), where gh, gw is the resolution of the BEV grids. The cone is then coordinate transformed to divide the front of the camera field of view into blocks of (D, gH, gW), where D is the number of height distributions and gH, gW is the resolution of the BEV grid. The coordinate transformation mode is to establish a virtual coordinate system at the camera, the origin is the same as the camera system, the ground clearance is H, and the Y axis points to the ground vertically. The coordinates of the point (u, v) in the image under the camera coordinate system areWhere u, v is the pixel coordinates and K is the camera intrinsic. The coordinates of the point (u, v) in the image under the virtual coordinate system are/>. Assuming that the spatial point ground-leaving height corresponding to the pixel point (u, v) is h _i, i corresponds to the i-th of the D discrete height values, the coordinate of the pixel point corresponding to each height in the virtual system is/>. A ego (body coordinate system) coordinate system is defined, the origin being a point on the ground directly below the camera, the z-axis being vertically upwards and the x-axis pointing forwards. The coordinates of a plurality of discrete height points corresponding to the pixel point (u, v) under the ego coordinate system are/>。/>Representing the transformation matrix from coordinate system a to B. Coordinates of a plurality of discrete height points corresponding to the pixel points (u, v) under a ego coordinate system are BEV space indexes.

Step S440, the image height fusion features are voxel pooled with the BEV spatial index to generate image height fusion features in BEV space.

Specifically, voxel pooling in this step takes the image height fusion feature and the BEV spatial index as input, allocates the image height fusion feature to the BEV space according to the position index of the image height fusion feature under the ego coordinate system, then performs sum-pooling on the features in the same BEV grid, adds the features, and finally obtains a feature map with the dimensions of (n, h, C, gH, gW), wherein C is the number of feature channels. In this example, a ego-coordinate system is divided into BEV grids in a square with a front length of 204.8 m and a width of 102.4 m, and thus gh=512 and gw=256. The image highly fused feature in BEV space is the second image feature in the previous embodiment.

And S450, inputting the image highly fused characteristic under the BEV space into a space-time fusion model, and outputting a space-time state fused with the history information.

Specifically, the image highly fused feature in the BEV space is input into the spatiotemporal fusion model F _temporal, the spatiotemporal state s _t fused with the history information is output, and the spatiotemporal state s _t is a multidimensional tensor and contains the image feature. The space-time fusion model F _temporal comprises a 3D convolution network and local average pooling, and outputs a space-time state s _t fused with historical information.

Step S460, modeling the space-time state probability distribution.

Specifically, the space-time state probability distribution modeling is divided into a current distribution, the input of which is the space-time state s _t, and a future distribution, the input of which is the future labelY is in the form of BEV segmentation labels, i.e. segmentation graphs, at future times, where H is the number of steps to predict. Both employ a multidimensional gaussian distribution. The probability distribution is modeled as a multidimensional gaussian distribution, which is represented by calculating the mean and variance. In this embodiment, the calculation method is that a probability encoder is passed through, the encoder structure is that 4 residual blocks are overlapped, each residual block is composed of three parts of 1x1 convolution, 3x3 convolution and 1x1 convolution, each residual block reduces the size of the feature map by half, the number of feature channels of the first block is reduced by half, and the number of feature channels of other blocks is unchanged. And (3) the output of the final residual block is subjected to a layer of 1x1 convolution, a final mean variance combined tensor (b, 2L) is output, the first half part L is the mean value, the second half part L is the variance, and L is the set probability distribution hidden variable dimension. The current space-time state is input into the probability encoder, the mean and variance of the current distribution are output, the future state, namely the label, is input into the probability encoder, and the mean and variance of the future distribution are output.

In step S470, hidden variables are sampled, and the current state and the hidden variables are used as inputs of the state prediction model.

Specifically, hidden variables are sampled from the never-coming distribution during training, and hidden variables are sampled from the current distribution during reasoning, which is defined asCurrent state s _t and hidden variable/>As input to a state prediction model. The sampling mode is specifically as followsWhere μ represents the mean, σ represents the variance, ε represents the random noise. Future distributions are used in training, and current distributions are used in prediction or reasoning. The inputs to the state prediction model are sampled from the probability distribution. All network models (including encoders and decoders) involved in the embodiment are arranged in series, and end-to-end refers to that all network models are trained simultaneously, and input during training is an image acquired by a road section camera and output is a segmentation map at a future moment. Hidden variable/>, sampled from never been distributedAfter passing through the following state prediction model and decoder, the real segmentation map is output, corresponding to intermediate variables.

In step S480, the state prediction model iteratively predicts the future state according to the time steps.

Specifically, the state prediction model iteratively predicts future states in time steps. The state prediction model structure is a convolution gating circulation unit ConvGRU, hidden variables/>As input, the current state s _t is input as a hidden layer of ConvGRU, and the output predicts the future state/>. In this example, 3 ConvGRU modules were used, each ConvGRU to cycle through H time steps.

Step S490, the predicted future state is input to the decoder to generate a predicted split image.

Specifically, future states to be predictedInput decoder, generating predicted BEV segmented image/>. A schematic diagram of a segmented image predicted by the method in this embodiment is shown in fig. 5, where different colors represent different examples, such as pedestrians, vehicles, etc., and each light-colored portion of the object is a predicted track. The model structure of the decoder is a class Unet structure of three downsampling, and finally two full convolution network task heads are connected, and a prediction segmentation image and a prediction instance center point image under BEV are respectively output, so that the prediction of instance segmentation is realized. In this embodiment, the input dimensions are (b×s, cstate, gH, gW), b is the batch size batchsize, s is the time step+1 to be predicted, that is, the current time and the future time to be predicted, cstate is the feature dimension of the current state, gH and gW are the resolution of the BEV grid, and the number of channels is unchanged and the size is halved through the first layer downsampling; the number of channels is changed into 2 times through the downsampling of the second layer, and the size is halved; the number of channels is changed into 2 times through the third layer downsampling, and the size is halved; and then the feature map is subjected to three upsampling steps, and is added with the feature map jump connection of the previous size after each upsampling step to form a similar Unet structure. The task head of the predictive segmentation image is formed by overlapping 3x3 convolution and 1x1 convolution, the segmentation result of the obstacle on the BEV grid space is output, and the number of channels is the category number of the obstacle. The task head of the predicted instance center point is formed by overlapping 3x3 convolution and 1x1 convolution, and finally classified by sigmoid, and the probability that each point on the BEV grid space is the instance center is output. Finally, through the positioning of the segmented image and the instance center, the instance segmentation is output, so that obstacle perception and track prediction are realized according to the instance segmentation of the predicted future time step.

In this embodiment, the prediction method combined with the image features can better understand the environmental context and improve the accuracy of prediction by analyzing the visible objects, road signs, traffic signs and the like in the scene, thereby ensuring the safety of the automatic driving system. Meanwhile, the road section perception method can provide wider visual field and global environment perception, and reduce visual field blind areas, so that more traffic information is captured, and more accurate scene understanding is provided for vehicles. The BEV example segmentation is output end to end through the image data, so that a more accurate and robust track prediction result is provided, and the method is particularly suitable for road-end perception scenes. The existing track prediction is to separately train 3 models of detection, tracking and prediction, and the accuracy of track prediction is not high.

It should be noted that the steps illustrated in the above-described flow or flow diagrams of the figures may be performed in a computer system, such as a set of computer-executable instructions, and that, although a logical order is illustrated in the flow diagrams, in some cases, the steps illustrated or described may be performed in an order other than that illustrated herein.

The present embodiment also provides a track prediction apparatus, which is used to implement the foregoing embodiments and preferred embodiments, and will not be described in detail. The terms "module," "unit," "sub-unit," and the like as used below may refer to a combination of software and/or hardware that performs a predetermined function. While the means described in the following embodiments are preferably implemented in software, implementations in hardware, or a combination of software and hardware, are also possible and contemplated.

Fig. 6 is a block diagram of a track prediction device according to an embodiment of the present application, and as shown in fig. 6, the device includes:

a first acquiring module 710, configured to acquire a first image feature of a target image; the first image feature comprises scene information;

a second obtaining module 720, configured to obtain a spatial index of a target space of the target image;

A first generating module 730, configured to generate a second image feature of the target image in the target space according to the first image feature and the spatial index of the target space;

A second generating module 740, configured to generate a current space-time state according to the second image feature; the space-time state comprises history information;

a determining module 750, configured to obtain a predicted spatiotemporal state according to the current spatiotemporal state;

and a prediction module 760, configured to generate a predicted segmented image according to the predicted space-time state, so as to implement track prediction.

The above-described respective modules may be functional modules or program modules, and may be implemented by software or hardware. For modules implemented in hardware, the various modules described above may be located in the same processor; or the above modules may be located in different processors in any combination.

There is also provided in this embodiment an electronic device comprising a memory having stored therein a computer program and a processor arranged to run the computer program to perform the steps of any of the method embodiments described above.

Optionally, the electronic apparatus may further include a transmission device and an input/output device, where the transmission device is connected to the processor, and the input/output device is connected to the processor.

Alternatively, in the present embodiment, the above-described processor may be configured to execute the following steps by a computer program:

s1, acquiring a first image feature of a target image; the first image feature comprises scene information;

s2, acquiring a spatial index of a target space of the target image;

S3, generating a second image feature of the target image in the target space according to the first image feature and the spatial index of the target space;

S4, generating a current space-time state according to the second image characteristics; the space-time state comprises history information;

S5, obtaining a predicted space-time state according to the current space-time state;

S6, generating a predicted segmented image according to the predicted space-time state, and realizing track prediction.

It should be noted that, specific examples in this embodiment may refer to examples described in the foregoing embodiments and alternative implementations, and are not described in detail in this embodiment.

In addition, in combination with the track prediction method provided in the above embodiment, a storage medium may be further provided in this embodiment to implement. The storage medium has a computer program stored thereon; the computer program, when executed by a processor, implements any of the trajectory prediction methods of the above embodiments.

It should be understood that the specific embodiments described herein are merely illustrative of this application and are not intended to be limiting. All other embodiments, which can be made by one of ordinary skill in the art without undue burden from the present disclosure, are within the scope of the present disclosure in accordance with the embodiments provided herein.

It is to be understood that the drawings are merely illustrative of some embodiments of the present application and that it is possible for those skilled in the art to adapt the present application to other similar situations without the need for inventive work. In addition, it should be appreciated that while the development effort might be complex and lengthy, it would nevertheless be a routine undertaking of design, fabrication, or manufacture for those of ordinary skill having the benefit of this disclosure, and thus should not be construed as a departure from the disclosure.

The term "embodiment" in this disclosure means that a particular feature, structure, or characteristic described in connection with the embodiment may be included in at least one embodiment of the application. The appearances of such phrases in various places in the specification are not necessarily all referring to the same embodiment, nor are separate or alternative embodiments mutually exclusive. It will be clear or implicitly understood by those of ordinary skill in the art that the embodiments described in the present application can be combined with other embodiments without conflict.

The above examples merely represent a few embodiments of the present application, which are described in more detail and are not to be construed as limiting the scope of the patent claims. It should be noted that it will be apparent to those skilled in the art that several variations and modifications can be made without departing from the spirit of the application, which are all within the scope of the application. Accordingly, the scope of the application should be assessed as that of the appended claims.

Claims

1. A method of trajectory prediction, the method comprising:

Acquiring a third image feature of the target image; generating a height context feature from the third image feature and a height feature encoder; generating a first image feature according to the height context feature; the first image feature comprises scene information;

Calculating coordinates of pixel coordinates in the target image under a target space, and generating a spatial index of the target space of the target image;

Voxel pooling is carried out on the first image feature and the spatial index of the target space, and a second image feature of the target image in the target space is generated;

Inputting the second image features of the target image in the target space into a space-time fusion model to generate a current space-time state; the space-time state comprises history information;

Modeling space-time state probability distribution; the space-time state probability distribution modeling is divided into current distribution and future distribution, wherein the input of the current distribution is space-time state, the input of the future distribution is future label, and the form of the future label is a bird's eye view angle segmentation label at the future moment; inputting the current space-time state into a probability encoder, outputting the mean value and variance of the current distribution, inputting the future state into the probability encoder, and outputting the mean value and variance of the future distribution;

acquiring hidden variables of current distribution; taking the current state and hidden variables as inputs of a state prediction model, and obtaining a predicted space-time state by the state prediction model according to time step iteration;

The predicted space-time state is input to a decoder, and a predicted segmented image is generated to realize track prediction.

2. The trajectory prediction method of claim 1, wherein the generating a first image feature from the height context feature comprises:

generating a discrete height estimate from the height context feature;

3. A trajectory prediction device for performing the method of claim 1, said device comprising:

The first generation module is used for generating second image features of the target image in the target space according to the first image features and the spatial index of the target space;

4. A computer readable storage medium having stored thereon a computer program, characterized in that the computer program when executed by a processor implements the steps of the trajectory prediction method of any one of claims 1 to 2.

5. An electronic device comprising a memory and a processor, characterized in that the memory has stored therein a computer program, the processor being arranged to run the computer program to perform the trajectory prediction method of any one of claims 1 to 2.