CN111031351A

CN111031351A - Method and device for predicting target object track

Info

Publication number: CN111031351A
Application number: CN202010163877.8A
Authority: CN
Inventors: 任冬淳; 夏华夏; 樊明宇; 钱德恒; 丁曙光
Original assignee: Beijing Sankuai Online Technology Co Ltd
Current assignee: Beijing Sankuai Online Technology Co Ltd
Priority date: 2020-03-11
Filing date: 2020-03-11
Publication date: 2020-04-17

Abstract

The present specification discloses a method and an apparatus for predicting a target trajectory, which directly predict a future prediction video for a target without abstracting the target and the environment in an acquired actual video. The process of generating the prediction video is based on all the information in the collected actual video, so that the obtained prediction video can more accurately reflect the future state of the target object. In addition, in the process of obtaining the predicted video, the method in the specification does not cut the target object from the environment where the target object is located, so that the information corresponding to the environment in the actual video can be fully utilized in the prediction process, and meanwhile, the accuracy and the comprehensiveness of the predicted video are improved. When the future motion trail of the target object is determined, the information of the environment in the prediction video can be combined, so that the obtained future motion trail of the target object is more accurate.

Description

Method and device for predicting target object track

Technical Field

The present disclosure relates to the field of unmanned driving technologies, and in particular, to a method and an apparatus for predicting a target trajectory.

Background

Currently, intelligent cities have become the direction for future planning and design of numerous cities around the world, and efforts are being made to improve the work and life of urban residents through the application of various new technologies. Intelligent vehicles (e.g., vehicles with driving assistance function, unmanned vehicles), intelligent transportation facilities have become indispensable components of intelligent cities.

Among them, the development of functions of intelligent transportation means and intelligent transportation facilities depends on the prediction of the movement locus of a target object in the environment. In the conventional method for predicting the motion trajectory of a target object, the target object is abstracted to a movable point, an environment where the target object is located is abstracted (for example, a road in the environment is abstracted to a road topological graph, and/or natural information such as a sky in the environment is abstracted to a color block), then the position of the point at the next moment is predicted according to the historical position of the point, and finally the motion trajectory of the target object at the next moment is drawn according to the predicted position of the point at the next moment.

Therefore, in the existing method, at least part of characteristics of the target object are ignored in the process of abstraction, and the environmental characteristics of the environment where the target object is located are ignored, so that the available information of the prediction process is reduced, and the prediction precision is further reduced.

Disclosure of Invention

The embodiments of the present disclosure provide a method and an apparatus for predicting a target trajectory, so as to partially solve the above problems in the prior art.

The embodiment of the specification adopts the following technical scheme:

the present specification provides a method for predicting a target trajectory, the method comprising:

acquiring an actual video of an environment where the unmanned equipment is located;

splitting the actual video;

determining the segments with the time length from the current moment to meet a preset rule from the segments obtained by splitting as the sub-videos;

generating a sub-prediction video by adopting a pre-trained video generation model according to the determined sub-video;

re-determining the sub-prediction video as a sub-video;

according to the re-determined sub-videos, re-adopting the video generation model to generate sub-prediction videos until the sum of the time lengths of the generated sub-prediction videos is not less than the preset time length;

synthesizing the generated sub prediction videos into prediction videos;

identifying a target in the predicted video;

and tracking the identified target object in the prediction video to obtain the motion track of the target object in the prediction video, wherein the motion track is used as the predicted motion track of the target object in the future.

Optionally, training the video generation model specifically includes:

acquiring a historically acquired actual video as a sample video;

splitting the sample video to obtain a plurality of sub-sample videos;

and aiming at any two adjacent sub-sample videos in the sample videos, taking the former video in the two adjacent sub-sample videos as the input of the video generation model to be trained, and taking the latter video in the two adjacent sub-sample videos as the label to train the video generation model to be trained.

Optionally, identifying the target object in the prediction video specifically includes:

and identifying a preset type of dynamic object in the prediction video as a target object.

Optionally, in the prediction video, tracking the identified target object to obtain a motion trajectory of the target object in the prediction video, where the motion trajectory is used as a predicted motion trajectory of the target object in the future, and specifically includes:

splicing the actual video and the predicted video to obtain a comprehensive video;

tracking the target object in the integrated video;

obtaining the motion track of the target object in the comprehensive video;

and taking the motion trail corresponding to the partial trail of the prediction video as the motion trail of the target object in the future.

Optionally, in the prediction video, the identified target object is tracked to obtain a motion trajectory of the target object in the prediction video, and the motion trajectory is used as a predicted motion trajectory of the target object in the future, and the method further includes:

and performing coordinate conversion on the motion track according to the pose when the actual video is collected and the parameters of the collection equipment for collecting the actual video to obtain the future motion track of the target object in a world coordinate system.

Optionally, the video generation model comprises an encoding end and a decoding end;

generating a prediction video by adopting a pre-trained video generation model according to at least part of sub-videos in the actual video, specifically comprising:

inputting at least part of sub-videos in the actual videos into the encoding end to obtain video characteristics output by the encoding end;

and inputting the video characteristics into the decoding end to obtain the sub-prediction video output by the decoding end.

Optionally, the encoding end includes a convolutional neural network CNN, and the decoding end includes a long-short term memory LSTM network.

The device for predicting the track of the target object provided by the specification comprises:

the acquisition module is used for acquiring an actual video of the environment where the unmanned equipment is located;

the splitting module is used for splitting the actual video;

the sub-video determining module is used for determining the segments with the duration from the current moment to meet the preset rule from the segments obtained by splitting as the sub-videos;

the sub-prediction video generation module is used for generating a sub-prediction video by adopting a pre-trained video generation model according to the determined sub-video;

a sub-video re-determination module for re-determining the sub-prediction video as a sub-video;

the sub-prediction video regeneration module is used for regenerating the sub-prediction video by adopting the video generation model according to the re-determined sub-video until the sum of the time lengths of the generated sub-prediction videos is not less than the preset time length;

the synthesis module is used for synthesizing the generated sub prediction videos into prediction videos;

an identification module for identifying a target object in the predicted video;

and the motion track determining module is used for tracking the identified target object in the prediction video to obtain the motion track of the target object in the prediction video, and the motion track is used as the predicted motion track of the target object in the future.

The present specification provides a computer-readable storage medium storing a computer program which, when executed by a processor, implements a method of predicting a trajectory of an object as described above.

The present specification provides an unmanned aerial vehicle comprising a memory, a processor and a computer program stored on the memory and executable on the processor, wherein the processor implements the above-mentioned method for predicting a target trajectory when executing the program.

The embodiment of the specification adopts at least one technical scheme which can achieve the following beneficial effects:

in the method and the device for predicting the target object track in the embodiment of the specification, the target object and the environment in the acquired actual video do not need to be abstracted, and the predicted video which corresponds to the actual video and contains the future target object is directly predicted. Due to the fact that the process of generating the prediction video is based on all the information in the collected actual video, the state of the target object in the future can be reflected more accurately by the obtained prediction video. In addition, in the process of obtaining the predicted video, the method in the specification does not cut the target object from the environment where the target object is located, so that the information corresponding to the environment in the actual video can be fully utilized in the prediction process, the accuracy of the obtained predicted video is further improved, and meanwhile the comprehensiveness of the information in the predicted video is improved. And then when the future motion trail of the target object is determined according to the prediction video in the follow-up process, the future motion trail of the target object can be determined by combining the environment information in the prediction video, so that the obtained future motion trail of the target object is more accurate. The prediction video with longer duration is determined through the iterative video prediction process in the specification, and the requirement of predicting the longer motion track of the target object can be met.

Drawings

The accompanying drawings, which are included to provide a further understanding of the specification and are incorporated in and constitute a part of this specification, illustrate embodiments of the specification and together with the description serve to explain the specification and not to limit the specification in a non-limiting sense. In the drawings:

FIG. 1 is a process for predicting a target trajectory provided by embodiments of the present disclosure;

fig. 2a is a schematic diagram of splitting an actual video according to an embodiment of the present disclosure;

FIG. 2b is a schematic diagram of an exemplary network architecture of a process for predicting a target trajectory according to an embodiment of the present disclosure;

FIG. 3 is a process for training a video generation model according to an embodiment of the present disclosure;

fig. 4 is a schematic diagram of a split sample video provided in an embodiment of the present specification;

fig. 5 is a schematic structural diagram of an apparatus for predicting a target trajectory according to an embodiment of the present disclosure;

fig. 6 is a schematic structural diagram of a portion of an unmanned aerial vehicle corresponding to fig. 1 provided in an embodiment of the present disclosure.

Detailed Description

In order to make the objects, technical solutions and advantages of the present disclosure more clear, the technical solutions of the present disclosure will be clearly and completely described below with reference to the specific embodiments of the present disclosure and the accompanying drawings. It is to be understood that the embodiments described are only a few embodiments of the present disclosure, and not all embodiments. All other embodiments obtained by a person skilled in the art based on the embodiments in the specification without making any creative effort belong to the protection scope of the specification.

The technical solutions provided by the embodiments of the present description are described in detail below with reference to the accompanying drawings.

Fig. 1 is a process for predicting a target trajectory according to an embodiment of the present disclosure, which may specifically include the following steps:

s100: and acquiring an actual video of the environment where the unmanned equipment is located.

The process of predicting the trajectory of a target in this specification is applicable to both fixed and changing environments. When the scene is a fixed environment, the acquisition modules required for acquiring the actual video may include traffic facilities (including but not limited to traffic lights) disposed in the environment, acquisition devices on buildings; when the scene is a changing environment, the acquisition module required for acquiring the actual video may include acquisition devices disposed on vehicles (including but not limited to vehicles with an auxiliary driving function, unmanned vehicles), unmanned aerial vehicles, mobile phones, and tablet computers.

Furthermore, the capturing device may be either a device with a continuous image capturing function, such as a video camera, or a device with a non-continuous image capturing function, such as a camera. When the capturing device is a camera, the image processing device in the capturing module can optionally process non-continuous images captured by the camera into continuous images (e.g., video).

Optionally, in the actual video capturing process, the capturing device may perform pose adjustment according to the actual environment. The number of capturing devices required to capture the actual video, and the locations where the capturing devices are located on the transportation facility and/or vehicle, are not limited by this description.

S102: and splitting the actual video.

When the prediction with foresight is performed according to the real-time acquired actual video through the process in the specification, for example, the process is applied to the real-time target object track prediction in the driving process of an unmanned vehicle, the actual video acquired in the previous step can be processed, the actual video is split into a plurality of segments, and at least part of the segments can be used for predicting the track of the target object. The method of determining the fragment may be various. Alternatively, the clip may have the same image presentation form as the actual video.

Specifically, the splitting process may be: according to a preset actual video splitting rule, splitting the actual video into at least two temporally continuous (i.e. adjacent) segments. In the embodiment shown in fig. 2a, the captured actual video may be split according to a preset time interval, so as to obtain a plurality of segments R identified by time (which may be identified by a timestamp)₁To R_nIf the preset time interval is 10 seconds, the duration of each segment is 10 seconds. Optionally, after the fragments are obtained by splitting, the fragments may be sorted in the order of the identifiers of the fragments, so that the sorted fragments still retain the order of the frames of images in the actual video.

S104: and determining the segments with the time length from the current moment to meet the preset rule from the segments obtained by splitting as the sub-videos.

Among the segments obtained by splitting, the segment closest to the current time can reflect the change trend of the motion state of the target object at the next time most, and is more critical to the prediction in the specification. The process in this specification mainly takes a segment closer to the current time as a sub-video.

Still taking the splitting according to the preset time interval as an example, when determining the sub-video, the segments required for prediction can be determined from the segments according to the identifier of each segment through the preset sub-video determination rule. In particular, in the splittingAnd determining a plurality of identifiers nearest to the current moment in the obtained segments according to the identifiers of the segments, and taking the segments corresponding to the identifiers as the sub-videos. For example, in FIG. 2a, fragment R₁Nearest to the current time, the segment R₁Can be taken as a sub-video.

In addition, the actual video can be split in an extraction mode according to the preset rule during splitting, so that the sub-video is obtained. Specifically, a video segment with a preset duration from the current moment is extracted as the split segment. And if the preset time length is 20 seconds, taking the video 20 seconds before the current moment as the segment of the actual video. And taking the split segment as a sub-video.

S106: and generating the sub prediction video by adopting a pre-trained video generation model according to the determined sub video.

In this specification, the sub-prediction video generated by the pre-trained video generation model includes information of the target object in the environment and information of the environment, i.e., the sub-prediction video is a presentation of each object in the real environment in the form of an image and is a temporally predictive continuation of the real video, and has the same information transmission effect as the real video in the input model. Besides, except that the information representation form in the sub-prediction video is the same as that of the actual video, each piece of information in the sub-prediction video is complete information corresponding to the target object and the environment, so that the sub-prediction video can simultaneously represent the target object, the environment and the information of interaction between the target object and the environment.

Optionally, the duration of the sub-prediction video may be set according to the attribute of the capturing device.

S108: and re-determining the sub-prediction video as the sub-video.

In order to distinguish the sub-prediction video from the sub-video generating the sub-prediction video, a timestamp may be set for the sub-prediction video, and the timestamp of the sub-prediction video may be used to characterize a future time corresponding to the sub-prediction video. No confusion occurs when the sub-predictive video is re-determined to be a sub-video.

S110: and according to the re-determined sub-videos, re-adopting the video generation model to generate the sub-prediction videos until the sum of the time lengths of the generated sub-prediction videos is not less than the preset time length.

The process of regenerating the sub-prediction video in this step may be the same as or similar to the process in step S106, and is not repeated here.

In addition, the preset time length or the number of times of executing the step S108 and the present step in a loop may be determined according to actual requirements. The sub-prediction videos have temporal precedence.

S112: and synthesizing the generated sub prediction videos into a prediction video.

Because each sub-prediction video is obtained by predicting one sub-video in the previous order, each sub-video can be synthesized into a continuous prediction video. The sub-prediction videos with the chronological order obtained in the previous step can be spliced according to the chronological order to obtain a synthesized prediction video.

It can be appreciated that even if the training is performed using better training samples and the model structure is improved, the obtained trained model still has the limitation of the prediction video in terms of time length. If the duration of the prediction video of single prediction is unlimitedly prolonged, the accuracy of the prediction video at a time far away from the current time can be reduced.

For example, the actual video input to the model has a duration of 30 seconds, the predicted video output by the model has a duration of 60 seconds, the predicted video has a duration that is significantly too long, and the accuracy of future conditions characterized in the latter half of the predicted video must be affected.

Through the iterative prediction and the synthesis of a plurality of prediction results in the description, a prediction video with longer duration can be obtained. The problem that the prediction duration is limited is effectively solved.

S114: identifying a target object in the predicted video.

The target object can be a dynamic object such as a motor vehicle, a pedestrian and the like in the prediction video. The process in this specification does not limit the number of identified objects, and motion trajectory prediction may be performed only for one object, or may be performed for a plurality of objects in a prediction video.

Optionally, the target object may be determined from various dynamic objects in the prediction video according to a preset target object type rule. Specifically, each dynamic object in the prediction video may be first identified, and for each dynamic object, the type of the dynamic object may be determined. Then, the matching degree of the type of the dynamic object and the type of the target object (which can be a motor vehicle and/or a pedestrian) defined in the target object type rule is determined, and whether the dynamic object is the target object is determined according to the determined matching degree. If the type of the dynamic object is matched with the target object type defined in the target object type rule, the dynamic object is the target object; conversely, the dynamic object is not the target object.

S116: and tracking the identified target object in the prediction video to obtain the motion track of the target object in the prediction video, wherein the motion track is used as the predicted motion track of the target object in the future.

Optionally, after the target object in the prediction video is identified, the target object may be tracked according to a video tracking algorithm to determine the position of the target object in each frame of the prediction video at each future time, and the position of the target object in any frame of the prediction video may be the position of the target object in the image coordinates of the frame. Then, the determined positions of the target object at various time points in the future can be fitted to obtain the motion track of the target object in the prediction video.

As can be seen from the foregoing, the process of the present specification collects and predicts all future information about the target and the environment when the actual video is collected and the predicted video is generated, and does not abstract the target and the environment in the process of collecting and predicting. It can be known that the process of abstraction will inevitably cause the deletion of information, and the process in this specification will obtain complete information in the prediction video on the premise of retaining and using all the information of the actual video.

Further, when the prediction video is generated in the process of the description, the target object is not cut out from the environment, so that the prediction video obtained through the process of the description not only combines all information displayed by the target object in the actual video, but also fully utilizes the environment information in the actual video, so that the interaction between the target object and the environment can be embodied in the prediction video, and the accuracy and the comprehensiveness of the prediction video are obviously improved. Furthermore, the motion track of the target object in the future obtained through the prediction video is more practical and more accurate.

In addition, since the predicted video obtained through the process in the description can contain the predicted future environment information, the predicted video can present a complete environment, that is, the predicted video has a good visual effect, and can intuitively convey the predicted future information to the user in the form of video and images.

The process of predicting the trajectory of the target object described in the present specification will be described in detail below.

The process of predicting the trajectory of the target object in the present specification may be realized by a device for predicting the trajectory of the target object. The description relates to a network architecture as shown in fig. 2 b. In fig. 2b, the actual video is input into the device for predicting the track of the target object, and the device for predicting the track of the target object can generate and output a corresponding predicted video according to the actual video prediction.

The process of predicting the trajectory of the target object in this specification requires the use of a model that can comprehensively process various information in the actual video and predict and generate a model corresponding to the future video based on the integrated information. To obtain such a video generative model with comprehensive processing capability and comprehensive prediction capability, the present specification provides a process for training a video generative model, which trains the model in a supervised training manner, as shown in fig. 3, and includes the following steps:

s300: actual videos collected historically are obtained as sample videos.

The model training process of the specification has no special requirements on sample videos. Preferably, in the model training process, an actual video with a higher Frame rate (Frame rate) is used as the sample video. The interval duration between each frame of the actual video with the higher frame rate is shorter, the continuity between any two adjacent frames is better, and the convergence rate of the model can be accelerated by training the model with the actual video with the higher frame rate.

Alternatively, the sample video may be selected according to a scene in which the process of predicting the trajectory is actually applied. If the application scene of the process of predicting the target object track is the fixed environment, selecting the actual video acquired aiming at the fixed environment as the sample video, and improving the training efficiency.

S302: and splitting the sample video to obtain a plurality of sub-sample videos.

Because the process of training the model is supervised training, inputs and labels with certain corresponding relations are needed as samples needed by training. In this step, the sample video obtained in step S300 needs to be processed to obtain input and label that can be directly used for training.

The way to process the sample video may be: and splitting the sample video according to a preset sample splitting rule to obtain at least two sub-sample videos which are continuous in time. In the model training process of the present specification, any two sub-sample videos that are continuous in time can be used as a set of input and labels required by the video generation model training.

In an alternative embodiment of the present specification, as shown in fig. 4, the sample video is split into five sub-sample videos, V1, V2, V3, V4 and V5. The sub-sample videos V1, V2 and V3 are continuous in time, and when the sub-sample video V1 is used as input of training, the sub-sample video V2 can be used as an annotation of the training; when the sub-sample video V2 is used as an input of training, the sub-sample video V3 can be used as a label of training. The sub-sample videos V4 and V5 are consecutive in time, and the sub-sample video V5 can be used as an annotation of training when the sub-sample video V4 is used as an input of training. The sub-sample video V2 and the sub-sample video V4 have no correspondence in training, and there may be an overlap in time between the sub-sample video V2 and the sub-sample video V4.

The duration of the split sub-sample video can be set according to the actual use scene. In addition, in addition to splitting, the specification determines that the process of the sub-sample video does not need to perform other processing on the sample video so as to retain all information in the sample video as much as possible, so that the model can learn the capability of integrating and predicting various information without distinguishing a target object from an environment according to all information in the sample video.

S304: and aiming at any two adjacent sub-sample videos in the sample videos, taking the former video in the two adjacent sub-sample videos as the input of the video generation model to be trained, and taking the latter video in the two adjacent sub-sample videos as the label to train the video generation model to be trained.

Optionally, the video generation model in this specification includes an encoding side and a decoding side. And taking two sub-sample videos adjacent in time in each sub-sample video obtained by splitting as a group, and taking the sub-sample video with the time before in the group as training input to be input to an encoding end. And the encoding end performs video feature extraction on the input and outputs the video features to the decoding end. And the decoding end generates and outputs the video to be checked according to the video characteristic prediction. Then, the verification video may be compared with a sub-sample video (i.e., a label) of the group after the time, a difference obtained by the comparison is used as an error, and the model is adjusted in an error back propagation manner until the obtained error is smaller than a preset threshold. Alternatively, the video feature may be a feature of each pixel of each frame in the video, which is determined for the frame.

In addition, the input mode of the video is not particularly limited in this specification. In an alternative implementation scenario, the video may be decomposed into images frame by frame, and each of the decomposed images may be used as an input of the model.

Further, the present specification does not limit the specific type and the specific structure of the video generation model. The model may be a CNN (Convolutional Neural Networks) model or an LSTM (long short-Term Memory) model. Specifically, optionally, the encoding end includes a convolutional neural network CNN, and the decoding end includes a long-short term memory LSTM network.

According to the model training process, the amount of information carried by the sample video adopted by the training model is large, so that various information presented by the target object and various information presented by the environment in the actual video can be comprehensively processed in the actual using process of the trained model, and the comprehensive and comprehensive prediction video can be predicted.

For example, when a vehicle is taken as a target object, since various motion state information (speed, motion direction, etc.) and signal state information (a turn signal, a brake signal, etc. of the vehicle) presented by the vehicle are retained in the sample video, the trained model can predict and generate a prediction video presenting an accurate future state of the vehicle according to various information of the vehicle target object in the actual video in the actual application. If a vehicle turns on a right turn light in the actual video, the pose of the vehicle is adjusted to be right turn in the prediction video generated by model prediction. If the traffic light in the driving direction of a vehicle is switched to the red light in the actual video, the driving speed of the vehicle is reduced in the prediction video generated by the model prediction.

Therefore, the video generation model obtained through the training of the model training process in this specification can be combined with various information of the target object and the environment, and the foregoing steps S100 to S112 are adopted to integrate the multi-aspect information in the actual video, so as to obtain the prediction video with higher accuracy.

Further, if the duration of the sub-prediction video obtained in step S106 is short, at least a portion of the sub-video used for generating the sub-prediction video may be spliced with the sub-prediction video, and the spliced video is used as a re-determined sub-video for the next prediction, so as to obtain a more accurate sub-prediction video according to the re-determined sub-video with a long duration, and further obtain a prediction video with a higher accuracy.

Therefore, the process in the specification can also effectively solve the problem of poor actual video quality caused by the functional defects of the acquisition equipment. For example, the frame rate of the acquisition device is low, and the time interval between two adjacent frames is long. The iterative video prediction process determines that the prediction video has longer duration, and the duration of the prediction video can be adjusted according to requirements, so that the prediction video can effectively fill the gap between two adjacent frames of the actual video.

Further, the process of determining the predicted video in the present specification can also efficiently and sensitively cope with the burst phenomenon in the video display process. For example, in the video capturing process, the performance of the capturing device is unstable, or the actual video frame loss phenomenon caused by poor video transmission quality occurs, the lost frame can be predicted through the process in the present specification, and the predicted video obtained through prediction is added to the position of the actual video frame loss, so as to perform frame compensation on the actual video frame loss. When the actual video after the frame supplement is displayed to the user in the form of videos and images, the user experience can be obviously improved.

After the prediction video is obtained through the foregoing process, a target object in the prediction video may be identified according to the prediction video, and a motion trajectory of the target object in the future may be determined for the target object.

Specifically, the process of determining the future motion trajectory of the target object may be: and splicing the actual video and the predicted video according to the time sequence to obtain the comprehensive video. Because the input and the label adopted in the model training process have time continuity, the prediction video obtained by the trained model can be well connected with the collected actual video into a whole.

Then, in the obtained comprehensive video, the position of the target object in each frame is selected by a frame so as to track the target object, and further the motion track of the target object in the comprehensive video is obtained. The motion trajectory may then be made to correspond to a partial trajectory of the predicted video as a future motion trajectory of the object.

The comprehensive video combines various information in the actual video and the prediction video, compared with a single prediction video, the comprehensive video can more comprehensively reflect the conditions of the target object and the environment, richer information is provided for tracking the target object, and the accuracy of the motion track of the target object in the future obtained according to the comprehensive video is higher.

In an optional use scenario in this specification, a step of stitching an actual video and a predicted video may be omitted, after the predicted video is obtained, the target object identified according to the predicted video is directly tracked in the predicted video, and a track of the target object in the predicted video is taken as a future motion track of the target object. Since the present embodiment can at least reduce the amount of data that needs to be processed by the tracking process, the process in the present embodiment can improve the efficiency of trajectory prediction.

The future track of the target object obtained through the process of the specification can accurately reflect the future movement state of the target object. In actual use, the determined future trajectory of the target object may be adopted by a plurality of users (e.g., vehicles, transportation facilities), and the demands of different users on the reference system corresponding to the predicted trajectory are different.

In order to expand the application range of the predicted track, the motion track can be subjected to coordinate conversion according to the pose of the acquisition equipment when the actual video is acquired and the parameters of the acquisition equipment for acquiring the actual video, so that the future motion track of the target object under a world coordinate system (or coordinate system) can be obtained. After the user obtains the predicted future trajectory of the target object in the world coordinate system, the predicted future trajectory of the target object can be converted from the world coordinate system to a coordinate system suitable for the user.

Wherein, the position and posture of the acquisition equipment can be obtained when the actual video is acquired. The parameters of the acquisition device (e.g., internal parameters, external parameters) may be obtained by a variety of methods. For example, the parameter may be a parameter predicted when the acquisition device leaves the factory; alternatively, the parameter can be obtained by the Zhang-Yongyou scaling method.

Therefore, in the process of predicting the motion trajectory of the target object in the present specification, the target object and the environment in the acquired actual video do not need to be abstracted, and the predicted video corresponding to the actual video and including the future target object is directly predicted. Due to the fact that the process of generating the prediction video is based on all the information in the collected actual video, the state of the target object in the future can be reflected more accurately by the obtained prediction video. In addition, in the process of obtaining the predicted video, the method in the specification does not cut the target object from the environment where the target object is located, so that the information corresponding to the environment in the actual video can be fully utilized in the prediction process, the accuracy of the obtained predicted video is further improved, and meanwhile the comprehensiveness of the information in the predicted video is improved.

The method for predicting the motion state of the target object provided by the specification can be particularly applied to the field of delivery by using an unmanned vehicle, for example, in the scene of delivery by using the unmanned vehicle, such as express delivery, takeaway and the like. Specifically, in the above-described scenario, delivery may be performed using an autonomous vehicle fleet configured with a plurality of unmanned vehicles.

Based on the same idea, the embodiment of the present specification further provides an apparatus for predicting a target trajectory corresponding to the process shown in fig. 1, and the apparatus for predicting a target trajectory is shown in fig. 5.

Fig. 5 is a schematic structural diagram of an apparatus for predicting a target trajectory according to an embodiment of the present disclosure, where the apparatus for predicting a target trajectory may include:

the acquisition module 500 is used for acquiring an actual video of the environment where the unmanned equipment is located;

a splitting module 502, configured to split the actual video;

a sub-video determining module 504, configured to determine, as a sub-video, a segment whose duration from a current time meets a preset rule among the segments obtained by splitting;

a sub-prediction video generation module 506, which generates a sub-prediction video according to the determined sub-video by adopting a pre-trained video generation model;

a sub-video re-determination module 508 for re-determining the sub-prediction video as a sub-video;

a sub-prediction video regeneration module 510, configured to regenerate sub-prediction videos by using the video generation model according to the newly determined sub-videos until the sum of the time lengths of the generated sub-prediction videos is not less than a preset time length;

a synthesizing module 512, configured to synthesize the generated sub prediction videos into prediction videos;

an identification module 514 for identifying a target object in the predicted video;

a motion trajectory determining module 516, configured to track the identified target object in the prediction video, to obtain a motion trajectory of the target object in the prediction video, where the motion trajectory is used as a predicted motion trajectory of the target object in the future.

The acquisition module 500, the splitting module 502, the sub-video determining module 504, the sub-prediction video generating module 506, the sub-video re-determining module 508, the sub-prediction video re-generating module 510, the synthesizing module 512, the identifying module 514, and the motion trajectory determining module 516 are electrically connected in sequence. The sub-predictive video generation module 506 is also electrically connected to the composition module 512.

Optionally, the apparatus for predicting a target trajectory further includes: and a training module. The training module is electrically connected with the prediction module. The training module is used for training the video generation model.

Optionally, the training module comprises: and the sub-sample video determining sub-module, the sub-sample video determining sub-module and the training sub-module are electrically connected in sequence.

And the sample video determining submodule is used for acquiring the historically acquired actual video as the sample video.

And the sub-sample video determining submodule is used for splitting the sample video to obtain a plurality of sub-sample videos.

And the training submodule is used for taking the previous video in the two adjacent sub-sample videos as the input of a video generation model to be trained and taking the next video in the two adjacent sub-sample videos as a label to train the video generation model to be trained aiming at any two adjacent sub-sample videos in the sample videos.

Optionally, the identifying module 514 is specifically configured to: and identifying a preset type of dynamic object in the prediction video as a target object.

Optionally, the motion trajectory determining module 516 is specifically configured to: splicing the actual video and the predicted video to obtain a comprehensive video; tracking the target object in the integrated video; obtaining the motion track of the target object in the comprehensive video; and taking the motion trail corresponding to the partial trail of the prediction video as the motion trail of the target object in the future.

Optionally, the apparatus further includes a coordinate conversion module, where the coordinate conversion module is specifically configured to: and performing coordinate conversion on the motion track according to the pose when the actual video is collected and the parameters of the collection equipment for collecting the actual video to obtain the future motion track of the target object in a world coordinate system.

Optionally, the video generation model in this specification includes an encoding side and a decoding side. The encoding end is used for inputting at least part of sub-videos in the actual videos into the encoding end to obtain video characteristics output by the encoding end, and the decoding end is used for inputting at least part of sub-videos in the actual videos into the encoding end to obtain video characteristics output by the encoding end.

Embodiments of the present description also provide a computer-readable storage medium, which stores a computer program, where the computer program can be used to execute the process of predicting the target trajectory provided in fig. 1.

The embodiment of the specification also provides a part of structure schematic structure diagram of the unmanned equipment shown in FIG. 6. As shown in fig. 6, at the hardware level, the drone includes a processor, an internal bus, a network interface, a memory, and a non-volatile memory, although it may also include hardware required for other services. The processor reads the corresponding computer program from the non-volatile memory into the memory and then runs the computer program to implement the process of predicting the target trajectory as described above with reference to fig. 1. Of course, besides the software implementation, the present specification does not exclude other implementations, such as logic devices or a combination of software and hardware, and the like, that is, the execution subject of the following processing flow is not limited to each logic unit, and may be hardware or logic devices.

In the 90 s of the 20 th century, improvements in a technology could clearly distinguish between improvements in hardware (e.g., improvements in circuit structures such as diodes, transistors, switches, etc.) and improvements in software (improvements in process flow). However, as technology advances, many of today's process flow improvements have been seen as direct improvements in hardware circuit architecture. Designers almost always obtain the corresponding hardware circuit structure by programming an improved method flow into the hardware circuit. Thus, it cannot be said that an improvement in the process flow cannot be realized by hardware physical modules. For example, a Programmable Logic Device (PLD), such as a Field Programmable Gate Array (FPGA), is an integrated circuit whose Logic functions are determined by programming the Device by a user. A digital system is "integrated" on a PLD by the designer's own programming without requiring the chip manufacturer to design and fabricate application-specific integrated circuit chips. Furthermore, nowadays, instead of manually making an integrated Circuit chip, such Programming is often implemented by "logic compiler" software, which is similar to a software compiler used in program development and writing, but the original code before compiling is also written by a specific Programming Language, which is called Hardware Description Language (HDL), and HDL is not only one but many, such as abel (advanced Boolean Expression Language), ahdl (alternate Language Description Language), traffic, pl (core unified Programming Language), HDCal, JHDL (Java Hardware Description Language), langue, Lola, HDL, laspam, hardsradware (Hardware Description Language), vhjhd (Hardware Description Language), and vhigh-Language, which are currently used in most common. It will also be apparent to those skilled in the art that hardware circuitry that implements the logical method flows can be readily obtained by merely slightly programming the method flows into an integrated circuit using the hardware description languages described above.

The controller may be implemented in any suitable manner, for example, the controller may take the form of, for example, a microprocessor or processor and a computer-readable medium storing computer-readable program code (e.g., software or firmware) executable by the (micro) processor, logic gates, switches, an Application Specific Integrated Circuit (ASIC), a programmable logic controller, and an embedded microcontroller, examples of which include, but are not limited to, the following microcontrollers: ARC 625D, Atmel AT91SAM, Microchip PIC18F26K20, and Silicone Labs C8051F320, the memory controller may also be implemented as part of the control logic for the memory. Those skilled in the art will also appreciate that, in addition to implementing the controller as pure computer readable program code, the same functionality can be implemented by logically programming method steps such that the controller is in the form of logic gates, switches, application specific integrated circuits, programmable logic controllers, embedded microcontrollers and the like. Such a controller may thus be considered a hardware component, and the means included therein for performing the various functions may also be considered as a structure within the hardware component. Or even means for performing the functions may be regarded as being both a software module for performing the method and a structure within a hardware component.

The systems, devices, modules or units illustrated in the above embodiments may be implemented by a computer chip or an entity, or by a product with certain functions. One typical implementation device is a computer. In particular, the computer may be, for example, a personal computer, a laptop computer, a cellular telephone, a camera phone, a smartphone, a personal digital assistant, a media player, a navigation device, an email device, a game console, a tablet computer, a wearable device, or a combination of any of these devices.

For convenience of description, the above devices are described as being divided into various units by function, and are described separately. Of course, the functions of the various elements may be implemented in the same one or more software and/or hardware implementations of the present description.

As will be appreciated by one skilled in the art, embodiments of the present invention may be provided as a method, system, or computer program product. Accordingly, the present invention may take the form of an entirely hardware embodiment, an entirely software embodiment or an embodiment combining software and hardware aspects. Furthermore, the present invention may take the form of a computer program product embodied on one or more computer-usable storage media (including, but not limited to, disk storage, CD-ROM, optical storage, and the like) having computer-usable program code embodied therein.

The present invention is described with reference to flowchart illustrations and/or block diagrams of methods, apparatus (systems), and computer program products according to embodiments of the invention. It will be understood that each flow and/or block of the flow diagrams and/or block diagrams, and combinations of flows and/or blocks in the flow diagrams and/or block diagrams, can be implemented by computer program instructions. These computer program instructions may be provided to a processor of a general purpose computer, special purpose computer, embedded processor, or other programmable data processing apparatus to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing apparatus, create means for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.

These computer program instructions may also be stored in a computer-readable memory that can direct a computer or other programmable data processing apparatus to function in a particular manner, such that the instructions stored in the computer-readable memory produce an article of manufacture including instruction means which implement the function specified in the flowchart flow or flows and/or block diagram block or blocks.

These computer program instructions may also be loaded onto a computer or other programmable data processing apparatus to cause a series of operational steps to be performed on the computer or other programmable apparatus to produce a computer implemented process such that the instructions which execute on the computer or other programmable apparatus provide steps for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.

In a typical configuration, a computing device includes one or more processors (CPUs), input/output interfaces, network interfaces, and memory.

The memory may include forms of volatile memory in a computer readable medium, Random Access Memory (RAM) and/or non-volatile memory, such as Read Only Memory (ROM) or flash memory (flash RAM). Memory is an example of a computer-readable medium.

Computer-readable media, including both non-transitory and non-transitory, removable and non-removable media, may implement information storage by any method or technology. The information may be computer readable instructions, data structures, modules of a program, or other data. Examples of computer storage media include, but are not limited to, phase change memory (PRAM), Static Random Access Memory (SRAM), Dynamic Random Access Memory (DRAM), other types of Random Access Memory (RAM), Read Only Memory (ROM), Electrically Erasable Programmable Read Only Memory (EEPROM), flash memory or other memory technology, compact disc read only memory (CD-ROM), Digital Versatile Discs (DVD) or other optical storage, magnetic cassettes, magnetic tape magnetic disk storage or other magnetic storage devices, or any other non-transmission medium that can be used to store information that can be accessed by a computing device. As defined herein, a computer readable medium does not include a transitory computer readable medium such as a modulated data signal and a carrier wave.

It should also be noted that the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, an element defined by the phrase "comprising an … …" does not exclude the presence of other like elements in a process, method, article, or apparatus that comprises the element.

As will be appreciated by one skilled in the art, embodiments of the present description may be provided as a method, system, or computer program product. Accordingly, the description may take the form of an entirely hardware embodiment, an entirely software embodiment or an embodiment combining software and hardware aspects. Furthermore, the description may take the form of a computer program product embodied on one or more computer-usable storage media (including, but not limited to, disk storage, CD-ROM, optical storage, and the like) having computer-usable program code embodied therein.

This description may be described in the general context of computer-executable instructions, such as program modules, being executed by a computer. Generally, program modules include routines, programs, objects, components, data structures, etc. that perform particular tasks or implement particular abstract data types. The specification may also be practiced in distributed computing environments where tasks are performed by remote processing devices that are linked through a communications network. In a distributed computing environment, program modules may be located in both local and remote computer storage media including memory storage devices.

The embodiments in the present specification are described in a progressive manner, and the same and similar parts among the embodiments are referred to each other, and each embodiment focuses on the differences from the other embodiments. In particular, for the system embodiment, since it is substantially similar to the method embodiment, the description is simple, and for the relevant points, reference may be made to the partial description of the method embodiment.

The above description is only an example of the present specification, and is not intended to limit the present specification. Various modifications and alterations to this description will become apparent to those skilled in the art. Any modification, equivalent replacement, improvement, etc. made within the spirit and principle of the present specification should be included in the scope of the claims of the present specification.

Claims

1. A method of predicting a trajectory of a target, the method comprising:

splitting the actual video;

re-determining the sub-prediction video as a sub-video;

synthesizing the generated sub prediction videos into prediction videos;

identifying a target in the predicted video;

2. The method of claim 1, wherein training the video generative model comprises:

acquiring a historically acquired actual video as a sample video;

splitting the sample video to obtain a plurality of sub-sample videos;

3. The method of claim 1, wherein identifying the object in the predictive video comprises:

4. The method according to claim 1, wherein tracking the identified target object in the prediction video to obtain a motion trajectory of the target object in the prediction video as the predicted motion trajectory of the target object in the future specifically comprises:

tracking the target object in the integrated video;

obtaining the motion track of the target object in the comprehensive video;

5. The method of claim 1, wherein the identified target object is tracked in the prediction video to obtain a motion trajectory of the target object in the prediction video as a predicted future motion trajectory of the target object, and the method further comprises:

6. The method of any of claims 1-5, wherein the video generation model comprises an encoding side and a decoding side;

7. The method of claim 6, wherein the encoding end comprises a Convolutional Neural Network (CNN) and the decoding end comprises a Long Short Term Memory (LSTM) network.

8. An apparatus for predicting a trajectory of a target, the apparatus comprising:

the splitting module is used for splitting the actual video;

9. A computer-readable storage medium, characterized in that the storage medium stores a computer program which, when executed by a processor, implements the method of any of the preceding claims 1-7.

10. An unmanned aerial device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, wherein the processor when executing the program implements the method of any of claims 1-7.