WO2021249114A1

WO2021249114A1 - Target tracking method and target tracking device

Info

Publication number: WO2021249114A1
Application number: PCT/CN2021/093852
Authority: WO
Inventors: 庞博; 卢策吾; 袁伟; 胡翔宇
Original assignee: 华为技术有限公司
Priority date: 2020-06-09
Filing date: 2021-05-14
Publication date: 2021-12-16
Also published as: CN113781519A

Abstract

Provided is a target tracking method applicable to tracking of a target in a video and capable of reducing tracking errors caused when a target is blocked. The method comprises: inputting a captured video of a target object into a pre-trained neural network model; acquiring motion channels of the target object; connecting the motion channels; and obtaining a tracking path of the target object, the tracking path comprising location information of the target in each video frame of the first video.

Description

Target tracking method and target tracking device

This application claims the priority of a Chinese patent application filed with the State Intellectual Property Office of China, the application number is 202010519876.2, and the invention title is "target tracking method and target tracking device" on June 9, 2020, the entire content of which is incorporated herein by reference Applying.

Technical field

This application relates to the field of image processing technology, and in particular to a target tracking method and target tracking device.

Background technique

Target tracking is one of the most important and basic tasks in the field of computer vision. Its purpose is to output the position of the target object in each video frame of the video from the video containing the target object. Usually a piece of video is input to the computer and the target object category to be tracked, and the computer outputs the identification (ID) of the target object in the form of a detection frame and the position information of the target object in each frame of the video.

The existing multi-target tracking method includes detection and tracking. Multiple target objects appearing in each video frame are detected through the detection module, and then multiple target objects appearing in each video frame are matched. In the matching process In the process, the feature of each target object in a single video frame is extracted, and the target matching is achieved through feature similarity comparison, and the tracking trajectory of each target object is obtained.

Since the existing target tracking algorithm adopts the method of first detection and then tracking, the target tracking effect depends on the detection algorithm of a single frame. If the target object is occluded in the target detection, a detection error will occur, which will cause tracking errors. Therefore, the target object Insufficient performance in dense or occluded scenes.

Summary of the invention

The embodiment of the present application provides a target tracking method, which is used for target tracking in a video, which can reduce tracking errors caused by target occlusion.

The first aspect of the embodiments of the present application provides a target tracking method, including: acquiring a first video, where the first video includes a target object; and inputting the first video into a pre-trained neural network model to acquire the target The position information of the object in at least two video frames and the time information of the at least two video frames; according to the position information of the target object in the at least two video frames and the time information of the at least two video frames Acquire a tracking trajectory of the target object in the first video, where the tracking trajectory includes position information of the target object in at least two video frames in the first video.

This method obtains the position information of the target object in at least two video frames and the time information of the at least two video frames through a pre-trained neural network model. Target tracking does not depend on the target detection result of a single video frame, which can reduce The problem of detection failure in scenes with dense targets or more occlusions can improve target tracking performance.

In a possible implementation manner of the first aspect, the obtaining the position information of the target object in at least two video frames and the time information of the at least two video frames specifically includes: obtaining information about the target object A motion pipeline, which is used to indicate time information and position information of the target object in at least two video frames of the first video, wherein the first video includes a first video frame and a second video Frame; the motion pipeline corresponds to a quadrangular pyramid in the space-time dimension, the space-time dimension includes a time dimension and a two-dimensional space dimension, and the position of the first bottom surface of the quadrangular pyramid in the time dimension is used to indicate the The first time information of the first video frame, the position of the second bottom surface of the quadrangular pyramid in the time dimension is used to indicate the second time information of the second video frame, the first bottom surface of the quadrangular pyramid The position in the two-dimensional space is used to indicate the first position information of the target object in the first video frame, and the position of the second bottom surface of the quadrangular pyramid in the two-dimensional space is used to Indicate the second position information of the target object in the second video frame; the quadrangular prism is used to indicate that the target object is in the first video frame and the second video frame of the first video Position information in all video frames between.

This method obtains the motion pipeline of each video frame through a pre-trained neural network model. Since the motion pipeline includes the position information of the target object in at least two video frames, the position of the target in the video frame can pass through the space-time dimension. The time in the time dimension and the position in the two-dimensional space are determined, the time is used to determine the video frame, and the position in the two-dimensional space is used to indicate the position information of the target in the video frame. This method can correspond the motion pipeline to the quadrangular prism in the space-time dimension, and visually display the position information of the target in at least two video frames through the quadrangular prism in the space-time dimension. The target tracking method does not depend on the target detection result of a single video frame, and can reduce the problem of detection failure in scenes with dense targets or more occlusions, and improve target tracking performance.

In a possible implementation manner of the first aspect, the obtaining the position information of the target object in at least two video frames and the time information of the at least two video frames specifically includes: obtaining information about the target object A motion pipeline, where the motion pipeline is used to indicate the position information of the target object in at least three video frames and the time information of the at least three video frames, wherein the first video includes a first video frame and a first video frame. Two video frames and a third video frame; the motion pipeline corresponds to a double quadrangular prism in the space-time dimension, the double quadrangular prism includes a first quadrangular prism and a second quadrangular prism, the first quadrangular prism It includes a first bottom surface and a second bottom surface, the second quadrangular ridge includes a first bottom surface and a third bottom surface, the first bottom surface is a common bottom surface of the first quadrangular ridge and the second quadrangular ridge, The position of the first bottom surface in the time dimension is used to indicate the first time information of the first video frame, and the position of the second bottom surface in the time dimension is used to indicate the first time information of the second video frame. Second time information, the position of the third bottom surface in the time dimension is used to indicate the third time information of the third video frame, and the time sequence of the first video frame in the first video is located in the first video Between the second video frame and the third video frame, the position of the first bottom surface in the two-dimensional space is used to indicate the first position information of the target object in the first video frame, and The position of the second bottom surface in the two-dimensional space indicates the second position information of the target object in the second video frame, and the position of the third bottom surface in the two-dimensional space indicates the target object The third position information in the third video frame; the double quadrangular prism is used to indicate that the target object is between the second video frame and the three video frames of the first video Position information in the video frame.

In this method, the motion pipeline includes the position information of the target object of the target object in at least three video frames. Specifically, the at least three video frames include the first video frame that is earlier in the time sequence of the video than the video frame corresponding to the motion pipeline. The second video frame and the later third video frame expand the receptive field in the time dimension, which can further improve the target tracking performance. The motion pipeline corresponds to the double quadrangular prism in the space-time dimension, and the position information of the target in at least three video frames is visually displayed through the double quadrangular prism in the space-time dimension. Specifically, it also includes the position information of the target in all the video frames between the two non-common bottom surfaces of the motion pipeline. Taking into account the continuity of target motion, in the space-time dimension, the structure of the real tracking trajectory of the target object is usually nonlinear. The motion pipeline of the double quadrangular prism structure can express the two directions of the target movement. The real tracking trajectory can be better fitted in the scene where the movement direction changes.

In a possible implementation manner of the first aspect, the acquisition of the location of the target object in the first aspect is performed according to the location information of the target object in at least two video frames and the time information of the at least two video frames. The tracking trajectory in a video specifically includes: acquiring the tracking trajectory of the target object in the first video according to the motion pipeline.

Obtaining the tracking trajectory of the target object in the first video according to the motion pipeline can reduce the problem of detection failure in a scene with dense targets or more occlusions, and improve target tracking performance.

In a possible implementation of the first aspect, the tracking trajectory specifically includes: a tracking trajectory of the target object formed by connecting at least two of the motion pipes corresponding to a quadrangular prism in a space-time dimension.

Obtaining the tracking trajectory of the target object by connecting the motion pipeline can not rely on the target detection result of a single video frame, which can reduce the problem of detection failure in scenes with dense targets or more occlusions, and improve target tracking performance.

In a possible implementation of the first aspect, the length of the motion pipeline of the video frame is a preset value, and the length of the motion pipeline indicates the number of video frames included in the at least two video frames, optionally Ground, the length of the movement pipeline includes 4, 6, or 8.

In this method, the length of the motion pipe can be a preset value. That is, the number of video frames corresponding to each motion pipeline is the same, indicating the position change of the target object in the same time period. Compared with the method of not setting the motion pipeline length, this method can reduce the calculation amount of the neural network model and reduce the target Tracking takes time.

In a possible implementation manner of the first aspect, the method further includes: obtaining category information of the target object through the pre-trained neural network model; and according to the target object in at least two videos The position information in the frame and the time information of the at least two video frames. Obtaining the tracking trajectory of the target object in the first video includes: according to the category information of the target object, the location of the target object is The position information in the at least two video frames and the time information of the at least two video frames acquire the tracking trajectory of the target object in the first video.

For multi-target tracking scenes, if the target to be tracked includes multiple categories, this method can determine the category information of the target object corresponding to the motion pipeline through the pre-trained neural network model, and obtain the target object’s information based on the category information, location information, and time information. Track the trajectory.

In a possible implementation of the first aspect, the acquiring, through the pre-trained neural network model, the category information of the target object corresponding to the motion pipeline specifically includes: through the pre-trained neural network model, Obtain the confidence level of the motion pipe, where the confidence level of the motion pipe is used to determine the category information of the target object corresponding to the motion pipe.

For a single target tracking scene, this method can distinguish whether the motion pipeline is a real motion pipeline indicating the target position by confidence. In addition, for a multi-target tracking scene, if the target to be tracked includes multiple categories, this method can use motion The confidence of the pipeline distinguishes the types of target objects corresponding to the motion pipeline.

In a possible implementation of the first aspect, before the acquiring the tracking trajectory of the target object according to the motion pipeline, the method further includes: deleting the motion pipeline, and obtaining the deleted A motion pipeline, where the deleted motion pipeline is used to obtain the tracking trajectory of the target object.

This method can delete motion pipelines of video frames, delete repeated motion pipelines or motion pipelines with low confidence, and can reduce the amount of calculation in the motion pipeline connection step.

In a possible implementation of the first aspect, the deletion of the motion pipeline and obtaining the deleted motion pipeline specifically includes: the motion pipeline includes a first motion pipeline and a second motion pipeline; if If the repetition rate between the first movement pipeline and the second movement pipeline is greater than or equal to the first threshold, the movement pipeline with a lower confidence level among the first movement pipeline and the second movement pipeline is deleted, and the first movement The repetition rate between the pipe and the second movement pipe is the intersection ratio between the first movement pipe and the second movement pipe, and the first movement pipe and the second movement pipe belong to the target object The confidence level indicates the probability that the category of the target object corresponding to the motion channel is the preset category.

This method introduces the specific method of deleting motion pipelines. Motion pipelines with a repetition rate greater than or equal to the first threshold can be considered as repeated data. Among them, the lower confidence is deleted, and the higher confidence is retained for The pipe connection can reduce the amount of calculation in the movement pipe connection step.

In a possible implementation of the first aspect, the deleting the motion pipeline and obtaining the deleted motion pipeline specifically includes: deleting the motion pipeline according to a non-maximum value suppression algorithm, Get the deleted motion pipeline.

This method can also be deleted according to the non-maximum suppression algorithm, that is, it can delete repeated motion pipelines, and it can also reserve motion pipelines with higher confidence for each target, reduce the calculation amount of pipeline connection steps, and improve target tracking efficient.

In a possible implementation of the first aspect, the confidence of any one of the motion pipes after the reduction is greater than or equal to the second threshold.

When deleting motion pipelines in this method, all the power pipelines with lower confidence can be discarded, and the power pipelines with confidence less than the second threshold can be understood as unreal motion pipelines, such as the motion pipelines corresponding to the background.

In a possible implementation manner of the first aspect, the acquiring the tracking trajectory of the target object according to the motion pipeline specifically includes: performing a third motion channel and a fourth motion in the motion channel that meet a preset condition. The pipelines are connected to obtain the tracking trajectory of the target object; the preset conditions include one or more of the following: the intersection of the third motion pipeline and the fourth motion pipeline between the sections where the time dimension overlaps The parallel ratio is greater than or equal to the third threshold; the cosine value of the angle between the movement direction of the third movement pipeline and the movement direction of the fourth movement pipeline is greater than or equal to the fourth threshold, and the movement direction is in time and space The dimension indicates the vector of the position change of the target object in the motion pipeline according to a preset rule; and the distance between the neural network feature vectors of the motion pipeline is less than or equal to the fifth threshold, and the distance includes Euclidean distance.

This method provides a specific method for connecting motion pipelines. According to the position of the motion pipelines in the space-time dimension, the motion pipelines with high overlap and similar motion directions are connected.

In a possible implementation of the first aspect, the acquiring the tracking trajectory of the target object according to the motion pipeline specifically includes: grouping the motion pipelines to acquire t groups of motion pipelines, where t is the first The total number of video frames in the video, the i-th motion pipe group in the t group of motion pipes includes all motion pipes starting from the i-th video frame in the first video, and the i is greater than or equal to 1, and less than or Equal to t; when i is 1, the motion pipeline in the i-th motion pipe group is used as the initial tracking trajectory to obtain the tracking trajectory set; in accordance with the sequence of the number of the motion pipe group, the motion in the i-th motion pipe group is sequentially The pipeline is connected with the tracking trajectory in the tracking trajectory set to obtain at least one tracking trajectory. This method provides a specific method for connecting motion pipelines. The motion pipeline corresponds to the position information of the target object in the video frame within a period of time. The motion pipelines are grouped according to the initial video frame, and each group of motion pipelines are connected in turn , Can improve the efficiency of target tracking.

In a possible implementation of the first aspect, the pre-trained neural network model is obtained after the initial network model is trained, and the method further includes: inputting the first video sample into the initial network model for training, and obtaining Target object loss; update the weight parameters in the initial network model according to the target object loss, and obtain the pre-trained neural network model.

In this method, the initial network model can be trained to obtain the neural network model of the output motion pipeline in the target tracking method.

In a possible implementation of the first aspect, the target object loss specifically includes: an intersection ratio between a true value of a motion pipe and a predicted value of a motion pipe, and the true value of the motion pipe is the first video sample The motion pipeline obtained by splitting the tracking trajectory of the target object in the middle, and the motion pipeline prediction value is the motion pipeline obtained by inputting the first video sample into the initial network model.

This method provides that the target loss in the model training process is the intersection ratio between the true value of the motion pipeline and the predicted value of the motion pipeline. The neural network model obtained by this training has high accuracy of the position information of the target object indicated by the motion pipeline. .

In a possible implementation of the first aspect, the target object loss specifically includes: the cross-union ratio between the true value of the motion pipe and the predicted value of the motion pipe, and the confidence of the true value of the motion pipe and the predicted value of the motion pipe The cross entropy between the confidence levels of the motion pipeline, the motion pipeline truth value is the motion pipeline obtained by splitting the tracking trajectory of the target object in the first video sample, and the motion pipeline prediction value is the input of the first video sample For the motion pipeline obtained by the initial network model, the confidence of the true value of the motion pipeline is the probability that the target object category corresponding to the true value of the motion pipeline belongs to the preset target object category, and the confidence of the predicted value of the motion pipeline It is the probability that the category of the target object corresponding to the predicted value of the motion pipeline belongs to the preset target object category.

This method provides that the target loss in the model training process is the intersection ratio between the true value of the motion pipeline and the predicted value of the motion pipeline. The neural network model obtained by this training has high accuracy of the position information of the target object indicated by the motion pipeline. , And can accurately indicate the type of target object.

In a possible implementation of the first aspect, the initial network model includes a three-dimensional convolutional neural network or a recurrent neural network, and the three-dimensional convolutional neural network includes a three-dimensional residual neural network or a three-dimensional feature pyramid network. Optionally, the initial network model is obtained by combining a three-dimensional residual neural network and a three-dimensional feature pyramid network.

The initial network model in this method can be a three-dimensional convolutional neural network, a recurrent neural network, or a combination of the two. The diversity of neural network model types provides multiple possibilities for the realization of the scheme.

In a possible implementation of the first aspect, the inputting the first video into a pre-trained neural network model to obtain the motion pipeline of the target object specifically includes: dividing the first video into a plurality of Video clips; input the multiple video clips into the pre-trained neural network model to obtain the motion pipeline.

Taking into account the limitation of the number of video frames processed by the neural network model, the video can be segmented first, and the video segment is input to the model. Optionally, the number of video frames of the video segment is a preset value, for example, 8 frames.

A second aspect of the embodiments of the present application provides a target tracking device, including: an acquisition unit configured to acquire a first video, where the first video includes a target object; and the acquisition unit is further configured to capture the first video Video input pre-trained neural network model to obtain the position information of the target object in at least two video frames and the time information of the at least two video frames; the obtaining unit is further configured to The position information in at least two video frames and the time information of the at least two video frames acquire the tracking trajectory of the target object in the first video, and the tracking trajectory includes the target object in the first video. Position information in at least two video frames in a video.

In a possible implementation of the second aspect, the acquisition unit is specifically configured to: acquire a motion pipeline of the target object, and the motion pipeline is used to indicate that the target object is at least two parts of the first video. Time information and location information in two video frames, where the first video includes a first video frame and a second video frame; the motion pipeline corresponds to a quadrangular pyramid in the space-time dimension, and the space-time dimension includes the time dimension And a two-dimensional space dimension, the position of the first bottom surface of the quadrangular pyramid in the time dimension is used to indicate the first time information of the first video frame, and the second bottom surface of the quadrangular pyramid is at the time The position of the dimension is used to indicate the second time information of the second video frame, and the position of the first bottom surface of the quadrangular prism in the two-dimensional space is used to indicate that the target object is in the first video frame The first position information in the second video frame, the position of the second bottom surface of the quadrangular prism in the two-dimensional space is used to indicate the second position information of the target object in the second video frame; the quadrangular The station is used to indicate the position information of the target object in all video frames between the first video frame and the second video frame of the first video.

In a possible implementation manner of the second aspect, the acquiring unit is specifically configured to: acquire a motion pipeline of the target object, and the motion pipeline is used to indicate the position of the target object in at least three video frames Information and time information of the at least three video frames, wherein the first video includes a first video frame, a second video frame, and a third video frame; the motion pipeline corresponds to a double quadrangular prism in the space-time dimension , The double quadrangular prism includes a first quadrangular prism and a second quadrangular prism, the first quadrangular prism includes a first bottom surface and a second bottom surface, and the second quadrangular prism includes a first bottom surface and a third quadrangular prism. The bottom surface, the first bottom surface is the common bottom surface of the first quadrangular prism platform and the second quadrangular prism platform, and the position of the first bottom surface in the time dimension is used to indicate the first video frame of the first video frame. Time information, the position of the second bottom surface in the time dimension is used to indicate the second time information of the second video frame, and the position of the third bottom surface in the time dimension is used to indicate the third video The third time information of the frame, the time sequence of the first video frame in the first video is located between the second video frame and the third video frame, and the first bottom surface is in the two-dimensional The position of the spatial dimension is used to indicate the first position information of the target object in the first video frame, and the position of the second bottom surface in the two-dimensional spatial dimension indicates that the target object is in the second video The second position information in the frame, the position of the third bottom surface in the two-dimensional space dimension indicates the third position information of the target object in the third video frame; the double quadrangular prism is used to indicate Position information of the target object in all video frames between the second video frame and the three video frames of the first video.

In a possible implementation of the second aspect, the acquiring unit is specifically configured to acquire the tracking trajectory of the target object in the first video according to the motion pipeline.

In a possible implementation of the second aspect, the tracking trajectory specifically includes: a tracking trajectory of the target object formed by connecting at least two of the motion pipes corresponding to a quadrangular prism in a space-time dimension.

In a possible implementation of the second aspect, the length of the motion pipe is a preset value, and the length of the motion pipe indicates the number of video frames included in the at least two video frames.

In a possible implementation of the second aspect, the acquiring unit is further configured to: acquire category information of the target object through the pre-trained neural network model; and according to the category information of the target object , The position information of the target object in at least two video frames and the time information of the at least two video frames obtain the tracking trajectory of the target object in the first video.

In a possible implementation of the second aspect, the acquiring unit is specifically configured to: acquire the confidence of the motion pipeline through the pre-trained neural network model, and the confidence of the motion pipeline is used to determine The category information of the target object corresponding to the motion pipeline.

In a possible implementation manner of the second aspect, the device further includes: a processing unit configured to delete the motion pipeline to obtain the deleted motion pipeline, and the deleted motion pipeline is used for To obtain the tracking trajectory of the target object.

In a possible implementation of the second aspect, the movement pipeline includes a first movement pipeline and a second movement pipeline; the processing unit is specifically configured to: if the first movement pipeline and the second movement pipeline are repeated If the rate is greater than or equal to the first threshold, then the movement pipeline with lower confidence among the first movement pipeline and the second movement pipeline is deleted, and the repetition rate between the first movement pipeline and the second movement pipeline is equal to The intersection ratio between the first movement pipeline and the second movement pipeline, the first movement pipeline and the second movement pipeline belong to the movement pipeline of the target object, and the confidence level indicates that the movement pipeline corresponds to The class of the target object is the probability of the preset class.

In a possible implementation manner of the second aspect, the processing unit is specifically configured to cut the motion pipeline according to a non-maximum value suppression algorithm, and obtain the deleted motion pipeline.

In a possible implementation of the second aspect, the confidence of any one of the motion pipes after the reduction is greater than or equal to the second threshold.

In a possible implementation of the second aspect, the acquiring unit is specifically configured to: connect a third movement pipeline and a fourth movement pipeline that meet a preset condition among the movement pipelines, and obtain information about the target object. Tracking the trajectory; the preset conditions include one or more of the following: the intersection ratio between the sections of the third motion pipeline and the fourth motion pipeline in the time dimension overlapping portion is greater than or equal to a third threshold; The cosine value of the angle between the movement direction of the third movement pipeline and the movement direction of the fourth movement pipeline is greater than or equal to a fourth threshold, and the movement direction is a predetermined rule indicating the target in the movement pipeline in the space-time dimension. The vector of the position change of the object; and, the distance between the neural network feature vectors of the motion pipeline is less than or equal to the fifth threshold, and the distance includes the Euclidean distance.

In a possible implementation of the second aspect, the obtaining unit is specifically configured to: group the motion pipes to obtain t groups of motion pipes, where t is the total number of video frames in the first video, and t The i-th motion pipeline group in the group of motion pipelines includes all motion pipelines starting from the i-th video frame in the first video, and the i is greater than or equal to 1, and less than or equal to t; when i is 1, the The motion pipes in the i-th motion pipe group are used as the initial tracking trajectories to obtain a tracking trajectory set; according to the sequence of the number of the motion pipe groups, the motion pipes in the i-th motion pipe group and the tracks in the tracking trajectory set are sequentially The trajectories are connected to obtain at least one tracking trajectory.

In a possible implementation of the second aspect, the acquiring unit is specifically configured to: input the first video sample into the initial network model for training, and acquire the target object loss; and update the initial The weight parameters in the network model are used to obtain the pre-trained neural network model.

In a possible implementation manner of the second aspect, the target object loss specifically includes: an intersection ratio between the true value of the motion pipe and the predicted value of the motion pipe, and the true value of the motion pipe is the first video sample The motion pipeline obtained by splitting the tracking trajectory of the target object in the middle, and the motion pipeline prediction value is the motion pipeline obtained by inputting the first video sample into the initial network model.

In a possible implementation of the second aspect, the target object loss specifically includes: the cross-union ratio between the true value of the motion pipe and the predicted value of the motion pipe, and the confidence of the true value of the motion pipe and the predicted value of the motion pipe The cross entropy between the confidence levels of the motion pipeline, the motion pipeline truth value is the motion pipeline obtained by splitting the tracking trajectory of the target object in the first video sample, and the motion pipeline prediction value is the input of the first video sample For the motion pipeline obtained by the initial network model, the confidence of the true value of the motion pipeline is the probability that the target object category corresponding to the true value of the motion pipeline belongs to the preset target object category, and the confidence of the predicted value of the motion pipeline It is the probability that the category of the target object corresponding to the predicted value of the motion pipeline belongs to the preset target object category.

In a possible implementation of the second aspect, the initial network model includes a three-dimensional convolutional neural network or a recurrent neural network.

In a possible implementation manner of the second aspect, the processing unit is further configured to: divide the first video into a plurality of video clips; the acquiring unit is specifically configured to: separate the plurality of video clips Input the pre-trained neural network model to obtain the motion pipeline.

The third aspect of the embodiments of the present application provides an electronic device, which is characterized by comprising a processor and a memory, the processor and the memory are connected to each other, wherein the memory is used to store a computer program, and the computer program It includes program instructions, and the processor is used to call the program instructions to execute the method described in any one of the foregoing first aspect and various possible implementation manners.

The fourth aspect of the embodiments of the present application provides a computer program product containing instructions, which is characterized in that when it runs on a computer, the computer executes any one of the above-mentioned first aspect and various possible implementation manners. The method described in the item.

The fifth aspect of the embodiments of the present application provides a computer-readable storage medium, including instructions, which are characterized in that, when the instructions run on a computer, the computer executes the first aspect and various possible implementation manners. Any one of the methods.

A sixth aspect of the embodiments of the present application provides a chip including a processor. The processor is used to read and execute the computer program stored in the memory to execute the method in any possible implementation manner of any of the foregoing aspects. Optionally, the chip should include a memory, and the memory and the processor are connected to the memory through a circuit or a wire. Further optionally, the chip further includes a communication interface, and the processor is connected to the communication interface. The communication interface is used to receive data and/or information that needs to be processed, and the processor obtains the data and/or information from the communication interface, processes the data and/or information, and outputs the processing result through the communication interface. The communication interface can be an input and output interface.

Among them, the technical effects brought by any one of the second, third, fourth, fifth, or sixth aspects can be found in the technical effects brought about by the corresponding implementation in the first aspect. I won't repeat it here.

It can be seen from the above technical solutions that the embodiments of the present application have the following advantages:

In the target tracking method provided by the embodiment of the present application, the position information of the target object in at least two video frames and the time information of the at least two video frames are obtained through a pre-trained neural network model, and the target is determined according to the information The tracking trajectory of the object in the first video. Since the time information of at least two video frames is output through the neural network model, target tracking does not depend on the target detection result of a single video frame, which can reduce the problem of detection failure in scenes with dense targets or more occlusions, and improve target tracking performance .

In the target tracking method provided by the embodiments of the present application, the motion pipeline of the target object is obtained through a pre-trained neural network model, and the tracking trajectory of the target object is obtained by connecting the motion pipeline. Since the motion pipeline includes the position information of the target object in at least two video frames, target tracking does not depend on the target detection result of a single video frame, which can reduce the problem of detection failure in scenes with dense targets or more occlusions, and improve the target Track performance.

In addition, in the prior art, the detection algorithm relies on a single frame, and the accuracy of the overall algorithm is affected by the detector. The development cost of step-by-step training of the detection model and tracking model is high. At the same time, the algorithm is divided into two phases, which also increases machine learning. The computational cost and deployment difficulty of the process. However, the target tracking method provided in the embodiments of the present application can realize end-to-end training, and complete the detection and tracking tasks of multi-target objects through a neural network model, which can reduce the complexity of the model.

In addition, the prior art has relatively single features extracted based on a single video frame. The target tracking method provided in the embodiments of this application uses video as the original input, and the model can be tracked through various features such as appearance features, motion trajectory features, or gait features. Tasks can improve target tracking performance.

In addition, the target tracking method provided by the embodiment of the present application uses video as the original input of the model, and the time dimension receptive field is increased, which can better capture the movement information of the character.

Description of the drawings

FIG. 1 is a schematic diagram of an artificial intelligence main body framework provided by an embodiment of this application;

FIG. 2 is a schematic diagram of a system architecture provided by an embodiment of the application;

FIG. 3 is a schematic diagram of a convolutional neural network structure provided by an embodiment of this application;

4 is a schematic diagram of another convolutional neural network structure provided by an embodiment of the application;

Fig. 5 is a schematic diagram of an embodiment of a motion pipe in an embodiment of the application;

Fig. 6 is a schematic diagram of the tracking trajectory splitting motion pipeline in an embodiment of the application;

FIG. 7 is a schematic diagram of an embodiment of a motion pipe in an embodiment of the application;

Fig. 8 is a schematic diagram of another embodiment of the motion pipe in the embodiment of the application;

FIG. 9 is a schematic diagram of the intersection and merging of the motion pipes in the embodiment of the application;

FIG. 10 is a schematic diagram of an embodiment of a target detection method in an embodiment of the application;

FIG. 11 is a schematic diagram of an embodiment of matching between moving pipes in an embodiment of the application;

FIG. 12 is a schematic diagram of an embodiment of a neural network model training method in an embodiment of the application;

FIG. 13 is a schematic diagram of tracking trajectory and motion pipeline in an embodiment of this application;

Fig. 14 is a schematic diagram of a motion pipeline output by a neural network model in an embodiment of the application;

15 is a schematic diagram of another embodiment of a target tracking method in an embodiment of the application;

FIG. 16 is a schematic diagram of an embodiment of a target tracking device in an embodiment of the application;

FIG. 17 is a schematic diagram of another embodiment of the target tracking device in the embodiment of this application;

FIG. 18 is a schematic diagram of another embodiment of the target tracking device in the embodiment of this application;

FIG. 19 is a schematic diagram of another embodiment of an electronic device in an embodiment of the application;

FIG. 20 is a hardware structure diagram of a chip provided by an embodiment of the application.

detailed description

The embodiment of the present application provides a target tracking method, which is used for target tracking in a video, which can reduce tracking errors in scenes with dense targets or more occlusions.

The following describes the embodiments of the present application with reference to the accompanying drawings. Obviously, the described embodiments are only a part of the embodiments of the present application, rather than all the embodiments. A person of ordinary skill in the art knows that with the development of technology and the emergence of new scenarios, the technical solutions provided in the embodiments of the present application are equally applicable to similar technical problems.

The terms "first", "second", etc. in the description and claims of the application and the above-mentioned drawings are used to distinguish similar objects, and are not necessarily used to describe a specific sequence or sequence. It should be understood that the data used in this way can be interchanged under appropriate circumstances, so that the embodiments described herein can be implemented in a sequence other than the content illustrated or described herein. In addition, the terms "including" and "having" and any variations of them are intended to cover non-exclusive inclusions. For example, a process, method, system, product, or device that includes a series of steps or modules is not necessarily limited to those clearly listed. Those steps or modules may include other steps or modules that are not clearly listed or are inherent to these processes, methods, products, or equipment. The naming or numbering of steps appearing in this application does not mean that the steps in the method flow must be executed in the time/logical order indicated by the naming or numbering. The named or numbered process steps can be implemented according to the The technical purpose changes the execution order, as long as the same or similar technical effects can be achieved.

For ease of understanding, the following briefly introduces some technical terms involved in the embodiments of this application:

1. Movement pipeline and tracking trajectory.

Multiple video frames of the video are obtained by continuous shooting, and the video frame rate is usually known. The moving target in the video refers to the relative movement of the target relative to the video capture device during the shooting process, taking the world coordinate system of the actual three-dimensional space as a reference. The target can be moving or not, and the specifics are not limited here. .

During the shooting of the target, the image information of the target object may be directly recorded in the video frame, or part of the image frame may be blocked by other objects.

Expand multiple video frames of the video in the time dimension. Since the time interval between video frames is known, different video frames correspond to different moments in the time dimension. Since the video frames are two-dimensional images, the image information of the video frames corresponds to For data in a two-dimensional space dimension, the data displayed in this form is defined as data in a space-time dimension in the embodiment of the present application. The position of the target in the video frame can be determined by the position in the time dimension and the position in the two-dimensional space in the space-time dimension. The position in the time dimension is used to determine the video frame, and the position in the two-dimensional space is used To indicate the location information of the target in the video frame.

Please refer to FIG. 5, which is a schematic diagram of an embodiment of the motion pipe in the embodiment of the application.

Target tracking needs to determine the position information of the target to be tracked (or target for short) in all video frames containing the target object. Generally, the target position in each video frame can be identified by a detection box (Bounding-Box). In the space-time dimension, the detection box of the same target object in each video frame is connected correspondingly to form the trajectory of the target in the space-time area. That is, tracking trajectory, or motion trajectory, tracking trajectory can not only give the position of the target object, but also connect the positions of the target object at different times. Therefore, the tracking trajectory can indicate the time and space information of the target object at the same time. FIG. 5 only illustrates the position information of the target object in the three video frames. It is understandable that all the video frames of the video can obtain the tracking trajectory according to the above-mentioned method. It should be noted that one or more targets may be included in the same video frame, the tracking trajectory also includes the identification (ID) of the target object indicated by the tracking trajectory, and the ID of the target object can be used to distinguish trajectories corresponding to different targets.

The following is an introduction to the motion pipeline and tracking trajectory.

The motion pipeline is used to indicate the position information of the target in at least two video frames, corresponding to the quadrangular prism in the space-time dimension, and the position of the first bottom surface of the quadrangular prism in the time dimension is used to indicate the first time of the first video frame Information, the position of the second bottom surface of the quadrangular prism in the time dimension is used to indicate the second time information of the second video frame, and the position of the first bottom surface of the quadrangular prism in the two-dimensional space is used to indicate that the target object is in the first The first position information in one video frame and the position of the second bottom surface of the quadrangular prism in the two-dimensional space are used to indicate the second position information of the target object in the second video frame.

Optionally, the motion pipeline is used to indicate the position information of the target in at least three different video frames. In this embodiment and the following embodiments, the motion pipeline includes the position information of the target in three different video frames as an example for introduction.

In the space-time dimension, the motion pipeline can be regarded as a double quadrangular pyramid structure composed of two quadrangular pyramids with a common bottom surface. The three bottom surfaces of the double quadrangular pyramid structure are parallel to each other. The direction perpendicular to the bottom surface is the time dimension, and the extension direction of the bottom surface is Spatial dimension, each bottom surface represents the position of the target in the video frame at the corresponding moment of the bottom surface. As shown in Figure 6, there is shown a movement pipeline with a double quadrangular pyramid structure, including: a first bottom surface 601, a second bottom surface 602, and a third bottom surface 603. The first bottom surface 601, namely rectangular abcd, is located at the The position information in the two-dimensional space represents the position information of the target object in the first video frame, and the position of the rectangle abcd mapped in the time dimension represents the time information of the first video frame; similarly, the second bottom surface 602 is the rectangle ijkm , The position information in the two-dimensional space where the second bottom surface is located represents the position information of the target object in the second video frame, and the position of the rectangle abcd mapped in the time dimension represents the time information of the second video frame; the third bottom surface 603, That is, the rectangle efgh, the position information in the two-dimensional space where the third bottom surface is located represents the position information of the target object in the third video frame, and the position of the rectangle abcd mapped in the time dimension represents the time information of the third video frame. It is understandable that during the process of shooting the target object in the first video, there may be relative motion between the target object and the video capture device. Therefore, the rectangle abcd, the rectangle efgh, and the rectangle ijkm are mapped to the two-dimensional space where the same bottom surface is located. When the time, the corresponding location may be different. The positions of the first bottom surface 601, the second bottom surface 602, and the third bottom surface 603 in the time dimension, that is, the positions of point a, point i, and point e mapped in the time dimension are a', i', and e', respectively, indicating the first Time information of a video frame, a second video frame, and a third video frame. The length of the motion pipeline is the position interval between the second bottom surface mapped in the time dimension and the third bottom surface mapped in the time dimension. It is used to indicate the time sequence of the video, the second bottom surface, the third bottom surface, and the second bottom surface. The number of all video frames between the bottom surface and the third bottom surface.

It should be noted that the motion pipeline corresponding to the first video frame includes at least the position information of the target in the first video frame.

The tracking trajectory can be split into multiple motion pipelines, as shown in FIG. 6. Optionally, in the embodiment of the present application, the tracking trajectory can be split into a position box of a single video frame, and each position box is used as a double quadrilateral The common bottom surface in the mesa structure is like the first bottom surface 601 in Figure 6, and extends forward and backward in the tracking trajectory to determine the other two bottom surfaces of the double quadrangular mesa structure, which are the second bottom surface 602 and the third bottom surface respectively. The bottom surface 603 thus obtains a double quadrangular prism structure with a common bottom surface, that is, the motion pipeline corresponding to the single video frame.

For the start video frame of the video, it can be considered that the forward extension is 0. Similarly, the last video frame extends backward to 0, and the motion pipelines corresponding to the start video frame and the last video frame degenerate into a single quadrangular pyramid structure. It should be noted that the length of the motion pipeline is defined as the number of video frames corresponding to the motion pipeline. As shown in FIG. 6, the video is between the video frame corresponding to the second bottom surface 602 and the video frame corresponding to the third bottom surface 603. The total number is the length of the movement pipeline.

The motion pipeline in the embodiment of this application is represented by a specific data format. Please refer to FIG. 7 and FIG. 8, which are two schematic diagrams of the data format of the motion pipeline in the embodiment of this application.

As shown in Figure 7, the first data format includes 3 data in the time dimension (t _s , t _m , t _e ), and 12 data in the space dimension

A total of 15 data. Among them, at the time corresponding to the data of each time dimension, the location information of the target in space can be determined by 4 pieces of data. Exemplarily, in _{the video frame at time t s} , the target location area is B _s , passing

and

Four data can determine the location area.

As shown in Figure 8, the motion pipeline output by the neural network model can be represented by another data format, the motion pipeline of the video frame m, B _m is the detection frame corresponding to the target in the common bottom surface, and B _m is the corresponding video frame time partial image region, P is any one of B _m pixel region, the pixel may be identified by a data point is located, in the time dimension, two data: d _s and d _e, motion of the conduit may be determined, respectively The length of the extension forward and backward. The _{four data of l m} , b _m , t _m , and r _m _{indicate the offset (Regress values for B m} ) _{of the boundary of the B m} area relative to the P point with the point P as the reference point. _{_{_{l s, b s, t s}}} , r s four data indicate a boundary region B _s offset with respect to the boundary region B _m (Regress values for B _s), similarly, l _{_e,} b _e, t _{_e,} r _e four data indicate a boundary region B _e offset with respect to the boundary region B _m (Regress values for B _e).

It can be seen that both data formats can represent a single motion pipeline through 15 data, and the two data formats can be converted to each other.

2. Intersection-over-union (IoU).

IoU is usually used to measure the degree of overlap between two locations. In object detection (object detection), the intersection ratio (IoU) refers to the ratio of the intersection and the union of two rectangular detection frames, and the value of IoU is between [0, 1]. Obviously, when IoU=0, the two location areas do not overlap; when IoU=1, the two location areas overlap.

In the embodiment of this application, the concept of IoU is extended to the three-dimensional space of the space-time dimension to measure the degree of overlap of the two motion pipelines in the space-time dimension. And schematic diagram.

IoU(T ⁽¹⁾ ,T ⁽²⁾ )=∩(T ⁽¹⁾ ,T ⁽²⁾ )/∪(T ⁽¹⁾ ,T ⁽²⁾ )

Among them, T ⁽¹⁾ represents motion channel 1, T ⁽²⁾ represents motion channel 2, ∩(T ⁽¹⁾ ,T ⁽²⁾ ) represents the intersection of two motion channels, ∪(T ⁽¹⁾ ,T ^{(2) )} ) Represents the union of two motion pipelines.

The target tracking method proposed in the embodiment of the application relates to the field of artificial intelligence technology. The artificial intelligence system is briefly introduced below. Figure 1 shows a schematic diagram of an artificial intelligence main frame, which describes the overall workflow of an artificial intelligence system and is suitable for general artificial intelligence field requirements.

The following describes the above-mentioned artificial intelligence theme framework from the two dimensions of "intelligent information chain" (horizontal axis) and "IT value chain" (vertical axis).

"Intelligent Information Chain" reflects a series of processes from data acquisition to processing. For example, it can be the general process of intelligent information perception, intelligent information representation and formation, intelligent reasoning, intelligent decision-making, intelligent execution and output. In this process, the data has gone through the condensing process of "data-information-knowledge-wisdom".

The "IT value chain" is the industrial ecological process from the underlying infrastructure of human intelligence and information (providing and processing technology realization) to the system, reflecting the value that artificial intelligence brings to the information technology industry.

(1) Infrastructure:

The infrastructure provides computing power support for the artificial intelligence system, realizes communication with the outside world, and realizes support through the basic platform. Communicate with the outside through sensors; computing capabilities are provided by smart chips (CPU, NPU, GPU, ASIC, FPGA and other hardware acceleration chips); the basic platform includes distributed computing framework and network related platform guarantees and support, which can include cloud storage and Computing, interconnection network, etc. For example, sensors communicate with the outside to obtain data, and these data are provided to the smart chip in the distributed computing system provided by the basic platform for calculation.

(2) Data

The data in the upper layer of the infrastructure is used to represent the data source in the field of artificial intelligence. The data involves graphics, images, voice, text, and IoT data of traditional devices, including business data of existing systems and sensory data such as force, displacement, liquid level, temperature, and humidity.

(3) Data processing

Data processing usually includes data training, machine learning, deep learning, search, reasoning, decision-making and other methods.

Among them, machine learning and deep learning can symbolize and formalize data for intelligent information modeling, extraction, preprocessing, training, etc.

Reasoning refers to the process of simulating human intelligent reasoning in a computer or intelligent system, using formal information to conduct machine thinking and solving problems based on reasoning control strategies. The typical function is search and matching.

Decision-making refers to the process of making decisions after intelligent information is reasoned, and usually provides functions such as classification, ranking, and prediction.

(4) General ability

After the above-mentioned data processing is performed on the data, some general capabilities can be formed based on the results of the data processing, such as an algorithm or a general system, for example, translation, text analysis, computer vision processing, speech recognition, image Recognition and so on.

(5) Smart products and industry applications

Intelligent products and industry applications refer to the products and applications of artificial intelligence systems in various fields. It is an encapsulation of the overall solution of artificial intelligence, productizing intelligent information decision-making and realizing landing applications. Its application fields mainly include: intelligent manufacturing, intelligent transportation, Smart home, smart medical, smart security, autonomous driving, safe city, smart terminal, etc.

In the target tracking method proposed in the embodiment of this application, the motion pipeline of the target object is obtained through a deep neural network. The following briefly introduces the system architecture based on the deep neural network for data processing. Please refer to FIG. 2. The embodiment of this application provides A system architecture 200. The data collection device 260 is used to collect the video data of the moving target and store it in the database 230. The training device 220 generates a target model/rule 201 based on the video samples containing the moving target maintained in the database 230. The following will describe in more detail how the training device 220 obtains the target model/rule 201 based on the video samples of the moving target. The target model/rule 201 can be used in application scenarios such as single target tracking, multiple target tracking, and virtual reality.

In the embodiment of the present application, training may be performed based on video samples of the moving target. Specifically, various video samples containing the moving target may be collected by the data collection device 260 and stored in the database 230. In addition, video data can be obtained directly from commonly used databases.

The target model/rule 201 may be obtained based on a deep neural network, and the deep neural network will be introduced below.

The work of each layer in the deep neural network can be expressed in mathematical expressions

To describe: From the physical level, the work of each layer in the deep neural network can be understood as the transformation of the input space to the output space (that is, the row space of the matrix to the column Space), these five operations include: 1. Dimension Up/Down; 2. Enlarge/Reduce; 3. Rotate; 4. Translation; 5. "Bend". The operations of 1, 2, and 3 are determined by

Completed, the operation of 4 is completed by +b, and the operation of 5 is realized by a(). The reason why the word "space" is used here is because the object to be classified is not a single thing, but a class of things. Space refers to the collection of all individuals of this class of things. Among them, W is a weight vector, and each value in the vector represents the weight value of a neuron in the layer of neural network. This vector W determines the spatial transformation from the input space to the output space described above, that is, the weight W of each layer controls how the space is transformed. The purpose of training a deep neural network is to finally obtain the weight matrix of all layers of the trained neural network (the weight matrix formed by the vector W of many layers). Therefore, the training process of the neural network is essentially the way of learning the control space transformation, and more specifically the learning weight matrix.

Because it is hoped that the output of the deep neural network is as close as possible to the value that you really want to predict, you can compare the current network's predicted value with the really desired target value, and then update each layer of neural network according to the difference between the two. The weight vector of the network (of course, there is usually an initialization process before the first update, which is to pre-configure parameters for each layer in the deep neural network). For example, if the predicted value of the network is high, adjust the weight vector to make it The prediction is lower and keep adjusting until the neural network can predict the target value you really want. Therefore, it is necessary to predefine "how to compare the difference between the predicted value and the target value". This is the loss function or objective function, which is used to measure the difference between the predicted value and the target value. Important equation. Among them, taking the loss function as an example, the higher the output value (loss) of the loss function, the greater the difference, then the training of the deep neural network becomes a process of reducing this loss as much as possible.

The target model/rule obtained by the training device 220 can be applied to different systems or devices. In FIG. 2, the execution device 210 is configured with an I/O interface 212 to perform data interaction with external devices. The "user" can input data to the I/O interface 212 through the client device 240.

The execution device 210 can call data, codes, etc. in the data storage system 250, and can also store data, instructions, etc. in the data storage system 250.

The calculation module 211 uses the target model/rule 201 to process the input data. Taking target tracking as an example, the calculation module 211 can analyze the input video to obtain features indicating target location information in the video frame.

The correlation function module 213 may preprocess the image data in the calculation module 211, for example, perform video preprocessing, including video segmentation.

The correlation function module 214 may preprocess the image data in the calculation module 211, for example, perform video preprocessing, including video segmentation.

Finally, the I/O interface 212 returns the processing result to the client device 240 and provides it to the user.

At a deeper level, the training device 220 can generate corresponding target models/rules 201 based on different data for different targets, so as to provide users with better results.

In the case shown in FIG. 2, the user can manually specify to input data in the execution device 210, for example, to operate in the interface provided by the I/O interface 212. In another case, the client device 240 can automatically input data to the I/O interface 212 and obtain the result. If the client device 240 automatically inputs data and needs the user's authorization, the user can set the corresponding authority in the client device 240. The user can view the result output by the execution device 210 on the client device 240, and the specific presentation form may be a specific manner such as display, sound, and action. The client device 240 can also serve as a data collection terminal to store the collected training data in the database 230.

It is worth noting that Fig. 2 is only a schematic diagram of a system architecture provided by an embodiment of the present application, and the positional relationship between the devices, devices, modules, etc. shown in the figure does not constitute any limitation. For example, in Fig. 2, The data storage system 250 is an external memory relative to the execution device 210. In other cases, the data storage system 250 may also be placed in the execution device 210.

The deep neural network used to extract the motion pipeline from the video in the embodiment of the application may be a convolutional neural network (convolutional neural network, CNN), for example. The following is a specific introduction to CNN.

CNN is a deep neural network with a convolutional structure. It is a deep learning architecture. The deep learning architecture refers to the use of machine learning algorithms to perform multiple levels of learning at different levels of abstraction. As a deep learning architecture, CNN is a feed-forward artificial neural network. Take image processing as an example. Each neuron in the feed-forward artificial neural network responds to overlapping areas in the input image. . Of course, it can also be of other types, and this application does not limit the type of deep neural network.

As shown in FIG. 3, a convolutional neural network (CNN) 100 may include an input layer 110, a convolutional layer/pooling layer 120, where the pooling layer is optional, and a neural network layer 130.

Convolutional layer/pooling layer 120:

Convolutional layer:

As shown in FIG. 3, the convolutional layer/pooling layer 120 may include layers 121-126 as shown in the example. In one implementation, layer 121 is a convolutional layer, layer 122 is a pooling layer, layer 123 is a convolutional layer, and 124 The layer is a pooling layer, 125 is a convolutional layer, and 126 is a pooling layer; in another implementation, 121 and 122 are convolutional layers, 123 is a pooling layer, 124 and 125 are convolutional layers, and 126 is a convolutional layer. Pooling layer. That is, the output of the convolutional layer can be used as the input of the subsequent pooling layer, or as the input of another convolutional layer to continue the convolution operation.

Take the convolutional layer 121 as an example. The convolutional layer 121 can include many convolution operators. The convolution operator is also called a kernel. Its function in image processing is equivalent to a filter that extracts specific information from the input image matrix. In essence, the convolution operator can be a weight matrix. This weight matrix is usually predefined. In the process of convolution on the image, the weight matrix is usually one pixel after another pixel in the horizontal direction on the input image ( Or two pixels followed by two pixels...It depends on the value of stride) to complete the work of extracting specific features from the image.

Depending on the dimensions of the data to be processed, the convolution kernel also has multiple formats. Commonly used convolution kernels include two-dimensional convolution kernels and three-dimensional convolution kernels. Two-dimensional convolution kernels are mainly used to process two-dimensional image data, while three-dimensional convolution kernels can be applied to video processing, stereoscopic image processing, etc. due to the increased depth or time dimension. In the embodiments of this application, in order to extract the information in the time dimension and the space dimension in the video through the neural network model, the three-dimensional convolution kernel is used to perform the convolution operation in the time dimension and the space dimension at the same time, thus, the three-dimensional convolution kernel is composed of The three-dimensional convolutional neural network can not only obtain the characteristics of each video frame, but also express the association and change of the video frame over time.

When the convolutional neural network 100 has multiple convolutional layers, the initial convolutional layer (such as 121) often extracts more general features, which can also be called low-level features; with the convolutional neural network With the deepening of the network 100, the features extracted by the subsequent convolutional layers (for example, 126) become more and more complex, such as features such as high-level semantics, and features with higher semantics are more suitable for the problem to be solved. To facilitate the description of the network structure, multiple convolutional layers can be referred to as a block.

Pooling layer:

Since it is often necessary to reduce the number of training parameters, it is often necessary to periodically introduce a pooling layer after the convolutional layer, that is, the 121-126 layers as illustrated by 120 in Figure 3, which can be a convolutional layer followed by a layer The pooling layer can also be a multi-layer convolutional layer followed by one or more pooling layers. In the image processing process, the sole purpose of the pooling layer is to reduce the spatial size of the image.

Neural network layer 130:

After processing by the convolutional layer/pooling layer 120, the convolutional neural network 100 is not enough to output the required output information. Therefore, the neural network layer 130 may include multiple hidden layers (131, 132 to 13n as shown in FIG. 3) and an output layer 140. The parameters contained in the multiple hidden layers can be based on specific task types. Relevant training data of, is obtained through pre-training. For example, the task type may include image recognition, image classification, image super-resolution reconstruction, and so on.

After the multiple hidden layers in the neural network layer 130, that is, the final layer of the entire convolutional neural network 100 is the output layer 140.

It should be noted that the convolutional neural network 100 shown in FIG. 3 is only used as an example of a convolutional neural network. In specific applications, the convolutional neural network may also exist in the form of other network models, for example, The multiple convolutional layers/pooling layers shown in FIG. 4 are in parallel, and the respectively extracted features are input to the full neural network layer 130 for processing.

Optionally, the deep neural network used to extract the motion pipeline from the video in the embodiment of the present application is a combination of a residual neural network and a feature pyramid network. Among them, the residual neural network makes the deeper network easier to train by letting the deep network learn the residual representation. Residual learning solves the problems of gradient disappearance and gradient explosion in deep networks. The feature pyramid network detects targets of corresponding scales on feature maps of different resolutions. The output of each layer is obtained by fusing the feature maps of the current layer and higher layers, so each layer of feature maps output has sufficient feature expression ability.

The target detection method provided by the embodiment of the application involves a wide range of target tracking technologies, such as auto-focus during video shooting. The target tracking algorithm can help the photographer more conveniently and accurately select the focus, or flexibly switch the focus to track the target, which is used in sports events. , Especially important in wildlife shooting. In the surveillance scene, the multi-target tracking algorithm can automatically complete the position tracking of the selected target object to facilitate the search for the established target, which is of great significance in the field of security. In the autonomous driving scenario, the multi-target tracking algorithm can control the surrounding pedestrians, the trajectory and trend of the vehicle, and provide initial information for automatic driving path planning, automatic obstacle avoidance and other functions. In virtual reality scenarios, somatosensory games, gesture recognition, and finger tracking can also be achieved through multi-target tracking technology.

The usual target tracking method includes detection and tracking. The detection module detects the target appearing in each video frame, and then matches the target appearing in each video frame. During the matching process, each target in a single video frame is extracted. The characteristics of each target object are matched through the similarity comparison of the features, and the tracking trajectory of each target object is obtained. Because this type of target tracking method uses the technical means of first detection and then tracking, the target tracking effect depends on the detection algorithm of a single frame. If the target is occluded in the target detection, detection errors will occur, which will lead to tracking errors. Therefore, when the target is dense or Insufficient performance in scenes with more occlusions.

The embodiment of the application adopts a target detection method, which inputs a video into a pre-trained neural network model, outputs multiple motion pipelines, and restores the tracking trajectories corresponding to one or more targets by matching the multiple motion pipelines. First, because the motion pipeline includes the position information of the target object in at least two video frames, target tracking does not depend on the target detection result of a single video frame, which can reduce the problem of detection failure in scenes with dense targets or more occlusions. Improve target tracking performance. Secondly, conventional target detection methods rely on single-frame detection algorithms. The accuracy of the overall algorithm is affected by the detector. The development cost of step-by-step training of detection models and tracking models is high. At the same time, the algorithm is divided into two stages and also increases machine learning. The computational cost and deployment difficulty of the process. However, the target tracking method provided in the embodiments of the present application can realize end-to-end training, and complete the detection and tracking tasks of multi-target objects through a neural network model, which can reduce the complexity of the model. In addition, the prior art has relatively single features extracted based on a single video frame. The target tracking method provided in the embodiments of the present application uses video as the original input, and the model can be realized by various features such as appearance features, motion trajectory features, or gait features. Tracking tasks can improve target tracking performance. Finally, the target tracking method provided by the embodiment of the present application uses video as the original input of the model, and the time dimension receptive field is increased, which can better capture the movement information of the character.

The following describes in detail the target detection method provided by the embodiment of the present application, please refer to FIG. 10, which is a schematic diagram of an embodiment of the target detection method in the embodiment of the present application;

1001. Preprocess the video;

The target tracking device can preprocess the acquired video. Optionally, the preprocessing includes one or more of the following: dividing the video into segments of preset length, adjusting the video resolution, and adjusting and normalizing the color space .

Exemplarily, when the length of the video is long, considering the data processing capability of the target tracking device, the video may be divided into 8 small segments.

It should be noted that step 1001 is an optional step and may or may not be executed.

1002. Input the video into the neural network model to obtain the motion pipeline and the confidence of the motion pipeline.

The video is input to the pre-trained neural network model, and the position information of the target object in the at least two video frames and the time information of the at least two video frames are obtained. Optionally, the video is input to a pre-trained neural network model to obtain the motion pipeline of each target object. The motion pipeline is used to indicate the time information and position information of the target object in at least two video frames of the first video. For the specific manner of indicating the time information and position information of the motion pipeline, please refer to the foregoing introduction, and will not be repeated here. . The training process of the neural network model will be described in detail in the subsequent embodiments.

Optionally, the data format of the output motion pipeline is the type shown in Figure 8. Specifically, the input video I, I∈R^(t×h×w×3), where R represents the real number domain, and t represents the video The number of frames, h×w represents the video resolution, 3 represents the RGB color gamut, and the output is the motion pipeline O, O∈R^(t×h'×w'×15), where R represents the real number domain and t represents the video The number of frames, h'×w' represents the resolution of the feature map output by the neural network. That is, t×h'×w' motion pipes are output, and each video frame corresponds to h'×w' motion pipes.

Optionally, the pre-trained neural network model is used to obtain the category information of the target object; specifically, the pre-trained neural network model is used to obtain the confidence level of the motion pipeline, which can be used to determine The category information of the target object corresponding to the motion pipeline.

Since the motion pipeline is used to indicate the position information of the target in the video frame, each motion pipeline corresponds to a target to be tracked, and the confidence of the motion pipeline refers to the possibility that the target corresponding to each motion pipeline belongs to the preset category. Generally, it is necessary to preset the category of the target object to be tracked in the video, such as human, vehicle, or dog. The confidence of the output motion pipeline represents the probability that the target corresponding to the motion pipeline belongs to the preset category, and the confidence is A value between 0 and 1. The smaller the confidence level, the less likely it is to belong to the preset category, and the larger it is, the greater the possibility of belonging to the preset category.

Optionally, the number of confidence levels of each motion channel is equal to the number of preset target object categories, and each confidence level corresponds to the possibility that the motion channel belongs to the category. The confidence of the motion pipeline output by the neural network model constitutes the confidence table.

Example 1: The preset category of the target object is "person" or "background". The background refers to the image area that does not contain the target object to be tracked. The confidence levels of the target object category corresponding to the first motion pipeline are 0.1 and 0.9, respectively. , The confidence of the second motion pipeline is 0.7, 0.3. Since there is only one preset category, there are two possibilities for the target object category, which belongs to "person" or "background". Therefore, the confidence threshold can be set to 0.5, Since the category of the target object corresponding to the first motion channel is "person", the confidence level of 0.3 is less than or equal to 0.5, which means that the target object corresponding to the motion channel has a low probability of being a person, and the "background" confidence level of 0.9 is greater than 0.5. That is, the possibility of belonging to the background is higher; the confidence that the target object corresponding to the second motion pipe belongs to the category of "person" is greater than 0.5, which means that the target corresponding to the motion pipe has a higher probability of belonging to the person, and the confidence that it belongs to the "background" If the degree of 0.3 is less than 0.5, it is less likely to belong to the background.

Example 2: The preset categories of the target object are "person", "vehicle" and "background", the confidence level of the first motion channel is 0.4, 0.1, 0.2, and the confidence level of the second motion channel is 0.2, 0.8, 0.1, There are three possibilities for the category of the target object: "person", "vehicle" or "background". 1/3≈0.33 can be used as the confidence threshold. Since 0.4 is greater than 0.33, the category with the highest confidence in the first motion pipeline is " "People", that is, the category of the corresponding target object has a higher probability of being human. Similarly, the category with the highest confidence of the second motion channel is "vehicle", that is, the category of the corresponding target object has a higher probability of being a vehicle.

1003. Delete part of the sports pipeline;

Before acquiring the tracking trajectory of the target object according to the motion pipeline, the motion pipeline can also be deleted to obtain the deleted motion pipeline, and the deleted motion pipeline is used to obtain the tracking trajectory of the target object. The multiple motion pipelines output by the neural network model can be deleted according to preset conditions.

Because of the multiple motion pipelines predicted by the neural network model, each pixel in each video frame corresponds to the motion pipeline, and the target appearing in the video frame usually occupies multiple pixel positions, so it is used to indicate the same target object There are multiple motion pipelines for the position information of. In this step, multiple motion pipelines corresponding to the same target object can be deleted, reducing the amount of calculation in the subsequent steps of connecting motion pipelines.

Optionally, if the confidence level of each motion channel is obtained, the category to which the target corresponding to each motion channel belongs can be determined according to the confidence level, and the motion channels of each category are respectively deleted.

Optionally, obtaining the deleted motion pipeline specifically includes, if the repetition rate between the first motion pipeline and the second motion pipeline is greater than or equal to a first threshold, deleting the confidence levels in the first motion pipeline and the second motion pipeline For the lower motion pipeline, optionally, the repetition rate of the motion pipeline may be the IoU between the two motion pipelines. The first threshold value ranges from 0.3 to 0.7. For example, the first threshold value is 0.5. If the IoU between the movement pipe and the second movement pipe is greater than or equal to 50%, then a movement pipe with a lower confidence level is deleted. Optionally, the motion pipeline is deleted according to a non-maximum suppression (NMS) algorithm, the deleted motion pipeline is obtained, and the IoU threshold of the motion pipeline is set to 0.5, and the NMS algorithm can be used to The motion pipeline is deleted, and only one corresponding motion pipeline is reserved for each target in each video frame. The deletion of target detection results according to the NMS algorithm is the prior art, and the specific process will not be repeated here.

Because of the multiple motion pipelines predicted by the neural network model, each pixel in each video frame corresponds to the motion pipeline, and the position of the pixel in the video frame that does not correspond to the background area of the target object also corresponds to some motion pipelines. Part of the motion pipeline can be understood as a fake motion pipeline, and the confidence is usually low. In order to reduce the calculation complexity of the subsequent motion pipeline connection steps, the confidence of the motion pipeline can be deleted.

Optionally, the confidence of any one of the motion pipelines after the reduction is greater than or equal to the second threshold, that is, the preset condition is that the confidence is less than or equal to the second threshold, and the second threshold is related to the preset category of the target object For example, if the preset number of categories of the target object is 2: "person" or "background", the second threshold is usually between 0.3 and 0.7, for example, 0.5. If the number of categories of the target object is 10, the second threshold is usually between 0.07 and 0.13, for example, 0.1.

It should be noted that step 1003 is an optional step, which may or may not be performed.

1004. Connect the motion pipeline to obtain the tracking trajectory;

Obtain the tracking trajectory of the target object in the first video according to the position information of the target object in at least two video frames and the time information of the at least two video frames. Optionally, since in this embodiment, a motion pipeline is used to indicate the target object The position information in the at least two video frames and the time information of the at least two video frames. Therefore, the tracking trajectory of the target object in the first video can be obtained according to the motion pipeline, and the tracking trajectory is specifically based on the at least two video frames. Each motion pipeline corresponds to the tracking trajectory of the target object formed by the connection of the quadrangular prisms in the space-time dimension.

Specifically, multiple motion pipes indicating the position information of the same target object are connected to obtain the tracking trajectory corresponding to each target. The connection between the motion pipes, or the matching between the motion pipes, needs to meet a preset condition. Obtaining the tracking trajectory of the target object according to the motion pipeline specifically includes: connecting a third motion pipeline and a fourth motion pipeline that meet a preset condition in the motion pipeline to obtain the tracking trajectory of the target object.

The specific content of the preset condition includes multiple types. Optionally, the preset condition includes one or more of the following: the intersection ratio between the third motion channel and the fourth motion channel in the overlapping sections of the time dimension is greater than or equal to The third threshold; the cosine value of the angle between the movement direction of the third movement pipeline and the movement direction of the fourth movement pipeline is greater than or equal to the fourth threshold, and the movement direction is to indicate the target object in the movement pipeline in the space-time dimension according to preset rules And, the distance between the neural network feature vectors of the motion pipeline is less than or equal to the fifth threshold, and the distance includes the Euclidean distance.

Specifically, the intersection ratio between the two motion pipelines corresponding to the overlapping parts of the time dimension is greater than or equal to the third threshold, the cosine of the angle between the motion directions of the motion pipelines is greater than or equal to the fourth threshold, The distance index between the neural network feature vectors of the motion pipeline is less than or equal to the fifth threshold, and the distance index may be, for example, Euclidean distance. The neural network feature vector of the motion pipeline can be the output feature vector of any layer in the neural network model. Optionally, the neural network feature vector of the motion pipeline is the output feature of the last layer of the three-dimensional (3D) convolutional neural network in the neural network model. vector.

The movement direction of the motion pipe is a vector indicating the position change of the target object corresponding to the two bottom surfaces of the motion pipe in the space-time dimension, indicating the moving speed and direction of the target object. It can be understood that the position change of the target object in the video is usually continuous There is no sudden change in the change. Therefore, the moving directions of adjacent moving pipeline sections in the tracking trajectory are relatively close. During the connection of the moving pipelines, the connection can also be made according to the similarity of the moving directions of the moving pipelines. It should be noted that: the movement direction of the motion pipeline can be determined according to preset rules. For example, the motion pipeline is in the space-time dimension, and the two bottom surfaces of the motion pipeline that are the farthest apart in the time dimension (for example, the motion shown in Figure 8 The vector of the position change of the target object corresponding to the Bs and Be of the pipe is the direction of movement of the moving pipe, or set two adjacent bottoms in the moving pipe (for example, Bm and Be of the moving pipe shown in Figure 8) to face The corresponding vector of the position change of the target object is the movement direction of the motion pipeline, or the position change direction of the target object between a preset number of video frames is set as the movement direction of the motion pipeline, and the preset number is, for example, 5 frames. Similarly, the direction of the tracking trajectory can be defined as at the end of the trajectory, the direction of the position change of the target object between a preset number of video frames is the direction of movement of the motion pipeline, or the direction of movement of the last motion pipeline at the end of the trajectory . It is understandable that the direction of motion of the motion pipe is generally defined as the direction from a certain moment to a moment after a certain moment in the time dimension.

The value of the third threshold is not limited, usually 70% to 95%, such as 75%, 80%, 85% or 90%, etc. The value of the fourth threshold is not limited, usually cos(π/6) To cos (π/36), for example, cos (π/9), cos (π/12), or cos (π/18). The value of the fifth threshold can be determined according to the size of the feature vector, and the specific value is not limited.

Optionally, the following preset conditions are that the intersection ratio between the two motion pipelines corresponding to the overlapped portion of the time dimension is greater than or equal to the third threshold, and the cosine of the angle between the motion directions of the motion pipelines The fourth threshold value is greater than or equal to an example.

Please refer to FIG. 11 for a schematic diagram of an embodiment of the matching between the motion pipes in the embodiment of the application.

Example 1, as shown in part a in Fig. 11, if the intersection ratio between the motion pipeline sections corresponding to the overlapping time dimension of the two motion pipelines is greater than or equal to the third threshold, and the angle between the motion directions of the two motion pipelines The cosine value of is greater than or equal to the fourth threshold, that is, the coincidence degree and the motion direction are matched, and the two motion pipes are matched successfully. It should be noted that the degree of coincidence between two motion pipes refers to the IoU between the motion pipe sections of the overlapping portion of the two motion pipes in the time dimension.

Example 2, as shown in part b of Fig. 11, if the cosine value of the angle between the motion directions of the two motion pipes is less than the fourth threshold, that is, the motion directions do not match, the matching of the two motion pipes is unsuccessful.

Example 3, as shown in part c of Figure 11, if the intersection ratio between the motion pipeline sections corresponding to the overlapping parts of the two motion pipelines in the time dimension is less than the third threshold, that is, the degree of coincidence does not match, then the two motion pipelines match unsuccessful.

It should be noted that since the two motion pipelines for matching have overlapping parts in the time dimension, there are two position information of the same target object in the video frame corresponding to the overlapping part, which can be determined by the method of averaging. The position of the target object in the video frame corresponding to the overlapping part of the time dimension, or a certain motion channel specified according to a preset rule, for example, the time dimension coordinates of the video frame corresponding to the common bottom surface shall prevail.

Optionally, the greedy algorithm can be used in the matching process of connecting all the motion pipes of the video to connect through a series of local optimal choices; the Hungarian algorithm can also be used for global optimal matching.

Connecting motion pipelines according to the greedy algorithm specifically includes: calculating the affinity between the two sets of motion pipelines to be matched (the affinity is defined as IoU*cos(θ), and θ is the angle of the direction of motion) to form the affinity matrix. In the affinity matrix, the matching motion pipe pair (Btube pair) is circularly selected from the maximum affinity until the matching is completed.

Connecting motion pipelines according to the Hungarian algorithm specifically includes: also after obtaining the affinity matrix, use the Hungarian algorithm to select a pair of motion pipelines.

Optionally, the following describes a specific process of connecting multiple motion pipes in this embodiment:

1) Take all the motion pipelines starting from the first frame as the initial tracking trajectory to obtain the set of tracking trajectories;

2) Connect the motion pipeline starting in the second frame with the tracking trajectories in the tracking trajectory set in turn, if the preset condition is met, the matching is successful, and the original tracking trajectory is updated according to the motion pipeline. If the matching is unsuccessful, it will be newly added to the set of tracking trajectories as the initial tracking trajectory;

3) Similarly, the motion pipeline starting from the i-th frame is sequentially connected with the tracking track set, where i is a positive integer greater than 2 and less than t, and t is the total number of frames of the video. If the preset conditions are met, it will match If it succeeds, the tracking trajectory is updated according to the motion pipeline. If the matching is unsuccessful, it is newly added to the set of tracking trajectories as the initial tracking trajectory.

Optionally, this embodiment adopts a greedy algorithm to sequentially connect the pipeline and the trajectory starting from the maximum affinity.

Exemplarily, suppose the first group of motion pipes starting from the first frame is the second group, and the motion pipes starting from the second frame are the second group. Similarly, the motion pipes starting from the i-th frame are the i-th group. If the first group includes 10 motion pipes, the second group includes 8 motion pipes, and the third group includes 13 motion pipes. First, take the 10 motion pipelines in the first group as 10 initial tracking trajectories. The second group is connected to the initial tracking trajectories. If the connection conditions are met, the tracking trajectories are updated. If the connection conditions are not met, the original initial tracking trajectories are retained. Tracking the trajectory, assuming that the 8 motion pipelines in the second group all meet the connection conditions and are successfully connected to 8 of the 10 initial tracking trajectories, then the tracking trajectory set includes 8 updated tracking trajectories, and The two tracking tracks remain unchanged. Next, connect the 13 motion pipelines in the third group to the trajectories in the tracking trajectory set. Since the tracking trajectory set includes 10 tracking trajectories, even if they are all successfully connected to the motion pipelines of the third group, there are still 3 The three motion pipelines are not used to update the tracking trajectory, and these three motion pipelines can be used as new initial tracking trajectories, that is, three new tracking trajectories are added to the tracking trajectory set.

Optionally, the target category to which the target corresponding to the motion pipeline belongs is determined according to the confidence table of the motion pipeline, and the motion pipelines of different target categories are respectively connected to obtain the tracking trajectory of the target object of each target category.

Optionally, the spatial position of the occluded part can be obtained by complementing the difference of the motion pipeline.

1005. Output tracking trajectory;

Output the connected tracking trajectory in a specific format, such as video stream, trajectory log, etc.

After the tracking trajectory is processed, the tracking trajectory is processed as a bounding box superimposed on the original video and output to the display to complete the real-time tracking deployment and achieve target tracking.

The target tracking method provided in the embodiment of the present application designs a pre-trained neural network model, and the training method of the neural network model is introduced below.

Please refer to FIG. 12, which is a schematic diagram of an embodiment of a neural network model training method in an embodiment of the application.

1201. Preparation for training;

Training preparations include building a training hardware environment, building a network model, and setting training parameters.

Prepare the hardware environment required for training. For example, 32 V100-32G graphics cards are used, and a distributed cluster of 4 nodes is used. The inference process uses 1 V100-16G graphics card, which is completed in a single machine.

To obtain the video data set, you can use a public data set, such as the MOT data set. Optionally, the video samples in the data set can also be processed to increase the diversity of data distribution and obtain better model generalization capabilities. Optionally, the processing of the video includes resolution scaling, whitening the color space, and random HSL (a color space, or color representation method) to the video color. H: hue, S: saturation, L: Brightness) jitter, random horizontal flipping of video frames, etc.

Set training parameters, including batch size, learning rate, optimizer model, etc., for example, the batch size is 32, the learning rate starts from 10^(-3), and when the loss is stable, it is reduced by 5 times for better convergence . After 25K training iterations, the network basically converges. In order to increase the generalization ability of the model, a second-order regular loss of 10^(-5) is used, and its momentum coefficient is 0.9.

1202. Split according to the manually marked trajectory information to obtain the true value of the motion pipeline;

Obtain the manually labeled trajectory information of the video samples in the public data set, including the target ID and the position box of the target object in each video frame.

Split according to the trajectory information manually labeled, and obtain the position frame of the target object in each frame as the motion pipeline of the common bottom surface. Based on the first data format of the motion pipeline, 15 data are used to represent the motion pipeline.

The specific method to obtain the motion pipeline is as follows:

Split the tracking trajectory into position boxes of a single video frame, and use each position box as the common bottom surface in the double quadrangular prism structure, and extend forward and backward in the tracking trajectory to determine the other two quadrangular prism structures. There are two bottom surfaces, and thus a double quadrangular prism structure with a common bottom surface is obtained, that is, the motion pipeline corresponding to the single video frame.

There are several ways to split the tracking trajectory into motion pipelines:

Optionally, split according to the preset pipe length, that is, set the interval between the three bottom surfaces in the double quadrangular pyramid structure. For example, the interval between the common bottom surface and the other two bottom surfaces is 4, and the movement pipe The length is 8.

Optionally, during the splitting process, under the condition that the IoU between the double quadrangular prism structure and the corresponding section of the original tracking trajectory is greater than or equal to 85%, the length of the motion pipeline in the time dimension is extended as much as possible, and the time dimension The longest structure serves as the final expanded structure. As shown in Figure 13. Since the structure of the motion pipeline (Btube) is linear, and the structure of the ground truth (ground truth) is non-linear, the long motion pipeline often cannot fit the motion trajectory better, that is, as the length increases, the IoU will be lower. Low (IoU<η). The length of the motion pipeline with larger IoU (IoU>η) is usually shorter. In the embodiment of the present application, the longest motion pipeline that meets the lowest IoU threshold is used as the split motion pipeline, which can better fit the original trajectory while expanding the time receptive field. As shown in Fig. 13, the overlapping part of the motion pipes can be used for connection matching between the motion pipes.

Similarly, the tracking trajectories of all target objects in the video sample are split to obtain the true values of multiple motion pipelines.

1203. Input the video samples into the initial network model for training, and obtain the predicted value of the motion pipeline;

The video samples are input into the initial network model for training, and the predicted value of the motion pipeline is output.

Optionally, the initial network model is a three-dimensional (3D) convolutional neural network or a recurrent neural network, etc., where the 3D convolutional neural network includes: a 3D residual neural network or a 3D feature pyramid network, etc. Optionally, the neural network model is a combination of a 3D residual neural network and a 3D feature pyramid network.

The video samples are input to the initial network model, and the motion pipelines of all target objects are output.

The data format of the output motion pipeline is the type shown in Figure 8. Specifically, the input video I, I∈R^(t×h×w×3), where R represents the real number domain, and t represents the number of frames of the video. h×w represents the video resolution, 3 represents the RGB color gamut, and the output is the motion pipeline O, O∈R^(t×h'×w'×15), where R represents the real number domain, and t represents the number of frames of the video, h'×w' represents the resolution of the feature map output by the neural network. That is, t×h'×w' motion pipes are output, and each video frame corresponds to h'×w' motion pipes.

Optionally, the confidence level of the motion pipeline is also output, and the confidence level is used to indicate the category of the target object corresponding to the motion pipeline.

It should be noted that the execution order of step 1202 and step 1203 is not limited.

1204. Calculate training loss;

Since step 1202 is split according to the manually labeled trajectory information, the data format of the true value of the obtained motion pipeline (R^(t×h'×w'×15), where t×h'×w' is the motion pipeline The number) is the first data format of the motion pipeline;

The data format of the motion pipeline output by the initial network model in step 1203 (R^(n×15), where n is the number of motion pipelines) is the second data format of the motion pipeline.

In order to calculate the training loss value based on the true value and the predicted value, it is necessary to unify the true value of the motion pipeline obtained in step 1202 and the motion pipeline output by the neural network model into one data format.

Optionally, in this embodiment of the present application, the true value of the motion pipeline is converted into the second data format. Please refer to Figure 14. Since the t×h'×w' motion pipelines output by the neural network model include t×h'×w' P points, only P1 and P2 are used as examples in Figure 14 for illustration, t×h The'×w' P points are a three-dimensional lattice distributed in the three dimensions of time and space. In order to achieve data conversion, it is necessary to convert the true values of n motion pipelines to a similar three-dimensional lattice, and use the following rules for conversion : If a point in the three-dimensional lattice is located on the common bottom surface of the double quadrangular pyramid structure corresponding to the true value of the motion pipe, then the true value of the motion pipe is assigned to the position of the motion pipe corresponding to this point. If the points in a dot matrix are located on the common bottom surface corresponding to the truth values of multiple motion pipes (that is, the target overlapping scene), the motion pipe with a smaller volume is preferentially allocated. After the allocation is completed, you will get the true value T of the motion pipeline in the same format as R^(t×h'×w'×15). It should be noted that some of the points are not assigned to the true value. At this time, you can use 0 to At the same time, the true value is accompanied by a 0/1 truth table to characterize whether it is a compensation pipeline. The truth table A'can be used as the confidence level corresponding to the truth value of the motion pipeline.

After converting the true value to the second data format, the loss between the true value (T) and the predicted value (O) can be calculated.

Optionally, the loss function L is:

L=L ₁ +L ₂

L ₁ =-ln(IoU(T,O))

L ₂ ＝CrossEntropy(A,A′)

Among them, IoU (T, O) represents the intersection ratio between the true value of the motion pipeline (T) and the predicted value (O) of the motion pipeline, A is the confidence level of the predicted value (O) of the motion pipeline, and A'is the true value of the motion pipeline. The confidence of the value (T), CrossEntropy is the cross entropy.

1205. According to the training loss, use the optimizer to optimize the neural network model.

According to the training loss L obtained in step 1204, the parameters are updated by the optimizer to optimize the neural network model, and finally a neural network model that can be used to implement the target tracking method in the embodiment of the present application is obtained.

There are many types of optimizers. Optionally, they can be BGD (batch gradient descent) algorithm, SGD (stochastic gradient descent) algorithm, or MBGD (mini-batch gradient descent) algorithm.

Please refer to FIG. 15, which is a schematic diagram of another embodiment of a target tracking method in an embodiment of this application;

In this solution, the target tracking device can track the moving target in the video in real time.

specific,

1501. System initialization;

At the beginning of the method, the system initialization of the target tracking device is performed first, and the preparation for device startup is completed;

1502. Obtain video content;

It can be a video captured by a target tracking device in real time, or a video captured through a communication network.

1503. Calculate through the neural network model to obtain a set of motion pipelines;

The video obtained in 1502 is input into the pre-trained neural network model, and the motion pipeline set of the input video will be obtained, including the motion pipeline of the target object corresponding to each video frame.

1504. Based on a greedy algorithm (greedy algorithm, also known as greedy algorithm), the motion pipelines are sequentially connected to track trajectories;

The basic idea of the greedy algorithm is to proceed step by step from a certain initial solution of the problem. According to a certain optimization measure, each step must ensure that a local optimal solution can be obtained. It is understandable that the algorithm for connecting the motion pipeline can be replaced with other algorithms, which is not limited here.

1505. Output tracking trajectory;

It should be noted that for single target object tracking, the output is the tracking trajectory of one target object. For multi-target tracking, the tracking trajectory of each target object can be output. Specifically, the tracking trajectory can be processed into each video frame The bounding box of is superimposed on the original video and displayed by the display module.

Considering that the video is a real-time captured video, the target tracking device will continue to obtain the newly captured video content, and repeat steps 1502 to 1505 until the target tracking task ends, which will not be repeated here.

The target tracking method provided by the present application is introduced above, and the target tracking device implementing the target tracking method is introduced below. Please refer to FIG. 16, which is a schematic diagram of an embodiment of the target tracking device in the embodiment of this application.

Only one or more of the various modules in FIG. 16 can be implemented by software, hardware, firmware, or a combination thereof. The software or firmware includes but is not limited to computer program instructions or codes, and can be executed by a hardware processor. The hardware includes, but is not limited to, various integrated circuits, such as a central processing unit (CPU), a digital signal processor (DSP), a field programmable gate array (FPGA), or an application specific integrated circuit (ASIC).

The target tracking device includes:

The acquiring unit 1601 is configured to acquire a first video, where the first video includes a target object;

The acquiring unit 1601 is further configured to input the first video into a pre-trained neural network model to acquire the position information of the target object in at least two video frames and the time information of the at least two video frames;

The acquiring unit 1601 is further configured to acquire the tracking of the target object in the first video according to the position information of the target object in at least two video frames and the time information of the at least two video frames A trajectory, the tracking trajectory includes position information of the target object in at least two video frames in the first video.

Optionally, the acquiring unit 1601 is specifically configured to: acquire a motion pipeline of the target object, where the motion pipeline is used to indicate the time information of the target object in at least two video frames of the first video and Location information, wherein the first video includes a first video frame and a second video frame; the motion pipeline corresponds to a quadrangular pyramid in a space-time dimension, and the space-time dimension includes a time dimension and a two-dimensional space dimension. The position of the first bottom surface of the quadrangular pyramid in the time dimension is used to indicate the first time information of the first video frame, and the position of the second bottom surface of the quadrangular pyramid in the time dimension is used to indicate the The second time information of the second video frame, the position of the first bottom surface of the quadrangular prism in the two-dimensional space is used to indicate the first position information of the target object in the first video frame, so The position of the second bottom surface of the quadrangular prism in the two-dimensional space is used to indicate the second position information of the target object in the second video frame; the quadrangular prism is used to indicate the target object Position information in all video frames between the first video frame and the second video frame of the first video.

Optionally, the acquiring unit 1601 is specifically configured to: acquire a motion pipeline of the target object, where the motion pipeline is used to indicate the position information of the target object in at least three video frames and the at least three videos Time information of the frame, wherein the first video includes a first video frame, a second video frame, and a third video frame; the motion pipeline corresponds to a double quadrangular prism in the space-time dimension, and the double quadrangular prism includes The first quadrangular platform and the second quadrangular platform, the first quadrangular platform includes a first bottom surface and a second bottom surface, the second quadrangular platform includes a first bottom surface and a third bottom surface, the first bottom surface is The common bottom surface of the first quadrangular prism and the second quadrangular prism, the position of the first bottom in the time dimension is used to indicate the first time information of the first video frame, and the second The position of the bottom surface in the time dimension is used to indicate the second time information of the second video frame, and the position of the third bottom surface in the time dimension is used to indicate the third time information of the third video frame, so The time sequence of the first video frame in the first video is located between the second video frame and the third video frame, and the position of the first bottom surface in the two-dimensional space is used to indicate the The first position information of the target object in the first video frame, and the position of the second bottom surface in the two-dimensional space dimension indicates the second position information of the target object in the second video frame, The position of the third bottom surface in the two-dimensional spatial dimension indicates the third position information of the target object in the third video frame; the double quadrangular prism is used to indicate that the target object is in the first Position information in all video frames between the second video frame and the three video frames of a video.

Optionally, the acquiring unit 1601 is specifically configured to acquire the tracking trajectory of the target object in the first video according to the motion pipeline.

Optionally, the tracking trajectory specifically includes: a tracking trajectory of the target object formed by connecting at least two of the motion pipes corresponding to the quadrangular prisms in the space-time dimension.

Optionally, the length of the motion pipeline is a preset value, and the length of the motion pipeline indicates the number of video frames included in the at least two video frames.

Optionally, the obtaining unit 1601 is further configured to: obtain category information of the target object through the pre-trained neural network model; according to the category information of the target object, the target object is The position information in the two video frames and the time information of the at least two video frames obtain the tracking trajectory of the target object in the first video.

Optionally, the acquiring unit 1601 is specifically configured to: acquire the confidence level of the motion pipeline through the pre-trained neural network model, and the confidence level of the motion pipeline is used to determine the target object corresponding to the motion pipeline Of the category information.

Optionally, the device further includes: a processing unit 1602, configured to delete the motion pipeline to obtain a deleted motion pipeline, and the deleted motion pipeline is used to acquire the tracking of the target object Trajectory.

Optionally, the movement pipeline includes a first movement pipeline and a second movement pipeline; the processing unit 1602 is specifically configured to: if the repetition rate between the first movement pipeline and the second movement pipeline is greater than or equal to a first threshold, Then delete the movement pipeline with lower confidence in the first movement pipeline and the second movement pipeline, and the repetition rate between the first movement pipeline and the second movement pipeline is the first movement pipeline and the second movement pipeline. The intersection ratio between the second motion pipeline, the first motion pipeline and the second motion pipeline belong to the motion pipeline of the target object, and the confidence indicates that the category of the target object corresponding to the motion pipeline is a preset category The probability.

Optionally, the processing unit 1602 is specifically configured to: delete the motion pipeline according to a non-maximum value suppression algorithm, and obtain the deleted motion pipeline.

Optionally, the confidence level of any one of the motion channels after the reduction is greater than or equal to a second threshold.

Optionally, the acquiring unit 1601 is specifically configured to: connect a third motion pipeline and a fourth motion pipeline that meet a preset condition in the motion pipeline to acquire the tracking trajectory of the target object; the preset condition It includes one or more of the following: the intersection ratio between the sections of the overlapping portion of the third movement pipeline and the fourth movement pipeline in the time dimension is greater than or equal to a third threshold; the movement direction of the third movement pipeline The cosine value of the included angle with the movement direction of the fourth motion pipe is greater than or equal to the fourth threshold, and the movement direction is a vector indicating the position change of the target object in the movement pipe in the space-time dimension according to a preset rule; and , The distance between the neural network feature vectors of the motion pipeline is less than or equal to the fifth threshold, and the distance includes the Euclidean distance.

Optionally, the obtaining unit 1601 is specifically configured to: group the motion pipes to obtain t groups of motion pipes, where t is the total number of video frames in the first video, and the i-th motion pipe in the t group of motion pipes The group includes all motion pipelines starting from the i-th video frame in the first video. The i is greater than or equal to 1 and less than or equal to t; when i is 1, the motion in the i-th motion pipeline group The pipeline is used as the initial tracking trajectory to obtain a tracking trajectory set; in accordance with the number sequence of the motion pipeline group, the motion pipelines in the i-th motion pipeline group are connected with the tracking trajectories in the tracking trajectory set to obtain at least one track Trajectory.

Optionally, the obtaining unit 1601 is specifically configured to: input the first video sample into the initial network model for training, and obtain the target object loss; update the weight parameter in the initial network model according to the target object loss to obtain The pre-trained neural network model.

Optionally, the target object loss specifically includes: an intersection ratio between the true value of the motion pipe and the predicted value of the motion pipe, and the true value of the motion pipe is obtained by splitting the tracking trajectory of the target object in the first video sample The predicted value of the motion pipeline is a motion pipeline obtained by inputting the first video sample into the initial network model.

Optionally, the target object loss specifically includes: the intersection ratio between the true value of the motion pipe and the predicted value of the motion pipe, and the cross entropy between the confidence of the true value of the motion pipe and the confidence of the predicted value of the motion pipe, The true value of the motion pipeline is the motion pipeline obtained by splitting the tracking trajectory of the target object in the first video sample, and the predicted value of the motion pipeline is the motion pipeline obtained by inputting the first video sample into the initial network model The confidence level of the true value of the motion pipe is the probability that the target object category corresponding to the true value of the motion pipe belongs to the preset target object category, and the confidence level of the predicted value of the motion pipe corresponds to the predicted value of the motion pipe The probability that the category of the target object belongs to the preset target object category.

Optionally, the initial network model includes a three-dimensional convolutional neural network or a recurrent neural network.

Optionally, the processing unit 1602 is further configured to: divide the first video into multiple video segments;

The acquiring unit 1601 is specifically configured to input the multiple video clips into the pre-trained neural network model to acquire the motion pipeline.

The target tracking device provided by the embodiment of the present application has multiple implementation forms. Optionally, the target tracking device includes a video acquisition module, a target tracking module, and an output module. Among them, the video acquisition module is used to obtain a video including the moving target object, the target tracking module is used to input the video, and the tracking trajectory of the target object is output by the target tracking method provided in this embodiment of the application, and the output module is used to superimpose the tracking trajectory on Shown to users in the video.

In another possible implementation manner, please refer to FIG. 17, which is a schematic diagram of another embodiment of the target tracking device in the embodiment of this application. The target tracking device includes a video acquisition module and a target tracking module, which can be understood as front-end equipment. In order to achieve the target tracking method, the front-end equipment and the back-end equipment need to be processed together.

As shown in Figure 17, the video acquisition module 1701, which can be a video acquisition module in a surveillance camera, a video camera, a mobile phone or a vehicle image sensor, is responsible for capturing video data as the input of the tracking algorithm;

The target tracking module 1702, which can be a processing unit in a camera processor, a mobile phone processor, a vehicle processing unit, etc., is used to receive video input and control information sent by a back-end device, such as tracking target category, tracking quantity, accuracy Control, model hyperparameters, etc. The target tracking method of the embodiment of the present application is mainly deployed in this module. For details, please refer to FIG. 18 for the introduction of the target tracking module 1702.

The back-end equipment includes an output module and a control module.

As shown in Fig. 17, the output module 1703, for example, may be a display unit of a background monitor, printer, or hard disk, for displaying or window tracking results;

The control module 1704 is used to analyze the output result, receive the user's instruction, and send the instruction to the target tracking module of the front end.

Please refer to FIG. 18, which is a schematic diagram of another embodiment of the target tracking device in the embodiment of the application.

The target tracking device includes: a video preprocessing module 1801, a prediction module 1802, and a motion pipeline connection module 1803.

Among them, the video preprocessing module 1801 is used to divide the input video into appropriate segments, and adjust and normalize the video resolution, color space, etc.

The prediction module 1802 is used to extract spatiotemporal features from the input video clips and make predictions, and output the target motion pipeline and the category information of the motion pipeline. In addition, it can also predict the future position of the target motion pipeline. The prediction module 1802 includes two sub-modules:

Target category prediction module 18021: Based on the features output by the 3D convolutional neural network, for example, the confidence value predicts the category to which the target belongs.

Motion pipeline prediction module 18022: predict the location of the target's current motion pipeline through the features output by the 3D convolutional neural network, that is, the coordinates of the motion pipeline in space and time dimensions.

The motion pipeline connection module 1803 analyzes the motion pipeline output in the prediction module, and if the target appears for the first time, initialize it as a new tracking trajectory. According to the temporal and spatial feature similarity between the motion pipelines and the spatial location proximity, the motion pipeline is connected Required connection characteristics. According to the movement pipeline and the connection characteristics of the movement pipeline, the movement pipelines are connected into a complete tracking trajectory by analyzing the spatial overlap characteristics of the movement pipelines and the similarity of the temporal and spatial characteristics.

Please refer to FIG. 19, which is a schematic diagram of an embodiment of an electronic device in an embodiment of the application.

The electronic device 1900 may have relatively large differences due to different configurations or performances, and may include one or more processors 1901 and a memory 1902, and the memory 1902 stores programs or data.

Among them, the memory 1902 may be volatile storage or non-volatile storage. Optionally, the processor 1901 is one or more central processing units (CPUs, central processing units). The CPUs may be single-core CPUs or multi-core CPUs. The processor 1901 may communicate with the memory 1902, and is on the electronic device 1900 A series of instructions in the memory 1902 are executed.

The electronic device 1900 also includes one or more wired or wireless network interfaces 1903, such as an Ethernet interface.

Optionally, although not shown in FIG. 19, the electronic device 1900 may also include one or more power supplies; one or more input and output interfaces, which can be used to connect a display, a mouse, a keyboard, a touch screen device or a sensor For equipment, etc., the input and output interfaces are optional components, which may or may not exist, and are not limited here.

For the process executed by the processor 1901 in the electronic device 1900 in this embodiment, reference may be made to the method process described in the foregoing method embodiment, which is not repeated here.

Please refer to FIG. 20, which is a hardware structure diagram of a chip provided by an embodiment of this application.

The embodiment of the present application provides a chip system that can be used to implement the target tracking method. Specifically, the algorithm based on the convolutional neural network shown in FIG. 3 and FIG. 4 can be implemented in the NPU chip shown in FIG. 20.

The neural network processor NPU 50 is mounted on the main CPU (Host CPU) as a coprocessor, and the Host CPU allocates tasks. The core part of the NPU is the arithmetic circuit 503. The arithmetic circuit 503 is controlled by the controller 504 to extract matrix data from the memory and perform multiplication operations.

In some implementations, the arithmetic circuit 503 includes multiple processing units (process engines, PE). In some implementations, the arithmetic circuit 503 is a two-dimensional systolic array. The arithmetic circuit 503 may also be a one-dimensional systolic array or other electronic circuits capable of performing mathematical operations such as multiplication and addition. In some implementations, the arithmetic circuit 503 is a general-purpose matrix processor.

For example, suppose there is an input matrix A, a weight matrix B, and an output matrix C. The arithmetic circuit fetches the data corresponding to matrix B from the weight memory 502 and caches it on each PE in the arithmetic circuit. The arithmetic circuit fetches the matrix A data and matrix B from the input memory 501 to perform matrix operations, and the partial result or final result of the obtained matrix is stored in the accumulator 508.

The unified memory 506 is used to store input data and output data. The weight data is directly transferred to the weight memory 502 through the storage unit access controller 505 (direct memory access controller, DMAC). The input data is also transferred to the unified memory 506 through the DMAC.

The BIU is the Bus Interface Unit, that is, the bus interface unit 510, which is used for the interaction between the AXI bus and the DMAC and the instruction fetch buffer 509.

The bus interface unit 510 (bus interface unit, BIU) is used for the instruction fetch memory 509 to obtain instructions from the external memory, and is also used for the storage unit access controller 505 to obtain the original data of the input matrix A or the weight matrix B from the external memory.

The DMAC is mainly used to transfer the input data in the external memory DDR to the unified memory 506 or to transfer the weight data to the weight memory 502 or to transfer the input data to the input memory 501.

The vector calculation unit 507 may include multiple arithmetic processing units, and if necessary, further process the output of the arithmetic circuit, such as vector multiplication, vector addition, exponential operation, logarithmic operation, size comparison, and so on. Mainly used for non-convolutional/FC layer network calculations in neural networks, such as Pooling, Batch Normalization, Local Response Normalization, etc.

In some implementations, the vector calculation unit 507 can store the processed output vector to the unified buffer 506. For example, the vector calculation unit 507 may apply a nonlinear function to the output of the arithmetic circuit 503, such as a vector of accumulated values, to generate the activation value. In some implementations, the vector calculation unit 507 generates a normalized value, a combined value, or both. In some implementations, the processed output vector can be used as an activation input to the arithmetic circuit 503, for example for use in a subsequent layer in a neural network.

The instruction fetch buffer 509 connected to the controller 504 is used to store instructions used by the controller 504;

The unified memory 506, the input memory 501, the weight memory 502, and the fetch memory 509 are all On-Chip memories. The external memory is private to the NPU hardware architecture.

Among them, the operations of each layer in the convolutional neural network shown in FIG. 3 and FIG. 4 may be executed by the matrix calculation unit 212 or the vector calculation unit 507.

In the various embodiments of the present application, various examples are given for the purpose of understanding. However, these examples are just some examples, and are not meant to be the best way to realize the application.

Those skilled in the art can clearly understand that, for the convenience and conciseness of the description, the specific working process of the system, device and unit described above can refer to the corresponding process in the foregoing method embodiment, which will not be repeated here.

In the several embodiments provided in this application, it should be understood that the disclosed system, device, and method may be implemented in other ways. For example, the device embodiments described above are merely illustrative, for example, the division of the units is only a logical function division, and there may be other divisions in actual implementation, for example, multiple units or components can be combined or It can be integrated into another system, or some features can be ignored or not implemented. In addition, the displayed or discussed mutual coupling or direct coupling or communication connection may be indirect coupling or communication connection through some interfaces, devices or units, and may be in electrical, mechanical or other forms.

The units described as separate components may or may not be physically separated, and the components displayed as units may or may not be physical units, that is, they may be located in one place, or they may be distributed on multiple network units. Some or all of the units may be selected according to actual needs to achieve the objectives of the solutions of the embodiments.

In addition, the functional units in the various embodiments of the present application may be integrated into one processing unit, or each unit may exist alone physically, or two or more units may be integrated into one unit. The above-mentioned integrated unit can be implemented in the form of hardware or software functional unit.

If the integrated unit is implemented in the form of a software functional unit and sold or used as an independent product, it can be stored in a computer readable storage medium. Based on this understanding, the technical solution of the present application essentially or the part that contributes to the existing technology or all or part of the technical solution can be embodied in the form of a software product, and the computer software product is stored in a storage medium , Including several instructions to make a computer device (which may be a personal computer, a server, or a network device, etc.) execute all or part of the steps of the methods described in the various embodiments of the present application. The aforementioned storage media include: U disk, mobile hard disk, read-only memory (read-only memory, ROM), random access memory (random access memory, RAM), magnetic disk or optical disk and other media that can store program code .

As mentioned above, the above embodiments are only used to illustrate the technical solutions of the present application, but not to limit them; although the present application has been described in detail with reference to the foregoing embodiments, those of ordinary skill in the art should understand that: The technical solutions recorded in the embodiments are modified, or some of the technical features are equivalently replaced; these modifications or replacements do not cause the essence of the corresponding technical solutions to deviate from the spirit and scope of the technical solutions of the embodiments of the present application.

Claims

A target tracking method is characterized in that it comprises:

Acquiring a first video, where the first video includes a target object;

Inputting the first video into a pre-trained neural network model, and acquiring position information of the target object in at least two video frames and time information of the at least two video frames;

Acquire the tracking trajectory of the target object in the first video according to the position information of the target object in at least two video frames and the time information of the at least two video frames, where the tracking trajectory includes the Position information of the target object in at least two video frames in the first video.
The method according to claim 1, wherein said acquiring the position information of the target object in at least two video frames and the time information of the at least two video frames specifically comprises:

Acquire a motion pipeline of the target object, where the motion pipeline is used to indicate the time information and position information of the target object in at least two video frames of the first video, wherein,

The first video includes a first video frame and a second video frame;

The motion pipeline corresponds to a quadrangular pyramid in the space-time dimension, the space-time dimension includes a time dimension and a two-dimensional space dimension, and the position of the first bottom surface of the quadrangular pyramid in the time dimension is used to indicate the first The first time information of the video frame, the position of the second bottom surface of the quadrangular prism in the time dimension is used to indicate the second time information of the second video frame, and the first bottom surface of the quadrangular prism is located at the The position of the two-dimensional spatial dimension is used to indicate the first position information of the target object in the first video frame, and the position of the second bottom surface of the quadrangular pyramid in the two-dimensional spatial dimension is used to indicate the position of the target object in the first video frame. Second position information of the target object in the second video frame;

The quadrangular prism is used to indicate the position information of the target object in all video frames between the first video frame and the second video frame of the first video.
The method according to claim 1, wherein said acquiring the position information of the target object in at least two video frames and the time information of the at least two video frames specifically comprises:

Acquire a motion pipeline of the target object, where the motion pipeline is used to indicate the position information of the target object in at least three video frames and the time information of the at least three video frames, wherein,

The first video includes a first video frame, a second video frame, and a third video frame;

The movement pipeline corresponds to a double quadrangular prism in the space-time dimension, the double quadrangular prism includes a first quadrangular prism and a second quadrangular prism, the first quadrangular prism includes a first bottom surface and a second bottom surface, The second quadrangular platform includes a first bottom surface and a third bottom surface. The first bottom surface is a common bottom surface of the first quadrangular platform and the second quadrangular platform. The position of the dimension is used to indicate the first time information of the first video frame, the position of the second bottom surface in the time dimension is used to indicate the second time information of the second video frame, and the third bottom surface is at The position of the time dimension is used to indicate the third time information of the third video frame, and the time sequence of the first video frame in the first video is located in the second video frame and the third video Between frames, the position of the first bottom surface in the two-dimensional space is used to indicate the first position information of the target object in the first video frame, and the second bottom surface is in the two-dimensional space. The position of the dimensionality indicates the second position information of the target object in the second video frame, and the position of the third bottom surface in the two-dimensional space indicates the position of the target object in the third video frame. Third location information;

The double quadrangular prism is used to indicate the position information of the target object in all video frames between the second video frame and the three video frames of the first video.
The method according to claim 2 or 3, wherein the position information of the target object in at least two video frames and the time information of the at least two video frames are used to obtain the location of the target object. The tracking track in the first video specifically includes:

Acquiring the tracking trajectory of the target object in the first video according to the motion pipeline.
The method according to any one of claims 2 to 4, wherein the tracking trajectory specifically comprises:

According to at least two of the motion pipes corresponding to the tracking trajectory of the target object formed by the connection of the quadrangular prisms in the space-time dimension.
The method according to any one of claims 2 to 5, characterized in that,

The length of the motion pipeline is a preset value, and the length of the motion pipeline indicates the number of video frames included in the at least two video frames.
The method according to any one of claims 1 to 6, characterized in that:

The method also includes:

Obtaining category information of the target object through the pre-trained neural network model;

The acquiring the tracking trajectory of the target object in the first video according to the position information of the target object in at least two video frames and the time information of the at least two video frames includes:

Acquire the tracking of the target object in the first video according to the category information of the target object, the position information of the target object in at least two video frames, and the time information of the at least two video frames Trajectory.
8. The method according to claim 7, wherein said obtaining the category information of the target object corresponding to the motion pipeline through the pre-trained neural network model specifically comprises:

Obtain the confidence level of the motion pipeline through the pre-trained neural network model, and the confidence level of the motion pipeline is used to determine the category information of the target object corresponding to the motion pipeline.
The method according to any one of claims 1 to 8, characterized in that:

Before the acquiring the tracking trajectory of the target object according to the motion pipeline, the method further includes:

The motion pipeline is deleted to obtain a deleted motion pipeline, and the deleted motion pipeline is used to acquire the tracking trajectory of the target object.
The method according to claim 9, wherein:

Said deleting the motion pipeline and obtaining the deleted motion pipeline specifically includes:

The movement pipeline includes a first movement pipeline and a second movement pipeline;

If the repetition rate between the first movement pipeline and the second movement pipeline is greater than or equal to the first threshold, the movement pipeline with lower confidence among the first movement pipeline and the second movement pipeline is deleted, and the first movement pipeline The repetition rate between the movement pipeline and the second movement pipeline is the intersection ratio between the first movement pipeline and the second movement pipeline, and the first movement pipeline and the second movement pipeline belong to the target The motion pipeline of the object, and the confidence level indicates the probability that the category of the target object corresponding to the motion pipeline is a preset category.
The method according to claim 9, wherein:

Said deleting the motion pipeline and obtaining the deleted motion pipeline specifically includes:

The motion pipeline is deleted according to the non-maximum value suppression algorithm, and the deleted motion pipeline is obtained.
The method according to claim 9, wherein:

The confidence of any one of the motion channels after the reduction is greater than or equal to the second threshold.
The method according to any one of claims 2 to 12, characterized in that,

The acquiring the tracking trajectory of the target object according to the motion pipeline specifically includes:

Connecting a third movement pipeline and a fourth movement pipeline that meet a preset condition in the movement pipeline to obtain a tracking trajectory of the target object;

The preset conditions include one or more of the following:

The intersection ratio between the sections of the overlapping portion of the time dimension of the third motion channel and the fourth motion channel is greater than or equal to a third threshold;

The cosine value of the angle between the movement direction of the third movement pipe and the movement direction of the fourth movement pipe is greater than or equal to the fourth threshold, and the movement direction is in the space-time dimension according to a preset rule indicating the movement pipe The vector of the position change of the target object; and, the distance between the neural network feature vectors of the motion pipeline is less than or equal to the fifth threshold, and the distance includes the Euclidean distance.
The method according to any one of claims 2 to 12, characterized in that,

The acquiring the tracking trajectory of the target object according to the motion pipeline specifically includes:

Group the motion pipelines to obtain t sets of motion pipelines, where t is the total number of video frames in the first video. Motion pipeline of i video frames, where i is greater than or equal to 1 and less than or equal to t;

When i is 1, use the motion pipeline in the i-th motion pipeline group as the initial tracking trajectory to obtain the tracking trajectory set;

According to the sequence of the number of the motion tube group, the motion tubes in the i-th motion tube group are sequentially connected with the tracking trajectories in the tracking trajectory set to obtain at least one tracking trajectory.
The method according to any one of claims 1 to 14, characterized in that,

The pre-trained neural network model is obtained after the initial network model training, and the method further includes:

Input the first video sample into the initial network model for training, and obtain the target object loss;

The weight parameter in the initial network model is updated according to the loss of the target object, and the pre-trained neural network model is obtained.
The method according to claim 15, wherein the loss of the target object specifically comprises:

The cross-union ratio between the true value of the motion pipeline and the predicted value of the motion pipeline, the true value of the motion pipeline is the motion pipeline obtained by splitting the tracking trajectory of the target object in the first video sample, and the predicted value of the motion pipeline is The first video sample is input into the motion pipeline obtained by the initial network model.
The method according to claim 15, wherein the loss of the target object specifically comprises:

The cross-union ratio between the true value of the motion pipe and the predicted value of the motion pipe, and the cross entropy between the confidence of the true value of the motion pipe and the confidence of the predicted value of the motion pipe, where the true value of the motion pipe is the first video The motion pipeline obtained by splitting the tracking trajectory of the target object in the sample, the motion pipeline prediction value is the motion pipeline obtained by inputting the first video sample into the initial network model, and the confidence of the true value of the motion pipeline is obtained The probability that the category of the target object corresponding to the true value of the motion pipeline belongs to the preset target object category, and the confidence of the predicted value of the motion pipeline is the probability that the category of the target object corresponding to the predicted value of the motion pipeline belongs to the preset target object category .
The method according to any one of claims 15 to 17, characterized in that,

The initial network model includes a three-dimensional convolutional neural network or a recurrent neural network.
The method according to any one of claims 1 to 18, characterized in that:

The inputting the first video into the pre-trained neural network model to obtain the motion pipeline of the target object specifically includes:

Dividing the first video into multiple video segments;

The multiple video clips are respectively input to the pre-trained neural network model to obtain the motion pipeline.
A target tracking device is characterized in that it comprises:

An acquiring unit, configured to acquire a first video, where the first video includes a target object;

The acquiring unit is further configured to input the first video into a pre-trained neural network model to acquire the position information of the target object in at least two video frames and the time information of the at least two video frames;

The acquiring unit is further configured to acquire the tracking trajectory of the target object in the first video according to the position information of the target object in at least two video frames and the time information of the at least two video frames The tracking track includes position information of the target object in at least two video frames in the first video.
The device according to claim 20, wherein the acquiring unit is specifically configured to:

Acquire a motion pipeline of the target object, where the motion pipeline is used to indicate the time information and position information of the target object in at least two video frames of the first video, wherein,

The first video includes a first video frame and a second video frame;

The motion pipeline corresponds to a quadrangular pyramid in the space-time dimension, the space-time dimension includes a time dimension and a two-dimensional space dimension, and the position of the first bottom surface of the quadrangular pyramid in the time dimension is used to indicate the first The first time information of the video frame, the position of the second bottom surface of the quadrangular prism in the time dimension is used to indicate the second time information of the second video frame, and the first bottom surface of the quadrangular prism is located at the The position of the two-dimensional spatial dimension is used to indicate the first position information of the target object in the first video frame, and the position of the second bottom surface of the quadrangular pyramid in the two-dimensional spatial dimension is used to indicate the position of the target object in the first video frame. Second position information of the target object in the second video frame;

The quadrangular prism is used to indicate the position information of the target object in all video frames between the first video frame and the second video frame of the first video.
The device according to claim 20, wherein the acquiring unit is specifically configured to:

Acquire a motion pipeline of the target object, where the motion pipeline is used to indicate the position information of the target object in at least three video frames and the time information of the at least three video frames, wherein,

The first video includes a first video frame, a second video frame, and a third video frame;

The movement pipeline corresponds to a double quadrangular prism in the space-time dimension, the double quadrangular prism includes a first quadrangular prism and a second quadrangular prism, the first quadrangular prism includes a first bottom surface and a second bottom surface, The second quadrangular platform includes a first bottom surface and a third bottom surface. The first bottom surface is a common bottom surface of the first quadrangular platform and the second quadrangular platform. The position of the dimension is used to indicate the first time information of the first video frame, the position of the second bottom surface in the time dimension is used to indicate the second time information of the second video frame, and the third bottom surface is at The position of the time dimension is used to indicate the third time information of the third video frame, and the time sequence of the first video frame in the first video is located in the second video frame and the third video Between frames, the position of the first bottom surface in the two-dimensional space is used to indicate the first position information of the target object in the first video frame, and the second bottom surface is in the two-dimensional space. The position of the dimensionality indicates the second position information of the target object in the second video frame, and the position of the third bottom surface in the two-dimensional space indicates the position of the target object in the third video frame. Third location information;

The double quadrangular prism is used to indicate the position information of the target object in all video frames between the second video frame and the three video frames of the first video.
The device according to claim 21 or 22, wherein the acquiring unit is specifically configured to:

Acquiring the tracking trajectory of the target object in the first video according to the motion pipeline.
The device according to any one of claims 21 to 23, wherein the tracking trajectory specifically comprises:

According to at least two of the motion pipes corresponding to the tracking trajectory of the target object formed by the connection of the quadrangular prisms in the space-time dimension.
The device according to any one of claims 21 to 24, wherein:

The length of the motion pipeline is a preset value, and the length of the motion pipeline indicates the number of video frames included in the at least two video frames.
The device according to any one of claims 20 to 25, wherein the acquiring unit is further configured to:

Obtaining category information of the target object through the pre-trained neural network model;

Acquire the tracking of the target object in the first video according to the category information of the target object, the position information of the target object in at least two video frames, and the time information of the at least two video frames Trajectory.
The device according to claim 26, wherein the acquiring unit is specifically configured to:

Obtain the confidence level of the motion pipeline through the pre-trained neural network model, and the confidence level of the motion pipeline is used to determine the category information of the target object corresponding to the motion pipeline.
The device according to any one of claims 20-27, wherein the device further comprises:

The processing unit is configured to delete the motion pipeline to obtain a deleted motion pipeline, and the deleted motion pipeline is used to acquire the tracking trajectory of the target object.
The device according to claim 28, wherein the movement pipe comprises a first movement pipe and a second movement pipe;

The processing unit is specifically used for:

If the repetition rate between the first movement pipeline and the second movement pipeline is greater than or equal to the first threshold, the movement pipeline with lower confidence among the first movement pipeline and the second movement pipeline is deleted, and the first movement pipeline The repetition rate between the movement pipeline and the second movement pipeline is the intersection ratio between the first movement pipeline and the second movement pipeline, and the first movement pipeline and the second movement pipeline belong to the target The motion pipeline of the object, and the confidence level indicates the probability that the category of the target object corresponding to the motion pipeline is a preset category.
The device according to claim 28, wherein the processing unit is specifically configured to:

The motion pipeline is deleted according to the non-maximum value suppression algorithm, and the deleted motion pipeline is obtained.
The device of claim 28, wherein:

The confidence of any one of the motion channels after the reduction is greater than or equal to the second threshold.
The device according to any one of claims 21 to 31, wherein the acquiring unit is specifically configured to:

Connecting a third movement pipeline and a fourth movement pipeline that meet a preset condition in the movement pipeline to obtain a tracking trajectory of the target object;

The preset conditions include one or more of the following:

The intersection ratio between the sections of the overlapping portion of the time dimension of the third motion channel and the fourth motion channel is greater than or equal to a third threshold;

The cosine value of the angle between the movement direction of the third movement pipe and the movement direction of the fourth movement pipe is greater than or equal to the fourth threshold, and the movement direction is in the space-time dimension according to a preset rule indicating the movement pipe The vector of the position change of the target object; and, the distance between the neural network feature vectors of the motion pipeline is less than or equal to the fifth threshold, and the distance includes the Euclidean distance.
The device according to any one of claims 21 to 31, wherein the acquiring unit is specifically configured to:

Group the motion pipelines to obtain t sets of motion pipelines, where t is the total number of video frames in the first video. Motion pipeline of i video frames, where i is greater than or equal to 1 and less than or equal to t;

When i is 1, use the motion pipeline in the i-th motion pipeline group as the initial tracking trajectory to obtain the tracking trajectory set;

According to the sequence of the number of the motion tube group, the motion tubes in the i-th motion tube group are sequentially connected with the tracking trajectories in the tracking trajectory set to obtain at least one tracking trajectory.
The device according to any one of claims 20 to 33, wherein the acquiring unit is specifically configured to:

Input the first video sample into the initial network model for training, and obtain the target object loss;

The weight parameter in the initial network model is updated according to the loss of the target object, and the pre-trained neural network model is obtained.
The device according to claim 34, wherein the target object loss specifically comprises:

The cross-union ratio between the true value of the motion pipeline and the predicted value of the motion pipeline, the true value of the motion pipeline is the motion pipeline obtained by splitting the tracking trajectory of the target object in the first video sample, and the predicted value of the motion pipeline is The first video sample is input into the motion pipeline obtained by the initial network model.
The device according to claim 34, wherein the target object loss specifically comprises:

The cross-union ratio between the true value of the motion pipe and the predicted value of the motion pipe, and the cross entropy between the confidence of the true value of the motion pipe and the confidence of the predicted value of the motion pipe, where the true value of the motion pipe is the first video The motion pipeline obtained by splitting the tracking trajectory of the target object in the sample, the motion pipeline prediction value is the motion pipeline obtained by inputting the first video sample into the initial network model, and the confidence of the true value of the motion pipeline is obtained The probability that the category of the target object corresponding to the true value of the motion pipeline belongs to the preset target object category, and the confidence of the predicted value of the motion pipeline is the probability that the category of the target object corresponding to the predicted value of the motion pipeline belongs to the preset target object category .
The device according to any one of claims 34 to 36, characterized in that:

The initial network model includes a three-dimensional convolutional neural network or a recurrent neural network.
The device according to any one of claims 20 to 37, wherein the processing unit is further configured to:

Dividing the first video into multiple video segments;

The acquiring unit is specifically used for:

The multiple video clips are respectively input to the pre-trained neural network model to obtain the motion pipeline.
An electronic device, characterized by comprising a processor and a memory, the processor and the memory are connected to each other, wherein the memory is used to store a computer program, the computer program includes program instructions, and the processor is used for Call the program instructions to execute the method according to any one of claims 1-19.
A computer program product containing instructions, which is characterized in that when it runs on a computer, the computer executes the method according to any one of claims 1 to 19.
A computer-readable storage medium, comprising instructions, characterized in that, when the instructions are executed on a computer, the computer executes the method according to any one of claims 1 to 19.