CN113781519A

CN113781519A - Target tracking method and target tracking device

Info

Publication number: CN113781519A
Application number: CN202010519876.2A
Authority: CN
Inventors: 庞博; 卢策吾; 袁伟; 胡翔宇
Original assignee: Huawei Technologies Co Ltd
Current assignee: Huawei Technologies Co Ltd
Priority date: 2020-06-09
Filing date: 2020-06-09
Publication date: 2021-12-10
Also published as: WO2021249114A1

Abstract

The embodiment of the application relates to a computer image processing technology in the field of artificial intelligence, and discloses a target tracking method which is applied to target tracking in a video and can reduce tracking errors caused by target shielding. The method comprises the following steps: inputting a video for shooting a target object into a pre-trained neural network model, acquiring a motion pipeline of the target object, connecting the motion pipelines, and acquiring a tracking track of the target object, wherein the tracking track comprises position information of the target in each video frame of the first video.

Description

Target tracking method and target tracking device

Technical Field

The present application relates to the field of image processing technologies, and in particular, to a target tracking method and a target tracking apparatus.

Background

Object tracking is one of the most important and fundamental tasks in the field of computer vision. The purpose is to output the position of a target object in each video frame of a video from the video containing the target object. Generally, a video and a target object category to be tracked are input into a computer, and the computer outputs the Identification (ID) of the target object and the position information of the target object in each frame of the video in the form of a detection box.

The existing multi-target tracking method comprises two parts of detection and tracking, wherein a plurality of target objects appearing in each video frame are detected through a detection module, then the plurality of target objects appearing in each video frame are matched, in the matching process, the characteristics of each target object in a single video frame are extracted, the target matching is realized through the similarity comparison of the characteristics, and the tracking track of each target object is obtained.

Because the existing target tracking algorithm adopts a method of firstly detecting and then tracking, the target tracking effect depends on a single-frame detection algorithm, if the target object is shielded in the target detection, a detection error is generated, and then a tracking error is caused, so that the performance is insufficient in a scene with dense target objects or more shielded objects.

Disclosure of Invention

The embodiment of the application provides a target tracking method, which is used for tracking a target in a video and can reduce tracking errors caused by target occlusion.

A first aspect of an embodiment of the present application provides a target tracking method, including: acquiring a first video, wherein the first video comprises a target object; inputting the first video into a pre-trained neural network model, and acquiring position information of the target object in at least two video frames and time information of the at least two video frames; and acquiring a tracking track of the target object in the first video according to the position information of the target object in at least two video frames and the time information of the at least two video frames, wherein the tracking track comprises the position information of the target object in the at least two video frames in the first video.

According to the method, the position information of the target object in at least two video frames and the time information of the at least two video frames are obtained through the pre-trained neural network model, target tracking does not depend on a target detection result of a single video frame, the problem of detection failure in a scene with dense targets or more shelters can be reduced, and target tracking performance is improved.

In a possible implementation manner of the first aspect, the acquiring the position information of the target object in the at least two video frames and the time information of the at least two video frames specifically includes: acquiring a motion pipeline of the target object, wherein the motion pipeline is used for indicating time information and position information of the target object in at least two video frames of the first video, and the first video comprises a first video frame and a second video frame; the motion pipe corresponds to a quadrangular frustum in a spatio-temporal dimension, the spatio-temporal dimension comprising a temporal dimension and a two-dimensional spatial dimension, a position of a first bottom surface of the quadrangular frustum in the temporal dimension being used for indicating first time information of the first video frame, a position of a second bottom surface of the quadrangular frustum in the temporal dimension being used for indicating second time information of the second video frame, a position of the first bottom surface of the quadrangular frustum in the two-dimensional spatial dimension being used for indicating first position information of the target object in the first video frame, and a position of the second bottom surface of the quadrangular frustum in the two-dimensional spatial dimension being used for indicating second position information of the target object in the second video frame; the quadrangular frustum is used to indicate position information of the target object in all video frames between the first video frame and the second video frame of the first video.

The method comprises the steps that a motion pipeline of each video frame is obtained through a pre-trained neural network model, the motion pipeline comprises position information of a target object in at least two video frames, and the position of a target in the video frames can be determined through time on a time dimension and the position on a two-dimensional space dimension in a space-time dimension, wherein the time is used for determining the video frames, and the position on the two-dimensional space dimension is used for indicating the position information of the target in the video frames. The method can enable the motion pipeline to correspond to the quadrangular frustum in the space-time dimension, and the position information of the target in at least two video frames can be visually displayed through the quadrangular frustum in the space-time dimension. The target tracking method does not depend on the target detection result of a single video frame, can reduce the problem of detection failure in the scene with dense targets or more shelters, and improves the target tracking performance.

In a possible implementation manner of the first aspect, the acquiring the position information of the target object in the at least two video frames and the time information of the at least two video frames specifically includes: acquiring a motion pipeline of the target object, wherein the motion pipeline is used for indicating position information of the target object in at least three video frames and time information of the at least three video frames, and the first video comprises a first video frame, a second video frame and a third video frame; the motion pipeline corresponds to a dual-quad terrace in a spatiotemporal dimension, the dual-quad terrace comprising a first quad terrace and a second quad terrace, the first quad terrace comprising a first bottom surface and a second bottom surface, the second quad terrace comprising a first bottom surface and a third bottom surface, the first bottom surface being a common bottom surface of the first quad terrace and the second quad terrace, a position of the first bottom surface in the temporal dimension being used for indicating first time information of the first video frame, a position of the second bottom surface in the temporal dimension being used for indicating second time information of the second video frame, a position of the third bottom surface in the temporal dimension being used for indicating third time information of the third video frame, a temporal order of the first video frame in the first video being located between the second video frame and the third video frame, a position of the first bottom surface in the two-dimensional spatial dimension being used for indicating a first time information of the target object in the first video frame A position of the second floor in the two-dimensional spatial dimension indicates second position information of the target object in the second video frame, and a position of the third floor in the two-dimensional spatial dimension indicates third position information of the target object in the third video frame; the double quadrangular frustum for indicating position information of the target object in all video frames between the second video frame and the three video frames of the first video.

In the method, the motion pipeline comprises position information of a target object of the target object in at least three video frames, and specifically, the at least three video frames comprise a second video frame which is earlier in the time sequence of the video than a video frame corresponding to the motion pipeline and a third video frame which is later in the time sequence of the video, so that the receptive field in the time dimension is expanded, and the target tracking performance can be further improved. And corresponding the motion pipeline to a double-quadrangular frustum in the space-time dimension, and visually displaying the position information of the target in at least three video frames through the double-quadrangular frustum in the space-time dimension. In particular, the position information of the target in all video frames between two non-common bottom surfaces of the motion pipeline is also included. In consideration of the continuity of target motion, in the space-time dimension, the structure of the real tracking track of the target object is generally nonlinear, the motion pipeline with the double-quadrangular-frustum structure can express two directions of the target motion, and the real tracking track can be better fitted in a scene with the change of the motion direction of the target object.

In a possible implementation manner of the first aspect, the obtaining a tracking trajectory of the target object in the first video according to the position information of the target object in the at least two video frames and the time information of the at least two video frames specifically includes: and acquiring the tracking track of the target object in the first video according to the motion pipeline.

The tracking track of the target object in the first video is obtained according to the motion pipeline, so that the problem of detection failure in a scene with dense targets or more shelters can be solved, and the target tracking performance is improved.

In a possible implementation manner of the first aspect, the tracking trajectory specifically includes: and connecting the at least two motion pipelines corresponding to the quadrangular frustum in the space-time dimension to form the tracking track of the target object.

The tracking track of the target object is obtained by connecting the motion pipeline, the target detection result of a single video frame can be independent, the problem of detection failure in a scene with dense targets or more shelters can be reduced, and the target tracking performance is improved.

In a possible implementation manner of the first aspect, a length of a motion pipeline of the video frames is a preset value, where the length of the motion pipeline indicates a number of video frames included in the at least two video frames, and optionally, the length of the motion pipeline includes 4, 6, or 8.

In the method, the length of the moving pipeline can be a preset value. The method can reduce the calculated amount of the neural network model and reduce the time consumed by target tracking compared with a method of not setting the length of the motion pipeline.

In a possible implementation manner of the first aspect, the method further includes: acquiring the category information of the target object through the pre-trained neural network model; the acquiring the tracking track of the target object in the first video according to the position information of the target object in at least two video frames and the time information of the at least two video frames comprises: and acquiring the tracking track of the target object in the first video according to the category information of the target object, the position information of the target object in at least two video frames and the time information of the at least two video frames.

For a multi-target tracking scene, if a target to be tracked comprises a plurality of categories, the method can determine the category information of the target object corresponding to the motion pipeline through the pre-trained neural network model, and acquire the tracking track of the target object according to the category information, the position information and the time information.

In a possible implementation manner of the first aspect, the obtaining, by the pre-trained neural network model, the category information of the target object corresponding to the motion pipe specifically includes: and acquiring the confidence coefficient of the motion pipeline through the pre-trained neural network model, wherein the confidence coefficient of the motion pipeline is used for determining the category information of the target object corresponding to the motion pipeline.

For a scene tracked by a single target, the method can distinguish whether the motion pipeline is a real motion pipeline indicating the position of the target through the confidence coefficient, and in addition, for the scene tracked by multiple targets, if the target to be tracked comprises multiple categories, the method can distinguish the category of the target object corresponding to the motion pipeline through the confidence coefficient of the motion pipeline.

In a possible implementation manner of the first aspect, before the acquiring, according to the motion pipeline, the tracking trajectory of the target object, the method further includes: and deleting the motion pipeline to obtain the deleted motion pipeline, wherein the deleted motion pipeline is used for obtaining the tracking track of the target object.

The method can delete the motion pipeline of the video frame, delete the repeated motion pipeline or the motion pipeline with lower confidence coefficient, and reduce the calculation amount in the motion pipeline connection step.

In a possible implementation manner of the first aspect, the deleting the motion pipeline, and the obtaining the deleted motion pipeline specifically includes: the motion pipeline comprises a first motion pipeline and a second motion pipeline; if the repetition rate between a first motion pipeline and a second motion pipeline is larger than or equal to a first threshold value, deleting the motion pipeline with lower reliability from the first motion pipeline and the second motion pipeline, wherein the repetition rate between the first motion pipeline and the second motion pipeline is the intersection and combination ratio between the first motion pipeline and the second motion pipeline, the first motion pipeline and the second motion pipeline belong to the motion pipeline of the target object, and the confidence coefficient indicates the probability that the class of the target object corresponding to the motion pipeline is a preset class.

The method introduces a specific method for deleting the motion pipeline, the motion pipeline with the repetition rate larger than or equal to a first threshold value can be regarded as repeated data, deletion is carried out on the motion pipeline with lower confidence coefficient, the motion pipeline with higher confidence coefficient is reserved for pipeline connection, and the calculation amount in the motion pipeline connection step can be reduced.

In a possible implementation manner of the first aspect, the deleting the motion pipeline, and the obtaining the deleted motion pipeline specifically includes: and deleting the motion pipeline according to a non-maximum suppression algorithm to obtain the deleted motion pipeline.

The method can also delete repeated moving pipelines according to a non-maximum value inhibition algorithm, and can also reserve a moving pipeline with higher confidence coefficient for each target, thereby reducing the calculation amount of the pipeline connection step and improving the target tracking efficiency.

In a possible implementation manner of the first aspect, the confidence of any one of the pruned motion pipes is greater than or equal to a second threshold.

When the motion pipeline is deleted, the motion pipeline with lower confidence coefficient can be discarded, and the motion pipeline with the confidence coefficient lower than the second threshold value can be understood as a non-real motion pipeline, for example, a motion pipeline corresponding to a background.

In a possible implementation manner of the first aspect, the acquiring a tracking trajectory of the target object according to the motion pipe specifically includes: connecting a third motion pipeline and a fourth motion pipeline which meet preset conditions in the motion pipelines to obtain a tracking track of the target object; the preset conditions include one or more of: the intersection ratio between sections of the third motion pipe and the fourth motion pipe at the time dimension overlapping part is greater than or equal to a third threshold; the cosine value of an included angle between the movement direction of the third movement pipeline and the movement direction of the fourth movement pipeline is greater than or equal to a fourth threshold value, and the movement direction is a vector which indicates the position change of a target object in the movement pipeline according to a preset rule in a space-time dimension; and the distance between the neural network feature vectors of the motion pipe is less than or equal to a fifth threshold value, wherein the distance comprises a Euclidean distance.

The method provides a concrete method for connecting the motion pipelines, and the motion pipelines with high overlapping degree and similar motion directions are connected according to the positions of the motion pipelines in the space-time dimension.

In a possible implementation manner of the first aspect, the acquiring a tracking trajectory of the target object according to the motion pipe specifically includes: grouping the motion pipelines to obtain t groups of motion pipelines, wherein t is the total number of video frames in the first video, the ith motion pipeline group in the t groups of motion pipelines comprises all motion pipelines starting from the ith video frame in the first video, and i is greater than or equal to 1 and less than or equal to t; when i is 1, taking the motion pipelines in the ith motion pipeline group as initial tracking tracks to obtain a tracking track set; and connecting the motion pipelines in the ith motion pipeline group with the tracking tracks in the tracking track set in sequence according to the numbering sequence of the motion pipeline groups to obtain at least one tracking track. The method provides a concrete method for connecting the motion pipelines, the motion pipelines correspond to the position information of the target object in the video frame within a period of time, the motion pipelines are grouped according to the initial video frame, and each group of motion pipelines are connected in sequence, so that the target tracking efficiency can be improved.

In a possible implementation manner of the first aspect, the pre-trained neural network model is obtained after an initial network model is trained, and the method further includes: inputting a first video sample into the initial network model for training to obtain the loss of a target object; and updating the weight parameters in the initial network model according to the target object loss to obtain the pre-trained neural network model.

In the method, the initial network model can be trained to obtain the neural network model of the output motion pipeline in the target tracking method.

In a possible implementation manner of the first aspect, the target object loss specifically includes: and comparing a motion pipeline truth value with a motion pipeline predicted value, wherein the motion pipeline truth value is a motion pipeline obtained by splitting a tracking track of a target object in the first video sample, and the motion pipeline predicted value is a motion pipeline obtained by inputting the first video sample into the initial network model.

The method provides the intersection ratio between the motion pipeline true value and the motion pipeline predicted value as the target loss in the model training process, so that the accuracy of the position information of the target object indicated by the motion pipeline is high in the neural network model obtained by training.

In a possible implementation manner of the first aspect, the target object loss specifically includes: the method comprises the steps of comparing a motion pipeline truth value with a motion pipeline predicted value, and comparing a confidence coefficient of the motion pipeline truth value with a cross entropy of the confidence coefficient of the motion pipeline predicted value, wherein the motion pipeline truth value is a motion pipeline obtained by splitting a tracking track of a target object in a first video sample, the motion pipeline predicted value is a motion pipeline obtained by inputting the first video sample into an initial network model, the confidence coefficient of the motion pipeline truth value is the probability that a category of the target object corresponding to the motion pipeline truth value belongs to a preset category of the target object, and the confidence coefficient of the motion pipeline predicted value is the probability that the category of the target object corresponding to the motion pipeline predicted value belongs to the preset category of the target object.

The method provides the intersection ratio between the motion pipeline true value and the motion pipeline predicted value as the target loss in the model training process, so that the accuracy of the position information of the target object indicated by the motion pipeline is high and the type of the target object can be accurately indicated.

In one possible implementation form of the first aspect, the initial network model comprises a three-dimensional convolutional or recurrent neural network, the three-dimensional convolutional neural network comprising a three-dimensional residual neural network or a three-dimensional feature pyramid network. Optionally, the initial network model is obtained by combining a three-dimensional residual error neural network and a three-dimensional feature pyramid network.

The initial network model in the method can be a three-dimensional convolution neural network, a recurrent neural network or a combination of the two, and the diversity of the types of the neural network models provides various possibilities for realizing the scheme.

In a possible implementation manner of the first aspect, the inputting the first video into a pre-trained neural network model, and acquiring the motion pipe of the target object specifically includes: dividing the first video into a plurality of video segments; and respectively inputting the video segments into the pre-trained neural network model to obtain the motion pipeline.

Considering the limitation of the neural network model to process the number of video frames, the video may be segmented first, and the video segment is input into the model, optionally, the number of video frames of the video segment is a preset value, for example, 8 frames.

A second aspect of an embodiment of the present application provides a target tracking apparatus, including: an acquisition unit configured to acquire a first video, the first video including a target object; the acquisition unit is further configured to input the first video into a pre-trained neural network model, and acquire position information of the target object in at least two video frames and time information of the at least two video frames; the obtaining unit is further configured to obtain a tracking track of the target object in the first video according to the position information of the target object in at least two video frames and the time information of the at least two video frames, where the tracking track includes the position information of the target object in the at least two video frames in the first video.

In a possible implementation manner of the second aspect, the obtaining unit is specifically configured to: acquiring a motion pipeline of the target object, wherein the motion pipeline is used for indicating time information and position information of the target object in at least two video frames of the first video, and the first video comprises a first video frame and a second video frame; the motion pipe corresponds to a quadrangular frustum in a spatio-temporal dimension, the spatio-temporal dimension comprising a temporal dimension and a two-dimensional spatial dimension, a position of a first bottom surface of the quadrangular frustum in the temporal dimension being used for indicating first time information of the first video frame, a position of a second bottom surface of the quadrangular frustum in the temporal dimension being used for indicating second time information of the second video frame, a position of the first bottom surface of the quadrangular frustum in the two-dimensional spatial dimension being used for indicating first position information of the target object in the first video frame, and a position of the second bottom surface of the quadrangular frustum in the two-dimensional spatial dimension being used for indicating second position information of the target object in the second video frame; the quadrangular frustum is used to indicate position information of the target object in all video frames between the first video frame and the second video frame of the first video.

In a possible implementation manner of the second aspect, the obtaining unit is specifically configured to: acquiring a motion pipeline of the target object, wherein the motion pipeline is used for indicating position information of the target object in at least three video frames and time information of the at least three video frames, and the first video comprises a first video frame, a second video frame and a third video frame; the motion pipeline corresponds to a dual-quad terrace in a spatiotemporal dimension, the dual-quad terrace comprising a first quad terrace and a second quad terrace, the first quad terrace comprising a first bottom surface and a second bottom surface, the second quad terrace comprising a first bottom surface and a third bottom surface, the first bottom surface being a common bottom surface of the first quad terrace and the second quad terrace, a position of the first bottom surface in the temporal dimension being used for indicating first time information of the first video frame, a position of the second bottom surface in the temporal dimension being used for indicating second time information of the second video frame, a position of the third bottom surface in the temporal dimension being used for indicating third time information of the third video frame, a temporal order of the first video frame in the first video being located between the second video frame and the third video frame, a position of the first bottom surface in the two-dimensional spatial dimension being used for indicating a first time information of the target object in the first video frame A position of the second floor in the two-dimensional spatial dimension indicates second position information of the target object in the second video frame, and a position of the third floor in the two-dimensional spatial dimension indicates third position information of the target object in the third video frame; the double quadrangular frustum for indicating position information of the target object in all video frames between the second video frame and the three video frames of the first video.

In a possible implementation manner of the second aspect, the obtaining unit is specifically configured to: and acquiring the tracking track of the target object in the first video according to the motion pipeline.

In a possible implementation manner of the second aspect, the tracking trajectory specifically includes: and connecting the at least two motion pipelines corresponding to the quadrangular frustum in the space-time dimension to form the tracking track of the target object.

In a possible implementation manner of the second aspect, a length of the motion pipeline is a preset value, and the length of the motion pipeline indicates a number of video frames included in the at least two video frames.

In a possible implementation manner of the second aspect, the obtaining unit is further configured to: acquiring the category information of the target object through the pre-trained neural network model; and acquiring the tracking track of the target object in the first video according to the category information of the target object, the position information of the target object in at least two video frames and the time information of the at least two video frames.

In a possible implementation manner of the second aspect, the obtaining unit is specifically configured to: and acquiring the confidence coefficient of the motion pipeline through the pre-trained neural network model, wherein the confidence coefficient of the motion pipeline is used for determining the category information of the target object corresponding to the motion pipeline.

In a possible implementation manner of the second aspect, the apparatus further includes: and the processing unit is used for deleting the motion pipeline to obtain the deleted motion pipeline, and the deleted motion pipeline is used for obtaining the tracking track of the target object.

In one possible implementation of the second aspect, the motion conduit comprises a first motion conduit and a second motion conduit; the processing unit is specifically configured to: if the repetition rate between a first motion pipeline and a second motion pipeline is larger than or equal to a first threshold value, deleting the motion pipeline with lower reliability from the first motion pipeline and the second motion pipeline, wherein the repetition rate between the first motion pipeline and the second motion pipeline is the intersection and combination ratio between the first motion pipeline and the second motion pipeline, the first motion pipeline and the second motion pipeline belong to the motion pipeline of the target object, and the confidence coefficient indicates the probability that the class of the target object corresponding to the motion pipeline is a preset class.

In a possible implementation manner of the second aspect, the processing unit is specifically configured to: and deleting the motion pipeline according to a non-maximum suppression algorithm to obtain the deleted motion pipeline.

In a possible implementation manner of the second aspect, the confidence of any one of the pruned motion pipes is greater than or equal to a second threshold.

In a possible implementation manner of the second aspect, the obtaining unit is specifically configured to: connecting a third motion pipeline and a fourth motion pipeline which meet preset conditions in the motion pipelines to obtain a tracking track of the target object; the preset conditions include one or more of: the intersection ratio between sections of the third motion pipe and the fourth motion pipe at the time dimension overlapping part is greater than or equal to a third threshold; the cosine value of an included angle between the movement direction of the third movement pipeline and the movement direction of the fourth movement pipeline is greater than or equal to a fourth threshold value, and the movement direction is a vector which indicates the position change of a target object in the movement pipeline according to a preset rule in a space-time dimension; and the distance between the neural network feature vectors of the motion pipe is less than or equal to a fifth threshold value, wherein the distance comprises a Euclidean distance.

In a possible implementation manner of the second aspect, the obtaining unit is specifically configured to: grouping the motion pipelines to obtain t groups of motion pipelines, wherein t is the total number of video frames in the first video, the ith motion pipeline group in the t groups of motion pipelines comprises all motion pipelines starting from the ith video frame in the first video, and i is greater than or equal to 1 and less than or equal to t; when i is 1, taking the motion pipelines in the ith motion pipeline group as initial tracking tracks to obtain a tracking track set; and connecting the motion pipelines in the ith motion pipeline group with the tracking tracks in the tracking track set in sequence according to the numbering sequence of the motion pipeline groups to obtain at least one tracking track.

In a possible implementation manner of the second aspect, the obtaining unit is specifically configured to: inputting a first video sample into the initial network model for training to obtain the loss of a target object; and updating the weight parameters in the initial network model according to the target object loss to obtain the pre-trained neural network model.

In a possible implementation manner of the second aspect, the target object loss specifically includes: and comparing a motion pipeline truth value with a motion pipeline predicted value, wherein the motion pipeline truth value is a motion pipeline obtained by splitting a tracking track of a target object in the first video sample, and the motion pipeline predicted value is a motion pipeline obtained by inputting the first video sample into the initial network model.

In a possible implementation manner of the second aspect, the target object loss specifically includes: the method comprises the steps of comparing a motion pipeline truth value with a motion pipeline predicted value, and comparing a confidence coefficient of the motion pipeline truth value with a cross entropy of the confidence coefficient of the motion pipeline predicted value, wherein the motion pipeline truth value is a motion pipeline obtained by splitting a tracking track of a target object in a first video sample, the motion pipeline predicted value is a motion pipeline obtained by inputting the first video sample into an initial network model, the confidence coefficient of the motion pipeline truth value is the probability that a category of the target object corresponding to the motion pipeline truth value belongs to a preset category of the target object, and the confidence coefficient of the motion pipeline predicted value is the probability that the category of the target object corresponding to the motion pipeline predicted value belongs to the preset category of the target object.

In one possible implementation of the second aspect, the initial network model comprises a three-dimensional convolutional neural network or a recurrent neural network.

In a possible implementation manner of the second aspect, the processing unit is further configured to: dividing the first video into a plurality of video segments; the obtaining unit is specifically configured to: and respectively inputting the video segments into the pre-trained neural network model to obtain the motion pipeline.

A third aspect of embodiments of the present application provides an electronic device, which includes a processor and a memory, where the processor and the memory are connected to each other, where the memory is configured to store a computer program, and the computer program includes program instructions, and the processor is configured to call the program instructions to execute the method according to any one of the foregoing first aspect and various possible implementation manners.

A fourth aspect of embodiments of the present application provides a computer program product containing instructions, which when run on a computer, causes the computer to perform the method according to the first aspect and any one of the various possible implementations.

A fifth aspect of embodiments of the present application provides a computer-readable storage medium, which includes instructions that, when executed on a computer, cause the computer to perform the method according to the first aspect and any one of the various possible implementations.

A sixth aspect of embodiments of the present application provides a chip, including a processor. The processor is used for reading and executing the computer program stored in the memory so as to execute the method in any possible implementation mode of any one aspect. Optionally, the chip may include a memory, and the memory and the processor may be connected to the memory through a circuit or a wire. Further optionally, the chip further comprises a communication interface, and the processor is connected to the communication interface. The communication interface is used for receiving data and/or information needing to be processed, the processor acquires the data and/or information from the communication interface, processes the data and/or information, and outputs a processing result through the communication interface. The communication interface may be an input output interface.

For technical effects brought by any one implementation manner of the second aspect, the third aspect, the fourth aspect, the fifth aspect, or the sixth aspect, reference may be made to technical effects brought by a corresponding implementation manner in the first aspect, and details are not repeated here.

According to the technical scheme, the embodiment of the application has the following advantages:

according to the target tracking method provided by the embodiment of the application, the position information of a target object in at least two video frames and the time information of the at least two video frames are obtained through a pre-trained neural network model, and the tracking track of the target object in the first video is determined according to the information. Because the time information of at least two video frames is output through the neural network model, the target tracking does not depend on the target detection result of a single video frame, the problem of detection failure in the scene with dense targets or more shelters can be reduced, and the target tracking performance is improved.

According to the target tracking method provided by the embodiment of the application, the motion pipeline of the target object is obtained through the pre-trained neural network model, and the tracking track of the target object is obtained through connecting the motion pipeline. Because the motion pipeline comprises the position information of the target object in at least two video frames, the target tracking does not depend on the target detection result of a single video frame, the problem of detection failure in a scene with dense targets or more shelters can be reduced, and the target tracking performance is improved.

In addition, in the prior art, a single-frame detection algorithm is relied on, the accuracy of the whole algorithm is influenced by a detector, the development cost of step-by-step training of a detection model and a tracking model is high, and meanwhile, the algorithm is divided into two stages, so that the calculation cost and the deployment difficulty in the machine learning process are increased. The target tracking method provided by the embodiment of the application can realize end-to-end training, complete the detection and tracking tasks of the multi-target object through the neural network model, and reduce the complexity of the model.

In addition, in the prior art, the features extracted based on a single video frame are single, the target tracking method provided by the embodiment of the application adopts the video as the original input, and the model can realize the tracking task through various features such as appearance features, motion track features or gait features, so that the target tracking performance can be improved.

In addition, the target tracking method provided by the embodiment of the application adopts the video as the original input of the model, the time dimension receptive field is increased, and the motion information of people can be better captured.

Drawings

FIG. 1 is a schematic diagram of an artificial intelligence agent framework provided by an embodiment of the present application;

FIG. 2 is a system architecture diagram according to an embodiment of the present application;

fig. 3 is a schematic structural diagram of a convolutional neural network according to an embodiment of the present disclosure;

FIG. 4 is a schematic diagram of another convolutional neural network structure provided in the embodiments of the present application;

FIG. 5 is a schematic view of one embodiment of a motion tube in an embodiment of the present application;

FIG. 6 is a schematic diagram of a split motion pipeline for tracking a trajectory in an embodiment of the present application;

FIG. 7 is a schematic view of one embodiment of a motion tube in an embodiment of the present application;

FIG. 8 is a schematic view of another embodiment of a motion tube in an embodiment of the present application;

FIG. 9 is a schematic view of the intersection and union of the motion tubes in an embodiment of the present application;

FIG. 10 is a schematic diagram of an embodiment of a target detection method in the embodiment of the present application;

FIG. 11 is a schematic diagram of an embodiment of matching between motion tubes in an embodiment of the present application;

FIG. 12 is a schematic diagram of an embodiment of a training method of a neural network model in the embodiment of the present application;

FIG. 13 is a schematic view of a tracking path and a moving pipe in an embodiment of the present application;

FIG. 14 is a schematic diagram of a motion pipeline output by a neural network model in an embodiment of the present application;

FIG. 15 is a schematic diagram of another embodiment of a target tracking method in the embodiment of the present application;

FIG. 16 is a schematic diagram of an embodiment of a target tracking device in an embodiment of the present application;

FIG. 17 is a schematic diagram of another embodiment of a target tracking device in an embodiment of the present application;

FIG. 18 is a schematic diagram of another embodiment of a target tracking device in the embodiment of the present application;

FIG. 19 is a schematic diagram of another embodiment of an electronic device in an embodiment of the application;

fig. 20 is a diagram of a chip hardware structure according to an embodiment of the present application.

Detailed Description

The embodiment of the application provides a target tracking method, which is used for tracking a target in a video and can reduce tracking errors in a scene with dense targets or more shelters.

Embodiments of the present application will now be described with reference to the accompanying drawings, and it is to be understood that the described embodiments are merely illustrative of some, but not all, embodiments of the present application. As can be known to those skilled in the art, with the development of technology and the emergence of new scenarios, the technical solution provided in the embodiments of the present application is also applicable to similar technical problems.

The terms "first," "second," and the like in the description and in the claims of the present application and in the above-described drawings are used for distinguishing between similar elements and not necessarily for describing a particular sequential or chronological order. It will be appreciated that the data so used may be interchanged under appropriate circumstances such that the embodiments described herein may be practiced otherwise than as specifically illustrated or described herein. Moreover, the terms "comprises," "comprising," and any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, system, article, or apparatus that comprises a list of steps or modules is not necessarily limited to those steps or modules explicitly listed, but may include other steps or modules not expressly listed or inherent to such process, method, article, or apparatus. The naming or numbering of the steps appearing in the present application does not mean that the steps in the method flow have to be executed in the chronological/logical order indicated by the naming or numbering, and the named or numbered process steps may be executed in a modified order depending on the technical purpose to be achieved, as long as the same or similar technical effects are achieved.

For the sake of understanding, some technical terms related to the embodiments of the present application are briefly described as follows:

firstly, moving a pipeline and tracking a track.

A plurality of video frames of a video are obtained by consecutive shooting, and the video frame rate is generally known. The moving object in the video refers to a relative movement of an object with respect to the video capture device during the shooting process, and the object may be moving or not moving with reference to a world coordinate system of a real three-dimensional space, which is not limited herein.

During shooting of the target, the image information of the target object may be directly recorded in the video frame, or may be occluded by other objects in a part of the image frame.

A plurality of video frames of a video are expanded on a time dimension, different video frames correspond to different moments on the time dimension due to the fact that time intervals of shooting among the video frames are known, image information of the video frames corresponds to data in a two-dimensional space dimension due to the fact that the video frames are two-dimensional images, and the data which are displayed in the form are defined as data of a space-time dimension in the embodiment of the application. The position of the target in the video frame can be determined by the position in the time dimension for determining the video frame and the position in the two-dimensional space dimension for indicating the position information of the target in the video frame in the space-time dimension.

Please refer to fig. 5, which is a schematic diagram of an embodiment of a motion pipe in an embodiment of the present application.

Target tracking requires determining the position information of the target to be tracked (or simply target) in all video frames containing the target object. Generally, the target position in each video frame can be identified by a detection frame (Bounding-Box), and the detection frames of the same target object in each video frame are correspondingly connected in a space-time dimension, so that a track of the target in a space-time area, namely a tracking track, or called a motion track, can be formed, and the tracking track can not only provide the position of the target object, but also connect the positions of the corresponding target objects at different times. Thus, the tracking trajectory may indicate both temporal and spatial information of the target object. Only the position information of the target object in 3 video frames is illustrated in fig. 5, and it can be understood that all video frames of the video can obtain the tracking track according to the above method. It should be noted that one or more targets may be included in the same video frame, the tracking track further includes an Identification (ID) of a target object indicated by the tracking track, and the ID of the target object may be used to distinguish tracks corresponding to different targets.

The following describes the motion pipe and the tracking trajectory.

The motion pipeline is used for indicating position information of a target in at least two video frames, and corresponds to a quadrangular frustum in a space-time dimension, the position of a first bottom surface of the quadrangular frustum in a time dimension is used for indicating first time information of the first video frame, the position of a second bottom surface of the quadrangular frustum in the time dimension is used for indicating second time information of the second video frame, the position of the first bottom surface of the quadrangular frustum in a two-dimensional space dimension is used for indicating first position information of the target object in the first video frame, and the position of the second bottom surface of the quadrangular frustum in the two-dimensional space dimension is used for indicating second position information of the target object in the second video frame.

Optionally, the motion pipe is used to indicate position information of the object in at least three different video frames. In this embodiment and the following embodiments, the motion pipeline includes position information of the target in three different video frames as an example.

In the space-time dimension, the motion pipeline can be regarded as a double-quadrangular frustum structure consisting of two quadrangular frustums with common bottom surfaces, three bottom surfaces of the double-quadrangular frustum structure are parallel to each other, the direction perpendicular to the bottom surfaces is a time dimension, the extending direction of the bottom surfaces is a space dimension, and each bottom surface represents the position of a target in a video frame at the moment corresponding to the bottom surface. As shown in fig. 6, a motion duct of a double quadrangular frustum structure is shown, comprising: a first bottom surface 601, a second bottom surface 602, and a third bottom surface 603, where the position information of the first bottom surface 601, that is, a rectangle abcd, in the two-dimensional space where the first bottom surface is located represents the position information of the target object in the first video frame, and the rectangle abcd maps the position in the time dimension and represents the time information of the first video frame; similarly, a second bottom 602, i.e. a rectangle ijkm, represents the position information of the target object in the second video frame in the two-dimensional space where the second bottom is located, and a rectangle abcd maps the position in the time dimension and represents the time information of the second video frame; the position information of the third bottom 603, i.e. the rectangle efgh, in the two-dimensional space where the third bottom is located represents the position information of the target object in the third video frame, and the rectangle abcd maps the position in the time dimension and represents the time information of the third video frame. It can be understood that, because there may be relative motion between the target object and the video capture device during the process of capturing the target object by the first video, when the rectangle abcd, the rectangle efgh, and the rectangle ijkm are mapped into a two-dimensional space with the same bottom surface, the corresponding positions may be different. The positions of the first bottom 601, the second bottom 602, and the third bottom 603 in the time dimension, i.e., the positions of the map a point, the i point, and the e point mapped in the time dimension are a ', i ', and e ', respectively, indicating the time information of the first video frame, the second video frame, and the third video frame, respectively. The length of the motion pipe, i.e. the position interval between the position of the second floor mapped in the time dimension and the position of the third floor mapped in the time dimension, is used to indicate the number of the second floor, the third floor and all video frames between the second floor and the third floor in the time sequence of the video.

It should be noted that the motion pipeline corresponding to the first video frame at least includes the position information of the object in the first video frame.

The tracking track may be split into a plurality of motion pipes, as shown in fig. 6, and optionally, in this embodiment of the present application, the tracking track may be split into position frames of a single video frame, each position frame serves as a common bottom surface in the double-quadrangular-frustum structure, such as the first bottom surface 601 in fig. 6, and extends forwards and backwards in the tracking track, and two other bottom surfaces of the double-quadrangular-frustum structure are determined, which are the second bottom surface 602 and the third bottom surface 603, respectively, thereby obtaining the double-quadrangular-frustum structure with the common bottom surface, that is, the motion pipe corresponding to the single video frame.

For the starting video frame of the video, it can be considered to extend forward to 0, and similarly, the last video frame extends backward to 0, and the motion pipes corresponding to the starting video frame and the last video frame are degenerated to a single quadrangular frustum structure. It should be noted that the length of the motion pipeline is defined as the number of video frames corresponding to the motion pipeline, and as shown in fig. 6, the total number of all video frames of the video between the video frame corresponding to the second bottom 602 and the video frame corresponding to the third bottom 603 is the length of the motion pipeline.

In the embodiment of the present application, the motion pipeline is represented by a specific data format, please refer to fig. 7 and fig. 8, which are two schematic diagrams of the data format of the motion pipeline in the embodiment of the present application.

As shown in FIG. 7, the first data format includes 3 data (t) in the time dimension_s，t_m，t_e) And 12 data in spatial dimension

For a total of 15 data. Wherein, at the time corresponding to the data of each time dimension, the position information of the target in the space can be determined by 4 data, exemplarily, t_sIn the video frame of the time, the target position area is B_sBy passing

And

four data may determine the location area.

As shown in FIG. 8, the motion pipeline output by the neural network model may be represented in another data format, i.e., a motion pipeline of video frame m, B_mDetection boxes for objects in a common bottom surface, B_mFor partial image areas in the corresponding video frame, P is B_mAny pixel point in the region can use a datum to mark the time of the pixel point, and in the time dimension, the time of the pixel point is marked by two data: d_sAnd d_eThe lengths of the moving duct extending forward and backward, respectively, can be determined. l_m、b_m、t_m、r_mFour data indications are referenced to point P, B_mDeviation amount of boundary of region from point P (Regress values for B)_m)。l_s、b_s、t_s、r_sFour data respectively indicate B_sBoundary of region with respect to B_mOffset of boundary of region (regression values for B)_s) Similarly,/_e、b_e、t_e、r_eFour data respectively indicate B_eBoundary of region with respect to B_mOffset of boundary of region (regression values for B)_e)。

It can be seen that, both data formats can represent a single motion pipeline by 15 data, and the two data formats can be converted with each other.

Second, intersection-over-unity (IoU).

IoU are commonly used to measure the degree to which two location areas overlap. In object detection, the intersection-to-union ratio (IoU) is the ratio of the intersection and union of the two rectangular detection frames, and the value of IoU is between [0, 1 ]. Clearly, when IoU is 0, there is no overlap of the two location areas; when IoU is equal to 1, the two location areas coincide.

In the embodiment of the present application, the concept IoU is extended to a three-dimensional space in a space-time dimension for measuring the degree of overlapping of two motion pipes in the space-time dimension, please refer to fig. 9, which is a schematic diagram of the intersection and union of the motion pipes in the embodiment of the present application.

IoU(T⁽¹⁾，T⁽²⁾)＝∩(T⁽¹⁾，T⁽²⁾)/∪(T⁽¹⁾，T⁽²⁾)

Wherein, T⁽¹⁾Representing a moving pipe 1, T⁽²⁾Represents a movement pipe 2, # n (T)⁽¹⁾，T⁽²⁾) Represents the intersection of two moving pipelines, U (T)⁽¹⁾，T⁽²⁾) Representing the union of two moving pipes.

The target tracking method provided by the embodiment of the application relates to the technical field of artificial intelligence, and the following briefly introduces an artificial intelligence system. FIG. 1 shows a schematic diagram of an artificial intelligence body framework that describes the overall workflow of an artificial intelligence system, applicable to the general artificial intelligence field requirements.

The artificial intelligence topic framework described above is set forth below in terms of two dimensions, the "intelligent information chain" (horizontal axis) and the "IT value chain" (vertical axis).

The "smart information chain" reflects a list of processes processed from the acquisition of data. For example, the general processes of intelligent information perception, intelligent information representation and formation, intelligent reasoning, intelligent decision making and intelligent execution and output can be realized. In this process, the data undergoes a "data-information-knowledge-wisdom" refinement process.

The 'IT value chain' reflects the value of the artificial intelligence to the information technology industry from the bottom infrastructure of the human intelligence, information (realization of providing and processing technology) to the industrial ecological process of the system.

(1) Infrastructure:

the infrastructure provides computing power support for the artificial intelligent system, realizes communication with the outside world, and realizes support through a foundation platform. Communicating with the outside through a sensor; the computing power is provided by intelligent chips (hardware acceleration chips such as CPU, NPU, GPU, ASIC, FPGA and the like); the basic platform comprises distributed computing framework, network and other related platform guarantees and supports, and can comprise cloud storage and computing, interconnection and intercommunication networks and the like. For example, sensors and external communications acquire data that is provided to intelligent chips in a distributed computing system provided by the base platform for computation.

(2) Data of

Data at the upper level of the infrastructure is used to represent the data source for the field of artificial intelligence. The data relates to graphs, images, voice and texts, and also relates to the data of the Internet of things of traditional equipment, including service data of the existing system and sensing data such as force, displacement, liquid level, temperature, humidity and the like.

(3) Data processing

Data processing typically includes data training, machine learning, deep learning, searching, reasoning, decision making, and the like.

The machine learning and the deep learning can perform symbolized and formalized intelligent information modeling, extraction, preprocessing, training and the like on data.

Inference means a process of simulating an intelligent human inference mode in a computer or an intelligent system, using formalized information to think about and solve a problem by a machine according to an inference control strategy, and a typical function is searching and matching.

The decision-making refers to a process of making a decision after reasoning intelligent information, and generally provides functions of classification, sequencing, prediction and the like.

(4) General capabilities

After the above-mentioned data processing, further based on the result of the data processing, some general capabilities may be formed, such as algorithms or a general system, e.g. translation, analysis of text, computer vision processing, speech recognition, recognition of images, etc.

(5) Intelligent product and industrial application

The intelligent product and industry application refers to the product and application of an artificial intelligence system in various fields, and is the encapsulation of an artificial intelligence integral solution, the intelligent information decision is commercialized, and the landing application is realized, and the application field mainly comprises: intelligent manufacturing, intelligent transportation, intelligent home, intelligent medical treatment, intelligent security, automatic driving, safe city, intelligent terminal and the like.

In the target tracking method provided in the embodiment of the present application, the motion pipeline of the target object is obtained through a deep neural network, and a system architecture for performing data processing based on the deep neural network is briefly described below, referring to fig. 2, which provides a system architecture 200 in the embodiment of the present application. The data collection device 260 is configured to collect video data of the moving object and store the video data in the database 230, and the training device 220 generates the object model/rule 201 based on the video sample containing the moving object maintained in the database 230. How the training device 220 derives the target model/rule 201 based on the video sample of the moving target will be described in more detail below, and the target model/rule 201 can be used in application scenarios such as single target tracking, multi-target tracking, and virtual reality.

In the embodiment of the present application, training may be performed based on a video sample of a moving object, and specifically, various video samples containing the moving object may be collected by the data collection device 260 and stored in the database 230. In addition, video data can be directly obtained from a commonly used database.

The target model/rule 201 may be derived based on a deep neural network, which is described below.

The operation of each layer in the deep neural network can be expressed mathematically

To describe: from the work of each layer in the physical-level deep neural network, it can be understood that the transformation of the input space into the output space (i.e. the row space to the column space of the matrix) is accomplished by five operations on the input space (set of input vectors), which include: 1. ascending/descending dimensions; 2. zooming in/out; 3. rotating; 4. translating; 5. "bending". Wherein 1, 2, 3 are operated by

The operation of 4 is completed by + b, and the operation of 5 is realized by a (). The expression "space" is used herein because the object being classified is not a single thing, but a class of things, and space refers to the collection of all individuals of such things. Where W is a weight vector, each value in the vector representing a weight value for a neuron in the layer of neural network. The vector W determines the spatial transformation of the input space into the output space described above, i.e. the weight W of each layer controls how the space is transformed. The purpose of training the deep neural network is to finally obtain the weight matrix (the weight matrix formed by the vectors W of many layers) of all the layers of the trained neural network. Therefore, the training process of the neural network is essentially a way of learning the control space transformation, and more specifically, the weight matrix.

Because it is desirable that the output of the deep neural network is as close as possible to the value actually desired to be predicted, the weight vector of each layer of the neural network can be updated by comparing the predicted value of the current network with the value actually desired to be predicted, and then updating the weight vector according to the difference between the predicted value and the value actually desired (of course, there is usually an initialization process before the first update, that is, parameters are configured in advance for each layer in the deep neural network). Therefore, it is necessary to define in advance "how to compare the difference between the predicted value and the target value", which are loss functions (loss functions) or objective functions (objective functions), which are important equations for measuring the difference between the predicted value and the target value. Taking the loss function as an example, if the higher the output value (loss) of the loss function indicates the larger the difference, the training of the deep neural network becomes the process of reducing the loss as much as possible.

The target models/rules obtained by the training device 220 may be applied in different systems or devices. In FIG. 2, the execution device 210 is configured with an I/O interface 212 to interact with data from an external device, and a "user" may input data to the I/O interface 212 via a client device 240.

The execution device 210 may call data, code, etc. from the data storage system 250 and may store data, instructions, etc. in the data storage system 250.

The calculation module 211 processes the input data using the target model/rule 201, and for example, the calculation module 211 may parse the input video to obtain a feature indicating the target position information in the video frame.

The correlation function 213 may perform pre-processing on the image data in the calculation module 211, such as video pre-processing, including video slicing, etc.

The correlation function 214 may perform pre-processing on the image data in the calculation module 211, such as video pre-processing, including video slicing, etc.

Finally, the I/O interface 212 returns the results of the processing to the client device 240 for presentation to the user.

Further, the training device 220 may generate corresponding target models/rules 201 based on different data for different targets to provide better results to the user.

In the case shown in FIG. 2, the user may manually specify data to be input into the execution device 210, for example, to operate in an interface provided by the I/O interface 212. Alternatively, the client device 240 may automatically enter data into the I/O interface 212 and obtain the results, and if the client device 240 automatically enters data to obtain authorization from the user, the user may set the corresponding permissions in the client device 240. The user can view the result output by the execution device 210 at the client device 240, and the specific presentation form can be display, sound, action, and the like. The client device 240 may also act as a data collection end to store the collected training data in the database 230.

It should be noted that fig. 2 is only a schematic diagram of a system architecture provided by an embodiment of the present application, and the position relationship between the devices, modules, and the like shown in the diagram does not constitute any limitation, for example, in fig. 2, the data storage system 250 is an external memory with respect to the execution device 210, and in other cases, the data storage system 250 may also be disposed in the execution device 210.

The deep neural network used in the embodiment of the present application to extract a motion pipe from a video may be, for example, a Convolutional Neural Network (CNN). CNN is described in detail below.

CNN is a deep neural network with a convolution structure, and is a deep learning (deep learning) architecture, which refers to learning at multiple levels at different abstraction levels by a machine learning algorithm. As a deep learning architecture, CNN is a feed-forward artificial neural network, for example, image processing, in which individual neurons respond to overlapping regions in an image input thereto. Of course, other types are possible, and the application is not limited to the type of deep neural network.

As shown in fig. 3, Convolutional Neural Network (CNN)100 may include an input layer 110, a convolutional/pooling layer 120, where the pooling layer is optional, and a neural network layer 130.

Convolutional layer/pooling layer 120:

and (3) rolling layers:

as shown in FIG. 3, convolutional layer/pooling layer 120 may include, for example, 121-126 layers, in one implementation, 121 layers are convolutional layers, 122 layers are pooling layers, 123 layers are convolutional layers, 124 layers are pooling layers, 125 layers are convolutional layers, and 126 layers are pooling layers; in another implementation, 121, 122 are convolutional layers, 123 are pooling layers, 124, 125 are convolutional layers, and 126 are pooling layers. I.e., the output of a convolutional layer may be used as input to a subsequent pooling layer, or may be used as input to another convolutional layer to continue the convolution operation.

Taking convolutional layer 121 as an example, convolutional layer 121 may include a plurality of convolution operators, also called kernels, whose role in image processing is to act as a filter for extracting specific information from an input image matrix, and a convolution operator may be essentially a weight matrix, which is usually predefined, and during the convolution operation on an image, the weight matrix is usually processed on the input image pixel by pixel (or two pixels by two pixels … …, which depends on the value of step size stride) in the horizontal direction, so as to complete the task of extracting a specific feature from the image.

Convolution kernels also come in a variety of formats depending on the dimensionality of the data that needs to be processed. Commonly used convolution kernels include two-dimensional convolution kernels and three-dimensional convolution kernels. The two-dimensional convolution kernel is mainly applied to processing two-dimensional image data, and the three-dimensional convolution kernel can be applied to video processing, stereo image processing and the like due to the fact that the dimension of the depth or the time direction is increased. In the embodiment of the application, in order to extract information in a time dimension and a space dimension in a video through a neural network model, a convolution operation is simultaneously performed in the time dimension and the space dimension through a three-dimensional convolution kernel, so that the three-dimensional convolution neural network formed by the three-dimensional convolution kernel can obtain the characteristics of each video frame and can express the association and the change of the video frame along with the time.

When convolutional neural network 100 has multiple convolutional layers, the initial convolutional layer (e.g., 121) tends to extract more general features, which may also be referred to as low-level features; as the depth of the convolutional neural network 100 increases, the more convolutional layers (e.g., 126) that go further back extract more complex features, such as features with high levels of semantics, the more highly semantic features are more suitable for the problem to be solved. To facilitate description of the network structure, a plurality of convolutional layers may be referred to as a block.

A pooling layer:

since it is often necessary to reduce the number of training parameters, it is often necessary to periodically introduce pooling layers after the convolutional layer, i.e. the layers 121-126 as illustrated by 120 in fig. 3, may be one convolutional layer followed by one pooling layer, or may be multiple convolutional layers followed by one or more pooling layers. During image processing, the only purpose of the pooling layer is to reduce the spatial size of the image.

The neural network layer 130:

after processing by convolutional layer/pooling layer 120, convolutional neural network 100 is not sufficient to output the required output information. Accordingly, a plurality of hidden layers (such as 131, 132, to 13n shown in fig. 3) and an output layer 140 may be included in the neural network layer 130, and parameters included in the plurality of hidden layers may be pre-trained according to related training data of a specific task type, for example, the task type may include image recognition, image classification, image super-resolution reconstruction, and the like.

The last layer after the plurality of hidden layers in the neural network layer 130, i.e., the entire convolutional neural network 100, is the output layer 140.

It should be noted that the convolutional neural network 100 shown in fig. 3 is only an example of a convolutional neural network, and in a specific application, the convolutional neural network may also exist in the form of other network models, for example, as shown in fig. 4, a plurality of convolutional layers/pooling layers are parallel, and the features extracted respectively are all input to the overall neural network layer 130 for processing.

Optionally, the deep neural network used for extracting the motion pipeline from the video in the embodiment of the present application is a combination of a residual neural network and a feature pyramid network. Wherein the residual neural network represents that deeper networks are easier to train by letting the deep network learn the residuals. Residual learning solves the problems of gradient extinction and gradient explosion existing in deep networks. And detecting the target with the corresponding scale on the feature map with different resolutions by the feature pyramid network. Each layer of output is obtained by fusing the feature maps of the current layer and the higher layer, so that each layer of output feature map has enough feature expression capability.

The target tracking technology related to the target detection method provided by the embodiment of the application is widely applied, for example, automatic focusing during video shooting is realized, and a target tracking algorithm can help a photographer to conveniently and accurately select a focus or flexibly switch the focus to track a target, which is particularly important in sports events and wild animal shooting. Under a monitoring scene, the multi-target tracking algorithm can automatically complete the position tracking of a selected target object, so that a set target can be conveniently found, and the method has important significance in the field of security protection. In an automatic driving scene, the multi-target tracking algorithm can control the movement tracks and trends of surrounding pedestrians and vehicles, and initial information is provided for the functions of automatic driving, automatic obstacle avoidance and the like. In a virtual reality scene, motion sensing games, gesture recognition, finger tracking and the like can also be realized through a multi-target tracking technology.

The common target tracking method comprises two parts of detection and tracking, wherein a target appearing in each video frame is detected through a detection module, then the targets appearing in each video frame are matched, in the matching process, the characteristics of each target object in a single video frame are extracted, the target matching is realized through the similarity comparison of the characteristics, and the tracking track of each target object is obtained. Because the target tracking method adopts the technical means of detecting first and then tracking, the target tracking effect depends on the detection algorithm of a single frame, if the target is shielded in the target detection, the detection error is generated, and the tracking error is further caused, so the performance is insufficient in the scene with dense targets or more shielded targets.

The embodiment of the application provides a target detection method, wherein a video is input into a pre-trained neural network model, a plurality of motion pipelines are output, and the plurality of motion pipelines are matched and restored to obtain tracking tracks corresponding to one or more targets. Firstly, the motion pipeline comprises position information of a target object in at least two video frames, and target tracking does not depend on a target detection result of a single video frame, so that the problem of detection failure in a scene with dense targets or more shelters can be reduced, and the target tracking performance is improved. Secondly, the conventional target detection method depends on a single-frame detection algorithm, the accuracy of the whole algorithm is influenced by a detector, the development cost of step-by-step training of a detection model and a tracking model is high, and meanwhile, the algorithm is divided into two stages, so that the calculation cost and the deployment difficulty in the machine learning process are increased. The target tracking method provided by the embodiment of the application can realize end-to-end training, complete the detection and tracking tasks of the multi-target object through the neural network model, and reduce the complexity of the model. In addition, in the prior art, the features extracted based on a single video frame are single, the target tracking method provided by the embodiment of the application adopts the video as the original input, and the model can realize the tracking task through various features such as appearance features, motion track features or gait features, so that the target tracking performance can be improved. Finally, the target tracking method provided by the embodiment of the application adopts the video as the original input of the model, the time dimension receptive field is increased, and the motion information of the person can be better captured.

Referring to fig. 10, a schematic diagram of an embodiment of the target detection method in the embodiment of the present application is shown;

1001. preprocessing a video;

the target tracking device may pre-process the acquired video, optionally including one or more of: the video is segmented into segments with preset lengths, and the video resolution and the adjustment and normalization of the color space are adjusted.

Illustratively, when the video length is long, the video may be divided into 8-frame small segments in consideration of the data processing capability of the target tracking device.

Step 1001 is an optional step, and may or may not be executed.

1002. And inputting the video into the neural network model to obtain the motion pipeline and the confidence coefficient of the motion pipeline.

And inputting the video into a pre-trained neural network model, and acquiring the position information of the target object in at least two video frames and the time information of the at least two video frames. Optionally, the video is input into a pre-trained neural network model to obtain a motion pipe for each target object. The motion pipeline is configured to indicate time information and position information of the target object in at least two video frames of the first video, and specific ways of indicating the time information and the position information by the motion pipeline may refer to the foregoing description, which are not described herein again. The training process of the neural network model is described in detail in the following embodiments.

Alternatively, the data format of the output motion pipeline is of the type shown in FIG. 8, specifically, the input video I, I ∈ R ^ (t × h × w × 3), where R represents the real domain, t represents the frame number of the video, h × w represents the video resolution, 3 represents the RGB color gamut, the output is the motion pipeline O, O ∈ R ^ (t × h '× w' × 15), where R represents the real domain, t represents the frame number of the video, and h '× w' represents the resolution of the feature map of the neural network output. I.e., t × h '× w' motion pipes are output, where each video frame corresponds to h '× w' motion pipes.

Optionally, obtaining category information of the target object through the pre-trained neural network model; specifically, the confidence of the motion pipeline is obtained through a pre-trained neural network model, and the confidence of the motion pipeline can be used for determining the category information of the target object corresponding to the motion pipeline.

The motion pipelines are used for indicating the position information of the target in the video frame, each motion pipeline corresponds to a target to be tracked, and the confidence of the motion pipelines refers to the possibility that the target corresponding to each motion pipeline belongs to a preset category. Generally, it is necessary to preset the category of the target object to be tracked in the video, for example: the confidence degrees of the motion pipelines output by a person, a vehicle or a dog respectively represent the probability that the target corresponding to the motion pipeline belongs to the preset category, the confidence degree is a numerical value between 0 and 1, the smaller the confidence degree is, the smaller the probability is, the larger the probability is, the probability is.

Optionally, the number of confidences of each motion pipe is equal to the number of preset categories of the target object, and each confidence correspondence indicates a likelihood that the motion pipe belongs to the category. The confidence degrees of the motion pipelines output by the neural network model constitute a confidence table.

Example 1, a preset category of a target object is "person" or "background", where the background refers to an image region that does not include a target object to be tracked, confidence levels of the categories of the target object corresponding to a first motion channel are 0.1 and 0.9, respectively, confidence levels of second motion channels are 0.7 and 0.3, and since there is only one preset category, a probability that the categories of the target object include two categories, which belong to the "person" or the "background", a confidence threshold value may be set to 0.5, and since the confidence level 0.3 that the category of the target object corresponding to the first motion channel belongs to the "person" is less than or equal to 0.5, that is, a probability that the target corresponding to the motion channel belongs to the person is small, and the confidence level 0.9 that the category belongs to the "background" is greater than 0.5, that the category belongs to the background is large; the confidence 0.7 that the class of the target object corresponding to the second motion pipeline belongs to the human is greater than 0.5, which means that the probability that the target corresponding to the motion pipeline belongs to the human is high, and the confidence 0.3 that the target belongs to the background is less than 0.5, i.e. the probability of belonging to the background is low.

Example 2, the preset categories of the target object are "person", "vehicle", and "background", the confidence of the first motion channel is 0.4, 0.1, 0.2, the confidence of the second motion channel is 0.2, 0.8, 0.1, and the possibility of the category of the target object includes three: "person", "vehicle" or "background", 1/3 ≈ 0.33 may be used as the confidence threshold, since 0.4 is greater than 0.33, the class with the highest confidence in the first motion channel is "person", i.e. the probability that the class of the corresponding target object is person is greater, and similarly, the class with the highest confidence in the second motion channel is "vehicle", i.e. the probability that the class of the corresponding target object is vehicle is greater.

1003. Deleting part of the motion pipeline;

before the tracking track of the target object is obtained according to the motion pipeline, the motion pipeline can be deleted to obtain the deleted motion pipeline, and the deleted motion pipeline is used for obtaining the tracking track of the target object. For a plurality of motion pipelines output by the neural network model, deletion can be carried out according to preset conditions.

In a plurality of motion pipelines obtained by neural network model prediction, each pixel point in each video frame corresponds to a motion pipeline, and targets appearing in the video frames usually occupy a plurality of pixel point positions, so that a plurality of motion pipelines used for indicating the position information of the same target object are provided.

Optionally, if the confidence of each motion pipeline is obtained, the category to which the target corresponding to each motion pipeline belongs may be determined according to the confidence, and the motion pipelines of each category are deleted respectively.

Optionally, the obtaining of the deleted motion pipe specifically includes deleting a motion pipe with a lower confidence level in the first motion pipe and the second motion pipe if the repetition rate between the first motion pipe and the second motion pipe is greater than or equal to a first threshold, optionally, the repetition rate of the motion pipe may be IoU between the two motion pipes, the first threshold ranges from 0.3 to 0.7, for example, the first threshold is 0.5, and if IoU between the first motion pipe and the second motion pipe is greater than or equal to 50%, deleting one motion pipe with a lower confidence level. Optionally, the motion pipeline is pruned according to a non-maximum suppression (NMS) algorithm to obtain the pruned motion pipeline, a threshold of the motion pipeline IoU is set to 0.5, the motion pipeline can be pruned through the NMS algorithm, only one corresponding motion pipeline is reserved for each target in each video frame, pruning the target detection result according to the NMS algorithm is the prior art, and details of the specific process are not described here.

In a plurality of motion pipelines obtained by prediction of the neural network model, each pixel point in each video frame corresponds to a motion pipeline, the pixel point positions which are not covered by the background area corresponding to the target object in the video frame also correspond to some motion pipelines, the motion pipelines can be understood as false motion pipelines, the confidence coefficient is usually low, and in order to reduce the calculation complexity of the subsequent motion pipeline connection step, the subtraction can be carried out through the confidence coefficient of the motion pipelines.

Optionally, the confidence of any one of the deleted motion pipelines is greater than or equal to a second threshold, that is, the preset condition is that the confidence is less than or equal to the second threshold, where the second threshold is related to the number of preset categories of the target object, for example, if the number of preset categories of the target object is 2: "human" or "background", the second threshold is typically between 0.3 and 0.7, for example 0.5. If the number of categories of target objects is 10, the second threshold value is usually between 0.07 and 0.13, for example 0.1.

Step 1003 is an optional step, and may or may not be executed.

1004. Connecting a moving pipeline to obtain a tracking track;

optionally, since in this embodiment, the position information of the target object in the at least two video frames and the time information of the at least two video frames are indicated by a motion pipeline, a tracking trajectory of the target object in the first video may be obtained according to the motion pipeline, specifically, the tracking trajectory of the target object formed by connecting at least two motion pipelines corresponding to four prismatic platforms in a space-time dimension.

Specifically, a plurality of motion pipelines indicating the position information of the same target object are connected to obtain a tracking track corresponding to each target, and the connection between the motion pipelines, or referred to as matching between the motion pipelines, needs to meet a preset condition. The acquiring of the tracking trajectory of the target object according to the moving pipe specifically includes: and connecting a third motion pipeline and a fourth motion pipeline which meet preset conditions in the motion pipelines to obtain the tracking track of the target object.

The specific content of the preset condition includes multiple kinds, and optionally, the preset condition includes one or more of the following: the intersection ratio between sections of the third motion pipe and the fourth motion pipe at the time dimension overlapping part is greater than or equal to a third threshold value; the cosine value of an included angle between the movement direction of the third movement pipeline and the movement direction of the fourth movement pipeline is greater than or equal to a fourth threshold value, and the movement direction is a vector which indicates the position change of a target object in the movement pipeline according to a preset rule in a space-time dimension; and the distance between the neural network feature vectors of the motion pipeline is smaller than or equal to a fifth threshold value, and the distance comprises a Euclidean distance.

Specifically, the intersection-to-parallel ratio between the motion pipeline sections corresponding to the time dimension overlapping portions of the two motion pipelines is greater than or equal to a third threshold, the cosine value of the motion direction included angle between the motion pipelines is greater than or equal to a fourth threshold, and the distance index between the neural network feature vectors of the motion pipelines is less than or equal to a fifth threshold, and the distance index may be, for example, an euclidean distance. The neural network feature vector of the motion pipe may be a feature vector output by any layer of the neural network model, and optionally, the neural network feature vector of the motion pipe is a feature vector output by a last layer of a three-dimensional (3D) convolutional neural network in the neural network model.

The motion direction of the motion pipeline is a vector which indicates the position change of the target object corresponding to the two bottom surfaces of the motion pipeline in the space-time dimension, and indicates the moving speed and the moving direction of the target object. It should be noted that: the motion direction of the motion pipeline may Be determined according to a preset rule, for example, regarding a motion pipeline in a space-time dimension as a motion direction of the motion pipeline by using a vector of a position change of a target object corresponding to two bottom surfaces (for example, Bs and Be of the motion pipeline shown in fig. 8) of the motion pipeline farthest from each other in the time dimension, or setting a vector of a position change of a target object corresponding to two adjacent bottom surfaces (for example, Bm and Be of the motion pipeline shown in fig. 8) of the motion pipeline as a motion direction of the motion pipeline, or setting a position change direction of a target object between a preset number of video frames as a motion direction of the motion pipeline, where the preset number is 5 frames, for example. Similarly, the direction of tracking the trajectory may be defined as setting the position change direction of the target object between a preset number of video frames as the movement direction of the movement pipe at the end of the trajectory, or, the movement direction of the last movement pipe at the end of the trajectory. It is understood that the direction of movement of the moving pipe is generally defined as the direction from a certain moment to a moment after the certain moment in the time dimension.

The value of the third threshold is not limited, and is usually 70% to 95%, such as 75%, 80%, 85%, or 90%, etc., and the value of the fourth threshold is not limited, and is usually cos (pi/6) to cos (pi/36), such as cos (pi/9), cos (pi/12), or cos (pi/18), etc. The value of the fifth threshold may be determined according to the size of the feature vector, and the specific numerical value is not limited.

Optionally, the following description takes as an example that the preset condition is that the intersection-to-parallel ratio between the two motion pipelines at the pipeline sections corresponding to the time dimension overlapping part is greater than or equal to a third threshold, and the cosine value of the motion direction included angle between the motion pipelines is greater than or equal to a fourth threshold.

Please refer to fig. 11, which is a schematic diagram of an embodiment of matching between motion pipes in an embodiment of the present application.

Example 1, as shown in a part a of fig. 11, if the intersection ratio between the motion pipeline sections corresponding to the time dimension overlapping portions of the two motion pipelines is greater than or equal to the third threshold, and the cosine value of the included angle between the motion directions of the two motion pipelines is greater than or equal to the fourth threshold, that is, the overlap ratio and the motion direction are both matched, the two motion pipelines are successfully matched. It should be noted that the coincidence between two motion ducts refers to IoU between the motion duct sections where the two motion ducts coincide in the time dimension.

Example 2, as shown in part b of fig. 11, if the cosine value of the included angle between the moving directions of the two moving pipes is smaller than the fourth threshold value, i.e. the moving directions are not matched, the matching of the two moving pipes is not successful.

Example 3, as shown in part c of fig. 11, if the intersection ratio between the motion pipe sections corresponding to the two motion pipe time dimension overlapping portions is less than the third threshold, i.e., the degree of overlap does not match, then the two motion pipe matching is not successful.

It should be noted that, because there are two overlapping portions in the time dimension between the two motion pipelines for matching, there are two pieces of position information of the same target object in the video frames corresponding to the overlapping portions, and the position of the target object in the video frames corresponding to the overlapping portions in the time dimension can be determined by an averaging method, or a certain motion pipeline, for example, the video frame corresponding to the common bottom surface is specified by an earlier time dimension coordinate according to a preset rule.

Optionally, a greedy algorithm may be used in the matching process of connecting all motion pipelines of the video, and the connection may be performed through a series of local optimal selections; the Hungarian algorithm can also be used to perform global optimal matching.

The motion pipeline connection according to the greedy algorithm specifically comprises the following steps: and calculating the affinity (defined as IoU × cos (theta), theta is the included angle of the motion direction) between every two groups of motion pipelines to be matched to form an affinity matrix. The matching motion pipe pair (Btube pair) is selected in the affinity matrix from the maximum affinity in a loop until the matching is complete.

The motion pipeline connection according to the Hungarian algorithm specifically comprises the following steps: also after the affinity matrix is obtained, the hungarian algorithm is used to select the motion pipe pairs.

Optionally, a specific process of connecting a plurality of moving pipes in the present embodiment is described below:

1) taking all motion pipelines starting from a first frame as initial tracking tracks to obtain a set of tracking tracks;

2) and connecting the motion pipeline starting from the second frame with the tracking tracks in the tracking track set in sequence, if the preset conditions are met, matching successfully, and updating the original tracking track according to the motion pipeline. If the matching is unsuccessful, the initial tracking track is newly added to the set of tracking tracks;

3) similarly, a motion pipeline starting from the ith frame is sequentially connected with a tracking track set, wherein i is a positive integer larger than 2 and smaller than t, t is the total frame number of the video, if a preset condition is met, the tracking track is updated according to the motion pipeline, and if the matching is not successful, the motion pipeline is used as an initial tracking track and is newly added into the tracking track set.

Optionally, the embodiment employs a greedy algorithm to connect the pipeline and the trace in sequence from the maximum affinity.

For example, let a first set of motion pipes starting from a first frame and a second set of motion pipes starting from a second frame be the ith set, and similarly, let the motion pipes starting from the ith frame be the ith set, if the first set includes 10 motion pipes, the second set includes 8 motion pipes, and the third set includes 13 motion pipes. Firstly, taking 10 motion pipelines in a first group as 10 initial tracking tracks, respectively connecting a second group with the initial tracking tracks, updating the tracking tracks if the connection conditions are met, and keeping the original initial tracking tracks if the connection conditions are not met, wherein the tracking tracks comprise 8 updated tracking tracks if 8 motion pipelines in the second group meet the connection conditions and are successfully connected with 8 initial tracking tracks in the 10 initial tracking tracks, and the other 2 tracking tracks are unchanged. Next, respectively connecting 13 motion pipelines in the third group with the tracks in the tracking track set, wherein the tracking track set comprises 10 tracking tracks, even if the motion pipelines are successfully connected with the motion pipelines of the third group, 3 motion pipelines still do not serve to update the tracking tracks, and the 3 motion pipelines can serve as newly-added initial tracking tracks, namely 3 newly-added tracking tracks in the tracking track set.

Optionally, the target category to which the target corresponding to the motion pipeline belongs is determined according to the confidence table of the motion pipeline, and the motion pipelines of different target categories are respectively connected to obtain the tracking track of the target object of each target category.

Alternatively, the spatial position of the occluded part may be obtained by a method of complementing the motion pipe difference.

1005. Outputting a tracking track;

and outputting the connected tracking tracks according to a specific format, such as a video stream, a track log and the like.

And after the tracking track is processed, processing the tracking track into a bounding box, overlapping the bounding box on the original video and outputting the bounding box to a display, completing real-time tracking deployment and realizing target tracking.

The target tracking method provided in the embodiment of the present application designs a pre-trained neural network model, and a training method of the neural network model is described below.

Please refer to fig. 12, which is a diagram illustrating an embodiment of a method for training a neural network model according to an embodiment of the present application.

1201. Training preparation work;

the training preparation work includes constructing a training hardware environment, building a network model, setting training parameters and the like.

Preparing a hardware environment required by training, illustratively, using 32V 100-32G video cards, adopting a distributed cluster of 4 nodes, and finishing the inference process under a single machine condition by using 1V 100-16G video card.

The video data set is acquired, and a public data set, such as an MOT data set, can be selected. Optionally, the video samples in the data set may be processed to improve the diversity of data distribution and obtain better model generalization capability. Optionally, the processing of the video includes resolution scaling, whitening of the color space, random HSL (a color space, or color representation. H: hue, S: saturation, L: lightness) dithering of the video colors, random horizontal flipping of the video frames, etc.

Training parameters are set including batch size, learning rate, optimizer model, etc., illustratively, batch size is 32, learning rate starts at 10^ (-3), and decreases by 5 times to better converge when the loss is stable. The network substantially converges after 25K training iterations. In order to increase the generalization capability of the model, a second-order regularization loss of 10 < -5 > is adopted, and the momentum coefficient is 0.9.

1202. Splitting according to the manually marked track information to obtain a true value of the motion pipeline;

track information of manual annotation of video samples in the public dataset is obtained, including a target ID and a location frame of the target object in each video frame.

And splitting according to the manually marked track information, and acquiring a motion pipeline with the position frame of the target object as a public bottom surface in each frame. Based on the first data format of the motion pipe, the motion pipe is represented by 15 data.

The specific method for acquiring the moving pipeline comprises the following steps:

and splitting the tracking track into position frames of single video frames, taking each position frame as a common bottom surface in the double-quadrangular-frustum structure, extending forwards and backwards in the tracking track, and determining the other two bottom surfaces of the double-quadrangular-frustum structure, thereby obtaining the double-quadrangular-frustum structure with the common bottom surface, namely the motion pipeline corresponding to the single video frame.

There are various ways to split the tracking trajectory into motion pipes:

optionally, the splitting is performed according to a preset length of the pipeline, that is, the intervals between three bottom surfaces in the double-quadrangular frustum pyramid structure are set, for example, the intervals between the common bottom surface and the other two bottom surfaces are both 4, and the length of the moving pipeline is 8.

Optionally, in the splitting process, under the condition that IoU between the double-quadrangular-frustum-shaped structure and the corresponding section of the original tracking track is ensured to be greater than or equal to 85%, the length of the motion pipeline in the time dimension is prolonged as much as possible, and the structure with the longest time dimension is taken as the final expanded structure. As shown in fig. 13. Since the structure of the motion pipe (Btube) is linear and the structure of the real tracking track (ground track) is nonlinear, a long motion pipe cannot be better fitted with the motion track, namely IoU is lower (IoU < η) along with the length extension. IoU larger (IoU > η) motion tubes are generally shorter in length. In the embodiment of the application, the motion pipeline which meets the lowest IoU threshold value and has the longest length is taken as the motion pipeline after being split, so that the time receptive field can be enlarged while the original track is well fitted. As shown in fig. 13, overlapping portions of the moving pipes (Overlap Part) may be used for connection matching between the moving pipes.

Similarly, splitting the tracking tracks of all target objects in the video sample to obtain true values of a plurality of motion pipelines.

1203. Inputting a video sample into an initial network model for training to obtain a predicted value of a motion pipeline;

and inputting the video sample into an initial network model for training, and outputting a predicted value of the motion pipeline.

Optionally, the initial network model is a three-dimensional (3D) convolutional neural network or a recurrent neural network, and the like, where the 3D convolutional neural network includes: a 3D residual neural network or a 3D feature pyramid network, etc. Optionally, the neural network model is a combination of a 3D residual neural network and a 3D feature pyramid network.

And inputting the video sample into the initial network model, and outputting the motion pipelines of all the target objects.

The data format of the output motion pipeline is of the type shown in FIG. 8, specifically, the input video I, I ∈ R ^ (t × h × w × 3), where R represents the real domain, t represents the frame number of the video, h × w represents the video resolution, 3 represents the RGB color gamut, and the output is the motion pipeline 0, 0 ∈ R ^ (t × h '× w' × 15), where R represents the real domain, t represents the frame number of the video, and h '× w' represents the resolution of the feature map of the neural network output. I.e., t × h '× w' motion pipes are output, where each video frame corresponds to h '× w' motion pipes.

Optionally, a confidence level of the motion pipeline is also output, and the confidence level is used for indicating the category of the target object corresponding to the motion pipeline.

The execution order of step 1202 and step 1203 is not limited.

1204. Calculating the training loss;

since the splitting is performed according to the manually labeled track information in step 1202, the obtained true value data format (R ^ (t × h '× w' × 15), where t × h '× w' is the number of motion pipes) of the motion pipe is the first data format of the motion pipe;

and the data format of the motion pipe output by the initial network model in step 1203 (R ^ (n × 15), where n is the number of motion pipes), is the second data format of the motion pipe.

In order to calculate the training loss value according to the true value and the predicted value, the real value of the motion pipeline obtained in step 1202 and the motion pipeline output by the neural network model need to be unified into one data format.

Optionally, in this embodiment of the present application, the true values of the motion pipeline are converted into the second data format. Referring to fig. 14, since t × h '× w' motion pipelines output by the neural network model include t × h '× w' P points, which are only illustrated as P1 and P2 in fig. 14, and t × h '× w' P points are three-dimensional lattices distributed in three dimensions of time and space, in order to implement data conversion, n motion pipeline truth values need to be converted into similar three-dimensional lattices, and the following rules are adopted for conversion: and if one point in the three-dimensional lattice is positioned in the common bottom surface of the double-quadrangular-frustum structure corresponding to the true value of the motion pipeline, distributing the true value of the motion pipeline to the position of the motion pipeline corresponding to the point. If a point in a lattice is located on a common bottom surface (i.e., a target overlapping scene) corresponding to the true values of a plurality of motion pipes, the motion pipes with smaller volumes are preferentially allocated. After the allocation is completed, a motion pipeline truth value T with the format of R ^ (T × h '× w' × 15) is obtained, wherein some points are not allocated to the truth value, and 0 can be used to complement the true value, and the true value is accompanied with an 0/1 truth table to characterize whether the pipeline is a bit-complementing pipeline. The truth table a' can be used as a confidence corresponding to the motion pipeline truth value.

After converting the truth value to the second data format, the penalty between the truth value (T) and the predicate value (0) can be calculated.

Optionally, the loss function L is:

L＝L₁+L₂

L₁＝-ln(Iou(T，O))

L₂＝CrossEntropy(A，A′)

wherein IoU (T, O) represents the cross-over ratio between the motion pipe truth (T) and the motion pipe predictor (0), a is the confidence of the motion pipe predictor (0), a' is the confidence of the motion pipe truth (T), and CrossEntropy.

1205. The neural network model is optimized using an optimizer based on training losses.

According to the training loss L obtained in the step 1204, parameters are updated through the optimizer, and the neural network model is optimized, so that the neural network model which can be used for realizing the target tracking method in the embodiment of the application is finally obtained.

The optimizer may be of various types, and may be a bgd (batch gradient parameter) algorithm, an sgd (stored gradient parameter) algorithm, or an MBGD (mini-batch gradient parameter) algorithm.

Please refer to fig. 15, which is a schematic diagram of another embodiment of a target tracking method in an embodiment of the present application;

in the scheme, the target tracking device can track the moving target in the video in real time.

In particular, the method comprises the following steps of,

1501. initializing a system;

the method starts, firstly, the system initialization of the target tracking device is carried out, and the device starting preparation work is completed;

1502. acquiring video content;

the video may be captured by the target tracking device in real time or may be captured via a communication network.

1503. Calculating through a neural network model to obtain a motion pipeline set;

the video obtained by 1502 is input into a pre-trained neural network model, and a motion pipeline set of the input video is obtained, including a motion pipeline of a target object corresponding to each video frame.

1504. Sequentially connecting the moving pipelines into a tracking track based on a greedy algorithm (also called a greedy algorithm);

the basic idea of the greedy algorithm is to proceed step by step from a certain initial solution of the problem, and according to a certain optimization measure, each step is required to ensure that a local optimal solution can be obtained. It is understood that the algorithm for connecting the moving pipes may be replaced by other algorithms, and is not limited herein.

1505. Outputting a tracking track;

it should be noted that, for tracking of a single target object, a tracking trajectory of one target object is output, and for multi-target tracking, a tracking trajectory of each target object may be output, and specifically, the tracking trajectory may be processed such that a bounding box in each video frame is superimposed on an original video and displayed by a display module.

Considering that the video is a video shot in real time, the target tracking apparatus will continue to acquire the newly shot video content, and repeat steps 1502 to 1505 until the target tracking task is finished, which is not described herein.

With reference to fig. 16, a schematic diagram of an embodiment of a target tracking device in an embodiment of the present application is shown.

Only one or more of the various modules in fig. 16 may be implemented in software, hardware, firmware, or a combination thereof. The software or firmware includes, but is not limited to, computer program instructions or code and may be executed by a hardware processor. The hardware includes, but is not limited to, various integrated circuits such as a Central Processing Unit (CPU), a Digital Signal Processor (DSP), a Field Programmable Gate Array (FPGA), or an Application Specific Integrated Circuit (ASIC).

The target tracking device includes:

an acquisition unit 1601 configured to acquire a first video, where the first video includes a target object;

the obtaining unit 1601 is further configured to input the first video into a pre-trained neural network model, and obtain position information of the target object in at least two video frames and time information of the at least two video frames;

the obtaining unit 1601 is further configured to obtain a tracking track of the target object in the first video according to the position information of the target object in at least two video frames and the time information of the at least two video frames, where the tracking track includes the position information of the target object in at least two video frames in the first video.

Optionally, the obtaining unit 1601 is specifically configured to: acquiring a motion pipeline of the target object, wherein the motion pipeline is used for indicating time information and position information of the target object in at least two video frames of the first video, and the first video comprises a first video frame and a second video frame; the motion pipe corresponds to a quadrangular frustum in a spatio-temporal dimension, the spatio-temporal dimension comprising a temporal dimension and a two-dimensional spatial dimension, a position of a first bottom surface of the quadrangular frustum in the temporal dimension being used for indicating first time information of the first video frame, a position of a second bottom surface of the quadrangular frustum in the temporal dimension being used for indicating second time information of the second video frame, a position of the first bottom surface of the quadrangular frustum in the two-dimensional spatial dimension being used for indicating first position information of the target object in the first video frame, and a position of the second bottom surface of the quadrangular frustum in the two-dimensional spatial dimension being used for indicating second position information of the target object in the second video frame; the quadrangular frustum is used to indicate position information of the target object in all video frames between the first video frame and the second video frame of the first video.

Optionally, the obtaining unit 1601 is specifically configured to: acquiring a motion pipeline of the target object, wherein the motion pipeline is used for indicating position information of the target object in at least three video frames and time information of the at least three video frames, and the first video comprises a first video frame, a second video frame and a third video frame; the motion pipeline corresponds to a dual-quad terrace in a spatiotemporal dimension, the dual-quad terrace comprising a first quad terrace and a second quad terrace, the first quad terrace comprising a first bottom surface and a second bottom surface, the second quad terrace comprising a first bottom surface and a third bottom surface, the first bottom surface being a common bottom surface of the first quad terrace and the second quad terrace, a position of the first bottom surface in the temporal dimension being used for indicating first time information of the first video frame, a position of the second bottom surface in the temporal dimension being used for indicating second time information of the second video frame, a position of the third bottom surface in the temporal dimension being used for indicating third time information of the third video frame, a temporal order of the first video frame in the first video being located between the second video frame and the third video frame, a position of the first bottom surface in the two-dimensional spatial dimension being used for indicating a first time information of the target object in the first video frame A position of the second floor in the two-dimensional spatial dimension indicates second position information of the target object in the second video frame, and a position of the third floor in the two-dimensional spatial dimension indicates third position information of the target object in the third video frame; the double quadrangular frustum for indicating position information of the target object in all video frames between the second video frame and the three video frames of the first video.

Optionally, the obtaining unit 1601 is specifically configured to: and acquiring the tracking track of the target object in the first video according to the motion pipeline.

Optionally, the tracking trajectory specifically includes: and connecting the at least two motion pipelines corresponding to the quadrangular frustum in the space-time dimension to form the tracking track of the target object.

Optionally, the length of the motion pipeline is a preset value, and the length of the motion pipeline indicates the number of video frames included in the at least two video frames.

Optionally, the obtaining unit 1601 is further configured to: acquiring the category information of the target object through the pre-trained neural network model; and acquiring the tracking track of the target object in the first video according to the category information of the target object, the position information of the target object in at least two video frames and the time information of the at least two video frames.

Optionally, the obtaining unit 1601 is specifically configured to: and acquiring the confidence coefficient of the motion pipeline through the pre-trained neural network model, wherein the confidence coefficient of the motion pipeline is used for determining the category information of the target object corresponding to the motion pipeline.

Optionally, the apparatus further comprises: a processing unit 1602, configured to prune the motion pipeline, and obtain the pruned motion pipeline, where the pruned motion pipeline is used to obtain the tracking trajectory of the target object.

Optionally, the motion conduit comprises a first motion conduit and a second motion conduit; the processing unit 1602 is specifically configured to: if the repetition rate between a first motion pipeline and a second motion pipeline is larger than or equal to a first threshold value, deleting the motion pipeline with lower reliability from the first motion pipeline and the second motion pipeline, wherein the repetition rate between the first motion pipeline and the second motion pipeline is the intersection and combination ratio between the first motion pipeline and the second motion pipeline, the first motion pipeline and the second motion pipeline belong to the motion pipeline of the target object, and the confidence coefficient indicates the probability that the class of the target object corresponding to the motion pipeline is a preset class.

Optionally, the processing unit 1602 is specifically configured to: and deleting the motion pipeline according to a non-maximum suppression algorithm to obtain the deleted motion pipeline.

Optionally, the confidence of any one of the pruned motion pipes is greater than or equal to a second threshold.

Optionally, the obtaining unit 1601 is specifically configured to: connecting a third motion pipeline and a fourth motion pipeline which meet preset conditions in the motion pipelines to obtain a tracking track of the target object; the preset conditions include one or more of: the intersection ratio between sections of the third motion pipe and the fourth motion pipe at the time dimension overlapping part is greater than or equal to a third threshold; the cosine value of an included angle between the movement direction of the third movement pipeline and the movement direction of the fourth movement pipeline is greater than or equal to a fourth threshold value, and the movement direction is a vector which indicates the position change of a target object in the movement pipeline according to a preset rule in a space-time dimension; and the distance between the neural network feature vectors of the motion pipe is less than or equal to a fifth threshold value, wherein the distance comprises a Euclidean distance.

Optionally, the obtaining unit 1601 is specifically configured to: grouping the motion pipelines to obtain t groups of motion pipelines, wherein t is the total number of video frames in the first video, the ith motion pipeline group in the t groups of motion pipelines comprises all motion pipelines starting from the ith video frame in the first video, and i is greater than or equal to 1 and less than or equal to t; when i is 1, taking the motion pipelines in the ith motion pipeline group as initial tracking tracks to obtain a tracking track set; and connecting the motion pipelines in the ith motion pipeline group with the tracking tracks in the tracking track set in sequence according to the numbering sequence of the motion pipeline groups to obtain at least one tracking track.

Optionally, the obtaining unit 1601 is specifically configured to: inputting a first video sample into the initial network model for training to obtain the loss of a target object; and updating the weight parameters in the initial network model according to the target object loss to obtain the pre-trained neural network model.

Optionally, the target object loss specifically includes: and comparing a motion pipeline truth value with a motion pipeline predicted value, wherein the motion pipeline truth value is a motion pipeline obtained by splitting a tracking track of a target object in the first video sample, and the motion pipeline predicted value is a motion pipeline obtained by inputting the first video sample into the initial network model.

Optionally, the target object loss specifically includes: the method comprises the steps of comparing a motion pipeline truth value with a motion pipeline predicted value, and comparing a confidence coefficient of the motion pipeline truth value with a cross entropy of the confidence coefficient of the motion pipeline predicted value, wherein the motion pipeline truth value is a motion pipeline obtained by splitting a tracking track of a target object in a first video sample, the motion pipeline predicted value is a motion pipeline obtained by inputting the first video sample into an initial network model, the confidence coefficient of the motion pipeline truth value is the probability that a category of the target object corresponding to the motion pipeline truth value belongs to a preset category of the target object, and the confidence coefficient of the motion pipeline predicted value is the probability that the category of the target object corresponding to the motion pipeline predicted value belongs to the preset category of the target object.

Optionally, the initial network model comprises a three-dimensional convolutional neural network or a recurrent neural network.

Optionally, the processing unit 1602 is further configured to: dividing the first video into a plurality of video segments;

the obtaining unit 1601 is specifically configured to: and respectively inputting the video segments into the pre-trained neural network model to obtain the motion pipeline.

The target tracking device provided by the embodiment of the application has multiple implementation forms, and optionally, the target tracking device comprises a video acquisition module, a target tracking module and an output module. The video acquisition module is used for acquiring a video including a moving target object, the target tracking module is used for inputting the video, a tracking track of the target object is output through the target tracking method provided by the embodiment of the application, and the output module is used for superposing the tracking track in the video and displaying the tracking track to a user.

In another possible implementation manner, please refer to fig. 17, which is a schematic diagram of another embodiment of the target tracking apparatus in the embodiment of the present application. The target tracking device comprises a video acquisition module and a target tracking module, and can be understood as front-end equipment. In order to realize the target tracking method, the front-end equipment and the back-end equipment need to be cooperatively processed.

As shown in fig. 17, the video capture module, which may be a video capture module in a surveillance camera, a video camera, a mobile phone or a vehicle-mounted image sensor, is responsible for capturing video data as an input of a tracking algorithm;

the target tracking module may be a processing unit in a camera processor, a mobile phone processor, a vehicle-mounted processing unit, and the like, and is configured to receive video input and control information sent by a backend device, where the control information includes a tracking target category, a tracking number, accuracy control, a model hyper-parameter, and the like. The target tracking method of the embodiment of the application is mainly deployed in the module. Please refer to fig. 18 for a description of the target tracking module.

The back-end equipment comprises an output module and a control module.

As shown in fig. 17, the output module, for example, may be a display unit of a background monitor, a printer, or a hard disk, and is used for displaying or tracking the window;

and the control module is used for analyzing the output result, receiving the instruction of the user and sending the instruction to the target tracking module at the front end.

Please refer to fig. 18, which is a schematic diagram of another embodiment of a target tracking apparatus in an embodiment of the present application.

The target tracking apparatus includes: a video pre-processing module 1801, a prediction module 1802, and a motion pipeline connection module 1803.

The video preprocessing module 1801 is configured to segment an input video into suitable segments, and perform adjustment and normalization of video resolution and color space.

The prediction module 1802 is configured to extract spatio-temporal features from an input video segment, perform prediction, output a target motion pipeline and class information to which the motion pipeline belongs, and predict a future position of the target motion pipeline. The prediction module 1802 includes two sub-modules:

target class prediction module 18021: the class to which the target belongs is predicted based on features, such as confidence values, of the 3D convolutional neural network output.

Motion pipeline prediction module 18022: and predicting the position of the current motion pipeline of the target through the characteristics output by the 3D convolutional neural network, namely coordinates of the motion pipeline in space-time dimension.

And a motion pipeline connecting module 1803, which analyzes the motion pipeline output by the prediction module, initializes the motion pipeline as a new tracking trajectory if the target appears for the first time, and uses the spatial-temporal feature similarity and spatial position proximity between the motion pipelines as the connection features required for connecting the motion pipelines. According to the motion pipeline and the motion pipeline connection characteristics, the motion pipelines are connected into a complete tracking track by analyzing the position coincidence characteristics and the space-time characteristic similarity of the motion pipelines on the space.

Please refer to fig. 19, which is a diagram illustrating an embodiment of an electronic device according to an embodiment of the present application.

The electronic device 1900, which may have a large difference due to different configurations or performances, may include one or more processors 1901 and a memory 1902, where the memory 1902 stores programs or data therein.

The memory 1902 may be, for example, volatile memory or nonvolatile memory. Optionally, the processor 1901 is one or more Central Processing Units (CPUs), which may be single core CPUs or multi-core CPUs, the processor 1901 may be in communication with the memory 1902 to execute a series of instructions in the memory 1902 on the electronic device 1900.

The electronic device 1900 also includes one or more wired or wireless network interfaces 1903, such as an ethernet interface.

Optionally, although not shown in FIG. 19, electronic device 1900 may also include one or more power supplies; the input/output interface may be used to connect a display, a mouse, a keyboard, a touch screen device, a sensing device, or the like, and the input/output interface is an optional component, and may or may not be present, and is not limited herein.

The process executed by the processor 1901 in the electronic device 1900 in this embodiment may refer to the method process described in the foregoing method embodiment, which is not described herein again.

Please refer to fig. 20, which is a diagram of a chip hardware structure according to an embodiment of the present disclosure.

The embodiment of the present application provides a chip system, which may be used to implement the target tracking method, and specifically, the algorithm based on the convolutional neural network shown in fig. 3 and fig. 4 may be implemented in the NPU chip shown in fig. 20.

The neural network processor NPU 50 is mounted as a coprocessor on a main CPU (Host CPU), and tasks are allocated by the Host CPU. The core portion of the NPU is an arithmetic circuit 503, and the controller 504 controls the arithmetic circuit 503 to extract matrix data in the memory and perform multiplication.

In some implementations, the arithmetic circuit 503 includes a plurality of processing units (PEs) therein. In some implementations, the operational circuitry 503 is a two-dimensional systolic array. The arithmetic circuit 503 may also be a one-dimensional systolic array or other electronic circuit capable of performing mathematical operations such as multiplication and addition. In some implementations, the arithmetic circuitry 503 is a general-purpose matrix processor.

For example, assume that there is an input matrix A, a weight matrix B, and an output matrix C. The arithmetic circuit fetches the data corresponding to matrix B from the weight memory 502 and buffers each PE in the arithmetic circuit. The arithmetic circuit takes the matrix a data from the input memory 501 and performs matrix operation with the matrix B, and partial results or final results of the obtained matrix are stored in the accumulator 508 accumulator.

The unified memory 506 is used to store input data as well as output data. The weight data is directly transferred to the weight memory 502 by a memory access controller 505 (DMAC). The input data is also carried through the DMAC into the unified memory 506.

The BIU is a Bus Interface Unit, i.e., a Bus Interface Unit 510, for the interaction of the AXI Bus with the DMAC and the Instruction Fetch memory 509 Instruction Fetch Buffer.

A bus interface unit 510 (BIU) for fetching the instruction from the external memory by the instruction fetch memory 509 and for fetching the original data of the input matrix a or the weight matrix B from the external memory by the storage unit access controller 505.

The DMAC is mainly used to transfer input data in the external memory DDR to the unified memory 506 or to transfer weight data into the weight memory 502 or to transfer input data into the input memory 501.

The vector calculation unit 507 may include a plurality of operation processing units, and further processes the output of the operation circuit, such as vector multiplication, vector addition, exponential operation, logarithmic operation, magnitude comparison, and the like, if necessary. The method is mainly used for non-convolution/FC layer network calculation in the neural network, such as Pooling (Pooling), Batch Normalization (Batch Normalization), Local Response Normalization (Local Response Normalization) and the like.

In some implementations, the vector calculation unit can 507 store the processed output vector to the unified buffer 506. For example, the vector calculation unit 507 may apply a non-linear function to the output of the arithmetic circuit 503, such as a vector of accumulated values, to generate the activation value. In some implementations, the vector calculation unit 507 generates normalized values, combined values, or both. In some implementations, the vector of processed outputs can be used as activation inputs to the arithmetic circuitry 503, for example for use in subsequent layers in a neural network.

An instruction fetch buffer 509 connected to the controller 504 for storing instructions used by the controller 504;

the unified memory 506, the input memory 501, the weight memory 502, and the instruction fetch memory 509 are all On-Chip memories. The external memory is private to the NPU hardware architecture.

Among them, the operations of the layers in the convolutional neural networks shown in fig. 3 and 4 may be performed by the matrix calculation unit 212 or the vector calculation unit 507.

In the embodiments of the present application, various illustrations are made for the sake of an understanding of aspects. However, these examples are merely examples and are not meant to be the best mode of carrying out the present application.

It is clear to those skilled in the art that, for convenience and brevity of description, the specific working processes of the above-described systems, apparatuses and units may refer to the corresponding processes in the foregoing method embodiments, and are not described herein again.

In the several embodiments provided in the present application, it should be understood that the disclosed system, apparatus and method may be implemented in other manners. For example, the above-described apparatus embodiments are merely illustrative, and for example, the division of the units is only one logical division, and other divisions may be realized in practice, for example, a plurality of units or components may be combined or integrated into another system, or some features may be omitted, or not executed. In addition, the shown or discussed mutual coupling or direct coupling or communication connection may be an indirect coupling or communication connection through some interfaces, devices or units, and may be in an electrical, mechanical or other form.

The units described as separate parts may or may not be physically separate, and parts displayed as units may or may not be physical units, may be located in one place, or may be distributed on a plurality of network units. Some or all of the units can be selected according to actual needs to achieve the purpose of the solution of the embodiment.

In addition, functional units in the embodiments of the present application may be integrated into one processing unit, or each unit may exist alone physically, or two or more units are integrated into one unit. The integrated unit can be realized in a form of hardware, and can also be realized in a form of a software functional unit.

The integrated unit, if implemented in the form of a software functional unit and sold or used as a stand-alone product, may be stored in a computer readable storage medium. Based on such understanding, the technical solution of the present application may be substantially implemented or contributed to by the prior art, or all or part of the technical solution may be embodied in a software product, which is stored in a storage medium and includes instructions for causing a computer device (which may be a personal computer, a server, or a network device) to execute all or part of the steps of the method according to the embodiments of the present application. And the aforementioned storage medium includes: various media capable of storing program codes, such as a usb disk, a removable hard disk, a read-only memory (ROM), a Random Access Memory (RAM), a magnetic disk, or an optical disk.

The above embodiments are only used for illustrating the technical solutions of the present application, and not for limiting the same; although the present application has been described in detail with reference to the foregoing embodiments, it should be understood by those of ordinary skill in the art that: the technical solutions described in the foregoing embodiments may still be modified, or some technical features may be equivalently replaced; and such modifications or substitutions do not depart from the spirit and scope of the corresponding technical solutions in the embodiments of the present application.

Claims

1. A target tracking method, comprising:

acquiring a first video, wherein the first video comprises a target object;

inputting the first video into a pre-trained neural network model, and acquiring position information of the target object in at least two video frames and time information of the at least two video frames;

and acquiring a tracking track of the target object in the first video according to the position information of the target object in at least two video frames and the time information of the at least two video frames, wherein the tracking track comprises the position information of the target object in the at least two video frames in the first video.

2. The method according to claim 1, wherein the acquiring the position information of the target object in the at least two video frames and the time information of the at least two video frames specifically comprises:

obtaining a motion pipe of the target object, the motion pipe being used to indicate time information and position information of the target object in at least two video frames of the first video, wherein,

the first video comprises a first video frame and a second video frame;

the motion pipe corresponds to a quadrangular frustum in a spatio-temporal dimension, the spatio-temporal dimension comprising a temporal dimension and a two-dimensional spatial dimension, a position of a first bottom surface of the quadrangular frustum in the temporal dimension being used for indicating first time information of the first video frame, a position of a second bottom surface of the quadrangular frustum in the temporal dimension being used for indicating second time information of the second video frame, a position of the first bottom surface of the quadrangular frustum in the two-dimensional spatial dimension being used for indicating first position information of the target object in the first video frame, and a position of the second bottom surface of the quadrangular frustum in the two-dimensional spatial dimension being used for indicating second position information of the target object in the second video frame;

the quadrangular frustum is used to indicate position information of the target object in all video frames between the first video frame and the second video frame of the first video.

3. The method according to claim 1, wherein the acquiring the position information of the target object in the at least two video frames and the time information of the at least two video frames specifically comprises:

obtaining a motion pipeline of the target object, the motion pipeline being used for indicating position information of the target object in at least three video frames and time information of the at least three video frames, wherein,

the first video comprises a first video frame, a second video frame and a third video frame;

the motion pipeline corresponds to a dual-quad terrace in a spatiotemporal dimension, the dual-quad terrace comprising a first quad terrace and a second quad terrace, the first quad terrace comprising a first bottom surface and a second bottom surface, the second quad terrace comprising a first bottom surface and a third bottom surface, the first bottom surface being a common bottom surface of the first quad terrace and the second quad terrace, a position of the first bottom surface in the temporal dimension being used for indicating first time information of the first video frame, a position of the second bottom surface in the temporal dimension being used for indicating second time information of the second video frame, a position of the third bottom surface in the temporal dimension being used for indicating third time information of the third video frame, a temporal order of the first video frame in the first video being located between the second video frame and the third video frame, a position of the first bottom surface in the two-dimensional spatial dimension being used for indicating a first time information of the target object in the first video frame A position of the second floor in the two-dimensional spatial dimension indicates second position information of the target object in the second video frame, and a position of the third floor in the two-dimensional spatial dimension indicates third position information of the target object in the third video frame;

the double quadrangular frustum for indicating position information of the target object in all video frames between the second video frame and the three video frames of the first video.

4. The method according to claim 2 or 3, wherein the obtaining of the tracking trajectory of the target object in the first video according to the position information of the target object in the at least two video frames and the time information of the at least two video frames specifically comprises:

and acquiring the tracking track of the target object in the first video according to the motion pipeline.

5. The method according to any one of claims 2 to 4, wherein the tracking a trajectory comprises in particular:

and connecting the at least two motion pipelines corresponding to the quadrangular frustum in the space-time dimension to form the tracking track of the target object.

6. The method according to any one of claims 2 to 5,

the length of the motion pipeline is a preset value, and the length of the motion pipeline indicates the number of video frames included in the at least two video frames.

7. The method according to any one of claims 1 to 6,

the method further comprises the following steps:

acquiring the category information of the target object through the pre-trained neural network model;

the acquiring the tracking track of the target object in the first video according to the position information of the target object in at least two video frames and the time information of the at least two video frames comprises:

and acquiring the tracking track of the target object in the first video according to the category information of the target object, the position information of the target object in at least two video frames and the time information of the at least two video frames.

8. The method according to claim 7, wherein the obtaining of the class information of the target object corresponding to the motion pipe through the pre-trained neural network model specifically includes:

and acquiring the confidence coefficient of the motion pipeline through the pre-trained neural network model, wherein the confidence coefficient of the motion pipeline is used for determining the category information of the target object corresponding to the motion pipeline.

9. The method according to any one of claims 1 to 8,

before the acquiring the tracking track of the target object according to the motion pipeline, the method further comprises:

and deleting the motion pipeline to obtain the deleted motion pipeline, wherein the deleted motion pipeline is used for obtaining the tracking track of the target object.

10. The method of claim 9,

the deleting the motion pipeline and the acquiring the deleted motion pipeline specifically include:

the motion pipeline comprises a first motion pipeline and a second motion pipeline;

if the repetition rate between a first motion pipeline and a second motion pipeline is larger than or equal to a first threshold value, deleting the motion pipeline with lower reliability from the first motion pipeline and the second motion pipeline, wherein the repetition rate between the first motion pipeline and the second motion pipeline is the intersection and combination ratio between the first motion pipeline and the second motion pipeline, the first motion pipeline and the second motion pipeline belong to the motion pipeline of the target object, and the confidence coefficient indicates the probability that the class of the target object corresponding to the motion pipeline is a preset class.

11. The method of claim 9,

and deleting the motion pipeline according to a non-maximum suppression algorithm to obtain the deleted motion pipeline.

12. The method of claim 9,

the confidence of any one of the deleted motion pipelines is greater than or equal to a second threshold.

13. The method according to any one of claims 2 to 12,

the acquiring of the tracking trajectory of the target object according to the motion pipeline specifically includes:

connecting a third motion pipeline and a fourth motion pipeline which meet preset conditions in the motion pipelines to obtain a tracking track of the target object;

the preset conditions include one or more of:

the intersection ratio between sections of the third motion pipe and the fourth motion pipe at the time dimension overlapping part is greater than or equal to a third threshold;

the cosine value of an included angle between the movement direction of the third movement pipeline and the movement direction of the fourth movement pipeline is greater than or equal to a fourth threshold value, and the movement direction is a vector which indicates the position change of a target object in the movement pipeline according to a preset rule in a space-time dimension; and the distance between the neural network feature vectors of the motion pipe is less than or equal to a fifth threshold value, wherein the distance comprises a Euclidean distance.

14. The method according to any one of claims 2 to 12,

grouping the motion pipelines to obtain t groups of motion pipelines, wherein t is the total number of video frames in the first video, the ith motion pipeline group in the t groups of motion pipelines comprises all motion pipelines starting from the ith video frame in the first video, and i is greater than or equal to 1 and less than or equal to t;

when i is 1, taking the motion pipelines in the ith motion pipeline group as initial tracking tracks to obtain a tracking track set;

and connecting the motion pipelines in the ith motion pipeline group with the tracking tracks in the tracking track set in sequence according to the numbering sequence of the motion pipeline groups to obtain at least one tracking track.

15. The method according to any one of claims 1 to 14,

the pre-trained neural network model is obtained after the initial network model is trained, and the method further comprises the following steps:

inputting a first video sample into the initial network model for training to obtain the loss of a target object;

and updating the weight parameters in the initial network model according to the target object loss to obtain the pre-trained neural network model.

16. The method according to claim 15, wherein the target object loss comprises in particular:

and comparing a motion pipeline truth value with a motion pipeline predicted value, wherein the motion pipeline truth value is a motion pipeline obtained by splitting a tracking track of a target object in the first video sample, and the motion pipeline predicted value is a motion pipeline obtained by inputting the first video sample into the initial network model.

17. The method according to claim 15, wherein the target object loss comprises in particular:

the method comprises the steps of comparing a motion pipeline truth value with a motion pipeline predicted value, and comparing a confidence coefficient of the motion pipeline truth value with a cross entropy of the confidence coefficient of the motion pipeline predicted value, wherein the motion pipeline truth value is a motion pipeline obtained by splitting a tracking track of a target object in a first video sample, the motion pipeline predicted value is a motion pipeline obtained by inputting the first video sample into an initial network model, the confidence coefficient of the motion pipeline truth value is the probability that a category of the target object corresponding to the motion pipeline truth value belongs to a preset category of the target object, and the confidence coefficient of the motion pipeline predicted value is the probability that the category of the target object corresponding to the motion pipeline predicted value belongs to the preset category of the target object.

18. The method according to any one of claims 15 to 17,

the initial network model comprises a three-dimensional convolutional neural network or a recurrent neural network.

19. The method according to any one of claims 1 to 18,

the inputting the first video into a pre-trained neural network model, and the obtaining of the motion pipeline of the target object specifically includes:

dividing the first video into a plurality of video segments;

and respectively inputting the video segments into the pre-trained neural network model to obtain the motion pipeline.

20. An object tracking device, comprising:

an acquisition unit configured to acquire a first video, the first video including a target object;

the acquisition unit is further configured to input the first video into a pre-trained neural network model, and acquire position information of the target object in at least two video frames and time information of the at least two video frames;

the obtaining unit is further configured to obtain a tracking track of the target object in the first video according to the position information of the target object in at least two video frames and the time information of the at least two video frames, where the tracking track includes the position information of the target object in the at least two video frames in the first video.

21. The apparatus according to claim 20, wherein the obtaining unit is specifically configured to:

the first video comprises a first video frame and a second video frame;

22. The apparatus according to claim 20, wherein the obtaining unit is specifically configured to:

23. The apparatus according to claim 21 or 22, wherein the obtaining unit is specifically configured to:

24. The apparatus according to any of the claims 21 to 23, wherein the tracking trajectory comprises in particular:

25. The apparatus of any one of claims 21 to 24,

26. The apparatus according to any one of claims 20 to 25, wherein the obtaining unit is further configured to:

27. The apparatus according to claim 26, wherein the obtaining unit is specifically configured to:

28. The apparatus of any one of claims 20 to 27, further comprising:

and the processing unit is used for deleting the motion pipeline to obtain the deleted motion pipeline, and the deleted motion pipeline is used for obtaining the tracking track of the target object.

29. The apparatus of claim 28, wherein the motion conduit comprises a first motion conduit and a second motion conduit;

the processing unit is specifically configured to:

30. The apparatus according to claim 28, wherein the processing unit is specifically configured to:

31. The apparatus of claim 28,

32. The apparatus according to any one of claims 21 to 31, wherein the obtaining unit is specifically configured to:

the preset conditions include one or more of:

33. The apparatus according to any one of claims 21 to 31, wherein the obtaining unit is specifically configured to:

34. The apparatus according to any one of claims 20 to 33, wherein the obtaining unit is specifically configured to:

35. The apparatus according to claim 34, wherein the target object loss comprises in particular:

36. The apparatus according to claim 34, wherein the target object loss comprises in particular:

37. The apparatus of any one of claims 34 to 36,

38. The apparatus according to any one of claims 20 to 37, wherein the processing unit is further configured to:

dividing the first video into a plurality of video segments;

the obtaining unit is specifically configured to:

39. An electronic device comprising a processor and a memory, the processor and the memory being interconnected, wherein the memory is configured to store a computer program comprising program instructions, and wherein the processor is configured to invoke the program instructions to perform the method of any of claims 1 to 19.

40. A computer program product comprising instructions which, when run on a computer, cause the computer to perform the method of any one of claims 1 to 19.

41. A computer-readable storage medium comprising instructions that, when executed on a computer, cause the computer to perform the method of any of claims 1 to 19.