WO2021249114A1 - Dispositif de suivi de cible et procédé de suivi de cible - Google Patents

Dispositif de suivi de cible et procédé de suivi de cible Download PDF

Info

Publication number
WO2021249114A1
WO2021249114A1 PCT/CN2021/093852 CN2021093852W WO2021249114A1 WO 2021249114 A1 WO2021249114 A1 WO 2021249114A1 CN 2021093852 W CN2021093852 W CN 2021093852W WO 2021249114 A1 WO2021249114 A1 WO 2021249114A1
Authority
WO
WIPO (PCT)
Prior art keywords
motion
pipeline
video
target object
movement
Prior art date
Application number
PCT/CN2021/093852
Other languages
English (en)
Chinese (zh)
Inventor
庞博
卢策吾
袁伟
胡翔宇
Original Assignee
华为技术有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 华为技术有限公司 filed Critical 华为技术有限公司
Publication of WO2021249114A1 publication Critical patent/WO2021249114A1/fr

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T7/00Image analysis
    • G06T7/20Analysis of motion
    • G06T7/246Analysis of motion using feature-based methods, e.g. the tracking of corners or segments
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T7/00Image analysis
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T7/00Image analysis
    • G06T7/20Analysis of motion
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T7/00Image analysis
    • G06T7/70Determining position or orientation of objects or cameras
    • G06T7/73Determining position or orientation of objects or cameras using feature-based methods
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04NPICTORIAL COMMUNICATION, e.g. TELEVISION
    • H04N7/00Television systems
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T2207/00Indexing scheme for image analysis or image enhancement
    • G06T2207/10Image acquisition modality
    • G06T2207/10016Video; Image sequence
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T2207/00Indexing scheme for image analysis or image enhancement
    • G06T2207/10Image acquisition modality
    • G06T2207/10024Color image
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T2207/00Indexing scheme for image analysis or image enhancement
    • G06T2207/20Special algorithmic details
    • G06T2207/20081Training; Learning
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T2207/00Indexing scheme for image analysis or image enhancement
    • G06T2207/20Special algorithmic details
    • G06T2207/20084Artificial neural networks [ANN]
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T2207/00Indexing scheme for image analysis or image enhancement
    • G06T2207/30Subject of image; Context of image processing
    • G06T2207/30241Trajectory

Definitions

  • This application relates to the field of image processing technology, and in particular to a target tracking method and target tracking device.
  • Target tracking is one of the most important and basic tasks in the field of computer vision. Its purpose is to output the position of the target object in each video frame of the video from the video containing the target object. Usually a piece of video is input to the computer and the target object category to be tracked, and the computer outputs the identification (ID) of the target object in the form of a detection frame and the position information of the target object in each frame of the video.
  • ID identification
  • the existing multi-target tracking method includes detection and tracking. Multiple target objects appearing in each video frame are detected through the detection module, and then multiple target objects appearing in each video frame are matched. In the matching process In the process, the feature of each target object in a single video frame is extracted, and the target matching is achieved through feature similarity comparison, and the tracking trajectory of each target object is obtained.
  • the target tracking effect depends on the detection algorithm of a single frame. If the target object is occluded in the target detection, a detection error will occur, which will cause tracking errors. Therefore, the target object Insufficient performance in dense or occluded scenes.
  • the embodiment of the present application provides a target tracking method, which is used for target tracking in a video, which can reduce tracking errors caused by target occlusion.
  • the first aspect of the embodiments of the present application provides a target tracking method, including: acquiring a first video, where the first video includes a target object; and inputting the first video into a pre-trained neural network model to acquire the target The position information of the object in at least two video frames and the time information of the at least two video frames; according to the position information of the target object in the at least two video frames and the time information of the at least two video frames Acquire a tracking trajectory of the target object in the first video, where the tracking trajectory includes position information of the target object in at least two video frames in the first video.
  • This method obtains the position information of the target object in at least two video frames and the time information of the at least two video frames through a pre-trained neural network model.
  • Target tracking does not depend on the target detection result of a single video frame, which can reduce The problem of detection failure in scenes with dense targets or more occlusions can improve target tracking performance.
  • the obtaining the position information of the target object in at least two video frames and the time information of the at least two video frames specifically includes: obtaining information about the target object A motion pipeline, which is used to indicate time information and position information of the target object in at least two video frames of the first video, wherein the first video includes a first video frame and a second video Frame; the motion pipeline corresponds to a quadrangular pyramid in the space-time dimension, the space-time dimension includes a time dimension and a two-dimensional space dimension, and the position of the first bottom surface of the quadrangular pyramid in the time dimension is used to indicate the The first time information of the first video frame, the position of the second bottom surface of the quadrangular pyramid in the time dimension is used to indicate the second time information of the second video frame, the first bottom surface of the quadrangular pyramid The position in the two-dimensional space is used to indicate the first position information of the target object in the first video frame, and the position of the second bottom surface of the quadrangular pyramid in the two-dimensional space is used
  • This method obtains the motion pipeline of each video frame through a pre-trained neural network model. Since the motion pipeline includes the position information of the target object in at least two video frames, the position of the target in the video frame can pass through the space-time dimension. The time in the time dimension and the position in the two-dimensional space are determined, the time is used to determine the video frame, and the position in the two-dimensional space is used to indicate the position information of the target in the video frame.
  • This method can correspond the motion pipeline to the quadrangular prism in the space-time dimension, and visually display the position information of the target in at least two video frames through the quadrangular prism in the space-time dimension.
  • the target tracking method does not depend on the target detection result of a single video frame, and can reduce the problem of detection failure in scenes with dense targets or more occlusions, and improve target tracking performance.
  • the obtaining the position information of the target object in at least two video frames and the time information of the at least two video frames specifically includes: obtaining information about the target object A motion pipeline, where the motion pipeline is used to indicate the position information of the target object in at least three video frames and the time information of the at least three video frames, wherein the first video includes a first video frame and a first video frame.
  • the motion pipeline corresponds to a double quadrangular prism in the space-time dimension
  • the double quadrangular prism includes a first quadrangular prism and a second quadrangular prism
  • the first quadrangular prism It includes a first bottom surface and a second bottom surface
  • the second quadrangular ridge includes a first bottom surface and a third bottom surface
  • the first bottom surface is a common bottom surface of the first quadrangular ridge and the second quadrangular ridge
  • the position of the first bottom surface in the time dimension is used to indicate the first time information of the first video frame
  • the position of the second bottom surface in the time dimension is used to indicate the first time information of the second video frame.
  • the position of the third bottom surface in the time dimension is used to indicate the third time information of the third video frame, and the time sequence of the first video frame in the first video is located in the first video
  • the position of the first bottom surface in the two-dimensional space is used to indicate the first position information of the target object in the first video frame
  • the position of the second bottom surface in the two-dimensional space indicates the second position information of the target object in the second video frame
  • the position of the third bottom surface in the two-dimensional space indicates the target object
  • the third position information in the third video frame; the double quadrangular prism is used to indicate that the target object is between the second video frame and the three video frames of the first video Position information in the video frame.
  • the motion pipeline includes the position information of the target object of the target object in at least three video frames.
  • the at least three video frames include the first video frame that is earlier in the time sequence of the video than the video frame corresponding to the motion pipeline.
  • the second video frame and the later third video frame expand the receptive field in the time dimension, which can further improve the target tracking performance.
  • the motion pipeline corresponds to the double quadrangular prism in the space-time dimension, and the position information of the target in at least three video frames is visually displayed through the double quadrangular prism in the space-time dimension. Specifically, it also includes the position information of the target in all the video frames between the two non-common bottom surfaces of the motion pipeline.
  • the structure of the real tracking trajectory of the target object is usually nonlinear.
  • the motion pipeline of the double quadrangular prism structure can express the two directions of the target movement.
  • the real tracking trajectory can be better fitted in the scene where the movement direction changes.
  • the acquisition of the location of the target object in the first aspect is performed according to the location information of the target object in at least two video frames and the time information of the at least two video frames.
  • the tracking trajectory in a video specifically includes: acquiring the tracking trajectory of the target object in the first video according to the motion pipeline.
  • Obtaining the tracking trajectory of the target object in the first video according to the motion pipeline can reduce the problem of detection failure in a scene with dense targets or more occlusions, and improve target tracking performance.
  • the tracking trajectory specifically includes: a tracking trajectory of the target object formed by connecting at least two of the motion pipes corresponding to a quadrangular prism in a space-time dimension.
  • Obtaining the tracking trajectory of the target object by connecting the motion pipeline can not rely on the target detection result of a single video frame, which can reduce the problem of detection failure in scenes with dense targets or more occlusions, and improve target tracking performance.
  • the length of the motion pipeline of the video frame is a preset value
  • the length of the motion pipeline indicates the number of video frames included in the at least two video frames, optionally Ground
  • the length of the movement pipeline includes 4, 6, or 8.
  • the length of the motion pipe can be a preset value. That is, the number of video frames corresponding to each motion pipeline is the same, indicating the position change of the target object in the same time period. Compared with the method of not setting the motion pipeline length, this method can reduce the calculation amount of the neural network model and reduce the target Tracking takes time.
  • the method further includes: obtaining category information of the target object through the pre-trained neural network model; and according to the target object in at least two videos The position information in the frame and the time information of the at least two video frames.
  • Obtaining the tracking trajectory of the target object in the first video includes: according to the category information of the target object, the location of the target object is The position information in the at least two video frames and the time information of the at least two video frames acquire the tracking trajectory of the target object in the first video.
  • this method can determine the category information of the target object corresponding to the motion pipeline through the pre-trained neural network model, and obtain the target object’s information based on the category information, location information, and time information. Track the trajectory.
  • the acquiring, through the pre-trained neural network model, the category information of the target object corresponding to the motion pipeline specifically includes: through the pre-trained neural network model, Obtain the confidence level of the motion pipe, where the confidence level of the motion pipe is used to determine the category information of the target object corresponding to the motion pipe.
  • this method can distinguish whether the motion pipeline is a real motion pipeline indicating the target position by confidence.
  • this method can use motion The confidence of the pipeline distinguishes the types of target objects corresponding to the motion pipeline.
  • the method before the acquiring the tracking trajectory of the target object according to the motion pipeline, the method further includes: deleting the motion pipeline, and obtaining the deleted A motion pipeline, where the deleted motion pipeline is used to obtain the tracking trajectory of the target object.
  • This method can delete motion pipelines of video frames, delete repeated motion pipelines or motion pipelines with low confidence, and can reduce the amount of calculation in the motion pipeline connection step.
  • the deletion of the motion pipeline and obtaining the deleted motion pipeline specifically includes: the motion pipeline includes a first motion pipeline and a second motion pipeline; if If the repetition rate between the first movement pipeline and the second movement pipeline is greater than or equal to the first threshold, the movement pipeline with a lower confidence level among the first movement pipeline and the second movement pipeline is deleted, and the first movement
  • the repetition rate between the pipe and the second movement pipe is the intersection ratio between the first movement pipe and the second movement pipe, and the first movement pipe and the second movement pipe belong to the target object
  • the confidence level indicates the probability that the category of the target object corresponding to the motion channel is the preset category.
  • This method introduces the specific method of deleting motion pipelines.
  • Motion pipelines with a repetition rate greater than or equal to the first threshold can be considered as repeated data.
  • the lower confidence is deleted, and the higher confidence is retained for The pipe connection can reduce the amount of calculation in the movement pipe connection step.
  • the deleting the motion pipeline and obtaining the deleted motion pipeline specifically includes: deleting the motion pipeline according to a non-maximum value suppression algorithm, Get the deleted motion pipeline.
  • This method can also be deleted according to the non-maximum suppression algorithm, that is, it can delete repeated motion pipelines, and it can also reserve motion pipelines with higher confidence for each target, reduce the calculation amount of pipeline connection steps, and improve target tracking efficient.
  • the confidence of any one of the motion pipes after the reduction is greater than or equal to the second threshold.
  • the acquiring the tracking trajectory of the target object according to the motion pipeline specifically includes: performing a third motion channel and a fourth motion in the motion channel that meet a preset condition.
  • the pipelines are connected to obtain the tracking trajectory of the target object;
  • the preset conditions include one or more of the following: the intersection of the third motion pipeline and the fourth motion pipeline between the sections where the time dimension overlaps
  • the parallel ratio is greater than or equal to the third threshold;
  • the cosine value of the angle between the movement direction of the third movement pipeline and the movement direction of the fourth movement pipeline is greater than or equal to the fourth threshold, and the movement direction is in time and space
  • the dimension indicates the vector of the position change of the target object in the motion pipeline according to a preset rule; and the distance between the neural network feature vectors of the motion pipeline is less than or equal to the fifth threshold, and the distance includes Euclidean distance.
  • This method provides a specific method for connecting motion pipelines. According to the position of the motion pipelines in the space-time dimension, the motion pipelines with high overlap and similar motion directions are connected.
  • the acquiring the tracking trajectory of the target object according to the motion pipeline specifically includes: grouping the motion pipelines to acquire t groups of motion pipelines, where t is the first The total number of video frames in the video, the i-th motion pipe group in the t group of motion pipes includes all motion pipes starting from the i-th video frame in the first video, and the i is greater than or equal to 1, and less than or Equal to t; when i is 1, the motion pipeline in the i-th motion pipe group is used as the initial tracking trajectory to obtain the tracking trajectory set; in accordance with the sequence of the number of the motion pipe group, the motion in the i-th motion pipe group is sequentially The pipeline is connected with the tracking trajectory in the tracking trajectory set to obtain at least one tracking trajectory.
  • This method provides a specific method for connecting motion pipelines.
  • the motion pipeline corresponds to the position information of the target object in the video frame within a period of time.
  • the motion pipelines are grouped according to the initial video frame, and each group of motion pipelines are connected in turn , Can improve the efficiency of target tracking.
  • the pre-trained neural network model is obtained after the initial network model is trained, and the method further includes: inputting the first video sample into the initial network model for training, and obtaining Target object loss; update the weight parameters in the initial network model according to the target object loss, and obtain the pre-trained neural network model.
  • the initial network model can be trained to obtain the neural network model of the output motion pipeline in the target tracking method.
  • the target object loss specifically includes: an intersection ratio between a true value of a motion pipe and a predicted value of a motion pipe, and the true value of the motion pipe is the first video sample
  • the motion pipeline obtained by splitting the tracking trajectory of the target object in the middle, and the motion pipeline prediction value is the motion pipeline obtained by inputting the first video sample into the initial network model.
  • This method provides that the target loss in the model training process is the intersection ratio between the true value of the motion pipeline and the predicted value of the motion pipeline.
  • the neural network model obtained by this training has high accuracy of the position information of the target object indicated by the motion pipeline. .
  • the target object loss specifically includes: the cross-union ratio between the true value of the motion pipe and the predicted value of the motion pipe, and the confidence of the true value of the motion pipe and the predicted value of the motion pipe
  • the cross entropy between the confidence levels of the motion pipeline, the motion pipeline truth value is the motion pipeline obtained by splitting the tracking trajectory of the target object in the first video sample, and the motion pipeline prediction value is the input of the first video sample
  • the confidence of the true value of the motion pipeline is the probability that the target object category corresponding to the true value of the motion pipeline belongs to the preset target object category
  • the confidence of the predicted value of the motion pipeline It is the probability that the category of the target object corresponding to the predicted value of the motion pipeline belongs to the preset target object category.
  • This method provides that the target loss in the model training process is the intersection ratio between the true value of the motion pipeline and the predicted value of the motion pipeline.
  • the neural network model obtained by this training has high accuracy of the position information of the target object indicated by the motion pipeline. , And can accurately indicate the type of target object.
  • the initial network model includes a three-dimensional convolutional neural network or a recurrent neural network, and the three-dimensional convolutional neural network includes a three-dimensional residual neural network or a three-dimensional feature pyramid network.
  • the initial network model is obtained by combining a three-dimensional residual neural network and a three-dimensional feature pyramid network.
  • the initial network model in this method can be a three-dimensional convolutional neural network, a recurrent neural network, or a combination of the two.
  • the diversity of neural network model types provides multiple possibilities for the realization of the scheme.
  • the inputting the first video into a pre-trained neural network model to obtain the motion pipeline of the target object specifically includes: dividing the first video into a plurality of Video clips; input the multiple video clips into the pre-trained neural network model to obtain the motion pipeline.
  • the video can be segmented first, and the video segment is input to the model.
  • the number of video frames of the video segment is a preset value, for example, 8 frames.
  • a second aspect of the embodiments of the present application provides a target tracking device, including: an acquisition unit configured to acquire a first video, where the first video includes a target object; and the acquisition unit is further configured to capture the first video Video input pre-trained neural network model to obtain the position information of the target object in at least two video frames and the time information of the at least two video frames; the obtaining unit is further configured to The position information in at least two video frames and the time information of the at least two video frames acquire the tracking trajectory of the target object in the first video, and the tracking trajectory includes the target object in the first video. Position information in at least two video frames in a video.
  • the acquisition unit is specifically configured to: acquire a motion pipeline of the target object, and the motion pipeline is used to indicate that the target object is at least two parts of the first video.
  • Time information and location information in two video frames where the first video includes a first video frame and a second video frame; the motion pipeline corresponds to a quadrangular pyramid in the space-time dimension, and the space-time dimension includes the time dimension And a two-dimensional space dimension, the position of the first bottom surface of the quadrangular pyramid in the time dimension is used to indicate the first time information of the first video frame, and the second bottom surface of the quadrangular pyramid is at the time
  • the position of the dimension is used to indicate the second time information of the second video frame, and the position of the first bottom surface of the quadrangular prism in the two-dimensional space is used to indicate that the target object is in the first video frame
  • the first position information in the second video frame, the position of the second bottom surface of the quadrangular prism in the two-dimensional space is used to indicate the second position information of the
  • the acquiring unit is specifically configured to: acquire a motion pipeline of the target object, and the motion pipeline is used to indicate the position of the target object in at least three video frames Information and time information of the at least three video frames, wherein the first video includes a first video frame, a second video frame, and a third video frame; the motion pipeline corresponds to a double quadrangular prism in the space-time dimension ,
  • the double quadrangular prism includes a first quadrangular prism and a second quadrangular prism, the first quadrangular prism includes a first bottom surface and a second bottom surface, and the second quadrangular prism includes a first bottom surface and a third quadrangular prism.
  • the bottom surface, the first bottom surface is the common bottom surface of the first quadrangular prism platform and the second quadrangular prism platform, and the position of the first bottom surface in the time dimension is used to indicate the first video frame of the first video frame.
  • Time information, the position of the second bottom surface in the time dimension is used to indicate the second time information of the second video frame, and the position of the third bottom surface in the time dimension is used to indicate the third video.
  • the third time information of the frame, the time sequence of the first video frame in the first video is located between the second video frame and the third video frame, and the first bottom surface is in the two-dimensional
  • the position of the spatial dimension is used to indicate the first position information of the target object in the first video frame, and the position of the second bottom surface in the two-dimensional spatial dimension indicates that the target object is in the second video
  • the second position information in the frame, the position of the third bottom surface in the two-dimensional space dimension indicates the third position information of the target object in the third video frame; the double quadrangular prism is used to
  • the acquiring unit is specifically configured to acquire the tracking trajectory of the target object in the first video according to the motion pipeline.
  • the tracking trajectory specifically includes: a tracking trajectory of the target object formed by connecting at least two of the motion pipes corresponding to a quadrangular prism in a space-time dimension.
  • the length of the motion pipe is a preset value, and the length of the motion pipe indicates the number of video frames included in the at least two video frames.
  • the acquiring unit is further configured to: acquire category information of the target object through the pre-trained neural network model; and according to the category information of the target object , The position information of the target object in at least two video frames and the time information of the at least two video frames obtain the tracking trajectory of the target object in the first video.
  • the acquiring unit is specifically configured to: acquire the confidence of the motion pipeline through the pre-trained neural network model, and the confidence of the motion pipeline is used to determine The category information of the target object corresponding to the motion pipeline.
  • the device further includes: a processing unit configured to delete the motion pipeline to obtain the deleted motion pipeline, and the deleted motion pipeline is used for To obtain the tracking trajectory of the target object.
  • the movement pipeline includes a first movement pipeline and a second movement pipeline; the processing unit is specifically configured to: if the first movement pipeline and the second movement pipeline are repeated If the rate is greater than or equal to the first threshold, then the movement pipeline with lower confidence among the first movement pipeline and the second movement pipeline is deleted, and the repetition rate between the first movement pipeline and the second movement pipeline is equal to The intersection ratio between the first movement pipeline and the second movement pipeline, the first movement pipeline and the second movement pipeline belong to the movement pipeline of the target object, and the confidence level indicates that the movement pipeline corresponds to The class of the target object is the probability of the preset class.
  • the processing unit is specifically configured to cut the motion pipeline according to a non-maximum value suppression algorithm, and obtain the deleted motion pipeline.
  • the confidence of any one of the motion pipes after the reduction is greater than or equal to the second threshold.
  • the acquiring unit is specifically configured to: connect a third movement pipeline and a fourth movement pipeline that meet a preset condition among the movement pipelines, and obtain information about the target object. Tracking the trajectory; the preset conditions include one or more of the following: the intersection ratio between the sections of the third motion pipeline and the fourth motion pipeline in the time dimension overlapping portion is greater than or equal to a third threshold; The cosine value of the angle between the movement direction of the third movement pipeline and the movement direction of the fourth movement pipeline is greater than or equal to a fourth threshold, and the movement direction is a predetermined rule indicating the target in the movement pipeline in the space-time dimension.
  • the vector of the position change of the object; and, the distance between the neural network feature vectors of the motion pipeline is less than or equal to the fifth threshold, and the distance includes the Euclidean distance.
  • the obtaining unit is specifically configured to: group the motion pipes to obtain t groups of motion pipes, where t is the total number of video frames in the first video, and t
  • the i-th motion pipeline group in the group of motion pipelines includes all motion pipelines starting from the i-th video frame in the first video, and the i is greater than or equal to 1, and less than or equal to t; when i is 1, the The motion pipes in the i-th motion pipe group are used as the initial tracking trajectories to obtain a tracking trajectory set; according to the sequence of the number of the motion pipe groups, the motion pipes in the i-th motion pipe group and the tracks in the tracking trajectory set are sequentially The trajectories are connected to obtain at least one tracking trajectory.
  • the acquiring unit is specifically configured to: input the first video sample into the initial network model for training, and acquire the target object loss; and update the initial The weight parameters in the network model are used to obtain the pre-trained neural network model.
  • the target object loss specifically includes: an intersection ratio between the true value of the motion pipe and the predicted value of the motion pipe, and the true value of the motion pipe is the first video sample
  • the motion pipeline obtained by splitting the tracking trajectory of the target object in the middle, and the motion pipeline prediction value is the motion pipeline obtained by inputting the first video sample into the initial network model.
  • the target object loss specifically includes: the cross-union ratio between the true value of the motion pipe and the predicted value of the motion pipe, and the confidence of the true value of the motion pipe and the predicted value of the motion pipe.
  • the cross entropy between the confidence levels of the motion pipeline, the motion pipeline truth value is the motion pipeline obtained by splitting the tracking trajectory of the target object in the first video sample, and the motion pipeline prediction value is the input of the first video sample
  • the confidence of the true value of the motion pipeline is the probability that the target object category corresponding to the true value of the motion pipeline belongs to the preset target object category
  • the confidence of the predicted value of the motion pipeline It is the probability that the category of the target object corresponding to the predicted value of the motion pipeline belongs to the preset target object category.
  • the initial network model includes a three-dimensional convolutional neural network or a recurrent neural network.
  • the processing unit is further configured to: divide the first video into a plurality of video clips; the acquiring unit is specifically configured to: separate the plurality of video clips Input the pre-trained neural network model to obtain the motion pipeline.
  • the third aspect of the embodiments of the present application provides an electronic device, which is characterized by comprising a processor and a memory, the processor and the memory are connected to each other, wherein the memory is used to store a computer program, and the computer program It includes program instructions, and the processor is used to call the program instructions to execute the method described in any one of the foregoing first aspect and various possible implementation manners.
  • the fourth aspect of the embodiments of the present application provides a computer program product containing instructions, which is characterized in that when it runs on a computer, the computer executes any one of the above-mentioned first aspect and various possible implementation manners. The method described in the item.
  • the fifth aspect of the embodiments of the present application provides a computer-readable storage medium, including instructions, which are characterized in that, when the instructions run on a computer, the computer executes the first aspect and various possible implementation manners. Any one of the methods.
  • a sixth aspect of the embodiments of the present application provides a chip including a processor.
  • the processor is used to read and execute the computer program stored in the memory to execute the method in any possible implementation manner of any of the foregoing aspects.
  • the chip should include a memory, and the memory and the processor are connected to the memory through a circuit or a wire.
  • the chip further includes a communication interface, and the processor is connected to the communication interface.
  • the communication interface is used to receive data and/or information that needs to be processed, and the processor obtains the data and/or information from the communication interface, processes the data and/or information, and outputs the processing result through the communication interface.
  • the communication interface can be an input and output interface.
  • the position information of the target object in at least two video frames and the time information of the at least two video frames are obtained through a pre-trained neural network model, and the target is determined according to the information The tracking trajectory of the object in the first video. Since the time information of at least two video frames is output through the neural network model, target tracking does not depend on the target detection result of a single video frame, which can reduce the problem of detection failure in scenes with dense targets or more occlusions, and improve target tracking performance .
  • the motion pipeline of the target object is obtained through a pre-trained neural network model, and the tracking trajectory of the target object is obtained by connecting the motion pipeline. Since the motion pipeline includes the position information of the target object in at least two video frames, target tracking does not depend on the target detection result of a single video frame, which can reduce the problem of detection failure in scenes with dense targets or more occlusions, and improve the target Track performance.
  • the detection algorithm relies on a single frame, and the accuracy of the overall algorithm is affected by the detector.
  • the development cost of step-by-step training of the detection model and tracking model is high.
  • the algorithm is divided into two phases, which also increases machine learning.
  • the target tracking method provided in the embodiments of the present application can realize end-to-end training, and complete the detection and tracking tasks of multi-target objects through a neural network model, which can reduce the complexity of the model.
  • the prior art has relatively single features extracted based on a single video frame.
  • the target tracking method provided in the embodiments of this application uses video as the original input, and the model can be tracked through various features such as appearance features, motion trajectory features, or gait features. Tasks can improve target tracking performance.
  • the target tracking method provided by the embodiment of the present application uses video as the original input of the model, and the time dimension receptive field is increased, which can better capture the movement information of the character.
  • FIG. 1 is a schematic diagram of an artificial intelligence main body framework provided by an embodiment of this application.
  • FIG. 2 is a schematic diagram of a system architecture provided by an embodiment of the application.
  • FIG. 3 is a schematic diagram of a convolutional neural network structure provided by an embodiment of this application.
  • FIG. 4 is a schematic diagram of another convolutional neural network structure provided by an embodiment of the application.
  • Fig. 5 is a schematic diagram of an embodiment of a motion pipe in an embodiment of the application.
  • Fig. 6 is a schematic diagram of the tracking trajectory splitting motion pipeline in an embodiment of the application.
  • FIG. 7 is a schematic diagram of an embodiment of a motion pipe in an embodiment of the application.
  • Fig. 8 is a schematic diagram of another embodiment of the motion pipe in the embodiment of the application.
  • FIG. 9 is a schematic diagram of the intersection and merging of the motion pipes in the embodiment of the application.
  • FIG. 10 is a schematic diagram of an embodiment of a target detection method in an embodiment of the application.
  • FIG. 11 is a schematic diagram of an embodiment of matching between moving pipes in an embodiment of the application.
  • FIG. 12 is a schematic diagram of an embodiment of a neural network model training method in an embodiment of the application.
  • FIG. 13 is a schematic diagram of tracking trajectory and motion pipeline in an embodiment of this application.
  • Fig. 14 is a schematic diagram of a motion pipeline output by a neural network model in an embodiment of the application.
  • 15 is a schematic diagram of another embodiment of a target tracking method in an embodiment of the application.
  • FIG. 16 is a schematic diagram of an embodiment of a target tracking device in an embodiment of the application.
  • FIG. 17 is a schematic diagram of another embodiment of the target tracking device in the embodiment of this application.
  • FIG. 18 is a schematic diagram of another embodiment of the target tracking device in the embodiment of this application.
  • FIG. 19 is a schematic diagram of another embodiment of an electronic device in an embodiment of the application.
  • FIG. 20 is a hardware structure diagram of a chip provided by an embodiment of the application.
  • the embodiment of the present application provides a target tracking method, which is used for target tracking in a video, which can reduce tracking errors in scenes with dense targets or more occlusions.
  • the moving target in the video refers to the relative movement of the target relative to the video capture device during the shooting process, taking the world coordinate system of the actual three-dimensional space as a reference.
  • the target can be moving or not, and the specifics are not limited here. .
  • the image information of the target object may be directly recorded in the video frame, or part of the image frame may be blocked by other objects.
  • the data displayed in this form is defined as data in a space-time dimension in the embodiment of the present application.
  • the position of the target in the video frame can be determined by the position in the time dimension and the position in the two-dimensional space in the space-time dimension.
  • the position in the time dimension is used to determine the video frame, and the position in the two-dimensional space is used To indicate the location information of the target in the video frame.
  • FIG. 5 is a schematic diagram of an embodiment of the motion pipe in the embodiment of the application.
  • Target tracking needs to determine the position information of the target to be tracked (or target for short) in all video frames containing the target object.
  • the target position in each video frame can be identified by a detection box (Bounding-Box).
  • the detection box of the same target object in each video frame is connected correspondingly to form the trajectory of the target in the space-time area. That is, tracking trajectory, or motion trajectory, tracking trajectory can not only give the position of the target object, but also connect the positions of the target object at different times. Therefore, the tracking trajectory can indicate the time and space information of the target object at the same time.
  • FIG. 5 only illustrates the position information of the target object in the three video frames.
  • all the video frames of the video can obtain the tracking trajectory according to the above-mentioned method.
  • the tracking trajectory also includes the identification (ID) of the target object indicated by the tracking trajectory, and the ID of the target object can be used to distinguish trajectories corresponding to different targets.
  • the motion pipeline is used to indicate the position information of the target in at least two video frames, corresponding to the quadrangular prism in the space-time dimension, and the position of the first bottom surface of the quadrangular prism in the time dimension is used to indicate the first time of the first video frame Information, the position of the second bottom surface of the quadrangular prism in the time dimension is used to indicate the second time information of the second video frame, and the position of the first bottom surface of the quadrangular prism in the two-dimensional space is used to indicate that the target object is in the first
  • the first position information in one video frame and the position of the second bottom surface of the quadrangular prism in the two-dimensional space are used to indicate the second position information of the target object in the second video frame.
  • the motion pipeline is used to indicate the position information of the target in at least three different video frames.
  • the motion pipeline includes the position information of the target in three different video frames as an example for introduction.
  • the motion pipeline can be regarded as a double quadrangular pyramid structure composed of two quadrangular pyramids with a common bottom surface.
  • the three bottom surfaces of the double quadrangular pyramid structure are parallel to each other.
  • the direction perpendicular to the bottom surface is the time dimension, and the extension direction of the bottom surface is Spatial dimension, each bottom surface represents the position of the target in the video frame at the corresponding moment of the bottom surface.
  • a movement pipeline with a double quadrangular pyramid structure including: a first bottom surface 601, a second bottom surface 602, and a third bottom surface 603.
  • the first bottom surface 601, namely rectangular abcd, is located at the The position information in the two-dimensional space represents the position information of the target object in the first video frame, and the position of the rectangle abcd mapped in the time dimension represents the time information of the first video frame; similarly, the second bottom surface 602 is the rectangle ijkm , The position information in the two-dimensional space where the second bottom surface is located represents the position information of the target object in the second video frame, and the position of the rectangle abcd mapped in the time dimension represents the time information of the second video frame; the third bottom surface 603, That is, the rectangle efgh, the position information in the two-dimensional space where the third bottom surface is located represents the position information of the target object in the third video frame, and the position of the rectangle abcd mapped in the time dimension represents the time information of the third video frame.
  • the rectangle abcd, the rectangle efgh, and the rectangle ijkm are mapped to the two-dimensional space where the same bottom surface is located.
  • the corresponding location may be different.
  • the positions of the first bottom surface 601, the second bottom surface 602, and the third bottom surface 603 in the time dimension, that is, the positions of point a, point i, and point e mapped in the time dimension are a', i', and e', respectively, indicating the first Time information of a video frame, a second video frame, and a third video frame.
  • the length of the motion pipeline is the position interval between the second bottom surface mapped in the time dimension and the third bottom surface mapped in the time dimension. It is used to indicate the time sequence of the video, the second bottom surface, the third bottom surface, and the second bottom surface. The number of all video frames between the bottom surface and the third bottom surface.
  • the motion pipeline corresponding to the first video frame includes at least the position information of the target in the first video frame.
  • the tracking trajectory can be split into multiple motion pipelines, as shown in FIG. 6.
  • the tracking trajectory can be split into a position box of a single video frame, and each position box is used as a double quadrilateral
  • the common bottom surface in the mesa structure is like the first bottom surface 601 in Figure 6, and extends forward and backward in the tracking trajectory to determine the other two bottom surfaces of the double quadrangular mesa structure, which are the second bottom surface 602 and the third bottom surface respectively.
  • the bottom surface 603 thus obtains a double quadrangular prism structure with a common bottom surface, that is, the motion pipeline corresponding to the single video frame.
  • the forward extension is 0.
  • the last video frame extends backward to 0, and the motion pipelines corresponding to the start video frame and the last video frame degenerate into a single quadrangular pyramid structure.
  • the length of the motion pipeline is defined as the number of video frames corresponding to the motion pipeline. As shown in FIG. 6, the video is between the video frame corresponding to the second bottom surface 602 and the video frame corresponding to the third bottom surface 603. The total number is the length of the movement pipeline.
  • the motion pipeline in the embodiment of this application is represented by a specific data format. Please refer to FIG. 7 and FIG. 8, which are two schematic diagrams of the data format of the motion pipeline in the embodiment of this application.
  • the first data format includes 3 data in the time dimension (t s , t m , t e ), and 12 data in the space dimension A total of 15 data.
  • the location information of the target in space can be determined by 4 pieces of data.
  • the target location area is B s , passing and Four data can determine the location area.
  • the motion pipeline output by the neural network model can be represented by another data format, the motion pipeline of the video frame m, B m is the detection frame corresponding to the target in the common bottom surface, and B m is the corresponding video frame time partial image region, P is any one of B m pixel region, the pixel may be identified by a data point is located, in the time dimension, two data: d s and d e, motion of the conduit may be determined, respectively The length of the extension forward and backward.
  • the four data of l m , b m , t m , and r m indicate the offset (Regress values for B m ) of the boundary of the B m area relative to the P point with the point P as the reference point.
  • l s, b s, t s , r s four data indicate a boundary region B s offset with respect to the boundary region B m (Regress values for B s), similarly, l e, b e, t e, r e four data indicate a boundary region B e offset with respect to the boundary region B m (Regress values for B e).
  • both data formats can represent a single motion pipeline through 15 data, and the two data formats can be converted to each other.
  • IoU is usually used to measure the degree of overlap between two locations.
  • object detection object detection
  • IoU is extended to the three-dimensional space of the space-time dimension to measure the degree of overlap of the two motion pipelines in the space-time dimension.
  • schematic diagram is extended to the three-dimensional space of the space-time dimension to measure the degree of overlap of the two motion pipelines in the space-time dimension.
  • IoU(T (1) ,T (2) ) ⁇ (T (1) ,T (2) )/ ⁇ (T (1) ,T (2) )
  • T (1) represents motion channel 1
  • T (2) represents motion channel 2
  • ⁇ (T (1) ,T (2) ) represents the intersection of two motion channels
  • ⁇ (T (1) ,T (2) ) ) Represents the union of two motion pipelines.
  • FIG. 1 shows a schematic diagram of an artificial intelligence main frame, which describes the overall workflow of an artificial intelligence system and is suitable for general artificial intelligence field requirements.
  • Intelligent Information Chain reflects a series of processes from data acquisition to processing. For example, it can be the general process of intelligent information perception, intelligent information representation and formation, intelligent reasoning, intelligent decision-making, intelligent execution and output. In this process, the data has gone through the condensing process of "data-information-knowledge-wisdom".
  • the "IT value chain” is the industrial ecological process from the underlying infrastructure of human intelligence and information (providing and processing technology realization) to the system, reflecting the value that artificial intelligence brings to the information technology industry.
  • the infrastructure provides computing power support for the artificial intelligence system, realizes communication with the outside world, and realizes support through the basic platform.
  • the basic platform includes distributed computing framework and network related platform guarantees and support, which can include cloud storage and Computing, interconnection network, etc.
  • sensors communicate with the outside to obtain data, and these data are provided to the smart chip in the distributed computing system provided by the basic platform for calculation.
  • the data in the upper layer of the infrastructure is used to represent the data source in the field of artificial intelligence.
  • the data involves graphics, images, voice, text, and IoT data of traditional devices, including business data of existing systems and sensory data such as force, displacement, liquid level, temperature, and humidity.
  • Data processing usually includes data training, machine learning, deep learning, search, reasoning, decision-making and other methods.
  • machine learning and deep learning can symbolize and formalize data for intelligent information modeling, extraction, preprocessing, training, etc.
  • Reasoning refers to the process of simulating human intelligent reasoning in a computer or intelligent system, using formal information to conduct machine thinking and solving problems based on reasoning control strategies.
  • the typical function is search and matching.
  • Decision-making refers to the process of making decisions after intelligent information is reasoned, and usually provides functions such as classification, ranking, and prediction.
  • some general capabilities can be formed based on the results of the data processing, such as an algorithm or a general system, for example, translation, text analysis, computer vision processing, speech recognition, image Recognition and so on.
  • Intelligent products and industry applications refer to the products and applications of artificial intelligence systems in various fields. It is an encapsulation of the overall solution of artificial intelligence, productizing intelligent information decision-making and realizing landing applications. Its application fields mainly include: intelligent manufacturing, intelligent transportation, Smart home, smart medical, smart security, autonomous driving, safe city, smart terminal, etc.
  • the motion pipeline of the target object is obtained through a deep neural network.
  • the embodiment of this application provides A system architecture 200.
  • the data collection device 260 is used to collect the video data of the moving target and store it in the database 230.
  • the training device 220 generates a target model/rule 201 based on the video samples containing the moving target maintained in the database 230.
  • the following will describe in more detail how the training device 220 obtains the target model/rule 201 based on the video samples of the moving target.
  • the target model/rule 201 can be used in application scenarios such as single target tracking, multiple target tracking, and virtual reality.
  • training may be performed based on video samples of the moving target.
  • various video samples containing the moving target may be collected by the data collection device 260 and stored in the database 230.
  • video data can be obtained directly from commonly used databases.
  • the target model/rule 201 may be obtained based on a deep neural network, and the deep neural network will be introduced below.
  • the work of each layer in the deep neural network can be expressed in mathematical expressions To describe: From the physical level, the work of each layer in the deep neural network can be understood as the transformation of the input space to the output space (that is, the row space of the matrix to the column Space), these five operations include: 1. Dimension Up/Down; 2. Enlarge/Reduce; 3. Rotate; 4. Translation; 5. "Bend”. The operations of 1, 2, and 3 are determined by Completed, the operation of 4 is completed by +b, and the operation of 5 is realized by a(). The reason why the word "space” is used here is because the object to be classified is not a single thing, but a class of things. Space refers to the collection of all individuals of this class of things.
  • W is a weight vector, and each value in the vector represents the weight value of a neuron in the layer of neural network.
  • This vector W determines the spatial transformation from the input space to the output space described above, that is, the weight W of each layer controls how the space is transformed.
  • the purpose of training a deep neural network is to finally obtain the weight matrix of all layers of the trained neural network (the weight matrix formed by the vector W of many layers). Therefore, the training process of the neural network is essentially the way of learning the control space transformation, and more specifically the learning weight matrix.
  • the weight vector of the network (of course, there is usually an initialization process before the first update, which is to pre-configure parameters for each layer in the deep neural network). For example, if the predicted value of the network is high, adjust the weight vector to make it The prediction is lower and keep adjusting until the neural network can predict the target value you really want. Therefore, it is necessary to predefine "how to compare the difference between the predicted value and the target value".
  • This is the loss function or objective function, which is used to measure the difference between the predicted value and the target value. Important equation. Among them, taking the loss function as an example, the higher the output value (loss) of the loss function, the greater the difference, then the training of the deep neural network becomes a process of reducing this loss as much as possible.
  • the target model/rule obtained by the training device 220 can be applied to different systems or devices.
  • the execution device 210 is configured with an I/O interface 212 to perform data interaction with external devices.
  • the "user" can input data to the I/O interface 212 through the client device 240.
  • the execution device 210 can call data, codes, etc. in the data storage system 250, and can also store data, instructions, etc. in the data storage system 250.
  • the calculation module 211 uses the target model/rule 201 to process the input data. Taking target tracking as an example, the calculation module 211 can analyze the input video to obtain features indicating target location information in the video frame.
  • the correlation function module 213 may preprocess the image data in the calculation module 211, for example, perform video preprocessing, including video segmentation.
  • the correlation function module 214 may preprocess the image data in the calculation module 211, for example, perform video preprocessing, including video segmentation.
  • the I/O interface 212 returns the processing result to the client device 240 and provides it to the user.
  • the training device 220 can generate corresponding target models/rules 201 based on different data for different targets, so as to provide users with better results.
  • the user can manually specify to input data in the execution device 210, for example, to operate in the interface provided by the I/O interface 212.
  • the client device 240 can automatically input data to the I/O interface 212 and obtain the result. If the client device 240 automatically inputs data and needs the user's authorization, the user can set the corresponding authority in the client device 240.
  • the user can view the result output by the execution device 210 on the client device 240, and the specific presentation form may be a specific manner such as display, sound, and action.
  • the client device 240 can also serve as a data collection terminal to store the collected training data in the database 230.
  • Fig. 2 is only a schematic diagram of a system architecture provided by an embodiment of the present application, and the positional relationship between the devices, devices, modules, etc. shown in the figure does not constitute any limitation.
  • the data storage system 250 is an external memory relative to the execution device 210. In other cases, the data storage system 250 may also be placed in the execution device 210.
  • the deep neural network used to extract the motion pipeline from the video in the embodiment of the application may be a convolutional neural network (convolutional neural network, CNN), for example.
  • CNN convolutional neural network
  • CNN is a deep neural network with a convolutional structure. It is a deep learning architecture.
  • the deep learning architecture refers to the use of machine learning algorithms to perform multiple levels of learning at different levels of abstraction.
  • CNN is a feed-forward artificial neural network. Take image processing as an example. Each neuron in the feed-forward artificial neural network responds to overlapping areas in the input image. .
  • it can also be of other types, and this application does not limit the type of deep neural network.
  • a convolutional neural network (CNN) 100 may include an input layer 110, a convolutional layer/pooling layer 120, where the pooling layer is optional, and a neural network layer 130.
  • the convolutional layer/pooling layer 120 may include layers 121-126 as shown in the example.
  • layer 121 is a convolutional layer
  • layer 122 is a pooling layer
  • layer 123 is a convolutional layer
  • 124 is a pooling layer
  • 121 and 122 are convolutional layers
  • 123 is a pooling layer
  • 124 and 125 are convolutional layers
  • 126 is a convolutional layer.
  • Pooling layer That is, the output of the convolutional layer can be used as the input of the subsequent pooling layer, or as the input of another convolutional layer to continue the convolution operation.
  • the convolutional layer 121 can include many convolution operators.
  • the convolution operator is also called a kernel. Its function in image processing is equivalent to a filter that extracts specific information from the input image matrix.
  • the convolution operator can be a weight matrix. This weight matrix is usually predefined. In the process of convolution on the image, the weight matrix is usually one pixel after another pixel in the horizontal direction on the input image ( Or two pixels followed by two pixels...It depends on the value of stride) to complete the work of extracting specific features from the image.
  • the convolution kernel also has multiple formats.
  • Commonly used convolution kernels include two-dimensional convolution kernels and three-dimensional convolution kernels. Two-dimensional convolution kernels are mainly used to process two-dimensional image data, while three-dimensional convolution kernels can be applied to video processing, stereoscopic image processing, etc. due to the increased depth or time dimension.
  • the three-dimensional convolution kernel in order to extract the information in the time dimension and the space dimension in the video through the neural network model, the three-dimensional convolution kernel is used to perform the convolution operation in the time dimension and the space dimension at the same time, thus, the three-dimensional convolution kernel is composed of
  • the three-dimensional convolutional neural network can not only obtain the characteristics of each video frame, but also express the association and change of the video frame over time.
  • the initial convolutional layer (such as 121) often extracts more general features, which can also be called low-level features; with the convolutional neural network
  • the features extracted by the subsequent convolutional layers (for example, 126) become more and more complex, such as features such as high-level semantics, and features with higher semantics are more suitable for the problem to be solved.
  • multiple convolutional layers can be referred to as a block.
  • a pooling layer after the convolutional layer that is, the 121-126 layers as illustrated by 120 in Figure 3, which can be a convolutional layer followed by a layer
  • the pooling layer can also be a multi-layer convolutional layer followed by one or more pooling layers.
  • the sole purpose of the pooling layer is to reduce the spatial size of the image.
  • the neural network layer 130 may include multiple hidden layers (131, 132 to 13n as shown in FIG. 3) and an output layer 140.
  • the parameters contained in the multiple hidden layers can be based on specific task types. Relevant training data of, is obtained through pre-training.
  • the task type may include image recognition, image classification, image super-resolution reconstruction, and so on.
  • the output layer 140 After the multiple hidden layers in the neural network layer 130, that is, the final layer of the entire convolutional neural network 100 is the output layer 140.
  • the convolutional neural network 100 shown in FIG. 3 is only used as an example of a convolutional neural network.
  • the convolutional neural network may also exist in the form of other network models, for example,
  • the multiple convolutional layers/pooling layers shown in FIG. 4 are in parallel, and the respectively extracted features are input to the full neural network layer 130 for processing.
  • the deep neural network used to extract the motion pipeline from the video in the embodiment of the present application is a combination of a residual neural network and a feature pyramid network.
  • the residual neural network makes the deeper network easier to train by letting the deep network learn the residual representation.
  • Residual learning solves the problems of gradient disappearance and gradient explosion in deep networks.
  • the feature pyramid network detects targets of corresponding scales on feature maps of different resolutions. The output of each layer is obtained by fusing the feature maps of the current layer and higher layers, so each layer of feature maps output has sufficient feature expression ability.
  • the target detection method provided by the embodiment of the application involves a wide range of target tracking technologies, such as auto-focus during video shooting.
  • the target tracking algorithm can help the photographer more conveniently and accurately select the focus, or flexibly switch the focus to track the target, which is used in sports events. , Especially important in wildlife shooting.
  • the multi-target tracking algorithm can automatically complete the position tracking of the selected target object to facilitate the search for the established target, which is of great significance in the field of security.
  • the multi-target tracking algorithm can control the surrounding pedestrians, the trajectory and trend of the vehicle, and provide initial information for automatic driving path planning, automatic obstacle avoidance and other functions.
  • somatosensory games, gesture recognition, and finger tracking can also be achieved through multi-target tracking technology.
  • the usual target tracking method includes detection and tracking.
  • the detection module detects the target appearing in each video frame, and then matches the target appearing in each video frame. During the matching process, each target in a single video frame is extracted. The characteristics of each target object are matched through the similarity comparison of the features, and the tracking trajectory of each target object is obtained. Because this type of target tracking method uses the technical means of first detection and then tracking, the target tracking effect depends on the detection algorithm of a single frame. If the target is occluded in the target detection, detection errors will occur, which will lead to tracking errors. Therefore, when the target is dense or Insufficient performance in scenes with more occlusions.
  • the embodiment of the application adopts a target detection method, which inputs a video into a pre-trained neural network model, outputs multiple motion pipelines, and restores the tracking trajectories corresponding to one or more targets by matching the multiple motion pipelines.
  • target tracking does not depend on the target detection result of a single video frame, which can reduce the problem of detection failure in scenes with dense targets or more occlusions. Improve target tracking performance.
  • conventional target detection methods rely on single-frame detection algorithms. The accuracy of the overall algorithm is affected by the detector. The development cost of step-by-step training of detection models and tracking models is high. At the same time, the algorithm is divided into two stages and also increases machine learning.
  • the target tracking method provided in the embodiments of the present application can realize end-to-end training, and complete the detection and tracking tasks of multi-target objects through a neural network model, which can reduce the complexity of the model.
  • the prior art has relatively single features extracted based on a single video frame.
  • the target tracking method provided in the embodiments of the present application uses video as the original input, and the model can be realized by various features such as appearance features, motion trajectory features, or gait features. Tracking tasks can improve target tracking performance.
  • the target tracking method provided by the embodiment of the present application uses video as the original input of the model, and the time dimension receptive field is increased, which can better capture the movement information of the character.
  • FIG. 10 is a schematic diagram of an embodiment of the target detection method in the embodiment of the present application.
  • the target tracking device can preprocess the acquired video.
  • the preprocessing includes one or more of the following: dividing the video into segments of preset length, adjusting the video resolution, and adjusting and normalizing the color space .
  • the video when the length of the video is long, considering the data processing capability of the target tracking device, the video may be divided into 8 small segments.
  • step 1001 is an optional step and may or may not be executed.
  • the video is input to the pre-trained neural network model, and the position information of the target object in the at least two video frames and the time information of the at least two video frames are obtained.
  • the video is input to a pre-trained neural network model to obtain the motion pipeline of each target object.
  • the motion pipeline is used to indicate the time information and position information of the target object in at least two video frames of the first video.
  • the data format of the output motion pipeline is the type shown in Figure 8.
  • 3 represents the RGB color gamut
  • the output is the motion pipeline O, O ⁇ R ⁇ (t ⁇ h' ⁇ w' ⁇ 15), where R represents the real number domain and t represents the video
  • the number of frames, h' ⁇ w' represents the resolution of the feature map output by the neural network. That is, t ⁇ h' ⁇ w' motion pipes are output, and each video frame corresponds to h' ⁇ w' motion pipes.
  • the pre-trained neural network model is used to obtain the category information of the target object; specifically, the pre-trained neural network model is used to obtain the confidence level of the motion pipeline, which can be used to determine The category information of the target object corresponding to the motion pipeline.
  • each motion pipeline corresponds to a target to be tracked
  • the confidence of the motion pipeline refers to the possibility that the target corresponding to each motion pipeline belongs to the preset category.
  • the category of the target object to be tracked in the video such as human, vehicle, or dog.
  • the confidence of the output motion pipeline represents the probability that the target corresponding to the motion pipeline belongs to the preset category, and the confidence is A value between 0 and 1. The smaller the confidence level, the less likely it is to belong to the preset category, and the larger it is, the greater the possibility of belonging to the preset category.
  • the number of confidence levels of each motion channel is equal to the number of preset target object categories, and each confidence level corresponds to the possibility that the motion channel belongs to the category.
  • the confidence of the motion pipeline output by the neural network model constitutes the confidence table.
  • Example 1 The preset category of the target object is "person” or "background".
  • the background refers to the image area that does not contain the target object to be tracked.
  • the confidence levels of the target object category corresponding to the first motion pipeline are 0.1 and 0.9, respectively.
  • the confidence of the second motion pipeline is 0.7, 0.3. Since there is only one preset category, there are two possibilities for the target object category, which belongs to "person” or "background”.
  • the confidence threshold can be set to 0.5, Since the category of the target object corresponding to the first motion channel is "person", the confidence level of 0.3 is less than or equal to 0.5, which means that the target object corresponding to the motion channel has a low probability of being a person, and the "background” confidence level of 0.9 is greater than 0.5. That is, the possibility of belonging to the background is higher; the confidence that the target object corresponding to the second motion pipe belongs to the category of "person” is greater than 0.5, which means that the target corresponding to the motion pipe has a higher probability of belonging to the person, and the confidence that it belongs to the "background” If the degree of 0.3 is less than 0.5, it is less likely to belong to the background.
  • Example 2 The preset categories of the target object are "person”, “vehicle” and “background”, the confidence level of the first motion channel is 0.4, 0.1, 0.2, and the confidence level of the second motion channel is 0.2, 0.8, 0.1, There are three possibilities for the category of the target object: “person”, “vehicle” or “background”. 1/3 ⁇ 0.33 can be used as the confidence threshold. Since 0.4 is greater than 0.33, the category with the highest confidence in the first motion pipeline is " "People”, that is, the category of the corresponding target object has a higher probability of being human. Similarly, the category with the highest confidence of the second motion channel is "vehicle", that is, the category of the corresponding target object has a higher probability of being a vehicle.
  • the motion pipeline Before acquiring the tracking trajectory of the target object according to the motion pipeline, the motion pipeline can also be deleted to obtain the deleted motion pipeline, and the deleted motion pipeline is used to obtain the tracking trajectory of the target object.
  • the multiple motion pipelines output by the neural network model can be deleted according to preset conditions.
  • each pixel in each video frame corresponds to the motion pipeline, and the target appearing in the video frame usually occupies multiple pixel positions, so it is used to indicate the same target object
  • the category to which the target corresponding to each motion channel belongs can be determined according to the confidence level, and the motion channels of each category are respectively deleted.
  • obtaining the deleted motion pipeline specifically includes, if the repetition rate between the first motion pipeline and the second motion pipeline is greater than or equal to a first threshold, deleting the confidence levels in the first motion pipeline and the second motion pipeline
  • the repetition rate of the motion pipeline may be the IoU between the two motion pipelines.
  • the first threshold value ranges from 0.3 to 0.7.
  • the first threshold value is 0.5. If the IoU between the movement pipe and the second movement pipe is greater than or equal to 50%, then a movement pipe with a lower confidence level is deleted.
  • the motion pipeline is deleted according to a non-maximum suppression (NMS) algorithm, the deleted motion pipeline is obtained, and the IoU threshold of the motion pipeline is set to 0.5, and the NMS algorithm can be used to The motion pipeline is deleted, and only one corresponding motion pipeline is reserved for each target in each video frame.
  • NMS non-maximum suppression
  • each pixel in each video frame corresponds to the motion pipeline, and the position of the pixel in the video frame that does not correspond to the background area of the target object also corresponds to some motion pipelines.
  • Part of the motion pipeline can be understood as a fake motion pipeline, and the confidence is usually low.
  • the confidence of the motion pipeline can be deleted.
  • the confidence of any one of the motion pipelines after the reduction is greater than or equal to the second threshold, that is, the preset condition is that the confidence is less than or equal to the second threshold, and the second threshold is related to the preset category of the target object
  • the second threshold is usually between 0.3 and 0.7, for example, 0.5. If the number of categories of the target object is 10, the second threshold is usually between 0.07 and 0.13, for example, 0.1.
  • step 1003 is an optional step, which may or may not be performed.
  • the tracking trajectory of the target object in the first video according to the position information of the target object in at least two video frames and the time information of the at least two video frames.
  • a motion pipeline is used to indicate the target object The position information in the at least two video frames and the time information of the at least two video frames. Therefore, the tracking trajectory of the target object in the first video can be obtained according to the motion pipeline, and the tracking trajectory is specifically based on the at least two video frames.
  • Each motion pipeline corresponds to the tracking trajectory of the target object formed by the connection of the quadrangular prisms in the space-time dimension.
  • Obtaining the tracking trajectory of the target object according to the motion pipeline specifically includes: connecting a third motion pipeline and a fourth motion pipeline that meet a preset condition in the motion pipeline to obtain the tracking trajectory of the target object.
  • the specific content of the preset condition includes multiple types.
  • the preset condition includes one or more of the following: the intersection ratio between the third motion channel and the fourth motion channel in the overlapping sections of the time dimension is greater than or equal to The third threshold; the cosine value of the angle between the movement direction of the third movement pipeline and the movement direction of the fourth movement pipeline is greater than or equal to the fourth threshold, and the movement direction is to indicate the target object in the movement pipeline in the space-time dimension according to preset rules And, the distance between the neural network feature vectors of the motion pipeline is less than or equal to the fifth threshold, and the distance includes the Euclidean distance.
  • the intersection ratio between the two motion pipelines corresponding to the overlapping parts of the time dimension is greater than or equal to the third threshold
  • the cosine of the angle between the motion directions of the motion pipelines is greater than or equal to the fourth threshold
  • the distance index between the neural network feature vectors of the motion pipeline is less than or equal to the fifth threshold
  • the distance index may be, for example, Euclidean distance.
  • the neural network feature vector of the motion pipeline can be the output feature vector of any layer in the neural network model.
  • the neural network feature vector of the motion pipeline is the output feature of the last layer of the three-dimensional (3D) convolutional neural network in the neural network model. vector.
  • the movement direction of the motion pipe is a vector indicating the position change of the target object corresponding to the two bottom surfaces of the motion pipe in the space-time dimension, indicating the moving speed and direction of the target object. It can be understood that the position change of the target object in the video is usually continuous There is no sudden change in the change. Therefore, the moving directions of adjacent moving pipeline sections in the tracking trajectory are relatively close. During the connection of the moving pipelines, the connection can also be made according to the similarity of the moving directions of the moving pipelines. It should be noted that: the movement direction of the motion pipeline can be determined according to preset rules.
  • the motion pipeline is in the space-time dimension, and the two bottom surfaces of the motion pipeline that are the farthest apart in the time dimension (for example, the motion shown in Figure 8
  • the vector of the position change of the target object corresponding to the Bs and Be of the pipe is the direction of movement of the moving pipe, or set two adjacent bottoms in the moving pipe (for example, Bm and Be of the moving pipe shown in Figure 8) to face
  • the corresponding vector of the position change of the target object is the movement direction of the motion pipeline, or the position change direction of the target object between a preset number of video frames is set as the movement direction of the motion pipeline, and the preset number is, for example, 5 frames.
  • the direction of the tracking trajectory can be defined as at the end of the trajectory, the direction of the position change of the target object between a preset number of video frames is the direction of movement of the motion pipeline, or the direction of movement of the last motion pipeline at the end of the trajectory . It is understandable that the direction of motion of the motion pipe is generally defined as the direction from a certain moment to a moment after a certain moment in the time dimension.
  • the value of the third threshold is not limited, usually 70% to 95%, such as 75%, 80%, 85% or 90%, etc.
  • the value of the fourth threshold is not limited, usually cos( ⁇ /6) To cos ( ⁇ /36), for example, cos ( ⁇ /9), cos ( ⁇ /12), or cos ( ⁇ /18).
  • the value of the fifth threshold can be determined according to the size of the feature vector, and the specific value is not limited.
  • the following preset conditions are that the intersection ratio between the two motion pipelines corresponding to the overlapped portion of the time dimension is greater than or equal to the third threshold, and the cosine of the angle between the motion directions of the motion pipelines
  • the fourth threshold value is greater than or equal to an example.
  • FIG. 11 Please refer to FIG. 11 for a schematic diagram of an embodiment of the matching between the motion pipes in the embodiment of the application.
  • Example 1 as shown in part a in Fig. 11, if the intersection ratio between the motion pipeline sections corresponding to the overlapping time dimension of the two motion pipelines is greater than or equal to the third threshold, and the angle between the motion directions of the two motion pipelines
  • the cosine value of is greater than or equal to the fourth threshold, that is, the coincidence degree and the motion direction are matched, and the two motion pipes are matched successfully.
  • the degree of coincidence between two motion pipes refers to the IoU between the motion pipe sections of the overlapping portion of the two motion pipes in the time dimension.
  • Example 2 as shown in part b of Fig. 11, if the cosine value of the angle between the motion directions of the two motion pipes is less than the fourth threshold, that is, the motion directions do not match, the matching of the two motion pipes is unsuccessful.
  • Example 3 as shown in part c of Figure 11, if the intersection ratio between the motion pipeline sections corresponding to the overlapping parts of the two motion pipelines in the time dimension is less than the third threshold, that is, the degree of coincidence does not match, then the two motion pipelines match unsuccessful.
  • the two motion pipelines for matching have overlapping parts in the time dimension, there are two position information of the same target object in the video frame corresponding to the overlapping part, which can be determined by the method of averaging.
  • the position of the target object in the video frame corresponding to the overlapping part of the time dimension, or a certain motion channel specified according to a preset rule, for example, the time dimension coordinates of the video frame corresponding to the common bottom surface shall prevail.
  • the greedy algorithm can be used in the matching process of connecting all the motion pipes of the video to connect through a series of local optimal choices; the Hungarian algorithm can also be used for global optimal matching.
  • Connecting motion pipelines according to the greedy algorithm specifically includes: calculating the affinity between the two sets of motion pipelines to be matched (the affinity is defined as IoU*cos( ⁇ ), and ⁇ is the angle of the direction of motion) to form the affinity matrix.
  • the affinity matrix the matching motion pipe pair (Btube pair) is circularly selected from the maximum affinity until the matching is completed.
  • Connecting motion pipelines according to the Hungarian algorithm specifically includes: also after obtaining the affinity matrix, use the Hungarian algorithm to select a pair of motion pipelines.
  • the motion pipeline starting from the i-th frame is sequentially connected with the tracking track set, where i is a positive integer greater than 2 and less than t, and t is the total number of frames of the video. If the preset conditions are met, it will match If it succeeds, the tracking trajectory is updated according to the motion pipeline. If the matching is unsuccessful, it is newly added to the set of tracking trajectories as the initial tracking trajectory.
  • this embodiment adopts a greedy algorithm to sequentially connect the pipeline and the trajectory starting from the maximum affinity.
  • the motion pipes starting from the first frame is the second group
  • the motion pipes starting from the second frame are the second group.
  • the motion pipes starting from the i-th frame are the i-th group.
  • the first group includes 10 motion pipes
  • the second group includes 8 motion pipes
  • the third group includes 13 motion pipes.
  • the second group is connected to the initial tracking trajectories. If the connection conditions are met, the tracking trajectories are updated. If the connection conditions are not met, the original initial tracking trajectories are retained.
  • the tracking trajectory set includes 8 updated tracking trajectories, and The two tracking tracks remain unchanged.
  • the three motion pipelines are not used to update the tracking trajectory, and these three motion pipelines can be used as new initial tracking trajectories, that is, three new tracking trajectories are added to the tracking trajectory set.
  • the target category to which the target corresponding to the motion pipeline belongs is determined according to the confidence table of the motion pipeline, and the motion pipelines of different target categories are respectively connected to obtain the tracking trajectory of the target object of each target category.
  • the spatial position of the occluded part can be obtained by complementing the difference of the motion pipeline.
  • the tracking trajectory is processed as a bounding box superimposed on the original video and output to the display to complete the real-time tracking deployment and achieve target tracking.
  • the target tracking method provided in the embodiment of the present application designs a pre-trained neural network model, and the training method of the neural network model is introduced below.
  • FIG. 12 is a schematic diagram of an embodiment of a neural network model training method in an embodiment of the application.
  • Training preparations include building a training hardware environment, building a network model, and setting training parameters.
  • the video samples in the data set can also be processed to increase the diversity of data distribution and obtain better model generalization capabilities.
  • the processing of the video includes resolution scaling, whitening the color space, and random HSL (a color space, or color representation method) to the video color.
  • H hue
  • S saturation
  • L Brightness
  • jitter random horizontal flipping of video frames, etc.
  • Set training parameters including batch size, learning rate, optimizer model, etc., for example, the batch size is 32, the learning rate starts from 10 ⁇ (-3), and when the loss is stable, it is reduced by 5 times for better convergence . After 25K training iterations, the network basically converges. In order to increase the generalization ability of the model, a second-order regular loss of 10 ⁇ (-5) is used, and its momentum coefficient is 0.9.
  • split according to the preset pipe length that is, set the interval between the three bottom surfaces in the double quadrangular pyramid structure.
  • the interval between the common bottom surface and the other two bottom surfaces is 4, and the movement pipe The length is 8.
  • the length of the motion pipeline in the time dimension is extended as much as possible, and the time dimension The longest structure serves as the final expanded structure.
  • Figure 13 Since the structure of the motion pipeline (Btube) is linear, and the structure of the ground truth (ground truth) is non-linear, the long motion pipeline often cannot fit the motion trajectory better, that is, as the length increases, the IoU will be lower. Low (IoU ⁇ ). The length of the motion pipeline with larger IoU (IoU> ⁇ ) is usually shorter.
  • the longest motion pipeline that meets the lowest IoU threshold is used as the split motion pipeline, which can better fit the original trajectory while expanding the time receptive field.
  • the overlapping part of the motion pipes can be used for connection matching between the motion pipes.
  • the tracking trajectories of all target objects in the video sample are split to obtain the true values of multiple motion pipelines.
  • the video samples are input into the initial network model for training, and the predicted value of the motion pipeline is output.
  • the initial network model is a three-dimensional (3D) convolutional neural network or a recurrent neural network, etc., where the 3D convolutional neural network includes: a 3D residual neural network or a 3D feature pyramid network, etc.
  • the neural network model is a combination of a 3D residual neural network and a 3D feature pyramid network.
  • the video samples are input to the initial network model, and the motion pipelines of all target objects are output.
  • the data format of the output motion pipeline is the type shown in Figure 8.
  • h ⁇ w represents the video resolution
  • 3 represents the RGB color gamut
  • the output is the motion pipeline O, O ⁇ R ⁇ (t ⁇ h' ⁇ w' ⁇ 15), where R represents the real number domain, and t represents the number of frames of the video
  • h' ⁇ w' represents the resolution of the feature map output by the neural network. That is, t ⁇ h' ⁇ w' motion pipes are output, and each video frame corresponds to h' ⁇ w' motion pipes.
  • the confidence level of the motion pipeline is also output, and the confidence level is used to indicate the category of the target object corresponding to the motion pipeline.
  • step 1202 and step 1203 is not limited.
  • step 1202 Since step 1202 is split according to the manually labeled trajectory information, the data format of the true value of the obtained motion pipeline (R ⁇ (t ⁇ h' ⁇ w' ⁇ 15), where t ⁇ h' ⁇ w' is the motion pipeline The number) is the first data format of the motion pipeline;
  • the data format of the motion pipeline output by the initial network model in step 1203 (R ⁇ (n ⁇ 15), where n is the number of motion pipelines) is the second data format of the motion pipeline.
  • the true value of the motion pipeline is converted into the second data format.
  • the t ⁇ h' ⁇ w' motion pipelines output by the neural network model include t ⁇ h' ⁇ w' P points, only P1 and P2 are used as examples in Figure 14 for illustration, t ⁇ h
  • The' ⁇ w' P points are a three-dimensional lattice distributed in the three dimensions of time and space.
  • the true value is accompanied by a 0/1 truth table to characterize whether it is a compensation pipeline.
  • the truth table A' can be used as the confidence level corresponding to the truth value of the motion pipeline.
  • the loss between the true value (T) and the predicted value (O) can be calculated.
  • the loss function L is:
  • IoU (T, O) represents the intersection ratio between the true value of the motion pipeline (T) and the predicted value (O) of the motion pipeline
  • A is the confidence level of the predicted value (O) of the motion pipeline
  • A' is the true value of the motion pipeline.
  • CrossEntropy is the cross entropy.
  • the parameters are updated by the optimizer to optimize the neural network model, and finally a neural network model that can be used to implement the target tracking method in the embodiment of the present application is obtained.
  • optimizers There are many types of optimizers. Optionally, they can be BGD (batch gradient descent) algorithm, SGD (stochastic gradient descent) algorithm, or MBGD (mini-batch gradient descent) algorithm.
  • BGD batch gradient descent
  • SGD stochastic gradient descent
  • MBGD mini-batch gradient descent
  • FIG. 15 is a schematic diagram of another embodiment of a target tracking method in an embodiment of this application.
  • the target tracking device can track the moving target in the video in real time.
  • the system initialization of the target tracking device is performed first, and the preparation for device startup is completed;
  • It can be a video captured by a target tracking device in real time, or a video captured through a communication network.
  • the video obtained in 1502 is input into the pre-trained neural network model, and the motion pipeline set of the input video will be obtained, including the motion pipeline of the target object corresponding to each video frame.
  • the basic idea of the greedy algorithm is to proceed step by step from a certain initial solution of the problem. According to a certain optimization measure, each step must ensure that a local optimal solution can be obtained. It is understandable that the algorithm for connecting the motion pipeline can be replaced with other algorithms, which is not limited here.
  • the output is the tracking trajectory of one target object.
  • the tracking trajectory of each target object can be output. Specifically, the tracking trajectory can be processed into each video frame The bounding box of is superimposed on the original video and displayed by the display module.
  • the target tracking device will continue to obtain the newly captured video content, and repeat steps 1502 to 1505 until the target tracking task ends, which will not be repeated here.
  • FIG. 16 is a schematic diagram of an embodiment of the target tracking device in the embodiment of this application.
  • the software or firmware includes but is not limited to computer program instructions or codes, and can be executed by a hardware processor.
  • the hardware includes, but is not limited to, various integrated circuits, such as a central processing unit (CPU), a digital signal processor (DSP), a field programmable gate array (FPGA), or an application specific integrated circuit (ASIC).
  • CPU central processing unit
  • DSP digital signal processor
  • FPGA field programmable gate array
  • ASIC application specific integrated circuit
  • the target tracking device includes:
  • the acquiring unit 1601 is configured to acquire a first video, where the first video includes a target object;
  • the acquiring unit 1601 is further configured to input the first video into a pre-trained neural network model to acquire the position information of the target object in at least two video frames and the time information of the at least two video frames;
  • the acquiring unit 1601 is further configured to acquire the tracking of the target object in the first video according to the position information of the target object in at least two video frames and the time information of the at least two video frames A trajectory, the tracking trajectory includes position information of the target object in at least two video frames in the first video.
  • the acquiring unit 1601 is specifically configured to: acquire a motion pipeline of the target object, where the motion pipeline is used to indicate the time information of the target object in at least two video frames of the first video and Location information, wherein the first video includes a first video frame and a second video frame; the motion pipeline corresponds to a quadrangular pyramid in a space-time dimension, and the space-time dimension includes a time dimension and a two-dimensional space dimension.
  • the position of the first bottom surface of the quadrangular pyramid in the time dimension is used to indicate the first time information of the first video frame
  • the position of the second bottom surface of the quadrangular pyramid in the time dimension is used to indicate the The second time information of the second video frame
  • the position of the first bottom surface of the quadrangular prism in the two-dimensional space is used to indicate the first position information of the target object in the first video frame
  • the position of the second bottom surface of the quadrangular prism in the two-dimensional space is used to indicate the second position information of the target object in the second video frame
  • the quadrangular prism is used to indicate the target object Position information in all video frames between the first video frame and the second video frame of the first video.
  • the acquiring unit 1601 is specifically configured to: acquire a motion pipeline of the target object, where the motion pipeline is used to indicate the position information of the target object in at least three video frames and the at least three videos Time information of the frame, wherein the first video includes a first video frame, a second video frame, and a third video frame; the motion pipeline corresponds to a double quadrangular prism in the space-time dimension, and the double quadrangular prism includes The first quadrangular platform and the second quadrangular platform, the first quadrangular platform includes a first bottom surface and a second bottom surface, the second quadrangular platform includes a first bottom surface and a third bottom surface, the first bottom surface is The common bottom surface of the first quadrangular prism and the second quadrangular prism, the position of the first bottom in the time dimension is used to indicate the first time information of the first video frame, and the second The position of the bottom surface in the time dimension is used to indicate the second time information of the second video frame, and the position of the third bottom surface in the time dimension is used
  • the acquiring unit 1601 is specifically configured to acquire the tracking trajectory of the target object in the first video according to the motion pipeline.
  • the tracking trajectory specifically includes: a tracking trajectory of the target object formed by connecting at least two of the motion pipes corresponding to the quadrangular prisms in the space-time dimension.
  • the length of the motion pipeline is a preset value, and the length of the motion pipeline indicates the number of video frames included in the at least two video frames.
  • the obtaining unit 1601 is further configured to: obtain category information of the target object through the pre-trained neural network model; according to the category information of the target object, the target object is The position information in the two video frames and the time information of the at least two video frames obtain the tracking trajectory of the target object in the first video.
  • the acquiring unit 1601 is specifically configured to: acquire the confidence level of the motion pipeline through the pre-trained neural network model, and the confidence level of the motion pipeline is used to determine the target object corresponding to the motion pipeline Of the category information.
  • the device further includes: a processing unit 1602, configured to delete the motion pipeline to obtain a deleted motion pipeline, and the deleted motion pipeline is used to acquire the tracking of the target object Trajectory.
  • a processing unit 1602 configured to delete the motion pipeline to obtain a deleted motion pipeline, and the deleted motion pipeline is used to acquire the tracking of the target object Trajectory.
  • the movement pipeline includes a first movement pipeline and a second movement pipeline; the processing unit 1602 is specifically configured to: if the repetition rate between the first movement pipeline and the second movement pipeline is greater than or equal to a first threshold, Then delete the movement pipeline with lower confidence in the first movement pipeline and the second movement pipeline, and the repetition rate between the first movement pipeline and the second movement pipeline is the first movement pipeline and the second movement pipeline.
  • the intersection ratio between the second motion pipeline, the first motion pipeline and the second motion pipeline belong to the motion pipeline of the target object, and the confidence indicates that the category of the target object corresponding to the motion pipeline is a preset category The probability.
  • the processing unit 1602 is specifically configured to: delete the motion pipeline according to a non-maximum value suppression algorithm, and obtain the deleted motion pipeline.
  • the confidence level of any one of the motion channels after the reduction is greater than or equal to a second threshold.
  • the acquiring unit 1601 is specifically configured to: connect a third motion pipeline and a fourth motion pipeline that meet a preset condition in the motion pipeline to acquire the tracking trajectory of the target object; the preset condition It includes one or more of the following: the intersection ratio between the sections of the overlapping portion of the third movement pipeline and the fourth movement pipeline in the time dimension is greater than or equal to a third threshold; the movement direction of the third movement pipeline The cosine value of the included angle with the movement direction of the fourth motion pipe is greater than or equal to the fourth threshold, and the movement direction is a vector indicating the position change of the target object in the movement pipe in the space-time dimension according to a preset rule; and , The distance between the neural network feature vectors of the motion pipeline is less than or equal to the fifth threshold, and the distance includes the Euclidean distance.
  • the obtaining unit 1601 is specifically configured to: group the motion pipes to obtain t groups of motion pipes, where t is the total number of video frames in the first video, and the i-th motion pipe in the t group of motion pipes
  • the group includes all motion pipelines starting from the i-th video frame in the first video.
  • the i is greater than or equal to 1 and less than or equal to t; when i is 1, the motion in the i-th motion pipeline group
  • the pipeline is used as the initial tracking trajectory to obtain a tracking trajectory set; in accordance with the number sequence of the motion pipeline group, the motion pipelines in the i-th motion pipeline group are connected with the tracking trajectories in the tracking trajectory set to obtain at least one track Trajectory.
  • the obtaining unit 1601 is specifically configured to: input the first video sample into the initial network model for training, and obtain the target object loss; update the weight parameter in the initial network model according to the target object loss to obtain The pre-trained neural network model.
  • the target object loss specifically includes: an intersection ratio between the true value of the motion pipe and the predicted value of the motion pipe, and the true value of the motion pipe is obtained by splitting the tracking trajectory of the target object in the first video sample
  • the predicted value of the motion pipeline is a motion pipeline obtained by inputting the first video sample into the initial network model.
  • the target object loss specifically includes: the intersection ratio between the true value of the motion pipe and the predicted value of the motion pipe, and the cross entropy between the confidence of the true value of the motion pipe and the confidence of the predicted value of the motion pipe,
  • the true value of the motion pipeline is the motion pipeline obtained by splitting the tracking trajectory of the target object in the first video sample
  • the predicted value of the motion pipeline is the motion pipeline obtained by inputting the first video sample into the initial network model
  • the confidence level of the true value of the motion pipe is the probability that the target object category corresponding to the true value of the motion pipe belongs to the preset target object category
  • the confidence level of the predicted value of the motion pipe corresponds to the predicted value of the motion pipe The probability that the category of the target object belongs to the preset target object category.
  • the initial network model includes a three-dimensional convolutional neural network or a recurrent neural network.
  • processing unit 1602 is further configured to: divide the first video into multiple video segments;
  • the acquiring unit 1601 is specifically configured to input the multiple video clips into the pre-trained neural network model to acquire the motion pipeline.
  • the target tracking device provided by the embodiment of the present application has multiple implementation forms.
  • the target tracking device includes a video acquisition module, a target tracking module, and an output module.
  • the video acquisition module is used to obtain a video including the moving target object
  • the target tracking module is used to input the video
  • the tracking trajectory of the target object is output by the target tracking method provided in this embodiment of the application
  • the output module is used to superimpose the tracking trajectory on Shown to users in the video.
  • FIG. 17 is a schematic diagram of another embodiment of the target tracking device in the embodiment of this application.
  • the target tracking device includes a video acquisition module and a target tracking module, which can be understood as front-end equipment.
  • the front-end equipment and the back-end equipment need to be processed together.
  • the video acquisition module 1701 which can be a video acquisition module in a surveillance camera, a video camera, a mobile phone or a vehicle image sensor, is responsible for capturing video data as the input of the tracking algorithm;
  • the target tracking module 1702 which can be a processing unit in a camera processor, a mobile phone processor, a vehicle processing unit, etc., is used to receive video input and control information sent by a back-end device, such as tracking target category, tracking quantity, accuracy Control, model hyperparameters, etc.
  • the target tracking method of the embodiment of the present application is mainly deployed in this module.
  • FIG. 18 for the introduction of the target tracking module 1702.
  • the back-end equipment includes an output module and a control module.
  • the output module 1703 may be a display unit of a background monitor, printer, or hard disk, for displaying or window tracking results;
  • the control module 1704 is used to analyze the output result, receive the user's instruction, and send the instruction to the target tracking module of the front end.
  • FIG. 18 is a schematic diagram of another embodiment of the target tracking device in the embodiment of the application.
  • the target tracking device includes: a video preprocessing module 1801, a prediction module 1802, and a motion pipeline connection module 1803.
  • the video preprocessing module 1801 is used to divide the input video into appropriate segments, and adjust and normalize the video resolution, color space, etc.
  • the prediction module 1802 is used to extract spatiotemporal features from the input video clips and make predictions, and output the target motion pipeline and the category information of the motion pipeline. In addition, it can also predict the future position of the target motion pipeline.
  • the prediction module 1802 includes two sub-modules:
  • Target category prediction module 18021 Based on the features output by the 3D convolutional neural network, for example, the confidence value predicts the category to which the target belongs.
  • Motion pipeline prediction module 18022 predict the location of the target's current motion pipeline through the features output by the 3D convolutional neural network, that is, the coordinates of the motion pipeline in space and time dimensions.
  • the motion pipeline connection module 1803 analyzes the motion pipeline output in the prediction module, and if the target appears for the first time, initialize it as a new tracking trajectory. According to the temporal and spatial feature similarity between the motion pipelines and the spatial location proximity, the motion pipeline is connected Required connection characteristics. According to the movement pipeline and the connection characteristics of the movement pipeline, the movement pipelines are connected into a complete tracking trajectory by analyzing the spatial overlap characteristics of the movement pipelines and the similarity of the temporal and spatial characteristics.
  • FIG. 19 is a schematic diagram of an embodiment of an electronic device in an embodiment of the application.
  • the electronic device 1900 may have relatively large differences due to different configurations or performances, and may include one or more processors 1901 and a memory 1902, and the memory 1902 stores programs or data.
  • the memory 1902 may be volatile storage or non-volatile storage.
  • the processor 1901 is one or more central processing units (CPUs, central processing units).
  • the CPUs may be single-core CPUs or multi-core CPUs.
  • the processor 1901 may communicate with the memory 1902, and is on the electronic device 1900 A series of instructions in the memory 1902 are executed.
  • the electronic device 1900 also includes one or more wired or wireless network interfaces 1903, such as an Ethernet interface.
  • the electronic device 1900 may also include one or more power supplies; one or more input and output interfaces, which can be used to connect a display, a mouse, a keyboard, a touch screen device or a sensor For equipment, etc.
  • the input and output interfaces are optional components, which may or may not exist, and are not limited here.
  • FIG. 20 is a hardware structure diagram of a chip provided by an embodiment of this application.
  • the embodiment of the present application provides a chip system that can be used to implement the target tracking method.
  • the algorithm based on the convolutional neural network shown in FIG. 3 and FIG. 4 can be implemented in the NPU chip shown in FIG. 20.
  • the neural network processor NPU 50 is mounted on the main CPU (Host CPU) as a coprocessor, and the Host CPU allocates tasks.
  • the core part of the NPU is the arithmetic circuit 503.
  • the arithmetic circuit 503 is controlled by the controller 504 to extract matrix data from the memory and perform multiplication operations.
  • the arithmetic circuit 503 includes multiple processing units (process engines, PE). In some implementations, the arithmetic circuit 503 is a two-dimensional systolic array. The arithmetic circuit 503 may also be a one-dimensional systolic array or other electronic circuits capable of performing mathematical operations such as multiplication and addition. In some implementations, the arithmetic circuit 503 is a general-purpose matrix processor.
  • the arithmetic circuit fetches the data corresponding to matrix B from the weight memory 502 and caches it on each PE in the arithmetic circuit.
  • the arithmetic circuit fetches the matrix A data and matrix B from the input memory 501 to perform matrix operations, and the partial result or final result of the obtained matrix is stored in the accumulator 508.
  • the unified memory 506 is used to store input data and output data.
  • the weight data is directly transferred to the weight memory 502 through the storage unit access controller 505 (direct memory access controller, DMAC).
  • the input data is also transferred to the unified memory 506 through the DMAC.
  • the BIU is the Bus Interface Unit, that is, the bus interface unit 510, which is used for the interaction between the AXI bus and the DMAC and the instruction fetch buffer 509.
  • the bus interface unit 510 (bus interface unit, BIU) is used for the instruction fetch memory 509 to obtain instructions from the external memory, and is also used for the storage unit access controller 505 to obtain the original data of the input matrix A or the weight matrix B from the external memory.
  • the DMAC is mainly used to transfer the input data in the external memory DDR to the unified memory 506 or to transfer the weight data to the weight memory 502 or to transfer the input data to the input memory 501.
  • the vector calculation unit 507 may include multiple arithmetic processing units, and if necessary, further process the output of the arithmetic circuit, such as vector multiplication, vector addition, exponential operation, logarithmic operation, size comparison, and so on.
  • the vector calculation unit 507 can store the processed output vector to the unified buffer 506.
  • the vector calculation unit 507 may apply a nonlinear function to the output of the arithmetic circuit 503, such as a vector of accumulated values, to generate the activation value.
  • the vector calculation unit 507 generates a normalized value, a combined value, or both.
  • the processed output vector can be used as an activation input to the arithmetic circuit 503, for example for use in a subsequent layer in a neural network.
  • the instruction fetch buffer 509 connected to the controller 504 is used to store instructions used by the controller 504;
  • the unified memory 506, the input memory 501, the weight memory 502, and the fetch memory 509 are all On-Chip memories.
  • the external memory is private to the NPU hardware architecture.
  • each layer in the convolutional neural network shown in FIG. 3 and FIG. 4 may be executed by the matrix calculation unit 212 or the vector calculation unit 507.
  • the disclosed system, device, and method may be implemented in other ways.
  • the device embodiments described above are merely illustrative, for example, the division of the units is only a logical function division, and there may be other divisions in actual implementation, for example, multiple units or components can be combined or It can be integrated into another system, or some features can be ignored or not implemented.
  • the displayed or discussed mutual coupling or direct coupling or communication connection may be indirect coupling or communication connection through some interfaces, devices or units, and may be in electrical, mechanical or other forms.
  • the units described as separate components may or may not be physically separated, and the components displayed as units may or may not be physical units, that is, they may be located in one place, or they may be distributed on multiple network units. Some or all of the units may be selected according to actual needs to achieve the objectives of the solutions of the embodiments.
  • the functional units in the various embodiments of the present application may be integrated into one processing unit, or each unit may exist alone physically, or two or more units may be integrated into one unit.
  • the above-mentioned integrated unit can be implemented in the form of hardware or software functional unit.
  • the integrated unit is implemented in the form of a software functional unit and sold or used as an independent product, it can be stored in a computer readable storage medium.
  • the technical solution of the present application essentially or the part that contributes to the existing technology or all or part of the technical solution can be embodied in the form of a software product, and the computer software product is stored in a storage medium , Including several instructions to make a computer device (which may be a personal computer, a server, or a network device, etc.) execute all or part of the steps of the methods described in the various embodiments of the present application.
  • the aforementioned storage media include: U disk, mobile hard disk, read-only memory (read-only memory, ROM), random access memory (random access memory, RAM), magnetic disk or optical disk and other media that can store program code .

Landscapes

  • Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Theoretical Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • General Health & Medical Sciences (AREA)
  • Artificial Intelligence (AREA)
  • Computational Linguistics (AREA)
  • Data Mining & Analysis (AREA)
  • Evolutionary Computation (AREA)
  • Biomedical Technology (AREA)
  • Molecular Biology (AREA)
  • Computing Systems (AREA)
  • General Engineering & Computer Science (AREA)
  • Biophysics (AREA)
  • Mathematical Physics (AREA)
  • Software Systems (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Health & Medical Sciences (AREA)
  • Multimedia (AREA)
  • Signal Processing (AREA)
  • Image Analysis (AREA)

Abstract

L'invention concerne un procédé de suivi de cible applicable au suivi d'une cible dans une vidéo et capable de réduire les erreurs de suivi provoquées lorsqu'une cible est bloquée. Le procédé comprend les étapes consistant à : entrer une vidéo capturée d'un objet cible dans un modèle de réseau neuronal pré-entraîné ; acquérir des canaux de mouvement de l'objet cible ; connecter les canaux de mouvement ; et obtenir un trajet de suivi de l'objet cible, le trajet de suivi comprenant des informations d'emplacement de la cible dans chaque trame vidéo de la première vidéo.
PCT/CN2021/093852 2020-06-09 2021-05-14 Dispositif de suivi de cible et procédé de suivi de cible WO2021249114A1 (fr)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN202010519876.2 2020-06-09
CN202010519876.2A CN113781519A (zh) 2020-06-09 2020-06-09 目标跟踪方法和目标跟踪装置

Publications (1)

Publication Number Publication Date
WO2021249114A1 true WO2021249114A1 (fr) 2021-12-16

Family

ID=78834470

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2021/093852 WO2021249114A1 (fr) 2020-06-09 2021-05-14 Dispositif de suivi de cible et procédé de suivi de cible

Country Status (2)

Country Link
CN (1) CN113781519A (fr)
WO (1) WO2021249114A1 (fr)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116504068A (zh) * 2023-06-26 2023-07-28 创辉达设计股份有限公司江苏分公司 车道级车流量的统计方法、装置、计算机设备及存储介质

Families Citing this family (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US11625909B1 (en) * 2022-05-04 2023-04-11 Motional Ad Llc Track segment cleaning of tracked objects
CN114972814B (zh) * 2022-07-11 2022-10-28 浙江大华技术股份有限公司 一种目标匹配的方法、装置及存储介质
CN115451962B (zh) * 2022-08-09 2024-04-30 中国人民解放军63629部队 一种基于五变量卡诺图的目标跟踪策略规划方法

Citations (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20060262184A1 (en) * 2004-11-05 2006-11-23 Yissum Research Development Company Of The Hebrew University Of Jerusalem Method and system for spatio-temporal video warping
US20090219300A1 (en) * 2005-11-15 2009-09-03 Yissum Research Deveopment Company Of The Hebrew University Of Jerusalem Method and system for producing a video synopsis
CN101702233A (zh) * 2009-10-16 2010-05-05 电子科技大学 视频帧中基于三点共线标记点的三维定位方法
US20160148392A1 (en) * 2014-11-21 2016-05-26 Thomson Licensing Method and apparatus for tracking the motion of image content in a video frames sequence using sub-pixel resolution motion estimation
CN106169187A (zh) * 2015-05-18 2016-11-30 汤姆逊许可公司 用于对视频中的物体设界的方法和设备
CN108182696A (zh) * 2018-01-23 2018-06-19 四川精工伟达智能技术股份有限公司 图像处理方法、装置及多目标定位跟踪系统
CN108509830A (zh) * 2017-02-28 2018-09-07 华为技术有限公司 一种视频数据处理方法及设备
CN110188719A (zh) * 2019-06-04 2019-08-30 北京字节跳动网络技术有限公司 目标跟踪方法和装置

Family Cites Families (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US8345984B2 (en) * 2010-01-28 2013-01-01 Nec Laboratories America, Inc. 3D convolutional neural networks for automatic human action recognition
CN106897714B (zh) * 2017-03-23 2020-01-14 北京大学深圳研究生院 一种基于卷积神经网络的视频动作检测方法
CN107492113B (zh) * 2017-06-01 2019-11-05 南京行者易智能交通科技有限公司 一种视频图像中运动目标位置预测模型训练方法、位置预测方法及轨迹预测方法
CN110032926B (zh) * 2019-02-22 2021-05-11 哈尔滨工业大学(深圳) 一种基于深度学习的视频分类方法以及设备
CN110188637A (zh) * 2019-05-17 2019-08-30 西安电子科技大学 一种基于深度学习的行为识别技术方法

Patent Citations (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20060262184A1 (en) * 2004-11-05 2006-11-23 Yissum Research Development Company Of The Hebrew University Of Jerusalem Method and system for spatio-temporal video warping
US20090219300A1 (en) * 2005-11-15 2009-09-03 Yissum Research Deveopment Company Of The Hebrew University Of Jerusalem Method and system for producing a video synopsis
CN101702233A (zh) * 2009-10-16 2010-05-05 电子科技大学 视频帧中基于三点共线标记点的三维定位方法
US20160148392A1 (en) * 2014-11-21 2016-05-26 Thomson Licensing Method and apparatus for tracking the motion of image content in a video frames sequence using sub-pixel resolution motion estimation
CN106169187A (zh) * 2015-05-18 2016-11-30 汤姆逊许可公司 用于对视频中的物体设界的方法和设备
CN108509830A (zh) * 2017-02-28 2018-09-07 华为技术有限公司 一种视频数据处理方法及设备
CN108182696A (zh) * 2018-01-23 2018-06-19 四川精工伟达智能技术股份有限公司 图像处理方法、装置及多目标定位跟踪系统
CN110188719A (zh) * 2019-06-04 2019-08-30 北京字节跳动网络技术有限公司 目标跟踪方法和装置

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116504068A (zh) * 2023-06-26 2023-07-28 创辉达设计股份有限公司江苏分公司 车道级车流量的统计方法、装置、计算机设备及存储介质

Also Published As

Publication number Publication date
CN113781519A (zh) 2021-12-10

Similar Documents

Publication Publication Date Title
Ming et al. Deep learning for monocular depth estimation: A review
Walch et al. Image-based localization using lstms for structured feature correlation
WO2021249114A1 (fr) Dispositif de suivi de cible et procédé de suivi de cible
CN109559320B (zh) 基于空洞卷积深度神经网络实现视觉slam语义建图功能的方法及系统
WO2021175050A1 (fr) Procédé et dispositif de reconstruction tridimensionnelle
WO2021043168A1 (fr) Procédé d'entraînement de réseau de ré-identification de personnes et procédé et appareil de ré-identification de personnes
WO2020192736A1 (fr) Procédé et dispositif de reconnaissance d'objet
WO2021227726A1 (fr) Procédés et appareils d'apprentissage de détection de visage et réseaux neuronaux de détection d'image, et dispositif
Zhang et al. Deep hierarchical guidance and regularization learning for end-to-end depth estimation
CN112990211B (zh) 一种神经网络的训练方法、图像处理方法以及装置
WO2021147325A1 (fr) Procédé et appareil de détection d'objets, et support de stockage
CN110222717B (zh) 图像处理方法和装置
CN112419368A (zh) 运动目标的轨迹跟踪方法、装置、设备及存储介质
WO2021218786A1 (fr) Système de traitement de données, procédé de détection d'objet et appareil associé
WO2023082882A1 (fr) Procédé et dispositif de reconnaissance d'action de chute de piéton basés sur une estimation de pose
CN111062263B (zh) 手部姿态估计的方法、设备、计算机设备和存储介质
CN113674416B (zh) 三维地图的构建方法、装置、电子设备及存储介质
WO2022179581A1 (fr) Procédé de traitement d'images et dispositif associé
CN110222718B (zh) 图像处理的方法及装置
WO2021218238A1 (fr) Procédé et appareil de traitement d'image
WO2021103731A1 (fr) Procédé de segmentation sémantique et procédé et appareil d'apprentissage de modèle
CN113011562A (zh) 一种模型训练方法及装置
CN112529904A (zh) 图像语义分割方法、装置、计算机可读存储介质和芯片
KR102143034B1 (ko) 객체의 미래 움직임 예측을 통한 동영상에서의 객체 추적을 위한 방법 및 시스템
WO2022052782A1 (fr) Procédé de traitement d'image et dispositif associé

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 21821941

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 21821941

Country of ref document: EP

Kind code of ref document: A1