WO2021249114A1 - Target tracking method and target tracking device - Google Patents

Target tracking method and target tracking device Download PDF

Info

Publication number
WO2021249114A1
WO2021249114A1 PCT/CN2021/093852 CN2021093852W WO2021249114A1 WO 2021249114 A1 WO2021249114 A1 WO 2021249114A1 CN 2021093852 W CN2021093852 W CN 2021093852W WO 2021249114 A1 WO2021249114 A1 WO 2021249114A1
Authority
WO
WIPO (PCT)
Prior art keywords
motion
pipeline
video
target object
movement
Prior art date
Application number
PCT/CN2021/093852
Other languages
French (fr)
Chinese (zh)
Inventor
庞博
卢策吾
袁伟
胡翔宇
Original Assignee
华为技术有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 华为技术有限公司 filed Critical 华为技术有限公司
Publication of WO2021249114A1 publication Critical patent/WO2021249114A1/en

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T7/00Image analysis
    • G06T7/20Analysis of motion
    • G06T7/246Analysis of motion using feature-based methods, e.g. the tracking of corners or segments
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T7/00Image analysis
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T7/00Image analysis
    • G06T7/20Analysis of motion
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T7/00Image analysis
    • G06T7/70Determining position or orientation of objects or cameras
    • G06T7/73Determining position or orientation of objects or cameras using feature-based methods
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04NPICTORIAL COMMUNICATION, e.g. TELEVISION
    • H04N7/00Television systems
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T2207/00Indexing scheme for image analysis or image enhancement
    • G06T2207/10Image acquisition modality
    • G06T2207/10016Video; Image sequence
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T2207/00Indexing scheme for image analysis or image enhancement
    • G06T2207/10Image acquisition modality
    • G06T2207/10024Color image
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T2207/00Indexing scheme for image analysis or image enhancement
    • G06T2207/20Special algorithmic details
    • G06T2207/20081Training; Learning
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T2207/00Indexing scheme for image analysis or image enhancement
    • G06T2207/20Special algorithmic details
    • G06T2207/20084Artificial neural networks [ANN]
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T2207/00Indexing scheme for image analysis or image enhancement
    • G06T2207/30Subject of image; Context of image processing
    • G06T2207/30241Trajectory

Definitions

  • This application relates to the field of image processing technology, and in particular to a target tracking method and target tracking device.
  • Target tracking is one of the most important and basic tasks in the field of computer vision. Its purpose is to output the position of the target object in each video frame of the video from the video containing the target object. Usually a piece of video is input to the computer and the target object category to be tracked, and the computer outputs the identification (ID) of the target object in the form of a detection frame and the position information of the target object in each frame of the video.
  • ID identification
  • the existing multi-target tracking method includes detection and tracking. Multiple target objects appearing in each video frame are detected through the detection module, and then multiple target objects appearing in each video frame are matched. In the matching process In the process, the feature of each target object in a single video frame is extracted, and the target matching is achieved through feature similarity comparison, and the tracking trajectory of each target object is obtained.
  • the target tracking effect depends on the detection algorithm of a single frame. If the target object is occluded in the target detection, a detection error will occur, which will cause tracking errors. Therefore, the target object Insufficient performance in dense or occluded scenes.
  • the embodiment of the present application provides a target tracking method, which is used for target tracking in a video, which can reduce tracking errors caused by target occlusion.
  • the first aspect of the embodiments of the present application provides a target tracking method, including: acquiring a first video, where the first video includes a target object; and inputting the first video into a pre-trained neural network model to acquire the target The position information of the object in at least two video frames and the time information of the at least two video frames; according to the position information of the target object in the at least two video frames and the time information of the at least two video frames Acquire a tracking trajectory of the target object in the first video, where the tracking trajectory includes position information of the target object in at least two video frames in the first video.
  • This method obtains the position information of the target object in at least two video frames and the time information of the at least two video frames through a pre-trained neural network model.
  • Target tracking does not depend on the target detection result of a single video frame, which can reduce The problem of detection failure in scenes with dense targets or more occlusions can improve target tracking performance.
  • the obtaining the position information of the target object in at least two video frames and the time information of the at least two video frames specifically includes: obtaining information about the target object A motion pipeline, which is used to indicate time information and position information of the target object in at least two video frames of the first video, wherein the first video includes a first video frame and a second video Frame; the motion pipeline corresponds to a quadrangular pyramid in the space-time dimension, the space-time dimension includes a time dimension and a two-dimensional space dimension, and the position of the first bottom surface of the quadrangular pyramid in the time dimension is used to indicate the The first time information of the first video frame, the position of the second bottom surface of the quadrangular pyramid in the time dimension is used to indicate the second time information of the second video frame, the first bottom surface of the quadrangular pyramid The position in the two-dimensional space is used to indicate the first position information of the target object in the first video frame, and the position of the second bottom surface of the quadrangular pyramid in the two-dimensional space is used
  • This method obtains the motion pipeline of each video frame through a pre-trained neural network model. Since the motion pipeline includes the position information of the target object in at least two video frames, the position of the target in the video frame can pass through the space-time dimension. The time in the time dimension and the position in the two-dimensional space are determined, the time is used to determine the video frame, and the position in the two-dimensional space is used to indicate the position information of the target in the video frame.
  • This method can correspond the motion pipeline to the quadrangular prism in the space-time dimension, and visually display the position information of the target in at least two video frames through the quadrangular prism in the space-time dimension.
  • the target tracking method does not depend on the target detection result of a single video frame, and can reduce the problem of detection failure in scenes with dense targets or more occlusions, and improve target tracking performance.
  • the obtaining the position information of the target object in at least two video frames and the time information of the at least two video frames specifically includes: obtaining information about the target object A motion pipeline, where the motion pipeline is used to indicate the position information of the target object in at least three video frames and the time information of the at least three video frames, wherein the first video includes a first video frame and a first video frame.
  • the motion pipeline corresponds to a double quadrangular prism in the space-time dimension
  • the double quadrangular prism includes a first quadrangular prism and a second quadrangular prism
  • the first quadrangular prism It includes a first bottom surface and a second bottom surface
  • the second quadrangular ridge includes a first bottom surface and a third bottom surface
  • the first bottom surface is a common bottom surface of the first quadrangular ridge and the second quadrangular ridge
  • the position of the first bottom surface in the time dimension is used to indicate the first time information of the first video frame
  • the position of the second bottom surface in the time dimension is used to indicate the first time information of the second video frame.
  • the position of the third bottom surface in the time dimension is used to indicate the third time information of the third video frame, and the time sequence of the first video frame in the first video is located in the first video
  • the position of the first bottom surface in the two-dimensional space is used to indicate the first position information of the target object in the first video frame
  • the position of the second bottom surface in the two-dimensional space indicates the second position information of the target object in the second video frame
  • the position of the third bottom surface in the two-dimensional space indicates the target object
  • the third position information in the third video frame; the double quadrangular prism is used to indicate that the target object is between the second video frame and the three video frames of the first video Position information in the video frame.
  • the motion pipeline includes the position information of the target object of the target object in at least three video frames.
  • the at least three video frames include the first video frame that is earlier in the time sequence of the video than the video frame corresponding to the motion pipeline.
  • the second video frame and the later third video frame expand the receptive field in the time dimension, which can further improve the target tracking performance.
  • the motion pipeline corresponds to the double quadrangular prism in the space-time dimension, and the position information of the target in at least three video frames is visually displayed through the double quadrangular prism in the space-time dimension. Specifically, it also includes the position information of the target in all the video frames between the two non-common bottom surfaces of the motion pipeline.
  • the structure of the real tracking trajectory of the target object is usually nonlinear.
  • the motion pipeline of the double quadrangular prism structure can express the two directions of the target movement.
  • the real tracking trajectory can be better fitted in the scene where the movement direction changes.
  • the acquisition of the location of the target object in the first aspect is performed according to the location information of the target object in at least two video frames and the time information of the at least two video frames.
  • the tracking trajectory in a video specifically includes: acquiring the tracking trajectory of the target object in the first video according to the motion pipeline.
  • Obtaining the tracking trajectory of the target object in the first video according to the motion pipeline can reduce the problem of detection failure in a scene with dense targets or more occlusions, and improve target tracking performance.
  • the tracking trajectory specifically includes: a tracking trajectory of the target object formed by connecting at least two of the motion pipes corresponding to a quadrangular prism in a space-time dimension.
  • Obtaining the tracking trajectory of the target object by connecting the motion pipeline can not rely on the target detection result of a single video frame, which can reduce the problem of detection failure in scenes with dense targets or more occlusions, and improve target tracking performance.
  • the length of the motion pipeline of the video frame is a preset value
  • the length of the motion pipeline indicates the number of video frames included in the at least two video frames, optionally Ground
  • the length of the movement pipeline includes 4, 6, or 8.
  • the length of the motion pipe can be a preset value. That is, the number of video frames corresponding to each motion pipeline is the same, indicating the position change of the target object in the same time period. Compared with the method of not setting the motion pipeline length, this method can reduce the calculation amount of the neural network model and reduce the target Tracking takes time.
  • the method further includes: obtaining category information of the target object through the pre-trained neural network model; and according to the target object in at least two videos The position information in the frame and the time information of the at least two video frames.
  • Obtaining the tracking trajectory of the target object in the first video includes: according to the category information of the target object, the location of the target object is The position information in the at least two video frames and the time information of the at least two video frames acquire the tracking trajectory of the target object in the first video.
  • this method can determine the category information of the target object corresponding to the motion pipeline through the pre-trained neural network model, and obtain the target object’s information based on the category information, location information, and time information. Track the trajectory.
  • the acquiring, through the pre-trained neural network model, the category information of the target object corresponding to the motion pipeline specifically includes: through the pre-trained neural network model, Obtain the confidence level of the motion pipe, where the confidence level of the motion pipe is used to determine the category information of the target object corresponding to the motion pipe.
  • this method can distinguish whether the motion pipeline is a real motion pipeline indicating the target position by confidence.
  • this method can use motion The confidence of the pipeline distinguishes the types of target objects corresponding to the motion pipeline.
  • the method before the acquiring the tracking trajectory of the target object according to the motion pipeline, the method further includes: deleting the motion pipeline, and obtaining the deleted A motion pipeline, where the deleted motion pipeline is used to obtain the tracking trajectory of the target object.
  • This method can delete motion pipelines of video frames, delete repeated motion pipelines or motion pipelines with low confidence, and can reduce the amount of calculation in the motion pipeline connection step.
  • the deletion of the motion pipeline and obtaining the deleted motion pipeline specifically includes: the motion pipeline includes a first motion pipeline and a second motion pipeline; if If the repetition rate between the first movement pipeline and the second movement pipeline is greater than or equal to the first threshold, the movement pipeline with a lower confidence level among the first movement pipeline and the second movement pipeline is deleted, and the first movement
  • the repetition rate between the pipe and the second movement pipe is the intersection ratio between the first movement pipe and the second movement pipe, and the first movement pipe and the second movement pipe belong to the target object
  • the confidence level indicates the probability that the category of the target object corresponding to the motion channel is the preset category.
  • This method introduces the specific method of deleting motion pipelines.
  • Motion pipelines with a repetition rate greater than or equal to the first threshold can be considered as repeated data.
  • the lower confidence is deleted, and the higher confidence is retained for The pipe connection can reduce the amount of calculation in the movement pipe connection step.
  • the deleting the motion pipeline and obtaining the deleted motion pipeline specifically includes: deleting the motion pipeline according to a non-maximum value suppression algorithm, Get the deleted motion pipeline.
  • This method can also be deleted according to the non-maximum suppression algorithm, that is, it can delete repeated motion pipelines, and it can also reserve motion pipelines with higher confidence for each target, reduce the calculation amount of pipeline connection steps, and improve target tracking efficient.
  • the confidence of any one of the motion pipes after the reduction is greater than or equal to the second threshold.
  • the acquiring the tracking trajectory of the target object according to the motion pipeline specifically includes: performing a third motion channel and a fourth motion in the motion channel that meet a preset condition.
  • the pipelines are connected to obtain the tracking trajectory of the target object;
  • the preset conditions include one or more of the following: the intersection of the third motion pipeline and the fourth motion pipeline between the sections where the time dimension overlaps
  • the parallel ratio is greater than or equal to the third threshold;
  • the cosine value of the angle between the movement direction of the third movement pipeline and the movement direction of the fourth movement pipeline is greater than or equal to the fourth threshold, and the movement direction is in time and space
  • the dimension indicates the vector of the position change of the target object in the motion pipeline according to a preset rule; and the distance between the neural network feature vectors of the motion pipeline is less than or equal to the fifth threshold, and the distance includes Euclidean distance.
  • This method provides a specific method for connecting motion pipelines. According to the position of the motion pipelines in the space-time dimension, the motion pipelines with high overlap and similar motion directions are connected.
  • the acquiring the tracking trajectory of the target object according to the motion pipeline specifically includes: grouping the motion pipelines to acquire t groups of motion pipelines, where t is the first The total number of video frames in the video, the i-th motion pipe group in the t group of motion pipes includes all motion pipes starting from the i-th video frame in the first video, and the i is greater than or equal to 1, and less than or Equal to t; when i is 1, the motion pipeline in the i-th motion pipe group is used as the initial tracking trajectory to obtain the tracking trajectory set; in accordance with the sequence of the number of the motion pipe group, the motion in the i-th motion pipe group is sequentially The pipeline is connected with the tracking trajectory in the tracking trajectory set to obtain at least one tracking trajectory.
  • This method provides a specific method for connecting motion pipelines.
  • the motion pipeline corresponds to the position information of the target object in the video frame within a period of time.
  • the motion pipelines are grouped according to the initial video frame, and each group of motion pipelines are connected in turn , Can improve the efficiency of target tracking.
  • the pre-trained neural network model is obtained after the initial network model is trained, and the method further includes: inputting the first video sample into the initial network model for training, and obtaining Target object loss; update the weight parameters in the initial network model according to the target object loss, and obtain the pre-trained neural network model.
  • the initial network model can be trained to obtain the neural network model of the output motion pipeline in the target tracking method.
  • the target object loss specifically includes: an intersection ratio between a true value of a motion pipe and a predicted value of a motion pipe, and the true value of the motion pipe is the first video sample
  • the motion pipeline obtained by splitting the tracking trajectory of the target object in the middle, and the motion pipeline prediction value is the motion pipeline obtained by inputting the first video sample into the initial network model.
  • This method provides that the target loss in the model training process is the intersection ratio between the true value of the motion pipeline and the predicted value of the motion pipeline.
  • the neural network model obtained by this training has high accuracy of the position information of the target object indicated by the motion pipeline. .
  • the target object loss specifically includes: the cross-union ratio between the true value of the motion pipe and the predicted value of the motion pipe, and the confidence of the true value of the motion pipe and the predicted value of the motion pipe
  • the cross entropy between the confidence levels of the motion pipeline, the motion pipeline truth value is the motion pipeline obtained by splitting the tracking trajectory of the target object in the first video sample, and the motion pipeline prediction value is the input of the first video sample
  • the confidence of the true value of the motion pipeline is the probability that the target object category corresponding to the true value of the motion pipeline belongs to the preset target object category
  • the confidence of the predicted value of the motion pipeline It is the probability that the category of the target object corresponding to the predicted value of the motion pipeline belongs to the preset target object category.
  • This method provides that the target loss in the model training process is the intersection ratio between the true value of the motion pipeline and the predicted value of the motion pipeline.
  • the neural network model obtained by this training has high accuracy of the position information of the target object indicated by the motion pipeline. , And can accurately indicate the type of target object.
  • the initial network model includes a three-dimensional convolutional neural network or a recurrent neural network, and the three-dimensional convolutional neural network includes a three-dimensional residual neural network or a three-dimensional feature pyramid network.
  • the initial network model is obtained by combining a three-dimensional residual neural network and a three-dimensional feature pyramid network.
  • the initial network model in this method can be a three-dimensional convolutional neural network, a recurrent neural network, or a combination of the two.
  • the diversity of neural network model types provides multiple possibilities for the realization of the scheme.
  • the inputting the first video into a pre-trained neural network model to obtain the motion pipeline of the target object specifically includes: dividing the first video into a plurality of Video clips; input the multiple video clips into the pre-trained neural network model to obtain the motion pipeline.
  • the video can be segmented first, and the video segment is input to the model.
  • the number of video frames of the video segment is a preset value, for example, 8 frames.
  • a second aspect of the embodiments of the present application provides a target tracking device, including: an acquisition unit configured to acquire a first video, where the first video includes a target object; and the acquisition unit is further configured to capture the first video Video input pre-trained neural network model to obtain the position information of the target object in at least two video frames and the time information of the at least two video frames; the obtaining unit is further configured to The position information in at least two video frames and the time information of the at least two video frames acquire the tracking trajectory of the target object in the first video, and the tracking trajectory includes the target object in the first video. Position information in at least two video frames in a video.
  • the acquisition unit is specifically configured to: acquire a motion pipeline of the target object, and the motion pipeline is used to indicate that the target object is at least two parts of the first video.
  • Time information and location information in two video frames where the first video includes a first video frame and a second video frame; the motion pipeline corresponds to a quadrangular pyramid in the space-time dimension, and the space-time dimension includes the time dimension And a two-dimensional space dimension, the position of the first bottom surface of the quadrangular pyramid in the time dimension is used to indicate the first time information of the first video frame, and the second bottom surface of the quadrangular pyramid is at the time
  • the position of the dimension is used to indicate the second time information of the second video frame, and the position of the first bottom surface of the quadrangular prism in the two-dimensional space is used to indicate that the target object is in the first video frame
  • the first position information in the second video frame, the position of the second bottom surface of the quadrangular prism in the two-dimensional space is used to indicate the second position information of the
  • the acquiring unit is specifically configured to: acquire a motion pipeline of the target object, and the motion pipeline is used to indicate the position of the target object in at least three video frames Information and time information of the at least three video frames, wherein the first video includes a first video frame, a second video frame, and a third video frame; the motion pipeline corresponds to a double quadrangular prism in the space-time dimension ,
  • the double quadrangular prism includes a first quadrangular prism and a second quadrangular prism, the first quadrangular prism includes a first bottom surface and a second bottom surface, and the second quadrangular prism includes a first bottom surface and a third quadrangular prism.
  • the bottom surface, the first bottom surface is the common bottom surface of the first quadrangular prism platform and the second quadrangular prism platform, and the position of the first bottom surface in the time dimension is used to indicate the first video frame of the first video frame.
  • Time information, the position of the second bottom surface in the time dimension is used to indicate the second time information of the second video frame, and the position of the third bottom surface in the time dimension is used to indicate the third video.
  • the third time information of the frame, the time sequence of the first video frame in the first video is located between the second video frame and the third video frame, and the first bottom surface is in the two-dimensional
  • the position of the spatial dimension is used to indicate the first position information of the target object in the first video frame, and the position of the second bottom surface in the two-dimensional spatial dimension indicates that the target object is in the second video
  • the second position information in the frame, the position of the third bottom surface in the two-dimensional space dimension indicates the third position information of the target object in the third video frame; the double quadrangular prism is used to
  • the acquiring unit is specifically configured to acquire the tracking trajectory of the target object in the first video according to the motion pipeline.
  • the tracking trajectory specifically includes: a tracking trajectory of the target object formed by connecting at least two of the motion pipes corresponding to a quadrangular prism in a space-time dimension.
  • the length of the motion pipe is a preset value, and the length of the motion pipe indicates the number of video frames included in the at least two video frames.
  • the acquiring unit is further configured to: acquire category information of the target object through the pre-trained neural network model; and according to the category information of the target object , The position information of the target object in at least two video frames and the time information of the at least two video frames obtain the tracking trajectory of the target object in the first video.
  • the acquiring unit is specifically configured to: acquire the confidence of the motion pipeline through the pre-trained neural network model, and the confidence of the motion pipeline is used to determine The category information of the target object corresponding to the motion pipeline.
  • the device further includes: a processing unit configured to delete the motion pipeline to obtain the deleted motion pipeline, and the deleted motion pipeline is used for To obtain the tracking trajectory of the target object.
  • the movement pipeline includes a first movement pipeline and a second movement pipeline; the processing unit is specifically configured to: if the first movement pipeline and the second movement pipeline are repeated If the rate is greater than or equal to the first threshold, then the movement pipeline with lower confidence among the first movement pipeline and the second movement pipeline is deleted, and the repetition rate between the first movement pipeline and the second movement pipeline is equal to The intersection ratio between the first movement pipeline and the second movement pipeline, the first movement pipeline and the second movement pipeline belong to the movement pipeline of the target object, and the confidence level indicates that the movement pipeline corresponds to The class of the target object is the probability of the preset class.
  • the processing unit is specifically configured to cut the motion pipeline according to a non-maximum value suppression algorithm, and obtain the deleted motion pipeline.
  • the confidence of any one of the motion pipes after the reduction is greater than or equal to the second threshold.
  • the acquiring unit is specifically configured to: connect a third movement pipeline and a fourth movement pipeline that meet a preset condition among the movement pipelines, and obtain information about the target object. Tracking the trajectory; the preset conditions include one or more of the following: the intersection ratio between the sections of the third motion pipeline and the fourth motion pipeline in the time dimension overlapping portion is greater than or equal to a third threshold; The cosine value of the angle between the movement direction of the third movement pipeline and the movement direction of the fourth movement pipeline is greater than or equal to a fourth threshold, and the movement direction is a predetermined rule indicating the target in the movement pipeline in the space-time dimension.
  • the vector of the position change of the object; and, the distance between the neural network feature vectors of the motion pipeline is less than or equal to the fifth threshold, and the distance includes the Euclidean distance.
  • the obtaining unit is specifically configured to: group the motion pipes to obtain t groups of motion pipes, where t is the total number of video frames in the first video, and t
  • the i-th motion pipeline group in the group of motion pipelines includes all motion pipelines starting from the i-th video frame in the first video, and the i is greater than or equal to 1, and less than or equal to t; when i is 1, the The motion pipes in the i-th motion pipe group are used as the initial tracking trajectories to obtain a tracking trajectory set; according to the sequence of the number of the motion pipe groups, the motion pipes in the i-th motion pipe group and the tracks in the tracking trajectory set are sequentially The trajectories are connected to obtain at least one tracking trajectory.
  • the acquiring unit is specifically configured to: input the first video sample into the initial network model for training, and acquire the target object loss; and update the initial The weight parameters in the network model are used to obtain the pre-trained neural network model.
  • the target object loss specifically includes: an intersection ratio between the true value of the motion pipe and the predicted value of the motion pipe, and the true value of the motion pipe is the first video sample
  • the motion pipeline obtained by splitting the tracking trajectory of the target object in the middle, and the motion pipeline prediction value is the motion pipeline obtained by inputting the first video sample into the initial network model.
  • the target object loss specifically includes: the cross-union ratio between the true value of the motion pipe and the predicted value of the motion pipe, and the confidence of the true value of the motion pipe and the predicted value of the motion pipe.
  • the cross entropy between the confidence levels of the motion pipeline, the motion pipeline truth value is the motion pipeline obtained by splitting the tracking trajectory of the target object in the first video sample, and the motion pipeline prediction value is the input of the first video sample
  • the confidence of the true value of the motion pipeline is the probability that the target object category corresponding to the true value of the motion pipeline belongs to the preset target object category
  • the confidence of the predicted value of the motion pipeline It is the probability that the category of the target object corresponding to the predicted value of the motion pipeline belongs to the preset target object category.
  • the initial network model includes a three-dimensional convolutional neural network or a recurrent neural network.
  • the processing unit is further configured to: divide the first video into a plurality of video clips; the acquiring unit is specifically configured to: separate the plurality of video clips Input the pre-trained neural network model to obtain the motion pipeline.
  • the third aspect of the embodiments of the present application provides an electronic device, which is characterized by comprising a processor and a memory, the processor and the memory are connected to each other, wherein the memory is used to store a computer program, and the computer program It includes program instructions, and the processor is used to call the program instructions to execute the method described in any one of the foregoing first aspect and various possible implementation manners.
  • the fourth aspect of the embodiments of the present application provides a computer program product containing instructions, which is characterized in that when it runs on a computer, the computer executes any one of the above-mentioned first aspect and various possible implementation manners. The method described in the item.
  • the fifth aspect of the embodiments of the present application provides a computer-readable storage medium, including instructions, which are characterized in that, when the instructions run on a computer, the computer executes the first aspect and various possible implementation manners. Any one of the methods.
  • a sixth aspect of the embodiments of the present application provides a chip including a processor.
  • the processor is used to read and execute the computer program stored in the memory to execute the method in any possible implementation manner of any of the foregoing aspects.
  • the chip should include a memory, and the memory and the processor are connected to the memory through a circuit or a wire.
  • the chip further includes a communication interface, and the processor is connected to the communication interface.
  • the communication interface is used to receive data and/or information that needs to be processed, and the processor obtains the data and/or information from the communication interface, processes the data and/or information, and outputs the processing result through the communication interface.
  • the communication interface can be an input and output interface.
  • the position information of the target object in at least two video frames and the time information of the at least two video frames are obtained through a pre-trained neural network model, and the target is determined according to the information The tracking trajectory of the object in the first video. Since the time information of at least two video frames is output through the neural network model, target tracking does not depend on the target detection result of a single video frame, which can reduce the problem of detection failure in scenes with dense targets or more occlusions, and improve target tracking performance .
  • the motion pipeline of the target object is obtained through a pre-trained neural network model, and the tracking trajectory of the target object is obtained by connecting the motion pipeline. Since the motion pipeline includes the position information of the target object in at least two video frames, target tracking does not depend on the target detection result of a single video frame, which can reduce the problem of detection failure in scenes with dense targets or more occlusions, and improve the target Track performance.
  • the detection algorithm relies on a single frame, and the accuracy of the overall algorithm is affected by the detector.
  • the development cost of step-by-step training of the detection model and tracking model is high.
  • the algorithm is divided into two phases, which also increases machine learning.
  • the target tracking method provided in the embodiments of the present application can realize end-to-end training, and complete the detection and tracking tasks of multi-target objects through a neural network model, which can reduce the complexity of the model.
  • the prior art has relatively single features extracted based on a single video frame.
  • the target tracking method provided in the embodiments of this application uses video as the original input, and the model can be tracked through various features such as appearance features, motion trajectory features, or gait features. Tasks can improve target tracking performance.
  • the target tracking method provided by the embodiment of the present application uses video as the original input of the model, and the time dimension receptive field is increased, which can better capture the movement information of the character.
  • FIG. 1 is a schematic diagram of an artificial intelligence main body framework provided by an embodiment of this application.
  • FIG. 2 is a schematic diagram of a system architecture provided by an embodiment of the application.
  • FIG. 3 is a schematic diagram of a convolutional neural network structure provided by an embodiment of this application.
  • FIG. 4 is a schematic diagram of another convolutional neural network structure provided by an embodiment of the application.
  • Fig. 5 is a schematic diagram of an embodiment of a motion pipe in an embodiment of the application.
  • Fig. 6 is a schematic diagram of the tracking trajectory splitting motion pipeline in an embodiment of the application.
  • FIG. 7 is a schematic diagram of an embodiment of a motion pipe in an embodiment of the application.
  • Fig. 8 is a schematic diagram of another embodiment of the motion pipe in the embodiment of the application.
  • FIG. 9 is a schematic diagram of the intersection and merging of the motion pipes in the embodiment of the application.
  • FIG. 10 is a schematic diagram of an embodiment of a target detection method in an embodiment of the application.
  • FIG. 11 is a schematic diagram of an embodiment of matching between moving pipes in an embodiment of the application.
  • FIG. 12 is a schematic diagram of an embodiment of a neural network model training method in an embodiment of the application.
  • FIG. 13 is a schematic diagram of tracking trajectory and motion pipeline in an embodiment of this application.
  • Fig. 14 is a schematic diagram of a motion pipeline output by a neural network model in an embodiment of the application.
  • 15 is a schematic diagram of another embodiment of a target tracking method in an embodiment of the application.
  • FIG. 16 is a schematic diagram of an embodiment of a target tracking device in an embodiment of the application.
  • FIG. 17 is a schematic diagram of another embodiment of the target tracking device in the embodiment of this application.
  • FIG. 18 is a schematic diagram of another embodiment of the target tracking device in the embodiment of this application.
  • FIG. 19 is a schematic diagram of another embodiment of an electronic device in an embodiment of the application.
  • FIG. 20 is a hardware structure diagram of a chip provided by an embodiment of the application.
  • the embodiment of the present application provides a target tracking method, which is used for target tracking in a video, which can reduce tracking errors in scenes with dense targets or more occlusions.
  • the moving target in the video refers to the relative movement of the target relative to the video capture device during the shooting process, taking the world coordinate system of the actual three-dimensional space as a reference.
  • the target can be moving or not, and the specifics are not limited here. .
  • the image information of the target object may be directly recorded in the video frame, or part of the image frame may be blocked by other objects.
  • the data displayed in this form is defined as data in a space-time dimension in the embodiment of the present application.
  • the position of the target in the video frame can be determined by the position in the time dimension and the position in the two-dimensional space in the space-time dimension.
  • the position in the time dimension is used to determine the video frame, and the position in the two-dimensional space is used To indicate the location information of the target in the video frame.
  • FIG. 5 is a schematic diagram of an embodiment of the motion pipe in the embodiment of the application.
  • Target tracking needs to determine the position information of the target to be tracked (or target for short) in all video frames containing the target object.
  • the target position in each video frame can be identified by a detection box (Bounding-Box).
  • the detection box of the same target object in each video frame is connected correspondingly to form the trajectory of the target in the space-time area. That is, tracking trajectory, or motion trajectory, tracking trajectory can not only give the position of the target object, but also connect the positions of the target object at different times. Therefore, the tracking trajectory can indicate the time and space information of the target object at the same time.
  • FIG. 5 only illustrates the position information of the target object in the three video frames.
  • all the video frames of the video can obtain the tracking trajectory according to the above-mentioned method.
  • the tracking trajectory also includes the identification (ID) of the target object indicated by the tracking trajectory, and the ID of the target object can be used to distinguish trajectories corresponding to different targets.
  • the motion pipeline is used to indicate the position information of the target in at least two video frames, corresponding to the quadrangular prism in the space-time dimension, and the position of the first bottom surface of the quadrangular prism in the time dimension is used to indicate the first time of the first video frame Information, the position of the second bottom surface of the quadrangular prism in the time dimension is used to indicate the second time information of the second video frame, and the position of the first bottom surface of the quadrangular prism in the two-dimensional space is used to indicate that the target object is in the first
  • the first position information in one video frame and the position of the second bottom surface of the quadrangular prism in the two-dimensional space are used to indicate the second position information of the target object in the second video frame.
  • the motion pipeline is used to indicate the position information of the target in at least three different video frames.
  • the motion pipeline includes the position information of the target in three different video frames as an example for introduction.
  • the motion pipeline can be regarded as a double quadrangular pyramid structure composed of two quadrangular pyramids with a common bottom surface.
  • the three bottom surfaces of the double quadrangular pyramid structure are parallel to each other.
  • the direction perpendicular to the bottom surface is the time dimension, and the extension direction of the bottom surface is Spatial dimension, each bottom surface represents the position of the target in the video frame at the corresponding moment of the bottom surface.
  • a movement pipeline with a double quadrangular pyramid structure including: a first bottom surface 601, a second bottom surface 602, and a third bottom surface 603.
  • the first bottom surface 601, namely rectangular abcd, is located at the The position information in the two-dimensional space represents the position information of the target object in the first video frame, and the position of the rectangle abcd mapped in the time dimension represents the time information of the first video frame; similarly, the second bottom surface 602 is the rectangle ijkm , The position information in the two-dimensional space where the second bottom surface is located represents the position information of the target object in the second video frame, and the position of the rectangle abcd mapped in the time dimension represents the time information of the second video frame; the third bottom surface 603, That is, the rectangle efgh, the position information in the two-dimensional space where the third bottom surface is located represents the position information of the target object in the third video frame, and the position of the rectangle abcd mapped in the time dimension represents the time information of the third video frame.
  • the rectangle abcd, the rectangle efgh, and the rectangle ijkm are mapped to the two-dimensional space where the same bottom surface is located.
  • the corresponding location may be different.
  • the positions of the first bottom surface 601, the second bottom surface 602, and the third bottom surface 603 in the time dimension, that is, the positions of point a, point i, and point e mapped in the time dimension are a', i', and e', respectively, indicating the first Time information of a video frame, a second video frame, and a third video frame.
  • the length of the motion pipeline is the position interval between the second bottom surface mapped in the time dimension and the third bottom surface mapped in the time dimension. It is used to indicate the time sequence of the video, the second bottom surface, the third bottom surface, and the second bottom surface. The number of all video frames between the bottom surface and the third bottom surface.
  • the motion pipeline corresponding to the first video frame includes at least the position information of the target in the first video frame.
  • the tracking trajectory can be split into multiple motion pipelines, as shown in FIG. 6.
  • the tracking trajectory can be split into a position box of a single video frame, and each position box is used as a double quadrilateral
  • the common bottom surface in the mesa structure is like the first bottom surface 601 in Figure 6, and extends forward and backward in the tracking trajectory to determine the other two bottom surfaces of the double quadrangular mesa structure, which are the second bottom surface 602 and the third bottom surface respectively.
  • the bottom surface 603 thus obtains a double quadrangular prism structure with a common bottom surface, that is, the motion pipeline corresponding to the single video frame.
  • the forward extension is 0.
  • the last video frame extends backward to 0, and the motion pipelines corresponding to the start video frame and the last video frame degenerate into a single quadrangular pyramid structure.
  • the length of the motion pipeline is defined as the number of video frames corresponding to the motion pipeline. As shown in FIG. 6, the video is between the video frame corresponding to the second bottom surface 602 and the video frame corresponding to the third bottom surface 603. The total number is the length of the movement pipeline.
  • the motion pipeline in the embodiment of this application is represented by a specific data format. Please refer to FIG. 7 and FIG. 8, which are two schematic diagrams of the data format of the motion pipeline in the embodiment of this application.
  • the first data format includes 3 data in the time dimension (t s , t m , t e ), and 12 data in the space dimension A total of 15 data.
  • the location information of the target in space can be determined by 4 pieces of data.
  • the target location area is B s , passing and Four data can determine the location area.
  • the motion pipeline output by the neural network model can be represented by another data format, the motion pipeline of the video frame m, B m is the detection frame corresponding to the target in the common bottom surface, and B m is the corresponding video frame time partial image region, P is any one of B m pixel region, the pixel may be identified by a data point is located, in the time dimension, two data: d s and d e, motion of the conduit may be determined, respectively The length of the extension forward and backward.
  • the four data of l m , b m , t m , and r m indicate the offset (Regress values for B m ) of the boundary of the B m area relative to the P point with the point P as the reference point.
  • l s, b s, t s , r s four data indicate a boundary region B s offset with respect to the boundary region B m (Regress values for B s), similarly, l e, b e, t e, r e four data indicate a boundary region B e offset with respect to the boundary region B m (Regress values for B e).
  • both data formats can represent a single motion pipeline through 15 data, and the two data formats can be converted to each other.
  • IoU is usually used to measure the degree of overlap between two locations.
  • object detection object detection
  • IoU is extended to the three-dimensional space of the space-time dimension to measure the degree of overlap of the two motion pipelines in the space-time dimension.
  • schematic diagram is extended to the three-dimensional space of the space-time dimension to measure the degree of overlap of the two motion pipelines in the space-time dimension.
  • IoU(T (1) ,T (2) ) ⁇ (T (1) ,T (2) )/ ⁇ (T (1) ,T (2) )
  • T (1) represents motion channel 1
  • T (2) represents motion channel 2
  • ⁇ (T (1) ,T (2) ) represents the intersection of two motion channels
  • ⁇ (T (1) ,T (2) ) ) Represents the union of two motion pipelines.
  • FIG. 1 shows a schematic diagram of an artificial intelligence main frame, which describes the overall workflow of an artificial intelligence system and is suitable for general artificial intelligence field requirements.
  • Intelligent Information Chain reflects a series of processes from data acquisition to processing. For example, it can be the general process of intelligent information perception, intelligent information representation and formation, intelligent reasoning, intelligent decision-making, intelligent execution and output. In this process, the data has gone through the condensing process of "data-information-knowledge-wisdom".
  • the "IT value chain” is the industrial ecological process from the underlying infrastructure of human intelligence and information (providing and processing technology realization) to the system, reflecting the value that artificial intelligence brings to the information technology industry.
  • the infrastructure provides computing power support for the artificial intelligence system, realizes communication with the outside world, and realizes support through the basic platform.
  • the basic platform includes distributed computing framework and network related platform guarantees and support, which can include cloud storage and Computing, interconnection network, etc.
  • sensors communicate with the outside to obtain data, and these data are provided to the smart chip in the distributed computing system provided by the basic platform for calculation.
  • the data in the upper layer of the infrastructure is used to represent the data source in the field of artificial intelligence.
  • the data involves graphics, images, voice, text, and IoT data of traditional devices, including business data of existing systems and sensory data such as force, displacement, liquid level, temperature, and humidity.
  • Data processing usually includes data training, machine learning, deep learning, search, reasoning, decision-making and other methods.
  • machine learning and deep learning can symbolize and formalize data for intelligent information modeling, extraction, preprocessing, training, etc.
  • Reasoning refers to the process of simulating human intelligent reasoning in a computer or intelligent system, using formal information to conduct machine thinking and solving problems based on reasoning control strategies.
  • the typical function is search and matching.
  • Decision-making refers to the process of making decisions after intelligent information is reasoned, and usually provides functions such as classification, ranking, and prediction.
  • some general capabilities can be formed based on the results of the data processing, such as an algorithm or a general system, for example, translation, text analysis, computer vision processing, speech recognition, image Recognition and so on.
  • Intelligent products and industry applications refer to the products and applications of artificial intelligence systems in various fields. It is an encapsulation of the overall solution of artificial intelligence, productizing intelligent information decision-making and realizing landing applications. Its application fields mainly include: intelligent manufacturing, intelligent transportation, Smart home, smart medical, smart security, autonomous driving, safe city, smart terminal, etc.
  • the motion pipeline of the target object is obtained through a deep neural network.
  • the embodiment of this application provides A system architecture 200.
  • the data collection device 260 is used to collect the video data of the moving target and store it in the database 230.
  • the training device 220 generates a target model/rule 201 based on the video samples containing the moving target maintained in the database 230.
  • the following will describe in more detail how the training device 220 obtains the target model/rule 201 based on the video samples of the moving target.
  • the target model/rule 201 can be used in application scenarios such as single target tracking, multiple target tracking, and virtual reality.
  • training may be performed based on video samples of the moving target.
  • various video samples containing the moving target may be collected by the data collection device 260 and stored in the database 230.
  • video data can be obtained directly from commonly used databases.
  • the target model/rule 201 may be obtained based on a deep neural network, and the deep neural network will be introduced below.
  • the work of each layer in the deep neural network can be expressed in mathematical expressions To describe: From the physical level, the work of each layer in the deep neural network can be understood as the transformation of the input space to the output space (that is, the row space of the matrix to the column Space), these five operations include: 1. Dimension Up/Down; 2. Enlarge/Reduce; 3. Rotate; 4. Translation; 5. "Bend”. The operations of 1, 2, and 3 are determined by Completed, the operation of 4 is completed by +b, and the operation of 5 is realized by a(). The reason why the word "space” is used here is because the object to be classified is not a single thing, but a class of things. Space refers to the collection of all individuals of this class of things.
  • W is a weight vector, and each value in the vector represents the weight value of a neuron in the layer of neural network.
  • This vector W determines the spatial transformation from the input space to the output space described above, that is, the weight W of each layer controls how the space is transformed.
  • the purpose of training a deep neural network is to finally obtain the weight matrix of all layers of the trained neural network (the weight matrix formed by the vector W of many layers). Therefore, the training process of the neural network is essentially the way of learning the control space transformation, and more specifically the learning weight matrix.
  • the weight vector of the network (of course, there is usually an initialization process before the first update, which is to pre-configure parameters for each layer in the deep neural network). For example, if the predicted value of the network is high, adjust the weight vector to make it The prediction is lower and keep adjusting until the neural network can predict the target value you really want. Therefore, it is necessary to predefine "how to compare the difference between the predicted value and the target value".
  • This is the loss function or objective function, which is used to measure the difference between the predicted value and the target value. Important equation. Among them, taking the loss function as an example, the higher the output value (loss) of the loss function, the greater the difference, then the training of the deep neural network becomes a process of reducing this loss as much as possible.
  • the target model/rule obtained by the training device 220 can be applied to different systems or devices.
  • the execution device 210 is configured with an I/O interface 212 to perform data interaction with external devices.
  • the "user" can input data to the I/O interface 212 through the client device 240.
  • the execution device 210 can call data, codes, etc. in the data storage system 250, and can also store data, instructions, etc. in the data storage system 250.
  • the calculation module 211 uses the target model/rule 201 to process the input data. Taking target tracking as an example, the calculation module 211 can analyze the input video to obtain features indicating target location information in the video frame.
  • the correlation function module 213 may preprocess the image data in the calculation module 211, for example, perform video preprocessing, including video segmentation.
  • the correlation function module 214 may preprocess the image data in the calculation module 211, for example, perform video preprocessing, including video segmentation.
  • the I/O interface 212 returns the processing result to the client device 240 and provides it to the user.
  • the training device 220 can generate corresponding target models/rules 201 based on different data for different targets, so as to provide users with better results.
  • the user can manually specify to input data in the execution device 210, for example, to operate in the interface provided by the I/O interface 212.
  • the client device 240 can automatically input data to the I/O interface 212 and obtain the result. If the client device 240 automatically inputs data and needs the user's authorization, the user can set the corresponding authority in the client device 240.
  • the user can view the result output by the execution device 210 on the client device 240, and the specific presentation form may be a specific manner such as display, sound, and action.
  • the client device 240 can also serve as a data collection terminal to store the collected training data in the database 230.
  • Fig. 2 is only a schematic diagram of a system architecture provided by an embodiment of the present application, and the positional relationship between the devices, devices, modules, etc. shown in the figure does not constitute any limitation.
  • the data storage system 250 is an external memory relative to the execution device 210. In other cases, the data storage system 250 may also be placed in the execution device 210.
  • the deep neural network used to extract the motion pipeline from the video in the embodiment of the application may be a convolutional neural network (convolutional neural network, CNN), for example.
  • CNN convolutional neural network
  • CNN is a deep neural network with a convolutional structure. It is a deep learning architecture.
  • the deep learning architecture refers to the use of machine learning algorithms to perform multiple levels of learning at different levels of abstraction.
  • CNN is a feed-forward artificial neural network. Take image processing as an example. Each neuron in the feed-forward artificial neural network responds to overlapping areas in the input image. .
  • it can also be of other types, and this application does not limit the type of deep neural network.
  • a convolutional neural network (CNN) 100 may include an input layer 110, a convolutional layer/pooling layer 120, where the pooling layer is optional, and a neural network layer 130.
  • the convolutional layer/pooling layer 120 may include layers 121-126 as shown in the example.
  • layer 121 is a convolutional layer
  • layer 122 is a pooling layer
  • layer 123 is a convolutional layer
  • 124 is a pooling layer
  • 121 and 122 are convolutional layers
  • 123 is a pooling layer
  • 124 and 125 are convolutional layers
  • 126 is a convolutional layer.
  • Pooling layer That is, the output of the convolutional layer can be used as the input of the subsequent pooling layer, or as the input of another convolutional layer to continue the convolution operation.
  • the convolutional layer 121 can include many convolution operators.
  • the convolution operator is also called a kernel. Its function in image processing is equivalent to a filter that extracts specific information from the input image matrix.
  • the convolution operator can be a weight matrix. This weight matrix is usually predefined. In the process of convolution on the image, the weight matrix is usually one pixel after another pixel in the horizontal direction on the input image ( Or two pixels followed by two pixels...It depends on the value of stride) to complete the work of extracting specific features from the image.
  • the convolution kernel also has multiple formats.
  • Commonly used convolution kernels include two-dimensional convolution kernels and three-dimensional convolution kernels. Two-dimensional convolution kernels are mainly used to process two-dimensional image data, while three-dimensional convolution kernels can be applied to video processing, stereoscopic image processing, etc. due to the increased depth or time dimension.
  • the three-dimensional convolution kernel in order to extract the information in the time dimension and the space dimension in the video through the neural network model, the three-dimensional convolution kernel is used to perform the convolution operation in the time dimension and the space dimension at the same time, thus, the three-dimensional convolution kernel is composed of
  • the three-dimensional convolutional neural network can not only obtain the characteristics of each video frame, but also express the association and change of the video frame over time.
  • the initial convolutional layer (such as 121) often extracts more general features, which can also be called low-level features; with the convolutional neural network
  • the features extracted by the subsequent convolutional layers (for example, 126) become more and more complex, such as features such as high-level semantics, and features with higher semantics are more suitable for the problem to be solved.
  • multiple convolutional layers can be referred to as a block.
  • a pooling layer after the convolutional layer that is, the 121-126 layers as illustrated by 120 in Figure 3, which can be a convolutional layer followed by a layer
  • the pooling layer can also be a multi-layer convolutional layer followed by one or more pooling layers.
  • the sole purpose of the pooling layer is to reduce the spatial size of the image.
  • the neural network layer 130 may include multiple hidden layers (131, 132 to 13n as shown in FIG. 3) and an output layer 140.
  • the parameters contained in the multiple hidden layers can be based on specific task types. Relevant training data of, is obtained through pre-training.
  • the task type may include image recognition, image classification, image super-resolution reconstruction, and so on.
  • the output layer 140 After the multiple hidden layers in the neural network layer 130, that is, the final layer of the entire convolutional neural network 100 is the output layer 140.
  • the convolutional neural network 100 shown in FIG. 3 is only used as an example of a convolutional neural network.
  • the convolutional neural network may also exist in the form of other network models, for example,
  • the multiple convolutional layers/pooling layers shown in FIG. 4 are in parallel, and the respectively extracted features are input to the full neural network layer 130 for processing.
  • the deep neural network used to extract the motion pipeline from the video in the embodiment of the present application is a combination of a residual neural network and a feature pyramid network.
  • the residual neural network makes the deeper network easier to train by letting the deep network learn the residual representation.
  • Residual learning solves the problems of gradient disappearance and gradient explosion in deep networks.
  • the feature pyramid network detects targets of corresponding scales on feature maps of different resolutions. The output of each layer is obtained by fusing the feature maps of the current layer and higher layers, so each layer of feature maps output has sufficient feature expression ability.
  • the target detection method provided by the embodiment of the application involves a wide range of target tracking technologies, such as auto-focus during video shooting.
  • the target tracking algorithm can help the photographer more conveniently and accurately select the focus, or flexibly switch the focus to track the target, which is used in sports events. , Especially important in wildlife shooting.
  • the multi-target tracking algorithm can automatically complete the position tracking of the selected target object to facilitate the search for the established target, which is of great significance in the field of security.
  • the multi-target tracking algorithm can control the surrounding pedestrians, the trajectory and trend of the vehicle, and provide initial information for automatic driving path planning, automatic obstacle avoidance and other functions.
  • somatosensory games, gesture recognition, and finger tracking can also be achieved through multi-target tracking technology.
  • the usual target tracking method includes detection and tracking.
  • the detection module detects the target appearing in each video frame, and then matches the target appearing in each video frame. During the matching process, each target in a single video frame is extracted. The characteristics of each target object are matched through the similarity comparison of the features, and the tracking trajectory of each target object is obtained. Because this type of target tracking method uses the technical means of first detection and then tracking, the target tracking effect depends on the detection algorithm of a single frame. If the target is occluded in the target detection, detection errors will occur, which will lead to tracking errors. Therefore, when the target is dense or Insufficient performance in scenes with more occlusions.
  • the embodiment of the application adopts a target detection method, which inputs a video into a pre-trained neural network model, outputs multiple motion pipelines, and restores the tracking trajectories corresponding to one or more targets by matching the multiple motion pipelines.
  • target tracking does not depend on the target detection result of a single video frame, which can reduce the problem of detection failure in scenes with dense targets or more occlusions. Improve target tracking performance.
  • conventional target detection methods rely on single-frame detection algorithms. The accuracy of the overall algorithm is affected by the detector. The development cost of step-by-step training of detection models and tracking models is high. At the same time, the algorithm is divided into two stages and also increases machine learning.
  • the target tracking method provided in the embodiments of the present application can realize end-to-end training, and complete the detection and tracking tasks of multi-target objects through a neural network model, which can reduce the complexity of the model.
  • the prior art has relatively single features extracted based on a single video frame.
  • the target tracking method provided in the embodiments of the present application uses video as the original input, and the model can be realized by various features such as appearance features, motion trajectory features, or gait features. Tracking tasks can improve target tracking performance.
  • the target tracking method provided by the embodiment of the present application uses video as the original input of the model, and the time dimension receptive field is increased, which can better capture the movement information of the character.
  • FIG. 10 is a schematic diagram of an embodiment of the target detection method in the embodiment of the present application.
  • the target tracking device can preprocess the acquired video.
  • the preprocessing includes one or more of the following: dividing the video into segments of preset length, adjusting the video resolution, and adjusting and normalizing the color space .
  • the video when the length of the video is long, considering the data processing capability of the target tracking device, the video may be divided into 8 small segments.
  • step 1001 is an optional step and may or may not be executed.
  • the video is input to the pre-trained neural network model, and the position information of the target object in the at least two video frames and the time information of the at least two video frames are obtained.
  • the video is input to a pre-trained neural network model to obtain the motion pipeline of each target object.
  • the motion pipeline is used to indicate the time information and position information of the target object in at least two video frames of the first video.
  • the data format of the output motion pipeline is the type shown in Figure 8.
  • 3 represents the RGB color gamut
  • the output is the motion pipeline O, O ⁇ R ⁇ (t ⁇ h' ⁇ w' ⁇ 15), where R represents the real number domain and t represents the video
  • the number of frames, h' ⁇ w' represents the resolution of the feature map output by the neural network. That is, t ⁇ h' ⁇ w' motion pipes are output, and each video frame corresponds to h' ⁇ w' motion pipes.
  • the pre-trained neural network model is used to obtain the category information of the target object; specifically, the pre-trained neural network model is used to obtain the confidence level of the motion pipeline, which can be used to determine The category information of the target object corresponding to the motion pipeline.
  • each motion pipeline corresponds to a target to be tracked
  • the confidence of the motion pipeline refers to the possibility that the target corresponding to each motion pipeline belongs to the preset category.
  • the category of the target object to be tracked in the video such as human, vehicle, or dog.
  • the confidence of the output motion pipeline represents the probability that the target corresponding to the motion pipeline belongs to the preset category, and the confidence is A value between 0 and 1. The smaller the confidence level, the less likely it is to belong to the preset category, and the larger it is, the greater the possibility of belonging to the preset category.
  • the number of confidence levels of each motion channel is equal to the number of preset target object categories, and each confidence level corresponds to the possibility that the motion channel belongs to the category.
  • the confidence of the motion pipeline output by the neural network model constitutes the confidence table.
  • Example 1 The preset category of the target object is "person” or "background".
  • the background refers to the image area that does not contain the target object to be tracked.
  • the confidence levels of the target object category corresponding to the first motion pipeline are 0.1 and 0.9, respectively.
  • the confidence of the second motion pipeline is 0.7, 0.3. Since there is only one preset category, there are two possibilities for the target object category, which belongs to "person” or "background”.
  • the confidence threshold can be set to 0.5, Since the category of the target object corresponding to the first motion channel is "person", the confidence level of 0.3 is less than or equal to 0.5, which means that the target object corresponding to the motion channel has a low probability of being a person, and the "background” confidence level of 0.9 is greater than 0.5. That is, the possibility of belonging to the background is higher; the confidence that the target object corresponding to the second motion pipe belongs to the category of "person” is greater than 0.5, which means that the target corresponding to the motion pipe has a higher probability of belonging to the person, and the confidence that it belongs to the "background” If the degree of 0.3 is less than 0.5, it is less likely to belong to the background.
  • Example 2 The preset categories of the target object are "person”, “vehicle” and “background”, the confidence level of the first motion channel is 0.4, 0.1, 0.2, and the confidence level of the second motion channel is 0.2, 0.8, 0.1, There are three possibilities for the category of the target object: “person”, “vehicle” or “background”. 1/3 ⁇ 0.33 can be used as the confidence threshold. Since 0.4 is greater than 0.33, the category with the highest confidence in the first motion pipeline is " "People”, that is, the category of the corresponding target object has a higher probability of being human. Similarly, the category with the highest confidence of the second motion channel is "vehicle", that is, the category of the corresponding target object has a higher probability of being a vehicle.
  • the motion pipeline Before acquiring the tracking trajectory of the target object according to the motion pipeline, the motion pipeline can also be deleted to obtain the deleted motion pipeline, and the deleted motion pipeline is used to obtain the tracking trajectory of the target object.
  • the multiple motion pipelines output by the neural network model can be deleted according to preset conditions.
  • each pixel in each video frame corresponds to the motion pipeline, and the target appearing in the video frame usually occupies multiple pixel positions, so it is used to indicate the same target object
  • the category to which the target corresponding to each motion channel belongs can be determined according to the confidence level, and the motion channels of each category are respectively deleted.
  • obtaining the deleted motion pipeline specifically includes, if the repetition rate between the first motion pipeline and the second motion pipeline is greater than or equal to a first threshold, deleting the confidence levels in the first motion pipeline and the second motion pipeline
  • the repetition rate of the motion pipeline may be the IoU between the two motion pipelines.
  • the first threshold value ranges from 0.3 to 0.7.
  • the first threshold value is 0.5. If the IoU between the movement pipe and the second movement pipe is greater than or equal to 50%, then a movement pipe with a lower confidence level is deleted.
  • the motion pipeline is deleted according to a non-maximum suppression (NMS) algorithm, the deleted motion pipeline is obtained, and the IoU threshold of the motion pipeline is set to 0.5, and the NMS algorithm can be used to The motion pipeline is deleted, and only one corresponding motion pipeline is reserved for each target in each video frame.
  • NMS non-maximum suppression
  • each pixel in each video frame corresponds to the motion pipeline, and the position of the pixel in the video frame that does not correspond to the background area of the target object also corresponds to some motion pipelines.
  • Part of the motion pipeline can be understood as a fake motion pipeline, and the confidence is usually low.
  • the confidence of the motion pipeline can be deleted.
  • the confidence of any one of the motion pipelines after the reduction is greater than or equal to the second threshold, that is, the preset condition is that the confidence is less than or equal to the second threshold, and the second threshold is related to the preset category of the target object
  • the second threshold is usually between 0.3 and 0.7, for example, 0.5. If the number of categories of the target object is 10, the second threshold is usually between 0.07 and 0.13, for example, 0.1.
  • step 1003 is an optional step, which may or may not be performed.
  • the tracking trajectory of the target object in the first video according to the position information of the target object in at least two video frames and the time information of the at least two video frames.
  • a motion pipeline is used to indicate the target object The position information in the at least two video frames and the time information of the at least two video frames. Therefore, the tracking trajectory of the target object in the first video can be obtained according to the motion pipeline, and the tracking trajectory is specifically based on the at least two video frames.
  • Each motion pipeline corresponds to the tracking trajectory of the target object formed by the connection of the quadrangular prisms in the space-time dimension.
  • Obtaining the tracking trajectory of the target object according to the motion pipeline specifically includes: connecting a third motion pipeline and a fourth motion pipeline that meet a preset condition in the motion pipeline to obtain the tracking trajectory of the target object.
  • the specific content of the preset condition includes multiple types.
  • the preset condition includes one or more of the following: the intersection ratio between the third motion channel and the fourth motion channel in the overlapping sections of the time dimension is greater than or equal to The third threshold; the cosine value of the angle between the movement direction of the third movement pipeline and the movement direction of the fourth movement pipeline is greater than or equal to the fourth threshold, and the movement direction is to indicate the target object in the movement pipeline in the space-time dimension according to preset rules And, the distance between the neural network feature vectors of the motion pipeline is less than or equal to the fifth threshold, and the distance includes the Euclidean distance.
  • the intersection ratio between the two motion pipelines corresponding to the overlapping parts of the time dimension is greater than or equal to the third threshold
  • the cosine of the angle between the motion directions of the motion pipelines is greater than or equal to the fourth threshold
  • the distance index between the neural network feature vectors of the motion pipeline is less than or equal to the fifth threshold
  • the distance index may be, for example, Euclidean distance.
  • the neural network feature vector of the motion pipeline can be the output feature vector of any layer in the neural network model.
  • the neural network feature vector of the motion pipeline is the output feature of the last layer of the three-dimensional (3D) convolutional neural network in the neural network model. vector.
  • the movement direction of the motion pipe is a vector indicating the position change of the target object corresponding to the two bottom surfaces of the motion pipe in the space-time dimension, indicating the moving speed and direction of the target object. It can be understood that the position change of the target object in the video is usually continuous There is no sudden change in the change. Therefore, the moving directions of adjacent moving pipeline sections in the tracking trajectory are relatively close. During the connection of the moving pipelines, the connection can also be made according to the similarity of the moving directions of the moving pipelines. It should be noted that: the movement direction of the motion pipeline can be determined according to preset rules.
  • the motion pipeline is in the space-time dimension, and the two bottom surfaces of the motion pipeline that are the farthest apart in the time dimension (for example, the motion shown in Figure 8
  • the vector of the position change of the target object corresponding to the Bs and Be of the pipe is the direction of movement of the moving pipe, or set two adjacent bottoms in the moving pipe (for example, Bm and Be of the moving pipe shown in Figure 8) to face
  • the corresponding vector of the position change of the target object is the movement direction of the motion pipeline, or the position change direction of the target object between a preset number of video frames is set as the movement direction of the motion pipeline, and the preset number is, for example, 5 frames.
  • the direction of the tracking trajectory can be defined as at the end of the trajectory, the direction of the position change of the target object between a preset number of video frames is the direction of movement of the motion pipeline, or the direction of movement of the last motion pipeline at the end of the trajectory . It is understandable that the direction of motion of the motion pipe is generally defined as the direction from a certain moment to a moment after a certain moment in the time dimension.
  • the value of the third threshold is not limited, usually 70% to 95%, such as 75%, 80%, 85% or 90%, etc.
  • the value of the fourth threshold is not limited, usually cos( ⁇ /6) To cos ( ⁇ /36), for example, cos ( ⁇ /9), cos ( ⁇ /12), or cos ( ⁇ /18).
  • the value of the fifth threshold can be determined according to the size of the feature vector, and the specific value is not limited.
  • the following preset conditions are that the intersection ratio between the two motion pipelines corresponding to the overlapped portion of the time dimension is greater than or equal to the third threshold, and the cosine of the angle between the motion directions of the motion pipelines
  • the fourth threshold value is greater than or equal to an example.
  • FIG. 11 Please refer to FIG. 11 for a schematic diagram of an embodiment of the matching between the motion pipes in the embodiment of the application.
  • Example 1 as shown in part a in Fig. 11, if the intersection ratio between the motion pipeline sections corresponding to the overlapping time dimension of the two motion pipelines is greater than or equal to the third threshold, and the angle between the motion directions of the two motion pipelines
  • the cosine value of is greater than or equal to the fourth threshold, that is, the coincidence degree and the motion direction are matched, and the two motion pipes are matched successfully.
  • the degree of coincidence between two motion pipes refers to the IoU between the motion pipe sections of the overlapping portion of the two motion pipes in the time dimension.
  • Example 2 as shown in part b of Fig. 11, if the cosine value of the angle between the motion directions of the two motion pipes is less than the fourth threshold, that is, the motion directions do not match, the matching of the two motion pipes is unsuccessful.
  • Example 3 as shown in part c of Figure 11, if the intersection ratio between the motion pipeline sections corresponding to the overlapping parts of the two motion pipelines in the time dimension is less than the third threshold, that is, the degree of coincidence does not match, then the two motion pipelines match unsuccessful.
  • the two motion pipelines for matching have overlapping parts in the time dimension, there are two position information of the same target object in the video frame corresponding to the overlapping part, which can be determined by the method of averaging.
  • the position of the target object in the video frame corresponding to the overlapping part of the time dimension, or a certain motion channel specified according to a preset rule, for example, the time dimension coordinates of the video frame corresponding to the common bottom surface shall prevail.
  • the greedy algorithm can be used in the matching process of connecting all the motion pipes of the video to connect through a series of local optimal choices; the Hungarian algorithm can also be used for global optimal matching.
  • Connecting motion pipelines according to the greedy algorithm specifically includes: calculating the affinity between the two sets of motion pipelines to be matched (the affinity is defined as IoU*cos( ⁇ ), and ⁇ is the angle of the direction of motion) to form the affinity matrix.
  • the affinity matrix the matching motion pipe pair (Btube pair) is circularly selected from the maximum affinity until the matching is completed.
  • Connecting motion pipelines according to the Hungarian algorithm specifically includes: also after obtaining the affinity matrix, use the Hungarian algorithm to select a pair of motion pipelines.
  • the motion pipeline starting from the i-th frame is sequentially connected with the tracking track set, where i is a positive integer greater than 2 and less than t, and t is the total number of frames of the video. If the preset conditions are met, it will match If it succeeds, the tracking trajectory is updated according to the motion pipeline. If the matching is unsuccessful, it is newly added to the set of tracking trajectories as the initial tracking trajectory.
  • this embodiment adopts a greedy algorithm to sequentially connect the pipeline and the trajectory starting from the maximum affinity.
  • the motion pipes starting from the first frame is the second group
  • the motion pipes starting from the second frame are the second group.
  • the motion pipes starting from the i-th frame are the i-th group.
  • the first group includes 10 motion pipes
  • the second group includes 8 motion pipes
  • the third group includes 13 motion pipes.
  • the second group is connected to the initial tracking trajectories. If the connection conditions are met, the tracking trajectories are updated. If the connection conditions are not met, the original initial tracking trajectories are retained.
  • the tracking trajectory set includes 8 updated tracking trajectories, and The two tracking tracks remain unchanged.
  • the three motion pipelines are not used to update the tracking trajectory, and these three motion pipelines can be used as new initial tracking trajectories, that is, three new tracking trajectories are added to the tracking trajectory set.
  • the target category to which the target corresponding to the motion pipeline belongs is determined according to the confidence table of the motion pipeline, and the motion pipelines of different target categories are respectively connected to obtain the tracking trajectory of the target object of each target category.
  • the spatial position of the occluded part can be obtained by complementing the difference of the motion pipeline.
  • the tracking trajectory is processed as a bounding box superimposed on the original video and output to the display to complete the real-time tracking deployment and achieve target tracking.
  • the target tracking method provided in the embodiment of the present application designs a pre-trained neural network model, and the training method of the neural network model is introduced below.
  • FIG. 12 is a schematic diagram of an embodiment of a neural network model training method in an embodiment of the application.
  • Training preparations include building a training hardware environment, building a network model, and setting training parameters.
  • the video samples in the data set can also be processed to increase the diversity of data distribution and obtain better model generalization capabilities.
  • the processing of the video includes resolution scaling, whitening the color space, and random HSL (a color space, or color representation method) to the video color.
  • H hue
  • S saturation
  • L Brightness
  • jitter random horizontal flipping of video frames, etc.
  • Set training parameters including batch size, learning rate, optimizer model, etc., for example, the batch size is 32, the learning rate starts from 10 ⁇ (-3), and when the loss is stable, it is reduced by 5 times for better convergence . After 25K training iterations, the network basically converges. In order to increase the generalization ability of the model, a second-order regular loss of 10 ⁇ (-5) is used, and its momentum coefficient is 0.9.
  • split according to the preset pipe length that is, set the interval between the three bottom surfaces in the double quadrangular pyramid structure.
  • the interval between the common bottom surface and the other two bottom surfaces is 4, and the movement pipe The length is 8.
  • the length of the motion pipeline in the time dimension is extended as much as possible, and the time dimension The longest structure serves as the final expanded structure.
  • Figure 13 Since the structure of the motion pipeline (Btube) is linear, and the structure of the ground truth (ground truth) is non-linear, the long motion pipeline often cannot fit the motion trajectory better, that is, as the length increases, the IoU will be lower. Low (IoU ⁇ ). The length of the motion pipeline with larger IoU (IoU> ⁇ ) is usually shorter.
  • the longest motion pipeline that meets the lowest IoU threshold is used as the split motion pipeline, which can better fit the original trajectory while expanding the time receptive field.
  • the overlapping part of the motion pipes can be used for connection matching between the motion pipes.
  • the tracking trajectories of all target objects in the video sample are split to obtain the true values of multiple motion pipelines.
  • the video samples are input into the initial network model for training, and the predicted value of the motion pipeline is output.
  • the initial network model is a three-dimensional (3D) convolutional neural network or a recurrent neural network, etc., where the 3D convolutional neural network includes: a 3D residual neural network or a 3D feature pyramid network, etc.
  • the neural network model is a combination of a 3D residual neural network and a 3D feature pyramid network.
  • the video samples are input to the initial network model, and the motion pipelines of all target objects are output.
  • the data format of the output motion pipeline is the type shown in Figure 8.
  • h ⁇ w represents the video resolution
  • 3 represents the RGB color gamut
  • the output is the motion pipeline O, O ⁇ R ⁇ (t ⁇ h' ⁇ w' ⁇ 15), where R represents the real number domain, and t represents the number of frames of the video
  • h' ⁇ w' represents the resolution of the feature map output by the neural network. That is, t ⁇ h' ⁇ w' motion pipes are output, and each video frame corresponds to h' ⁇ w' motion pipes.
  • the confidence level of the motion pipeline is also output, and the confidence level is used to indicate the category of the target object corresponding to the motion pipeline.
  • step 1202 and step 1203 is not limited.
  • step 1202 Since step 1202 is split according to the manually labeled trajectory information, the data format of the true value of the obtained motion pipeline (R ⁇ (t ⁇ h' ⁇ w' ⁇ 15), where t ⁇ h' ⁇ w' is the motion pipeline The number) is the first data format of the motion pipeline;
  • the data format of the motion pipeline output by the initial network model in step 1203 (R ⁇ (n ⁇ 15), where n is the number of motion pipelines) is the second data format of the motion pipeline.
  • the true value of the motion pipeline is converted into the second data format.
  • the t ⁇ h' ⁇ w' motion pipelines output by the neural network model include t ⁇ h' ⁇ w' P points, only P1 and P2 are used as examples in Figure 14 for illustration, t ⁇ h
  • The' ⁇ w' P points are a three-dimensional lattice distributed in the three dimensions of time and space.
  • the true value is accompanied by a 0/1 truth table to characterize whether it is a compensation pipeline.
  • the truth table A' can be used as the confidence level corresponding to the truth value of the motion pipeline.
  • the loss between the true value (T) and the predicted value (O) can be calculated.
  • the loss function L is:
  • IoU (T, O) represents the intersection ratio between the true value of the motion pipeline (T) and the predicted value (O) of the motion pipeline
  • A is the confidence level of the predicted value (O) of the motion pipeline
  • A' is the true value of the motion pipeline.
  • CrossEntropy is the cross entropy.
  • the parameters are updated by the optimizer to optimize the neural network model, and finally a neural network model that can be used to implement the target tracking method in the embodiment of the present application is obtained.
  • optimizers There are many types of optimizers. Optionally, they can be BGD (batch gradient descent) algorithm, SGD (stochastic gradient descent) algorithm, or MBGD (mini-batch gradient descent) algorithm.
  • BGD batch gradient descent
  • SGD stochastic gradient descent
  • MBGD mini-batch gradient descent
  • FIG. 15 is a schematic diagram of another embodiment of a target tracking method in an embodiment of this application.
  • the target tracking device can track the moving target in the video in real time.
  • the system initialization of the target tracking device is performed first, and the preparation for device startup is completed;
  • It can be a video captured by a target tracking device in real time, or a video captured through a communication network.
  • the video obtained in 1502 is input into the pre-trained neural network model, and the motion pipeline set of the input video will be obtained, including the motion pipeline of the target object corresponding to each video frame.
  • the basic idea of the greedy algorithm is to proceed step by step from a certain initial solution of the problem. According to a certain optimization measure, each step must ensure that a local optimal solution can be obtained. It is understandable that the algorithm for connecting the motion pipeline can be replaced with other algorithms, which is not limited here.
  • the output is the tracking trajectory of one target object.
  • the tracking trajectory of each target object can be output. Specifically, the tracking trajectory can be processed into each video frame The bounding box of is superimposed on the original video and displayed by the display module.
  • the target tracking device will continue to obtain the newly captured video content, and repeat steps 1502 to 1505 until the target tracking task ends, which will not be repeated here.
  • FIG. 16 is a schematic diagram of an embodiment of the target tracking device in the embodiment of this application.
  • the software or firmware includes but is not limited to computer program instructions or codes, and can be executed by a hardware processor.
  • the hardware includes, but is not limited to, various integrated circuits, such as a central processing unit (CPU), a digital signal processor (DSP), a field programmable gate array (FPGA), or an application specific integrated circuit (ASIC).
  • CPU central processing unit
  • DSP digital signal processor
  • FPGA field programmable gate array
  • ASIC application specific integrated circuit
  • the target tracking device includes:
  • the acquiring unit 1601 is configured to acquire a first video, where the first video includes a target object;
  • the acquiring unit 1601 is further configured to input the first video into a pre-trained neural network model to acquire the position information of the target object in at least two video frames and the time information of the at least two video frames;
  • the acquiring unit 1601 is further configured to acquire the tracking of the target object in the first video according to the position information of the target object in at least two video frames and the time information of the at least two video frames A trajectory, the tracking trajectory includes position information of the target object in at least two video frames in the first video.
  • the acquiring unit 1601 is specifically configured to: acquire a motion pipeline of the target object, where the motion pipeline is used to indicate the time information of the target object in at least two video frames of the first video and Location information, wherein the first video includes a first video frame and a second video frame; the motion pipeline corresponds to a quadrangular pyramid in a space-time dimension, and the space-time dimension includes a time dimension and a two-dimensional space dimension.
  • the position of the first bottom surface of the quadrangular pyramid in the time dimension is used to indicate the first time information of the first video frame
  • the position of the second bottom surface of the quadrangular pyramid in the time dimension is used to indicate the The second time information of the second video frame
  • the position of the first bottom surface of the quadrangular prism in the two-dimensional space is used to indicate the first position information of the target object in the first video frame
  • the position of the second bottom surface of the quadrangular prism in the two-dimensional space is used to indicate the second position information of the target object in the second video frame
  • the quadrangular prism is used to indicate the target object Position information in all video frames between the first video frame and the second video frame of the first video.
  • the acquiring unit 1601 is specifically configured to: acquire a motion pipeline of the target object, where the motion pipeline is used to indicate the position information of the target object in at least three video frames and the at least three videos Time information of the frame, wherein the first video includes a first video frame, a second video frame, and a third video frame; the motion pipeline corresponds to a double quadrangular prism in the space-time dimension, and the double quadrangular prism includes The first quadrangular platform and the second quadrangular platform, the first quadrangular platform includes a first bottom surface and a second bottom surface, the second quadrangular platform includes a first bottom surface and a third bottom surface, the first bottom surface is The common bottom surface of the first quadrangular prism and the second quadrangular prism, the position of the first bottom in the time dimension is used to indicate the first time information of the first video frame, and the second The position of the bottom surface in the time dimension is used to indicate the second time information of the second video frame, and the position of the third bottom surface in the time dimension is used
  • the acquiring unit 1601 is specifically configured to acquire the tracking trajectory of the target object in the first video according to the motion pipeline.
  • the tracking trajectory specifically includes: a tracking trajectory of the target object formed by connecting at least two of the motion pipes corresponding to the quadrangular prisms in the space-time dimension.
  • the length of the motion pipeline is a preset value, and the length of the motion pipeline indicates the number of video frames included in the at least two video frames.
  • the obtaining unit 1601 is further configured to: obtain category information of the target object through the pre-trained neural network model; according to the category information of the target object, the target object is The position information in the two video frames and the time information of the at least two video frames obtain the tracking trajectory of the target object in the first video.
  • the acquiring unit 1601 is specifically configured to: acquire the confidence level of the motion pipeline through the pre-trained neural network model, and the confidence level of the motion pipeline is used to determine the target object corresponding to the motion pipeline Of the category information.
  • the device further includes: a processing unit 1602, configured to delete the motion pipeline to obtain a deleted motion pipeline, and the deleted motion pipeline is used to acquire the tracking of the target object Trajectory.
  • a processing unit 1602 configured to delete the motion pipeline to obtain a deleted motion pipeline, and the deleted motion pipeline is used to acquire the tracking of the target object Trajectory.
  • the movement pipeline includes a first movement pipeline and a second movement pipeline; the processing unit 1602 is specifically configured to: if the repetition rate between the first movement pipeline and the second movement pipeline is greater than or equal to a first threshold, Then delete the movement pipeline with lower confidence in the first movement pipeline and the second movement pipeline, and the repetition rate between the first movement pipeline and the second movement pipeline is the first movement pipeline and the second movement pipeline.
  • the intersection ratio between the second motion pipeline, the first motion pipeline and the second motion pipeline belong to the motion pipeline of the target object, and the confidence indicates that the category of the target object corresponding to the motion pipeline is a preset category The probability.
  • the processing unit 1602 is specifically configured to: delete the motion pipeline according to a non-maximum value suppression algorithm, and obtain the deleted motion pipeline.
  • the confidence level of any one of the motion channels after the reduction is greater than or equal to a second threshold.
  • the acquiring unit 1601 is specifically configured to: connect a third motion pipeline and a fourth motion pipeline that meet a preset condition in the motion pipeline to acquire the tracking trajectory of the target object; the preset condition It includes one or more of the following: the intersection ratio between the sections of the overlapping portion of the third movement pipeline and the fourth movement pipeline in the time dimension is greater than or equal to a third threshold; the movement direction of the third movement pipeline The cosine value of the included angle with the movement direction of the fourth motion pipe is greater than or equal to the fourth threshold, and the movement direction is a vector indicating the position change of the target object in the movement pipe in the space-time dimension according to a preset rule; and , The distance between the neural network feature vectors of the motion pipeline is less than or equal to the fifth threshold, and the distance includes the Euclidean distance.
  • the obtaining unit 1601 is specifically configured to: group the motion pipes to obtain t groups of motion pipes, where t is the total number of video frames in the first video, and the i-th motion pipe in the t group of motion pipes
  • the group includes all motion pipelines starting from the i-th video frame in the first video.
  • the i is greater than or equal to 1 and less than or equal to t; when i is 1, the motion in the i-th motion pipeline group
  • the pipeline is used as the initial tracking trajectory to obtain a tracking trajectory set; in accordance with the number sequence of the motion pipeline group, the motion pipelines in the i-th motion pipeline group are connected with the tracking trajectories in the tracking trajectory set to obtain at least one track Trajectory.
  • the obtaining unit 1601 is specifically configured to: input the first video sample into the initial network model for training, and obtain the target object loss; update the weight parameter in the initial network model according to the target object loss to obtain The pre-trained neural network model.
  • the target object loss specifically includes: an intersection ratio between the true value of the motion pipe and the predicted value of the motion pipe, and the true value of the motion pipe is obtained by splitting the tracking trajectory of the target object in the first video sample
  • the predicted value of the motion pipeline is a motion pipeline obtained by inputting the first video sample into the initial network model.
  • the target object loss specifically includes: the intersection ratio between the true value of the motion pipe and the predicted value of the motion pipe, and the cross entropy between the confidence of the true value of the motion pipe and the confidence of the predicted value of the motion pipe,
  • the true value of the motion pipeline is the motion pipeline obtained by splitting the tracking trajectory of the target object in the first video sample
  • the predicted value of the motion pipeline is the motion pipeline obtained by inputting the first video sample into the initial network model
  • the confidence level of the true value of the motion pipe is the probability that the target object category corresponding to the true value of the motion pipe belongs to the preset target object category
  • the confidence level of the predicted value of the motion pipe corresponds to the predicted value of the motion pipe The probability that the category of the target object belongs to the preset target object category.
  • the initial network model includes a three-dimensional convolutional neural network or a recurrent neural network.
  • processing unit 1602 is further configured to: divide the first video into multiple video segments;
  • the acquiring unit 1601 is specifically configured to input the multiple video clips into the pre-trained neural network model to acquire the motion pipeline.
  • the target tracking device provided by the embodiment of the present application has multiple implementation forms.
  • the target tracking device includes a video acquisition module, a target tracking module, and an output module.
  • the video acquisition module is used to obtain a video including the moving target object
  • the target tracking module is used to input the video
  • the tracking trajectory of the target object is output by the target tracking method provided in this embodiment of the application
  • the output module is used to superimpose the tracking trajectory on Shown to users in the video.
  • FIG. 17 is a schematic diagram of another embodiment of the target tracking device in the embodiment of this application.
  • the target tracking device includes a video acquisition module and a target tracking module, which can be understood as front-end equipment.
  • the front-end equipment and the back-end equipment need to be processed together.
  • the video acquisition module 1701 which can be a video acquisition module in a surveillance camera, a video camera, a mobile phone or a vehicle image sensor, is responsible for capturing video data as the input of the tracking algorithm;
  • the target tracking module 1702 which can be a processing unit in a camera processor, a mobile phone processor, a vehicle processing unit, etc., is used to receive video input and control information sent by a back-end device, such as tracking target category, tracking quantity, accuracy Control, model hyperparameters, etc.
  • the target tracking method of the embodiment of the present application is mainly deployed in this module.
  • FIG. 18 for the introduction of the target tracking module 1702.
  • the back-end equipment includes an output module and a control module.
  • the output module 1703 may be a display unit of a background monitor, printer, or hard disk, for displaying or window tracking results;
  • the control module 1704 is used to analyze the output result, receive the user's instruction, and send the instruction to the target tracking module of the front end.
  • FIG. 18 is a schematic diagram of another embodiment of the target tracking device in the embodiment of the application.
  • the target tracking device includes: a video preprocessing module 1801, a prediction module 1802, and a motion pipeline connection module 1803.
  • the video preprocessing module 1801 is used to divide the input video into appropriate segments, and adjust and normalize the video resolution, color space, etc.
  • the prediction module 1802 is used to extract spatiotemporal features from the input video clips and make predictions, and output the target motion pipeline and the category information of the motion pipeline. In addition, it can also predict the future position of the target motion pipeline.
  • the prediction module 1802 includes two sub-modules:
  • Target category prediction module 18021 Based on the features output by the 3D convolutional neural network, for example, the confidence value predicts the category to which the target belongs.
  • Motion pipeline prediction module 18022 predict the location of the target's current motion pipeline through the features output by the 3D convolutional neural network, that is, the coordinates of the motion pipeline in space and time dimensions.
  • the motion pipeline connection module 1803 analyzes the motion pipeline output in the prediction module, and if the target appears for the first time, initialize it as a new tracking trajectory. According to the temporal and spatial feature similarity between the motion pipelines and the spatial location proximity, the motion pipeline is connected Required connection characteristics. According to the movement pipeline and the connection characteristics of the movement pipeline, the movement pipelines are connected into a complete tracking trajectory by analyzing the spatial overlap characteristics of the movement pipelines and the similarity of the temporal and spatial characteristics.
  • FIG. 19 is a schematic diagram of an embodiment of an electronic device in an embodiment of the application.
  • the electronic device 1900 may have relatively large differences due to different configurations or performances, and may include one or more processors 1901 and a memory 1902, and the memory 1902 stores programs or data.
  • the memory 1902 may be volatile storage or non-volatile storage.
  • the processor 1901 is one or more central processing units (CPUs, central processing units).
  • the CPUs may be single-core CPUs or multi-core CPUs.
  • the processor 1901 may communicate with the memory 1902, and is on the electronic device 1900 A series of instructions in the memory 1902 are executed.
  • the electronic device 1900 also includes one or more wired or wireless network interfaces 1903, such as an Ethernet interface.
  • the electronic device 1900 may also include one or more power supplies; one or more input and output interfaces, which can be used to connect a display, a mouse, a keyboard, a touch screen device or a sensor For equipment, etc.
  • the input and output interfaces are optional components, which may or may not exist, and are not limited here.
  • FIG. 20 is a hardware structure diagram of a chip provided by an embodiment of this application.
  • the embodiment of the present application provides a chip system that can be used to implement the target tracking method.
  • the algorithm based on the convolutional neural network shown in FIG. 3 and FIG. 4 can be implemented in the NPU chip shown in FIG. 20.
  • the neural network processor NPU 50 is mounted on the main CPU (Host CPU) as a coprocessor, and the Host CPU allocates tasks.
  • the core part of the NPU is the arithmetic circuit 503.
  • the arithmetic circuit 503 is controlled by the controller 504 to extract matrix data from the memory and perform multiplication operations.
  • the arithmetic circuit 503 includes multiple processing units (process engines, PE). In some implementations, the arithmetic circuit 503 is a two-dimensional systolic array. The arithmetic circuit 503 may also be a one-dimensional systolic array or other electronic circuits capable of performing mathematical operations such as multiplication and addition. In some implementations, the arithmetic circuit 503 is a general-purpose matrix processor.
  • the arithmetic circuit fetches the data corresponding to matrix B from the weight memory 502 and caches it on each PE in the arithmetic circuit.
  • the arithmetic circuit fetches the matrix A data and matrix B from the input memory 501 to perform matrix operations, and the partial result or final result of the obtained matrix is stored in the accumulator 508.
  • the unified memory 506 is used to store input data and output data.
  • the weight data is directly transferred to the weight memory 502 through the storage unit access controller 505 (direct memory access controller, DMAC).
  • the input data is also transferred to the unified memory 506 through the DMAC.
  • the BIU is the Bus Interface Unit, that is, the bus interface unit 510, which is used for the interaction between the AXI bus and the DMAC and the instruction fetch buffer 509.
  • the bus interface unit 510 (bus interface unit, BIU) is used for the instruction fetch memory 509 to obtain instructions from the external memory, and is also used for the storage unit access controller 505 to obtain the original data of the input matrix A or the weight matrix B from the external memory.
  • the DMAC is mainly used to transfer the input data in the external memory DDR to the unified memory 506 or to transfer the weight data to the weight memory 502 or to transfer the input data to the input memory 501.
  • the vector calculation unit 507 may include multiple arithmetic processing units, and if necessary, further process the output of the arithmetic circuit, such as vector multiplication, vector addition, exponential operation, logarithmic operation, size comparison, and so on.
  • the vector calculation unit 507 can store the processed output vector to the unified buffer 506.
  • the vector calculation unit 507 may apply a nonlinear function to the output of the arithmetic circuit 503, such as a vector of accumulated values, to generate the activation value.
  • the vector calculation unit 507 generates a normalized value, a combined value, or both.
  • the processed output vector can be used as an activation input to the arithmetic circuit 503, for example for use in a subsequent layer in a neural network.
  • the instruction fetch buffer 509 connected to the controller 504 is used to store instructions used by the controller 504;
  • the unified memory 506, the input memory 501, the weight memory 502, and the fetch memory 509 are all On-Chip memories.
  • the external memory is private to the NPU hardware architecture.
  • each layer in the convolutional neural network shown in FIG. 3 and FIG. 4 may be executed by the matrix calculation unit 212 or the vector calculation unit 507.
  • the disclosed system, device, and method may be implemented in other ways.
  • the device embodiments described above are merely illustrative, for example, the division of the units is only a logical function division, and there may be other divisions in actual implementation, for example, multiple units or components can be combined or It can be integrated into another system, or some features can be ignored or not implemented.
  • the displayed or discussed mutual coupling or direct coupling or communication connection may be indirect coupling or communication connection through some interfaces, devices or units, and may be in electrical, mechanical or other forms.
  • the units described as separate components may or may not be physically separated, and the components displayed as units may or may not be physical units, that is, they may be located in one place, or they may be distributed on multiple network units. Some or all of the units may be selected according to actual needs to achieve the objectives of the solutions of the embodiments.
  • the functional units in the various embodiments of the present application may be integrated into one processing unit, or each unit may exist alone physically, or two or more units may be integrated into one unit.
  • the above-mentioned integrated unit can be implemented in the form of hardware or software functional unit.
  • the integrated unit is implemented in the form of a software functional unit and sold or used as an independent product, it can be stored in a computer readable storage medium.
  • the technical solution of the present application essentially or the part that contributes to the existing technology or all or part of the technical solution can be embodied in the form of a software product, and the computer software product is stored in a storage medium , Including several instructions to make a computer device (which may be a personal computer, a server, or a network device, etc.) execute all or part of the steps of the methods described in the various embodiments of the present application.
  • the aforementioned storage media include: U disk, mobile hard disk, read-only memory (read-only memory, ROM), random access memory (random access memory, RAM), magnetic disk or optical disk and other media that can store program code .

Landscapes

  • Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Theoretical Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • General Health & Medical Sciences (AREA)
  • Artificial Intelligence (AREA)
  • Computational Linguistics (AREA)
  • Data Mining & Analysis (AREA)
  • Evolutionary Computation (AREA)
  • Biomedical Technology (AREA)
  • Molecular Biology (AREA)
  • Computing Systems (AREA)
  • General Engineering & Computer Science (AREA)
  • Biophysics (AREA)
  • Mathematical Physics (AREA)
  • Software Systems (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Health & Medical Sciences (AREA)
  • Multimedia (AREA)
  • Signal Processing (AREA)
  • Image Analysis (AREA)

Abstract

Provided is a target tracking method applicable to tracking of a target in a video and capable of reducing tracking errors caused when a target is blocked. The method comprises: inputting a captured video of a target object into a pre-trained neural network model; acquiring motion channels of the target object; connecting the motion channels; and obtaining a tracking path of the target object, the tracking path comprising location information of the target in each video frame of the first video.

Description

目标跟踪方法和目标跟踪装置Target tracking method and target tracking device
本申请要求于2020年6月9日提交中国国家知识产权局、申请号为202010519876.2、发明名称为“目标跟踪方法和目标跟踪装置”的中国专利申请的优先权,其全部内容通过引用结合在本申请中。This application claims the priority of a Chinese patent application filed with the State Intellectual Property Office of China, the application number is 202010519876.2, and the invention title is "target tracking method and target tracking device" on June 9, 2020, the entire content of which is incorporated herein by reference Applying.
技术领域Technical field
本申请涉及图像处理技术领域,尤其涉及一种目标跟踪方法和目标跟踪装置。This application relates to the field of image processing technology, and in particular to a target tracking method and target tracking device.
背景技术Background technique
目标跟踪是计算机视觉领域中最重要和基础的任务之一。其目的是从包含目标物体的视频中,输出目标物体在视频的每个视频帧中的位置。通常输入计算机一段视频及需要跟踪的目标物体类别,计算机以检测框的形式输出目标物体的标识(ID)以及目标物体在视频每一帧中的位置信息。Target tracking is one of the most important and basic tasks in the field of computer vision. Its purpose is to output the position of the target object in each video frame of the video from the video containing the target object. Usually a piece of video is input to the computer and the target object category to be tracked, and the computer outputs the identification (ID) of the target object in the form of a detection frame and the position information of the target object in each frame of the video.
现有的多目标跟踪方法包括检测和跟踪两部分,通过检测模块检测出每个视频帧中出现的多个目标物体,然后将各个视频帧中出现的多个目标物体进行匹配,在匹配的过程中,提取单个视频帧中每个目标物体的特征,通过特征的相似度比对实现目标匹配,得到每个目标物体的跟踪轨迹。The existing multi-target tracking method includes detection and tracking. Multiple target objects appearing in each video frame are detected through the detection module, and then multiple target objects appearing in each video frame are matched. In the matching process In the process, the feature of each target object in a single video frame is extracted, and the target matching is achieved through feature similarity comparison, and the tracking trajectory of each target object is obtained.
由于现有的目标跟踪算法采用先检测再跟踪的方法,目标跟踪效果依赖于单帧的检测算法,若目标检测中存在目标物体遮挡,会产生检测错误,进而引起跟踪错误,因此,在目标物体密集或者遮挡较多的场景下性能不足。Since the existing target tracking algorithm adopts the method of first detection and then tracking, the target tracking effect depends on the detection algorithm of a single frame. If the target object is occluded in the target detection, a detection error will occur, which will cause tracking errors. Therefore, the target object Insufficient performance in dense or occluded scenes.
发明内容Summary of the invention
本申请实施例提供了一种目标跟踪方法,用于视频中的目标跟踪,可以减少目标遮挡带来的跟踪错误。The embodiment of the present application provides a target tracking method, which is used for target tracking in a video, which can reduce tracking errors caused by target occlusion.
本申请实施例的第一方面提供了一种目标跟踪方法,包括:获取第一视频,所述第一视频包括目标物体;将所述第一视频输入预训练的神经网络模型,获取所述目标物体在至少两个视频帧中的位置信息和所述至少两个视频帧的时间信息;根据所述目标物体在至少两个视频帧中的位置信息,和所述至少两个视频帧的时间信息获取所述目标物体在所述第一视频中的跟踪轨迹,所述跟踪轨迹包括所述目标物体在所述第一视频中的至少两个视频帧中的位置信息。The first aspect of the embodiments of the present application provides a target tracking method, including: acquiring a first video, where the first video includes a target object; and inputting the first video into a pre-trained neural network model to acquire the target The position information of the object in at least two video frames and the time information of the at least two video frames; according to the position information of the target object in the at least two video frames and the time information of the at least two video frames Acquire a tracking trajectory of the target object in the first video, where the tracking trajectory includes position information of the target object in at least two video frames in the first video.
该方法通过预训练的神经网络模型获取目标物体在至少两个视频帧中的位置信息和所述至少两个视频帧的时间信息,目标跟踪不依赖于单个视频帧的目标检测结果,可以减少在目标密集或者遮挡较多的场景下的检测失败的问题,提升目标跟踪性能。This method obtains the position information of the target object in at least two video frames and the time information of the at least two video frames through a pre-trained neural network model. Target tracking does not depend on the target detection result of a single video frame, which can reduce The problem of detection failure in scenes with dense targets or more occlusions can improve target tracking performance.
在第一方面的一种可能的实现方式中,所述获取所述目标物体在至少两个视频帧中的位置信息和所述至少两个视频帧的时间信息具体包括:获取所述目标物体的运动管道,所述运动管道用于指示所述目标物体在所述第一视频的至少两个视频帧中的时间信息和位置信息,其中,所述第一视频包括第一视频帧和第二视频帧;所述运动管道对应于时空维度中的四棱 台,所述时空维度包括时间维度和二维空间维度,所述四棱台的第一底面在所述时间维度的位置用于指示所述第一视频帧的第一时间信息,所述四棱台的第二底面在所述时间维度的位置用于指示所述第二视频帧的第二时间信息,所述四棱台的第一底面在所述二维空间维度的位置用于指示所述目标物体在所述第一视频帧中的第一位置信息,所述四棱台的第二底面在所述二维空间维度的位置用于指示所述目标物体在所述第二视频帧中的第二位置信息;所述四棱台用于指示所述目标物体在所述第一视频的所述第一视频帧与所述二视频帧之间的所有视频帧中的位置信息。In a possible implementation manner of the first aspect, the obtaining the position information of the target object in at least two video frames and the time information of the at least two video frames specifically includes: obtaining information about the target object A motion pipeline, which is used to indicate time information and position information of the target object in at least two video frames of the first video, wherein the first video includes a first video frame and a second video Frame; the motion pipeline corresponds to a quadrangular pyramid in the space-time dimension, the space-time dimension includes a time dimension and a two-dimensional space dimension, and the position of the first bottom surface of the quadrangular pyramid in the time dimension is used to indicate the The first time information of the first video frame, the position of the second bottom surface of the quadrangular pyramid in the time dimension is used to indicate the second time information of the second video frame, the first bottom surface of the quadrangular pyramid The position in the two-dimensional space is used to indicate the first position information of the target object in the first video frame, and the position of the second bottom surface of the quadrangular pyramid in the two-dimensional space is used to Indicate the second position information of the target object in the second video frame; the quadrangular prism is used to indicate that the target object is in the first video frame and the second video frame of the first video Position information in all video frames between.
该方法通过预训练的神经网络模型获取每个视频帧的运动管道,由于运动管道包括至少两个视频帧中的目标物体的位置信息,目标在视频帧中的位置,可以通过时空维度中,在时间维度上的时刻以及二维空间维度上的位置确定,该时刻用于确定视频帧,该二维空间维度上的位置用于指示目标在该视频帧中的位置信息。该方法可以将运动管道对应于时空维度中的四棱台,通过时空维度中的四棱台直观展现目标在至少两个视频帧中的位置信息。该目标跟踪方法不依赖于单个视频帧的目标检测结果,可以减少在目标密集或者遮挡较多的场景下的检测失败的问题,提升目标跟踪性能。This method obtains the motion pipeline of each video frame through a pre-trained neural network model. Since the motion pipeline includes the position information of the target object in at least two video frames, the position of the target in the video frame can pass through the space-time dimension. The time in the time dimension and the position in the two-dimensional space are determined, the time is used to determine the video frame, and the position in the two-dimensional space is used to indicate the position information of the target in the video frame. This method can correspond the motion pipeline to the quadrangular prism in the space-time dimension, and visually display the position information of the target in at least two video frames through the quadrangular prism in the space-time dimension. The target tracking method does not depend on the target detection result of a single video frame, and can reduce the problem of detection failure in scenes with dense targets or more occlusions, and improve target tracking performance.
在第一方面的一种可能的实现方式中,所述获取所述目标物体在至少两个视频帧中的位置信息和所述至少两个视频帧的时间信息具体包括:获取所述目标物体的运动管道,所述运动管道用于指示所述目标物体在至少三个视频帧中的位置信息和所述至少三个视频帧的时间信息,其中,所述第一视频包括第一视频帧、第二视频帧和第三视频帧;所述运动管道对应于时空维度中的双四棱台,所述双四棱台包括第一四棱台和第二四棱台,所述第一四棱台包括第一底面和第二底面,所述第二四棱台包括第一底面和第三底面,所述第一底面为所述第一四棱台和所述第二四棱台的公共底面,所述第一底面在所述时间维度的位置用于指示所述第一视频帧的第一时间信息,所述第二底面在所述时间维度的位置用于指示所述第二视频帧的第二时间信息,所第三底面在所述时间维度的位置用于指示所述第三视频帧的第三时间信息,所述第一视频帧在所述第一视频中的时间顺序位于所述第二视频帧和所述第三视频帧之间,所述第一底面在所述二维空间维度的位置用于指示所述目标物体在所述第一视频帧中的第一位置信息,所述第二底面在所述二维空间维度的位置指示所述目标物体在所述第二视频帧中的第二位置信息,所述第三底面在所述二维空间维度的位置指示所述目标物体在所述第三视频帧中的第三位置信息;所述双四棱台用于指示所述目标物体在所述第一视频的所述第二视频帧与所述三视频帧之间的所有视频帧中的位置信息。In a possible implementation manner of the first aspect, the obtaining the position information of the target object in at least two video frames and the time information of the at least two video frames specifically includes: obtaining information about the target object A motion pipeline, where the motion pipeline is used to indicate the position information of the target object in at least three video frames and the time information of the at least three video frames, wherein the first video includes a first video frame and a first video frame. Two video frames and a third video frame; the motion pipeline corresponds to a double quadrangular prism in the space-time dimension, the double quadrangular prism includes a first quadrangular prism and a second quadrangular prism, the first quadrangular prism It includes a first bottom surface and a second bottom surface, the second quadrangular ridge includes a first bottom surface and a third bottom surface, the first bottom surface is a common bottom surface of the first quadrangular ridge and the second quadrangular ridge, The position of the first bottom surface in the time dimension is used to indicate the first time information of the first video frame, and the position of the second bottom surface in the time dimension is used to indicate the first time information of the second video frame. Second time information, the position of the third bottom surface in the time dimension is used to indicate the third time information of the third video frame, and the time sequence of the first video frame in the first video is located in the first video Between the second video frame and the third video frame, the position of the first bottom surface in the two-dimensional space is used to indicate the first position information of the target object in the first video frame, and The position of the second bottom surface in the two-dimensional space indicates the second position information of the target object in the second video frame, and the position of the third bottom surface in the two-dimensional space indicates the target object The third position information in the third video frame; the double quadrangular prism is used to indicate that the target object is between the second video frame and the three video frames of the first video Position information in the video frame.
该方法中,运动管道包括至少三个视频帧中的目标物体的目标物体的位置信息,具体的,该至少三个视频帧包括较运动管道对应的视频帧在视频的时间顺序上更早的第二视频帧,以及更晚的第三视频帧中,拓展了在时间维度的感受野,可以进一步提升目标跟踪性能。将运动管道对应于时空维度中的双四棱台,通过时空维度中的双四棱台直观展现目标在至少三个视频帧中的位置信息。具体的,还包括了目标在运动管道的两个非公共底面之间所有的视频帧中位置信息。考虑到目标运动的连续性,时空维度中,目标物体的目标物体的真实跟踪轨迹的结构通常为非线性的,双四棱台结构的运动管道可以表达目标运动的两个方向,在目标物体的运动方向存在变化的场景中可以更好的拟合真实跟踪轨迹。In this method, the motion pipeline includes the position information of the target object of the target object in at least three video frames. Specifically, the at least three video frames include the first video frame that is earlier in the time sequence of the video than the video frame corresponding to the motion pipeline. The second video frame and the later third video frame expand the receptive field in the time dimension, which can further improve the target tracking performance. The motion pipeline corresponds to the double quadrangular prism in the space-time dimension, and the position information of the target in at least three video frames is visually displayed through the double quadrangular prism in the space-time dimension. Specifically, it also includes the position information of the target in all the video frames between the two non-common bottom surfaces of the motion pipeline. Taking into account the continuity of target motion, in the space-time dimension, the structure of the real tracking trajectory of the target object is usually nonlinear. The motion pipeline of the double quadrangular prism structure can express the two directions of the target movement. The real tracking trajectory can be better fitted in the scene where the movement direction changes.
在第一方面的一种可能的实现方式中,所述根据所述目标物体在至少两个视频帧中的位 置信息和所述至少两个视频帧的时间信息获取所述目标物体在所述第一视频中的跟踪轨迹具体包括:根据所述运动管道获取所述目标物体在所述第一视频中的跟踪轨迹。In a possible implementation manner of the first aspect, the acquisition of the location of the target object in the first aspect is performed according to the location information of the target object in at least two video frames and the time information of the at least two video frames. The tracking trajectory in a video specifically includes: acquiring the tracking trajectory of the target object in the first video according to the motion pipeline.
根据所述运动管道获取所述目标物体在所述第一视频中的跟踪轨迹,可以减少在目标密集或者遮挡较多的场景下的检测失败的问题,提升目标跟踪性能。Obtaining the tracking trajectory of the target object in the first video according to the motion pipeline can reduce the problem of detection failure in a scene with dense targets or more occlusions, and improve target tracking performance.
在第一方面的一种可能的实现方式中,所述跟踪轨迹具体包括:根据至少两个所述运动管道对应于时空维度中的四棱台连接形成的所述目标物体的跟踪轨迹。In a possible implementation of the first aspect, the tracking trajectory specifically includes: a tracking trajectory of the target object formed by connecting at least two of the motion pipes corresponding to a quadrangular prism in a space-time dimension.
通过连接运动管道获取目标物体的跟踪轨迹,可以不依赖于单个视频帧的目标检测结果,可以减少在目标密集或者遮挡较多的场景下的检测失败的问题,提升目标跟踪性能。Obtaining the tracking trajectory of the target object by connecting the motion pipeline can not rely on the target detection result of a single video frame, which can reduce the problem of detection failure in scenes with dense targets or more occlusions, and improve target tracking performance.
在第一方面的一种可能的实现方式中,所述视频帧的运动管道的长度为预设值,所述运动管道的长度指示所述至少两个视频帧包括的视频帧的数量,可选地,运动管道的长度包括4、6或8。In a possible implementation of the first aspect, the length of the motion pipeline of the video frame is a preset value, and the length of the motion pipeline indicates the number of video frames included in the at least two video frames, optionally Ground, the length of the movement pipeline includes 4, 6, or 8.
该方法中,运动管道的长度可以为预设值。即每个运动管道对应的视频帧的数量相同,指示同样时长的时间段中目标物体的位置变化,相较不设定运动管道长度的方法,本方法可以降低神经网络模型的计算量,减少目标跟踪需要耗费的时间。In this method, the length of the motion pipe can be a preset value. That is, the number of video frames corresponding to each motion pipeline is the same, indicating the position change of the target object in the same time period. Compared with the method of not setting the motion pipeline length, this method can reduce the calculation amount of the neural network model and reduce the target Tracking takes time.
在第一方面的一种可能的实现方式中,所述方法还包括:通过所述预训练的神经网络模型,获取所述目标物体的类别信息;所述根据所述目标物体在至少两个视频帧中的位置信息,和所述至少两个视频帧的时间信息获取所述目标物体在所述第一视频中的跟踪轨迹包括:根据所述目标物体的所述类别信息、所述目标物体在至少两个视频帧中的位置信息和所述至少两个视频帧的时间信息获取所述目标物体在所述第一视频中的跟踪轨迹。In a possible implementation manner of the first aspect, the method further includes: obtaining category information of the target object through the pre-trained neural network model; and according to the target object in at least two videos The position information in the frame and the time information of the at least two video frames. Obtaining the tracking trajectory of the target object in the first video includes: according to the category information of the target object, the location of the target object is The position information in the at least two video frames and the time information of the at least two video frames acquire the tracking trajectory of the target object in the first video.
对于多目标跟踪的场景,若待跟踪目标包括多个类别,本方法可以通过预训练的神经网络模型确定运动管道对应的目标物体的类别信息,根据类别信息、位置信息和时间信息获取目标物体的跟踪轨迹。For multi-target tracking scenes, if the target to be tracked includes multiple categories, this method can determine the category information of the target object corresponding to the motion pipeline through the pre-trained neural network model, and obtain the target object’s information based on the category information, location information, and time information. Track the trajectory.
在第一方面的一种可能的实现方式中,所述通过所述预训练的神经网络模型,获取所述运动管道对应的目标物体的类别信息具体包括:通过所述预训练的神经网络模型,获取所述运动管道的置信度,所述运动管道的置信度用于确定所述运动管道对应的目标物体的所述类别信息。In a possible implementation of the first aspect, the acquiring, through the pre-trained neural network model, the category information of the target object corresponding to the motion pipeline specifically includes: through the pre-trained neural network model, Obtain the confidence level of the motion pipe, where the confidence level of the motion pipe is used to determine the category information of the target object corresponding to the motion pipe.
对于单目标跟踪的场景,本方法可以通过置信度区分该运动管道是否为指示目标位置的真实运动管道,此外,对于多目标跟踪的场景,若待跟踪目标包括多个类别,本方法可以通过运动管道的置信度对该运动管道对应的目标物体的类别进行区分。For a single target tracking scene, this method can distinguish whether the motion pipeline is a real motion pipeline indicating the target position by confidence. In addition, for a multi-target tracking scene, if the target to be tracked includes multiple categories, this method can use motion The confidence of the pipeline distinguishes the types of target objects corresponding to the motion pipeline.
在第一方面的一种可能的实现方式中,所述根据所述运动管道获取所述目标物体的跟踪轨迹之前,所述方法还包括:对所述运动管道进行删减,获取删减后的运动管道,所述删减后的运动管道用于获取所述目标物体的跟踪轨迹。In a possible implementation of the first aspect, before the acquiring the tracking trajectory of the target object according to the motion pipeline, the method further includes: deleting the motion pipeline, and obtaining the deleted A motion pipeline, where the deleted motion pipeline is used to obtain the tracking trajectory of the target object.
本方法可以对视频帧的运动管道进行删减,删减重复的运动管道或者置信度较低的运动管道,可以减少在运动管道连接步骤中的计算量。This method can delete motion pipelines of video frames, delete repeated motion pipelines or motion pipelines with low confidence, and can reduce the amount of calculation in the motion pipeline connection step.
在第一方面的一种可能的实现方式中,所述对所述运动管道进行删减,获取删减后的运动管道具体包括:所述运动管道包括第一运动管道和第二运动管道;若第一运动管道和第二运动管道之间的重复率大于或等于第一阈值,则删除所述第一运动管道和所述第二运动管道中置信度较低的运动管道,所述第一运动管道和第二运动管道之间的重复率为所述第一运动 管道和所述第二运动管道之间的交并比,所述第一运动管道和所述第二运动管道属于所述目标物体的运动管道,所述置信度指示运动管道对应的目标物体的类别为预设类别的概率。In a possible implementation of the first aspect, the deletion of the motion pipeline and obtaining the deleted motion pipeline specifically includes: the motion pipeline includes a first motion pipeline and a second motion pipeline; if If the repetition rate between the first movement pipeline and the second movement pipeline is greater than or equal to the first threshold, the movement pipeline with a lower confidence level among the first movement pipeline and the second movement pipeline is deleted, and the first movement The repetition rate between the pipe and the second movement pipe is the intersection ratio between the first movement pipe and the second movement pipe, and the first movement pipe and the second movement pipe belong to the target object The confidence level indicates the probability that the category of the target object corresponding to the motion channel is the preset category.
本方法介绍了删减运动管道的具体方法,重复率大于或等于第一阈值的运动管道可以认为是重复数据,对其中置信度较低的进行删减,保留置信度较高的,用于进行管道连接,可以减少在运动管道连接步骤中的计算量。This method introduces the specific method of deleting motion pipelines. Motion pipelines with a repetition rate greater than or equal to the first threshold can be considered as repeated data. Among them, the lower confidence is deleted, and the higher confidence is retained for The pipe connection can reduce the amount of calculation in the movement pipe connection step.
在第一方面的一种可能的实现方式中,所述对所述运动管道进行删减,获取删减后的运动管道具体包括:根据非极大值抑制算法对所述运动管道进行删减,获取删减后的运动管道。In a possible implementation of the first aspect, the deleting the motion pipeline and obtaining the deleted motion pipeline specifically includes: deleting the motion pipeline according to a non-maximum value suppression algorithm, Get the deleted motion pipeline.
本方法还可以根据非极大值抑制算法进行删减,即能删去重复的运动管道,还可以为每个目标保留置信度较高的运动管道,减少管道连接步骤的计算量,提升目标跟踪效率。This method can also be deleted according to the non-maximum suppression algorithm, that is, it can delete repeated motion pipelines, and it can also reserve motion pipelines with higher confidence for each target, reduce the calculation amount of pipeline connection steps, and improve target tracking efficient.
在第一方面的一种可能的实现方式中,所述删减后的运动管道中的任意一个运动管道的置信度大于或等于第二阈值。In a possible implementation of the first aspect, the confidence of any one of the motion pipes after the reduction is greater than or equal to the second threshold.
本方法在删减运动管道时,可以将置信度较低的运功管道都舍弃,置信度小于第二阈值的运功管道可以理解为非真实运动管道,例如背景对应的运动管道。When deleting motion pipelines in this method, all the power pipelines with lower confidence can be discarded, and the power pipelines with confidence less than the second threshold can be understood as unreal motion pipelines, such as the motion pipelines corresponding to the background.
在第一方面的一种可能的实现方式中,所述根据所述运动管道获取所述目标物体的跟踪轨迹具体包括:对所述运动管道中满足预设条件的第三运动管道和第四运动管道进行连接,获取所述目标物体的跟踪轨迹;所述预设条件包括以下一个或多个:所述第三运动管道和所述第四运动管道在时间维度重叠部分的区段之间的交并比大于或等于第三阈值;所述第三运动管道的运动方向和所述第四运动管道的运动方向之间的夹角的余弦值大于或等于第四阈值,所述运动方向为在时空维度按照预设规则指示运动管道中目标物体的位置变化的向量;以及,运动管道的神经网络特征向量之间的距离小于或等于第五阈值,所述距离包括欧式距离。In a possible implementation manner of the first aspect, the acquiring the tracking trajectory of the target object according to the motion pipeline specifically includes: performing a third motion channel and a fourth motion in the motion channel that meet a preset condition. The pipelines are connected to obtain the tracking trajectory of the target object; the preset conditions include one or more of the following: the intersection of the third motion pipeline and the fourth motion pipeline between the sections where the time dimension overlaps The parallel ratio is greater than or equal to the third threshold; the cosine value of the angle between the movement direction of the third movement pipeline and the movement direction of the fourth movement pipeline is greater than or equal to the fourth threshold, and the movement direction is in time and space The dimension indicates the vector of the position change of the target object in the motion pipeline according to a preset rule; and the distance between the neural network feature vectors of the motion pipeline is less than or equal to the fifth threshold, and the distance includes Euclidean distance.
本方法提供了运动管道连接的具体方法,依据运动管道在时空维度的位置,将重叠度高且运动方向相似的运动管道进行连接。This method provides a specific method for connecting motion pipelines. According to the position of the motion pipelines in the space-time dimension, the motion pipelines with high overlap and similar motion directions are connected.
在第一方面的一种可能的实现方式中,所述根据所述运动管道获取所述目标物体的跟踪轨迹具体包括:对所述运动管道分组,获取t组运动管道,t为所述第一视频中视频帧的总数,所述t组运动管道中第i运动管道组包括所有起始于所述第一视频中第i个视频帧的运动管道,所述i大于或等于1,且小于或等于t;当i为1时,将第i运动管道组中的运动管道作为初始的跟踪轨迹,得到跟踪轨迹集合;按照运动管道组的编号顺序,依次将所述第i运动管道组中的运动管道与所述跟踪轨迹集合中的跟踪轨迹进行连接,获取至少一条跟踪轨迹。本方法提供了运动管道连接的一种具体方法,运动管道对应了一段时长范围内的视频帧中的目标物体的位置信息,根据起始视频帧对运动管道分组,依次对每组运动管道进行连接,可以提高目标跟踪的效率。In a possible implementation of the first aspect, the acquiring the tracking trajectory of the target object according to the motion pipeline specifically includes: grouping the motion pipelines to acquire t groups of motion pipelines, where t is the first The total number of video frames in the video, the i-th motion pipe group in the t group of motion pipes includes all motion pipes starting from the i-th video frame in the first video, and the i is greater than or equal to 1, and less than or Equal to t; when i is 1, the motion pipeline in the i-th motion pipe group is used as the initial tracking trajectory to obtain the tracking trajectory set; in accordance with the sequence of the number of the motion pipe group, the motion in the i-th motion pipe group is sequentially The pipeline is connected with the tracking trajectory in the tracking trajectory set to obtain at least one tracking trajectory. This method provides a specific method for connecting motion pipelines. The motion pipeline corresponds to the position information of the target object in the video frame within a period of time. The motion pipelines are grouped according to the initial video frame, and each group of motion pipelines are connected in turn , Can improve the efficiency of target tracking.
在第一方面的一种可能的实现方式中,所述预训练的神经网络模型由初始网络模型训练后得到,所述方法还包括:将第一视频样本输入所述初始网络模型中训练,获取目标物体损失;根据所述目标物体损失更新所述初始网络模型中的权重参数,获取所述预训练的神经网络模型。In a possible implementation of the first aspect, the pre-trained neural network model is obtained after the initial network model is trained, and the method further includes: inputting the first video sample into the initial network model for training, and obtaining Target object loss; update the weight parameters in the initial network model according to the target object loss, and obtain the pre-trained neural network model.
本方法中可以由初始网络模型进行训练,以获取本目标跟踪方法中输出运动管道的神经网络模型。In this method, the initial network model can be trained to obtain the neural network model of the output motion pipeline in the target tracking method.
在第一方面的一种可能的实现方式中,所述目标物体损失具体包括:运动管道真值与运 动管道预测值之间的交并比,所述运动管道真值为所述第一视频样本中目标物体的跟踪轨迹拆分得到的运动管道,所述运动管道预测值为将所述第一视频样本输入所述初始网络模型得到的运动管道。In a possible implementation of the first aspect, the target object loss specifically includes: an intersection ratio between a true value of a motion pipe and a predicted value of a motion pipe, and the true value of the motion pipe is the first video sample The motion pipeline obtained by splitting the tracking trajectory of the target object in the middle, and the motion pipeline prediction value is the motion pipeline obtained by inputting the first video sample into the initial network model.
本方法提供了模型训练过程中的目标损失为运动管道真值与运动管道预测值之间的交并比,由此训练得到的神经网络模型,运动管道指示的目标物体的位置信息的准确度高。This method provides that the target loss in the model training process is the intersection ratio between the true value of the motion pipeline and the predicted value of the motion pipeline. The neural network model obtained by this training has high accuracy of the position information of the target object indicated by the motion pipeline. .
在第一方面的一种可能的实现方式中,所述目标物体损失具体包括:运动管道真值与运动管道预测值之间的交并比,以及运动管道真值的置信度与运动管道预测值的置信度之间的交叉熵,所述运动管道真值为所述第一视频样本中目标物体的跟踪轨迹拆分得到的运动管道,所述运动管道预测值为将所述第一视频样本输入所述初始网络模型得到的运动管道,所述运动管道真值的置信度为所述运动管道真值对应的目标物体的类别属于预设目标物体类别的概率,所述运动管道预测值的置信度为所述运动管道预测值对应的目标物体的类别属于预设目标物体类别的概率。In a possible implementation of the first aspect, the target object loss specifically includes: the cross-union ratio between the true value of the motion pipe and the predicted value of the motion pipe, and the confidence of the true value of the motion pipe and the predicted value of the motion pipe The cross entropy between the confidence levels of the motion pipeline, the motion pipeline truth value is the motion pipeline obtained by splitting the tracking trajectory of the target object in the first video sample, and the motion pipeline prediction value is the input of the first video sample For the motion pipeline obtained by the initial network model, the confidence of the true value of the motion pipeline is the probability that the target object category corresponding to the true value of the motion pipeline belongs to the preset target object category, and the confidence of the predicted value of the motion pipeline It is the probability that the category of the target object corresponding to the predicted value of the motion pipeline belongs to the preset target object category.
本方法提供了模型训练过程中的目标损失为运动管道真值与运动管道预测值之间的交并比,由此训练得到的神经网络模型,运动管道指示的目标物体的位置信息的准确度高,而且可以准确指示目标物体的类型。This method provides that the target loss in the model training process is the intersection ratio between the true value of the motion pipeline and the predicted value of the motion pipeline. The neural network model obtained by this training has high accuracy of the position information of the target object indicated by the motion pipeline. , And can accurately indicate the type of target object.
在第一方面的一种可能的实现方式中,所述初始网络模型包括三维卷积神经网络或递归神经网络,所述三维卷积神经网络包括三维残差神经网络或三维特征金字塔网络。可选地,该初始网络模型由三维残差神经网络和三维特征金字塔网络组合得到。In a possible implementation of the first aspect, the initial network model includes a three-dimensional convolutional neural network or a recurrent neural network, and the three-dimensional convolutional neural network includes a three-dimensional residual neural network or a three-dimensional feature pyramid network. Optionally, the initial network model is obtained by combining a three-dimensional residual neural network and a three-dimensional feature pyramid network.
本方法中初始网络模型可以为三维卷积神经网络、递归神经网络,或者两者的组合,神经网络模型类型的多样性提供了方案实现的多种可能。The initial network model in this method can be a three-dimensional convolutional neural network, a recurrent neural network, or a combination of the two. The diversity of neural network model types provides multiple possibilities for the realization of the scheme.
在第一方面的一种可能的实现方式中,所述将所述第一视频输入预训练的神经网络模型,获取所述目标物体的运动管道具体包括:将所述第一视频划分为多个视频片段;将所述多个视频片段分别输入所述预训练的神经网络模型,获取所述运动管道。In a possible implementation of the first aspect, the inputting the first video into a pre-trained neural network model to obtain the motion pipeline of the target object specifically includes: dividing the first video into a plurality of Video clips; input the multiple video clips into the pre-trained neural network model to obtain the motion pipeline.
考虑到神经网络模型处理视频帧数量的限制,可以先对视频进行切分,将视频片段输入模型,可选地,视频片段的视频帧数量为预设值,例如8帧。Taking into account the limitation of the number of video frames processed by the neural network model, the video can be segmented first, and the video segment is input to the model. Optionally, the number of video frames of the video segment is a preset value, for example, 8 frames.
本申请实施例的第二方面提供了一种目标跟踪装置,包括:获取单元,用于获取第一视频,所述第一视频包括目标物体;所述获取单元,还用于将所述第一视频输入预训练的神经网络模型,获取所述目标物体在至少两个视频帧中的位置信息和所述至少两个视频帧的时间信息;所述获取单元,还用于根据所述目标物体在至少两个视频帧中的位置信息,和所述至少两个视频帧的时间信息获取所述目标物体在所述第一视频中的跟踪轨迹,所述跟踪轨迹包括所述目标物体在所述第一视频中的至少两个视频帧中的位置信息。A second aspect of the embodiments of the present application provides a target tracking device, including: an acquisition unit configured to acquire a first video, where the first video includes a target object; and the acquisition unit is further configured to capture the first video Video input pre-trained neural network model to obtain the position information of the target object in at least two video frames and the time information of the at least two video frames; the obtaining unit is further configured to The position information in at least two video frames and the time information of the at least two video frames acquire the tracking trajectory of the target object in the first video, and the tracking trajectory includes the target object in the first video. Position information in at least two video frames in a video.
在第二方面的一种可能的实现方式中,所述获取单元具体用于:获取所述目标物体的运动管道,所述运动管道用于指示所述目标物体在所述第一视频的至少两个视频帧中的时间信息和位置信息,其中,所述第一视频包括第一视频帧和第二视频帧;所述运动管道对应于时空维度中的四棱台,所述时空维度包括时间维度和二维空间维度,所述四棱台的第一底面在所述时间维度的位置用于指示所述第一视频帧的第一时间信息,所述四棱台的第二底面在所述时间维度的位置用于指示所述第二视频帧的第二时间信息,所述四棱台的第一底面在所述二维空间维度的位置用于指示所述目标物体在所述第一视频帧中的第一位置信息,所述四棱 台的第二底面在所述二维空间维度的位置用于指示所述目标物体在所述第二视频帧中的第二位置信息;所述四棱台用于指示所述目标物体在所述第一视频的所述第一视频帧与所述二视频帧之间的所有视频帧中的位置信息。In a possible implementation of the second aspect, the acquisition unit is specifically configured to: acquire a motion pipeline of the target object, and the motion pipeline is used to indicate that the target object is at least two parts of the first video. Time information and location information in two video frames, where the first video includes a first video frame and a second video frame; the motion pipeline corresponds to a quadrangular pyramid in the space-time dimension, and the space-time dimension includes the time dimension And a two-dimensional space dimension, the position of the first bottom surface of the quadrangular pyramid in the time dimension is used to indicate the first time information of the first video frame, and the second bottom surface of the quadrangular pyramid is at the time The position of the dimension is used to indicate the second time information of the second video frame, and the position of the first bottom surface of the quadrangular prism in the two-dimensional space is used to indicate that the target object is in the first video frame The first position information in the second video frame, the position of the second bottom surface of the quadrangular prism in the two-dimensional space is used to indicate the second position information of the target object in the second video frame; the quadrangular The station is used to indicate the position information of the target object in all video frames between the first video frame and the second video frame of the first video.
在第二方面的一种可能的实现方式中,所述获取单元具体用于:获取所述目标物体的运动管道,所述运动管道用于指示所述目标物体在至少三个视频帧中的位置信息和所述至少三个视频帧的时间信息,其中,所述第一视频包括第一视频帧、第二视频帧和第三视频帧;所述运动管道对应于时空维度中的双四棱台,所述双四棱台包括第一四棱台和第二四棱台,所述第一四棱台包括第一底面和第二底面,所述第二四棱台包括第一底面和第三底面,所述第一底面为所述第一四棱台和所述第二四棱台的公共底面,所述第一底面在所述时间维度的位置用于指示所述第一视频帧的第一时间信息,所述第二底面在所述时间维度的位置用于指示所述第二视频帧的第二时间信息,所第三底面在所述时间维度的位置用于指示所述第三视频帧的第三时间信息,所述第一视频帧在所述第一视频中的时间顺序位于所述第二视频帧和所述第三视频帧之间,所述第一底面在所述二维空间维度的位置用于指示所述目标物体在所述第一视频帧中的第一位置信息,所述第二底面在所述二维空间维度的位置指示所述目标物体在所述第二视频帧中的第二位置信息,所述第三底面在所述二维空间维度的位置指示所述目标物体在所述第三视频帧中的第三位置信息;所述双四棱台用于指示所述目标物体在所述第一视频的所述第二视频帧与所述三视频帧之间的所有视频帧中的位置信息。In a possible implementation manner of the second aspect, the acquiring unit is specifically configured to: acquire a motion pipeline of the target object, and the motion pipeline is used to indicate the position of the target object in at least three video frames Information and time information of the at least three video frames, wherein the first video includes a first video frame, a second video frame, and a third video frame; the motion pipeline corresponds to a double quadrangular prism in the space-time dimension , The double quadrangular prism includes a first quadrangular prism and a second quadrangular prism, the first quadrangular prism includes a first bottom surface and a second bottom surface, and the second quadrangular prism includes a first bottom surface and a third quadrangular prism. The bottom surface, the first bottom surface is the common bottom surface of the first quadrangular prism platform and the second quadrangular prism platform, and the position of the first bottom surface in the time dimension is used to indicate the first video frame of the first video frame. Time information, the position of the second bottom surface in the time dimension is used to indicate the second time information of the second video frame, and the position of the third bottom surface in the time dimension is used to indicate the third video The third time information of the frame, the time sequence of the first video frame in the first video is located between the second video frame and the third video frame, and the first bottom surface is in the two-dimensional The position of the spatial dimension is used to indicate the first position information of the target object in the first video frame, and the position of the second bottom surface in the two-dimensional spatial dimension indicates that the target object is in the second video The second position information in the frame, the position of the third bottom surface in the two-dimensional space dimension indicates the third position information of the target object in the third video frame; the double quadrangular prism is used to indicate Position information of the target object in all video frames between the second video frame and the three video frames of the first video.
在第二方面的一种可能的实现方式中,所述获取单元具体用于:根据所述运动管道获取所述目标物体在所述第一视频中的跟踪轨迹。In a possible implementation of the second aspect, the acquiring unit is specifically configured to acquire the tracking trajectory of the target object in the first video according to the motion pipeline.
在第二方面的一种可能的实现方式中,所述跟踪轨迹具体包括:根据至少两个所述运动管道对应于时空维度中的四棱台连接形成的所述目标物体的跟踪轨迹。In a possible implementation of the second aspect, the tracking trajectory specifically includes: a tracking trajectory of the target object formed by connecting at least two of the motion pipes corresponding to a quadrangular prism in a space-time dimension.
在第二方面的一种可能的实现方式中,所述运动管道的长度为预设值,所述运动管道的长度指示所述至少两个视频帧包括的视频帧的数量。In a possible implementation of the second aspect, the length of the motion pipe is a preset value, and the length of the motion pipe indicates the number of video frames included in the at least two video frames.
在第二方面的一种可能的实现方式中,所述获取单元还用于:通过所述预训练的神经网络模型,获取所述目标物体的类别信息;根据所述目标物体的所述类别信息、所述目标物体在至少两个视频帧中的位置信息和所述至少两个视频帧的时间信息获取所述目标物体在所述第一视频中的跟踪轨迹。In a possible implementation of the second aspect, the acquiring unit is further configured to: acquire category information of the target object through the pre-trained neural network model; and according to the category information of the target object , The position information of the target object in at least two video frames and the time information of the at least two video frames obtain the tracking trajectory of the target object in the first video.
在第二方面的一种可能的实现方式中,所述获取单元具体用于:通过所述预训练的神经网络模型,获取所述运动管道的置信度,所述运动管道的置信度用于确定所述运动管道对应的目标物体的所述类别信息。In a possible implementation of the second aspect, the acquiring unit is specifically configured to: acquire the confidence of the motion pipeline through the pre-trained neural network model, and the confidence of the motion pipeline is used to determine The category information of the target object corresponding to the motion pipeline.
在第二方面的一种可能的实现方式中,所述装置还包括:处理单元,用于对所述运动管道进行删减,获取删减后的运动管道,所述删减后的运动管道用于获取所述目标物体的跟踪轨迹。In a possible implementation manner of the second aspect, the device further includes: a processing unit configured to delete the motion pipeline to obtain the deleted motion pipeline, and the deleted motion pipeline is used for To obtain the tracking trajectory of the target object.
在第二方面的一种可能的实现方式中,所述运动管道包括第一运动管道和第二运动管道;所述处理单元具体用于:若第一运动管道和第二运动管道之间的重复率大于或等于第一阈值,则删除所述第一运动管道和所述第二运动管道中置信度较低的运动管道,所述第一运动管道和第二运动管道之间的重复率为所述第一运动管道和所述第二运动管道之间的交并比,所述第一运动管道和所述第二运动管道属于所述目标物体的运动管道,所述置信度指示运动管道 对应的目标物体的类别为预设类别的概率。In a possible implementation of the second aspect, the movement pipeline includes a first movement pipeline and a second movement pipeline; the processing unit is specifically configured to: if the first movement pipeline and the second movement pipeline are repeated If the rate is greater than or equal to the first threshold, then the movement pipeline with lower confidence among the first movement pipeline and the second movement pipeline is deleted, and the repetition rate between the first movement pipeline and the second movement pipeline is equal to The intersection ratio between the first movement pipeline and the second movement pipeline, the first movement pipeline and the second movement pipeline belong to the movement pipeline of the target object, and the confidence level indicates that the movement pipeline corresponds to The class of the target object is the probability of the preset class.
在第二方面的一种可能的实现方式中,所述处理单元具体用于:根据非极大值抑制算法对所述运动管道进行删减,获取删减后的运动管道。In a possible implementation manner of the second aspect, the processing unit is specifically configured to cut the motion pipeline according to a non-maximum value suppression algorithm, and obtain the deleted motion pipeline.
在第二方面的一种可能的实现方式中,所述删减后的运动管道中的任意一个运动管道的置信度大于或等于第二阈值。In a possible implementation of the second aspect, the confidence of any one of the motion pipes after the reduction is greater than or equal to the second threshold.
在第二方面的一种可能的实现方式中,所述获取单元具体用于:对所述运动管道中满足预设条件的第三运动管道和第四运动管道进行连接,获取所述目标物体的跟踪轨迹;所述预设条件包括以下一个或多个:所述第三运动管道和所述第四运动管道在时间维度重叠部分的区段之间的交并比大于或等于第三阈值;所述第三运动管道的运动方向和所述第四运动管道的运动方向之间的夹角的余弦值大于或等于第四阈值,所述运动方向为在时空维度按照预设规则指示运动管道中目标物体的位置变化的向量;以及,运动管道的神经网络特征向量之间的距离小于或等于第五阈值,所述距离包括欧式距离。In a possible implementation of the second aspect, the acquiring unit is specifically configured to: connect a third movement pipeline and a fourth movement pipeline that meet a preset condition among the movement pipelines, and obtain information about the target object. Tracking the trajectory; the preset conditions include one or more of the following: the intersection ratio between the sections of the third motion pipeline and the fourth motion pipeline in the time dimension overlapping portion is greater than or equal to a third threshold; The cosine value of the angle between the movement direction of the third movement pipeline and the movement direction of the fourth movement pipeline is greater than or equal to a fourth threshold, and the movement direction is a predetermined rule indicating the target in the movement pipeline in the space-time dimension. The vector of the position change of the object; and, the distance between the neural network feature vectors of the motion pipeline is less than or equal to the fifth threshold, and the distance includes the Euclidean distance.
在第二方面的一种可能的实现方式中,所述获取单元具体用于:对所述运动管道分组,获取t组运动管道,t为所述第一视频中视频帧的总数,所述t组运动管道中第i运动管道组包括所有起始于所述第一视频中第i个视频帧的运动管道,所述i大于或等于1,且小于或等于t;当i为1时,将第i运动管道组中的运动管道作为初始的跟踪轨迹,得到跟踪轨迹集合;按照运动管道组的编号顺序,依次将所述第i运动管道组中的运动管道与所述跟踪轨迹集合中的跟踪轨迹进行连接,获取至少一条跟踪轨迹。In a possible implementation of the second aspect, the obtaining unit is specifically configured to: group the motion pipes to obtain t groups of motion pipes, where t is the total number of video frames in the first video, and t The i-th motion pipeline group in the group of motion pipelines includes all motion pipelines starting from the i-th video frame in the first video, and the i is greater than or equal to 1, and less than or equal to t; when i is 1, the The motion pipes in the i-th motion pipe group are used as the initial tracking trajectories to obtain a tracking trajectory set; according to the sequence of the number of the motion pipe groups, the motion pipes in the i-th motion pipe group and the tracks in the tracking trajectory set are sequentially The trajectories are connected to obtain at least one tracking trajectory.
在第二方面的一种可能的实现方式中,所述获取单元具体用于:将第一视频样本输入所述初始网络模型中训练,获取目标物体损失;根据所述目标物体损失更新所述初始网络模型中的权重参数,获取所述预训练的神经网络模型。In a possible implementation of the second aspect, the acquiring unit is specifically configured to: input the first video sample into the initial network model for training, and acquire the target object loss; and update the initial The weight parameters in the network model are used to obtain the pre-trained neural network model.
在第二方面的一种可能的实现方式中,所述目标物体损失具体包括:运动管道真值与运动管道预测值之间的交并比,所述运动管道真值为所述第一视频样本中目标物体的跟踪轨迹拆分得到的运动管道,所述运动管道预测值为将所述第一视频样本输入所述初始网络模型得到的运动管道。In a possible implementation manner of the second aspect, the target object loss specifically includes: an intersection ratio between the true value of the motion pipe and the predicted value of the motion pipe, and the true value of the motion pipe is the first video sample The motion pipeline obtained by splitting the tracking trajectory of the target object in the middle, and the motion pipeline prediction value is the motion pipeline obtained by inputting the first video sample into the initial network model.
在第二方面的一种可能的实现方式中,所述目标物体损失具体包括:运动管道真值与运动管道预测值之间的交并比,以及运动管道真值的置信度与运动管道预测值的置信度之间的交叉熵,所述运动管道真值为所述第一视频样本中目标物体的跟踪轨迹拆分得到的运动管道,所述运动管道预测值为将所述第一视频样本输入所述初始网络模型得到的运动管道,所述运动管道真值的置信度为所述运动管道真值对应的目标物体的类别属于预设目标物体类别的概率,所述运动管道预测值的置信度为所述运动管道预测值对应的目标物体的类别属于预设目标物体类别的概率。In a possible implementation of the second aspect, the target object loss specifically includes: the cross-union ratio between the true value of the motion pipe and the predicted value of the motion pipe, and the confidence of the true value of the motion pipe and the predicted value of the motion pipe The cross entropy between the confidence levels of the motion pipeline, the motion pipeline truth value is the motion pipeline obtained by splitting the tracking trajectory of the target object in the first video sample, and the motion pipeline prediction value is the input of the first video sample For the motion pipeline obtained by the initial network model, the confidence of the true value of the motion pipeline is the probability that the target object category corresponding to the true value of the motion pipeline belongs to the preset target object category, and the confidence of the predicted value of the motion pipeline It is the probability that the category of the target object corresponding to the predicted value of the motion pipeline belongs to the preset target object category.
在第二方面的一种可能的实现方式中,所述初始网络模型包括三维卷积神经网络或递归神经网络。In a possible implementation of the second aspect, the initial network model includes a three-dimensional convolutional neural network or a recurrent neural network.
在第二方面的一种可能的实现方式中,所述处理单元还用于:将所述第一视频划分为多个视频片段;所述获取单元具体用于:将所述多个视频片段分别输入所述预训练的神经网络模型,获取所述运动管道。In a possible implementation manner of the second aspect, the processing unit is further configured to: divide the first video into a plurality of video clips; the acquiring unit is specifically configured to: separate the plurality of video clips Input the pre-trained neural network model to obtain the motion pipeline.
本申请实施例第三方面提供了一种电子设备,其特征在于,包括处理器和存储器,所述 处理器和所述存储器相互连接,其中,所述存储器用于存储计算机程序,所述计算机程序包括程序指令,所述处理器用于调用所述程序指令,执行如上述第一方面以及各种可能的实现方式中任一项所述的方法。The third aspect of the embodiments of the present application provides an electronic device, which is characterized by comprising a processor and a memory, the processor and the memory are connected to each other, wherein the memory is used to store a computer program, and the computer program It includes program instructions, and the processor is used to call the program instructions to execute the method described in any one of the foregoing first aspect and various possible implementation manners.
本申请实施例第四方面提供了一种包含指令的计算机程序产品,其特征在于,当其在计算机上运行时,使得所述计算机执行如上述第一方面以及各种可能的实现方式中任一项所述的方法。The fourth aspect of the embodiments of the present application provides a computer program product containing instructions, which is characterized in that when it runs on a computer, the computer executes any one of the above-mentioned first aspect and various possible implementation manners. The method described in the item.
本申请实施例第五方面提供了一种计算机可读存储介质,包括指令,其特征在于,当所述指令在计算机上运行时,使得计算机执行如上述第一方面以及各种可能的实现方式中任一项所述的方法。The fifth aspect of the embodiments of the present application provides a computer-readable storage medium, including instructions, which are characterized in that, when the instructions run on a computer, the computer executes the first aspect and various possible implementation manners. Any one of the methods.
本申请实施例第六方面提供了一种芯片,包括处理器。处理器用于读取并执行存储器中存储的计算机程序,以执行上述任一方面任意可能的实现方式中的方法。可选地,该芯片该包括存储器,该存储器与该处理器通过电路或电线与存储器连接。进一步可选地,该芯片还包括通信接口,处理器与该通信接口连接。通信接口用于接收需要处理的数据和/或信息,处理器从该通信接口获取该数据和/或信息,并对该数据和/或信息进行处理,并通过该通信接口输出处理结果。该通信接口可以是输入输出接口。A sixth aspect of the embodiments of the present application provides a chip including a processor. The processor is used to read and execute the computer program stored in the memory to execute the method in any possible implementation manner of any of the foregoing aspects. Optionally, the chip should include a memory, and the memory and the processor are connected to the memory through a circuit or a wire. Further optionally, the chip further includes a communication interface, and the processor is connected to the communication interface. The communication interface is used to receive data and/or information that needs to be processed, and the processor obtains the data and/or information from the communication interface, processes the data and/or information, and outputs the processing result through the communication interface. The communication interface can be an input and output interface.
其中,第二方面、第三方面、第四方面、第五方面或第六方面中任一种实现方式所带来的技术效果可参见第一方面中相应实现方式所带来的技术效果,此处不再赘述。Among them, the technical effects brought by any one of the second, third, fourth, fifth, or sixth aspects can be found in the technical effects brought about by the corresponding implementation in the first aspect. I won't repeat it here.
从以上技术方案可以看出,本申请实施例具有以下优点:It can be seen from the above technical solutions that the embodiments of the present application have the following advantages:
本申请实施例提供的目标跟踪方法,通过预训练的神经网络模型获取目标物体在至少两个视频帧中的位置信息和所述至少两个视频帧的时间信息,并根据这些信息确定所述目标物体在所述第一视频中的跟踪轨迹。由于通过神经网络模型输出至少两个视频帧的时间信息,目标跟踪不依赖于单个视频帧的目标检测结果,可以减少在目标密集或者遮挡较多的场景下的检测失败的问题,提升目标跟踪性能。In the target tracking method provided by the embodiment of the present application, the position information of the target object in at least two video frames and the time information of the at least two video frames are obtained through a pre-trained neural network model, and the target is determined according to the information The tracking trajectory of the object in the first video. Since the time information of at least two video frames is output through the neural network model, target tracking does not depend on the target detection result of a single video frame, which can reduce the problem of detection failure in scenes with dense targets or more occlusions, and improve target tracking performance .
本申请实施例提供的目标跟踪方法,通过预训练的神经网络模型获取目标物体的运动管道,通过连接运动管道获取目标物体的跟踪轨迹。由于运动管道包括至少两个视频帧中的目标物体的位置信息,目标跟踪不依赖于单个视频帧的目标检测结果,可以减少在目标密集或者遮挡较多的场景下的检测失败的问题,提升目标跟踪性能。In the target tracking method provided by the embodiments of the present application, the motion pipeline of the target object is obtained through a pre-trained neural network model, and the tracking trajectory of the target object is obtained by connecting the motion pipeline. Since the motion pipeline includes the position information of the target object in at least two video frames, target tracking does not depend on the target detection result of a single video frame, which can reduce the problem of detection failure in scenes with dense targets or more occlusions, and improve the target Track performance.
此外,现有技术中依赖于单帧的检测算法,整体算法的精度受到检测器的影响,分步训练检测模型和跟踪模型的开发成本高,同时算法分为两个阶段也增大了机器学习过程的计算成本和部署难度。而本申请实施例提供的目标跟踪方法,可以实现端到端的训练,通过一个神经网络模型完成多目标物体的检测和跟踪任务,可以减低模型的复杂度。In addition, in the prior art, the detection algorithm relies on a single frame, and the accuracy of the overall algorithm is affected by the detector. The development cost of step-by-step training of the detection model and tracking model is high. At the same time, the algorithm is divided into two phases, which also increases machine learning. The computational cost and deployment difficulty of the process. However, the target tracking method provided in the embodiments of the present application can realize end-to-end training, and complete the detection and tracking tasks of multi-target objects through a neural network model, which can reduce the complexity of the model.
此外,现有技术基于单个视频帧提取的特征较为单一,本申请实施例提供的目标跟踪方法,采用视频作为原始输入,模型可以通过外貌特征、运动轨迹特征或步态特征等多种特征实现跟踪任务,可以提升目标跟踪性能。In addition, the prior art has relatively single features extracted based on a single video frame. The target tracking method provided in the embodiments of this application uses video as the original input, and the model can be tracked through various features such as appearance features, motion trajectory features, or gait features. Tasks can improve target tracking performance.
此外,本申请实施例提供的目标跟踪方法,采用视频作为模型原始输入,时间维度感受野增加,可以更好的捕捉人物的运动信息。In addition, the target tracking method provided by the embodiment of the present application uses video as the original input of the model, and the time dimension receptive field is increased, which can better capture the movement information of the character.
附图说明Description of the drawings
图1为本申请实施例提供的一种人工智能主体框架示意图;FIG. 1 is a schematic diagram of an artificial intelligence main body framework provided by an embodiment of this application;
图2为本申请实施例提供的一种系统架构示意图;FIG. 2 is a schematic diagram of a system architecture provided by an embodiment of the application;
图3为本申请实施例提供的一种卷积神经网络结构示意图;FIG. 3 is a schematic diagram of a convolutional neural network structure provided by an embodiment of this application;
图4为本申请实施例提供的另一种卷积神经网络结构示意图;4 is a schematic diagram of another convolutional neural network structure provided by an embodiment of the application;
图5为本申请实施例中运动管道的一个实施例示意图;Fig. 5 is a schematic diagram of an embodiment of a motion pipe in an embodiment of the application;
图6为本申请实施例中跟踪轨迹拆分运动管道的一个示意图;Fig. 6 is a schematic diagram of the tracking trajectory splitting motion pipeline in an embodiment of the application;
图7为本申请实施例中运动管道的一个实施例示意图;FIG. 7 is a schematic diagram of an embodiment of a motion pipe in an embodiment of the application;
图8为本申请实施例中运动管道的另一个实施例示意图;Fig. 8 is a schematic diagram of another embodiment of the motion pipe in the embodiment of the application;
图9为本申请实施例中运动管道的交和并的示意图;FIG. 9 is a schematic diagram of the intersection and merging of the motion pipes in the embodiment of the application;
图10为本申请实施例中目标检测方法的一个实施例示意图;FIG. 10 is a schematic diagram of an embodiment of a target detection method in an embodiment of the application;
图11为本申请实施例中运动管道之间的匹配的实施例示意图;FIG. 11 is a schematic diagram of an embodiment of matching between moving pipes in an embodiment of the application;
图12为本申请实施例中神经网络模型的训练方法的一个实施例示意图;FIG. 12 is a schematic diagram of an embodiment of a neural network model training method in an embodiment of the application;
图13为本申请实施例中跟踪轨迹与运动管道的一个示意图;FIG. 13 is a schematic diagram of tracking trajectory and motion pipeline in an embodiment of this application;
图14为本申请实施例中神经网络模型输出的运动管道的一个示意图;Fig. 14 is a schematic diagram of a motion pipeline output by a neural network model in an embodiment of the application;
图15为本申请实施例中目标跟踪方法的另一个实施例示意图;15 is a schematic diagram of another embodiment of a target tracking method in an embodiment of the application;
图16为本申请实施例中目标跟踪装置的一个实施例示意图;FIG. 16 is a schematic diagram of an embodiment of a target tracking device in an embodiment of the application;
图17为本申请实施例中目标跟踪装置的另一个实施例示意图;FIG. 17 is a schematic diagram of another embodiment of the target tracking device in the embodiment of this application;
图18为本申请实施例中目标跟踪装置的另一个实施例示意图;FIG. 18 is a schematic diagram of another embodiment of the target tracking device in the embodiment of this application;
图19为本申请实施例中电子设备的另一个实施例示意图;FIG. 19 is a schematic diagram of another embodiment of an electronic device in an embodiment of the application;
图20为本申请实施例提供的一种芯片硬件结构图。FIG. 20 is a hardware structure diagram of a chip provided by an embodiment of the application.
具体实施方式detailed description
本申请实施例提供了一种目标跟踪方法,用于视频中的目标跟踪,可以减少在目标密集或者遮挡较多的场景下的跟踪错误。The embodiment of the present application provides a target tracking method, which is used for target tracking in a video, which can reduce tracking errors in scenes with dense targets or more occlusions.
下面结合附图,对本申请的实施例进行描述,显然,所描述的实施例仅仅是本申请一部分的实施例,而不是全部的实施例。本领域普通技术人员可知,随着技术的发展和新场景的出现,本申请实施例提供的技术方案对于类似的技术问题,同样适用。The following describes the embodiments of the present application with reference to the accompanying drawings. Obviously, the described embodiments are only a part of the embodiments of the present application, rather than all the embodiments. A person of ordinary skill in the art knows that with the development of technology and the emergence of new scenarios, the technical solutions provided in the embodiments of the present application are equally applicable to similar technical problems.
本申请的说明书和权利要求书及上述附图中的术语“第一”、“第二”等是用于区别类似的对象,而不必用于描述特定的顺序或先后次序。应该理解这样使用的数据在适当情况下可以互换,以便这里描述的实施例能够以除了在这里图示或描述的内容以外的顺序实施。此外,术语“包括”和“具有”以及他们的任何变形,意图在于覆盖不排他的包含,例如,包含了一系列步骤或模块的过程、方法、系统、产品或设备不必限于清楚地列出的那些步骤或模块,而是可包括没有清楚地列出的或对于这些过程、方法、产品或设备固有的其它步骤或模块。在本申请中出现的对步骤进行的命名或者编号,并不意味着必须按照命名或者编号所指示的时间/逻辑先后顺序执行方法流程中的步骤,已经命名或者编号的流程步骤可以根据要实现的技术目的变更执行次序,只要能达到相同或者相类似的技术效果即可。The terms "first", "second", etc. in the description and claims of the application and the above-mentioned drawings are used to distinguish similar objects, and are not necessarily used to describe a specific sequence or sequence. It should be understood that the data used in this way can be interchanged under appropriate circumstances, so that the embodiments described herein can be implemented in a sequence other than the content illustrated or described herein. In addition, the terms "including" and "having" and any variations of them are intended to cover non-exclusive inclusions. For example, a process, method, system, product, or device that includes a series of steps or modules is not necessarily limited to those clearly listed. Those steps or modules may include other steps or modules that are not clearly listed or are inherent to these processes, methods, products, or equipment. The naming or numbering of steps appearing in this application does not mean that the steps in the method flow must be executed in the time/logical order indicated by the naming or numbering. The named or numbered process steps can be implemented according to the The technical purpose changes the execution order, as long as the same or similar technical effects can be achieved.
为了便于理解,下面对本申请实施例涉及的部分技术术语进行简要介绍:For ease of understanding, the following briefly introduces some technical terms involved in the embodiments of this application:
一、运动管道和跟踪轨迹。1. Movement pipeline and tracking trajectory.
视频的多个视频帧通过连续拍摄得到,视频帧率通常已知。视频中的运动目标是指在拍摄过程中目标相对于视频采集设备相对运动,以实际三维空间的世界坐标系为参考,该目标可以是运动的也可以是不运动的,具体此处不做限定。Multiple video frames of the video are obtained by continuous shooting, and the video frame rate is usually known. The moving target in the video refers to the relative movement of the target relative to the video capture device during the shooting process, taking the world coordinate system of the actual three-dimensional space as a reference. The target can be moving or not, and the specifics are not limited here. .
拍摄目标期间,目标物体的图像信息可能被直接记录在视频帧中,也可能在部分图像帧中被其他物体遮挡。During the shooting of the target, the image information of the target object may be directly recorded in the video frame, or part of the image frame may be blocked by other objects.
将视频的多个视频帧在时间维度上展开,由于视频帧之间拍摄的时间间隔已知,不同视频帧对应于时间维度上不同时刻,由于视频帧为二维图像,视频帧的图像信息对应于二维空间维度中的数据,本申请实施例中将以该形式展现的数据定义为时空维度的数据。目标在视频帧中的位置,可以通过时空维度中,在时间维度上的位置以及二维空间维度上的位置确定,在时间维度的位置用于确定视频帧,该二维空间维度上的位置用于指示目标在该视频帧中的位置信息。Expand multiple video frames of the video in the time dimension. Since the time interval between video frames is known, different video frames correspond to different moments in the time dimension. Since the video frames are two-dimensional images, the image information of the video frames corresponds to For data in a two-dimensional space dimension, the data displayed in this form is defined as data in a space-time dimension in the embodiment of the present application. The position of the target in the video frame can be determined by the position in the time dimension and the position in the two-dimensional space in the space-time dimension. The position in the time dimension is used to determine the video frame, and the position in the two-dimensional space is used To indicate the location information of the target in the video frame.
请参见图5,为本申请实施例中运动管道的一个实施例示意图。Please refer to FIG. 5, which is a schematic diagram of an embodiment of the motion pipe in the embodiment of the application.
目标跟踪需要确定待跟踪目标(或简称目标)在所有包含该目标物体的视频帧中的位置信息。通常,每个视频帧中的目标位置可以通过检测框(Bounding-Box)标识,在时空维度,将各个视频帧中的同一目标物体的检测框对应连接起来,可以形成目标在时空区域的轨迹,即跟踪轨迹,或称运动轨迹,跟踪轨迹既能给出目标物体的位置,也将不同时刻对应的目标物体的位置连接了起来。因此,跟踪轨迹可以同时指示目标物体的时间和空间信息。图5中仅示意出3个视频帧中的目标物体的位置信息,可以理解的是,视频的所有视频帧均可按照上述方法获取跟踪轨迹。需要说明的是,同一视频帧中可能包括一个或多个目标,跟踪轨迹还包括该跟踪轨迹指示的目标物体的标识(ID),目标物体的ID可以用于区分不同的目标对应的轨迹。Target tracking needs to determine the position information of the target to be tracked (or target for short) in all video frames containing the target object. Generally, the target position in each video frame can be identified by a detection box (Bounding-Box). In the space-time dimension, the detection box of the same target object in each video frame is connected correspondingly to form the trajectory of the target in the space-time area. That is, tracking trajectory, or motion trajectory, tracking trajectory can not only give the position of the target object, but also connect the positions of the target object at different times. Therefore, the tracking trajectory can indicate the time and space information of the target object at the same time. FIG. 5 only illustrates the position information of the target object in the three video frames. It is understandable that all the video frames of the video can obtain the tracking trajectory according to the above-mentioned method. It should be noted that one or more targets may be included in the same video frame, the tracking trajectory also includes the identification (ID) of the target object indicated by the tracking trajectory, and the ID of the target object can be used to distinguish trajectories corresponding to different targets.
下面对运动管道与跟踪轨迹进行介绍。The following is an introduction to the motion pipeline and tracking trajectory.
运动管道用于指示目标在至少两个视频帧中的位置信息,对应于时空维度中的四棱台,四棱台的第一底面在时间维度的位置用于指示第一视频帧的第一时间信息,四棱台的第二底面在时间维度的位置用于指示第二视频帧的第二时间信息,四棱台的第一底面在二维空间维度的位置用于指示目标物体在所述第一视频帧中的第一位置信息,四棱台的第二底面在二维空间维度的位置用于指示目标物体在第二视频帧中的第二位置信息。The motion pipeline is used to indicate the position information of the target in at least two video frames, corresponding to the quadrangular prism in the space-time dimension, and the position of the first bottom surface of the quadrangular prism in the time dimension is used to indicate the first time of the first video frame Information, the position of the second bottom surface of the quadrangular prism in the time dimension is used to indicate the second time information of the second video frame, and the position of the first bottom surface of the quadrangular prism in the two-dimensional space is used to indicate that the target object is in the first The first position information in one video frame and the position of the second bottom surface of the quadrangular prism in the two-dimensional space are used to indicate the second position information of the target object in the second video frame.
可选地,运动管道用于指示目标在至少三个不同视频帧中的位置信息。本实施例及以下实施例中均以运动管道包括目标在三个不同视频帧中的位置信息为例进行介绍。Optionally, the motion pipeline is used to indicate the position information of the target in at least three different video frames. In this embodiment and the following embodiments, the motion pipeline includes the position information of the target in three different video frames as an example for introduction.
在时空维度,运动管道可以视为两个具有公共底面的四棱台组成的双四棱台结构,该双四棱台结构的三个底面相互平行,垂直底面方向为时间维度,底面延伸方向为空间维度,每个底面代表在该底面对应的时刻,目标在视频帧中的位置。如图6所示,示出了一个双四棱台结构的运动管道,包括:第一底面601、第二底面602和第三底面603,第一底面601,即矩形abcd,在第一底面所在的二维空间中的位置信息代表目标物体在第一视频帧中的位置信息,矩形abcd映射在时间维度的位置,代表第一视频帧的时间信息;类似地,第二底面602,即矩形ijkm,在第二底面所在的二维空间中的位置信息代表目标物体在第二视频帧中的位置信息,矩形abcd映射在时间维度的位置,代表第二视频帧的时间信息;第三底面603,即矩形efgh,在第三底面所在的二维空间中的位置信息代表目标物体在第三视频帧中的位置信息, 矩形abcd映射在时间维度的位置,代表第三视频帧的时间信息。可以理解的是,由于第一视频拍摄目标物体的过程中,目标物体和视频采集设备之间可能存在相对运动,因此,矩形abcd、矩形efgh和矩形ijkm映射到同一个底面所在的二维空间中时,对应的位置可能不同。第一底面601、第二底面602和第三底面603在时间维度的位置,即图a点、i点和e点映射在时间维度的位置分别为a'、i'和e',分别指示第一视频帧、第二视频帧和第三视频帧的时间信息。运动管道的长度即第二底面映射在时间维度的位置与第三底面映射在时间维度的位置之间的位置间隔,用于指示在视频的时间顺序上,第二底面、第三底面以及第二底面和第三底面之间的所有视频帧的数量。In the space-time dimension, the motion pipeline can be regarded as a double quadrangular pyramid structure composed of two quadrangular pyramids with a common bottom surface. The three bottom surfaces of the double quadrangular pyramid structure are parallel to each other. The direction perpendicular to the bottom surface is the time dimension, and the extension direction of the bottom surface is Spatial dimension, each bottom surface represents the position of the target in the video frame at the corresponding moment of the bottom surface. As shown in Figure 6, there is shown a movement pipeline with a double quadrangular pyramid structure, including: a first bottom surface 601, a second bottom surface 602, and a third bottom surface 603. The first bottom surface 601, namely rectangular abcd, is located at the The position information in the two-dimensional space represents the position information of the target object in the first video frame, and the position of the rectangle abcd mapped in the time dimension represents the time information of the first video frame; similarly, the second bottom surface 602 is the rectangle ijkm , The position information in the two-dimensional space where the second bottom surface is located represents the position information of the target object in the second video frame, and the position of the rectangle abcd mapped in the time dimension represents the time information of the second video frame; the third bottom surface 603, That is, the rectangle efgh, the position information in the two-dimensional space where the third bottom surface is located represents the position information of the target object in the third video frame, and the position of the rectangle abcd mapped in the time dimension represents the time information of the third video frame. It is understandable that during the process of shooting the target object in the first video, there may be relative motion between the target object and the video capture device. Therefore, the rectangle abcd, the rectangle efgh, and the rectangle ijkm are mapped to the two-dimensional space where the same bottom surface is located. When the time, the corresponding location may be different. The positions of the first bottom surface 601, the second bottom surface 602, and the third bottom surface 603 in the time dimension, that is, the positions of point a, point i, and point e mapped in the time dimension are a', i', and e', respectively, indicating the first Time information of a video frame, a second video frame, and a third video frame. The length of the motion pipeline is the position interval between the second bottom surface mapped in the time dimension and the third bottom surface mapped in the time dimension. It is used to indicate the time sequence of the video, the second bottom surface, the third bottom surface, and the second bottom surface. The number of all video frames between the bottom surface and the third bottom surface.
需要说明的是,第一视频帧对应的运动管道,至少包括目标在该第一视频帧中的位置信息。It should be noted that the motion pipeline corresponding to the first video frame includes at least the position information of the target in the first video frame.
跟踪轨迹可以拆分为多个运动管道,如图6所示,可选地,本申请实施例中,可以将跟踪轨迹拆分为单个视频帧的位置框,将每个位置框作为双四棱台结构中的公共底面,如图6中的第一底面601,并在跟踪轨迹中向前和向后延伸,确定双四棱台结构的另外两个底面,分别是第二底面602和第三底面603,由此得到具有公共底面的双四棱台结构,即该单个视频帧对应的运动管道。The tracking trajectory can be split into multiple motion pipelines, as shown in FIG. 6. Optionally, in the embodiment of the present application, the tracking trajectory can be split into a position box of a single video frame, and each position box is used as a double quadrilateral The common bottom surface in the mesa structure is like the first bottom surface 601 in Figure 6, and extends forward and backward in the tracking trajectory to determine the other two bottom surfaces of the double quadrangular mesa structure, which are the second bottom surface 602 and the third bottom surface respectively. The bottom surface 603 thus obtains a double quadrangular prism structure with a common bottom surface, that is, the motion pipeline corresponding to the single video frame.
对于视频的起始视频帧,可以认为向前延伸为0,类似地,末尾视频帧向后延伸为0,起始视频帧和末尾视频帧对应的运动管道退化为单个四棱台结构。需要说明的是,运动管道的长度定义为运动管道对应的视频帧的数量,如图6所示,视频在第二底面602对应的视频帧和第三底面603对应的视频帧之间所有视频帧的总数即为该运动管道的长度。For the start video frame of the video, it can be considered that the forward extension is 0. Similarly, the last video frame extends backward to 0, and the motion pipelines corresponding to the start video frame and the last video frame degenerate into a single quadrangular pyramid structure. It should be noted that the length of the motion pipeline is defined as the number of video frames corresponding to the motion pipeline. As shown in FIG. 6, the video is between the video frame corresponding to the second bottom surface 602 and the video frame corresponding to the third bottom surface 603. The total number is the length of the movement pipeline.
本申请实施例中运动管道用特定的数据格式来表示,请参见图7和图8,为本申请实施例中运动管道的数据格式的两种示意图。The motion pipeline in the embodiment of this application is represented by a specific data format. Please refer to FIG. 7 and FIG. 8, which are two schematic diagrams of the data format of the motion pipeline in the embodiment of this application.
如图7所示,第一种数据格式包括时间维度的3个数据(t s,t m,t e),和空间维度的12个数据
Figure PCTCN2021093852-appb-000001
共15个数据。其中,在每个时间维度的数据对应的时刻,目标在空间中的位置信息可以通过4个数据确定,示例性的,t s时刻的视频帧中,目标位置区域为B s,通过
Figure PCTCN2021093852-appb-000002
Figure PCTCN2021093852-appb-000003
四个数据可以确定该位置区域。
As shown in Figure 7, the first data format includes 3 data in the time dimension (t s , t m , t e ), and 12 data in the space dimension
Figure PCTCN2021093852-appb-000001
A total of 15 data. Among them, at the time corresponding to the data of each time dimension, the location information of the target in space can be determined by 4 pieces of data. Exemplarily, in the video frame at time t s , the target location area is B s , passing
Figure PCTCN2021093852-appb-000002
and
Figure PCTCN2021093852-appb-000003
Four data can determine the location area.
如图8所示,神经网络模型输出的运动管道的可以为另一种数据格式表示,视频帧m的运动管道,B m为公共底面中目标对应的检测框,B m为对应的视频帧中的部分图像区域,P为B m区域中的任意一个像素点,可以用一个数据标识该像素点所处的时刻,在时间维度,通过两个数据:d s和d e,可以分别确定运动管道向前和向后延伸的长度。l m、b m、t m、r m四个数据指示以P点为参考点,B m区域的边界相对于P点的偏移量(Regress values for B m)。l s、b s、t s、r s四个数据分别指示B s区域的边界相对于B m区域的边界的偏移量(Regress values for B s),类似地,l e、b e、t e、r e四个数据分别指示B e区域的边界相对于B m区域的边界的偏移量(Regress values for B e)。 As shown in Figure 8, the motion pipeline output by the neural network model can be represented by another data format, the motion pipeline of the video frame m, B m is the detection frame corresponding to the target in the common bottom surface, and B m is the corresponding video frame time partial image region, P is any one of B m pixel region, the pixel may be identified by a data point is located, in the time dimension, two data: d s and d e, motion of the conduit may be determined, respectively The length of the extension forward and backward. The four data of l m , b m , t m , and r m indicate the offset (Regress values for B m ) of the boundary of the B m area relative to the P point with the point P as the reference point. l s, b s, t s , r s four data indicate a boundary region B s offset with respect to the boundary region B m (Regress values for B s), similarly, l e, b e, t e, r e four data indicate a boundary region B e offset with respect to the boundary region B m (Regress values for B e).
可见,两种数据格式均可以通过15个数据来表示单个运动管道,两种数据格式之间可以相互转换。It can be seen that both data formats can represent a single motion pipeline through 15 data, and the two data formats can be converted to each other.
二、交并比(intersection-over-union,IoU)。2. Intersection-over-union (IoU).
IoU通常用于衡量两个位置区域重叠的程度。在目标检测(object detection)中,交 并比(IoU)是指两个矩形的检测框的交集与并集的比值,IoU的取值在[0,1]之间。显然,当IoU=0时,两个位置区域没有重叠;当IoU=1时,两个位置区域重合。IoU is usually used to measure the degree of overlap between two locations. In object detection (object detection), the intersection ratio (IoU) refers to the ratio of the intersection and the union of two rectangular detection frames, and the value of IoU is between [0, 1]. Obviously, when IoU=0, the two location areas do not overlap; when IoU=1, the two location areas overlap.
本申请实施例中,将IoU的概念延伸到时空维度的三维空间中,用于衡量两个运动管道在时空维度中重叠的程度,请参见图9,为本申请实施例中运动管道的交和并的示意图。In the embodiment of this application, the concept of IoU is extended to the three-dimensional space of the space-time dimension to measure the degree of overlap of the two motion pipelines in the space-time dimension. And schematic diagram.
IoU(T (1),T (2))=∩(T (1),T (2))/∪(T (1),T (2)) IoU(T (1) ,T (2) )=∩(T (1) ,T (2) )/∪(T (1) ,T (2) )
其中,T (1)代表运动管道1,T (2)代表运动管道2,∩(T (1),T (2))代表两个运动管道的交,∪(T (1),T (2))代表两个运动管道的并。 Among them, T (1) represents motion channel 1, T (2) represents motion channel 2, ∩(T (1) ,T (2) ) represents the intersection of two motion channels, ∪(T (1) ,T (2) ) ) Represents the union of two motion pipelines.
下面结合附图,对本申请的实施例进行描述,显然,所描述的实施例仅仅是本申请一部分的实施例,而不是全部的实施例。本领域普通技术人员可知,随着技术的发展和新场景的出现,本申请实施例提供的技术方案对于类似的技术问题,同样适用。The following describes the embodiments of the present application with reference to the accompanying drawings. Obviously, the described embodiments are only a part of the embodiments of the present application, rather than all the embodiments. A person of ordinary skill in the art knows that with the development of technology and the emergence of new scenarios, the technical solutions provided in the embodiments of the present application are equally applicable to similar technical problems.
本申请的说明书和权利要求书及上述附图中的术语“第一”、“第二”等是用于区别类似的对象,而不必用于描述特定的顺序或先后次序。应该理解这样使用的数据在适当情况下可以互换,以便这里描述的实施例能够以除了在这里图示或描述的内容以外的顺序实施。此外,术语“包括”和“具有”以及他们的任何变形,意图在于覆盖不排他的包含,例如,包含了一系列步骤或模块的过程、方法、系统、产品或设备不必限于清楚地列出的那些步骤或模块,而是可包括没有清楚地列出的或对于这些过程、方法、产品或设备固有的其它步骤或模块。在本申请中出现的对步骤进行的命名或者编号,并不意味着必须按照命名或者编号所指示的时间/逻辑先后顺序执行方法流程中的步骤,已经命名或者编号的流程步骤可以根据要实现的技术目的变更执行次序,只要能达到相同或者相类似的技术效果即可。The terms "first", "second", etc. in the description and claims of the application and the above-mentioned drawings are used to distinguish similar objects, and are not necessarily used to describe a specific sequence or sequence. It should be understood that the data used in this way can be interchanged under appropriate circumstances, so that the embodiments described herein can be implemented in a sequence other than the content illustrated or described herein. In addition, the terms "including" and "having" and any variations of them are intended to cover non-exclusive inclusions. For example, a process, method, system, product, or device that includes a series of steps or modules is not necessarily limited to those clearly listed. Those steps or modules may include other steps or modules that are not clearly listed or are inherent to these processes, methods, products, or equipment. The naming or numbering of steps appearing in this application does not mean that the steps in the method flow must be executed in the time/logical order indicated by the naming or numbering. The named or numbered process steps can be implemented according to the The technical purpose changes the execution order, as long as the same or similar technical effects can be achieved.
本申请实施例提出的目标跟踪方法涉及人工智能技术领域,下面对人工智能系统进行简要介绍。图1示出一种人工智能主体框架示意图,该主体框架描述了人工智能系统总体工作流程,适用于通用的人工智能领域需求。The target tracking method proposed in the embodiment of the application relates to the field of artificial intelligence technology. The artificial intelligence system is briefly introduced below. Figure 1 shows a schematic diagram of an artificial intelligence main frame, which describes the overall workflow of an artificial intelligence system and is suitable for general artificial intelligence field requirements.
下面从“智能信息链”(水平轴)和“IT价值链”(垂直轴)两个维度对上述人工智能主题框架进行阐述。The following describes the above-mentioned artificial intelligence theme framework from the two dimensions of "intelligent information chain" (horizontal axis) and "IT value chain" (vertical axis).
“智能信息链”反映从数据的获取到处理的一列过程。举例来说,可以是智能信息感知、智能信息表示与形成、智能推理、智能决策、智能执行与输出的一般过程。在这个过程中,数据经历了“数据—信息—知识—智慧”的凝练过程。"Intelligent Information Chain" reflects a series of processes from data acquisition to processing. For example, it can be the general process of intelligent information perception, intelligent information representation and formation, intelligent reasoning, intelligent decision-making, intelligent execution and output. In this process, the data has gone through the condensing process of "data-information-knowledge-wisdom".
“IT价值链”从人智能的底层基础设施、信息(提供和处理技术实现)到系统的产业生态过程,反映人工智能为信息技术产业带来的价值。The "IT value chain" is the industrial ecological process from the underlying infrastructure of human intelligence and information (providing and processing technology realization) to the system, reflecting the value that artificial intelligence brings to the information technology industry.
(1)基础设施:(1) Infrastructure:
基础设施为人工智能系统提供计算能力支持,实现与外部世界的沟通,并通过基础平台实现支撑。通过传感器与外部沟通;计算能力由智能芯片(CPU、NPU、GPU、ASIC、FPGA等硬件加速芯片)提供;基础平台包括分布式计算框架及网络等相关的平台保障和支持,可以包括云存储和计算、互联互通网络等。举例来说,传感器和外部沟通获取数据,这些数据提供给基础平台提供的分布式计算系统中的智能芯片进行计算。The infrastructure provides computing power support for the artificial intelligence system, realizes communication with the outside world, and realizes support through the basic platform. Communicate with the outside through sensors; computing capabilities are provided by smart chips (CPU, NPU, GPU, ASIC, FPGA and other hardware acceleration chips); the basic platform includes distributed computing framework and network related platform guarantees and support, which can include cloud storage and Computing, interconnection network, etc. For example, sensors communicate with the outside to obtain data, and these data are provided to the smart chip in the distributed computing system provided by the basic platform for calculation.
(2)数据(2) Data
基础设施的上一层的数据用于表示人工智能领域的数据来源。数据涉及到图形、图像、 语音、文本,还涉及到传统设备的物联网数据,包括已有系统的业务数据以及力、位移、液位、温度、湿度等感知数据。The data in the upper layer of the infrastructure is used to represent the data source in the field of artificial intelligence. The data involves graphics, images, voice, text, and IoT data of traditional devices, including business data of existing systems and sensory data such as force, displacement, liquid level, temperature, and humidity.
(3)数据处理(3) Data processing
数据处理通常包括数据训练,机器学习,深度学习,搜索,推理,决策等方式。Data processing usually includes data training, machine learning, deep learning, search, reasoning, decision-making and other methods.
其中,机器学习和深度学习可以对数据进行符号化和形式化的智能信息建模、抽取、预处理、训练等。Among them, machine learning and deep learning can symbolize and formalize data for intelligent information modeling, extraction, preprocessing, training, etc.
推理是指在计算机或智能系统中,模拟人类的智能推理方式,依据推理控制策略,利用形式化的信息进行机器思维和求解问题的过程,典型的功能是搜索与匹配。Reasoning refers to the process of simulating human intelligent reasoning in a computer or intelligent system, using formal information to conduct machine thinking and solving problems based on reasoning control strategies. The typical function is search and matching.
决策是指智能信息经过推理后进行决策的过程,通常提供分类、排序、预测等功能。Decision-making refers to the process of making decisions after intelligent information is reasoned, and usually provides functions such as classification, ranking, and prediction.
(4)通用能力(4) General ability
对数据经过上面提到的数据处理后,进一步基于数据处理的结果可以形成一些通用的能力,比如可以是算法或者一个通用系统,例如,翻译,文本的分析,计算机视觉的处理,语音识别,图像的识别等等。After the above-mentioned data processing is performed on the data, some general capabilities can be formed based on the results of the data processing, such as an algorithm or a general system, for example, translation, text analysis, computer vision processing, speech recognition, image Recognition and so on.
(5)智能产品及行业应用(5) Smart products and industry applications
智能产品及行业应用指人工智能系统在各领域的产品和应用,是对人工智能整体解决方案的封装,将智能信息决策产品化、实现落地应用,其应用领域主要包括:智能制造、智能交通、智能家居、智能医疗、智能安防、自动驾驶,平安城市,智能终端等。Intelligent products and industry applications refer to the products and applications of artificial intelligence systems in various fields. It is an encapsulation of the overall solution of artificial intelligence, productizing intelligent information decision-making and realizing landing applications. Its application fields mainly include: intelligent manufacturing, intelligent transportation, Smart home, smart medical, smart security, autonomous driving, safe city, smart terminal, etc.
本申请实施例提出的目标跟踪方法中目标物体的运动管道通过深度神经网络得到,下面对基于深度神经网络进行数据处理的系统架构进行简要介绍,请参见附图2,本申请实施例提供了一种系统架构200。数据采集设备260用于采集运动目标的视频数据并存入数据库230,训练设备220基于数据库230中维护的包含运动目标的视频样本生成目标模型/规则201。下面将更详细地描述训练设备220如何基于运动目标的视频样本得到目标模型/规则201,目标模型/规则201能够用于单目标跟踪、多目标跟踪和虚拟现实等应用场景。In the target tracking method proposed in the embodiment of this application, the motion pipeline of the target object is obtained through a deep neural network. The following briefly introduces the system architecture based on the deep neural network for data processing. Please refer to FIG. 2. The embodiment of this application provides A system architecture 200. The data collection device 260 is used to collect the video data of the moving target and store it in the database 230. The training device 220 generates a target model/rule 201 based on the video samples containing the moving target maintained in the database 230. The following will describe in more detail how the training device 220 obtains the target model/rule 201 based on the video samples of the moving target. The target model/rule 201 can be used in application scenarios such as single target tracking, multiple target tracking, and virtual reality.
本申请实施例中,可以基于运动目标的视频样本进行训练,具体的,可以通过数据采集设备260采集各种包含运动目标的视频样本,并存入数据库230。此外,还可以直接从常用的数据库获取视频数据。In the embodiment of the present application, training may be performed based on video samples of the moving target. Specifically, various video samples containing the moving target may be collected by the data collection device 260 and stored in the database 230. In addition, video data can be obtained directly from commonly used databases.
该目标模型/规则201可以是基于深度神经网络得到的,下面对深度神经网络进行介绍。The target model/rule 201 may be obtained based on a deep neural network, and the deep neural network will be introduced below.
深度神经网络中的每一层的工作可以用数学表达式
Figure PCTCN2021093852-appb-000004
来描述:从物理层面深度神经网络中的每一层的工作可以理解为通过五种对输入空间(输入向量的集合)的操作,完成输入空间到输出空间的变换(即矩阵的行空间到列空间),这五种操作包括:1、升维/降维;2、放大/缩小;3、旋转;4、平移;5、“弯曲”。其中1、2、3的操作由
Figure PCTCN2021093852-appb-000005
完成,4的操作由+b完成,5的操作则由a()来实现。这里之所以用“空间”二字来表述是因为被分类的对象并不是单个事物,而是一类事物,空间是指这类事物所有个体的集合。其中,W是权重向量,该向量中的每一个值表示该层神经网络中的一个神经元的权重值。该向量W决定着上文所述的输入空间到输出空间的空间变换,即每一层的权重W控制着如何变换空间。训练深度神经网络的目的,也就是最终得到训练好的神经网络的所有层的权重矩阵(由很多层的向量W形成的权重矩阵)。因此,神经网络的训练过程本质上就是学习控制空间变换的方式,更具体的就是学习权重矩阵。
The work of each layer in the deep neural network can be expressed in mathematical expressions
Figure PCTCN2021093852-appb-000004
To describe: From the physical level, the work of each layer in the deep neural network can be understood as the transformation of the input space to the output space (that is, the row space of the matrix to the column Space), these five operations include: 1. Dimension Up/Down; 2. Enlarge/Reduce; 3. Rotate; 4. Translation; 5. "Bend". The operations of 1, 2, and 3 are determined by
Figure PCTCN2021093852-appb-000005
Completed, the operation of 4 is completed by +b, and the operation of 5 is realized by a(). The reason why the word "space" is used here is because the object to be classified is not a single thing, but a class of things. Space refers to the collection of all individuals of this class of things. Among them, W is a weight vector, and each value in the vector represents the weight value of a neuron in the layer of neural network. This vector W determines the spatial transformation from the input space to the output space described above, that is, the weight W of each layer controls how the space is transformed. The purpose of training a deep neural network is to finally obtain the weight matrix of all layers of the trained neural network (the weight matrix formed by the vector W of many layers). Therefore, the training process of the neural network is essentially the way of learning the control space transformation, and more specifically the learning weight matrix.
因为希望深度神经网络的输出尽可能的接近真正想要预测的值,所以可以通过比较当前网络的预测值和真正想要的目标值,再根据两者之间的差异情况来更新每一层神经网络的权重向量(当然,在第一次更新之前通常会有初始化的过程,即为深度神经网络中的各层预先配置参数),比如,如果网络的预测值高了,就调整权重向量让它预测低一些,不断的调整,直到神经网络能够预测出真正想要的目标值。因此,就需要预先定义“如何比较预测值和目标值之间的差异”,这便是损失函数(loss function)或目标函数(objective function),它们是用于衡量预测值和目标值的差异的重要方程。其中,以损失函数举例,损失函数的输出值(loss)越高表示差异越大,那么深度神经网络的训练就变成了尽可能缩小这个loss的过程。Because it is hoped that the output of the deep neural network is as close as possible to the value that you really want to predict, you can compare the current network's predicted value with the really desired target value, and then update each layer of neural network according to the difference between the two. The weight vector of the network (of course, there is usually an initialization process before the first update, which is to pre-configure parameters for each layer in the deep neural network). For example, if the predicted value of the network is high, adjust the weight vector to make it The prediction is lower and keep adjusting until the neural network can predict the target value you really want. Therefore, it is necessary to predefine "how to compare the difference between the predicted value and the target value". This is the loss function or objective function, which is used to measure the difference between the predicted value and the target value. Important equation. Among them, taking the loss function as an example, the higher the output value (loss) of the loss function, the greater the difference, then the training of the deep neural network becomes a process of reducing this loss as much as possible.
训练设备220得到的目标模型/规则可以应用不同的系统或设备中。在附图2中,执行设备210配置有I/O接口212,与外部设备进行数据交互,“用户”可以通过客户设备240向I/O接口212输入数据。The target model/rule obtained by the training device 220 can be applied to different systems or devices. In FIG. 2, the execution device 210 is configured with an I/O interface 212 to perform data interaction with external devices. The "user" can input data to the I/O interface 212 through the client device 240.
执行设备210可以调用数据存储系统250中的数据、代码等,也可以将数据、指令等存入数据存储系统250中。The execution device 210 can call data, codes, etc. in the data storage system 250, and can also store data, instructions, etc. in the data storage system 250.
计算模块211使用目标模型/规则201对输入的数据进行处理,以目标跟踪为例,计算模块211可以对输入的视频进行解析,得到指示视频帧中目标位置信息的特征。The calculation module 211 uses the target model/rule 201 to process the input data. Taking target tracking as an example, the calculation module 211 can analyze the input video to obtain features indicating target location information in the video frame.
关联功能模块213可以对计算模块211中的图像数据进行预处理,例如进行视频预处理,包括视频切分等。The correlation function module 213 may preprocess the image data in the calculation module 211, for example, perform video preprocessing, including video segmentation.
关联功能模块214可以对计算模块211中的图像数据进行预处理,例如进行视频预处理,包括视频切分等。The correlation function module 214 may preprocess the image data in the calculation module 211, for example, perform video preprocessing, including video segmentation.
最后,I/O接口212将处理结果返回给客户设备240,提供给用户。Finally, the I/O interface 212 returns the processing result to the client device 240 and provides it to the user.
更深层地,训练设备220可以针对不同的目标,基于不同的数据生成相应的目标模型/规则201,以给用户提供更佳的结果。At a deeper level, the training device 220 can generate corresponding target models/rules 201 based on different data for different targets, so as to provide users with better results.
在附图2中所示情况下,用户可以手动指定输入执行设备210中的数据,例如,在I/O接口212提供的界面中操作。另一种情况下,客户设备240可以自动地向I/O接口212输入数据并获得结果,如果客户设备240自动输入数据需要获得用户的授权,用户可以在客户设备240中设置相应权限。用户可以在客户设备240查看执行设备210输出的结果,具体的呈现形式可以是显示、声音、动作等具体方式。客户设备240也可以作为数据采集端将采集到训练数据存入数据库230。In the case shown in FIG. 2, the user can manually specify to input data in the execution device 210, for example, to operate in the interface provided by the I/O interface 212. In another case, the client device 240 can automatically input data to the I/O interface 212 and obtain the result. If the client device 240 automatically inputs data and needs the user's authorization, the user can set the corresponding authority in the client device 240. The user can view the result output by the execution device 210 on the client device 240, and the specific presentation form may be a specific manner such as display, sound, and action. The client device 240 can also serve as a data collection terminal to store the collected training data in the database 230.
值得注意的,附图2仅是本申请实施例提供的一种系统架构的示意图,图中所示设备、器件、模块等之间的位置关系不构成任何限制,例如,在附图2中,数据存储系统250相对执行设备210是外部存储器,在其它情况下,也可以将数据存储系统250置于执行设备210中。It is worth noting that Fig. 2 is only a schematic diagram of a system architecture provided by an embodiment of the present application, and the positional relationship between the devices, devices, modules, etc. shown in the figure does not constitute any limitation. For example, in Fig. 2, The data storage system 250 is an external memory relative to the execution device 210. In other cases, the data storage system 250 may also be placed in the execution device 210.
本申请实施例中用于从视频中提取运动管道的深度神经网络,例如,可以是卷积神经网络(convolutional neural network,CNN)。下面对CNN进行具体介绍。The deep neural network used to extract the motion pipeline from the video in the embodiment of the application may be a convolutional neural network (convolutional neural network, CNN), for example. The following is a specific introduction to CNN.
CNN是一种带有卷积结构的深度神经网络,是一种深度学习(deep learning)架构,深度学习架构是指通过机器学习的算法,在不同的抽象层级上进行多个层次的学习。作为一种深度学习架构,CNN是一种前馈(feed-forward)人工神经网络,以图像处理为例,该前馈 人工神经网络中的各个神经元对输入其中的图像中的重叠区域作出响应。当然,还可以是其他类型,本申请不限制深度神经网络的类型。CNN is a deep neural network with a convolutional structure. It is a deep learning architecture. The deep learning architecture refers to the use of machine learning algorithms to perform multiple levels of learning at different levels of abstraction. As a deep learning architecture, CNN is a feed-forward artificial neural network. Take image processing as an example. Each neuron in the feed-forward artificial neural network responds to overlapping areas in the input image. . Of course, it can also be of other types, and this application does not limit the type of deep neural network.
如图3所示,卷积神经网络(CNN)100可以包括输入层110,卷积层/池化层120,其中池化层为可选的,以及神经网络层130。As shown in FIG. 3, a convolutional neural network (CNN) 100 may include an input layer 110, a convolutional layer/pooling layer 120, where the pooling layer is optional, and a neural network layer 130.
卷积层/池化层120:Convolutional layer/pooling layer 120:
卷积层:Convolutional layer:
如图3所示卷积层/池化层120可以包括如示例121-126层,在一种实现中,121层为卷积层,122层为池化层,123层为卷积层,124层为池化层,125为卷积层,126为池化层;在另一种实现方式中,121、122为卷积层,123为池化层,124、125为卷积层,126为池化层。即卷积层的输出可以作为随后的池化层的输入,也可以作为另一个卷积层的输入以继续进行卷积操作。As shown in FIG. 3, the convolutional layer/pooling layer 120 may include layers 121-126 as shown in the example. In one implementation, layer 121 is a convolutional layer, layer 122 is a pooling layer, layer 123 is a convolutional layer, and 124 The layer is a pooling layer, 125 is a convolutional layer, and 126 is a pooling layer; in another implementation, 121 and 122 are convolutional layers, 123 is a pooling layer, 124 and 125 are convolutional layers, and 126 is a convolutional layer. Pooling layer. That is, the output of the convolutional layer can be used as the input of the subsequent pooling layer, or as the input of another convolutional layer to continue the convolution operation.
以卷积层121为例,卷积层121可以包括很多个卷积算子,卷积算子也称为核,其在图像处理中的作用相当于一个从输入图像矩阵中提取特定信息的过滤器,卷积算子本质上可以是一个权重矩阵,这个权重矩阵通常被预先定义,在对图像进行卷积操作的过程中,权重矩阵通常在输入图像上沿着水平方向一个像素接着一个像素(或两个像素接着两个像素……这取决于步长stride的取值)的进行处理,从而完成从图像中提取特定特征的工作。Take the convolutional layer 121 as an example. The convolutional layer 121 can include many convolution operators. The convolution operator is also called a kernel. Its function in image processing is equivalent to a filter that extracts specific information from the input image matrix. In essence, the convolution operator can be a weight matrix. This weight matrix is usually predefined. In the process of convolution on the image, the weight matrix is usually one pixel after another pixel in the horizontal direction on the input image ( Or two pixels followed by two pixels...It depends on the value of stride) to complete the work of extracting specific features from the image.
根据所需处理数据的维度不同,卷积核也有多种格式。常用的卷积核包括二维卷积核和三维卷积核。二维卷积核主要应用于处理二维的图像数据,而三维卷积核则由于增加了深度或时间方向的维度,可应用于视频处理、立体图像处理等。本申请实施例中为了通过神经网络模型提取视频中时间维度和空间维度上的信息,通过三维卷积核,同时在时间维度和空间维度进行卷积操作,由此,由三维卷积核构成的三维卷积神经网络可以在获取每一视频帧的特征的同时,也能表达视频帧随时间推移的关联与变化。Depending on the dimensions of the data to be processed, the convolution kernel also has multiple formats. Commonly used convolution kernels include two-dimensional convolution kernels and three-dimensional convolution kernels. Two-dimensional convolution kernels are mainly used to process two-dimensional image data, while three-dimensional convolution kernels can be applied to video processing, stereoscopic image processing, etc. due to the increased depth or time dimension. In the embodiments of this application, in order to extract the information in the time dimension and the space dimension in the video through the neural network model, the three-dimensional convolution kernel is used to perform the convolution operation in the time dimension and the space dimension at the same time, thus, the three-dimensional convolution kernel is composed of The three-dimensional convolutional neural network can not only obtain the characteristics of each video frame, but also express the association and change of the video frame over time.
当卷积神经网络100有多个卷积层的时候,初始的卷积层(例如121)往往提取较多的一般特征,该一般特征也可以称之为低级别的特征;随着卷积神经网络100深度的加深,越往后的卷积层(例如126)提取到的特征越来越复杂,比如高级别的语义之类的特征,语义越高的特征越适用于待解决的问题。为方便描述网络结构,可以将多个卷积层称为一个块(block)。When the convolutional neural network 100 has multiple convolutional layers, the initial convolutional layer (such as 121) often extracts more general features, which can also be called low-level features; with the convolutional neural network With the deepening of the network 100, the features extracted by the subsequent convolutional layers (for example, 126) become more and more complex, such as features such as high-level semantics, and features with higher semantics are more suitable for the problem to be solved. To facilitate the description of the network structure, multiple convolutional layers can be referred to as a block.
池化层:Pooling layer:
由于常常需要减少训练参数的数量,因此卷积层之后常常需要周期性的引入池化层,即如图3中120所示例的121-126各层,可以是一层卷积层后面跟一层池化层,也可以是多层卷积层后面接一层或多层池化层。在图像处理过程中,池化层的唯一目的就是减少图像的空间大小。Since it is often necessary to reduce the number of training parameters, it is often necessary to periodically introduce a pooling layer after the convolutional layer, that is, the 121-126 layers as illustrated by 120 in Figure 3, which can be a convolutional layer followed by a layer The pooling layer can also be a multi-layer convolutional layer followed by one or more pooling layers. In the image processing process, the sole purpose of the pooling layer is to reduce the spatial size of the image.
神经网络层130:Neural network layer 130:
在经过卷积层/池化层120的处理后,卷积神经网络100还不足以输出所需要的输出信息。因此,在神经网络层130中可以包括多层隐含层(如图3所示的131、132至13n)以及输出层140,该多层隐含层中所包含的参数可以根据具体的任务类型的相关训练数据进行预先训练得到,例如该任务类型可以包括图像识别,图像分类,图像超分辨率重建等等。After processing by the convolutional layer/pooling layer 120, the convolutional neural network 100 is not enough to output the required output information. Therefore, the neural network layer 130 may include multiple hidden layers (131, 132 to 13n as shown in FIG. 3) and an output layer 140. The parameters contained in the multiple hidden layers can be based on specific task types. Relevant training data of, is obtained through pre-training. For example, the task type may include image recognition, image classification, image super-resolution reconstruction, and so on.
在神经网络层130中的多层隐含层之后,也就是整个卷积神经网络100的最后层为输出 层140。After the multiple hidden layers in the neural network layer 130, that is, the final layer of the entire convolutional neural network 100 is the output layer 140.
需要说明的是,如图3所示的卷积神经网络100仅作为一种卷积神经网络的示例,在具体的应用中,卷积神经网络还可以以其他网络模型的形式存在,例如,如图4所示的多个卷积层/池化层并行,将分别提取的特征均输入给全神经网络层130进行处理。It should be noted that the convolutional neural network 100 shown in FIG. 3 is only used as an example of a convolutional neural network. In specific applications, the convolutional neural network may also exist in the form of other network models, for example, The multiple convolutional layers/pooling layers shown in FIG. 4 are in parallel, and the respectively extracted features are input to the full neural network layer 130 for processing.
可选地,本申请实施例中用于从视频中提取运动管道的深度神经网络为残差神经网络和特征金字塔网络的组合。其中,残差神经网络通过让深度网络学习残差表示让更深的网络更容易训练。残差学习解决了深层网络中存在的梯度消失和梯度爆炸问题。特征金字塔网络在不同分辨率的特征图上检测对应尺度的目标。其每一层输出均由当前层和更高层的特征图融合得到,因此输出的每一层特征图均有足够的特征表达能力。Optionally, the deep neural network used to extract the motion pipeline from the video in the embodiment of the present application is a combination of a residual neural network and a feature pyramid network. Among them, the residual neural network makes the deeper network easier to train by letting the deep network learn the residual representation. Residual learning solves the problems of gradient disappearance and gradient explosion in deep networks. The feature pyramid network detects targets of corresponding scales on feature maps of different resolutions. The output of each layer is obtained by fusing the feature maps of the current layer and higher layers, so each layer of feature maps output has sufficient feature expression ability.
本申请实施例提供的目标检测方法涉及的目标跟踪技术应用广泛,比如视频拍摄时的自动对焦,目标跟踪算法可以帮助拍摄者更便捷的准确选择焦点,或者灵活切换焦点跟踪目标,这在体育赛事,野生动物拍摄中尤为重要。在监控场景下,多目标跟踪算法可以自动化完成选定目标物体的位置跟踪,方便查找既定目标,这在安防领域有重要意义。在自动驾驶场景下,多目标跟踪算法可以掌控周边行人,车辆的运动轨迹与趋势,为自动驾驶的路径规划,自动避障等功能提供初始信息。在虚拟现实场景中,体感游戏、手势识别和手指跟踪等也可以通过多目标跟踪技术实现。The target detection method provided by the embodiment of the application involves a wide range of target tracking technologies, such as auto-focus during video shooting. The target tracking algorithm can help the photographer more conveniently and accurately select the focus, or flexibly switch the focus to track the target, which is used in sports events. , Especially important in wildlife shooting. In the surveillance scene, the multi-target tracking algorithm can automatically complete the position tracking of the selected target object to facilitate the search for the established target, which is of great significance in the field of security. In the autonomous driving scenario, the multi-target tracking algorithm can control the surrounding pedestrians, the trajectory and trend of the vehicle, and provide initial information for automatic driving path planning, automatic obstacle avoidance and other functions. In virtual reality scenarios, somatosensory games, gesture recognition, and finger tracking can also be achieved through multi-target tracking technology.
通常的目标跟踪方法包括检测和跟踪两部分,通过检测模块检测出每个视频帧中出现的目标,然后将各个视频帧中出现的目标进行匹配,在匹配的过程中,提取单个视频帧中每个目标物体的特征,通过特征的相似度比对实现目标匹配,得到每个目标物体的跟踪轨迹。由于此类目标跟踪方法采用先检测再跟踪的技术手段,目标跟踪效果依赖于单帧的检测算法,若目标检测中存在目标遮挡,会产生检测错误,进而引起跟踪错误,因此,在目标密集或者遮挡较多的场景下性能不足。The usual target tracking method includes detection and tracking. The detection module detects the target appearing in each video frame, and then matches the target appearing in each video frame. During the matching process, each target in a single video frame is extracted. The characteristics of each target object are matched through the similarity comparison of the features, and the tracking trajectory of each target object is obtained. Because this type of target tracking method uses the technical means of first detection and then tracking, the target tracking effect depends on the detection algorithm of a single frame. If the target is occluded in the target detection, detection errors will occur, which will lead to tracking errors. Therefore, when the target is dense or Insufficient performance in scenes with more occlusions.
本申请实施例通过了一种目标检测方法,将视频输入预训练的神经网络模型,输出多个的运动管道,将多个运动管道通过匹配还原出一个或多个目标对应的跟踪轨迹。首先,由于运动管道包括至少两个视频帧中的目标物体的位置信息,目标跟踪不依赖于单个视频帧的目标检测结果,可以减少在目标密集或者遮挡较多的场景下的检测失败的问题,提升目标跟踪性能。其次,常规目标检测方法依赖于单帧的检测算法,整体算法的精度受到检测器的影响,分步训练检测模型和跟踪模型的开发成本高,同时算法分为两个阶段也增大了机器学习过程的计算成本和部署难度。而本申请实施例提供的目标跟踪方法,可以实现端到端的训练,通过一个神经网络模型完成多目标物体的检测和跟踪任务,可以减低模型的复杂度。还有,现有技术基于单个视频帧提取的特征较为单一,本申请实施例提供的目标跟踪方法,采用视频作为原始输入,模型可以通过外貌特征、运动轨迹特征或步态特征等多种特征实现跟踪任务,可以提升目标跟踪性能。最后,本申请实施例提供的目标跟踪方法,采用视频作为模型原始输入,时间维度感受野增加,可以更好的捕捉人物的运动信息。The embodiment of the application adopts a target detection method, which inputs a video into a pre-trained neural network model, outputs multiple motion pipelines, and restores the tracking trajectories corresponding to one or more targets by matching the multiple motion pipelines. First, because the motion pipeline includes the position information of the target object in at least two video frames, target tracking does not depend on the target detection result of a single video frame, which can reduce the problem of detection failure in scenes with dense targets or more occlusions. Improve target tracking performance. Secondly, conventional target detection methods rely on single-frame detection algorithms. The accuracy of the overall algorithm is affected by the detector. The development cost of step-by-step training of detection models and tracking models is high. At the same time, the algorithm is divided into two stages and also increases machine learning. The computational cost and deployment difficulty of the process. However, the target tracking method provided in the embodiments of the present application can realize end-to-end training, and complete the detection and tracking tasks of multi-target objects through a neural network model, which can reduce the complexity of the model. In addition, the prior art has relatively single features extracted based on a single video frame. The target tracking method provided in the embodiments of the present application uses video as the original input, and the model can be realized by various features such as appearance features, motion trajectory features, or gait features. Tracking tasks can improve target tracking performance. Finally, the target tracking method provided by the embodiment of the present application uses video as the original input of the model, and the time dimension receptive field is increased, which can better capture the movement information of the character.
下面对本申请实施例提供的目标检测方法进行详细介绍,请参阅图10,为本申请实施例中目标检测方法的一个实施例示意图;The following describes in detail the target detection method provided by the embodiment of the present application, please refer to FIG. 10, which is a schematic diagram of an embodiment of the target detection method in the embodiment of the present application;
1001、对视频进行预处理;1001. Preprocess the video;
目标跟踪装置可以对获取的视频进行预处理,可选地,预处理包括以下一项或多项:将视频切分为预设长度的片段、调整视频分辨率和色彩空间的调整与归一化。The target tracking device can preprocess the acquired video. Optionally, the preprocessing includes one or more of the following: dividing the video into segments of preset length, adjusting the video resolution, and adjusting and normalizing the color space .
示例性的,当视频长度较长时,考虑到目标跟踪装置的数据处理能力,可以将视频切分为8帧小片段。Exemplarily, when the length of the video is long, considering the data processing capability of the target tracking device, the video may be divided into 8 small segments.
需要说明的是,步骤1001为可选步骤,可以执行,也可以不执行。It should be noted that step 1001 is an optional step and may or may not be executed.
1002、将视频输入神经网络模型,获取运动管道和运动管道的置信度。1002. Input the video into the neural network model to obtain the motion pipeline and the confidence of the motion pipeline.
将视频输入预训练的神经网络模型,获取目标物体在至少两个视频帧中的位置信息和所述至少两个视频帧的时间信息。可选地,将视频输入预训练的神经网络模型,以获取每个目标物体的运动管道。运动管道用于指示所述目标物体在所述第一视频的至少两个视频帧中的时间信息和位置信息,运动管道指示时间信息和位置信息的具体方式可以参考前述介绍,此处不再赘述。神经网络模型的训练过程在后续实施例中进行详细介绍。The video is input to the pre-trained neural network model, and the position information of the target object in the at least two video frames and the time information of the at least two video frames are obtained. Optionally, the video is input to a pre-trained neural network model to obtain the motion pipeline of each target object. The motion pipeline is used to indicate the time information and position information of the target object in at least two video frames of the first video. For the specific manner of indicating the time information and position information of the motion pipeline, please refer to the foregoing introduction, and will not be repeated here. . The training process of the neural network model will be described in detail in the subsequent embodiments.
可选地,输出的运动管道的数据格式为图8所示的类型,具体的,输入视频I,I∈R^(t×h×w×3),其中,R代表实数域,t代表视频的帧数,h×w代表视频分辨率,3代表RGB色域,输出为运动管道O,O∈R^(t×h'×w'×15),其中,R代表实数域,t代表视频的帧数,h'×w'代表神经网络输出的特征图的分辨率。即输出了t×h'×w'个运动管道,其中,每一个视频帧对应h'×w'个运动管道。Optionally, the data format of the output motion pipeline is the type shown in Figure 8. Specifically, the input video I, I∈R^(t×h×w×3), where R represents the real number domain, and t represents the video The number of frames, h×w represents the video resolution, 3 represents the RGB color gamut, and the output is the motion pipeline O, O∈R^(t×h'×w'×15), where R represents the real number domain and t represents the video The number of frames, h'×w' represents the resolution of the feature map output by the neural network. That is, t×h'×w' motion pipes are output, and each video frame corresponds to h'×w' motion pipes.
可选地,通过所述预训练的神经网络模型,获取所述目标物体的类别信息;具体地,通过预训练的神经网络模型,获取运动管道的置信度,运动管道的置信度可以用于确定运动管道对应的目标物体的类别信息。Optionally, the pre-trained neural network model is used to obtain the category information of the target object; specifically, the pre-trained neural network model is used to obtain the confidence level of the motion pipeline, which can be used to determine The category information of the target object corresponding to the motion pipeline.
由于运动管道用于指示目标在视频帧中的位置信息,每个运动管道对应一个待跟踪的目标,运动管道的置信度是指每个运动管道对应的目标,属于预设类别的可能性。通常,需要预设视频中待跟踪的目标物体的类别,例如:人、车辆或狗等,输出的运动管道的置信度分别代表该运动管道对应的目标属于该预设类别的概率,置信度为0至1之间的数值,置信度越小代表属于该预设类别的可能性越小,越大代表属于该预设类别的可能性越大。Since the motion pipeline is used to indicate the position information of the target in the video frame, each motion pipeline corresponds to a target to be tracked, and the confidence of the motion pipeline refers to the possibility that the target corresponding to each motion pipeline belongs to the preset category. Generally, it is necessary to preset the category of the target object to be tracked in the video, such as human, vehicle, or dog. The confidence of the output motion pipeline represents the probability that the target corresponding to the motion pipeline belongs to the preset category, and the confidence is A value between 0 and 1. The smaller the confidence level, the less likely it is to belong to the preset category, and the larger it is, the greater the possibility of belonging to the preset category.
可选地,每个运动管道的置信度的数量与预设的目标物体的类别的数量相等,每个置信度对应指示该运动管道属于该类别的可能性。神经网络模型输出的运动管道的置信度组成置信度表。Optionally, the number of confidence levels of each motion channel is equal to the number of preset target object categories, and each confidence level corresponds to the possibility that the motion channel belongs to the category. The confidence of the motion pipeline output by the neural network model constitutes the confidence table.
示例1,目标物体的预设类别为“人”或“背景”,背景是指不包含待跟踪目标物体的图像区域,第一运动管道的对应的目标物体的类别的置信度分别为0.1、0.9,第二运动管道的置信度为0.7、0.3,由于预设类别只有一个,目标物体的类别的可能性包括两种,属于“人”或者“背景”,因此,置信度阈值可设为0.5,由于第一运动管道对应的目标物体的类别属于“人”的置信度0.3小于或等于0.5,即代表该运动管道对应的目标属于人的概率较小,属于“背景”的置信度0.9大于0.5,即属于背景的可能性较大;第二运动管道对应的目标物体的类别属于“人”的置信度0.7大于0.5,代表该运动管道对应的目标属于人的概率较大,属于“背景”的置信度0.3小于0.5,即属于背景的可能性较小。Example 1: The preset category of the target object is "person" or "background". The background refers to the image area that does not contain the target object to be tracked. The confidence levels of the target object category corresponding to the first motion pipeline are 0.1 and 0.9, respectively. , The confidence of the second motion pipeline is 0.7, 0.3. Since there is only one preset category, there are two possibilities for the target object category, which belongs to "person" or "background". Therefore, the confidence threshold can be set to 0.5, Since the category of the target object corresponding to the first motion channel is "person", the confidence level of 0.3 is less than or equal to 0.5, which means that the target object corresponding to the motion channel has a low probability of being a person, and the "background" confidence level of 0.9 is greater than 0.5. That is, the possibility of belonging to the background is higher; the confidence that the target object corresponding to the second motion pipe belongs to the category of "person" is greater than 0.5, which means that the target corresponding to the motion pipe has a higher probability of belonging to the person, and the confidence that it belongs to the "background" If the degree of 0.3 is less than 0.5, it is less likely to belong to the background.
示例2,目标物体的预设类别为“人”、“车辆”和“背景”,第一运动管道的置信度为0.4、0.1、0.2,第二运动管道的置信度为0.2、0.8、0.1,目标物体的类别的可能性包括三种:“人”、“车辆”或“背景”,1/3≈0.33可作为置信度阈值,由于0.4大于0.33,第一运 动管道置信度最高的类别为“人”,即对应的目标物体的类别为人的概率较大,类似的,第二运动管道置信度最高的类别为“车辆”,即对应的目标物体的类别为车辆的概率较大。Example 2: The preset categories of the target object are "person", "vehicle" and "background", the confidence level of the first motion channel is 0.4, 0.1, 0.2, and the confidence level of the second motion channel is 0.2, 0.8, 0.1, There are three possibilities for the category of the target object: "person", "vehicle" or "background". 1/3≈0.33 can be used as the confidence threshold. Since 0.4 is greater than 0.33, the category with the highest confidence in the first motion pipeline is " "People", that is, the category of the corresponding target object has a higher probability of being human. Similarly, the category with the highest confidence of the second motion channel is "vehicle", that is, the category of the corresponding target object has a higher probability of being a vehicle.
1003、删减部分运动管道;1003. Delete part of the sports pipeline;
在根据运动管道获取目标物体的跟踪轨迹之前,还可以运动管道进行删减,获取删减后的运动管道,删减后的运动管道用于获取目标物体的跟踪轨迹。对于神经网络模型输出的多个运动管道,可以根据预设条件进行删减。Before acquiring the tracking trajectory of the target object according to the motion pipeline, the motion pipeline can also be deleted to obtain the deleted motion pipeline, and the deleted motion pipeline is used to obtain the tracking trajectory of the target object. The multiple motion pipelines output by the neural network model can be deleted according to preset conditions.
由于神经网络模型预测得到的多个运动管道中,每个视频帧中每个像素点均对应运动管道,视频帧中出现的目标,通常占用多个像素点位置,因此用于指示同一个目标物体的位置信息的运动管道有多个,本步骤可以对对应于同一目标物体的多个运动管道进行删减,减少后续运动管道连接步骤的计算量。Because of the multiple motion pipelines predicted by the neural network model, each pixel in each video frame corresponds to the motion pipeline, and the target appearing in the video frame usually occupies multiple pixel positions, so it is used to indicate the same target object There are multiple motion pipelines for the position information of. In this step, multiple motion pipelines corresponding to the same target object can be deleted, reducing the amount of calculation in the subsequent steps of connecting motion pipelines.
可选地,若获取了每个运动管道的置信度,可以根据置信度确定每个运动管道对应的目标所属的类别,对每个类别的运动管道分别进行删减。Optionally, if the confidence level of each motion channel is obtained, the category to which the target corresponding to each motion channel belongs can be determined according to the confidence level, and the motion channels of each category are respectively deleted.
可选地,获取删减后的运动管道具体包括,若第一运动管道和第二运动管道之间的重复率大于或等于第一阈值,则删除第一运动管道和第二运动管道中置信度较低的运动管道,可选地,运动管道的重复率可以是两个运动管道之间的IoU,第一阈值的范围为0.3至0.7,示例性的,的第一阈值为0.5,若第一运动管道与第二运动管道之间的IoU大于或等于50%,则删除置信度较低的一个运动管道。可选地,根据非极大值抑制(non-maximum suppression,NMS)算法对所述运动管道进行删减,获取删减后的运动管道,设置运动管道IoU阈值为为0.5,通过NMS算法可以对运动管道进行删减,每个视频帧中每个目标仅保留一个对应的运动管道,根据NMS算法删减目标检测结果为现有技术,具体过程此处不再赘述。Optionally, obtaining the deleted motion pipeline specifically includes, if the repetition rate between the first motion pipeline and the second motion pipeline is greater than or equal to a first threshold, deleting the confidence levels in the first motion pipeline and the second motion pipeline For the lower motion pipeline, optionally, the repetition rate of the motion pipeline may be the IoU between the two motion pipelines. The first threshold value ranges from 0.3 to 0.7. For example, the first threshold value is 0.5. If the IoU between the movement pipe and the second movement pipe is greater than or equal to 50%, then a movement pipe with a lower confidence level is deleted. Optionally, the motion pipeline is deleted according to a non-maximum suppression (NMS) algorithm, the deleted motion pipeline is obtained, and the IoU threshold of the motion pipeline is set to 0.5, and the NMS algorithm can be used to The motion pipeline is deleted, and only one corresponding motion pipeline is reserved for each target in each video frame. The deletion of target detection results according to the NMS algorithm is the prior art, and the specific process will not be repeated here.
由于神经网络模型预测得到的多个运动管道中,每个视频帧中每个像素点均对应运动管道,视频帧中未对应目标物体的背景区域覆盖的像素点位置也同样对应一些运动管道,这部分运动管道可以理解为假的运动管道,置信度通常较低,为减少后续运动管道连接步骤的计算复杂度,可以通过运动管道的置信度进行删减。Because of the multiple motion pipelines predicted by the neural network model, each pixel in each video frame corresponds to the motion pipeline, and the position of the pixel in the video frame that does not correspond to the background area of the target object also corresponds to some motion pipelines. Part of the motion pipeline can be understood as a fake motion pipeline, and the confidence is usually low. In order to reduce the calculation complexity of the subsequent motion pipeline connection steps, the confidence of the motion pipeline can be deleted.
可选地,删减后的运动管道中的任意一个运动管道的置信度大于或等于第二阈值,即预设条件为置信度小于或等于第二阈值,第二阈值与目标物体的预设类别的数量有关,例如,若目标物体的预设类别数量为2:“人”或“背景”,则第二阈值通常在0.3至0.7之间,例如取0.5。若目标物体的类别数量为10,则第二阈值通常在0.07至0.13之间,例如取0.1。Optionally, the confidence of any one of the motion pipelines after the reduction is greater than or equal to the second threshold, that is, the preset condition is that the confidence is less than or equal to the second threshold, and the second threshold is related to the preset category of the target object For example, if the preset number of categories of the target object is 2: "person" or "background", the second threshold is usually between 0.3 and 0.7, for example, 0.5. If the number of categories of the target object is 10, the second threshold is usually between 0.07 and 0.13, for example, 0.1.
需要说明的是,步骤1003为可选步骤,可以执行,也可以不执行。It should be noted that step 1003 is an optional step, which may or may not be performed.
1004、连接运动管道,以获取跟踪轨迹;1004. Connect the motion pipeline to obtain the tracking trajectory;
根据目标物体在至少两个视频帧中的位置信息和至少两个视频帧的时间信息获取目标物体在第一视频中的跟踪轨迹,可选地,由于本实施例中,以运动管道指示目标物体在至少两个视频帧中的位置信息和至少两个视频帧的时间信息,因此可以根据所运动管道获取所述目标物体在所述第一视频中的跟踪轨迹,该跟踪轨迹具体为根据至少两个运动管道对应于时空维度中的四棱台连接形成的目标物体的跟踪轨迹。Obtain the tracking trajectory of the target object in the first video according to the position information of the target object in at least two video frames and the time information of the at least two video frames. Optionally, since in this embodiment, a motion pipeline is used to indicate the target object The position information in the at least two video frames and the time information of the at least two video frames. Therefore, the tracking trajectory of the target object in the first video can be obtained according to the motion pipeline, and the tracking trajectory is specifically based on the at least two video frames. Each motion pipeline corresponds to the tracking trajectory of the target object formed by the connection of the quadrangular prisms in the space-time dimension.
具体的,将指示同一目标物体的位置信息的多个运动管道进行连接,得到每个目标对应的跟踪轨迹,运动管道之间的连接,或称为运动管道之间的匹配需满足预设条件。根据运动管道获取目标物体的跟踪轨迹具体包括:对所述运动管道中满足预设条件的第三运动管道和 第四运动管道进行连接,获取所述目标物体的跟踪轨迹。Specifically, multiple motion pipes indicating the position information of the same target object are connected to obtain the tracking trajectory corresponding to each target. The connection between the motion pipes, or the matching between the motion pipes, needs to meet a preset condition. Obtaining the tracking trajectory of the target object according to the motion pipeline specifically includes: connecting a third motion pipeline and a fourth motion pipeline that meet a preset condition in the motion pipeline to obtain the tracking trajectory of the target object.
预设条件的具体内容包括多种,可选地,预设条件包括以下一个或多个:第三运动管道和第四运动管道在时间维度重叠部分的区段之间的交并比大于或等于第三阈值;第三运动管道的运动方向和第四运动管道的运动方向之间的夹角的余弦值大于或等于第四阈值,运动方向为在时空维度按照预设规则指示运动管道中目标物体的位置变化的向量;以及,运动管道的神经网络特征向量之间的距离小于或等于第五阈值,距离包括欧式距离。The specific content of the preset condition includes multiple types. Optionally, the preset condition includes one or more of the following: the intersection ratio between the third motion channel and the fourth motion channel in the overlapping sections of the time dimension is greater than or equal to The third threshold; the cosine value of the angle between the movement direction of the third movement pipeline and the movement direction of the fourth movement pipeline is greater than or equal to the fourth threshold, and the movement direction is to indicate the target object in the movement pipeline in the space-time dimension according to preset rules And, the distance between the neural network feature vectors of the motion pipeline is less than or equal to the fifth threshold, and the distance includes the Euclidean distance.
具体的,两个运动管道在时间维度重叠部分对应的运动管道区段之间的交并比大于或等于第三阈值、运动管道之间的运动方向夹角的余弦值大于或等于第四阈值、运动管道的神经网络特征向量之间的距离指标小于或等于第五阈值,距离指标例如可以为欧式距离。运动管道的神经网络特征向量可以神经网络模型中任意一层输出的特征向量,可选地,运动管道的神经网络特征向量为神经网络模型中三维(3D)卷积神经网络最后一层输出的特征向量。Specifically, the intersection ratio between the two motion pipelines corresponding to the overlapping parts of the time dimension is greater than or equal to the third threshold, the cosine of the angle between the motion directions of the motion pipelines is greater than or equal to the fourth threshold, The distance index between the neural network feature vectors of the motion pipeline is less than or equal to the fifth threshold, and the distance index may be, for example, Euclidean distance. The neural network feature vector of the motion pipeline can be the output feature vector of any layer in the neural network model. Optionally, the neural network feature vector of the motion pipeline is the output feature of the last layer of the three-dimensional (3D) convolutional neural network in the neural network model. vector.
运动管道的运动方向为在时空维度指示运动管道的两个底面对应的目标物体的位置变化的向量,指示目标物体的移动速度和方向,可以理解的是,视频中目标物体的位置变化通常为连续变化,不会发生突变,因此,跟踪轨迹中相邻运动管道区段的运动方向比较接近,在运动管道连接过程中,也可以根据运动管道的运动方向的相似程度进行连接。需要说明的是:运动管道的运动方向可以根据预设的规则确定,例如,将运动管道在时空维度中,以运动管道在时间维度上相距最远的两个底面(例如,图8所示运动管道的Bs和Be)对应的目标物体的位置变化的向量为运动管道的运动方向,或者,设定运动管道中两个相邻底(例如,图8所示运动管道的Bm和Be)面对应的目标物体的位置变化的向量为运动管道的运动方向,或者,设定预设数量的视频帧之间的目标物体的位置变化方向为运动管道的运动方向,预设数量例如为5帧。类似地,跟踪轨迹的方向可以定义为在轨迹末端,设定预设数量的视频帧之间的目标物体的位置变化方向为运动管道的运动方向,或者,在轨迹末端最后一个运动管道的运动方向。可以理解的是,运动管道的运动方向一般定义为在时间维度中,由某一时刻向某一时刻之后的时刻的方向。The movement direction of the motion pipe is a vector indicating the position change of the target object corresponding to the two bottom surfaces of the motion pipe in the space-time dimension, indicating the moving speed and direction of the target object. It can be understood that the position change of the target object in the video is usually continuous There is no sudden change in the change. Therefore, the moving directions of adjacent moving pipeline sections in the tracking trajectory are relatively close. During the connection of the moving pipelines, the connection can also be made according to the similarity of the moving directions of the moving pipelines. It should be noted that: the movement direction of the motion pipeline can be determined according to preset rules. For example, the motion pipeline is in the space-time dimension, and the two bottom surfaces of the motion pipeline that are the farthest apart in the time dimension (for example, the motion shown in Figure 8 The vector of the position change of the target object corresponding to the Bs and Be of the pipe is the direction of movement of the moving pipe, or set two adjacent bottoms in the moving pipe (for example, Bm and Be of the moving pipe shown in Figure 8) to face The corresponding vector of the position change of the target object is the movement direction of the motion pipeline, or the position change direction of the target object between a preset number of video frames is set as the movement direction of the motion pipeline, and the preset number is, for example, 5 frames. Similarly, the direction of the tracking trajectory can be defined as at the end of the trajectory, the direction of the position change of the target object between a preset number of video frames is the direction of movement of the motion pipeline, or the direction of movement of the last motion pipeline at the end of the trajectory . It is understandable that the direction of motion of the motion pipe is generally defined as the direction from a certain moment to a moment after a certain moment in the time dimension.
第三阈值的取值不做限定,通常为70%至95%,例如75%、80%、85%或90%等,第四阈值的取值不做限定,通常为cos(π/6)至cos(π/36),例如为cos(π/9)、cos(π/12)或cos(π/18)等。第五阈值的取值可以根据特征向量的大小确定,具体数值不做限定。The value of the third threshold is not limited, usually 70% to 95%, such as 75%, 80%, 85% or 90%, etc. The value of the fourth threshold is not limited, usually cos(π/6) To cos (π/36), for example, cos (π/9), cos (π/12), or cos (π/18). The value of the fifth threshold can be determined according to the size of the feature vector, and the specific value is not limited.
可选地,下面以预设条件为两个运动管道在时间维度重叠部分对应的管道区段之间的交并比大于或等于第三阈值,且运动管道之间的运动方向夹角的余弦值大于或等于第四阈值为例进行介绍。Optionally, the following preset conditions are that the intersection ratio between the two motion pipelines corresponding to the overlapped portion of the time dimension is greater than or equal to the third threshold, and the cosine of the angle between the motion directions of the motion pipelines The fourth threshold value is greater than or equal to an example.
请参阅图11为本申请实施例中运动管道之间的匹配的实施例示意图。Please refer to FIG. 11 for a schematic diagram of an embodiment of the matching between the motion pipes in the embodiment of the application.
示例1,如图11中a部分所示,若两个运动管道时间维度重叠部分对应的运动管道区段之间的交并比大于或等于第三阈值,且两个运动管道的运动方向夹角的余弦值大于或等于第四阈值,即重合度和运动方向均匹配,两个运动管道匹配成功。需要说明的是,两个运动管道之间的重合度是指两个运动管道在时间维度重合部分的运动管道区段之间的IoU。Example 1, as shown in part a in Fig. 11, if the intersection ratio between the motion pipeline sections corresponding to the overlapping time dimension of the two motion pipelines is greater than or equal to the third threshold, and the angle between the motion directions of the two motion pipelines The cosine value of is greater than or equal to the fourth threshold, that is, the coincidence degree and the motion direction are matched, and the two motion pipes are matched successfully. It should be noted that the degree of coincidence between two motion pipes refers to the IoU between the motion pipe sections of the overlapping portion of the two motion pipes in the time dimension.
示例2,如图11中b部分所示,若两个运动管道的运动方向夹角的余弦值小于第四阈值,即运动方向不匹配,则两个运动管道匹配不成功。Example 2, as shown in part b of Fig. 11, if the cosine value of the angle between the motion directions of the two motion pipes is less than the fourth threshold, that is, the motion directions do not match, the matching of the two motion pipes is unsuccessful.
示例3,如图11中c部分所示,若两个运动管道时间维度重叠部分对应的运动管道区段 之间的交并比小于第三阈值,即重合度不匹配,则两个运动管道匹配不成功。Example 3, as shown in part c of Figure 11, if the intersection ratio between the motion pipeline sections corresponding to the overlapping parts of the two motion pipelines in the time dimension is less than the third threshold, that is, the degree of coincidence does not match, then the two motion pipelines match unsuccessful.
需要说明的是,由于进行匹配的两个运动管道之间在时间维度具有重叠部分,因此,该重叠部分对应的视频帧中,同一目标物体的位置信息有两个,可以通过求均值的方法确定时间维度重叠部分对应的视频帧中目标物体的位置,或者根据预设的规则指定以某一个运动管道,例如公共底面对应的视频帧的时间维度坐标更早的为准。It should be noted that since the two motion pipelines for matching have overlapping parts in the time dimension, there are two position information of the same target object in the video frame corresponding to the overlapping part, which can be determined by the method of averaging. The position of the target object in the video frame corresponding to the overlapping part of the time dimension, or a certain motion channel specified according to a preset rule, for example, the time dimension coordinates of the video frame corresponding to the common bottom surface shall prevail.
可选地,对视频的所有运动管道进行连接的匹配过程中可以使用贪婪算法,通过一系列局部最优的选择进行连接;也可以使用匈牙利算法,进行全局最优匹配。Optionally, the greedy algorithm can be used in the matching process of connecting all the motion pipes of the video to connect through a series of local optimal choices; the Hungarian algorithm can also be used for global optimal matching.
根据贪婪算法进行运动管道连接具体包括:计算待匹配的两组运动管道两两之间的亲和度(亲和度定义为IoU*cos(θ),θ为运动方向夹角)形成亲和度矩阵。在亲和度矩阵中从最大亲和度开始循环选择匹配的运动管道对(Btube pair)直至匹配完成。Connecting motion pipelines according to the greedy algorithm specifically includes: calculating the affinity between the two sets of motion pipelines to be matched (the affinity is defined as IoU*cos(θ), and θ is the angle of the direction of motion) to form the affinity matrix. In the affinity matrix, the matching motion pipe pair (Btube pair) is circularly selected from the maximum affinity until the matching is completed.
根据匈牙利算法进行运动管道连接具体包括:同样在得到亲和度矩阵之后,使用匈牙利算法选择运动管道对。Connecting motion pipelines according to the Hungarian algorithm specifically includes: also after obtaining the affinity matrix, use the Hungarian algorithm to select a pair of motion pipelines.
可选地,下面介绍本实施例中连接多个运动管道的一个具体过程:Optionally, the following describes a specific process of connecting multiple motion pipes in this embodiment:
1)将所有起始于第一帧的运动管道作为初始的跟踪轨迹,得到跟踪轨迹的集合;1) Take all the motion pipelines starting from the first frame as the initial tracking trajectory to obtain the set of tracking trajectories;
2)将起始于第二帧的运动管道依次与跟踪轨迹集合中的跟踪轨迹进行连接,若满足预设条件则匹配成功,根据该运动管道更新原跟踪轨迹。若匹配不成功,则作为初始跟踪轨迹新增加到跟踪轨迹的集合中;2) Connect the motion pipeline starting in the second frame with the tracking trajectories in the tracking trajectory set in turn, if the preset condition is met, the matching is successful, and the original tracking trajectory is updated according to the motion pipeline. If the matching is unsuccessful, it will be newly added to the set of tracking trajectories as the initial tracking trajectory;
3)类似地,将起始于第i帧的运动管道依次与跟踪轨迹集合进行连接,其中,i为大于2小于t的正整数,t为视频的总帧数,若满足预设条件则匹配成功,根据该运动管道更新跟踪轨迹,若匹配不成功,则作为初始跟踪轨迹新增加到跟踪轨迹的集合中。3) Similarly, the motion pipeline starting from the i-th frame is sequentially connected with the tracking track set, where i is a positive integer greater than 2 and less than t, and t is the total number of frames of the video. If the preset conditions are met, it will match If it succeeds, the tracking trajectory is updated according to the motion pipeline. If the matching is unsuccessful, it is newly added to the set of tracking trajectories as the initial tracking trajectory.
可选地,本实施例采用贪婪算法,从最大亲和度开始依次将管道与轨迹连接。Optionally, this embodiment adopts a greedy algorithm to sequentially connect the pipeline and the trajectory starting from the maximum affinity.
示例性的,设起始于第一帧的运动管道的第一组,起始于第二帧的运动管道为第二组,类似地,起始于第i帧的运动管道为第i组,若第一组包括10个运动管道,第二组包括8个运动管道,第三组包括13个运动管道。首先,将第一组中的10个运动管道作为10个初始跟踪轨迹,第二组分别与初始跟踪轨迹进行连接,若满足连接条件,则更新跟踪轨迹,若不满足连接条件,则保留原初始跟踪轨迹,假设第二组中的8个运动管道均满足连接条件,分别与10条初始跟踪轨迹中的8条初始跟踪轨迹连接成功,则跟踪轨迹集合中包括了8条更新的跟踪轨迹,另外2条跟踪轨迹则不变。下一步,将第三组中的13个运动管道分别与跟踪轨迹集合中的轨迹进行连接,由于跟踪轨迹集合中包括10条跟踪轨迹,即使都与第三组的运动管道成功连接,仍然有3个运动管道没有用于更新跟踪轨迹,这3个运动管道可以作为新增的初始跟踪轨迹,即跟踪轨迹集合中新增3条跟踪轨迹。Exemplarily, suppose the first group of motion pipes starting from the first frame is the second group, and the motion pipes starting from the second frame are the second group. Similarly, the motion pipes starting from the i-th frame are the i-th group. If the first group includes 10 motion pipes, the second group includes 8 motion pipes, and the third group includes 13 motion pipes. First, take the 10 motion pipelines in the first group as 10 initial tracking trajectories. The second group is connected to the initial tracking trajectories. If the connection conditions are met, the tracking trajectories are updated. If the connection conditions are not met, the original initial tracking trajectories are retained. Tracking the trajectory, assuming that the 8 motion pipelines in the second group all meet the connection conditions and are successfully connected to 8 of the 10 initial tracking trajectories, then the tracking trajectory set includes 8 updated tracking trajectories, and The two tracking tracks remain unchanged. Next, connect the 13 motion pipelines in the third group to the trajectories in the tracking trajectory set. Since the tracking trajectory set includes 10 tracking trajectories, even if they are all successfully connected to the motion pipelines of the third group, there are still 3 The three motion pipelines are not used to update the tracking trajectory, and these three motion pipelines can be used as new initial tracking trajectories, that is, three new tracking trajectories are added to the tracking trajectory set.
可选地,根据运动管道的置信度表确定运动管道对应的目标所属的目标类别,分别对不同目标类别的运动管道进行连接,以获取每个目标类别的目标物体的跟踪轨迹。Optionally, the target category to which the target corresponding to the motion pipeline belongs is determined according to the confidence table of the motion pipeline, and the motion pipelines of different target categories are respectively connected to obtain the tracking trajectory of the target object of each target category.
可选地,可以通过将运动管道差值补全的方法获得被遮挡部分的空间位置。Optionally, the spatial position of the occluded part can be obtained by complementing the difference of the motion pipeline.
1005、输出跟踪轨迹;1005. Output tracking trajectory;
将连接好的跟踪轨迹按照特定格式输出,例如视频流,轨迹日志等。Output the connected tracking trajectory in a specific format, such as video stream, trajectory log, etc.
处理出跟踪轨迹后,将跟踪轨迹处理为边界框叠加在原始视频上并输出到显示器上,完成实时跟踪部署,实现目标跟踪。After the tracking trajectory is processed, the tracking trajectory is processed as a bounding box superimposed on the original video and output to the display to complete the real-time tracking deployment and achieve target tracking.
本申请实施例中提供的目标跟踪方法设计预训练的神经网络模型,下面对该神经网络模型的训练方法进行介绍。The target tracking method provided in the embodiment of the present application designs a pre-trained neural network model, and the training method of the neural network model is introduced below.
请参阅图12,为本申请实施例中神经网络模型的训练方法的一个实施例示意图。Please refer to FIG. 12, which is a schematic diagram of an embodiment of a neural network model training method in an embodiment of the application.
1201、训练准备工作;1201. Preparation for training;
训练准备工作包括构建训练的硬件环境,搭建网络模型以及设置训练参数等。Training preparations include building a training hardware environment, building a network model, and setting training parameters.
准备训练所需的硬件环境,示例性的,使用32张V100-32G显卡,采用了4个节点的分布式集群,推断过程使用1张V100-16G显卡,在单机情况下完成。Prepare the hardware environment required for training. For example, 32 V100-32G graphics cards are used, and a distributed cluster of 4 nodes is used. The inference process uses 1 V100-16G graphics card, which is completed in a single machine.
获取视频数据集,可以选用公开数据集,例如MOT数据集。可选地,还可以对数据集中的视频样本进行处理,以提高数据分布的多样性,获得更好的模型泛化能力。可选地,对视频进行的处理包括分辨率缩放,对色彩空间进行白化处理,对视频色彩进行随机HSL(一种色彩空间,或者说色彩表示方式。H:色相,S:饱和度,L:明度)抖动,对视频帧进行随机水平翻转等。To obtain the video data set, you can use a public data set, such as the MOT data set. Optionally, the video samples in the data set can also be processed to increase the diversity of data distribution and obtain better model generalization capabilities. Optionally, the processing of the video includes resolution scaling, whitening the color space, and random HSL (a color space, or color representation method) to the video color. H: hue, S: saturation, L: Brightness) jitter, random horizontal flipping of video frames, etc.
设置训练参数,包括批大小、学习率、优化器模型等,示例性的,批处理大小为32,学习率从10^(-3)开始,并在损失稳定时减小5倍以更好收敛。训练迭代25K次后网络基本收敛。为了增加模型泛化能力,采用了10^(-5)的二阶正则损失,其动量系数为0.9。Set training parameters, including batch size, learning rate, optimizer model, etc., for example, the batch size is 32, the learning rate starts from 10^(-3), and when the loss is stable, it is reduced by 5 times for better convergence . After 25K training iterations, the network basically converges. In order to increase the generalization ability of the model, a second-order regular loss of 10^(-5) is used, and its momentum coefficient is 0.9.
1202、根据人工标注的轨迹信息进行拆分,获取运动管道的真值;1202. Split according to the manually marked trajectory information to obtain the true value of the motion pipeline;
获取公开数据集中视频样本的人工标注的轨迹信息,包括目标ID以及每个视频帧中该目标物体的位置框。Obtain the manually labeled trajectory information of the video samples in the public data set, including the target ID and the position box of the target object in each video frame.
根据人工标注的轨迹信息进行拆分,获取每帧中目标物体的位置框为公共底面的运动管道。基于运动管道的第一种数据格式,通过15个数据来表示运动管道。Split according to the trajectory information manually labeled, and obtain the position frame of the target object in each frame as the motion pipeline of the common bottom surface. Based on the first data format of the motion pipeline, 15 data are used to represent the motion pipeline.
获取运动管道的具体方法如下:The specific method to obtain the motion pipeline is as follows:
将跟踪轨迹拆分为单个视频帧的位置框,将每个位置框作为双四棱台结构中的公共底面,并在跟踪轨迹中向前和向后延伸,确定双四棱台结构的另外两个底面,由此得到具有公共底面的双四棱台结构,即该单个视频帧对应的运动管道。Split the tracking trajectory into position boxes of a single video frame, and use each position box as the common bottom surface in the double quadrangular prism structure, and extend forward and backward in the tracking trajectory to determine the other two quadrangular prism structures. There are two bottom surfaces, and thus a double quadrangular prism structure with a common bottom surface is obtained, that is, the motion pipeline corresponding to the single video frame.
将跟踪轨迹拆分为运动管道的方式有多种:There are several ways to split the tracking trajectory into motion pipelines:
可选地,根据预设的管道长度进行拆分,即设定双四棱台结构中三个底面之间的间隔,例如,公共底面与另外两个底面之间的间隔均为4,运动管道的长度为8。Optionally, split according to the preset pipe length, that is, set the interval between the three bottom surfaces in the double quadrangular pyramid structure. For example, the interval between the common bottom surface and the other two bottom surfaces is 4, and the movement pipe The length is 8.
可选地,在拆分过程中,在保证双四棱台结构与原跟踪轨迹对应区段之间IoU大于或等于85%的条件下,尽可能延长运动管道在时间维度的长度,将时间维度最长的结构作为最终的扩展后结构。如图13所示。由于运动管道(Btube)的结构为线性的,真实跟踪轨迹(ground truth)的结构为非线性的,所以长的运动管道往往无法更好的拟合运动轨迹,即随着长度延长,IoU会较低(IoU<η)。IoU较大(IoU>η)的运动管道长度通常较短。本申请实施例中,将满足最低IoU阈值的长度最长的运动管道作为拆分后的运动管道,可以在较好的拟合原轨迹的同时扩大时间感受野。如图13所示,运动管道的重叠部分(Overlap Part)可以用于运动管道之间的连接匹配。Optionally, during the splitting process, under the condition that the IoU between the double quadrangular prism structure and the corresponding section of the original tracking trajectory is greater than or equal to 85%, the length of the motion pipeline in the time dimension is extended as much as possible, and the time dimension The longest structure serves as the final expanded structure. As shown in Figure 13. Since the structure of the motion pipeline (Btube) is linear, and the structure of the ground truth (ground truth) is non-linear, the long motion pipeline often cannot fit the motion trajectory better, that is, as the length increases, the IoU will be lower. Low (IoU<η). The length of the motion pipeline with larger IoU (IoU>η) is usually shorter. In the embodiment of the present application, the longest motion pipeline that meets the lowest IoU threshold is used as the split motion pipeline, which can better fit the original trajectory while expanding the time receptive field. As shown in Fig. 13, the overlapping part of the motion pipes can be used for connection matching between the motion pipes.
类似地,对视频样本中所有目标物体的跟踪轨迹进行拆分,获取多个运动管道的真值。Similarly, the tracking trajectories of all target objects in the video sample are split to obtain the true values of multiple motion pipelines.
1203、将视频样本输入初始网络模型进行训练,获取运动管道的预测值;1203. Input the video samples into the initial network model for training, and obtain the predicted value of the motion pipeline;
将视频样本输入初始网络模型中进行训练,输出运动管道的预测值。The video samples are input into the initial network model for training, and the predicted value of the motion pipeline is output.
可选地,该初始网络模型为三维(3D)卷积神经网络或递归神经网络等,其中,3D卷积神经网络包括:3D残差神经网络或3D特征金字塔网络等。可选地,该神经网络模型为3D残差神经网络和3D特征金字塔网络的组合。Optionally, the initial network model is a three-dimensional (3D) convolutional neural network or a recurrent neural network, etc., where the 3D convolutional neural network includes: a 3D residual neural network or a 3D feature pyramid network, etc. Optionally, the neural network model is a combination of a 3D residual neural network and a 3D feature pyramid network.
将视频样本输入该初始网络模型,输出所有目标物体的运动管道。The video samples are input to the initial network model, and the motion pipelines of all target objects are output.
输出的运动管道的数据格式为图8所示的类型,具体的,输入视频I,I∈R^(t×h×w×3),其中,R代表实数域,t代表视频的帧数,h×w代表视频分辨率,3代表RGB色域,输出为运动管道O,O∈R^(t×h'×w'×15),其中,R代表实数域,t代表视频的帧数,h'×w'代表神经网络输出的特征图的分辨率。即输出了t×h'×w'个运动管道,其中,每一个视频帧对应h'×w'个运动管道。The data format of the output motion pipeline is the type shown in Figure 8. Specifically, the input video I, I∈R^(t×h×w×3), where R represents the real number domain, and t represents the number of frames of the video. h×w represents the video resolution, 3 represents the RGB color gamut, and the output is the motion pipeline O, O∈R^(t×h'×w'×15), where R represents the real number domain, and t represents the number of frames of the video, h'×w' represents the resolution of the feature map output by the neural network. That is, t×h'×w' motion pipes are output, and each video frame corresponds to h'×w' motion pipes.
可选地,还输出运动管道的置信度,置信度用于指示运动管道对应的目标物体的类别。Optionally, the confidence level of the motion pipeline is also output, and the confidence level is used to indicate the category of the target object corresponding to the motion pipeline.
需要说明的是,步骤1202与步骤1203的执行顺序不做限定。It should be noted that the execution order of step 1202 and step 1203 is not limited.
1204、计算训练损失;1204. Calculate training loss;
由于步骤1202中根据人工标注的轨迹信息进行拆分,获取的运动管道的真值的数据格式(R^(t×h'×w'×15),其中t×h'×w'为运动管道的数量)为运动管道的第一种数据格式;Since step 1202 is split according to the manually labeled trajectory information, the data format of the true value of the obtained motion pipeline (R^(t×h'×w'×15), where t×h'×w' is the motion pipeline The number) is the first data format of the motion pipeline;
而步骤1203中初始网络模型输出的运动管道的数据格式(R^(n×15),其中n为运动管道的数量),为运动管道的第二种数据格式。The data format of the motion pipeline output by the initial network model in step 1203 (R^(n×15), where n is the number of motion pipelines) is the second data format of the motion pipeline.
为了根据真值和预测值计算训练损失值,需要将将步骤1202获取的运动管道真实值和神经网络模型输出的运动管道统一到一个数据格式下。In order to calculate the training loss value based on the true value and the predicted value, it is necessary to unify the true value of the motion pipeline obtained in step 1202 and the motion pipeline output by the neural network model into one data format.
可选地,本申请实施例中,将运动管道的真值转换为第二种数据格式。请参见图14,由于神经网络模型输出的t×h'×w'个运动管道,包括t×h'×w'个P点,图14中仅以P1和P2为例进行示意,t×h'×w'个P点是分布于时间和空间三个维度上的三维点阵,为实现数据转换,需要将n个运动管道真值转换到类似的三维点阵中,采用以下的规则进行转换:若三维点阵中的一个点位于运动管道的真值对应的双四棱台结构的公共底面中,则将该运动管道真值分配于该点对应的运动管道的位置上。若一个点阵中的点位于多个运动管道真值对应的公共底面(即目标重叠场景),则优先分配体积较小的运动管道。分配完成后,将得到格式同样为R^(t×h'×w'×15)的运动管道真值T,需要说明的是,其中有些点没有被分配到真值,此时可以使用0来补位,同时该真实值伴随一个0/1真值表来表征是否为补位管道。该真值表A'可以作为运动管道真值对应的置信度。Optionally, in this embodiment of the present application, the true value of the motion pipeline is converted into the second data format. Please refer to Figure 14. Since the t×h'×w' motion pipelines output by the neural network model include t×h'×w' P points, only P1 and P2 are used as examples in Figure 14 for illustration, t×h The'×w' P points are a three-dimensional lattice distributed in the three dimensions of time and space. In order to achieve data conversion, it is necessary to convert the true values of n motion pipelines to a similar three-dimensional lattice, and use the following rules for conversion : If a point in the three-dimensional lattice is located on the common bottom surface of the double quadrangular pyramid structure corresponding to the true value of the motion pipe, then the true value of the motion pipe is assigned to the position of the motion pipe corresponding to this point. If the points in a dot matrix are located on the common bottom surface corresponding to the truth values of multiple motion pipes (that is, the target overlapping scene), the motion pipe with a smaller volume is preferentially allocated. After the allocation is completed, you will get the true value T of the motion pipeline in the same format as R^(t×h'×w'×15). It should be noted that some of the points are not assigned to the true value. At this time, you can use 0 to At the same time, the true value is accompanied by a 0/1 truth table to characterize whether it is a compensation pipeline. The truth table A'can be used as the confidence level corresponding to the truth value of the motion pipeline.
将真值转换到第二种数据格式后,可以计算真值(T)与预测值(O)之间的损失。After converting the true value to the second data format, the loss between the true value (T) and the predicted value (O) can be calculated.
可选地,损失函数L为:Optionally, the loss function L is:
L=L 1+L 2 L=L 1 +L 2
L 1=-ln(IoU(T,O)) L 1 =-ln(IoU(T,O))
L 2=CrossEntropy(A,A′) L 2 =CrossEntropy(A,A′)
其中,IoU(T,O)代表运动管道真值(T)与运动管道预测值(O)之间的交并比,A为运动管道预测值(O)的置信度,A'为运动管道真值(T)的置信度,CrossEntropy为交叉熵。Among them, IoU (T, O) represents the intersection ratio between the true value of the motion pipeline (T) and the predicted value (O) of the motion pipeline, A is the confidence level of the predicted value (O) of the motion pipeline, and A'is the true value of the motion pipeline. The confidence of the value (T), CrossEntropy is the cross entropy.
1205、根据训练损失,使用优化器优化神经网络模型。1205. According to the training loss, use the optimizer to optimize the neural network model.
根据步骤1204获取的训练损失L,通过优化器更新参数,优化神经网络模型,最终得到可以用于实现本申请实施例中目标跟踪方法的神经网络模型。According to the training loss L obtained in step 1204, the parameters are updated by the optimizer to optimize the neural network model, and finally a neural network model that can be used to implement the target tracking method in the embodiment of the present application is obtained.
优化器的类型有多种,可选地,可以是BGD(batch gradient descent)算法、SGD(stochastic gradient descent)算法,或MBGD(mini-batch gradient descent)算法等。There are many types of optimizers. Optionally, they can be BGD (batch gradient descent) algorithm, SGD (stochastic gradient descent) algorithm, or MBGD (mini-batch gradient descent) algorithm.
请参阅图15,为本申请实施例中一种目标跟踪方法的另一个实施例示意图;Please refer to FIG. 15, which is a schematic diagram of another embodiment of a target tracking method in an embodiment of this application;
本方案中,目标跟踪装置可以实时地对视频中的运动目标进行跟踪。In this solution, the target tracking device can track the moving target in the video in real time.
具体的,specific,
1501、系统初始化;1501. System initialization;
方法开始,先进行目标跟踪装置的系统初始化,完成装置启动准备工作;At the beginning of the method, the system initialization of the target tracking device is performed first, and the preparation for device startup is completed;
1502、获取视频内容;1502. Obtain video content;
可以是目标跟踪装置实时采集的视频,或者是通过通信网络获取的视频。It can be a video captured by a target tracking device in real time, or a video captured through a communication network.
1503、通过神经网络模型计算,得到运动管道集合;1503. Calculate through the neural network model to obtain a set of motion pipelines;
将1502获取的视频输入预训练的神经网络模型,将得到输入的视频的运动管道集合,包括每个视频帧对应的目标物体的运动管道。The video obtained in 1502 is input into the pre-trained neural network model, and the motion pipeline set of the input video will be obtained, including the motion pipeline of the target object corresponding to each video frame.
1504、基于贪婪算法(greedy algorithm,又称贪心算法)将运动管道依次连接为跟踪轨迹;1504. Based on a greedy algorithm (greedy algorithm, also known as greedy algorithm), the motion pipelines are sequentially connected to track trajectories;
贪心算法的基本思路是从问题的某一个初始解出发一步一步地进行,根据某个优化测度,每一步都要确保能获得局部最优解。可以理解的是连接运动管道的算法可以替换为其他算法,此处不做限定。The basic idea of the greedy algorithm is to proceed step by step from a certain initial solution of the problem. According to a certain optimization measure, each step must ensure that a local optimal solution can be obtained. It is understandable that the algorithm for connecting the motion pipeline can be replaced with other algorithms, which is not limited here.
1505、输出跟踪轨迹;1505. Output tracking trajectory;
需要说明的是,对于单目标物体的跟踪,输出为一个目标物体的跟踪轨迹,对于多目标跟踪,可以输出每个目标物体的跟踪轨迹,具体地,可以将跟踪轨迹处理为每个视频帧中的边界框叠加在原始视频上通过显示模块显示。It should be noted that for single target object tracking, the output is the tracking trajectory of one target object. For multi-target tracking, the tracking trajectory of each target object can be output. Specifically, the tracking trajectory can be processed into each video frame The bounding box of is superimposed on the original video and displayed by the display module.
考虑到视频为实时拍摄的视频,目标跟踪装置将继续获取新拍摄得到的视频内容,重复执行步骤1502至步骤1505,直至目标跟踪任务结束,此处不在赘述。Considering that the video is a real-time captured video, the target tracking device will continue to obtain the newly captured video content, and repeat steps 1502 to 1505 until the target tracking task ends, which will not be repeated here.
上面介绍了本申请提供的目标跟踪方法,下面对实现该目标跟踪方法的目标跟踪装置进行介绍,请参阅图16,为本申请实施例中目标跟踪装置的一个实施例示意图。The target tracking method provided by the present application is introduced above, and the target tracking device implementing the target tracking method is introduced below. Please refer to FIG. 16, which is a schematic diagram of an embodiment of the target tracking device in the embodiment of this application.
图16中的各个模块的只一个或多个可以软件、硬件、固件或其结合实现。所述软件或固件包括但不限于计算机程序指令或代码,并可以被硬件处理器所执行。所述硬件包括但不限于各类集成电路,如中央处理单元(CPU)、数字信号处理器(DSP)、现场可编程门阵列(FPGA)或专用集成电路(ASIC)。Only one or more of the various modules in FIG. 16 can be implemented by software, hardware, firmware, or a combination thereof. The software or firmware includes but is not limited to computer program instructions or codes, and can be executed by a hardware processor. The hardware includes, but is not limited to, various integrated circuits, such as a central processing unit (CPU), a digital signal processor (DSP), a field programmable gate array (FPGA), or an application specific integrated circuit (ASIC).
该目标跟踪装置,包括:The target tracking device includes:
获取单元1601,用于获取第一视频,所述第一视频包括目标物体;The acquiring unit 1601 is configured to acquire a first video, where the first video includes a target object;
所述获取单元1601,还用于将所述第一视频输入预训练的神经网络模型,获取所述目标物体在至少两个视频帧中的位置信息和所述至少两个视频帧的时间信息;The acquiring unit 1601 is further configured to input the first video into a pre-trained neural network model to acquire the position information of the target object in at least two video frames and the time information of the at least two video frames;
所述获取单元1601,还用于根据所述目标物体在至少两个视频帧中的位置信息,和所述至少两个视频帧的时间信息获取所述目标物体在所述第一视频中的跟踪轨迹,所述跟踪轨迹包括所述目标物体在所述第一视频中的至少两个视频帧中的位置信息。The acquiring unit 1601 is further configured to acquire the tracking of the target object in the first video according to the position information of the target object in at least two video frames and the time information of the at least two video frames A trajectory, the tracking trajectory includes position information of the target object in at least two video frames in the first video.
可选地,所述获取单元1601具体用于:获取所述目标物体的运动管道,所述运动管道用 于指示所述目标物体在所述第一视频的至少两个视频帧中的时间信息和位置信息,其中,所述第一视频包括第一视频帧和第二视频帧;所述运动管道对应于时空维度中的四棱台,所述时空维度包括时间维度和二维空间维度,所述四棱台的第一底面在所述时间维度的位置用于指示所述第一视频帧的第一时间信息,所述四棱台的第二底面在所述时间维度的位置用于指示所述第二视频帧的第二时间信息,所述四棱台的第一底面在所述二维空间维度的位置用于指示所述目标物体在所述第一视频帧中的第一位置信息,所述四棱台的第二底面在所述二维空间维度的位置用于指示所述目标物体在所述第二视频帧中的第二位置信息;所述四棱台用于指示所述目标物体在所述第一视频的所述第一视频帧与所述二视频帧之间的所有视频帧中的位置信息。Optionally, the acquiring unit 1601 is specifically configured to: acquire a motion pipeline of the target object, where the motion pipeline is used to indicate the time information of the target object in at least two video frames of the first video and Location information, wherein the first video includes a first video frame and a second video frame; the motion pipeline corresponds to a quadrangular pyramid in a space-time dimension, and the space-time dimension includes a time dimension and a two-dimensional space dimension. The position of the first bottom surface of the quadrangular pyramid in the time dimension is used to indicate the first time information of the first video frame, and the position of the second bottom surface of the quadrangular pyramid in the time dimension is used to indicate the The second time information of the second video frame, the position of the first bottom surface of the quadrangular prism in the two-dimensional space is used to indicate the first position information of the target object in the first video frame, so The position of the second bottom surface of the quadrangular prism in the two-dimensional space is used to indicate the second position information of the target object in the second video frame; the quadrangular prism is used to indicate the target object Position information in all video frames between the first video frame and the second video frame of the first video.
可选地,所述获取单元1601具体用于:获取所述目标物体的运动管道,所述运动管道用于指示所述目标物体在至少三个视频帧中的位置信息和所述至少三个视频帧的时间信息,其中,所述第一视频包括第一视频帧、第二视频帧和第三视频帧;所述运动管道对应于时空维度中的双四棱台,所述双四棱台包括第一四棱台和第二四棱台,所述第一四棱台包括第一底面和第二底面,所述第二四棱台包括第一底面和第三底面,所述第一底面为所述第一四棱台和所述第二四棱台的公共底面,所述第一底面在所述时间维度的位置用于指示所述第一视频帧的第一时间信息,所述第二底面在所述时间维度的位置用于指示所述第二视频帧的第二时间信息,所第三底面在所述时间维度的位置用于指示所述第三视频帧的第三时间信息,所述第一视频帧在所述第一视频中的时间顺序位于所述第二视频帧和所述第三视频帧之间,所述第一底面在所述二维空间维度的位置用于指示所述目标物体在所述第一视频帧中的第一位置信息,所述第二底面在所述二维空间维度的位置指示所述目标物体在所述第二视频帧中的第二位置信息,所述第三底面在所述二维空间维度的位置指示所述目标物体在所述第三视频帧中的第三位置信息;所述双四棱台用于指示所述目标物体在所述第一视频的所述第二视频帧与所述三视频帧之间的所有视频帧中的位置信息。Optionally, the acquiring unit 1601 is specifically configured to: acquire a motion pipeline of the target object, where the motion pipeline is used to indicate the position information of the target object in at least three video frames and the at least three videos Time information of the frame, wherein the first video includes a first video frame, a second video frame, and a third video frame; the motion pipeline corresponds to a double quadrangular prism in the space-time dimension, and the double quadrangular prism includes The first quadrangular platform and the second quadrangular platform, the first quadrangular platform includes a first bottom surface and a second bottom surface, the second quadrangular platform includes a first bottom surface and a third bottom surface, the first bottom surface is The common bottom surface of the first quadrangular prism and the second quadrangular prism, the position of the first bottom in the time dimension is used to indicate the first time information of the first video frame, and the second The position of the bottom surface in the time dimension is used to indicate the second time information of the second video frame, and the position of the third bottom surface in the time dimension is used to indicate the third time information of the third video frame, so The time sequence of the first video frame in the first video is located between the second video frame and the third video frame, and the position of the first bottom surface in the two-dimensional space is used to indicate the The first position information of the target object in the first video frame, and the position of the second bottom surface in the two-dimensional space dimension indicates the second position information of the target object in the second video frame, The position of the third bottom surface in the two-dimensional spatial dimension indicates the third position information of the target object in the third video frame; the double quadrangular prism is used to indicate that the target object is in the first Position information in all video frames between the second video frame and the three video frames of a video.
可选地,所述获取单元1601具体用于:根据所述运动管道获取所述目标物体在所述第一视频中的跟踪轨迹。Optionally, the acquiring unit 1601 is specifically configured to acquire the tracking trajectory of the target object in the first video according to the motion pipeline.
可选地,所述跟踪轨迹具体包括:根据至少两个所述运动管道对应于时空维度中的四棱台连接形成的所述目标物体的跟踪轨迹。Optionally, the tracking trajectory specifically includes: a tracking trajectory of the target object formed by connecting at least two of the motion pipes corresponding to the quadrangular prisms in the space-time dimension.
可选地,所述运动管道的长度为预设值,所述运动管道的长度指示所述至少两个视频帧包括的视频帧的数量。Optionally, the length of the motion pipeline is a preset value, and the length of the motion pipeline indicates the number of video frames included in the at least two video frames.
可选地,所述获取单元1601还用于:通过所述预训练的神经网络模型,获取所述目标物体的类别信息;根据所述目标物体的所述类别信息、所述目标物体在至少两个视频帧中的位置信息和所述至少两个视频帧的时间信息获取所述目标物体在所述第一视频中的跟踪轨迹。Optionally, the obtaining unit 1601 is further configured to: obtain category information of the target object through the pre-trained neural network model; according to the category information of the target object, the target object is The position information in the two video frames and the time information of the at least two video frames obtain the tracking trajectory of the target object in the first video.
可选地,所述获取单元1601具体用于:通过所述预训练的神经网络模型,获取所述运动管道的置信度,所述运动管道的置信度用于确定所述运动管道对应的目标物体的所述类别信息。Optionally, the acquiring unit 1601 is specifically configured to: acquire the confidence level of the motion pipeline through the pre-trained neural network model, and the confidence level of the motion pipeline is used to determine the target object corresponding to the motion pipeline Of the category information.
可选地,所述装置还包括:处理单元1602,用于对所述运动管道进行删减,获取删减后的运动管道,所述删减后的运动管道用于获取所述目标物体的跟踪轨迹。Optionally, the device further includes: a processing unit 1602, configured to delete the motion pipeline to obtain a deleted motion pipeline, and the deleted motion pipeline is used to acquire the tracking of the target object Trajectory.
可选地,所述运动管道包括第一运动管道和第二运动管道;所述处理单元1602具体用于: 若第一运动管道和第二运动管道之间的重复率大于或等于第一阈值,则删除所述第一运动管道和所述第二运动管道中置信度较低的运动管道,所述第一运动管道和第二运动管道之间的重复率为所述第一运动管道和所述第二运动管道之间的交并比,所述第一运动管道和所述第二运动管道属于所述目标物体的运动管道,所述置信度指示运动管道对应的目标物体的类别为预设类别的概率。Optionally, the movement pipeline includes a first movement pipeline and a second movement pipeline; the processing unit 1602 is specifically configured to: if the repetition rate between the first movement pipeline and the second movement pipeline is greater than or equal to a first threshold, Then delete the movement pipeline with lower confidence in the first movement pipeline and the second movement pipeline, and the repetition rate between the first movement pipeline and the second movement pipeline is the first movement pipeline and the second movement pipeline. The intersection ratio between the second motion pipeline, the first motion pipeline and the second motion pipeline belong to the motion pipeline of the target object, and the confidence indicates that the category of the target object corresponding to the motion pipeline is a preset category The probability.
可选地,所述处理单元1602具体用于:根据非极大值抑制算法对所述运动管道进行删减,获取删减后的运动管道。Optionally, the processing unit 1602 is specifically configured to: delete the motion pipeline according to a non-maximum value suppression algorithm, and obtain the deleted motion pipeline.
可选地,所述删减后的运动管道中的任意一个运动管道的置信度大于或等于第二阈值。Optionally, the confidence level of any one of the motion channels after the reduction is greater than or equal to a second threshold.
可选地,所述获取单元1601具体用于:对所述运动管道中满足预设条件的第三运动管道和第四运动管道进行连接,获取所述目标物体的跟踪轨迹;所述预设条件包括以下一个或多个:所述第三运动管道和所述第四运动管道在时间维度重叠部分的区段之间的交并比大于或等于第三阈值;所述第三运动管道的运动方向和所述第四运动管道的运动方向之间的夹角的余弦值大于或等于第四阈值,所述运动方向为在时空维度按照预设规则指示运动管道中目标物体的位置变化的向量;以及,运动管道的神经网络特征向量之间的距离小于或等于第五阈值,所述距离包括欧式距离。Optionally, the acquiring unit 1601 is specifically configured to: connect a third motion pipeline and a fourth motion pipeline that meet a preset condition in the motion pipeline to acquire the tracking trajectory of the target object; the preset condition It includes one or more of the following: the intersection ratio between the sections of the overlapping portion of the third movement pipeline and the fourth movement pipeline in the time dimension is greater than or equal to a third threshold; the movement direction of the third movement pipeline The cosine value of the included angle with the movement direction of the fourth motion pipe is greater than or equal to the fourth threshold, and the movement direction is a vector indicating the position change of the target object in the movement pipe in the space-time dimension according to a preset rule; and , The distance between the neural network feature vectors of the motion pipeline is less than or equal to the fifth threshold, and the distance includes the Euclidean distance.
可选地,所述获取单元1601具体用于:对所述运动管道分组,获取t组运动管道,t为所述第一视频中视频帧的总数,所述t组运动管道中第i运动管道组包括所有起始于所述第一视频中第i个视频帧的运动管道,所述i大于或等于1,且小于或等于t;当i为1时,将第i运动管道组中的运动管道作为初始的跟踪轨迹,得到跟踪轨迹集合;按照运动管道组的编号顺序,依次将所述第i运动管道组中的运动管道与所述跟踪轨迹集合中的跟踪轨迹进行连接,获取至少一条跟踪轨迹。Optionally, the obtaining unit 1601 is specifically configured to: group the motion pipes to obtain t groups of motion pipes, where t is the total number of video frames in the first video, and the i-th motion pipe in the t group of motion pipes The group includes all motion pipelines starting from the i-th video frame in the first video. The i is greater than or equal to 1 and less than or equal to t; when i is 1, the motion in the i-th motion pipeline group The pipeline is used as the initial tracking trajectory to obtain a tracking trajectory set; in accordance with the number sequence of the motion pipeline group, the motion pipelines in the i-th motion pipeline group are connected with the tracking trajectories in the tracking trajectory set to obtain at least one track Trajectory.
可选地,所述获取单元1601具体用于:将第一视频样本输入所述初始网络模型中训练,获取目标物体损失;根据所述目标物体损失更新所述初始网络模型中的权重参数,获取所述预训练的神经网络模型。Optionally, the obtaining unit 1601 is specifically configured to: input the first video sample into the initial network model for training, and obtain the target object loss; update the weight parameter in the initial network model according to the target object loss to obtain The pre-trained neural network model.
可选地,所述目标物体损失具体包括:运动管道真值与运动管道预测值之间的交并比,所述运动管道真值为所述第一视频样本中目标物体的跟踪轨迹拆分得到的运动管道,所述运动管道预测值为将所述第一视频样本输入所述初始网络模型得到的运动管道。Optionally, the target object loss specifically includes: an intersection ratio between the true value of the motion pipe and the predicted value of the motion pipe, and the true value of the motion pipe is obtained by splitting the tracking trajectory of the target object in the first video sample The predicted value of the motion pipeline is a motion pipeline obtained by inputting the first video sample into the initial network model.
可选地,所述目标物体损失具体包括:运动管道真值与运动管道预测值之间的交并比,以及运动管道真值的置信度与运动管道预测值的置信度之间的交叉熵,所述运动管道真值为所述第一视频样本中目标物体的跟踪轨迹拆分得到的运动管道,所述运动管道预测值为将所述第一视频样本输入所述初始网络模型得到的运动管道,所述运动管道真值的置信度为所述运动管道真值对应的目标物体的类别属于预设目标物体类别的概率,所述运动管道预测值的置信度为所述运动管道预测值对应的目标物体的类别属于预设目标物体类别的概率。Optionally, the target object loss specifically includes: the intersection ratio between the true value of the motion pipe and the predicted value of the motion pipe, and the cross entropy between the confidence of the true value of the motion pipe and the confidence of the predicted value of the motion pipe, The true value of the motion pipeline is the motion pipeline obtained by splitting the tracking trajectory of the target object in the first video sample, and the predicted value of the motion pipeline is the motion pipeline obtained by inputting the first video sample into the initial network model The confidence level of the true value of the motion pipe is the probability that the target object category corresponding to the true value of the motion pipe belongs to the preset target object category, and the confidence level of the predicted value of the motion pipe corresponds to the predicted value of the motion pipe The probability that the category of the target object belongs to the preset target object category.
可选地,所述初始网络模型包括三维卷积神经网络或递归神经网络。Optionally, the initial network model includes a three-dimensional convolutional neural network or a recurrent neural network.
可选地,所述处理单元1602还用于:将所述第一视频划分为多个视频片段;Optionally, the processing unit 1602 is further configured to: divide the first video into multiple video segments;
所述获取单元1601具体用于:将所述多个视频片段分别输入所述预训练的神经网络模型,获取所述运动管道。The acquiring unit 1601 is specifically configured to input the multiple video clips into the pre-trained neural network model to acquire the motion pipeline.
本申请实施例提供的目标跟踪装置有多种实现形式,可选地,目标跟踪装置包括视频采集模块、目标跟踪模块和输出模块。其中,视频采集模块,用于获取包括运动目标物体的视频,目标跟踪模块用于输入视频,通过本申请实施例提供的目标跟踪方法输出目标物体的跟踪轨迹,输出模块用于将跟踪轨迹叠加在视频中向用户显示。The target tracking device provided by the embodiment of the present application has multiple implementation forms. Optionally, the target tracking device includes a video acquisition module, a target tracking module, and an output module. Among them, the video acquisition module is used to obtain a video including the moving target object, the target tracking module is used to input the video, and the tracking trajectory of the target object is output by the target tracking method provided in this embodiment of the application, and the output module is used to superimpose the tracking trajectory on Shown to users in the video.
在另一种可能的实现方式中,请参阅图17,为本申请实施例中目标跟踪装置的另一个实施例示意图。目标跟踪装置包括视频采集模块、目标跟踪模块,可以理解为前端设备。为实现目标跟踪方法,需前端设备和后端设备协同处理。In another possible implementation manner, please refer to FIG. 17, which is a schematic diagram of another embodiment of the target tracking device in the embodiment of this application. The target tracking device includes a video acquisition module and a target tracking module, which can be understood as front-end equipment. In order to achieve the target tracking method, the front-end equipment and the back-end equipment need to be processed together.
如图17所示,视频采集模块1701,可以是监控摄像头,摄像机,手机或车载图像传感器等中的视频采集模块,负责捕捉视频数据作为跟踪算法的输入;As shown in Figure 17, the video acquisition module 1701, which can be a video acquisition module in a surveillance camera, a video camera, a mobile phone or a vehicle image sensor, is responsible for capturing video data as the input of the tracking algorithm;
目标跟踪模块1702,可以是摄像头处理器,手机处理器、车载处理单元等设备中的处理单元,用于接收视频输入以及后端设备发送的控制信息,控制信息例如跟踪目标类别,跟踪数量,精度控制,模型超参数等。本申请实施例的目标跟踪方法主要部署在该模块中。具体请参阅图18对该目标跟踪模块1702的介绍。The target tracking module 1702, which can be a processing unit in a camera processor, a mobile phone processor, a vehicle processing unit, etc., is used to receive video input and control information sent by a back-end device, such as tracking target category, tracking quantity, accuracy Control, model hyperparameters, etc. The target tracking method of the embodiment of the present application is mainly deployed in this module. For details, please refer to FIG. 18 for the introduction of the target tracking module 1702.
后端设备包括输出模块和控制模块。The back-end equipment includes an output module and a control module.
如图17所示,输出模块1703,例如可以是后台监视器、打印机或硬盘等装置的显示单元,用于显示或车窗跟踪结果;As shown in Fig. 17, the output module 1703, for example, may be a display unit of a background monitor, printer, or hard disk, for displaying or window tracking results;
控制模块1704,用于分析输出的结果,以及接收用户的指令,并将该指令发送给前端的目标跟踪模块。The control module 1704 is used to analyze the output result, receive the user's instruction, and send the instruction to the target tracking module of the front end.
请参阅图18,为本申请实施例中目标跟踪装置的另一个实施例示意图。Please refer to FIG. 18, which is a schematic diagram of another embodiment of the target tracking device in the embodiment of the application.
该目标跟踪装置包括:视频预处理模块1801,预测模块1802和运动管道连接模块1803。The target tracking device includes: a video preprocessing module 1801, a prediction module 1802, and a motion pipeline connection module 1803.
其中,视频预处理模块1801,用于将输入的视频切分为合适的片段,并进行视频分辨率,色彩空间的调整与归一化等。Among them, the video preprocessing module 1801 is used to divide the input video into appropriate segments, and adjust and normalize the video resolution, color space, etc.
预测模块1802用于从输入视频片段中提取时空特征并进行预测,输出目运动管道及运动管道的所属类别信息,此外,还可以对目标运动管道的未来位置进行预测。预测模块1802包括两个子模块:The prediction module 1802 is used to extract spatiotemporal features from the input video clips and make predictions, and output the target motion pipeline and the category information of the motion pipeline. In addition, it can also predict the future position of the target motion pipeline. The prediction module 1802 includes two sub-modules:
目标类别预测模块18021:基于3D卷积神经网络输出的特征,例如置信度值预测目标所属类别。Target category prediction module 18021: Based on the features output by the 3D convolutional neural network, for example, the confidence value predicts the category to which the target belongs.
运动管道预测模块18022:通过3D卷积神经网络输出的特征预测目标当前运动管道的位置,即运动管道在时空维度的坐标。Motion pipeline prediction module 18022: predict the location of the target's current motion pipeline through the features output by the 3D convolutional neural network, that is, the coordinates of the motion pipeline in space and time dimensions.
运动管道连接模块1803,分析预测模块中输出的运动管道,如果该目标是首次出现则初始化其为新的跟踪轨迹,根据运动管道之间的时空特征相似性以及空间位置临近度作为连接运动管道所需的连接特征。根据运动管道以及运动管道连接特征,通过分析运动管道在空间上的位置重合特性以及时空特征相似性将运动管道连接成为完整的跟踪轨迹。The motion pipeline connection module 1803 analyzes the motion pipeline output in the prediction module, and if the target appears for the first time, initialize it as a new tracking trajectory. According to the temporal and spatial feature similarity between the motion pipelines and the spatial location proximity, the motion pipeline is connected Required connection characteristics. According to the movement pipeline and the connection characteristics of the movement pipeline, the movement pipelines are connected into a complete tracking trajectory by analyzing the spatial overlap characteristics of the movement pipelines and the similarity of the temporal and spatial characteristics.
将连接好的跟踪轨迹按照特定格式输出,例如视频流,轨迹日志等。Output the connected tracking trajectory in a specific format, such as video stream, trajectory log, etc.
请参阅图19,为本申请实施例中电子设备的一个实施例示意图。Please refer to FIG. 19, which is a schematic diagram of an embodiment of an electronic device in an embodiment of the application.
该电子设备1900可因配置或性能不同而产生比较大的差异,可以包括一个或一个以上处理器1901和存储器1902,该存储器1902中存储有程序或数据。The electronic device 1900 may have relatively large differences due to different configurations or performances, and may include one or more processors 1901 and a memory 1902, and the memory 1902 stores programs or data.
其中,存储器1902可以是易失性存储或非易失性存储。可选地,处理器1901是一个或多个中央处理器(CPU,central processing unit,该CPU可以是单核CPU,也可以是多核CPU。处理器1901可以与存储器1902通信,在电子设备1900上执行存储器1902中的一系列指令。Among them, the memory 1902 may be volatile storage or non-volatile storage. Optionally, the processor 1901 is one or more central processing units (CPUs, central processing units). The CPUs may be single-core CPUs or multi-core CPUs. The processor 1901 may communicate with the memory 1902, and is on the electronic device 1900 A series of instructions in the memory 1902 are executed.
该电子设备1900还包括一个或一个以上有线或无线网络接口1903,例如以太网接口。The electronic device 1900 also includes one or more wired or wireless network interfaces 1903, such as an Ethernet interface.
可选地,尽管图19中未示出,电子设备1900还可以包括一个或一个以上电源;一个或一个以上输入输出接口,输入输出接口可以用于连接显示器、鼠标、键盘、触摸屏设备或传感设备等,输入输出接口为可选部件,可以存在也可以不存在,此处不做限定。Optionally, although not shown in FIG. 19, the electronic device 1900 may also include one or more power supplies; one or more input and output interfaces, which can be used to connect a display, a mouse, a keyboard, a touch screen device or a sensor For equipment, etc., the input and output interfaces are optional components, which may or may not exist, and are not limited here.
本实施例中电子设备1900中的处理器1901所执行的流程可以参考前述方法实施例中描述的方法流程,此处不加赘述。For the process executed by the processor 1901 in the electronic device 1900 in this embodiment, reference may be made to the method process described in the foregoing method embodiment, which is not repeated here.
请参阅图20,为本申请实施例提供的一种芯片硬件结构图。Please refer to FIG. 20, which is a hardware structure diagram of a chip provided by an embodiment of this application.
本申请实施例提供了一种芯片系统,可以用于实现该目标跟踪方法,具体地,图3和图4所示的基于卷积神经网络的算法可以在图20所示的NPU芯片中实现。The embodiment of the present application provides a chip system that can be used to implement the target tracking method. Specifically, the algorithm based on the convolutional neural network shown in FIG. 3 and FIG. 4 can be implemented in the NPU chip shown in FIG. 20.
神经网络处理器NPU 50作为协处理器挂载到主CPU(Host CPU)上,由Host CPU分配任务。NPU的核心部分为运算电路503,通过控制器504控制运算电路503提取存储器中的矩阵数据并进行乘法运算。The neural network processor NPU 50 is mounted on the main CPU (Host CPU) as a coprocessor, and the Host CPU allocates tasks. The core part of the NPU is the arithmetic circuit 503. The arithmetic circuit 503 is controlled by the controller 504 to extract matrix data from the memory and perform multiplication operations.
在一些实现中,运算电路503内部包括多个处理单元(process engine,PE)。在一些实现中,运算电路503是二维脉动阵列。运算电路503还可以是一维脉动阵列或者能够执行例如乘法和加法这样的数学运算的其它电子线路。在一些实现中,运算电路503是通用的矩阵处理器。In some implementations, the arithmetic circuit 503 includes multiple processing units (process engines, PE). In some implementations, the arithmetic circuit 503 is a two-dimensional systolic array. The arithmetic circuit 503 may also be a one-dimensional systolic array or other electronic circuits capable of performing mathematical operations such as multiplication and addition. In some implementations, the arithmetic circuit 503 is a general-purpose matrix processor.
举例来说,假设有输入矩阵A,权重矩阵B,输出矩阵C。运算电路从权重存储器502中取矩阵B相应的数据,并缓存在运算电路中每一个PE上。运算电路从输入存储器501中取矩阵A数据与矩阵B进行矩阵运算,得到的矩阵的部分结果或最终结果,保存在累加器508accumulator中。For example, suppose there is an input matrix A, a weight matrix B, and an output matrix C. The arithmetic circuit fetches the data corresponding to matrix B from the weight memory 502 and caches it on each PE in the arithmetic circuit. The arithmetic circuit fetches the matrix A data and matrix B from the input memory 501 to perform matrix operations, and the partial result or final result of the obtained matrix is stored in the accumulator 508.
统一存储器506用于存放输入数据以及输出数据。权重数据直接通过存储单元访问控制器505(direct memory access controller,DMAC)被搬运到权重存储器502中。输入数据也通过DMAC被搬运到统一存储器506中。The unified memory 506 is used to store input data and output data. The weight data is directly transferred to the weight memory 502 through the storage unit access controller 505 (direct memory access controller, DMAC). The input data is also transferred to the unified memory 506 through the DMAC.
BIU为Bus Interface Unit,即总线接口单元510,用于AXI总线与DMAC和取指存储器509Instruction Fetch Buffer的交互。The BIU is the Bus Interface Unit, that is, the bus interface unit 510, which is used for the interaction between the AXI bus and the DMAC and the instruction fetch buffer 509.
总线接口单元510(bus interface unit,BIU),用于取指存储器509从外部存储器获取指令,还用于存储单元访问控制器505从外部存储器获取输入矩阵A或者权重矩阵B的原数据。The bus interface unit 510 (bus interface unit, BIU) is used for the instruction fetch memory 509 to obtain instructions from the external memory, and is also used for the storage unit access controller 505 to obtain the original data of the input matrix A or the weight matrix B from the external memory.
DMAC主要用于将外部存储器DDR中的输入数据搬运到统一存储器506或将权重数据搬运到权重存储器502中或将输入数据数据搬运到输入存储器501中。The DMAC is mainly used to transfer the input data in the external memory DDR to the unified memory 506 or to transfer the weight data to the weight memory 502 or to transfer the input data to the input memory 501.
向量计算单元507可以包括多个运算处理单元,在需要的情况下,对运算电路的输出做进一步处理,如向量乘,向量加,指数运算,对数运算,大小比较等等。主要用于神经网络中非卷积/FC层网络计算,如Pooling(池化),Batch Normalization(批归一化),Local Response Normalization(局部响应归一化)等。The vector calculation unit 507 may include multiple arithmetic processing units, and if necessary, further process the output of the arithmetic circuit, such as vector multiplication, vector addition, exponential operation, logarithmic operation, size comparison, and so on. Mainly used for non-convolutional/FC layer network calculations in neural networks, such as Pooling, Batch Normalization, Local Response Normalization, etc.
在一些实现中,向量计算单元能507将经处理的输出的向量存储到统一缓存器506。例如,向量计算单元507可以将非线性函数应用到运算电路503的输出,例如累加值的向量,用以生成激活值。在一些实现中,向量计算单元507生成归一化的值、合并值,或二者均有。在一些实现中,处理过的输出的向量能够用作到运算电路503的激活输入,例如用于在神经网络中的后续层中的使用。In some implementations, the vector calculation unit 507 can store the processed output vector to the unified buffer 506. For example, the vector calculation unit 507 may apply a nonlinear function to the output of the arithmetic circuit 503, such as a vector of accumulated values, to generate the activation value. In some implementations, the vector calculation unit 507 generates a normalized value, a combined value, or both. In some implementations, the processed output vector can be used as an activation input to the arithmetic circuit 503, for example for use in a subsequent layer in a neural network.
控制器504连接的取指存储器(instruction fetch buffer)509,用于存储控制器504使用的指令;The instruction fetch buffer 509 connected to the controller 504 is used to store instructions used by the controller 504;
统一存储器506,输入存储器501,权重存储器502以及取指存储器509均为On-Chip存储器。外部存储器私有于该NPU硬件架构。The unified memory 506, the input memory 501, the weight memory 502, and the fetch memory 509 are all On-Chip memories. The external memory is private to the NPU hardware architecture.
其中,图3和图4所示的卷积神经网络中各层的运算可以由矩阵计算单元212或向量计算单元507执行。Among them, the operations of each layer in the convolutional neural network shown in FIG. 3 and FIG. 4 may be executed by the matrix calculation unit 212 or the vector calculation unit 507.
在本申请的各实施例中,为了方面理解,进行了多种举例说明。然而,这些例子仅仅是一些举例,并不意味着是实现本申请的最佳实现方式。In the various embodiments of the present application, various examples are given for the purpose of understanding. However, these examples are just some examples, and are not meant to be the best way to realize the application.
所属领域的技术人员可以清楚地了解到,为描述的方便和简洁,上述描述的系统,装置和单元的具体工作过程,可以参考前述方法实施例中的对应过程,在此不再赘述。Those skilled in the art can clearly understand that, for the convenience and conciseness of the description, the specific working process of the system, device and unit described above can refer to the corresponding process in the foregoing method embodiment, which will not be repeated here.
在本申请所提供的几个实施例中,应该理解到,所揭露的系统,装置和方法,可以通过其它的方式实现。例如,以上所描述的装置实施例仅仅是示意性的,例如,所述单元的划分,仅仅为一种逻辑功能划分,实际实现时可以有另外的划分方式,例如多个单元或组件可以结合或者可以集成到另一个系统,或一些特征可以忽略,或不执行。另一点,所显示或讨论的相互之间的耦合或直接耦合或通信连接可以是通过一些接口,装置或单元的间接耦合或通信连接,可以是电性,机械或其它的形式。In the several embodiments provided in this application, it should be understood that the disclosed system, device, and method may be implemented in other ways. For example, the device embodiments described above are merely illustrative, for example, the division of the units is only a logical function division, and there may be other divisions in actual implementation, for example, multiple units or components can be combined or It can be integrated into another system, or some features can be ignored or not implemented. In addition, the displayed or discussed mutual coupling or direct coupling or communication connection may be indirect coupling or communication connection through some interfaces, devices or units, and may be in electrical, mechanical or other forms.
所述作为分离部件说明的单元可以是或者也可以不是物理上分开的,作为单元显示的部件可以是或者也可以不是物理单元,即可以位于一个地方,或者也可以分布到多个网络单元上。可以根据实际的需要选择其中的部分或者全部单元来实现本实施例方案的目的。The units described as separate components may or may not be physically separated, and the components displayed as units may or may not be physical units, that is, they may be located in one place, or they may be distributed on multiple network units. Some or all of the units may be selected according to actual needs to achieve the objectives of the solutions of the embodiments.
另外,在本申请各个实施例中的各功能单元可以集成在一个处理单元中,也可以是各个单元单独物理存在,也可以两个或两个以上单元集成在一个单元中。上述集成的单元既可以采用硬件的形式实现,也可以采用软件功能单元的形式实现。In addition, the functional units in the various embodiments of the present application may be integrated into one processing unit, or each unit may exist alone physically, or two or more units may be integrated into one unit. The above-mentioned integrated unit can be implemented in the form of hardware or software functional unit.
所述集成的单元如果以软件功能单元的形式实现并作为独立的产品销售或使用时,可以存储在一个计算机可读取存储介质中。基于这样的理解,本申请的技术方案本质上或者说对现有技术做出贡献的部分或者该技术方案的全部或部分可以以软件产品的形式体现出来,该计算机软件产品存储在一个存储介质中,包括若干指令用以使得一台计算机设备(可以是个人计算机,服务器,或者网络设备等)执行本申请各个实施例所述方法的全部或部分步骤。而前述的存储介质包括:U盘、移动硬盘、只读存储器(read-only memory,ROM)、随机存取存储器(random access memory,RAM)、磁碟或者光盘等各种可以存储程序代码的介质。If the integrated unit is implemented in the form of a software functional unit and sold or used as an independent product, it can be stored in a computer readable storage medium. Based on this understanding, the technical solution of the present application essentially or the part that contributes to the existing technology or all or part of the technical solution can be embodied in the form of a software product, and the computer software product is stored in a storage medium , Including several instructions to make a computer device (which may be a personal computer, a server, or a network device, etc.) execute all or part of the steps of the methods described in the various embodiments of the present application. The aforementioned storage media include: U disk, mobile hard disk, read-only memory (read-only memory, ROM), random access memory (random access memory, RAM), magnetic disk or optical disk and other media that can store program code .
以上所述,以上实施例仅用以说明本申请的技术方案,而非对其限制;尽管参照前述实施例对本申请进行了详细的说明,本领域的普通技术人员应当理解:其依然可以对前述各实施例所记载的技术方案进行修改,或者对其中部分技术特征进行等同替换;而这些修改或者替换,并不使相应技术方案的本质脱离本申请各实施例技术方案的精神和范围。As mentioned above, the above embodiments are only used to illustrate the technical solutions of the present application, but not to limit them; although the present application has been described in detail with reference to the foregoing embodiments, those of ordinary skill in the art should understand that: The technical solutions recorded in the embodiments are modified, or some of the technical features are equivalently replaced; these modifications or replacements do not cause the essence of the corresponding technical solutions to deviate from the spirit and scope of the technical solutions of the embodiments of the present application.

Claims (41)

  1. 一种目标跟踪方法,其特征在于,包括:A target tracking method is characterized in that it comprises:
    获取第一视频,所述第一视频包括目标物体;Acquiring a first video, where the first video includes a target object;
    将所述第一视频输入预训练的神经网络模型,获取所述目标物体在至少两个视频帧中的位置信息和所述至少两个视频帧的时间信息;Inputting the first video into a pre-trained neural network model, and acquiring position information of the target object in at least two video frames and time information of the at least two video frames;
    根据所述目标物体在至少两个视频帧中的位置信息,和所述至少两个视频帧的时间信息获取所述目标物体在所述第一视频中的跟踪轨迹,所述跟踪轨迹包括所述目标物体在所述第一视频中的至少两个视频帧中的位置信息。Acquire the tracking trajectory of the target object in the first video according to the position information of the target object in at least two video frames and the time information of the at least two video frames, where the tracking trajectory includes the Position information of the target object in at least two video frames in the first video.
  2. 根据权利要求1所述的方法,其特征在于,所述获取所述目标物体在至少两个视频帧中的位置信息和所述至少两个视频帧的时间信息具体包括:The method according to claim 1, wherein said acquiring the position information of the target object in at least two video frames and the time information of the at least two video frames specifically comprises:
    获取所述目标物体的运动管道,所述运动管道用于指示所述目标物体在所述第一视频的至少两个视频帧中的时间信息和位置信息,其中,Acquire a motion pipeline of the target object, where the motion pipeline is used to indicate the time information and position information of the target object in at least two video frames of the first video, wherein,
    所述第一视频包括第一视频帧和第二视频帧;The first video includes a first video frame and a second video frame;
    所述运动管道对应于时空维度中的四棱台,所述时空维度包括时间维度和二维空间维度,所述四棱台的第一底面在所述时间维度的位置用于指示所述第一视频帧的第一时间信息,所述四棱台的第二底面在所述时间维度的位置用于指示所述第二视频帧的第二时间信息,所述四棱台的第一底面在所述二维空间维度的位置用于指示所述目标物体在所述第一视频帧中的第一位置信息,所述四棱台的第二底面在所述二维空间维度的位置用于指示所述目标物体在所述第二视频帧中的第二位置信息;The motion pipeline corresponds to a quadrangular pyramid in the space-time dimension, the space-time dimension includes a time dimension and a two-dimensional space dimension, and the position of the first bottom surface of the quadrangular pyramid in the time dimension is used to indicate the first The first time information of the video frame, the position of the second bottom surface of the quadrangular prism in the time dimension is used to indicate the second time information of the second video frame, and the first bottom surface of the quadrangular prism is located at the The position of the two-dimensional spatial dimension is used to indicate the first position information of the target object in the first video frame, and the position of the second bottom surface of the quadrangular pyramid in the two-dimensional spatial dimension is used to indicate the position of the target object in the first video frame. Second position information of the target object in the second video frame;
    所述四棱台用于指示所述目标物体在所述第一视频的所述第一视频帧与所述二视频帧之间的所有视频帧中的位置信息。The quadrangular prism is used to indicate the position information of the target object in all video frames between the first video frame and the second video frame of the first video.
  3. 根据权利要求1所述的方法,其特征在于,所述获取所述目标物体在至少两个视频帧中的位置信息和所述至少两个视频帧的时间信息具体包括:The method according to claim 1, wherein said acquiring the position information of the target object in at least two video frames and the time information of the at least two video frames specifically comprises:
    获取所述目标物体的运动管道,所述运动管道用于指示所述目标物体在至少三个视频帧中的位置信息和所述至少三个视频帧的时间信息,其中,Acquire a motion pipeline of the target object, where the motion pipeline is used to indicate the position information of the target object in at least three video frames and the time information of the at least three video frames, wherein,
    所述第一视频包括第一视频帧、第二视频帧和第三视频帧;The first video includes a first video frame, a second video frame, and a third video frame;
    所述运动管道对应于时空维度中的双四棱台,所述双四棱台包括第一四棱台和第二四棱台,所述第一四棱台包括第一底面和第二底面,所述第二四棱台包括第一底面和第三底面,所述第一底面为所述第一四棱台和所述第二四棱台的公共底面,所述第一底面在所述时间维度的位置用于指示所述第一视频帧的第一时间信息,所述第二底面在所述时间维度的位置用于指示所述第二视频帧的第二时间信息,所第三底面在所述时间维度的位置用于指示所述第三视频帧的第三时间信息,所述第一视频帧在所述第一视频中的时间顺序位于所述第二视频帧和所述第三视频帧之间,所述第一底面在所述二维空间维度的位置用于指示所述目标物体在所述第一视频帧中的第一位置信息,所述第二底面在所述二维空间维度的位置指示所述目标物体在所述第二视频帧中的第二位置信息,所述第三底面在所述二维空间维度的位置指示所述目标物体在所述第三视频帧中的第三位置信息;The movement pipeline corresponds to a double quadrangular prism in the space-time dimension, the double quadrangular prism includes a first quadrangular prism and a second quadrangular prism, the first quadrangular prism includes a first bottom surface and a second bottom surface, The second quadrangular platform includes a first bottom surface and a third bottom surface. The first bottom surface is a common bottom surface of the first quadrangular platform and the second quadrangular platform. The position of the dimension is used to indicate the first time information of the first video frame, the position of the second bottom surface in the time dimension is used to indicate the second time information of the second video frame, and the third bottom surface is at The position of the time dimension is used to indicate the third time information of the third video frame, and the time sequence of the first video frame in the first video is located in the second video frame and the third video Between frames, the position of the first bottom surface in the two-dimensional space is used to indicate the first position information of the target object in the first video frame, and the second bottom surface is in the two-dimensional space. The position of the dimensionality indicates the second position information of the target object in the second video frame, and the position of the third bottom surface in the two-dimensional space indicates the position of the target object in the third video frame. Third location information;
    所述双四棱台用于指示所述目标物体在所述第一视频的所述第二视频帧与所述三视频帧之间的所有视频帧中的位置信息。The double quadrangular prism is used to indicate the position information of the target object in all video frames between the second video frame and the three video frames of the first video.
  4. 根据权利要求2或3所述的方法,其特征在于,所述根据所述目标物体在至少两个视频帧中的位置信息和所述至少两个视频帧的时间信息获取所述目标物体在所述第一视频中的跟踪轨迹具体包括:The method according to claim 2 or 3, wherein the position information of the target object in at least two video frames and the time information of the at least two video frames are used to obtain the location of the target object. The tracking track in the first video specifically includes:
    根据所述运动管道获取所述目标物体在所述第一视频中的跟踪轨迹。Acquiring the tracking trajectory of the target object in the first video according to the motion pipeline.
  5. 根据权利要求2至4中任一项所述的方法,其特征在于,所述跟踪轨迹具体包括:The method according to any one of claims 2 to 4, wherein the tracking trajectory specifically comprises:
    根据至少两个所述运动管道对应于时空维度中的四棱台连接形成的所述目标物体的跟踪轨迹。According to at least two of the motion pipes corresponding to the tracking trajectory of the target object formed by the connection of the quadrangular prisms in the space-time dimension.
  6. 根据权利要求2至5中任一项所述的方法,其特征在于,The method according to any one of claims 2 to 5, characterized in that,
    所述运动管道的长度为预设值,所述运动管道的长度指示所述至少两个视频帧包括的视频帧的数量。The length of the motion pipeline is a preset value, and the length of the motion pipeline indicates the number of video frames included in the at least two video frames.
  7. 根据权利要求1至6中任一项所述的方法,其特征在于,The method according to any one of claims 1 to 6, characterized in that:
    所述方法还包括:The method also includes:
    通过所述预训练的神经网络模型,获取所述目标物体的类别信息;Obtaining category information of the target object through the pre-trained neural network model;
    所述根据所述目标物体在至少两个视频帧中的位置信息,和所述至少两个视频帧的时间信息获取所述目标物体在所述第一视频中的跟踪轨迹包括:The acquiring the tracking trajectory of the target object in the first video according to the position information of the target object in at least two video frames and the time information of the at least two video frames includes:
    根据所述目标物体的所述类别信息、所述目标物体在至少两个视频帧中的位置信息和所述至少两个视频帧的时间信息获取所述目标物体在所述第一视频中的跟踪轨迹。Acquire the tracking of the target object in the first video according to the category information of the target object, the position information of the target object in at least two video frames, and the time information of the at least two video frames Trajectory.
  8. 根据权利要求7所述的方法,其特征在于,所述通过所述预训练的神经网络模型,获取所述运动管道对应的目标物体的类别信息具体包括:8. The method according to claim 7, wherein said obtaining the category information of the target object corresponding to the motion pipeline through the pre-trained neural network model specifically comprises:
    通过所述预训练的神经网络模型,获取所述运动管道的置信度,所述运动管道的置信度用于确定所述运动管道对应的目标物体的所述类别信息。Obtain the confidence level of the motion pipeline through the pre-trained neural network model, and the confidence level of the motion pipeline is used to determine the category information of the target object corresponding to the motion pipeline.
  9. 根据权利要求1至8中任一项所述的方法,其特征在于,The method according to any one of claims 1 to 8, characterized in that:
    所述根据所述运动管道获取所述目标物体的跟踪轨迹之前,所述方法还包括:Before the acquiring the tracking trajectory of the target object according to the motion pipeline, the method further includes:
    对所述运动管道进行删减,获取删减后的运动管道,所述删减后的运动管道用于获取所述目标物体的跟踪轨迹。The motion pipeline is deleted to obtain a deleted motion pipeline, and the deleted motion pipeline is used to acquire the tracking trajectory of the target object.
  10. 根据权利要求9所述的方法,其特征在于,The method according to claim 9, wherein:
    所述对所述运动管道进行删减,获取删减后的运动管道具体包括:Said deleting the motion pipeline and obtaining the deleted motion pipeline specifically includes:
    所述运动管道包括第一运动管道和第二运动管道;The movement pipeline includes a first movement pipeline and a second movement pipeline;
    若第一运动管道和第二运动管道之间的重复率大于或等于第一阈值,则删除所述第一运动管道和所述第二运动管道中置信度较低的运动管道,所述第一运动管道和第二运动管道之间的重复率为所述第一运动管道和所述第二运动管道之间的交并比,所述第一运动管道和所述第二运动管道属于所述目标物体的运动管道,所述置信度指示运动管道对应的目标物体的类别为预设类别的概率。If the repetition rate between the first movement pipeline and the second movement pipeline is greater than or equal to the first threshold, the movement pipeline with lower confidence among the first movement pipeline and the second movement pipeline is deleted, and the first movement pipeline The repetition rate between the movement pipeline and the second movement pipeline is the intersection ratio between the first movement pipeline and the second movement pipeline, and the first movement pipeline and the second movement pipeline belong to the target The motion pipeline of the object, and the confidence level indicates the probability that the category of the target object corresponding to the motion pipeline is a preset category.
  11. 根据权利要求9所述的方法,其特征在于,The method according to claim 9, wherein:
    所述对所述运动管道进行删减,获取删减后的运动管道具体包括:Said deleting the motion pipeline and obtaining the deleted motion pipeline specifically includes:
    根据非极大值抑制算法对所述运动管道进行删减,获取删减后的运动管道。The motion pipeline is deleted according to the non-maximum value suppression algorithm, and the deleted motion pipeline is obtained.
  12. 根据权利要求9所述的方法,其特征在于,The method according to claim 9, wherein:
    所述删减后的运动管道中的任意一个运动管道的置信度大于或等于第二阈值。The confidence of any one of the motion channels after the reduction is greater than or equal to the second threshold.
  13. 根据权利要求2至12中任一项所述的方法,其特征在于,The method according to any one of claims 2 to 12, characterized in that,
    所述根据所述运动管道获取所述目标物体的跟踪轨迹具体包括:The acquiring the tracking trajectory of the target object according to the motion pipeline specifically includes:
    对所述运动管道中满足预设条件的第三运动管道和第四运动管道进行连接,获取所述目标物体的跟踪轨迹;Connecting a third movement pipeline and a fourth movement pipeline that meet a preset condition in the movement pipeline to obtain a tracking trajectory of the target object;
    所述预设条件包括以下一个或多个:The preset conditions include one or more of the following:
    所述第三运动管道和所述第四运动管道在时间维度重叠部分的区段之间的交并比大于或等于第三阈值;The intersection ratio between the sections of the overlapping portion of the time dimension of the third motion channel and the fourth motion channel is greater than or equal to a third threshold;
    所述第三运动管道的运动方向和所述第四运动管道的运动方向之间的夹角的余弦值大于或等于第四阈值,所述运动方向为在时空维度按照预设规则指示运动管道中目标物体的位置变化的向量;以及,运动管道的神经网络特征向量之间的距离小于或等于第五阈值,所述距离包括欧式距离。The cosine value of the angle between the movement direction of the third movement pipe and the movement direction of the fourth movement pipe is greater than or equal to the fourth threshold, and the movement direction is in the space-time dimension according to a preset rule indicating the movement pipe The vector of the position change of the target object; and, the distance between the neural network feature vectors of the motion pipeline is less than or equal to the fifth threshold, and the distance includes the Euclidean distance.
  14. 根据权利要求2至12中任一项所述的方法,其特征在于,The method according to any one of claims 2 to 12, characterized in that,
    所述根据所述运动管道获取所述目标物体的跟踪轨迹具体包括:The acquiring the tracking trajectory of the target object according to the motion pipeline specifically includes:
    对所述运动管道分组,获取t组运动管道,t为所述第一视频中视频帧的总数,所述t组运动管道中第i运动管道组包括所有起始于所述第一视频中第i个视频帧的运动管道,所述i大于或等于1,且小于或等于t;Group the motion pipelines to obtain t sets of motion pipelines, where t is the total number of video frames in the first video. Motion pipeline of i video frames, where i is greater than or equal to 1 and less than or equal to t;
    当i为1时,将第i运动管道组中的运动管道作为初始的跟踪轨迹,得到跟踪轨迹集合;When i is 1, use the motion pipeline in the i-th motion pipeline group as the initial tracking trajectory to obtain the tracking trajectory set;
    按照运动管道组的编号顺序,依次将所述第i运动管道组中的运动管道与所述跟踪轨迹集合中的跟踪轨迹进行连接,获取至少一条跟踪轨迹。According to the sequence of the number of the motion tube group, the motion tubes in the i-th motion tube group are sequentially connected with the tracking trajectories in the tracking trajectory set to obtain at least one tracking trajectory.
  15. 根据权利要求1至14中任一项所述的方法,其特征在于,The method according to any one of claims 1 to 14, characterized in that,
    所述预训练的神经网络模型由初始网络模型训练后得到,所述方法还包括:The pre-trained neural network model is obtained after the initial network model training, and the method further includes:
    将第一视频样本输入所述初始网络模型中训练,获取目标物体损失;Input the first video sample into the initial network model for training, and obtain the target object loss;
    根据所述目标物体损失更新所述初始网络模型中的权重参数,获取所述预训练的神经网络模型。The weight parameter in the initial network model is updated according to the loss of the target object, and the pre-trained neural network model is obtained.
  16. 根据权利要求15所述的方法,其特征在于,所述目标物体损失具体包括:The method according to claim 15, wherein the loss of the target object specifically comprises:
    运动管道真值与运动管道预测值之间的交并比,所述运动管道真值为所述第一视频样本中目标物体的跟踪轨迹拆分得到的运动管道,所述运动管道预测值为将所述第一视频样本输入所述初始网络模型得到的运动管道。The cross-union ratio between the true value of the motion pipeline and the predicted value of the motion pipeline, the true value of the motion pipeline is the motion pipeline obtained by splitting the tracking trajectory of the target object in the first video sample, and the predicted value of the motion pipeline is The first video sample is input into the motion pipeline obtained by the initial network model.
  17. 根据权利要求15所述的方法,其特征在于,所述目标物体损失具体包括:The method according to claim 15, wherein the loss of the target object specifically comprises:
    运动管道真值与运动管道预测值之间的交并比,以及运动管道真值的置信度与运动管道预测值的置信度之间的交叉熵,所述运动管道真值为所述第一视频样本中目标物体的跟踪轨迹拆分得到的运动管道,所述运动管道预测值为将所述第一视频样本输入所述初始网络模型得到的运动管道,所述运动管道真值的置信度为所述运动管道真值对应的目标物体的类别属于预设目标物体类别的概率,所述运动管道预测值的置信度为所述运动管道预测值对应的目标物体的类别属于预设目标物体类别的概率。The cross-union ratio between the true value of the motion pipe and the predicted value of the motion pipe, and the cross entropy between the confidence of the true value of the motion pipe and the confidence of the predicted value of the motion pipe, where the true value of the motion pipe is the first video The motion pipeline obtained by splitting the tracking trajectory of the target object in the sample, the motion pipeline prediction value is the motion pipeline obtained by inputting the first video sample into the initial network model, and the confidence of the true value of the motion pipeline is obtained The probability that the category of the target object corresponding to the true value of the motion pipeline belongs to the preset target object category, and the confidence of the predicted value of the motion pipeline is the probability that the category of the target object corresponding to the predicted value of the motion pipeline belongs to the preset target object category .
  18. 根据权利要求15至17中任一项所述的方法,其特征在于,The method according to any one of claims 15 to 17, characterized in that,
    所述初始网络模型包括三维卷积神经网络或递归神经网络。The initial network model includes a three-dimensional convolutional neural network or a recurrent neural network.
  19. 根据权利要求1至18中任一项所述的方法,其特征在于,The method according to any one of claims 1 to 18, characterized in that:
    所述将所述第一视频输入预训练的神经网络模型,获取所述目标物体的运动管道具体包括:The inputting the first video into the pre-trained neural network model to obtain the motion pipeline of the target object specifically includes:
    将所述第一视频划分为多个视频片段;Dividing the first video into multiple video segments;
    将所述多个视频片段分别输入所述预训练的神经网络模型,获取所述运动管道。The multiple video clips are respectively input to the pre-trained neural network model to obtain the motion pipeline.
  20. 一种目标跟踪装置,其特征在于,包括:A target tracking device is characterized in that it comprises:
    获取单元,用于获取第一视频,所述第一视频包括目标物体;An acquiring unit, configured to acquire a first video, where the first video includes a target object;
    所述获取单元,还用于将所述第一视频输入预训练的神经网络模型,获取所述目标物体在至少两个视频帧中的位置信息和所述至少两个视频帧的时间信息;The acquiring unit is further configured to input the first video into a pre-trained neural network model to acquire the position information of the target object in at least two video frames and the time information of the at least two video frames;
    所述获取单元,还用于根据所述目标物体在至少两个视频帧中的位置信息,和所述至少两个视频帧的时间信息获取所述目标物体在所述第一视频中的跟踪轨迹,所述跟踪轨迹包括所述目标物体在所述第一视频中的至少两个视频帧中的位置信息。The acquiring unit is further configured to acquire the tracking trajectory of the target object in the first video according to the position information of the target object in at least two video frames and the time information of the at least two video frames The tracking track includes position information of the target object in at least two video frames in the first video.
  21. 根据权利要求20所述的装置,其特征在于,所述获取单元具体用于:The device according to claim 20, wherein the acquiring unit is specifically configured to:
    获取所述目标物体的运动管道,所述运动管道用于指示所述目标物体在所述第一视频的至少两个视频帧中的时间信息和位置信息,其中,Acquire a motion pipeline of the target object, where the motion pipeline is used to indicate the time information and position information of the target object in at least two video frames of the first video, wherein,
    所述第一视频包括第一视频帧和第二视频帧;The first video includes a first video frame and a second video frame;
    所述运动管道对应于时空维度中的四棱台,所述时空维度包括时间维度和二维空间维度,所述四棱台的第一底面在所述时间维度的位置用于指示所述第一视频帧的第一时间信息,所述四棱台的第二底面在所述时间维度的位置用于指示所述第二视频帧的第二时间信息,所述四棱台的第一底面在所述二维空间维度的位置用于指示所述目标物体在所述第一视频帧中的第一位置信息,所述四棱台的第二底面在所述二维空间维度的位置用于指示所述目标物体在所述第二视频帧中的第二位置信息;The motion pipeline corresponds to a quadrangular pyramid in the space-time dimension, the space-time dimension includes a time dimension and a two-dimensional space dimension, and the position of the first bottom surface of the quadrangular pyramid in the time dimension is used to indicate the first The first time information of the video frame, the position of the second bottom surface of the quadrangular prism in the time dimension is used to indicate the second time information of the second video frame, and the first bottom surface of the quadrangular prism is located at the The position of the two-dimensional spatial dimension is used to indicate the first position information of the target object in the first video frame, and the position of the second bottom surface of the quadrangular pyramid in the two-dimensional spatial dimension is used to indicate the position of the target object in the first video frame. Second position information of the target object in the second video frame;
    所述四棱台用于指示所述目标物体在所述第一视频的所述第一视频帧与所述二视频帧之间的所有视频帧中的位置信息。The quadrangular prism is used to indicate the position information of the target object in all video frames between the first video frame and the second video frame of the first video.
  22. 根据权利要求20所述的装置,其特征在于,所述获取单元具体用于:The device according to claim 20, wherein the acquiring unit is specifically configured to:
    获取所述目标物体的运动管道,所述运动管道用于指示所述目标物体在至少三个视频帧中的位置信息和所述至少三个视频帧的时间信息,其中,Acquire a motion pipeline of the target object, where the motion pipeline is used to indicate the position information of the target object in at least three video frames and the time information of the at least three video frames, wherein,
    所述第一视频包括第一视频帧、第二视频帧和第三视频帧;The first video includes a first video frame, a second video frame, and a third video frame;
    所述运动管道对应于时空维度中的双四棱台,所述双四棱台包括第一四棱台和第二四棱台,所述第一四棱台包括第一底面和第二底面,所述第二四棱台包括第一底面和第三底面,所述第一底面为所述第一四棱台和所述第二四棱台的公共底面,所述第一底面在所述时间维度的位置用于指示所述第一视频帧的第一时间信息,所述第二底面在所述时间维度的位置用于指示所述第二视频帧的第二时间信息,所第三底面在所述时间维度的位置用于指示所述第三视频帧的第三时间信息,所述第一视频帧在所述第一视频中的时间顺序位于所述第二视频帧和所述第三视频帧之间,所述第一底面在所述二维空间维度的位置用于指示所述目标物体在所述第一视频帧中的第一位置信息,所述第二底面在所述二维空间维度的位置指示所述目标物体在所述第二视频帧中的第二位置信息,所述第三底面在所述二维空间维度的位置指示所述目标物体在所述第三视频帧中的第三位置信息;The movement pipeline corresponds to a double quadrangular prism in the space-time dimension, the double quadrangular prism includes a first quadrangular prism and a second quadrangular prism, the first quadrangular prism includes a first bottom surface and a second bottom surface, The second quadrangular platform includes a first bottom surface and a third bottom surface. The first bottom surface is a common bottom surface of the first quadrangular platform and the second quadrangular platform. The position of the dimension is used to indicate the first time information of the first video frame, the position of the second bottom surface in the time dimension is used to indicate the second time information of the second video frame, and the third bottom surface is at The position of the time dimension is used to indicate the third time information of the third video frame, and the time sequence of the first video frame in the first video is located in the second video frame and the third video Between frames, the position of the first bottom surface in the two-dimensional space is used to indicate the first position information of the target object in the first video frame, and the second bottom surface is in the two-dimensional space. The position of the dimensionality indicates the second position information of the target object in the second video frame, and the position of the third bottom surface in the two-dimensional space indicates the position of the target object in the third video frame. Third location information;
    所述双四棱台用于指示所述目标物体在所述第一视频的所述第二视频帧与所述三视频帧 之间的所有视频帧中的位置信息。The double quadrangular prism is used to indicate the position information of the target object in all video frames between the second video frame and the three video frames of the first video.
  23. 根据权利要求21或22所述的装置,其特征在于,所述获取单元具体用于:The device according to claim 21 or 22, wherein the acquiring unit is specifically configured to:
    根据所述运动管道获取所述目标物体在所述第一视频中的跟踪轨迹。Acquiring the tracking trajectory of the target object in the first video according to the motion pipeline.
  24. 根据权利要求21至23中任一项所述的装置,其特征在于,所述跟踪轨迹具体包括:The device according to any one of claims 21 to 23, wherein the tracking trajectory specifically comprises:
    根据至少两个所述运动管道对应于时空维度中的四棱台连接形成的所述目标物体的跟踪轨迹。According to at least two of the motion pipes corresponding to the tracking trajectory of the target object formed by the connection of the quadrangular prisms in the space-time dimension.
  25. 根据权利要求21至24中任一项所述的装置,其特征在于,The device according to any one of claims 21 to 24, wherein:
    所述运动管道的长度为预设值,所述运动管道的长度指示所述至少两个视频帧包括的视频帧的数量。The length of the motion pipeline is a preset value, and the length of the motion pipeline indicates the number of video frames included in the at least two video frames.
  26. 根据权利要求20至25中任一项所述的装置,其特征在于,所述获取单元还用于:The device according to any one of claims 20 to 25, wherein the acquiring unit is further configured to:
    通过所述预训练的神经网络模型,获取所述目标物体的类别信息;Obtaining category information of the target object through the pre-trained neural network model;
    根据所述目标物体的所述类别信息、所述目标物体在至少两个视频帧中的位置信息和所述至少两个视频帧的时间信息获取所述目标物体在所述第一视频中的跟踪轨迹。Acquire the tracking of the target object in the first video according to the category information of the target object, the position information of the target object in at least two video frames, and the time information of the at least two video frames Trajectory.
  27. 根据权利要求26所述的装置,其特征在于,所述获取单元具体用于:The device according to claim 26, wherein the acquiring unit is specifically configured to:
    通过所述预训练的神经网络模型,获取所述运动管道的置信度,所述运动管道的置信度用于确定所述运动管道对应的目标物体的所述类别信息。Obtain the confidence level of the motion pipeline through the pre-trained neural network model, and the confidence level of the motion pipeline is used to determine the category information of the target object corresponding to the motion pipeline.
  28. 根据权利要求20至27中任一项所述的装置,其特征在于,所述装置还包括:The device according to any one of claims 20-27, wherein the device further comprises:
    处理单元,用于对所述运动管道进行删减,获取删减后的运动管道,所述删减后的运动管道用于获取所述目标物体的跟踪轨迹。The processing unit is configured to delete the motion pipeline to obtain a deleted motion pipeline, and the deleted motion pipeline is used to acquire the tracking trajectory of the target object.
  29. 根据权利要求28所述的装置,其特征在于,所述运动管道包括第一运动管道和第二运动管道;The device according to claim 28, wherein the movement pipe comprises a first movement pipe and a second movement pipe;
    所述处理单元具体用于:The processing unit is specifically used for:
    若第一运动管道和第二运动管道之间的重复率大于或等于第一阈值,则删除所述第一运动管道和所述第二运动管道中置信度较低的运动管道,所述第一运动管道和第二运动管道之间的重复率为所述第一运动管道和所述第二运动管道之间的交并比,所述第一运动管道和所述第二运动管道属于所述目标物体的运动管道,所述置信度指示运动管道对应的目标物体的类别为预设类别的概率。If the repetition rate between the first movement pipeline and the second movement pipeline is greater than or equal to the first threshold, the movement pipeline with lower confidence among the first movement pipeline and the second movement pipeline is deleted, and the first movement pipeline The repetition rate between the movement pipeline and the second movement pipeline is the intersection ratio between the first movement pipeline and the second movement pipeline, and the first movement pipeline and the second movement pipeline belong to the target The motion pipeline of the object, and the confidence level indicates the probability that the category of the target object corresponding to the motion pipeline is a preset category.
  30. 根据权利要求28所述的装置,其特征在于,所述处理单元具体用于:The device according to claim 28, wherein the processing unit is specifically configured to:
    根据非极大值抑制算法对所述运动管道进行删减,获取删减后的运动管道。The motion pipeline is deleted according to the non-maximum value suppression algorithm, and the deleted motion pipeline is obtained.
  31. 根据权利要求28所述的装置,其特征在于,The device of claim 28, wherein:
    所述删减后的运动管道中的任意一个运动管道的置信度大于或等于第二阈值。The confidence of any one of the motion channels after the reduction is greater than or equal to the second threshold.
  32. 根据权利要求21至31中任一项所述的装置,其特征在于,所述获取单元具体用于:The device according to any one of claims 21 to 31, wherein the acquiring unit is specifically configured to:
    对所述运动管道中满足预设条件的第三运动管道和第四运动管道进行连接,获取所述目标物体的跟踪轨迹;Connecting a third movement pipeline and a fourth movement pipeline that meet a preset condition in the movement pipeline to obtain a tracking trajectory of the target object;
    所述预设条件包括以下一个或多个:The preset conditions include one or more of the following:
    所述第三运动管道和所述第四运动管道在时间维度重叠部分的区段之间的交并比大于或等于第三阈值;The intersection ratio between the sections of the overlapping portion of the time dimension of the third motion channel and the fourth motion channel is greater than or equal to a third threshold;
    所述第三运动管道的运动方向和所述第四运动管道的运动方向之间的夹角的余弦值大于 或等于第四阈值,所述运动方向为在时空维度按照预设规则指示运动管道中目标物体的位置变化的向量;以及,运动管道的神经网络特征向量之间的距离小于或等于第五阈值,所述距离包括欧式距离。The cosine value of the angle between the movement direction of the third movement pipe and the movement direction of the fourth movement pipe is greater than or equal to the fourth threshold, and the movement direction is in the space-time dimension according to a preset rule indicating the movement pipe The vector of the position change of the target object; and, the distance between the neural network feature vectors of the motion pipeline is less than or equal to the fifth threshold, and the distance includes the Euclidean distance.
  33. 根据权利要求21至31中任一项所述的装置,其特征在于,所述获取单元具体用于:The device according to any one of claims 21 to 31, wherein the acquiring unit is specifically configured to:
    对所述运动管道分组,获取t组运动管道,t为所述第一视频中视频帧的总数,所述t组运动管道中第i运动管道组包括所有起始于所述第一视频中第i个视频帧的运动管道,所述i大于或等于1,且小于或等于t;Group the motion pipelines to obtain t sets of motion pipelines, where t is the total number of video frames in the first video. Motion pipeline of i video frames, where i is greater than or equal to 1 and less than or equal to t;
    当i为1时,将第i运动管道组中的运动管道作为初始的跟踪轨迹,得到跟踪轨迹集合;When i is 1, use the motion pipeline in the i-th motion pipeline group as the initial tracking trajectory to obtain the tracking trajectory set;
    按照运动管道组的编号顺序,依次将所述第i运动管道组中的运动管道与所述跟踪轨迹集合中的跟踪轨迹进行连接,获取至少一条跟踪轨迹。According to the sequence of the number of the motion tube group, the motion tubes in the i-th motion tube group are sequentially connected with the tracking trajectories in the tracking trajectory set to obtain at least one tracking trajectory.
  34. 根据权利要求20至33中任一项所述的装置,其特征在于,所述获取单元具体用于:The device according to any one of claims 20 to 33, wherein the acquiring unit is specifically configured to:
    将第一视频样本输入所述初始网络模型中训练,获取目标物体损失;Input the first video sample into the initial network model for training, and obtain the target object loss;
    根据所述目标物体损失更新所述初始网络模型中的权重参数,获取所述预训练的神经网络模型。The weight parameter in the initial network model is updated according to the loss of the target object, and the pre-trained neural network model is obtained.
  35. 根据权利要求34所述的装置,其特征在于,所述目标物体损失具体包括:The device according to claim 34, wherein the target object loss specifically comprises:
    运动管道真值与运动管道预测值之间的交并比,所述运动管道真值为所述第一视频样本中目标物体的跟踪轨迹拆分得到的运动管道,所述运动管道预测值为将所述第一视频样本输入所述初始网络模型得到的运动管道。The cross-union ratio between the true value of the motion pipeline and the predicted value of the motion pipeline, the true value of the motion pipeline is the motion pipeline obtained by splitting the tracking trajectory of the target object in the first video sample, and the predicted value of the motion pipeline is The first video sample is input into the motion pipeline obtained by the initial network model.
  36. 根据权利要求34所述的装置,其特征在于,所述目标物体损失具体包括:The device according to claim 34, wherein the target object loss specifically comprises:
    运动管道真值与运动管道预测值之间的交并比,以及运动管道真值的置信度与运动管道预测值的置信度之间的交叉熵,所述运动管道真值为所述第一视频样本中目标物体的跟踪轨迹拆分得到的运动管道,所述运动管道预测值为将所述第一视频样本输入所述初始网络模型得到的运动管道,所述运动管道真值的置信度为所述运动管道真值对应的目标物体的类别属于预设目标物体类别的概率,所述运动管道预测值的置信度为所述运动管道预测值对应的目标物体的类别属于预设目标物体类别的概率。The cross-union ratio between the true value of the motion pipe and the predicted value of the motion pipe, and the cross entropy between the confidence of the true value of the motion pipe and the confidence of the predicted value of the motion pipe, where the true value of the motion pipe is the first video The motion pipeline obtained by splitting the tracking trajectory of the target object in the sample, the motion pipeline prediction value is the motion pipeline obtained by inputting the first video sample into the initial network model, and the confidence of the true value of the motion pipeline is obtained The probability that the category of the target object corresponding to the true value of the motion pipeline belongs to the preset target object category, and the confidence of the predicted value of the motion pipeline is the probability that the category of the target object corresponding to the predicted value of the motion pipeline belongs to the preset target object category .
  37. 根据权利要求34至36中任一项所述的装置,其特征在于,The device according to any one of claims 34 to 36, characterized in that:
    所述初始网络模型包括三维卷积神经网络或递归神经网络。The initial network model includes a three-dimensional convolutional neural network or a recurrent neural network.
  38. 根据权利要求20至37中任一项所述的装置,其特征在于,所述处理单元还用于:The device according to any one of claims 20 to 37, wherein the processing unit is further configured to:
    将所述第一视频划分为多个视频片段;Dividing the first video into multiple video segments;
    所述获取单元具体用于:The acquiring unit is specifically used for:
    将所述多个视频片段分别输入所述预训练的神经网络模型,获取所述运动管道。The multiple video clips are respectively input to the pre-trained neural network model to obtain the motion pipeline.
  39. 一种电子设备,其特征在于,包括处理器和存储器,所述处理器和所述存储器相互连接,其中,所述存储器用于存储计算机程序,所述计算机程序包括程序指令,所述处理器用于调用所述程序指令,执行如权利要求1至19中任一项所述的方法。An electronic device, characterized by comprising a processor and a memory, the processor and the memory are connected to each other, wherein the memory is used to store a computer program, the computer program includes program instructions, and the processor is used for Call the program instructions to execute the method according to any one of claims 1-19.
  40. 一种包含指令的计算机程序产品,其特征在于,当其在计算机上运行时,使得所述计算机执行如权利要求1至19中任一项所述的方法。A computer program product containing instructions, which is characterized in that when it runs on a computer, the computer executes the method according to any one of claims 1 to 19.
  41. 一种计算机可读存储介质,包括指令,其特征在于,当所述指令在计算机上运行时,使得计算机执行如权利要求1至19中任一项所述的方法。A computer-readable storage medium, comprising instructions, characterized in that, when the instructions are executed on a computer, the computer executes the method according to any one of claims 1 to 19.
PCT/CN2021/093852 2020-06-09 2021-05-14 Target tracking method and target tracking device WO2021249114A1 (en)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN202010519876.2A CN113781519A (en) 2020-06-09 2020-06-09 Target tracking method and target tracking device
CN202010519876.2 2020-06-09

Publications (1)

Publication Number Publication Date
WO2021249114A1 true WO2021249114A1 (en) 2021-12-16

Family

ID=78834470

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2021/093852 WO2021249114A1 (en) 2020-06-09 2021-05-14 Target tracking method and target tracking device

Country Status (2)

Country Link
CN (1) CN113781519A (en)
WO (1) WO2021249114A1 (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116504068A (en) * 2023-06-26 2023-07-28 创辉达设计股份有限公司江苏分公司 Statistical method, device, computer equipment and storage medium for lane-level traffic flow

Families Citing this family (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US11625909B1 (en) * 2022-05-04 2023-04-11 Motional Ad Llc Track segment cleaning of tracked objects
CN114972814B (en) * 2022-07-11 2022-10-28 浙江大华技术股份有限公司 Target matching method, device and storage medium
CN115451962B (en) * 2022-08-09 2024-04-30 中国人民解放军63629部队 Target tracking strategy planning method based on five-variable Carnot diagram

Citations (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20060262184A1 (en) * 2004-11-05 2006-11-23 Yissum Research Development Company Of The Hebrew University Of Jerusalem Method and system for spatio-temporal video warping
US20090219300A1 (en) * 2005-11-15 2009-09-03 Yissum Research Deveopment Company Of The Hebrew University Of Jerusalem Method and system for producing a video synopsis
CN101702233A (en) * 2009-10-16 2010-05-05 电子科技大学 Three-dimension locating method based on three-point collineation marker in video frame
US20160148392A1 (en) * 2014-11-21 2016-05-26 Thomson Licensing Method and apparatus for tracking the motion of image content in a video frames sequence using sub-pixel resolution motion estimation
CN106169187A (en) * 2015-05-18 2016-11-30 汤姆逊许可公司 For the method and apparatus that the object in video is set boundary
CN108182696A (en) * 2018-01-23 2018-06-19 四川精工伟达智能技术股份有限公司 Image processing method, device and Multi-target position tracking system
CN108509830A (en) * 2017-02-28 2018-09-07 华为技术有限公司 A kind of video data handling procedure and equipment
CN110188719A (en) * 2019-06-04 2019-08-30 北京字节跳动网络技术有限公司 Method for tracking target and device

Family Cites Families (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US8345984B2 (en) * 2010-01-28 2013-01-01 Nec Laboratories America, Inc. 3D convolutional neural networks for automatic human action recognition
CN106897714B (en) * 2017-03-23 2020-01-14 北京大学深圳研究生院 Video motion detection method based on convolutional neural network
CN107492113B (en) * 2017-06-01 2019-11-05 南京行者易智能交通科技有限公司 A kind of moving object in video sequences position prediction model training method, position predicting method and trajectory predictions method
CN110032926B (en) * 2019-02-22 2021-05-11 哈尔滨工业大学(深圳) Video classification method and device based on deep learning
CN110188637A (en) * 2019-05-17 2019-08-30 西安电子科技大学 A kind of Activity recognition technical method based on deep learning

Patent Citations (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20060262184A1 (en) * 2004-11-05 2006-11-23 Yissum Research Development Company Of The Hebrew University Of Jerusalem Method and system for spatio-temporal video warping
US20090219300A1 (en) * 2005-11-15 2009-09-03 Yissum Research Deveopment Company Of The Hebrew University Of Jerusalem Method and system for producing a video synopsis
CN101702233A (en) * 2009-10-16 2010-05-05 电子科技大学 Three-dimension locating method based on three-point collineation marker in video frame
US20160148392A1 (en) * 2014-11-21 2016-05-26 Thomson Licensing Method and apparatus for tracking the motion of image content in a video frames sequence using sub-pixel resolution motion estimation
CN106169187A (en) * 2015-05-18 2016-11-30 汤姆逊许可公司 For the method and apparatus that the object in video is set boundary
CN108509830A (en) * 2017-02-28 2018-09-07 华为技术有限公司 A kind of video data handling procedure and equipment
CN108182696A (en) * 2018-01-23 2018-06-19 四川精工伟达智能技术股份有限公司 Image processing method, device and Multi-target position tracking system
CN110188719A (en) * 2019-06-04 2019-08-30 北京字节跳动网络技术有限公司 Method for tracking target and device

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116504068A (en) * 2023-06-26 2023-07-28 创辉达设计股份有限公司江苏分公司 Statistical method, device, computer equipment and storage medium for lane-level traffic flow

Also Published As

Publication number Publication date
CN113781519A (en) 2021-12-10

Similar Documents

Publication Publication Date Title
WO2021249114A1 (en) Target tracking method and target tracking device
WO2021017606A1 (en) Video processing method and apparatus, and electronic device and storage medium
CN109559320B (en) Method and system for realizing visual SLAM semantic mapping function based on hole convolution deep neural network
WO2020192736A1 (en) Object recognition method and device
Zhang et al. Deep hierarchical guidance and regularization learning for end-to-end depth estimation
WO2021227726A1 (en) Methods and apparatuses for training face detection and image detection neural networks, and device
CN112990211B (en) Training method, image processing method and device for neural network
WO2021218786A1 (en) Data processing system, object detection method and apparatus thereof
CN110222717B (en) Image processing method and device
WO2021147325A1 (en) Object detection method and apparatus, and storage medium
WO2023082882A1 (en) Pose estimation-based pedestrian fall action recognition method and device
CN112419368A (en) Method, device and equipment for tracking track of moving target and storage medium
WO2022179581A1 (en) Image processing method and related device
CN111062263B (en) Method, apparatus, computer apparatus and storage medium for hand gesture estimation
CN110222718B (en) Image processing method and device
JP7439153B2 (en) Lifted semantic graph embedding for omnidirectional location recognition
WO2021103731A1 (en) Semantic segmentation method, and model training method and apparatus
WO2021218238A1 (en) Image processing method and image processing apparatus
CN113011562A (en) Model training method and device
KR102143034B1 (en) Method and system for tracking object in video through prediction of future motion of object
WO2022052782A1 (en) Image processing method and related device
Chakravarthy et al. Dronesegnet: robust aerial semantic segmentation for UAV-based IoT applications
Shi et al. An improved lightweight deep neural network with knowledge distillation for local feature extraction and visual localization using images and LiDAR point clouds
CN116194951A (en) Method and apparatus for stereoscopic based 3D object detection and segmentation
WO2023093086A1 (en) Target tracking method and apparatus, training method and apparatus for model related thereto, and device, medium and computer program product

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 21821941

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 21821941

Country of ref document: EP

Kind code of ref document: A1