CN108986143B

CN108986143B - Target detection tracking method in video

Info

Publication number: CN108986143B
Application number: CN201810940035.1A
Authority: CN
Inventors: 尚凌辉; 张兆生; 王弘玥; 郑永宏
Original assignee: Zhejiang Icare Vision Technology Co ltd
Current assignee: Zhejiang Jiehuixin Digital Technology Co.,Ltd.
Priority date: 2018-08-17
Filing date: 2018-08-17
Publication date: 2022-05-03
Anticipated expiration: 2038-08-17
Also published as: CN108986143A

Abstract

The invention discloses a target detection and tracking method in a video. The method firstly carries out segmented sampling on the video to obtain a plurality of segments of video image frame sequences. Then adopting a neural network model

And carrying out target detection and feature extraction on each video image frame sequence. And calculating the correlation matrix of the target characteristic vectors corresponding to all the detection results output in the video sequence again, and further obtaining the tracking results of all the detected targets in the video sequence in the frequency sequence. Finally, sequencing the video image frame sequence sampled in segments according to the time axis, and inputting the target detection tracking track and the characteristic matrix of the video image frame sequence into the neural network model

And obtaining the tracking characteristic of each target in each video image frame sequence, and calculating the correlation of all targets between two adjacent video image frame sequences by using the tracking characteristic so as to complete the tracking of the targets in the whole video segment. The method can effectively reduce the calculation amount required by completing the target detection and tracking task in the video.

Description

Target detection tracking method in video

Technical Field

The invention belongs to the technical field of computer vision, and relates to a method for detecting and tracking a target in a video.

Background

Monitoring equipment such as bayonets, public security and various network cameras and the like are installed and used in large quantities, video data acquired by the equipment plays a great role in traffic violation, public security management and the like, but with the continuous increase of the installation quantity of the equipment, the produced data quantity is increased day by day, the storage and utilization of the data face huge challenges, and video structuring becomes a research hotspot in scientific research and industry.

A fundamental problem that cannot be circumvented in various video structuring schemes is the accurate and efficient detection and tracking of key targets in the video. In the patents of ' target tracking optimization method based on tracking learning detection ' 107967692A, real-time unmanned aerial vehicle video target detection and tracking method ' 108108697A, and ' multi-target pedestrian detection and tracking method based on deep learning ' 107563313A, etc., single-frame images are used for completing target detection, characteristics of relevant areas of target detection results are calculated, and matching and tracking of targets between close frames are completed by relying on the characteristics. In the methods, target detection depends on information of a single-frame image, and cannot utilize related information between similar image frames in a time sequence, so that the accuracy of a detection result is limited; meanwhile, the features used in the matching and tracking process are also extracted from a single-frame image, and the features can distinguish various different target individuals, so that similar targets in the same row are very easy to be matched wrongly, and the tracking failure is caused; finally, in order to ensure the accuracy of detection and tracking, the interval of sampling at intervals of frames is limited, which results in large calculation amount and low efficiency.

Disclosure of Invention

Aiming at the defects of the prior art, the invention provides a method for detecting and tracking a target in a video.

The technical scheme adopted by the invention for solving the technical problem is as follows:

step 1, performing segmented sampling on a video to obtain a plurality of segments of video image frame sequences.

Step 2, adopting a neural network model M₁For each video image frameAnd performing target detection and feature extraction on the sequence, wherein the output information comprises: the number of the image where the target is located in the sequence, the rectangular frame of the target in the image and the feature vector of the target.

And 3, calculating correlation matrixes of target feature vectors corresponding to all detection results output in the video sequence, and further obtaining tracking results of all detected targets in the video sequence in the frequency sequence.

Step 4, inputting the target detection tracking track and the characteristic matrix in the video image frame sequence to the neural network model M according to the time axis₂And obtaining the tracking characteristic of each target in each video image frame sequence, and calculating the correlation of all targets between two adjacent video image frame sequences by using the tracking characteristic so as to complete the tracking of the targets in the whole video segment.

The invention has the beneficial effects that:

1. the accuracy of the detector is improved by using the inter-frame information of the time series images.

2. The space-time information of the time sequence images is fully utilized to improve the tracking effect of the target.

3. The calculation amount of detection tracking can be effectively reduced, and the operation efficiency is improved.

4. The detection and the tracking are effectively integrated, and the overall detection and tracking effect is improved.

Drawings

FIG. 1 is a flow chart of the method of the present invention.

Detailed Description

In order to make the objects, technical solutions and advantages of the embodiments of the present invention clearer, the technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments of the present invention without making any creative effort, shall fall within the protection scope of the present invention.

As shown in fig. 1, the present invention comprises the steps of:

Step 2, performing target detection and feature extraction on each video image frame sequence, wherein the output information comprises: the number of the image where the target is located in the sequence, the rectangular frame of the target in the image and the feature vector of the target.

And 3, calculating correlation matrixes of target feature vectors corresponding to all detection results output in the video sequence, and further obtaining tracking results of all detected targets in the video sequence.

And 4, matching and tracking the targets in the front and rear adjacent video sequences by utilizing target detection tracking tracks (including the serial numbers of the targets in the sequences and the rectangular frames of the targets in the images) and feature matrixes (serial splicing of feature vectors) in the video image frame sequences according to a time axis.

The calculation method of the target detection and the feature extraction of each video image frame sequence comprises the following steps: executing the trained neural network model M₁The number of the image in which the target is located in the sequence, the rectangular frame of the target in the image and the feature vector of the target are directly obtained in the inference process.

Wherein the neural network model M₁The training method comprises the following steps:

collecting the annotated video data;

cutting the sampled video segment to obtain the video image frame sequence, the number of the marked image of the target in the sequence, the rectangular frame of the target in the image and the number information of the target;

and training an optimized network model by detecting and classifying targets in the video image sequence.

The following is an implementation scheme of a target detection and tracking method in a video, and the specific steps are as follows:

neural network model M for training target detection and matching feature calculation in video image frame sequence₁The method comprises the following specific steps:

1. collecting a plurality of video segments V; artificially labeling target locations and each target occurrence in a sequence of video imagesObtaining the original marked sample set A ═ V by the disappeared ID information₁,V₂,…,V_L}。

2. Using deep learning theory and method, each video segment V in the original labeled sample set A_iDividing the sample to generate several video image frame sequences P_i,P_i+1,…,P_i+k∈V_iObtaining a training test sample set B ═ P₁,P₂,…,P_i,P_i+1,…,P_i+k…,P_n-k,…,P_n-1,P_n}。

3. By utilizing deep learning theory and method and combining with the training test sample set B, a neural network model M capable of detecting the target and calculating the target characteristic is obtained by utilizing multi-task mode training₁。

Neural network model M for training target matching tracking feature calculation between video image frame sequences₂The method comprises the following specific steps:

1. using neural network model M₁Obtaining each video image sequence P in the training test sample set B_iThe tracking trajectory of each target (the number of the image in which the target is located in the sequence, and the rectangular frame of the target in the image) and a feature matrix (serial splicing of feature vectors).

2. Using per video segment V_iTarget information of the annotation and each video image frame sequence P_i+jPassing through a neural network model M₁Obtaining the tracking track and the characteristic matrix of the target to obtain a video segment V_iA set of feature samples of each object in a different sequence of video image frames: o ═ q₁,q₂,…,q_kWherein q is_iBy M₁At P_iTo generate a training data set C ═ O of target matching tracking features between sequences of video images₁,O₂,…,O_s}

3. Training by using deep learning theory and method and combining with training test sample set C to obtain neural network model M for calculating target matching tracking characteristics between video image sequences₂。

Using neural network modelsType M₁，M₂Detecting and tracking a target in a video, and specifically comprising the following steps:

1. sampling the video segment to be analyzed to generate several video image frame sequences

2. For each sequence of video image frames, a neural network model M is implemented₁The number of the image of each target in the sequence, the rectangular frame of the target in the image and the characteristic vector of the target are obtained in the reasoning process of (1)

3. And calculating a correlation matrix of target feature vectors corresponding to all detection results output in the video image frame sequence, wherein the correlation can be calculated by Euclidean distance, Mahalanobis distance and the like, so as to obtain the tracking results of all detected targets in the video image frame sequence.

4. Sequencing the sequence of segmented sampled video image frames according to time axis information, and executing a neural network model M according to a tracking trajectory and a feature matrix₂Obtaining the tracking characteristics of each target in each video image frame sequence

The feature is used to calculate the correlation of all objects between two adjacent video image frame sequences (wherein the correlation can be calculated by Euclidean distance, Mahalanobis distance, etc.), thereby completing the tracking of the objects in the whole video segment.

In conclusion, the method for detecting and tracking the target in the video is realized based on the video image frame sequence data and by combining the information of the single frame image and the inter-frame correlation between the video image frame sequences. Compared with a target detection method based on a single-frame image, the target detection method combines the related information among the image sequences, and the target detection performance is improved. The features are obtained by calculation from a single frame image by using a machine learning method and are used for target tracking matching, the features need to meet the distinguishing of similar targets of the same type, or the calculated amount for obtaining the features is very large, or the distinguishing capability of the features is poor, so that matching errors are easy to occur, and the tracking failure is caused. Therefore, the tracking matching of the invention is divided into two stages, namely the matching tracking of the target in the video image frame sequence and the target matching tracking between different image frame frequency sequences in a short time: matching tracking features inside the video image frame sequence depend on the correlation between the sequences and multi-needle image information in the video image frame sequence, and the distinguishing capability of the features is limited to the objects inside the video image frame sequence; the target matching between the video image frame sequences mainly utilizes the matching and tracking results of targets in the video image frame sequences and the characteristics of the targets in the video image frame sequences, so that the tracking accuracy can be effectively improved. Compared with other methods, the method has the advantage that the calculated amount required for completing the target detection and tracking task in the video can be effectively reduced.

While the foregoing is directed to the preferred embodiment of the present invention, and is not intended to limit the scope of the invention, it will be understood that the invention is not limited to the embodiments described herein, which are described to assist those skilled in the art in practicing the invention.

Claims

1. A method for detecting and tracking a target in a video is characterized by comprising the following steps:

step 1, performing segmented sampling on a video to obtain a plurality of segments of video image frame sequences;

step 2, adopting a neural network model

Carrying out target detection and feature extraction on each video image frame sequence, wherein the output information comprises: the number of the image where the target is located in the sequence, a rectangular frame of the target in the image and a feature vector of the target;

step 3, calculating correlation matrixes of target feature vectors corresponding to all detection results output in the video sequence, and further obtaining tracking results of all detected targets in the video sequence in the frequency sequence;

step 4, sequencing the video image frame sequences sampled in sections according to the time axis, and inputting the target detection tracking track and the characteristic matrix of the video image frame sequences into the neural network model

And obtaining the tracking characteristic of each target in each video image frame sequence, and calculating the correlation of all targets between two adjacent video image frame sequences by using the tracking characteristic so as to complete the tracking of the targets in the whole video segment.

2. The method according to claim 1, wherein the method comprises: the neural network model

The method is established in the following way:

collecting a large number of video segments, manually marking the target positions in the video image sequence and the ID information of each target from appearance to disappearance to obtain an original marked sample set;

by utilizing a deep learning method, for each video segment in an original labeled sample set, segmenting and sampling to generate a plurality of video image frame sequences to obtain a training test sample set;

the neural network model is obtained by utilizing a deep learning method, combining with a training test sample set and utilizing a multi-task mode for training

。

3. The method according to claim 2, wherein the method comprises: the neural network model

The method is established in the following way:

using neural network models

Obtaining the tracking track sum of each target in each video image sequence in the training test sample setA feature matrix;

passing through a neural network model by using target information marked in each video and each video image frame sequence

Obtaining a tracking track and a characteristic matrix of the target to obtain a characteristic sample set of each target in different video image frame sequences in each section of video, thereby generating a training data set of target matching tracking characteristics among the video image sequences;

training to obtain a neural network model by using a deep learning method and combining a training data set

。