CN114140494A

CN114140494A - Single-target tracking system and method in complex scene, electronic device and storage medium

Info

Publication number: CN114140494A
Application number: CN202110742736.6A
Authority: CN
Inventors: 苏晋鹏; 曹颂; 钟星
Original assignee: Hangzhou Turing Video Technology Co ltd
Current assignee: Hangzhou Turing Video Technology Co ltd
Priority date: 2021-06-30
Filing date: 2021-06-30
Publication date: 2022-03-04

Abstract

The invention discloses a single-target tracking system, a method, electronic equipment and a storage medium under a complex scene, wherein the tracking system comprises: a pre-processing module to perform: processing an initial target frame and a subsequent video frame according to an incoming violation template frame to obtain a template area and a search area; transmitting the target template area into a re-recognition network to obtain the initial characteristics of the target; a tracking prescreening module for performing: acquiring a template region and a search region of the preprocessing module, transmitting the template region and the search region into a deep learning single-target tracking algorithm, screening the first 10 candidate tracking frames with high confidence coefficient through an NMS (network management system) algorithm, and transmitting the first 10 candidate tracking frames into a feature comparison module; a feature comparison module to perform: respectively comparing the cosine similarity of the first 10 candidate tracking frames with the initial features of the target, and selecting the optimal target tracking frame according to the similarity; a threshold linkage module to perform: and adjusting the size of the search area of the candidate template in the preprocessing module according to the confidence coefficient of the output candidate frame.

Description

Single-target tracking system and method in complex scene, electronic device and storage medium

Technical Field

The invention relates to a single-target tracking system and method in a complex scene, electronic equipment and a storage medium, and belongs to the technical field of video monitoring and security.

Background

Target tracking is an important component in computer vision research, and has great application requirements in the fields of monitoring security, unmanned driving, accurate guidance and the like. The application scenes can be divided into a civil field and a military field, and the two fields have own characteristics respectively. In the civil field, as the occurrence time and the duration of a target are uncertain, a video monitoring system needs to execute work for a long time with high stability; in the military field, the flight speed of a high-speed maneuvering target can exceed Mach 5, and a tracking system is required to ensure real-time performance and accuracy in a complex battlefield environment. Due to the fact that the situation exists, the target to be tracked is identified and marked manually and cannot meet the requirements of practical application on a tracking system, and the research of a target tracking algorithm replacing a manual method is of great significance.

In recent years, a single-target tracking algorithm based on a twin network series of deep learning is greatly improved, but in a practical scene, the interference encountered by a target is more extremely complicated, so that the tracking performance is greatly reduced.

Patent 1: CN 201910882990.9A robust single target tracking method based on deep learning. The method is characterized in that whether template updating is started or not is determined by setting a threshold, the template is updated by using confidence coefficient, and the characteristics are updated in time by using the change of a target, so that error tracking caused by updating the template is avoided. Patent 2: a single-target tracking method based on multiple characteristics is CN 110807794A. The design adopts a correlation filter tracking method to respectively carry out correlation operation on the convolution characteristic and the difference image characteristic, and after response graphs obtained by the correlation operation are fused, a target is tracked by taking a fusion result as a dynamic target coordinate correction basis.

Disclosure of Invention

The disadvantages of the prior art are as follows: patent 1: a robust single target tracking method based on deep learning has the main defects that a tracked score is used for setting a threshold value to update a target template, but once a plurality of targets with the same attribute are nearby, other targets are likely to be tracked and still have higher confidence coefficient, so that the targets are completely lost when the next template is updated wrongly; patent 2: the main design defect of the multi-feature-based single target tracking method is that a traditional target tracking algorithm is still used, and when a target is greatly deformed, and illumination changes and is shielded, the target still can be lost, and subsequent false alarm and the like are brought.

The invention aims to overcome the technical defects in the prior art, solve the technical problems and provide a single-target tracking system, a single-target tracking method, electronic equipment and a storage medium in a complex scene.

The invention specifically adopts the following technical scheme: single target tracking system under complicated scene includes:

a pre-processing module to perform: processing an initial target frame and a subsequent video frame according to an incoming violation template frame to obtain a template area and a search area; transmitting the target template area into a re-recognition network to obtain the initial characteristics of the target;

a tracking prescreening module for performing: acquiring a template region and a search region of the preprocessing module, transmitting the template region and the search region into a deep learning single-target tracking algorithm, screening the first 10 candidate tracking frames with high confidence coefficient through an NMS (network management system) algorithm, and transmitting the first 10 candidate tracking frames into a feature comparison module;

a feature comparison module to perform: respectively comparing the cosine similarity of the first 10 candidate tracking frames with the initial features of the target, and selecting the optimal target tracking frame according to the similarity;

a threshold linkage module to perform: and adjusting the sizes of the candidate template region and the search region in the preprocessing module according to the confidence degree of the output candidate frame.

The invention also provides a single-target tracking method in a complex scene, which comprises the following steps:

the pretreatment step specifically comprises the following steps: processing an initial target frame and a subsequent video frame according to an incoming violation template frame to obtain a template area and a search area; transmitting the target template area into a re-recognition network to obtain the initial characteristics of the target;

tracking and primary screening, specifically comprising: acquiring a template region and a search region of the preprocessing step, transmitting the template region and the search region into a deep learning single-target tracking algorithm, screening the first 10 candidate tracking frames with high confidence coefficient through an NMS (network management system) algorithm, and transmitting the first 10 candidate tracking frames into a feature comparison step;

the characteristic comparison step specifically comprises the following steps: respectively comparing the cosine similarity of the first 10 candidate tracking frames with the initial features of the target, and selecting the optimal target tracking frame according to the similarity;

the threshold value linkage step specifically comprises the following steps: and adjusting the sizes of the candidate template region and the search region in the preprocessing step according to the confidence degree of the output candidate frame.

As a preferred embodiment, the preprocessing step specifically includes:

step SS 11: detecting the video stream by using a target detection algorithm, or manually selecting a current Frame of the video stream_initTo obtain the frame B of the tracked target_initThe template area is cut to obtain an object O _ crop;

step SS 12: RGB three channels (R) for obtaining images in video stream_mean,G_mean,B_mean) The RGB three-channel mean value is: RGB (Red, Green, blue) color filter_mean＝(R_mean+G_mean+B_mean)/3；

Step SS 13: respectively calculating the sizes of the template area and the search area according to a formula (1) and a formula (2); with a frame B_initThe center of the template picture is taken as a central point, a square with the length and the width both being Z _ sz is cut out from an original picture, then the size of the square is adjusted to (127 ) through an interpolation algorithm to obtain a template picture Z _ crop, similarly, the square with the length and the width both being X _ sz is cut out from a current frame according to the center of a frame transmitted by a previous frame tracking frame, and then the square is adjusted to (271 ) through the interpolation algorithm to obtain a search area X _ crop; once clipping exceeds the image boundary, the mean RGB in step SS12 is used_meanFilling pixels to ensure that the area obtained by cutting is positioned in the image;

x_sz＝z_sz*271/127 (2)

wherein, x is the horizontal coordinate of the center point of the initial frame, y is the vertical coordinate of the center of the initial frame, w is the width of the initial frame, and h is the height of the initial frame.

As a preferred embodiment, the tracking prescreening step specifically includes:

step SS 21: collecting data sets required by single target tracking, wherein the data sets comprise nine data sets including COCO, GOT10K, VOT2020, LASOT, TrackingNet, VID, DET, YOUTUBEBB and UAV123 for training a neural network;

step SS 22: training to obtain a single target tracking model based on deep learning;

step SS 23: respectively transmitting the template region and the search region into a single-target tracking model in the step SS22 to obtain a series of candidate frames and corresponding confidence degrees, then sequencing the candidate frames from large to small according to the confidence degrees through an NMS algorithm, selecting the first 10 frames to obtain a candidate set of tracking frames { (B)₁,Score_tracking1),(B₂,Score_tracking2)...(B₁₀,Score_tracking10) }; wherein B represents the coordinates of the frame, Score_trackingRepresenting the corresponding confidence score;

step SS 24: and cutting the corresponding 10 borders in the step SS23 in the video frame in the original image to obtain candidate crop areas.

As a preferred embodiment, the threshold of the NMS algorithm in said step SS23 is 0.4.

As a preferred embodiment, the feature comparison step specifically includes:

step SS 31: collecting a re-recognition data set of the pedestrian for training to obtain a deep learning metric learning network, wherein the deep learning metric learning network measures the cosine distance between learning objects, and then finds the cluster to which the deep learning metric learning network belongs according to the nearest distance;

step SS 32: adjusting the size of each of the O _ crop obtained in the step SS11 in the preprocessing step and the 10 candidate regions obtained in the step SS24 in the tracking preliminary screening step to 64 (width) × 128 (height);

step SS 33: transmitting the regions obtained by cutting in the step SS32 to the deep learning metric learning network in the step SS31 to obtain respective 128-dimensional Feature vectors Feature_initAnd Feature_candidate＝{Feature_candidate1,Feature_candidate2...,Feature_candidate10Are calculated, and then respective cosine similarity scores, Score, are calculated, respectively_reid＝{Score_reid1,Score_reid2...,Score_reid10}；

Step SS 34: and (4) fusing the tracking Score with the Score of the candidate frame according to the formula (3) to obtain a final tracking Score_final；

Score_final＝w*Score_tracking+(1-w)*Score_reid (3)

Step SS 35: will Score_finalAnd sorting from large to small, and outputting the frame corresponding to the highest confidence coefficient.

In a preferred embodiment, w in step SS34 is 0.4.

As a preferred embodiment, the threshold linking step specifically includes:

step SS 41: judgment of Score_finalWhether or not it is greater than the threshold value Score_thresholdIf yes, the module is not entered, and the module is exited; such asIf the threshold value is less than the threshold value, the step SS42 is entered;

step SS 42: readjusting the size of the search area according to the formula (4), and then performing crop processing when the next frame is used;

the re-search area is:

x_sz＝1.5*z_sz*271/127 (4)。

the invention also proposes an electronic device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, the steps of the method being implemented when the processor executes the program.

The invention also proposes a storage medium on which a computer program is stored which, when being executed by a processor, carries out the steps of the method.

The invention achieves the following beneficial effects: 1: high-precision, stable and real-time tracking can be realized for any framed target with the size exceeding 10 pixel points, a tracking model with better performance is obtained by using a richer training data set, and tracking can be completed even if the target is subjected to distance deformation and illumination influence; 2: when the target passes through the complex interference of the same attribute class, the patent can still realize stable tracking and does not lose the tracking target, 3: the mutual fusion of the feature re-comparison and the deep learning tracking algorithm greatly reduces the probability of tracking failure, and the tracking reliability is further improved by re-feature comparison of the tracking candidate frame; 4: when the target is lost on the picture for a period of time and reappears, the method can still realize accurate identification and retracing of the target.

Drawings

FIG. 1 is a schematic diagram of a principle topology of a deep learning-based single-target tracking system in a complex scenario of the present invention;

FIG. 2 is a flowchart of a single-target tracking method based on deep learning in a complex scene.

Detailed Description

The invention is further described below with reference to the accompanying drawings. The following examples are only for illustrating the technical solutions of the present invention more clearly, and the protection scope of the present invention is not limited thereby.

Example 1: the invention provides a single-target tracking system under a complex scene, which comprises:

a tracking prescreening module for performing: acquiring a template area and a search area of the preprocessing module, wherein the tracking module adopts the latest single-target tracking algorithm SimCar based on deep learning at present, screens out the first 10 candidate tracking frames with high confidence coefficient through an NMS (network management system) algorithm, and then transmits the first 10 candidate tracking frames into a feature comparison module;

a threshold linkage module to perform: adjusting the size of the search area of the candidate template in the preprocessing module according to the confidence coefficient of the output candidate frame; in order to achieve the ability for the disappearance of the target to reappear while still being recognizable and trackable.

The single-target tracking system under the complex scene can greatly improve the tracking precision and robustness of the target under various complex conditions, consumes less time and can completely achieve the real-time effect.

Example 2: the invention also provides a single-target tracking method in a complex scene, which comprises the following steps:

tracking and primary screening, specifically comprising: obtaining a template area and a search area of the preprocessing step, screening the first 10 candidate tracking frames with high confidence coefficient through an NMS algorithm, and then transmitting the first 10 candidate tracking frames to a feature comparison step;

Preferably, the pretreatment step specifically comprises:

step SS 11: detecting the video stream by using a target detection algorithm (both a traditional detection algorithm and a deep learning algorithm), or manually selecting a current Frame of the video stream_initTo obtain the frame B of the tracked target_initThe template area is cut to obtain an object O _ crop;

x_sz＝z_sz*271/127 (2)

Preferably, the tracking prescreening step specifically includes:

step SS 21: collecting data sets required by single target tracking, wherein the data sets comprise nine data sets including COCO, GOT10K, VOT2020, LASOT, TrackingNet, VID, DET, YOUTUBEBB and UAV123 for training a neural network; the data set contains richer data, so that the robustness of the target is further improved;

step SS 22: training to obtain a single-target tracking model based on deep learning according to a single-target tracking algorithm siamcar;

step SS 24: and (4) cutting 10 frames in the corresponding step SS23 in the video frame to obtain candidate crop areas.

Preferably, the threshold of the NMS algorithm in step SS23 is 0.4, and experiments show that the value has a good effect on classes such as pedestrians.

Preferably, the feature comparison step specifically includes:

step SS 31: according to the table 1, a re-recognition data set of pedestrians is collected and trained to obtain a deep learning metric learning network, the deep learning metric learning network measures cosine distances between learning objects, and then the clusters to which the deep learning metric learning network belongs are found according to the nearest distance;

Score_final＝w*Score_tracking+(1-w)*Score_reid (3)

Table one: and learning a network architecture based on deep learning depth cosine measurement.

Preferably, w in the step SS34 is 0.4, and experiments prove that when w is 0.4, the experimental effect is the best.

Preferably, the threshold value linkage step specifically includes: setting a threshold according to the output confidence, and starting a linkage mechanism to react on a preprocessing module when the score is smaller than the threshold to enlarge a search area so as to further find a target which temporarily disappears;

step SS 41: judgment of Score_finalWhether or not it is greater than the threshold value Score_thresholdIf yes, the module is not entered, and the module is exited; if less than the threshold, go to step SS 42;

the re-search area is:

x_sz＝1.5*z_sz*271/127 (4)。

The above description is only a preferred embodiment of the present invention, and it should be noted that, for those skilled in the art, several modifications and variations can be made without departing from the technical principle of the present invention, and these modifications and variations should also be regarded as the protection scope of the present invention.

Claims

1. Single target tracking system under complicated scene, its characterized in that includes:

a tracking prescreening module for performing: acquiring a template region and a search region of the preprocessing module, transmitting the template region and the search region into a single-target tracking algorithm for deep learning, screening the first 10 candidate tracking frames with high confidence coefficient through an NMS (network management system) algorithm, and then transmitting the first 10 candidate tracking frames into a feature comparison module;

a threshold linkage module to perform: and adjusting the size of the search area of the candidate template in the preprocessing module according to the confidence coefficient of the output candidate frame.

2. The single-target tracking method under the complex scene is characterized by comprising the following steps:

the threshold value linkage step specifically comprises the following steps: and adjusting the size of the search area of the candidate template in the preprocessing step according to the confidence degree of the output candidate frame.

3. The method for tracking the single target under the complex scene according to claim 2, wherein the preprocessing step specifically comprises:

x_sz＝z_sz*271/127 (2)

4. The single-target tracking method under the complex scene according to claim 2, wherein the tracking preliminary screening step specifically comprises:

5. The single-target tracking method under the complex scene according to claim 4, wherein the threshold of the NMS algorithm in the step SS23 is 0.4.

6. The method for tracking a single target under a complex scene according to claim 2, wherein the feature comparison step specifically comprises:

Score_final＝w*Score_tracking+(1-w)*Score_reid

(3)

7. The method for tracking the single target under the complex scene as recited in claim 6, wherein w in the step SS34 is selected to be 0.4.

8. The single-target tracking method under the complex scene according to claim 2, wherein the threshold value linkage step specifically comprises:

the re-search area is: x _ sz ═ 1.5 ═ z _ sz ═ 271/127 (4).

9. Electronic device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, characterized in that the steps of the method according to any of claims 2 to 8 are implemented when the processor executes the program.

10. Storage medium on which a computer program is stored which, when being executed by a processor, carries out the steps of the method of any one of claims 2 to 8.