CN110929560B

CN110929560B - Video semi-automatic target labeling method integrating target detection and tracking

Info

Publication number: CN110929560B
Application number: CN201910963482.3A
Authority: CN
Inventors: 徐英; 谷雨; 刘俊; 彭冬亮; 陈庆林
Original assignee: Hangzhou Dianzi University
Current assignee: Hangzhou Dianzi University
Priority date: 2019-10-11
Filing date: 2019-10-11
Publication date: 2022-10-14
Anticipated expiration: 2039-10-11
Also published as: CN110929560A

Abstract

The invention discloses a video semi-automatic target marking method integrating target detection and tracking. In the subsequent frames, fusing an image-based target detection algorithm and an image sequence-based video target tracking algorithm to estimate the position of a target in an image; the invention judges whether the target marking is finished according to the target tracking algorithm. If the target marking is finished, extracting the video key frames according to the size of the significant value of each frame of target to obtain a target marking result, and otherwise, continuously estimating the position of the target in the video image; the method for extracting the video key frame based on the target significance enables the key frame to reflect the diversity of target changes. The invention adopts multi-lens multi-ship video to carry out experimental test, and verifies the effectiveness of the method provided by the invention.

Description

Video semi-automatic target labeling method integrating target detection and tracking

Technical Field

The invention belongs to the field of video data marking, and relates to a video target marking method which integrates target detection and target tracking and extracts video key frames according to target significance.

Background

In recent years, the deep learning technology is rapidly developed, and the target detection and target tracking field is promoted to continuously realize new breakthroughs. Because the deep learning technology needs the support of big data, obtaining a large amount of accurate label training data with sample diversity is the key for obtaining excellent performance by the deep learning technology.

At present, two methods of manual marking and automatic marking are mainly used for acquiring training data. The manual marking adopts a manual mode to mark the target position and the label in a single image, a large number of continuous image frames exist in the video, the manual marking efficiency is low, and the automatic marking becomes possible due to the fact that the target in the video has the characteristic of space-time continuity. In the prior art, only a target tracking algorithm based on relevant filtering is used for video target labeling, and the accuracy of a labeling result cannot meet the requirement of being used as training data. And only the target detection algorithm is used for marking the video target, the detector marks all targets which accord with the target type in the subsequent frame according to the type of the target of the initial frame, and whether the targets are the same as the initial frame cannot be judged, or the detector fails to detect due to factors such as jitter and blurring of the target and the like to cause inconsistent marking of the video target. The invention integrates the detection algorithm and the tracking algorithm, combines the advantages of the two algorithms, can improve the accuracy of automatic labeling, can determine the same target by utilizing the space-time continuity of the tracking algorithm, solves the problem of detection omission of a detector, can automatically judge that the target disappears and improves the labeling efficiency.

The invention provides a video semi-automatic labeling method, which comprises the steps of manually labeling a target position in an initial frame, automatically labeling the target position in a subsequent frame, and finally automatically extracting a plurality of key frames to obtain a labeling result. The main problems to be solved include: (1) How to improve the accuracy and the consistency of video target labeling is the first problem to be solved. (2) In order to reduce the manual participation and improve the labeling efficiency, the target disappearance and the labeling end need to be automatically judged. (3) The extracted key frames can reflect the diversity of changes of the size, the angle, the illumination and the like of the target scale.

Aiming at the condition that the current independent target detection algorithm or target tracking algorithm cannot meet the automatic labeling requirement of the video target, the invention fuses target detection and target tracking through reasonable rules, thereby greatly improving the efficiency and accuracy of video target labeling; in addition, a method for extracting video key frames based on target significance is provided, so that the extracted key frames can accurately reflect the diversity of target changes.

Disclosure of Invention

The invention provides a video semi-automatic target marking method integrating target detection and tracking, aiming at solving the technical problems that the existing automatic marking means is low in precision and continuity or low in manual marking speed.

Firstly, a certain frame is selected as an initial frame in a video image, the initial position of a target is marked manually, and a category label of the target is determined. In subsequent frames, fusing an image-based target detection algorithm and an image sequence-based video target tracking algorithm to estimate the position of a target in an image, and judging whether target labeling is finished according to the target tracking algorithm. And if the target marking is finished, extracting the video key frame according to the significant value of each frame of target to obtain a target marking result, otherwise, continuously estimating the position of the target in the video image. The method disclosed by the invention integrates a target detection algorithm and a target tracking algorithm to accurately label the video target, automatically judges the end of target labeling, and extracts a video key frame according to the target significance to obtain a target labeling result.

The technical scheme adopted by the invention comprises the following steps:

1. the video semi-automatic target labeling method integrating target detection and tracking is characterized by comprising the following steps of:

selecting a certain frame as an initial frame in a certain shot of a video, manually marking the initial position and size of a target, and determining a category label of the target;

step (2), adopting automatic labeling for other subsequent frames after the initial frame, specifically fusing an image-based target detection algorithm and an image sequence-based video target tracking algorithm to estimate the position of a target in an image; the method comprises the following steps:

2.1 detecting the target in each frame of image by adopting YOLO V3 and marking a detection frame;

the YOLO V3 is used for adjusting the target image with the label to a fixed size as a training sample and training YOLO-V3; wherein, the YOLO layer is increased to 4 layers, and four different receptive field characteristic diagrams with different scales of 13 multiplied by 13, 26 multiplied by 26, 52 multiplied by 52 and 104 multiplied by 104 are obtained through multi-scale characteristic fusion; using three prior boxes of (116 x 90), (156 x 198) and (373 x 326) to predict the 13 x13 feature map, detecting a larger object; predicting the 26 × 26 feature map by using three prior boxes of (30 x 61), (62 x 45) and (59 x 119), and detecting a medium-sized object; using three prior boxes of (10 x 13), (16 x 30) and (33 x 23) to predict a 52 x 52 feature map, detecting smaller objects; using newly added (5 x 6), (8 x 15) and (16 x 10) three prior boxes to predict a 104 x 104 feature map and detect a smaller target;

2.2, acquiring a tracking frame of the target by adopting a KCF related filtering tracking algorithm;

firstly, HOG characteristics are extracted according to the target position and size of the previous frame, then the HOG characteristics are converted into a frequency domain through Fourier transform, the obtained frequency domain characteristics are mapped to a high dimension through a Gaussian kernel function, and a filtering template alpha is obtained according to the formula (1):

wherein x represents the HOG characteristic of the sample, ^ represents Fourier transform, g is a two-dimensional Gaussian function with the center as the peak value, and λ is a regularization parameter used for controlling overfitting of training; k is a radical of ^xx The kernel autocorrelation matrix of x in the high-dimensional space is represented, and the calculation mode is given by the formula (2):

wherein σ is the width parameter of the Gaussian kernel function, controls the radial extent of the function, { character }, indicates complex conjugation, { character }, indicates point multiplication,

representing the inverse fourier transform, c is the number of channels of the HOG feature x;

in order to adapt to the change of the target appearance, the filter needs to be updated online; when target tracking is performed on the t-th frame image, the update of the correlation filter α is given by:

wherein η is an update parameter;

to adapt to the scale change of the object, the filter α of the current frame _t Scaling is needed so as to predict the size of the next frame target; wherein the scaling ratios are [1.1,1.05,1,0.95,0.9, respectively]；

Extracting a candidate sample HOG characteristic z at a t frame target position on a t +1 frame image; in conjunction with each of the above-mentioned size-scaled filters, each corresponding filtered output response plot f is shown in equation (4):

wherein m = (1, 2,3,4, 5), corresponding to scaled ratios [1.1,1.05,1,0.95,0.9], respectively; x represents the HOG characteristic of the t frame target;

the maximum value f is selected from the maximum values max (f) of the 5 response maps f _max ,f _max The corresponding position is the position of the target center, f _max The corresponding scaling is the target size, and a tracking frame of the t +1 th frame is obtained;

2.3 fusing the results of target detection and target tracking to determine the labeled target frame;

firstly, judging whether each frame of image contains a detection frame or not, and if not, taking the target frame as a tracking frame; if yes, continuously judging whether the detection frame is only one, if yes, calculating the IOU values of the tracking frame and the detection frame, if the IOU value is larger than a threshold value, taking the target frame as the detection frame, initializing a KCF tracking algorithm by using the detection frame, and if not, taking the target frame as the tracking frame; if the number of the detection boxes is multiple, the IOU value of the tracking box and the IOU value of each detection box need to be calculated, the maximum IOU value is further screened out, if the maximum IOU value is larger than a threshold value, the target box is the detection box corresponding to the maximum IOU value, the KCF tracking algorithm is initialized by using the detection box, and if not, the target box is the tracking box;

the IOU value is used for evaluating the coincidence degree of the tracking frame and each detection frame under the current frame, and the formula is as follows:

wherein S _I Indicates the overlapping area of the tracking frame and each detection frame under the same frame, S _U Representing the area of the set part of the tracking frame and each detection frame under the same frame, wherein the area of the set part is the sum of the areas of the tracking frame and the detection frame minus the overlapping area;

step (3), judging whether the target marking is finished or not according to a target tracking algorithm;

according to a response graph f of the KCF correlation filtering tracker, judging whether max (f) is smaller than a set threshold value theta and the peak side lobe ratio PSR is smaller than the set threshold value theta _PSR When the method is as follows:

max(f)<θandPSR<θ _PSR (7)

if yes, judging that the target marking is finished, and turning to the step (4) to select the key frame; otherwise, turning to the step (2), and continuing to estimate the position of the target in the next frame image;

the PSR is calculated as follows:

where max (f) is the peak of the correlation filter response plot f, Φ =0.5, μ _Φ (f) And σ _Φ (f) Mean and standard deviation of 50% response area centered on the f peak, respectively;

step (4), calculating a significant value of each frame of target in the current shot; extracting a set number of video key frames according to the significant value of each frame of target to obtain a target labeling result; the method comprises the following steps:

4.1 local binary pattern LBP extracts the texture feature of the image, the basic idea is to define in the neighborhood of pixel 3x3, regard neighborhood center pixel as the threshold, the gray value of adjacent 8 pixels compares with it, if the gray value of a certain surrounding pixel is greater than the center pixel value, the position of the surrounding pixel is marked as 1, otherwise is 0; comparing 8 points in 3-by-3 neighborhood to generate 8-bit binary number, converting the binary number into decimal number to obtain LBP value of the central pixel, and reflecting the LBP information of the region by using the value; the specific calculation formula is shown as (8):

wherein (x) ₀ ,y ₀ ) Is the coordinate of the central pixel, p is the p-th pixel of the neighborhood, j _p Is the gray value of the neighborhood pixel, j ₀ The gray value of the central pixel; s (x) is a sign function:

4.2 the calculation formula of the color saliency characteristic map is as follows:

wherein the patch is an original image of the target frame area _gaussian The method is characterized in that the image is the image of patch after Gaussian filtering processing with 5 multiplied by 5 Gaussian kernel and 0 standard deviation, | | represents absolute value, i represents channel number, and (x, y) is pixel coordinate;

4.3 obtaining edge saliency characteristic map for pixel points of target edge region in each frame of image target frame

In the target edge area in the target frame, pixel values can jump, derivatives are obtained for the pixel values, and the first derivative of the derivatives is an extreme value at the edge position, namely the edge is at the extreme value, which is the principle used by the Sobel operator; if the second derivative is calculated for the pixel value, the derivative value at the edge is 0; the Laplace function realization method comprises the steps of firstly calculating second-order x and y derivatives by using a Sobel operator, then summing to obtain an edge significance characteristic diagram, wherein the calculation formula is as follows:

wherein I represents an image in the target frame, and (x, y) represents pixel coordinates of a target edge region in the target frame;

4.4, carrying out average weighted fusion on LBP texture features, color saliency features, edge saliency features and other features to obtain a fusion value mean, wherein a fusion calculation formula is as follows:

wherein, the first and the second end of the pipe are connected with each other,

respectively representing the values of pixel points (x, y) in an LBP texture characteristic graph, a color saliency characteristic graph and an edge saliency characteristic graph in the t frame;

4.5 the color histogram variation value Dist is obtained by calculating the babbitt distance between the color histogram of the selected target area of the initial frame and the target area of the t-th frame, and the calculation formula is as follows:

wherein H ₀ Manually labeling a selected target frame color histogram for an initial frame, H _t Automatically labeling the color histogram of the target frame for the t-th frame,

is H ₀ The value obtained by the operation of the formula (14),

is H _t A value obtained by the operation of expression (14), n represents the total number of color histogram bins,

is given by：

Wherein k =0 or t;

4.6 the scale change value is obtained by calculating the width and height change of the initial frame target frame and the t frame target frame, and the calculation formula is as follows:

wherein

For the width and height of the target box of the initial frame,

and

the width and height of the target frame of the t frame;

4.7 according to the fusion value, the color histogram change value and the scale change value of the image target frame region, the calculation formula of the target significant value of the t-th frame is as follows:

wherein T represents the total number of frames of the video;

4.8 saliency S of each frame object in video _t Constructing a significant value line graph, and solving all peak values and corresponding frames;

assuming that the video has T frames, setting the number of extracted key frames as a; b significant value peaks are obtained, if a is less than b, the peak values are sorted in a descending order, frames corresponding to the first a peak values are extracted as key frames, if b is less than a < T, frames corresponding to all the peak values are extracted, and the rest a-b key frames adopt a random and unrepeated extraction mode; if a is larger than T, all video frames are used as key frames;

and (5) returning to the step (1) to label the target of the next video shot.

Compared with the prior art, the invention has the following remarkable advantages: (1) The invention creatively fuses the target detection algorithm and the target tracking algorithm, thereby improving the accuracy of target positioning and the continuity of target state estimation in the video image; (2) Only the target initial position is marked manually in the initial frame, and the marking is automatically judged to be finished in the marking process, so that the times of artificial participation are reduced; (3) And fusing the LBP texture features, the color saliency features and the edge saliency features of the target region, and calculating the target saliency by combining the color histogram change and the scale change, so that the extracted key frame can reflect the diversity of the target change.

Drawings

FIG. 1 is a flow chart of the method of the present invention;

FIG. 2 is a flow chart of fused target detection and target tracking;

FIG. 3 is a flow chart of target saliency calculation;

FIG. 4 is a detection result of a 2 nd frame image in an example video;

FIG. 5 is a tracking result of a 2 nd frame image in an example video;

FIG. 6 is a fusion detection and tracking result for a 2 nd frame of image in an example video;

FIG. 7 is a KCF response plot peak change curve for the 2 nd lens of an example video;

FIG. 8 is a PSR plot of KCF response for the 2 nd shot of an example video;

FIG. 9 is a 243 frame image of an example video shot 2;

FIG. 10 is a 1 st frame image of an example video shot 3;

FIG. 11 is a target saliency curve for an example video shot 6;

fig. 12 is a key frame extracted for the 6 th shot of an example video.

Detailed Description

The invention is further described below with reference to the accompanying drawings.

As shown in FIG. 1, the method comprises the following steps:

selecting a certain frame in a video image as an initial frame, manually marking the initial position of a target, and determining the category label of the target.

And (2) fusing an image-based target detection algorithm and an image sequence-based video target tracking algorithm to estimate the position of a target in an image in a subsequent frame. The YOLO V3 detection algorithm and the KCF related filtering tracking algorithm are adopted, and a fusion method is shown as a figure 2 and specifically comprises the following steps:

2.1 the detector of the invention adopts a higher-speed YOLO V3 detection algorithm in the current mainstream detection network, which meets the requirements of real-time and accuracy in the video annotation technology, including a Feature extraction network Darknet-53 and a prediction network, the Darknet53 network adopts ResNet short cut connection, and avoids gradient disappearance, in the prediction stage, the algorithm uses a method for extracting an region of interest based on an anchor in an RPN network, and uses Feature maps of 3 scales in an FPN (Feature pyramid network), a small Feature map provides semantic information, a large Feature map has finer granularity information, and a small Feature map fuses through upsampling and large scale, so as to achieve a better detection effect.

The invention carries out the following improvement and optimization on the basis of the original model:

firstly, initializing training parameters by adopting a darknet53.Conv.74 pre-training model in a feature extraction part, then increasing a YOLO layer of an original model to 4 layers, obtaining four different receptive field feature maps of 13 x13, 26 x 26, 52 x 52 and 104 x 104 in different scales through multi-scale feature fusion, and then predicting the feature map of 13 x13 by using three prior frames of (116 x 90), (156 x 198) and (373 x 326) to detect a larger object; using (30 x 61), (62 x 45), (59 x 119) to predict the 26 x 26 feature map, detecting objects of medium size; the 52 × 52 feature map is predicted using (10 x 13), (16 x 30), (33 x 23) to detect smaller objects, and the 104 × 104 feature map is predicted using the newly added (5 x 6), (8 x 15), (16 x 10) three prior blocks to detect smaller objects. Compared with the original model, the improved detection network integrates the characteristics of lower layers, so that the detection rate of the small target is improved.

In each detection operation, inputting a t +1 th frame image, firstly resize to a fixed scale, and finally obtaining a detection frame containing an object category and a score value as a detection result of the t +1 th frame through a feature extraction network and a prediction network.

2.2KCF related filtering tracking algorithm firstly extracts HOG characteristics according to the target position and size of the t frame, then transfers the HOG characteristics to a frequency domain through Fourier transformation, maps the obtained frequency domain characteristics to a high dimension through a Gaussian kernel function, and obtains a filtering template alpha according to the formula (1):

where x represents the HOG characteristics of the sample, a represents the Fourier transform, g is a two-dimensional Gaussian function centered at the peak, and λ is a regularization parameter used to control the overfitting of the training. k is a radical of ^xx The kernel autocorrelation matrix of x in the high-dimensional space is represented, and the calculation mode is given by the formula (2):

wherein σ is a width parameter of the Gaussian kernel function, controlling the radial extent of the function,. Indicates a complex conjugate,. Indicates a dot product,

denotes the inverse fourier transform and c is the number of channels of the HOG feature x.

In order to accommodate the target appearance change, the filter needs to be updated online. When target tracking is performed on the t-th frame image, the update of the correlation filter α is given by:

where m = (1, 2,3,4, 5), corresponding to scaled ratios [1.1,1.05,1,0.95,0.9], respectively; x represents the HOG characteristic of the t frame target;

the maximum value f is selected from the maximum values max (f) of the 5 response maps f _max ,f _max The corresponding position is the position of the target center, f _max The corresponding scaling is the size of the target, and a tracking frame of the t +1 th frame is obtained;

and 2.3, fusing the results of target detection and target tracking to determine the labeled target frame.

Firstly, judging whether each frame of image contains a detection frame or not, and if not, taking the target frame as a tracking frame; if yes, continuously judging whether the detection frame is only one, if yes, calculating the IOU values of the tracking frame and the detection frame, if the IOU value is larger than a threshold value, taking the target frame as the detection frame, initializing a KCF tracking algorithm by using the detection frame, and if not, taking the target frame as the tracking frame; if the number of the detection frames is multiple, the IOU value of the tracking frame and each detection frame needs to be calculated, the maximum IOU value is further screened out, if the maximum IOU value is larger than a threshold value, the target frame is the detection frame corresponding to the maximum IOU value, a KCF tracking algorithm is initialized by using the detection frame, and if not, the target frame is the tracking frame;

wherein S _I Indicates the overlapping area of the tracking frame and each detection frame under the same frame, S _U And the area of the collection part of the tracking frame and each detection frame under the same frame is shown. The area of the set part is the total area of the tracking frame and the detection frame minus the overlapping area;

and (3) the peak value of the response image f of the KCF correlation filtering tracker represents the confidence that the corresponding position is the target, and the higher the peak value is, the higher the probability that the position is the target is. The PSR measures the peak intensity of the relevant filtering output, and the higher the PSR value is, the higher the reliability of the tracking result is. If the peak value and the PSR are lower than the set threshold value, the target is possibly disappeared, and therefore the video target marking is judged to be finished. The PSR is calculated as follows:

where max (f) is the peak of the correlation filter response plot f, Φ =0.5, μ _Φ (f) And σ _Φ (f) Mean and standard deviation, respectively, of a 50% response region centered at the f-peak. If max (f) is less than the set threshold θ and PSR is less than the set threshold θ _PSR When, namely:

max(f)<θandPSR<θ _PSR (7) And (4) judging that the target marking is finished, and turning to the step (4) to select the key frame. Otherwise, turning to the step (2), and continuing to estimate the position of the target in the next frame image.

And (4) calculating a significant value of each frame of target, as shown in fig. 3, in the labeling process, acquiring a target region by using the target frame obtained in the step (2), then performing LBP texture feature, color significant feature and edge significant feature fusion on the target region, and calculating the significant value of the target by combining color histogram change and scale change. The method comprises the following specific steps:

4.1LBP extracts the texture feature of the target region, the basic idea is to define in the neighborhood of pixel 3x3, regard neighborhood center pixel as the threshold, the gray value of adjacent 8 pixels is compared with it, the gray value of a certain peripheral pixel is greater than the center pixel value, the position of this peripheral pixel is marked as 1, otherwise is 0. 8 points in the 3-by-3 neighborhood can generate 8-bit binary numbers through comparison, the binary numbers are converted into decimal numbers to obtain the LBP value of the central pixel, and the LBP information of the area is reflected by the value. The specific calculation formula is shown as (8):

wherein (x) ₀ ,y ₀ ) Is the coordinate of the central pixel, p is the p-th pixel of the neighborhood, j _p Is the gray value of the neighborhood pixel, j ₀ Is the gray value of the neighborhood pixel. s (x) is a sign function:

4.2 the calculation formula of the color saliency map is as follows:

wherein the patch is a target area image _gaussian In the image after patch is subjected to gaussian filtering with a gaussian kernel of 5 × 5 and a standard deviation of 0, | | represents an absolute value, i represents the number of channels in the picture, and (x, y) are horizontal and vertical coordinates.

4.3 in the edge area of the target area image, the pixel value will generate "jump", and the derivative of these pixel values is taken, and its first derivative is extreme value at the edge position, which is the principle used by Sobel operator-the extreme value is the edge. If the second derivative is taken over the pixel value, the derivative value at the edge is 0. The method for realizing the Laplace function is that first Sobel operators are used for calculating second-order x and y derivatives, then edge significance characteristic graphs are obtained through summation, and the calculation formula is as follows:

wherein I represents an image, (x, y) represents pixel coordinates of an edge region of the object in the object frame;

4.4, carrying out average weighted fusion on the LBP texture characteristics, the color saliency characteristics, the edge saliency characteristics and other characteristics to obtain a fusion value mean, wherein a fusion calculation formula is as follows:

wherein the content of the first and second substances,

and respectively representing the values of pixel points (x, y) in an LBP texture characteristic graph, a color saliency characteristic graph and an edge saliency characteristic graph in the t-th frame.

4.5 color histogram of the target area image represents the distribution of color components in the image, showing different types of colors and the number of pixels in each color. The color histogram change value Dist is obtained by calculating the babbit distance between the color histogram of the initial frame selected target area and the color histogram of the t frame target area, the greater the Dist value is, the lower the similarity is, the more obvious the target change is, and the calculation formula is as follows:

wherein H ₀ Selecting a target region color histogram for the initial frame, H _t Is the color histogram of the target area of the t-th frame,

is H ₀ The value obtained by the operation of the formula (14),

the calculation formula of (a) is given by:

where k =0 or t.

wherein

For the width and height of the target box of the initial frame,

and

the width and height of the target box of the t-th frame.

4.7 by the above calculation, the calculation formula of the target significant value of the t-th frame is as follows:

where T represents the total number of video frames for a shot.

And 4.8 drawing a salient value line graph according to the salient value of each frame of target in the scene shot to obtain all peak values and corresponding frames. Supposing that the shot has T video frames, the number of key frames to be extracted is a, the number of peak values is b, if a is less than b, the sizes of the peak values are sorted in a descending order, the frames corresponding to the first a peak values are extracted as key frames, if b is less than a, the frames corresponding to all the peak values are extracted, and the rest a-b key frames adopt a random and unrepeated extraction mode; if a > T, all video frames are used as key frames.

And (5) returning to the step (1) to mark the target of the next lens.

In order to verify the effectiveness of the method provided by the invention, a section of multi-lens multi-ship video is adopted for experimental testing. The video has 9 scene shots of multiple ships, the frame number of each scene shot is shown in table 1, and for accelerated calculation, the experiment is labeled once every 5 frames.

TABLE 1 video shot and frame number

In the stage of target detection, a single-stage target detection algorithm YOLO V3 is first trained on a large number of labeled samples with ship label information and position information to obtain a detection model, and then the model is used as a detector. In consideration of the fact that the original algorithm has low capability of detecting small targets, a small-scale anchor is added on the original basis, the defect of low detection precision is improved, the detection capability of targets with various scales is improved on the premise of ensuring the detection speed, and accurate real-time detection is realized. In the target tracking stage, parameter setting lambda =1 × 10 of the KCF tracking algorithm ^-4 σ =0.5, η =0.02. Considering that the original algorithm cannot adapt to the change of the target scale, scale judgment is added to the KCF tracking algorithm, and the improved KCF tracking algorithm is used as a tracker.

In the stage of fusing the detection result and the tracking result, the IOU threshold is set to be 0.5. If the IOU value of the tracking frame and each detection frame is less than 0.5, the detector does not detect the target to be marked, and the target frame of the target is the tracking frame. If the IOU values of the tracking frame and one or more detection frames are greater than 0.5, which indicates that the detector detects the target to be labeled, the target frame of the target is the detection frame corresponding to the maximum IOU value. For example, after the 1 st frame of the video shot 1 is manually marked with the target, the detection result and the tracking result of the 2 nd frame are shown in fig. 4 and 5. As can be seen from the figure, there are multiple targets in the detection result of the detector, and the tracking result of the tracker has only one target. By calculating the IOU values of the tracking frame and each detection frame, only one detection frame and the IOU value of the tracking frame are greater than the threshold value of 0.5, the result of fusing and outputting the target frame is shown in fig. 6, and the result of fusing and outputting the detection frame is output.

When judging whether the target marking is finished or not, setting a peak value threshold theta of a KCF tracker to be 0.3 _PSR 3.5, when the peak sum PSR is less than the threshold, the labeling ends. For example, when the target disappears during the process of labeling the 2 nd shot target in the video, the KCF tracking algorithm responds to the peak value sum P of the graphThe SR becomes small as shown in fig. 7 and 8. In frames 0-47 under a scene shot, the peak value and PSR value of a response image of a KCF tracking algorithm are large, and the peak value and PSR value in a 48 th frame are small, which indicates that the target of the frame disappears, and actually 243 frames corresponding to the scene shot are marked once every 5 frames, and the scene shot of the next 243 frames is switched. Wherein the 243 st frame image of the shot 2 and the 1 st frame image of the shot 3 are as shown in fig. 9 and 10. In the figure, it can be seen that the target disappears when the video is switched from shot 2 to shot 3, which indicates that the method judges that the annotation is finished accurately.

When the tracker judges that the video shot target labeling is finished, a video shot target significant value curve is obtained according to the target significant value of each frame, key frames are extracted at the local maximum value of the curve, and in the experiment, 10 frames are extracted from each shot to serve as the key frames. For example, the target saliency curve for shot 6 is shown in fig. 11. First, the local maxima are arranged from large to small, then the frames corresponding to the first 10 local maxima are taken as key frames, and the extracted key frames are shown in fig. 12 (a-j). As can be seen from the figure, the extracted key frame has strong representativeness, and the diversity of changes such as the size, the angle and the like of the target size can be accurately reflected.

The results of this experiment are shown in table 2,

TABLE 2 Key Frames for each shot

Lens barrel	Key frame
		1	5，10，25，30，40，50，55，65，75，80
2	90，110，125，135，145，160，180，195，205，215
		3	325，340，365，380，400，420，430，445，460，480
4	1099，1109，1119，1139，1149，1159，1169，1179，1329，1369
		5	1424，1519，1559，1594，1604，1624，1634，1674，1754，1764
6	1779，1854，1869，1994，2054，2064，2089，2114，2144，2154
		7	2194，2199，2214，2229，2249，2269，2279，2289，2294，2314
8	2349，2359，2379，2399，2414，2424，2444，2459，2474，2539
		9	2974，3094，3164，3179，3189，3199，3214，3229，3259，3274

It can be seen from the table that the extraction ranges of the key frames are all in the corresponding shots, which further proves that the method can distinguish different shots and automatically judge the end of the target marking. The method adopts the local maximum value of the target significant value as the extraction basis of the key frame, so that the extracted key frame is representative. According to the experimental result, the video target labeling method based on the fusion target detection algorithm and the target tracking algorithm obtains higher accuracy.

Claims

2.1 detecting the target in each frame of image by adopting YOLOV3 and marking a detection frame;

the YOLOV3 is to adjust the target image with the label to a fixed scale size as a training sample, and train yolo-v 3; wherein, the YOLO layer is increased to 4 layers, and four different receptive field characteristic maps with different scales of 13 multiplied by 13, 26 multiplied by 26, 52 multiplied by 52 and 104 multiplied by 104 are obtained through multi-scale characteristic fusion; using three prior boxes of (116 x 90), (156 x 198) and (373 x 326) to predict the 13 x13 feature map, detecting a larger object; predicting the 26 × 26 feature map by using three prior boxes of (30 x 61), (62 x 45) and (59 x 119), and detecting a medium-sized object; using three prior boxes of (10 x 13), (16 x 30) and (33 x 23) to predict a 52 x 52 feature map, detecting smaller objects; the 104 x 104 feature map is predicted by using three new and added prior boxes of (5 x 6), (8 x 15) and (16 x 10), and a smaller target is detected;

2.2 acquiring a tracking frame of the target by adopting a KCF related filtering tracking algorithm;

wherein x represents the HOG characteristic of the sample, ^ represents Fourier transform, g is a two-dimensional Gaussian function with the center as the peak value, and λ is a regularization parameter used for controlling overfitting of training; k is a radical of formula ^xx The kernel autocorrelation matrix of x in the high-dimensional space is represented, and the calculation mode is given by the formula (2):

wherein eta is an updating parameter;

to adapt to the scale change of the object, the filter alpha of the current frame _t Scaling is needed so as to predict the size of the next frame of target; wherein the scaling ratios are [1.1,1.05,1,0.95,0.9, respectively]；

Extracting a candidate sample HOG characteristic z at the t frame target position on the t +1 frame image; in conjunction with each of the above-mentioned size-scaled filters, each corresponding correlated filtered output response plot f is shown in equation (4):

the maximum value f is selected from the 5 response diagrams f maximum values max (f) _max ,f _max The corresponding position is the position of the target center, f _max The corresponding scaling is the target size, and a tracking frame of the t +1 th frame is obtained;

firstly, judging whether each frame of image contains a detection frame or not, and if not, taking the target frame as a tracking frame; if yes, continuously judging whether the number of the detection boxes is only one, if yes, calculating IOU values of the tracking boxes and the detection boxes, if the IOU value is larger than a threshold value, taking the target box as the detection box, initializing a KCF tracking algorithm by using the detection box, and if not, taking the target box as the tracking box; if the number of the detection frames is multiple, the IOU value of the tracking frame and each detection frame needs to be calculated, the maximum IOU value is further screened out, if the maximum IOU value is larger than a threshold value, the target frame is the detection frame corresponding to the maximum IOU value, a KCF tracking algorithm is initialized by using the detection frame, and if not, the target frame is the tracking frame;

wherein S _I Showing the overlapping area of the tracking frame and each detection frame in the same frame, S _U Representing the area of the set part of the tracking frame and each detection frame under the same frame, wherein the area of the set part is the sum of the areas of the tracking frame and the detection frame minus the overlapping area;

step (3), judging whether the target marking is finished according to a target tracking algorithm;

judging whether max (f) is less than or not according to a response graph f of a KCF related filtering trackerA set threshold value theta and a peak-to-side lobe ratio PSR smaller than the set threshold value theta _PSR When, namely:

max(f)<θandPSR<θ _PSR (7)

the PSR is calculated as follows:

4.1 local binary pattern LBP extracts the texture characteristics of the image, the basic idea is to define in the neighborhood of pixel 3x3, regard neighborhood center pixel as the threshold, the gray value of adjacent 8 pixels is compared with it, if the gray value of a certain surrounding pixel is greater than the center pixel value, the position of the surrounding pixel is marked as 1, otherwise is 0; comparing 8 points in 3-by-3 neighborhood to generate 8-bit binary number, converting the binary number into decimal number to obtain LBP value of the central pixel, and reflecting the LBP information of the region by using the value; the specific calculation formula is shown as (8):

In the target edge area in the target frame, pixel values can jump, derivatives are obtained for the pixel values, and the first derivative of the derivatives is an extreme value at the edge position, namely the edge is at the extreme value, which is the principle used by the Sobel operator; if the second derivative is calculated for the pixel value, the derivative value at the edge is 0; the method for realizing the Laplace function is that first Sobel operators are used for calculating second-order x and y derivatives, then edge significance characteristic graphs are obtained through summation, and the calculation formula is as follows:

wherein the content of the first and second substances,

is H ₀ The value obtained by the operation of the formula (14),

the formula of (c) is given by:

wherein k =0 or t;

wherein

For the width and height of the target box of the initial frame,

and

the width and height of the target frame of the t frame;

wherein T represents the total frame number of the video;

4.8 saliency value S of each frame of object in video _t Constructing a significant value line graph, and solving all peak values and corresponding frames;

and (5) returning to the step (1) to label the target of the next video shot.