CN110322475B

CN110322475B - Video sparse detection method

Info

Publication number: CN110322475B
Application number: CN201910435217.8A
Authority: CN
Inventors: 盛健; 张美玲; 刘畅; 韩娟; 石晶林
Original assignee: Beijing Sylincom Technology Co ltd
Current assignee: Beijing Sylincom Technology Co ltd
Priority date: 2019-05-23
Filing date: 2019-05-23
Publication date: 2022-11-11
Anticipated expiration: 2039-05-23
Also published as: CN110322475A

Abstract

The invention discloses a video sparse detection method, which is characterized in that the corresponding relation between target characteristics and working parameters for determining the working time of a tracker is predetermined, and the target characteristics indicate that when a target has the characteristics, the tracker can replace a detector to carry out target detection in a corresponding period; when video detection is carried out, the detector and the tracker are started to work alternately; in each alternate period, the detector is started first, the target characteristic is determined according to the detection result, and the working parameters of the tracker corresponding to the target characteristic are matched by utilizing the corresponding relation; and initializing the tracker into a detector, and detecting the target by the tracker according to the working parameters. According to the invention, on the basis that the current detector works independently, the tracker replaces the detection task of part of video frames, so that the calculation cost of video detection is greatly reduced and the processing speed is increased on the basis of no loss or only a small part of video detection performance loss.

Description

Video sparse detection method

Technical Field

The invention relates to the technical field of target identification and target extraction, in particular to a sparse detection method for a video.

Background

Among various types of big data, the image video is "big data with the largest volume". According to the statistics of the Cisco, the video content accounts for about 90% of the total flow of the Internet; in rapidly evolving mobile networks, the proportion of video traffic is also as high as 64% and grows at annual compound growth rates in excess of 130%. As can be seen, image video data dominates the big data, so the processing of image video is the key of big data application. Moreover, compared with data such as text, voice and the like, image video has larger data volume and higher dimensionality, and has larger technical challenges for expression, processing, transmission and utilization. Therefore, the integration of computer vision technology into video data processing systems, such as data processing of scenes of movies, televisions, video surveillance and the like, is an inevitable trend of future development, and the intelligent video data processing system is capable of performing image processing, target analysis and other work on video streams, judging target actions, automatically detecting and tracking targets, performing relevant recording and providing intelligence to the video data processing systems.

Video detection is a key and fundamental technology in video data processing systems. The processing speed and computational overhead of the video detection task have been one of the important challenges facing the field of computer vision. Especially, in an intelligent video monitoring system, an interested target Region (ROI) is searched and scanned continuously within a video frame and between frames through a detection algorithm for a moving target (background modeling class) or a specific target (object modeling class), so that intelligent applications such as positioning, identification, behavior analysis and the like of the target in a natural scene are realized. Meanwhile, in a video monitoring system, detection and region segmentation operation of a moving target need to be performed on acquisition-side equipment (such as an intelligent camera), so that key information (moving target region) in a video is extracted, and the purposes of reducing transmission flow and butting a target for subsequent intelligent analysis and application are achieved. However, due to the constraints of size, power consumption and cost, the computing resources which can be provided by the video monitoring acquisition end device are very limited, and the real-time computation of the video is very stressful.

The current target detection algorithm can be divided into a detection method based on target modeling and a detection method based on background modeling. Among them, as a deep Convolutional Neural Network (CNN) and a target detection framework become mature, a detection method based on target modeling has been rapidly developed in recent years. For example, R-CNN and subsequent research and other leading edge target detection architectures can extract deep convolution features from image regions and perform framing and classification of target regions. Today, object detection gradually moves from the field of still images to the field of video detection. However, the current detection architecture is designed for static image detection, and the image-oriented target detection algorithm is directly used for video detection, so that the problems of abrupt change of a target detection area and the like occur, and the accuracy is not ideal.

In a video sequence, a moving object shows strong correlation in a space-time dimension, so that an inter-frame object region should be continuous and stable, but the accuracy of video detection is mostly improved. For video detection, the conventional processing architecture is frame-by-frame detection, and this detection mode may be referred to as dense video detection. The target in the video sequence shows stronger space-time continuity between continuous frames, so that part of similar frames can be skipped over to detect the key frame, namely sparse video detection. Based on the thought, stanford university provides a NoScope system, which trains a lightweight CNN network for different scenes, analyzes similarity between adjacent video frames, and realizes a high processing speed in a frame skipping detection mode. However, the processing manner of the NoScope for the skipped video frame is too extreme, and the skipped video frame directly follows the target detection frame of the previous frame to mark the target of the current frame, which may result in the loss of key information.

The optical flow method has large calculated amount, real-time performance is difficult to achieve, factors such as noise, multiple light sources, shadow and shielding can cause serious influence on the calculated result of the distribution of the optical flow field, and meanwhile, the sparse optical flow field is difficult to accurately extract the shape of a moving target, so that better detection result and throughput are difficult to achieve under the condition of limited calculation resources; the deep Convolutional Neural Network (CNN) can achieve a better result on video target detection, but the deep Convolutional Neural Network (CNN) is mostly designed for static image detection, an image-oriented target detection algorithm is directly used for video detection, the problem that a target detection area is suddenly changed and the like can occur, and the method is a frame-by-frame detection method and cannot fully utilize the space-time continuity of targets in a video sequence between continuous frames, so that the unnecessary resource consumption is greatly increased. Meanwhile, the deep Convolutional Neural Network (CNN) method must be supervised target detection, and if a target object that does not appear in the CNN training stage appears in the video, the CNN cannot correctly detect and identify a related object. This drawback greatly limits the application of CNN in the field of video detection where scenes and targets are diverse.

In video detection, the tracker also typically assists the detector in performing video detection, and the tracker operates with less computational complexity than the detector. However, although the tracker has low consumption of computing resources, the target features are relatively roughly computed, and there is often a tracking loss phenomenon, which leads to errors such as target local missing detection and overall missing detection.

Therefore, the prior art cannot give consideration to both the detection accuracy and the detection speed.

Disclosure of Invention

In view of this, the invention provides a sparse detection method for a video, which can greatly improve the video detection speed on the premise of guaranteeing the accuracy of the segmentation of a moving target region.

In order to solve the technical problem, the invention is realized as follows:

a method of sparse detection of a video, comprising:

predetermining a corresponding relation between the target characteristics and working parameters for determining the working time of the tracker, wherein the corresponding relation indicates that when the characteristics exist in the target, the tracker can replace the detector to detect the target in a corresponding period;

when video detection is carried out, the detector and the tracker are started to work alternately; in each alternate period, the detector is started first, the target characteristic is determined according to the detection result, and the working parameters of the tracker corresponding to the target characteristic are matched by utilizing the corresponding relation; and initializing the tracker into a detector, and detecting the target by the tracker according to the matched working parameters of the tracker.

Preferably, the target features are: one or more combinations of area, brightness, or number of corner points of the target region;

preferably, the operating parameters for determining the operating time of the tracker are: one or more of a tracker operation length in a frame unit, a tracker operation time length in a time unit, and a condition for stopping the tracker operation.

Preferably, the corresponding relationship between the target feature and the working parameter is obtained in an online learning manner in an initialization stage, specifically:

before video detection, the detector and the tracker are started simultaneously to carry out target detection on a video to be detected, the tracker can replace the detector to carry out target detection as a target, and a plurality of continuous working sections of the tracker which accord with the target are determined according to a detection result;

and acquiring the target characteristics of the intersection area of the detection frame and the tracking frame aiming at each tracker continuous working section, wherein the target characteristics and the working parameters determining the duration of the tracker continuous working section jointly form the corresponding relation.

Preferably, the determination mode of the tracker continuous working section is determined according to the intersection ratio IoU of the detection frame area and the tracking frame area: and comparing the intersection ratio IoU with the validity scale for each frame, determining whether the intersection ratio IoU data is valid or not, and determining the paragraph corresponding to the continuously valid data as the continuous working segment of the tracker.

Preferably, the EMA reference value theta obtained by an exponential weighted moving average method EWMA is adopted _k And analyzing the cross-over ratio IoU data as the effectiveness scale to filter the failure IoU value.

Preferably, in the process of filtering the invalid IoU value by adopting the EWMA, the reference value theta of the EMA after invalid data is judged is further added _k The filtering process of the failed IoU value includes:

for the k frame video, the detector and tracker cross-over ratio metric value iou for the current frame is known _k And the EMA reference value theta of the k-1 frame _k-1 ；

When iou _k ≧θ _k-1 When the current IoU is recorded as effective data, and the IoU is utilized _k And theta _k-1 Updating the EMA reference value theta of the current frame by the weighted value of _k ；

When iou _k <θ _k-1 When the current IoU is recorded as invalid data, 3 EMA temporary reference values theta are calculated according to the following formula _1,k 、θ _2,k And theta _3,k

θ _1,k ＝θ _init

θ _2,k ＝iou _k

θ _3,k ＝β·θ _k-1

Wherein beta is an attenuation coefficient, theta _init Is an initial value of the EMA reference value;

for the temporary reference value theta _1,k 、θ _2,k And theta _3,k Sorting according to size; and simultaneously recording the times N that the intersection ratio metric value is continuously smaller than the reference value of the EMA of the previous frame, and selecting the maximum value of the three temporary reference values to update theta when N =1 _k (ii) a When N =2, selecting the numerical value with the numerical value size ranked as 2 from the three temporary reference values to update theta _k (ii) a When N =3, selecting the minimum value of the three temporary reference values to update theta _k (ii) a When N is present>At time 3, theta is always updated with the minimum value of the three temporary reference values _k 。

Preferably, for each tracker continuous working segment, the target feature of the intersection region of the detection frame and the tracking frame is acquired as follows: and selecting the average value, the maximum value or the minimum value of the target characteristics of each frame in the intersection area.

Preferably, the tracker operating parameters corresponding to the target feature matched by using the correspondence relationship are as follows:

training a classifier or a clustering device by utilizing the corresponding relation; and inputting the target characteristics into a classifier or a clustering device, and outputting the predicted working parameters of the tracker by the classifier or the clustering device.

Preferably, the method further comprises a real-time monitoring mechanism:

when the tracker works, monitoring whether the detection effect meets the established standard in real time, if not, switching to the next alternate cycle, and starting the detector to work;

under the condition that switching from the tracker to the detector is achieved, judging whether the current detection result of the detector can be adopted or not; if yes, working through the detector to obtain working parameters of the tracker; otherwise, the tracker continues to track the target for this frame until the detector meets the condition and then switches to the next alternate period.

Preferably, judging whether the detection effect of the tracker meets the established standard is realized by acquiring the structural similarity data of two adjacent frames of images by adopting a structural similarity analysis (SSIM) algorithm;

the judgment mode of whether the current detection result of the detector can be adopted is as follows:

if the detector has the reliability parameter, judging whether the reliability returned by the detector is greater than a preset reliability threshold value, and if so, judging that the current detection result of the detector can be adopted; otherwise, the code is not adopted;

if the detector has no credibility parameter, judging whether the detector returns at least one target detection result, if so, judging that the current detection result of the detector can be adopted; otherwise it is not adopted.

Has the beneficial effects that:

(1) According to the invention, on the basis that the current detector works independently, the tracker replaces the detection task of part of video frames, so that the calculation cost of video detection is greatly reduced and the processing speed is increased on the basis of no loss or only a small part of video detection performance loss.

(2) The method determines the corresponding relation between the target characteristics and the working parameters of the tracker on line, does not depend on historical data, can reflect the characteristics of the current target more accurately, and can correctly identify target objects which do not appear in the training stage if a moving target is detected in a background modeling mode, thereby overcoming the defect of supervised learning of CNN.

(3) In a preferred embodiment, successive segments of the tracker are determined based on the cross-over ratio IoU of the detection frame area and the tracking frame area, which is related to both the detector and the tracker, and can be used to measure the effectiveness of the substitution.

(4) The invention uses the EMA reference value theta _k As a scale, the method for filtering the invalid IoU value by adopting the EWMA method has stronger aggregability than that of the method for directly utilizing the IoU data, can ensure that more effective data can be trained within a limited initialization length, and avoids frequent fluctuation of the tracking frame frequency in the subsequent stage caused by wrong prediction.

(5) According to the invention, after the IoU invalid data is judged, an EMA reference value theta is designed _k The updating operation of (2) can be quickly and iteratively adapted under the condition that the IoU data is invalid so as to obtain more valid data.

(6) When the corresponding relation is used for matching, the classifier or the clustering device is selected, so that the corresponding relation between the target characteristic and the working parameter for determining the working time of the tracker can be better modeled.

(7) The invention further adds a real-time monitoring mechanism to monitor the working state of the detector or tracker and switch in time, thereby ensuring the accuracy of video detection.

Drawings

FIG. 1 is a detection architecture of the sparse detection method of video of the present invention;

FIG. 2 is a process flow of the IA-VID initialization phase;

FIG. 3 is a diagram illustrating an IoU analysis method for a detector and a tracker.

Detailed Description

The target detection and region segmentation algorithm has high calculation overhead and low processing speed, and is difficult to meet the requirements of low cost and low power consumption of processing equipment and real-time video processing. For a video sequence, the motion of a target between frames has strong correlation, so that a target tracking algorithm (tracker) with lower computing resource overhead and higher processing speed can be adopted to replace a target detection task (detector) of a part of video frames, thereby achieving higher video detection processing speed. However, because the tracker may fail to track or fail to track, the tracker needs to be switched to the detector to work after the tracker works for a period of time, so as to adjust the working parameters of the tracker, thereby achieving the purpose of monitoring and correcting the result of the tracker.

Therefore, the invention provides a video sparse detection method, which has the following basic idea: the corresponding relation between the target characteristics and the working parameters for determining the working time of the tracker is predetermined, and the target characteristics and the working parameters indicate that when the target has the characteristics, the tracker can replace the detector to carry out target detection in a corresponding period. When video detection is carried out, the detector and the tracker are started to work alternately; in each alternate period, the detector is started first, the target characteristic is determined according to the detection result, and the working parameters of the tracker corresponding to the target characteristic are matched by utilizing the corresponding relation; and initializing the tracker into a detector, and detecting the target by the tracker according to the matched working parameters of the tracker.

The target feature may adopt the area, brightness, angular point number of a target Region (ROI), or a combination thereof; the operation parameters for determining the tracker operation time may be the tracker operation length (in frame units), the tracker operation time length (in time units), the condition for stopping the tracker operation (for example, the condition for the area difference between the previous and subsequent tracking results), or a combination thereof.

It can be seen that the main objective of the present invention is not to modify and improve the detection performance of the current detector, such as accuracy, integrity, etc., and the upper limit of these performances is determined by the current detector itself. The main objective of the invention is to replace the detection task of part of video frames by the tracker on the basis of the independent work of the current detector, thereby greatly reducing the calculation overhead of video detection and improving the processing speed on the basis of not losing or only losing a small part of video detection performance.

The corresponding relation is obtained according to previous data, and the length of a sequence of which the tracker can take charge of target detection work instead of the detector is given on the basis of ensuring the detection precision. In practice, the data can be sorted according to historical data, preferably, the invention provides a way for determining the corresponding relation on line, which does not depend on the historical data and can more accurately reflect the characteristics of the current target; meanwhile, if a moving target is detected in a background modeling-based mode, target objects which do not appear in the training stage can be correctly identified, and the defect of supervised learning of the CNN is overcome.

For how to obtain an effective corresponding relation in the scheme, the invention provides a determination method based on an Intersection over Union (IoU) analysis of a detector and a tracker. This adaptive sparse video detection method based on detector and tracker cross-over is abbreviated as IA-VID herein.

The following detailed description of preferred embodiments of the invention refers to the accompanying drawings and examples.

The embodiment selects the most common video monitoring scene in video processing. In an actual monitoring video, a video monitoring scene can be regarded as stable in a certain period. In the IA-VID method, the initialization phase will be started every time a video scene changes, and training and learning will be performed for the current monitoring scene. During each stable period, the background can be considered relatively stationary, and therefore, for a moving object in the same background, it is often the characteristics of the object itself that affect the detection result.

The principle of the tracker is to learn the input target features and analyze the features in the subsequent video frames to realize tracking. Thus, the input of target features is critical to the performance of the tracker. For example, assuming that the area of the region corresponding to the current target is S, when the area of the region input to the tracker is larger than S, it means that more image features enter the tracker, and an over-fitting condition occurs, and these additional image features may interfere with the performance of the subsequent tracker; on the contrary, when the area of the target region input into the tracker is smaller than S, it is proved that some features of the target are omitted, that is, under-fitting occurs, which also affects the processing effect of the subsequent tracker. For the tracker, effectively continuously tracking the length is the most intuitive expression for the tracking performance. From the above analysis the following assumptions can be made: the performance of the tracker remains stable when the number of image features input to the tracker is the same. That is, when the image area of the input tracker is the same, the successive effective tracking lengths of the trackers are consistent or similar.

Therefore, in the embodiment, the area of the target region is selected as the target feature, and the working length of the tracker is selected as the working parameter of the tracker in the corresponding relationship.

The IA-VID is mainly divided into an initialization stage and a mixed video detection stage, and comprises the following steps:

the method comprises the following steps: and in the initialization stage, counting the processing results of the detector and recording the working state of the corresponding tracker so as to obtain the effective working length of the tracker under different input areas and record the effective working length as a training pair of [ the area of a target area and the effective working length of the tracker ].

Referring to fig. 2, the present step includes the following substeps:

substep 101: the detector and the tracker work simultaneously to detect the target of the same video and generate an initialization detection sequence and an initialization tracking sequence with the same length as the initialization video sequence.

Sub-step 102: and jointly initializing a detection sequence and a tracking sequence, evaluating the performances of the detector and the tracker in the scene, taking target detection which can be performed by the tracker instead of the detector as a target, and determining a plurality of continuous working sections of the tracker which accord with the target according to the detection result. The length of the jth tracker continuous operation segment is recorded as L _tc.j The unit is a frame. The specific implementation process is as follows:

defining the detector to obtain the target area information roi in the video frame _d Is (x) _d ,y _d ,w _d ,h _d ) Wherein (x) _d ,y _d ) Is the center point coordinate of the detection frame, (w) _d ,h _d ) Is the length and width of the detection frame. Similarly, a target area roi obtained when the tracker works is defined _t Is (x) _t ,y _t ,w _t ,h _t ) Wherein (x) _t ,y _t ) Is the center point coordinate of the tracking frame, (w) _t ,h _t ) Is the length and width of the trace box.

For the k frame of the initialization stage, the metric value IoU of the intersection ratio IoU _k Can be expressed as:

wherein roi is _d,k Is the detection frame area, roi, of the k-th frame _t,k Is the tracking frame region of the k-th frame, S () represents taking the area of the region in parentheses, then the numerator represents the area of the region where the detection frame intersects the tracking frame, and the denominator represents the area of the region where the detection frame intersects the tracking frame.

Meanwhile, the method considers that a plurality of large differences often appear during initial trainingCross-over ratio values (false values) have a great influence on the processing in the subsequent stage, and therefore, these errors need to be filtered to obtain correct values with more reference significance. As shown in fig. 3, the merge ratio of the (k + 1) th frame is changed greatly, and if the merge ratio of the frame is changed greatly, the merge ratio iou of the frame is changed _k+1 Incorporative pair L _t The prediction of (2) can cause a great error. Therefore, when the cross-comparison analysis is performed in the initialization stage, the cross-comparison metric value without reference meaning needs to be filtered. The filtering removes segments that cannot be used to replace the detector with a tracker, thereby forming a continuous working segment of the tracker.

Then, for the data of each frame k, the intersection ratio IoU data needs to be compared with the validity scale, whether the intersection ratio IoU data is valid or not is determined, and the sections corresponding to the continuously valid data are determined as the continuous working sections of the tracker. In this embodiment, a variable scale of adaptive adjustment is adopted, that is, an EMA reference value θ in an Exponential Weighted Moving Average (EWMA) method is adopted _k As the validity scale, the IoU sequence in the initialization stage is analyzed, the error value is eliminated, and the follow-up tracking length L is increased _t The accuracy of the prediction.

The exponential weighted averaging procedure may refer to the RFC793 algorithm in the TCP-IP protocol. In RFC793, round-Trip delay Time (RTT) estimates are changed to smoother RTT reference indicator (SRTT) estimates, i.e., smooth RTT's, by the EWMA method

SRTT _k ＝α·SRTT _k-1 +(1-α)·RTT _k (2)

Wherein RTT _k For the round-trip communication delay, SRTT, obtained for the current actual sampling _k-1 Is the Exponential regression value (EMA) of the previous sampling point, SRTT _k For the EMA estimation value at the current time, alpha is a smoothing factor (smoothing factor), and the value range is (0.8, 0.9). By this smoothing operation, an exponential decrease in the weighting of the reference values over time can be achieved, i.e. the old data is weighted progressively less (forgetting) and the recent data is weighted (memory) higher. By adjusting the value of alpha, the forgetting speed can be changedTherefore, α is also referred to as "memory coefficient" in this patent. From the analysis of probability, the EWMA method sets the current estimated value to be determined by the last estimated value and the current sampling value, which is an ideal maximum likelihood estimation method, and can realize the adjustment of the estimated value and the weight of the sampling value by the memory coefficient alpha. From the analysis of signal processing, the EWMA can be regarded as a low-pass filter, and short-term fluctuation is eliminated while the long-term development trend is kept, so that the signal is smoothed. The EWMA is a method with extremely low consumption of computing resources, and has small computing amount and extremely low memory occupation. Therefore, the IA-VID framework also adopts an EWMA method to analyze the IoU sequence, realizes the filtering of invalid IoU values and strengthens the influence between recent intersection ratio metric values. In addition, in the process of filtering the invalid IoU value by adopting the EWMA, the reference value theta of the EMA after invalid data is further added _k The update operation of (2).

Then the metric IoU of the cross-over ratio IoU is obtained _k Then, the process of filtering the failed IoU value to obtain the continuous working segment of the tracker is as follows:

for the k frame video, the known parameter is the detector and tracker cross-over ratio metric iou for the current frame _k And the EMA reference value theta of the k-1 frame _k-1 . First, iou needs to be determined _k And theta _k-1 And carrying out the next operation according to the relationship between the two.

When iou _k ≧θ _k-1 When the current IoU index is considered reliable, the IoU index can be further determined by IoU _k And theta _k-1 Calculating the EMA reference value theta of the current frame _k I.e. by

θ _k ＝α·θ _k-1 +(1-α)·iou _k (3)

Wherein the initial value theta of the EMA reference value _init Typically 0.5 and alpha typically 0.9. Therefore, the value iou of the cross-over ratio metric obtained when the k frame video is obtained _k EMA reference value theta greater than the k-1 frame _k-1 When the EMA reference value is in the same direction as the current intersection ratio information iou, the EMA reference value is smoothly compared with the current intersection ratio information iou _k Close. If the current transaction is carried out for a long period of timeAnd the comparing metric values are all larger than or equal to the EMA reference value, the EMA reference value gradually tends to the current actual intersection comparing metric value and gradually tends to be stable. Meanwhile, the initialized data obtained by calculation of the EMA reference value has stronger aggregativeness than the initialized data directly obtained by the IoU, more effective data can be trained in a limited initialized length, and frequent fluctuation of tracking frame frequency in a subsequent stage caused by wrong prediction is avoided.

When iou _k <θ _k-1 In this case, it is proved that the cross-over ratio metric does not change according to the conventional trend, and therefore, the value has a high probability of becoming an invalid value. At this time, θ _k The update method of (3) is as follows:

first, 3 EMA temporary reference values theta are calculated _1,k 、θ _2,k And theta _3,k Wherein

θ _1,k ＝θ _init (4-a)

θ _2,k ＝iou _k (4-b)

θ _3,k ＝β·θ _k-1 (4-c)

Wherein, theta _init Is an initial value of the EMA reference value; the parameter beta is the attenuation coefficient and has<1. Referring to the Reno algorithm in the TCP-IP protocol, β may be set to 0.5.

Second, for the temporary reference value theta _1,k 、θ _2,k And theta _3,k Sorting by size. Meanwhile, recording the number N of times that the current IoU is continuously smaller than the reference value of the EMA of the previous frame, and selecting a temporary reference value theta when N =1 _1,k 、θ _2,k And theta _3,k In place of θ _k (ii) a When N =2, the provisional reference value θ is selected _1,k 、θ _2,k And theta _3,k Numerical substitution of the middle numerical value in order of magnitude 2 for theta _k (ii) a When N =3, the provisional reference value θ is selected _1,k 、θ _2,k And theta _3,k In place of θ _k (ii) a When N is present>At time 3, always using the temporary reference value theta _1,k 、θ _2,k And theta _3,k Is substituted for theta _k Until iou in the next current frame _k ≧θ _k-1 Then, the current is calculated by the formula (3)EMA reference value theta of frame _k 。

Through the comparison, for the IoU sequence in the initialization stage, the record satisfies the IoU _k ≧θ _k Conditional IoU is valid data and records corresponding roi _d,k And roi _t,k A numerical value; recording satisfaction of iou _k <θ _k Conditional IoU is invalid data. The valid IoU values and corresponding detector and tracker data roi are then evaluated _d,k And roi _t,k Initialization sequence R remaining in three dimensions _init In (1), i.e

Wherein r is _k The index is represented as K, which represents the number of data recorded in the initialization phase. Based on an initialization sequence R _init The continuous working length of the tracker can be calculated, and the j-th continuous working length is recorded as L _tc,j 。

Substep 103: for each tracker successive working segment j, from the initialization sequence R _init Searching out the data of the detection frames of all the trackers and the detectors, calculating the area of the intersection area of the detection frame and the tracking frame, and recording the area as roi _∩ The area can be taken from each frame roi in the working section _∩ The average value of (1) is obtained to obtain the average tracking frame area of the tracker which continuously works for the jth time, and is marked as Avg _ roi _j 。Avg_roi _j See equation (1).

Wherein roi is _d,m ∩roi _t,m Representing the area of intersection of the detection and tracking frames, i.e. roi _∩ (ii) a In practice, if the target feature takes other features, such as brightness, this place is replaced by the brightness of the area where the detection frame and the tracking frame intersect.

This embodiment uses the average tracking frame area of the continuous operation of the tracker as the target feature,in practice, frames roi may also be selected _∩ Or minimum value of, or employing pairs of roi using other strategies _∩ And (5) performing calculated data.

Substep 104: will [ Avg _ roi _j ,L _tc.j ]And storing the training pairs as training pairs for analyzing the video detection group in a mixed detection stage.

Step two: and in the mixed detection stage, the detector and the tracker are started to work alternately. In each alternate period, the detection result of the detector is utilized to correspond to the Avg _ roi _j ,L _tc.j ]To the corresponding working length L of the tracker _tc.j By means of L _tc.j The tracker is initialized to the detector. Here, each video frame will be processed by one of the detectors or trackers individually. The purpose of the detector operation is to rectify the results of the tracker and the purpose of the tracker operation is to reduce the overall computational overhead and increase the processing speed.

In this step, IA-VID sends consecutive L's as shown in FIG. 1 _d Number of detector active lengths (number of processing frames) and L _t The number of tracker active lengths (number of processing frames) is called a Group of Hybrid Detection (GoHD). Wherein L is _t The value of (c) can be specified empirically. For the ith mixed detection group (GoHD) _i ) The detector will start pair L first _d,i Processing each video frame to obtain L _d,i Taking the average value of the detection results, or taking the maximum value/the minimum value, or taking the value according to a set strategy, comparing the value with a two-dimensional array counted in the initialization stage, finding the same or similar target area, and matching the effective working length L of the corresponding tracker _t,i The initialization tracker is initialized to the detector with this parameter, and the tracker can then be based on L _t,i And carrying out target detection on the parameters.

Obtaining the effective working length L of the tracker by matching _t,i In this case, a table lookup plus interpolation method may be used, or a clustering or classification method may be used. In the preferred embodiment, L of the detector is set _d,i Averaging the area of each detection frame to obtain Avg _ roi _d Use ofMethod of classification or clustering, and training pair [ Avg _ roi ] generated during initialization phase _j ,L _tc.j ]Comparing to realize the matching of the target characteristics, thereby determining the tracking length L of the detection group _t,i 。

Matching pursuit length L in training _t,i In time, both the experimental scenario and the matching method have a large impact on the results. The experimental result shows that the KNN classifier can obtain a more significant experimental effect in this embodiment, so the preferred embodiment selects the KNN classifier to predict the working length of the tracker of each mixed detection group. In the mixed video detection stage of IA-VID, the area of the current detection target area can be calculated when the detector is started, and the Avg _ roi is paired with the training one by one _j ,L _tc.j ]The Euclidean distance is calculated according to the area value in (1), and the average value of the minimum k Euclidean distances is used as the working length of the detector of the current mixed detection group. In general, k is 3.

Step three: and adding a long-acting monitoring mechanism in the process of executing the step two.

In consideration of the complexity of the actual video, the tracker or the detector cannot avoid the abnormal detection phenomenon. According to the analysis, the target shows extremely strong correlation between video frames, which means that if the target state does not change abruptly, the features in the picture will remain relatively stable; conversely, if the features of the frame change significantly, it is assumed that the processor will have a serious false detection.

Therefore, the IA-VID designs a long-acting monitoring mechanism to monitor the working state of the detector or tracker, and takes corresponding measures when extreme conditions occur, thereby ensuring the accuracy of video detection.

For the tracker, the problem that the detection accuracy is affected is mainly that the detection effect of the tracker cannot reach the degree of target detection by the detector instead. At this point the IA-VID will terminate the current mixed detect group (GoHD) _i ) The next hybrid detection group (GoHD) is started _i+1 ) And directly starts the detector to enter a detection state.

For the detector, whether the period is normally alternated or the tracker is switched to the detector in advance due to poor effect, the detector needs to be judged to determine the detection result and judge whether the current detection result can be adopted. If the tracker can be adopted, the detector works normally, and the working parameters of the tracker are obtained through the working of the detector; otherwise, the tracker continues to track the target for the frame, and does not switch to the next alternate period until the detector meets the condition, and the working of the detector is utilized to acquire the working parameters of the tracker.

For the tracker, the way to determine whether the detection effect meets the predetermined standard is to perform a comparison analysis on the features of two adjacent frames of images to determine whether the difference between the two frames exceeds a set threshold. In the comparative analysis, the resource overhead and the processing speed of the arbiter need to be fully considered, and if the resource overhead is large and the processing speed is slow, the whole IA-VID will be greatly affected, which is contrary to the original design of the IA-VID. In this regard, IA-VID has chosen to analyze images by Structural Similarity Index (SSIM). SSIM is a calculation method for measuring similarity between images, and compared with indexes such as peak signal to noise ratio (PSNR) adopted in the traditional digital image processing field, the SSIM method has more visual expressive force when the quality of a video is measured.

In the SSIM method, given two signals x and y, the structural similarity of the two can be defined

SSIM(x,y)＝[l(x,y)] ^α [c(x,y)] ^β [s(x,y)] ^γ (7)

Wherein

In the formula, l (x, y) is used for comparing the brightness difference between x and y, c (x, y) is used for comparing the contrast difference between x and y, and s (x, y) is used for comparing the structure difference between x and y. α, β and γ are parameters that adjust the relative importance of l (x, y), c (x, y), s (x, y), and there is α>0，β>0，γ>0。μ _x And mu _y Represents the mean values of x and y, respectively, σ _x And σ _y Denotes the standard deviation, σ, of x and y, respectively _xy Is the covariance of x and y. Furthermore, C ₁ 、C ₂ 、C ₃ All are constants for maintaining the stability of l (x, y), c (x, y), s (x, y).

Through a large number of tests and comparison experiments under scenes, the judgment accuracy of the SSIM algorithm on extreme conditions can be found to be higher, namely the tracker cannot track the target information under the current extreme conditions, therefore, in the IA-VID method, the SSIM algorithm is selected as a monitoring method of the working state of the tracker, the signals x and y are respectively detection results of two frames before and after the tracker, and when the SSIM value is lower than a set threshold value, the current detection effect is considered not to meet the set standard.

For the detector, the IA-VID may determine whether to adopt the target detection result of the current detector according to the reliability result returned by the current detector, on condition that the detector needs to be restarted. If the credibility returned by the current detector is greater than a preset credibility threshold (detection threshold), entering a new mixed detection group (GoHD) _i ) And initializing the tracker result to the current detector result; otherwise, the tracker continues to track the target of the frame, the detector is switched to the next alternate period until the detector meets the condition, and the tracker is initialized by using the working result of the detector. The preset confidence threshold value needs to be finely adjusted according to different scenes, and in IA-VID, the confidence threshold value is generally set to 0.5.

For a detector without a reliability parameter, once the detector returns at least one target detection result, the reliability of the detector is considered to be 100%; if the detector does not return a detection result, the confidence level of the detector is considered to be 0%. Accordingly, it is possible to judge whether or not the detection result of the detector can be accepted.

The scheme is a general processing architecture, and the detector and the tracker can be flexibly replaced. For example, a moving object detection algorithm based on background modeling and a detection algorithm based on object modeling, such as vibe, YOLO, etc., can be used as detectors to perform video detection through the present invention and obtain acceleration effect.

In summary, the above description is only a preferred embodiment of the present invention, and is not intended to limit the scope of the present invention. Any modification, equivalent replacement, or improvement made within the spirit and principle of the present invention should be included in the protection scope of the present invention.

Claims

1. A video sparse detection method is characterized by comprising an initialization stage and a mixed video detection stage; the mixed video detection stage replaces the detection task of partial video frames by a tracker;

step one, an initialization stage, namely determining the corresponding relation between target characteristics and working parameters for determining the working time of a tracker;

substep 101: the detector and the tracker work simultaneously, and carry out target detection on the same video to generate an initialization detection sequence and an initialization tracking sequence which have the same length as the initialization video sequence;

sub-step 102: jointly initializing a detection sequence and a tracking sequence, evaluating the performance of the detector and the tracker in the current scene, taking the tracker as a target for replacing the detector to detect the target, and determining a plurality of continuous working sections of the tracker which accord with the target according to the detection result; the length of the jth tracker continuous operation segment is recorded as L _tc.j ；

The determination mode of the continuous working section of the tracker is determined according to the intersection ratio IoU of the detection frame area and the tracking frame area: for each frame, comparing the intersection ratio IoU of the detection frame area and the tracking frame area with the validity scale, determining whether the data of the intersection ratio IoU is valid or not, and determining the paragraph corresponding to the continuously valid data as the continuous working segment of the tracker;

wherein, the EMA reference value theta is obtained by adopting an exponential weighted moving average method EWMA _k As the effectiveness scale, the cross-over-parallel ratio IoU data is analyzed, and the filtering of the invalid IoU value is realized; wherein, in the process of filtering the invalid IoU value by adopting the EWMA, the reference value theta of the EMA after invalid data is further added _k The filtering process of the failed IoU value includes:

for the k frame video, the detector and tracker cross-over metric iou for the known current frame _k And the EMA reference value theta of the k-1 frame _k-1 ；

When iou _k ≧θ _k-1 When the IoU is recorded as effective data, the IoU is utilized _k And theta _k-1 Updating the EMA reference value theta of the current frame by the weighted value _k ；

θ _1,k ＝θ _init

θ _2,k ＝iou _k

θ _3,k ＝β·θ _k-1

for temporary reference value theta _1,k 、θ _2,k And theta _3,k Sorting according to size; simultaneously recording the times N that the intersection ratiometric value is continuously less than the reference value of the EMA of the previous frame, and selecting the maximum value of the three temporary reference values to update theta when N =1 _k (ii) a When N =2, selecting the numerical value with the numerical value size of 2 in the three temporary reference values to update theta _k (ii) a When N =3, selecting the minimum value of the three temporary reference values to update theta _k (ii) a When N is present>At time 3, theta is always updated with the minimum value of the three temporary reference values _k ；

Substep 103: for each tracker continuous working segment j, obtaining the detection frame and pursuingTarget feature X _ roi of trace frame intersection region _j The target feature and the operating parameter determining the duration of the continuous operating segment of the tracker form the corresponding relationship [ X _ roi _j ,L _tc.j ]；

Step two: and (3) a mixed video detection stage:

when video detection is carried out, the detector and the tracker are started to work alternately; in each alternate period, the detector is started first, the target characteristic is determined according to the detection result, and the corresponding relation [ X _ roi ] is utilized _j ,L _tc.j ]Matching the length L of the continuous working section of the tracker corresponding to the target feature _tc.j (ii) a Initializing the tracker to the detector, the tracker being based on the length L of the matched continuous working section of the tracker _tc.j Carrying out target detection;

adding a long-acting monitoring mechanism in the process of executing the step two:

when the tracker works, monitoring whether the detection effect meets a set standard or not in real time, if not, switching to the next alternate cycle in advance, and starting the detector to work;

whether the period is normal and alternated or the tracker is switched to the detector in advance due to poor effect, the detection result of the detector needs to be judged, and whether the current detection result of the detector can be adopted or not is judged; if yes, working through the detector to obtain working parameters of the tracker; otherwise, the tracker continues to track the target for this frame until the detector meets the condition and then switches to the next alternate period.

2. The method of claim 1, wherein the target feature is: one or more combinations of area, brightness, or number of corner points of the target region;

the working parameters for determining the working time of the tracker are as follows: one or more of a length of the tracker operation in units of frames, a length of the tracker operation in units of time, and a condition for stopping the tracker operation.

3. The method as claimed in claim 1, wherein the step of obtaining the target feature of the intersection region of the detection frame and the tracking frame for each tracker continuous working segment is: and selecting the average value, the maximum value or the minimum value of the target characteristics of each frame in the intersection area.

4. The method of claim 1, wherein the tracker operating parameters corresponding to matching the target feature using the correspondence relationship are:

5. The method of claim 1, wherein the determination of whether the detection effect of the tracker meets the predetermined criteria is achieved by obtaining structural similarity data of two adjacent frames of images using a structural similarity analysis (SSIM) algorithm;