CN113033679B

CN113033679B - Semi-supervised noisy learning method for monitoring video vehicle detection

Info

Publication number: CN113033679B
Application number: CN202110348338.6A
Authority: CN
Inventors: 王好谦; 刘伟锋
Original assignee: Shenzhen International Graduate School of Tsinghua University
Current assignee: Shenzhen Zhaoxiang Intelligent Technology Co.,Ltd.
Priority date: 2021-03-31
Filing date: 2021-03-31
Publication date: 2022-05-31
Anticipated expiration: 2041-03-31
Also published as: CN113033679A

Abstract

The invention provides a semi-supervised noisy learning method for monitoring video vehicle detection, which comprises the following steps: c annotators are trained on the disclosed target detection data set simultaneously by adopting a c-angle supervision method; deducing on the service data without the labels by using c annotators to obtain c groups of detection results, and respectively calculating a confidence weighted average value of each vehicle surrounding frame as an integral output result to serve as an integrated label of the service data; mixing the disclosed target detection data set and the service data with the integrated label to obtain a mixed data set, wherein the mixed data set comprises a picture and a label, the picture and the label are respectively from the disclosed target detection data set and the service data, the corresponding relation between the picture and the label is kept, a corrector is trained on the mixed data set in a full supervision mode, and the corrector outputs a correction label; replacing the label of the mixed data set with a correction label, and training a extrapolator in a full supervision mode; and (5) processing the monitoring video stream in real time by adopting a pushing device to detect the vehicle.

Description

Semi-supervised noisy learning method for monitoring video vehicle detection

Technical Field

The invention relates to the technical field of computer vision and digital image processing, in particular to a semi-supervised noisy learning method for monitoring video vehicle detection.

Background

Target detection refers to locating all objects in an image and classifying the objects into the correct categories. This is a fundamental task in many vision applications. A high level of activity analysis typically builds on this step. In addition, Intelligent Transportation Systems (ITS) are an ideal solution to current traffic problems. A large amount of video data are generated by road monitoring videos at every moment, and the existing analysis method is manual observation, so that the efficiency is low, and a large amount of manpower, material resources and financial resources are consumed. By means of the target detection algorithm, the analysis task can be completed quickly, and a large amount of labor force is saved. The detection of the vehicle in the video is the basis of subsequent tasks such as traffic accident detection, vehicle re-identification and the like. For example, the time trajectory can be calculated according to the detected vehicle surrounding frame, and then the video clip with the accident in the large-scale monitoring video can be found by using a specific post-processing logic with little manpower. Such algorithms can help traffic authorities and insurance companies, etc., to analyze the cause of a traffic accident and to determine the responsible body of the accident. In addition, vehicle re-identification facilitates finding culprit-escape vehicles.

At present, target detection algorithms with good effects mostly use depth models based on fully supervised learning, and need a large amount of labeled data for training. The conditions such as shielding and illumination are greatly different in the video; in dense scenes, a large number of small objects are often occluded, which is difficult for the detection model to handle. A large amount of training data with scene diversity is therefore required. However, the size of the public data set is limited, and the actual application scenario is often very different from that of the public data set. If the model is trained on public data sets only, overfitting will be evident. Some companies label data in a specific application scenario by themselves when using deep learning techniques. However, due to the large size of video data, annotation costs are much higher than for data sets such as image classification. Therefore, it is a challenge to acquire a large amount of annotation data with as little budget as possible.

Many application scenarios require the detection model to run in real-time, even on mobile devices with very limited computational resources. In general, a large model cannot meet the speed requirement, and a small model has insufficient detection accuracy. There is a need to strike a proper balance between inference speed and detection accuracy.

And the data under the specific application scene is marked, so that time and labor are wasted. Whereas unlabeled data is readily available. The semi-supervised learning method fully utilizes a small amount of labeled data and a large amount of unlabelled data, and can obtain the effect similar to the effect of a model obtained by a fully supervised method training by using a large amount of labeled data.

Some errors or unreasonables exist in the labeling results of the public data sets such as the COCO. In the semi-supervised learning method, a deep network is generally used for replacing a marker to obtain labels, and the accuracy of the labels is generally obviously lower than that of manual labels. The existing semi-supervised learning algorithm generally does not explicitly process the noise in the label, and causes adverse effects in the subsequent model training process.

The above background disclosure is only provided to assist understanding of the concept and technical solution of the present invention, which does not necessarily belong to the prior art of the present patent application, and should not be used to evaluate the novelty and inventive step of the present application in the case that there is no clear evidence that the above content is disclosed at the filing date of the present patent application.

Disclosure of Invention

The invention provides a semi-supervised noisy learning method for monitoring video vehicle detection, aiming at solving the existing problems.

In order to solve the above problems, the technical solution adopted by the present invention is as follows:

a semi-supervised noisy learning method for surveillance video vehicle detection, comprising the steps of: s1: c annotators are trained on the disclosed target detection data set simultaneously by adopting a c-angle supervision method, wherein c is a natural number greater than 1; s2: deducing the business data without the labels by using the c annotators to obtain c groups of detection results, respectively calculating a confidence weighted average value of each vehicle surrounding frame as an integral output result of the c annotators, wherein the integral output result is used as an integrated label of the business data; s3: mixing the public target detection data set and the business data with the integrated label to obtain a mixed data set, wherein the mixed data set comprises a picture and a label, the picture and the label are respectively from the public target detection data set and the business data, the corresponding relation between the picture and the label is kept, a corrector is trained on the mixed data set in a full supervision mode, and the corrector outputs a correction label; s4: replacing the label of the mixed data set with the correction label, and training a pushing device in a full supervision mode; s5: and the vehicle is detected by adopting the pushing device to process the monitoring video stream in real time.

Preferably, before the step S1, the method further includes processing the disclosed target detection data set, wherein the processing includes at least one of: rejecting bounding boxes in the disclosed target detection dataset that are less than 10 pixels wide and less than 10 pixels high or that are less than 5 pixels by 5 pixels in area; removing pictures without vehicle surrounding frames in the disclosed target detection data set; removing the gray level picture in the disclosed target detection data set; augmenting the disclosed target detection data set with a data enhancement method.

Preferably, the c annotators use different types of models; the types of the models are two-stage target detection models, single-stage prior frame-based target detection models or single-stage key point-based target detection models.

Preferably, obtaining the integrated tag of the service data includes the following steps: and the c annotators deduce on the service data, a soft non-maximum inhibition post-processing method is adopted to filter vehicle surrounding frames with confidence coefficients lower than a first preset threshold, a preset number of vehicle surrounding frames with highest confidence coefficients are reserved to obtain detection results of c groups, and the weighted average of the confidence coefficients of each vehicle surrounding frame is respectively calculated to serve as the overall output result of the c annotators to obtain the integrated tag.

Preferably, the confidence weighted average is calculated as follows: the c markers have a total of (c-1) c/2 pair combinations, and each marker is respectively marked as A1, A2, … …, Ac; for a pair of the annotators, the annotators As with smaller serial numbers are selected to output each vehicle surrounding frame Bs, intersection and comparison between the vehicle surrounding frame Bs and all vehicle surrounding frames Banother output by the other annotators are calculated to obtain a list Li, and the vehicle surrounding frames Banother are sorted according to the list Li; if the maximum value in the list Li is larger than a second preset threshold value, a vehicle surrounding frame and the vehicle surrounding frame Bs in the vehicle surrounding frame Banother represent the same object, and the object is represented as the same object in the integrated labelBtotal; if the maximum value in the list Li obtained by each calculation of the annotator As is smaller than the second preset threshold value, the vehicle surrounding frame Bs is subjected to false detection and is discarded; confidence Conf of each non-integrated vehicle surrounding frame corresponding to Btotal in the integrated label_iFor the maximum value among the confidence values that the vehicle surrounding frames belong to each category, the number of the non-integrated vehicle surrounding frames is recorded to be k, and the Conf is calculated_iAnd (3) carrying out normalization:

wherein, Conf_iIs the confidence, Conf, of the ith unintegrated vehicle bounding box corresponding to Btotal_tIs the confidence, W, of the tth non-integrated vehicle bounding box corresponding to Btotal_iIs the weight of the ith unintegrated vehicle bounding box;

abscissa X of center point of Btotal_totalCalculated from the following formula:

wherein, X_iIs the abscissa of the ith non-integrated vehicle bounding box.

Preferably, obtaining the correction tag comprises the following steps: s31: training a two-stage target detection model M on the mixed data set in a full-supervision mode; s32: fixing the parameters of the two-stage target detection model M to correct the label of the mixed data set; s33: and continuously training the two-stage target detection model M by using the corrected label until the loss value of the verification set is converged to obtain the corrected label.

Preferably, modifying the label of the mixed dataset comprises the steps of: inputting a noisy label and a picture of the mixed data set into the two-stage target detection model M together, extracting features by a backbone network and transmitting the obtained features to a surrounding frame corrector irrelevant to the category, outputting a coordinate vector b of a vehicle surrounding frame corrected by a first stage by the surrounding frame corrector, pooling the features of the region where the vehicle surrounding frame is located in a fine interested region and transmitting the pooled features to at least 3 detection heads with the same structure and different initialization methods, wherein each detection head comprises a classification head and a positioning head, and the positioning head outputs the coordinates of the vehicle surrounding frame; the classification head classifies the content of the vehicle enclosure; at least 3 detection heads respectively predict the offset of 4 coordinates of 1 vehicle surrounding frame to obtain an offset vector; and multiplying the mean value of at least 3 offset vectors by a hyperparameter less than 1, adding the hyperparameter and the bounding box coordinate vector corrected in the first stage, and outputting the final coordinate vector of the correction label.

Preferably, at least 3 detection heads respectively predict the probability that the enclosure belongs to each category to obtain at least 3 probability vectors, and after element-by-element averaging is carried out on the at least 3 probability vectors and the probability vectors contained in the noise labels, a corrected probability vector v is obtained; the g-th element in the probability vector v

Representing the probability that the bounding box belongs to the g-th class; carrying out sharpening operation on the probability vector v as shown in the following formula, wherein the sharpened probability vector is y:

wherein T is a hyperparameter less than 1; c represents the total number of categories in the dataset.

Preferably, obtaining the coordinate vector b of the bounding box after the first-stage correction includes the following steps: inputting a coordinate vector b in a noisy label and a picture of the mixed data set into the two-stage target detection model M, extracting features by a backbone network, obtaining the features corresponding to the bounding box by a fine region-of-interest pooling layer, and transmitting the features to at least 3 classification heads, wherein the classification heads predict a probability vector respectively; if the confidence degrees of representing the surrounding frame belonging to the background in at least 3 probability vectors are all larger than a threshold Tb, the surrounding frame is considered to belong to the background, and the surrounding frame is removed;otherwise, for at least 3 probability vectors, 2 norms between every two probability vectors are calculated, and the sum L of the 2 norms is calculated_disagreeAs a classification bifurcation; in at least 3 probability vectors, the sum of values representing the confidence of the background is recorded as L_bgAnd b is recorded as L (b), b is optimized according to L (b), the learning rate is recorded as alpha, and the corrected coordinate vector b is:

preferably, the extrapolator selects the open source model Scaled-YOLOv 4.

The invention has the beneficial effects that: the semi-supervised learning method and the noisy learning method are combined, the c-angle supervised learning method is adopted to filter error labels in public data sets, integrated labels are generated on a business data set, a corrector is used for correcting the integrated labels to obtain high-quality corrected labels, the corrected labels are used for training a real-time deducer, and the deducer can obtain high-precision detection results in the business video in real time. The invention explicitly processes the noise in the label and improves the effect of semi-supervised learning.

Drawings

FIG. 1 is a schematic diagram of a semi-supervised noisy learning method for surveillance video vehicle detection in an embodiment of the present invention.

Fig. 2 is a schematic diagram of sample transfer between a priori box-based cascade RCNN and keypoint-based centrnet 2 models in the embodiment of the present invention.

Fig. 3 is a schematic diagram of model a transferring samples to other models when c is 4 in the embodiment of the present invention.

Fig. 4 is a schematic diagram of a method for obtaining a calibration tag in an embodiment of the present invention.

Fig. 5 is a schematic diagram of the working principle of a corrector in the embodiment of the present invention.

FIG. 6 is a diagram illustrating a detection result according to an embodiment of the present invention.

FIG. 7 is a diagram illustrating still another detection result in the embodiment of the present invention.

Detailed Description

In order to make the technical problems, technical solutions and advantageous effects to be solved by the embodiments of the present invention more clearly apparent, the present invention is further described in detail below with reference to the accompanying drawings and the embodiments. It should be understood that the specific embodiments described herein are merely illustrative of the invention and are not intended to limit the invention.

It will be understood that when an element is referred to as being "secured to" or "disposed on" another element, it can be directly on the other element or be indirectly on the other element. When an element is referred to as being "connected to" another element, it can be directly connected to the other element or be indirectly connected to the other element. In addition, the connection may be for either a fixing function or a circuit connection function.

It is to be understood that the terms "length," "width," "upper," "lower," "front," "rear," "left," "right," "vertical," "horizontal," "top," "bottom," "inner," "outer," and the like are used in an orientation or positional relationship indicated in the drawings for convenience in describing the embodiments of the present invention and to simplify the description, and are not intended to indicate or imply that the referenced device or element must have a particular orientation, be constructed in a particular orientation, and be in any way limiting of the present invention.

Furthermore, the terms "first", "second" and "first" are used for descriptive purposes only and are not to be construed as indicating or implying relative importance or implicitly indicating the number of technical features indicated. Thus, a feature defined as "first" or "second" may explicitly or implicitly include one or more of that feature. In the description of the embodiments of the present invention, "a plurality" means two or more unless specifically limited otherwise.

As shown in fig. 1, a semi-supervised noisy learning method for surveillance video vehicle detection includes the following steps:

s1: c annotators are trained on the disclosed target detection data set by adopting a c-angle supervision method, wherein c is a natural number greater than 1;

in one embodiment of the invention, c has a value of 2 to 10.

S2: deducing the business data without the labels by using the c annotators to obtain c groups of detection results, respectively calculating a confidence weighted average value of each vehicle surrounding frame as an integral output result of the c annotators, wherein the integral output result is used as an integrated label of the business data;

s3: mixing the public target detection data set and the business data with the integrated label to obtain a mixed data set, wherein the mixed data set comprises a picture and a label, the picture and the label are respectively from the public target detection data set and the business data, the corresponding relation between the picture and the label is kept, a corrector is trained on the mixed data set in a full supervision mode, and the corrector outputs a correction label;

it is understood that the number of the correctors may be plural, preferably 1, because a large number of calculations are added when plural correctors are used, but the effect is not significantly improved.

S4: replacing the label of the mixed data set with the correction label, and training a pushing device in a full supervision mode;

s5: and the vehicle is detected by adopting the pushing device to process the monitoring video stream in real time.

According to the invention, a semi-supervised learning method and a noisy learning method are combined, a c-angle supervised learning method is adopted to filter error labels in an open data set, an integrated label is generated on a service data set, a corrector is used for correcting the integrated label to obtain a high-quality corrected label, the corrected label is used for training a real-time deducer, and the deducer can obtain a high-precision detection result in a service video in real time. The invention explicitly processes the noise in the label and improves the effect of semi-supervised learning.

It can be understood that the invention generates the integrated label on the label-free data, thereby improving the detection precision of the model, and further, the corrector is used for correcting the integrated label, reducing the noise of the label and improving the detection accuracy.

In an embodiment of the present invention, before step S1, the method further includes processing the disclosed target detection data set, where the processing includes at least one of:

rejecting bounding boxes in the disclosed target detection dataset that are less than 10 pixels wide and less than 10 pixels high or that are less than 5 pixels by 5 pixels in area; it will be appreciated that the surrounding box is too small for the naked eye to discern the contents of the box. In order to avoid such extremely small and content-ambiguous objects misleading the model, they are eliminated.

Removing pictures without vehicle surrounding frames in the disclosed target detection data set;

removing the gray level picture in the disclosed target detection data set;

augmenting the disclosed target detection data set with a data enhancement method. In a specific embodiment, data enhancement methods such as random clipping, random flipping, random occlusion and the like can be adopted to expand a data set, so that a trained model is more robust.

The c-angle supervision method is characterized in that c models are trained in the same data set, a certain number of detection results are output by each model after each conventional iteration, the result with the loss value lower than the threshold value L is selected as the true value of the other (c-1) models, and one c-angle supervision iteration is carried out. This filters noise in the training set. The target detection model is divided into a model based on a prior frame, a model based on key points and the like, the two models have different principles, and the obtained results have certain difference. In order to fully utilize the characteristics of different models and exert the advantage of c-angle supervision to the maximum extent, preferably, different types of models are selected from the c models. For example, a priori box-based cascade RCNN and keypoint-based centrnet 2 models are selected.

As shown in fig. 2, is a schematic diagram of sample transfer between the prior box-based cascade RCNN and the keypoint-based centrnet 2 models.

As shown in fig. 3, model a transfers samples to other models when c equals 4. The models are reciprocal, i.e. the samples are passed to each other, and the models guide the learning each other.

In one embodiment of the invention, the c annotators use different types of models; the types of the models are two-stage target detection models, single-stage prior frame-based target detection models or single-stage key point-based target detection models.

In an embodiment of the present invention, obtaining the integrated tag of the service data includes the following steps:

and the c annotators deduce on the service data, a soft non-maximum inhibition post-processing method is adopted to filter vehicle surrounding frames with confidence coefficients lower than a first preset threshold, a preset number of vehicle surrounding frames with highest confidence coefficients are reserved to obtain detection results of c groups, and the weighted average of the confidence coefficients of each vehicle surrounding frame is respectively calculated to serve as the overall output result of the c annotators to obtain the integrated tag.

In a specific embodiment, the detection result is in the form of [ number of bounding boxes, abscissa of upper left corner of bounding box, ordinate of upper left corner of bounding box, abscissa of lower right corner of bounding box, ordinate of lower right corner of bounding box, confidence that the object belongs to each category ].

It is to be understood that the form of the detection result may be not limited to the above form, as long as the position of the rectangle can be represented. May be the horizontal and vertical coordinates of the center point and the width and height of the box.

In one embodiment of the present invention, the confidence weighted average is calculated as follows:

the c markers have a total of (c-1) c/2 pair combinations, and each marker is respectively marked as A1, A2, … …, Ac;

for a pair of the annotators, the annotators As with smaller serial numbers are selected to output each vehicle surrounding frame Bs, intersection and comparison between the vehicle surrounding frame Bs and all vehicle surrounding frames Banother output by the other annotators are calculated to obtain a list Li, and the vehicle surrounding frames Banother are sorted according to the list Li;

if the maximum value in the list Li is greater than a second preset threshold value, a vehicle enclosure box in the vehicle enclosure box Banother and the vehicle enclosure box Bs represent the same object, and the object is marked as Btotal in an integrated label;

if the maximum value in the list Li obtained by each calculation of the annotator As is smaller than the second preset threshold value, the vehicle surrounding frame Bs is subjected to false detection and is discarded;

confidence Conf of each non-integrated vehicle enclosure frame corresponding to Btotal in the integrated label_iFor the maximum value in the confidence values that the vehicle surrounding frames belong to all the categories, the number of the non-integrated vehicle surrounding frames is recorded to be k, and the Conf is calculated_iAnd (3) carrying out normalization:

wherein, X_iIs the abscissa of the ith non-integrated vehicle bounding box.

Similarly, the vertical coordinate, width and height can be calculated by respectively changing x of the formula into y, w and h.

The disclosed target detection data set and the business data with the integrated label are mixed to obtain a mixed data set, and the integrated label is processed, so that label noise is reduced, but certain noise still exists because the integrated label is generated by a model instead of being labeled manually. And it is not entirely reasonable to disclose labeling of data sets. Therefore, on the mixed data set, 1 corrector is trained in a full supervision mode; the corrector outputs a corrected label, called a corrected label, on the basis of the integrated label.

As shown in fig. 4, obtaining the correction label includes the following steps:

s31: training a two-stage target detection model M on the mixed data set in a full-supervision mode;

s32: fixing the parameters of the two-stage target detection model M to correct the label of the mixed data set;

s33: and continuously training the two-stage target detection model M by using the modified label until the loss value of the verification set is converged to obtain the corrected label.

Fig. 5 is a schematic diagram illustrating the operation principle of the corrector according to the present invention.

In one embodiment of the present invention, modifying the label of the mixed dataset comprises the following steps:

inputting a noisy label and a picture of the mixed data set into the two-stage target detection model M together, extracting features by a backbone network and transmitting the obtained features to a surrounding frame corrector irrelevant to the category, outputting a coordinate vector b of a vehicle surrounding frame corrected by a first stage by the surrounding frame corrector, pooling the features of the region where the vehicle surrounding frame is located in a fine interested region and transmitting the pooled features to at least 3 detection heads with the same structure and different initialization methods, wherein each detection head comprises a classification head and a positioning head, and the positioning head outputs the coordinates of the vehicle surrounding frame; the classification head classifies the content of the vehicle enclosure;

at least 3 detection heads respectively predict the offset of 4 coordinates of 1 vehicle surrounding frame to obtain an offset vector;

and multiplying the mean value of at least 3 offset vectors by a hyperparameter less than 1, adding the mean value and the bounding box coordinate vector corrected in the first stage, and outputting the final coordinate vector of the correction label.

In an embodiment of the present invention, at least 3 detection heads respectively predict probabilities that the bounding box belongs to each category, to obtain at least 3 probability vectors, and after element-by-element averaging the probability vectors included in the at least 3 probability vectors and the noise label, obtain a corrected probability vector v;

the g-th element in the probability vector v

In an embodiment of the present invention, obtaining the coordinate vector b of the bounding box after the first stage correction includes the following steps:

inputting a coordinate vector b in a noisy label and a picture of the mixed data set into the two-stage target detection model M, after extracting features by a backbone network, obtaining the features corresponding to the bounding box by a fine region-of-interest pooling layer, and transmitting the features to at least 3 classification heads, wherein the classification heads predict a probability vector respectively;

if the confidence degrees of representing the surrounding frame belonging to the background in at least 3 probability vectors are all larger than a threshold Tb, the surrounding frame is considered to belong to the background, and the surrounding frame is removed; otherwise, for at least 3 probability vectors, 2 norms between every two probability vectors are calculated, and the sum L of the 2 norms is calculated_disagreeAs a classification bifurcation;

in at least 3 probability vectors, the sum of values representing the confidence of the background is recorded as L_bgAnd b is optimized according to L (b), and if the learning rate is alpha, the corrected coordinate vector b is as follows:

and replacing the label of the mixed data set with a correction label, and training 1 real-time detection model in a full supervision mode to serve as a pushing device. In one embodiment of the invention, the extrapolator selects the open source model Scaled-YOLOv 4.

The deducer is used as a final on-line model, processes the monitoring video stream in real time and detects the vehicles in the monitoring video stream.

In a specific embodiment of the present invention, c-angle supervision is applied on the disclosed target detection data set MSCOCO while training c markers. In this example, c is 3. Deducing the collected service data RoadNet without the label by using 3 annotators to obtain 3 groups of detection results, respectively calculating a confidence weighted average value of each vehicle surrounding frame as an integral output result of the 3 annotators, and taking the integral output result as an integrated label of the service data.

The method comprises the steps that a public target detection data set MSCOCO and service data RoadNet with integrated labels are mixed to obtain a mixed data set MSCOCO _ Road, the mixed data set MSCOCO _ Road comprises pictures and labels, the pictures and the labels are respectively from the public target detection data set MSCOCO and the service data RoadNet, the corresponding relation between the pictures and the labels is kept, a corrector is trained on the mixed data set MSCOCO _ Road in a full-supervision mode, and the corrector outputs correction labels; and replacing the label of the mixed data set MSCOCO _ Road with a correction label, and training the deducer in a full supervision mode. And (5) processing the monitoring video stream in real time by adopting a pushing device to detect the vehicle.

Experiments show that the detection precision of the finally obtained inference device can be improved by processing the public target detection data set MSCOCO as follows: rejecting bounding boxes in the disclosed target detection dataset that are less than 10 pixels wide and less than 10 pixels high or that are less than 5 pixels by 5 pixels in area; rejecting pictures without vehicle surrounding frames in the disclosed target detection data set; removing gray level pictures in the disclosed target detection data set; augmenting the disclosed target detection data set with a data enhancement method.

Specifically, the 3 annotators respectively adopt open source models MaskRCNN, YOLOv5 and CenterNet.

The method for obtaining the integrated label of the service data comprises the following steps: and 3 annotators infer on the service data, a soft non-maximum suppression post-processing method is adopted, vehicle surrounding frames with confidence degrees lower than a first preset threshold by 0.5 are filtered, a preset number (40) of vehicle surrounding frames with the highest confidence degrees are reserved to obtain 3 groups of detection results, the weighted average of the confidence degrees of the vehicle surrounding frames is respectively calculated, and the weighted average is used as the integral output result of the 3 annotators to obtain the integrated label.

The confidence weighted average is calculated as follows: the 3 markers have (3-1) × 3/2 pair combinations, and each marker is marked as A1, A2 and A3; for a pair of the markers, the marker As with a smaller sequence number is selected to output each vehicle surrounding frame Bs, intersection and comparison between the vehicle surrounding frame Bs and all vehicle surrounding frames bather output by the other marker are calculated to obtain a list Li, and the vehicle surrounding frames bather are sorted according to the list Li; if the maximum value in the list Li is greater than the second preset threshold value of 0.5, a vehicle surrounding frame and a vehicle surrounding frame Bs in the vehicle surrounding frame Banother represent the same object, and the object is represented as Btotal in the integrated label; if the maximum value in the list Li obtained by calculating the marker As for each time is smaller than the second preset threshold value 0.5, the vehicle surrounding frame Bs is subjected to false detection and is discarded; confidence Conf of each non-integrated vehicle surrounding frame corresponding to Btotal in integrated label_iFor the maximum of the confidence values that the vehicle bounding box belongs to each category, let us note that there are k (k obtained in each graph is different) unconformity vehicle bounding boxes for Conf_iAnd (3) carrying out normalization:

wherein the content of the first and second substances,X_iis the abscissa of the ith non-integrated vehicle bounding box.

The method for obtaining the correction label comprises the following steps: s31: on a mixed data set, training a two-stage target detection model (specifically an open source model Mask RCNN) in a full-supervision mode, which is abbreviated as M; fixing the parameters of the two-stage target detection model M, and correcting the label of the mixed data set; and continuing to train the two-stage target detection model M by using the corrected label until the loss value of the verification set is converged, and obtaining a corrected label.

Preferably, the label of the mixed data set is modified, comprising the steps of: inputting the noisy label and the picture of the mixed data set into the two-stage target detection model M together, extracting features by a backbone network and transmitting the obtained features to a surrounding frame corrector irrelevant to the category, wherein the surrounding frame corrector outputs a coordinate vector b of a vehicle surrounding frame corrected by a first stage, the features of the region where the vehicle surrounding frame is located are pooled in a fine interested region and then transmitted to 3 detection heads with the same structure and different initialization random seeds, each detection head comprises a classification head and a positioning head, and the positioning head outputs the coordinates of the vehicle surrounding frame; the classification head classifies the content of the vehicle surrounding frame; the 3 detection heads respectively predict the offset of 4 coordinates of the 1 vehicle surrounding frame to obtain an offset vector; and multiplying the average value of the 3 offset vectors by 0.6, adding the average value and the bounding box coordinate vector corrected in the first stage, and outputting the final coordinate vector of the correction label.

The 3 detection heads respectively predict the probability that the enclosure frame belongs to each category to obtain 3 probability vectors, and the 3 probability vectors and the probability vectors contained in the noise label are subjected to element-by-element averaging to obtain a corrected probability vector v; the g-th element in the probability vector v

wherein T is a hyper-parameter less than 1, and 0.8 is selected; c represents the total number of classes in the dataset, here 4(4 classes are car, bus, truck, others, respectively).

Preferably, obtaining the coordinate vector b of the bounding box after the first stage correction includes the following steps: inputting a coordinate vector b in the noisy label and a picture of the mixed data set into a two-stage target detection model M, extracting features by a backbone network, obtaining the features corresponding to the bounding box by a fine region-of-interest pooling layer, and transmitting the features to 3 classification heads, wherein the classification heads respectively predict a probability vector; if the confidence degrees of representing the surrounding frame belonging to the background in the 3 probability vectors are all larger than a threshold value Tb (taking 0.6 here), the surrounding frame is considered to belong to the background, and the surrounding frame is removed; otherwise, for 3 probability vectors, 2 norms between every two probability vectors are calculated, and the sum L is calculated_disagreeAs a classification bifurcation; of the 3 probability vectors, the sum of the values representing the confidence of the background is recorded as L_bgAnd b is recorded as L (b), b is optimized according to L (b), the learning rate is recorded as alpha, and the corrected coordinate vector b is:

the extrapolator selects the open source model Scaled-YOLOv 4.

The detection result of the extrapolator at RoadNet is shown in fig. 6 and 7: the words in the figure surrounding the upper left of the box are object categories and the numbers are confidence degrees. Therefore, the method can detect most vehicles in the picture, vehicles which are extremely small and difficult to identify by naked eyes at a distance, and a model in the technology can not be found to be completely and accurately detected at present, so that the deducer is no exception. On a V100 video card of NVIDIA, the detection frame rate of the circuit breaker reaches 63FPS, and the requirement of real-time detection is met.

An embodiment of the present application further provides a control apparatus, including a processor and a storage medium for storing a computer program; wherein a processor is adapted to perform at least the method as described above when executing the computer program.

Embodiments of the present application also provide a storage medium for storing a computer program, which when executed performs at least the method described above.

Embodiments of the present application further provide a processor, where the processor executes a computer program to perform at least the method described above.

The storage medium may be implemented by any type of volatile or non-volatile storage device, or combination thereof. The nonvolatile Memory may be a Read Only Memory (ROM), a Programmable Read Only Memory (PROM), an Erasable Programmable Read-Only Memory (EPROM), an Electrically Erasable Programmable Read-Only Memory (EEPROM), a magnetic Random Access Memory (FRAM), a Flash Memory (Flash Memory), a magnetic surface Memory, an optical Disc, or a Compact Disc Read-Only Memory (CD-ROM); the magnetic surface storage may be disk storage or tape storage. Volatile Memory can be Random Access Memory (RAM), which acts as external cache Memory. By way of illustration and not limitation, many forms of RAM are available, such as Static Random Access Memory (SRAM), Synchronous Static Random Access Memory (SSRAM), Dynamic Random Access Memory (DRAM), Synchronous Dynamic Random Access Memory (SDRAM), Double Data Rate Synchronous Dynamic Random Access Memory (DDRSDRAM), Enhanced Synchronous Dynamic Random Access Memory (ESDRAMEN), Synchronous linked Dynamic Random Access Memory (DRAM), and Direct Random Access Memory (DRMBER). The storage media described in connection with the embodiments of the invention are intended to comprise, without being limited to, these and any other suitable types of memory.

In the several embodiments provided in the present application, it should be understood that the disclosed system and method may be implemented in other ways. The above-described device embodiments are merely illustrative, for example, the division of the unit is only a logical functional division, and there may be other division ways in actual implementation, such as: multiple units or components may be combined, or may be integrated into another system, or some features may be omitted, or not implemented. In addition, the coupling, direct coupling or communication connection between the components shown or discussed may be through some interfaces, and the indirect coupling or communication connection between the devices or units may be electrical, mechanical or other forms.

The units described as separate parts may or may not be physically separate, and parts displayed as units may or may not be physical units, that is, may be located in one place, or may be distributed on a plurality of network units; some or all of the units can be selected according to actual needs to achieve the purpose of the solution of the embodiment.

In addition, all the functional units in the embodiments of the present invention may be integrated into one processing unit, or each unit may be separately regarded as one unit, or two or more units may be integrated into one unit; the integrated unit can be realized in a form of hardware, or in a form of hardware plus a software functional unit.

Those of ordinary skill in the art will understand that: all or part of the steps for implementing the method embodiments may be implemented by hardware related to program instructions, and the program may be stored in a computer readable storage medium, and when executed, the program performs the steps including the method embodiments; and the aforementioned storage medium includes: a mobile storage device, a Read-Only Memory (ROM), a Random Access Memory (RAM), a magnetic disk or an optical disk, and other various media capable of storing program codes.

Alternatively, the integrated unit of the present invention may be stored in a computer-readable storage medium if it is implemented in the form of a software functional module and sold or used as a separate product. Based on such understanding, the technical solutions of the embodiments of the present invention may be essentially implemented or a part contributing to the prior art may be embodied in the form of a software product, which is stored in a storage medium and includes several instructions for causing a computer device (which may be a personal computer, a server, or a network device) to execute all or part of the methods described in the embodiments of the present invention. And the aforementioned storage medium includes: a removable storage device, a ROM, a RAM, a magnetic or optical disk, or various other media that can store program code.

The methods disclosed in the several method embodiments provided in the present application may be combined arbitrarily without conflict to arrive at new method embodiments.

Features disclosed in several of the product embodiments provided in the present application may be combined in any combination to yield new product embodiments without conflict.

The features disclosed in the several method or apparatus embodiments provided in the present application may be combined arbitrarily, without conflict, to arrive at new method embodiments or apparatus embodiments.

The foregoing is a more detailed description of the invention in connection with specific preferred embodiments and it is not intended that the invention be limited to these specific details. For those skilled in the art to which the invention pertains, several equivalent substitutions or obvious modifications can be made without departing from the spirit of the invention, and all the properties or uses are considered to be within the scope of the invention.

Claims

1. A semi-supervised noisy learning method for surveillance video vehicle detection is characterized by comprising the following steps:

s1: c annotators are trained on the disclosed target detection data set by adopting a c-angle supervision method, wherein c is a natural number greater than 1; the c-angle supervision method is characterized in that c models are trained on the same data set, after each conventional iteration, each model outputs a certain number of detection results, the result with the loss value lower than a threshold value L is selected as the true value of the other c-1 models, and one c-angle supervision iteration is carried out;

s3: mixing the public target detection data set and the business data with the integrated label to obtain a mixed data set, wherein the mixed data set comprises a picture and a label, the picture and the label are from the public target detection data set and the business data, the corresponding relation between the picture and the label is kept, a corrector is trained on the mixed data set in a full supervision mode, and the corrector outputs a correction label; the step of obtaining the correction label comprises the following steps:

s33: continuing to train the two-stage target detection model M by using the corrected label until the loss value of the verification set is converged to obtain the corrected label;

modifying the label of the mixed dataset, comprising the steps of:

multiplying the mean value of at least 3 offset vectors by a hyperparameter less than 1, adding the mean value and the bounding box coordinate vector corrected in the first stage, and outputting the final coordinate vector of the correction label;

2. The semi-supervised noisy learning method for surveillance video vehicle detection according to claim 1, further comprising processing said published target detection data set prior to step S1, said processing including at least one of:

rejecting bounding boxes in the disclosed target detection dataset that are less than 10 pixels wide and less than 10 pixels high or that are less than 5 pixels by 5 pixels in area;

removing the gray level picture in the disclosed target detection data set;

augmenting the disclosed target detection data set with a data enhancement method.

3. The semi-supervised noisy learning method for surveillance video vehicle detection according to claim 2, wherein said c annotators employ different types of models; the types of the models are two-stage target detection models, single-stage prior frame-based target detection models or single-stage key point-based target detection models.

4. The semi-supervised noisy learning method for surveillance video vehicle detection according to claim 3, wherein obtaining the integrated label of traffic data comprises the steps of:

and the c annotators deduce on the service data, a soft non-maximum inhibition post-processing method is adopted to filter vehicle surrounding frames with confidence coefficients lower than a first preset threshold, a preset number of vehicle surrounding frames with highest confidence coefficients are reserved to obtain c groups of detection results, and the weighted average of the confidence coefficients of the vehicle surrounding frames is respectively calculated to serve as the integral output result of the c annotators to obtain the integrated tag.

5. The semi-supervised noisy learning method for surveillance video vehicle detection according to claim 4, wherein the confidence weighted average is calculated as follows:

for a pair of the annotators, selecting the annotators As with smaller serial numbers to output each vehicle surrounding frame Bs, calculating and comparing the intersection of the vehicle surrounding frame Bs and all vehicle surrounding frames Banother output by the other annotators to obtain a list Li, and sequencing the vehicle surrounding frames Banother according to the list Li;

if the maximum value in the list Li is larger than a second preset threshold value, a vehicle surrounding frame and the vehicle surrounding frame Bs in the vehicle surrounding frame Banother represent the same object, and the object is represented as Btotal in the integrated label;

if the maximum value in the list Li obtained by calculating the marker As for each time is smaller than the second preset threshold value, the vehicle surrounding frame Bs is subjected to false detection and discarded;

confidence Conf of each non-integrated vehicle surrounding frame corresponding to Btotal in the integrated label_iFor the maximum value in the confidence values that the vehicle surrounding frames belong to all the categories, the number of the non-integrated vehicle surrounding frames is recorded to be k, and the Conf is calculated_iAnd (4) normalization is carried out:

wherein, Conf_iIs the confidence, Conf, of the ith non-integrated vehicle bounding box corresponding to Btotal_tIs the confidence, W, of the tth non-integrated vehicle bounding box corresponding to Btotal_iIs the weight of the ith unintegrated vehicle bounding box; abscissa X of center point of Btotal_totalCalculated from the following formula:

wherein, X_iIs the abscissa of the ith non-integrated vehicle bounding box.

6. The semi-supervised noisy learning method for surveillance video vehicle detection according to claim 5, wherein at least 3 detection heads predict probabilities that the bounding box belongs to each category respectively to obtain at least 3 probability vectors, and after element-by-element averaging is performed on the at least 3 probability vectors and the probability vectors included in the noise label, a corrected probability vector v is obtained;

the g-th element in the probability vector v

7. The semi-supervised noisy learning method for surveillance video vehicle detection according to claim 6, wherein obtaining the first-stage corrected coordinate vector b of the bounding box comprises the steps of:

inputting a coordinate vector b in a noisy label and a picture of the mixed data set into the two-stage target detection model M, extracting features by a backbone network, obtaining the features corresponding to the bounding box by a fine region-of-interest pooling layer, and transmitting the features to at least 3 classification heads, wherein the classification heads predict a probability vector respectively;

in at least 3 probability vectors, the sum of values representing the confidence of the background is recorded as L_bgAnd b is recorded as L (b), b is optimized according to L (b), the learning rate is recorded as alpha, and the corrected coordinate vector b is:

8. the method of semi-supervised noisy learning for surveillance video vehicle detection according to claim 7, wherein the extrapolator selects an open source model Scaled-YOLOv 4.