CN111753682A

CN111753682A - Hoisting area dynamic monitoring method based on target detection algorithm

Info

Publication number: CN111753682A
Application number: CN202010528652.8A
Authority: CN
Inventors: 马士伟; 杨超; 赵焕; 王建; 乐文; 段钢; 黄希; 李炳
Original assignee: China Construction Underground Space Co Ltd
Current assignee: China Construction Underground Space Co Ltd
Priority date: 2020-06-11
Filing date: 2020-06-11
Publication date: 2020-10-09
Anticipated expiration: 2040-06-11
Also published as: CN111753682B

Abstract

The invention discloses a hoisting area dynamic monitoring method based on a target detection algorithm, which comprises the following steps: s1, performing data enhancement on the input image of the hoisting area; s2, extracting the features of the image obtained in the step S1 by adopting an SSD target detection network; s3, extracting a feature map, respectively constructing six frames with different scales at each point on the feature map, and then respectively performing regression on categories and positions; and S4, screening the result obtained in the S3 through a non-maximum value inhibition method to obtain an output result. The invention adopts the SSD target detection algorithm to aim at the characteristic that the feature expressions of different scales are different, adopts a multi-scale target feature extraction method, extracts feature maps of different scales for detection, improves the robustness of detecting lifting area pictures of different scales, and improves the accuracy of detecting whether the lifting hook is in a working state under the current condition and the accuracy of detecting whether a person is under lifting in the working state.

Description

Hoisting area dynamic monitoring method based on target detection algorithm

Technical Field

The invention belongs to the field of computer vision and image processing, and particularly relates to a dynamic monitoring method for a hoisting area based on a target detection algorithm.

Background

In recent years, target detection has become an important research direction and a research hotspot in the fields of computer vision and image processing, and can be applied to the fields of unmanned driving, robot navigation, intelligent video monitoring, industrial detection, aerospace and the like. Meanwhile, target detection is a core part of an intelligent monitoring system, and plays an important role in subsequent tasks such as face recognition, gait recognition, crowd counting, instance segmentation and the like. Before deep learning appears, the target detection method is mainly completed by establishing a certain mathematical model according to certain prior knowledge. However, with the wide application of deep learning in recent years, the target detection algorithm is rapidly developed, and the accuracy and robustness of target detection are improved. The target detection model based on deep learning has the advantages that the deep neural network can independently learn the features of different levels, and compared with the traditional manual design feature learning, the target detection model is richer in features and stronger in feature expression capability. According to the design concept, the method can be mainly divided into two types, namely a target detection algorithm based on region nomination and a target detection algorithm based on end-to-end learning. The method for detecting the target based on the region nomination is a typical representative method for proposing a candidate region in advance aiming at the position of a target object in an image, and comprises the following steps: R-CNN (Region-CNN), Fast R-CNN, etc. The representative methods are YOLO (young only look) and SSD (Single Shot MultiBox Detector), the main idea is to uniformly perform intensive sampling at different positions of a picture, different scales and aspect ratios can be adopted during sampling, then classification and regression are directly performed after the characteristics are extracted by using CNN, and the whole process only needs one step, so that the method has the advantage of high speed. Among the algorithms, the R-CNN algorithm has low efficiency and large occupied hard disk space, and although Fast R-CNN and Fast R-CNN are improved on the R-CNN algorithm, candidate regions need to be extracted from a detection region first to prepare for subsequent feature extraction and classification; while YOLO has a high detection speed, a background false detection rate lower than that of R-CNN, and the like, and supports detection of an unnatural image, it can detect only one of two objects having a large object positioning error and falling in the same grid. Compared with the prior art, the SSD has relatively better detection performance, and has the advantages of real-time performance and high accuracy.

The SSD is a single-detection deep neural network, and combines the regression idea of YOLO and the anchors mechanism of Faster R-CNN. By adopting the regression idea, the calculation complexity of the neural network can be simplified, and the real-time performance of the algorithm is improved; the method for extracting the local features is reasonable and effective in recognition compared with a method for extracting global features at a certain position by YOLO (YOLO) by adopting an anchors mechanism. In other words, the multi-scale region features of all positions of the whole graph are used for regression, so that the characteristic of high YOLO speed is kept, and the window prediction is more accurate as that of fast-RCNN. In addition, the SSD adopts a multi-scale target feature extraction method aiming at the characteristic that features of different scales express different features, and the method is beneficial to improving the robustness of detecting the targets of different scales.

Disclosure of Invention

The invention aims to overcome the defects of the prior art, and provides a dynamic monitoring method for a hoisting area based on an SSD (solid State disk) target detection algorithm, which adopts a multi-scale target feature extraction method aiming at the characteristic that different scales of feature expressions are different, extracts feature maps of different scales for detection, improves the robustness of detecting pictures of the hoisting area of different scales, and improves the accuracy of detecting whether a hook is in a working state under the current condition.

The purpose of the invention is realized by the following technical scheme: a hoisting area dynamic monitoring method based on a target detection algorithm comprises the following steps:

s1, performing data enhancement on the input image of the hoisting area;

s2, extracting the features of the image obtained in the step S1 by adopting an SSD target detection network;

s3, extracting a feature map, respectively constructing six frames with different scales at each point on the feature map, and then respectively performing regression on categories and positions;

and S4, screening the result obtained in the S3 through a non-maximum value inhibition method to obtain an output result.

Further, the image of each hoist area input in step S1 is randomly sampled by one of the following three methods:

(1) using the whole image, namely the collected original image of the hoisting area;

(2) randomly cropping on the original image;

(3) random clipping with Jaccard overlap ratio constraint, the calculation formula of the Jaccard overlap ratio is as follows:

a and B are similarity sets of all real frames and the cut image in the original image respectively; the ratio of the cutting size to the original image is between [0.1 and 0.9], and the aspect ratio is between [1/2 and 2 ];

the input image is resized to a uniform size and horizontally flipped with a probability of 0.5.

Further, the SSD target detection network is based on the VGG-16 network, and is also trained by Conv1_1, Conv1_2, Conv2_1, Conv2_2, Conv3_1, Conv3_2, Conv3_3, Conv4_1, Conv4_2, Conv4_3, Conv5_1, Conv5_2, Conv5_3 (512);

FC6 and FC7 changed from the original fully connected layer to a convolution of 3 × 1024 and a convolution of 1 × 1024, and additional layers included Conv6_1, Conv6_2, Conv7_1, Conv7_2, Conv,8_1, Conv8_2, Conv9_1, Conv9_ 2;

meanwhile, the pooling layer Pool5 is changed from 2 × 2 with original Stride being 2 to 3 × 3 with Stride being 1;

based on the Atrous algorithm, Conv6 employs an extended convolution or a punctured convolution, exponentially enlarges the field of view of the convolution without increasing the complexity of the parameters and the model, and uses the expansion rate parameter to represent the size of the expansion;

wherein, the Conv4_3 layer is used as a first feature map for detection; the Conv4_3 layer signature size is 38 × 38, but the layer is more regular than the layer, so an L2 regular layer is added after the Conv4_3 layer to ensure that the difference with the detection layer is not very large.

Further, the specific implementation method of step S3 is as follows: extracting 19 × 19 Conv7, 10 × 10 Conv8_2, 5 × 5 Conv9_2, 3 × 3 Conv10_2 and 1 × 1 Conv11_2 from the convolutional layer as feature maps for detection, adding Conv4_3 layers, and extracting 6 feature maps in total; constructing six frames with different sizes at each point on the 6 feature maps, and then performing regression on categories and positions respectively;

the method comprises the following steps: and obtaining a plurality of six feature maps with different sizes by adopting a multi-scale method, wherein if the m-layer feature map is adopted during system detection, the prior frame proportion calculation formula of the kth feature map is as follows:

wherein m denotes the number of characteristic diagrams, s_kRepresenting the ratio of the prior frame size to the picture, s_min and s_maxMinimum and maximum values representing ratios; for the first feature map, the scale ratio of the prior frame is set to be 0.1, and the scale is 30; for the later characteristic diagram, the prior frame scale is increased linearly according to the formula above, but the scale is firstly enlarged by 100 times, and the increasing step is17, s of each feature map_k20, 37, 54, 71 and 88, dividing the ratios by 100, and multiplying by the picture size to obtain the dimension of each feature map; for the aspect ratio, choose

For a particular aspect ratio, the width and height of the prior box are calculated as follows:

each feature map in the module fusing different scale features has an a_r1 and dimension s_kIn addition to the a priori block of (a), a scale is set to_r1 and

the prior frames of (1), thus each feature map is provided with two square prior frames with the aspect ratio of 1 but different sizes; furthermore, the center point of the prior frame of each cell is distributed in the center of the respective cell, i.e. the center point of the prior frame of each cell is distributed in the center of the respective cell

a,b∈[0,|f_k|]Where | f_kL is the size of the kth characteristic diagram, and the coordinates of the prior frame are intercepted to be [0,1 ]]Internal; the mapping relation between the coordinate of the prior frame on the characteristic diagram and the coordinate of the original image is as follows:

in the formula ,(c_x,c_y) Coordinates of the center of a prior frame on the characteristic layer are obtained; w is a_b,h_bWidth and height of the prior frame; w is a_feature,h_featureWidth and height of the feature layer; w is a_img,h_imgWidth and height of the original image; obtained (x)_min,y_min,x_max,y_max) Is the feature map of the k-th layer with the center

Size w_k,h_kMapping the prior frame to the object frame coordinates of the original image;

and (3) simultaneously regressing the position and the target type for each output feature map, wherein the target loss function is the sum of confidence (classification) loss and position loss, and the expression is as follows:

where N is the total number of matching positive samples, if N is 0, let L be 0, x and c be the indicated amount and confidence of the classification, L, g be the predicted box and the true box, α be the weight of the position loss, L_conf(x, c) is a confidence loss function; l is_loc(x, l, g) is a position loss function;

the position penalty is the Smooth L1 penalty between the predicted frame L and the true frame g:

wherein Pos represents a positive sample;

is an indication quantity, when the value of the ith prediction box and the jth real box of the classification p are paired to be 1, otherwise, the value is 0; cx, cy, w and h are respectively the coordinate of the center point x and the coordinate of the center point y of the frame, the width and the height; d is a prior box;

is the coordinate offset of the prediction box;

is the predicted x coordinate of the center point

Y coordinate of center point

Width of

Height offset

The sum of (a);

respectively a central point x coordinate, a central point y coordinate, a width and a height of the prediction frame;

are respectively trueThe offset of the center coordinate cx of the real frame, the offset of the center coordinate cy, the scaling of the width w and the scaling of the height h;

the classification penalty is the softmax penalty between the classification confidences:

wherein ,

representing the probability of the softmax activation function for the prediction box i as class p,

the probability of being represented as a background,

when the number is 1, the ith prediction box is matched with the jth real box with the category p, otherwise, the ith prediction box is not matched with the proper real box, and the classification loss formula comprises a positive sample Pos and a negative sample Neg;

in order to predict the detection result, outputting a set of independent detection values for each prior frame of each unit; corresponding to a bounding box, the method is mainly divided into two parts: the first part is the confidence coefficient or score of each category, and when there are c category confidence coefficients, only c-1 real detection categories are available; in the prediction process, the class with the highest confidence is the class to which the bounding box belongs, and particularly, when the first confidence value is the highest, the bounding box does not contain the target; the second part is the position of the bounding box and comprises 4 values (cx, cy, w, h) which respectively represent the center coordinate and the width and the height of the bounding box; for a feature map with the size of m × n, mn units are shared, the number of the prior frames set by each unit is recorded as k, then (c +4) k predicted values are required for each unit, and (c +4) kmn predicted values are required for all units, and as the SSD used by the system adopts convolution for detection, the (c +4) k convolution kernels are required to complete the detection process of the feature map;

in order to ensure that positive and negative samples are balanced as much as possible, sampling is carried out on the negative samples, descending order arrangement is carried out according to confidence errors (the smaller the confidence of a prediction background is, the larger the error is) during sampling, top-k with larger errors is selected as a training negative sample, and the proportion of the positive and negative samples is ensured to be close to 1: 3.

Further, the specific implementation method of step S4 is as follows: adopting a non-maximum suppression method, comprising the following sub-steps:

s41, regarding the detection result obtained in the step S3 as a candidate set, sorting the candidate set according to the confidence degrees aiming at each type of target, selecting the target with the highest confidence degree, deleting the target from the candidate set, and adding the target into the detection result set;

s42, calculating the Jaccard overlapping rate between the elements in the candidate set and the target obtained in S41, and deleting the elements corresponding to the candidate set with the Jaccard overlapping rate larger than a given threshold value;

and S43, repeating the steps S41 and S42 until the candidate set is empty, and outputting the result set as a final result.

The invention has the beneficial effects that: (1) the SSD target detection algorithm adopted in the invention utilizes the idea of YOLO regression, simplifies the computational complexity of a neural network, and improves the real-time performance of the algorithm;

(2) the SSD target detection algorithm adopted in the invention can extract hook characteristics with different aspect ratio sizes by using the anchors mechanism of Faster R-CNN, and meanwhile, the method for extracting the local characteristics is more reasonable and effective in identification;

(3) the SSD target detection algorithm adopted in the invention adopts a multi-scale target feature extraction method aiming at the characteristic that features of different scales express different, and extracts feature maps of different scales for detection, wherein a large-scale feature map (a feature map closer to the front) can be used for detecting small objects, and a small-scale feature map (a feature map closer to the rear) is used for detecting large objects, so that the robustness of detecting lifting area pictures of different scales is improved;

(4) the invention improves the accuracy of detecting whether the lifting hook is in the working state under the current condition and the accuracy of detecting whether a worker is under the lifting state in the working state.

Drawings

FIG. 1 is a flow chart of a dynamic monitoring method of a hoisting area based on a target detection algorithm of the present invention;

FIG. 2 is a diagram of a conventional VGG-16 network architecture;

FIG. 3 is a diagram of an SSD destination detect network architecture of the present invention;

FIG. 4 is a schematic diagram of a characteristic pyramid of the present invention;

FIG. 5 is a graph showing the results of the detection of the present invention.

Detailed Description

The technical scheme of the invention is further explained by combining the attached drawings.

As shown in fig. 1, the hoisting area dynamic monitoring method based on the target detection algorithm of the present invention includes the following steps:

s1, performing data enhancement on the input image of the hoisting area;

the input image of each hoisting area is randomly sampled by one of the following three methods:

(2) randomly cropping on the original image;

The number of training samples can be increased, more targets with different shapes and sizes are constructed at the same time, and the targets are input into the network, so that the network can learn more robust characteristics, the performance of a subsequent algorithm is improved, and finally, the system enhances the sensitivity to target translation and has more robustness to targets with different sizes and aspect ratios.

the SSD target detection network is based on a VGG-16 network, the traditional VGG-16 network structure is shown in figure 2, and the SSD network of the invention is shown in figure 3. During training, the parameters are also Conv1_1, Conv1_2, Conv2_1, Conv2_2, Conv3_1, Conv3_2, Conv3_3, Conv4_1, Conv4_2, Conv4_3, Conv5_1, Conv5_2 and Conv5_3 (512);

the specific implementation method comprises the following steps: extracting 19 × 19 Conv7, 10 × 10 Conv8_2, 5 × 5 Conv9_2, 3 × 3 Conv10_2 and 1 × 1 Conv11_2 from the convolutional layer as feature maps for detection, adding a Conv4_3 layer, and extracting 6 feature maps in total, so that the feature structure of a pyramid is presented, as shown in fig. 4; constructing six frames with different sizes at each point on the 6 feature maps, and then performing regression on categories and positions respectively;

fig. 5 shows the final detection result, which is specifically performed as follows: the SSD uses the concept of anchors in the Faster R-CNN for reference, each unit is provided with prior frames with different scales or length-width ratios, and predicted bounding boxes (bounding boxes) are based on the prior frames, so that the training difficulty is reduced to a certain extent. In general, each cell is provided with a plurality of prior frames, and the dimensions and the aspect ratios of the prior frames are different. The prior boxes arranged in different feature maps are different in number (the prior boxes arranged in each unit on the same feature map are the same). The setting of the prior box includes two aspects of scale (or size) and aspect ratio. For the scale of the prior box, it obeys a linear increasing rule: as the feature map size decreases, the a priori box scale increases linearly.

where m denotes the number of feature maps, which is set to 5 in the present embodiment, because the first layer (Conv4 — 3 layer) is set separately; s_kRepresenting the ratio of the prior frame size to the picture, s_min and s_maxThe minimum value and the maximum value of the ratio are respectively 0.2 and 0.9; for the first feature map, the scale ratio of the prior frame is set to be 0.1, and the scale is 30; for the later feature maps, the prior frame scale is increased linearly according to the above formula, but the scale is first enlarged by 100 times, and the increase step size is 17, so that s of each feature map is_k20, 37, 54, 71 and 88, dividing the ratios by 100, and multiplying by the picture size to obtain the dimension of each feature map; for the aspect ratio, choose

wherein Pos represents a positive sample;

is an indication quantity, when the value of the ith prediction box and the jth real box of the classification p are paired to be 1, otherwise, the value is 0; cx, cy, w and h are respectively the coordinate of the center point x and the coordinate of the center point y of the frame, the width and the height; d is a prior frame (a prior frame preset by the network itself), l is a prediction frame (a frame output by the network and added with a prediction offset), and g is a GT frame (a real frame marked by the data set);

is the coordinate offset of the prediction box;

is the predicted x coordinate of the center point

Y coordinate of center point

Width of

Height offset

The sum of (a);

respectively the offset of the center coordinate cx of the real frame, the offset of the center coordinate cy, the scaling of the width w and the scaling of the height h;

wherein ,

the probability of being represented as a background,

S4, screening the result obtained in the S3 through a non-maximum value inhibition method to obtain an output result;

for each prediction box, firstly, the class (the one with the highest confidence) and the confidence value of the prediction box are determined according to the class confidence, and the prediction boxes belonging to the background are filtered out. The prediction boxes with lower thresholds are then filtered out according to a confidence threshold (e.g., 0.5). And decoding the residual prediction frame, and obtaining the real position parameter of the prediction frame according to the prior frame. After decoding, the blocks are sorted in descending order according to confidence, and then only top-k (e.g. 400) prediction blocks are reserved. And finally, filtering prediction boxes with large Jaccard overlapping degree by adopting a non-maximum value inhibition algorithm. The last remaining prediction box is the detection result.

The specific implementation method comprises the following steps: adopting a non-maximum suppression method, comprising the following sub-steps:

It will be appreciated by those of ordinary skill in the art that the embodiments described herein are intended to assist the reader in understanding the principles of the invention and are to be construed as being without limitation to such specifically recited embodiments and examples. Those skilled in the art can make various other specific changes and combinations based on the teachings of the present invention without departing from the spirit of the invention, and these changes and combinations are within the scope of the invention.

Claims

1. A hoisting area dynamic monitoring method based on a target detection algorithm is characterized by comprising the following steps:

s1, performing data enhancement on the input image of the hoisting area;

2. The method for dynamically monitoring the hoisting area based on the object detection algorithm as claimed in claim 1, wherein the image of each hoisting area input in the step S1 is randomly sampled by one of the following three methods:

(2) randomly cropping on the original image;

3. The method for dynamically monitoring the hoisting area based on the target detection algorithm as claimed in claim 1, wherein the SSD target detection network is based on VGG-16 network, and is trained by Conv1_1, Conv1_2, Conv2_1, Conv2_2, Conv3_1, Conv3_2, Conv3_3, Conv4_1, Conv4_2, Conv4_3, Conv5_1, Conv5_2, Conv5_3 (512);

wherein, the Conv4_3 layer is used as a first feature map for detection; the Conv4_3 layer signature size is 38 × 38, with an L2 regular layer added after the Conv4_3 layer.

4. The method for dynamically monitoring the hoisting area based on the target detection algorithm as claimed in claim 3, wherein the step S3 is implemented by: extracting 19 × 19 Conv7, 10 × 10 Conv8_2, 5 × 5 Conv9_2, 3 × 3 Conv10_2 and 1 × 1 Conv11_2 from the convolutional layer as feature maps for detection, adding Conv4_3 layers, and extracting 6 feature maps in total; constructing six frames with different sizes at each point on the 6 feature maps, and then performing regression on categories and positions respectively;

wherein m denotes the number of characteristic diagrams, s_kTo representThe ratio of the prior frame size to the picture, s_min and s_maxMinimum and maximum values representing ratios; for the first feature map, the scale ratio of the prior frame is set to be 0.1, and the scale is 30; for the later feature maps, the prior frame scale is increased linearly according to the above formula, but the scale is first enlarged by 100 times, and the increase step size is 17, so that s of each feature map is_k20, 37, 54, 71 and 88, dividing the ratios by 100, and multiplying by the picture size to obtain the dimension of each feature map; for the aspect ratio, choose

and (3) simultaneously regressing the position and the target type for each output feature map, wherein the target loss function is the sum of the confidence loss and the position loss, and the expression is as follows:

wherein Pos represents a positive sample;

is the coordinate offset of the prediction box;

is the predicted x coordinate of the center point

Y coordinate of center point

Width of

Height offset

The sum of (a);

wherein ,

the probability of being represented as a background,

in order to ensure that positive and negative samples are balanced as much as possible, sampling is carried out on the negative samples, descending order arrangement is carried out according to confidence error during sampling, top-k with larger error is selected as a training negative sample, and the proportion of the positive and negative samples is ensured to be close to 1: 3.

5. The method for dynamically monitoring the hoisting area based on the target detection algorithm as recited in claim 4, wherein the step S4 is implemented by: adopting a non-maximum suppression method, comprising the following sub-steps: