CN111753682B

CN111753682B - Hoisting area dynamic monitoring method based on target detection algorithm

Info

Publication number: CN111753682B
Application number: CN202010528652.8A
Authority: CN
Inventors: 马士伟; 杨超; 赵焕; 王建; 乐文; 段钢; 黄希; 李炳
Original assignee: China Construction Underground Space Co Ltd
Current assignee: China Construction Underground Space Co Ltd
Priority date: 2020-06-11
Filing date: 2020-06-11
Publication date: 2023-05-23
Anticipated expiration: 2040-06-11
Also published as: CN111753682A

Abstract

The invention discloses a hoisting area dynamic monitoring method based on a target detection algorithm, which comprises the following steps: s1, carrying out data enhancement on an input image of a hoisting area; s2, extracting features of the image obtained in the step S1 by adopting an SSD destination detection network; s3, extracting a feature map, respectively constructing six frames with different dimensions at each point on the feature map, and then respectively carrying out regression on categories and positions; s4, screening the result obtained in the S3 through a non-maximum value inhibition method to obtain an output result. According to the invention, the SSD target detection algorithm is adopted to aim at the characteristic that the feature expressions of different scales are different, a multi-scale target feature extraction method is adopted, feature graphs of different scales are extracted to detect, the robustness of detecting the pictures of the hoisting areas of different scales is improved, and the accuracy of detecting whether the lifting hook is in a working state or not and the accuracy of detecting whether a detector is under the hoisting state or not in the current state are improved.

Description

Hoisting area dynamic monitoring method based on target detection algorithm

Technical Field

The invention belongs to the field of computer vision and image processing, and particularly relates to a hoisting area dynamic monitoring method based on a target detection algorithm.

Background

In recent years, target detection has become an important research direction and research hotspot in the fields of computer vision and image processing, and can be applied to the fields of unmanned aerial vehicle, robot navigation, intelligent video monitoring, industrial detection, aerospace and the like. Meanwhile, the target detection is a core part of the intelligent monitoring system, and plays a vital role in the following tasks such as face recognition, gait recognition, crowd counting, instance segmentation and the like. Before deep learning occurs, the target detection method is mainly completed by establishing a certain mathematical model according to a certain priori knowledge. However, with the wide application of deep learning in recent years, the target detection algorithm is developed more rapidly, and the accuracy and the robustness of target detection are improved. The deep learning-based target detection model has the advantages that the deep neural network can autonomously learn the characteristics of different levels, so that compared with the traditional manual design characteristic learning, the deep neural network has richer characteristics and stronger characteristic expression capability. According to the design concept, the method can be mainly divided into two types, namely a target detection algorithm based on region naming and a target detection algorithm based on end-to-end learning. The method for detecting the target based on the region nomination is aimed at the position of the target object in the image, and the method for pre-proposing the candidate region typically comprises the following steps: R-CNN (Region-CNN), fast R-CNN, etc. The target detection method based on end-to-end learning does not need to pre-extract candidate areas, the representative methods are YOLO (You only look once) and SSD (Single Shot MultiBox Detector), the main idea is to uniformly and densely sample at different positions of the picture, different scales and length-width ratios can be adopted during sampling, and then classification and regression are directly carried out after the CNN is utilized to extract the features, so that the whole process only needs one step, and the method has the advantage of high speed. Among the above algorithms, the R-CNN algorithm has low efficiency and large occupied hard disk space, and although Fast R-CNN and Fast R-CNN are improved on the R-CNN algorithm, candidate areas need to be extracted from detection areas firstly to prepare for subsequent feature extraction and classification; while YOLO is fast in detection speed, low in background false detection rate than R-CNN and the like, and supports detection of unnatural images, two objects which have large object positioning errors and fall into the same grid can only detect one of the objects. In contrast, the SSD has relatively better detection performance, and has the two advantages of real-time performance and high accuracy.

SSD is a single detection deep neural network, combining the regression idea of YOLO and the anchors mechanism of Faster R-CNN. By adopting the idea of regression, the computational complexity of the neural network can be simplified, and the real-time performance of the algorithm can be improved; features with different aspect ratio sizes can be extracted by adopting an anchors mechanism, and meanwhile, the method for extracting the local features is more reasonable and effective in recognition compared with a method for extracting global features of a certain position by utilizing YOLO. In other words, the multi-scale regional characteristics of all positions of the full graph are used for regression, so that the characteristic of high YOLO speed is maintained, and the window prediction is ensured to be accurate as fast-RCNN. In addition, SSD adopts a method for extracting multi-scale target features aiming at the characteristic that the feature expressions of different scales are different, and the method is beneficial to improving the robustness of detecting targets of different scales.

Disclosure of Invention

The invention aims to overcome the defects of the prior art, and provides a hoisting area dynamic monitoring method based on a target detection algorithm, which adopts an SSD target detection algorithm to aim at the characteristic that different scale features are expressed differently, adopts a method for extracting multi-scale target features, extracts feature images of different scales to detect, improves the robustness of detecting hoisting area pictures of different scales, and improves the accuracy of detecting whether a lifting hook is in a working state in the current situation.

The aim of the invention is realized by the following technical scheme: a hoisting area dynamic monitoring method based on a target detection algorithm comprises the following steps:

s1, carrying out data enhancement on an input image of a hoisting area;

s2, extracting features of the image obtained in the step S1 by adopting an SSD destination detection network;

s3, extracting a feature map, respectively constructing six frames with different dimensions at each point on the feature map, and then respectively carrying out regression on categories and positions;

s4, screening the result obtained in the S3 through a non-maximum value inhibition method to obtain an output result.

Further, the image of each hoisting area input in step S1 is randomly sampled by one of the following three methods:

(1) Using the whole image, namely an acquired original image of the hoisting area;

(2) Randomly cropping on the original image;

(3) Random clipping with Jaccard overlap constraint, jaccard overlap calculation formula is as follows:

a and B are similarity sets of all real frames in the original image and the clipping image respectively; the ratio of the size of cutting to the original picture is between 0.1 and 0.9, and the aspect ratio is between 1/2 and 2;

the input image is resized to a uniform size and flipped horizontally with a probability of 0.5.

Further, the SSD destination detection network is based on a VGG-16 network, and training is performed by using Conv1_1, conv1_2, conv2_1, conv2_2, conv3_1, conv3_2, conv3_3, conv4_1, conv4_2, conv4_3, conv5_1, conv5_2, conv5_3 (512);

FC6 and FC7 change from the original full link layer to a convolution of 3 x 1024 and a convolution of 1 x 1024, the additional layer comprising conv6_1, conv6_2, conv7_1, conv7_2, conv,8_1, conv8_2, conv9_1, conv9_2;

simultaneously changing Pool layer Pool5 from original 2×2 of stride=2 to 3×3 of stride=1;

based on the Atrous algorithm, conv6 employs an extended convolution or band Kong Juanji to exponentially expand the field of view of the convolution without increasing the parameters and model complexity, and uses the expansion rate parameter to represent the magnitude of the expansion;

wherein Conv4_3 layer is to be used as the first feature map for detection; the conv4_3 layer feature map size is 38×38, but the layer is relatively front and is regular and large, so an L2 regular layer is added behind the conv4_3 layer to ensure that the difference from the following detection layer is not very large.

Further, the specific implementation method of the step S3 is as follows: extracting 19×19 Conv7, 10×10 Conv8_2,5×5 Conv9_2,3×3 Conv10_2 and 1×1 Conv11_2 from the convolution layer as feature maps for detection, adding Conv4_3 layer, and extracting 6 feature maps in total; constructing six frames with different dimensions on each point of the 6 feature graphs respectively, and then carrying out regression on categories and positions respectively;

the specific method comprises the following steps: obtaining a plurality of six feature graphs with different sizes by adopting a multi-scale method, and if the system detects by adopting m layers of feature graphs, calculating the prior frame proportion formula of the kth feature graph as follows:

wherein m refers to the number of feature patterns, s _k Representing the proportion of a priori frame size relative to the picture, s _min and s_max Representing the minimum and maximum values of the ratio; for the first feature map, the scale proportion of the prior frame is set to be 0.1, and the scale is 30; for the following feature graphs, the prior frame scale increases linearly according to the above formula, but the scale proportion is first enlarged by 100 times, and the increasing step length is 17, so that the s of each feature graph _k For 20, 37, 54, 71 and 88, dividing these ratios by 100, and multiplying by the picture size to obtain the scale of each feature map; for aspect ratio, select

For a particular aspect ratio, the width and height of the a priori frame is calculated as follows:

each feature map in the fused different-scale feature modules has a value of a _r =1 and scale s _k In addition to this, a scale of a is set _r =1 and

such that each feature map is provided with two square prior frames of aspect ratio 1 but different sizes; furthermore, the center point of the a priori frame of each cell is distributed in the center of the respective cell, i.e. +.>

a,b∈[0,|f _k |]Wherein |f _k I is the size of the kth feature map, and the coordinates of the prior frame are intercepted to be in [0,1 ]]An inner part; the mapping relation between the prior frame coordinates and the original image coordinates on the feature map is as follows:

in the formula ,(c_x ,c _y ) The coordinates of the center of the priori frame on the feature layer; w (w) _b ,h _b Width and height of a priori frame; w (w) _feature ,h _feature Is the width and height of the feature layer; w (w) _img ,h _img Wide and high for the original image; the obtained (x) _min ,y _min ,x _max ,y _max ) Is the center on the k layer characteristic diagram

Size w _k ,h _k Mapping the prior frame of the original image to the object frame coordinates of the original image;

regression is performed on the position and the target category simultaneously on each output feature map, wherein the target loss function is the sum of confidence (classification) loss and position loss, and the expression is as follows:

where N is the total amount of matching positive samples, let l=0 if n=0; x and c are the indication and confidence of the classification, respectively; l, g are the prediction frame and the real frame, respectively; α is the weight of the position loss; l (L) _conf (x, c) is a confidence loss function; l (L) _loc (x, l, g) is a position loss function;

the position loss is the smoothl 1 loss between the prediction box L and the real box g:

wherein Pos represents a positive sample;

is an indication quantity, when the pairing value of the ith predicted frame and the jth real frame of the classification p is 1, otherwise, the pairing value is 0; cx, cy, w, h the x coordinate of the center point and the y coordinate of the center point of the frame, the width and the height; d is a priori box;

is the coordinate offset of the predicted frame; />

Is the predicted center point x coordinate +.>

Center point y coordinate +.>

Width->

Height offset->

Is the sum of (3); />

Respectively the x coordinate of the central point and the y coordinate of the central point of the prediction frame, and the width and the height;

the offset of the center coordinate cx, the offset of the center point coordinate cy, the scaling of the width w and the scaling of the height h of the real frame are respectively shown;

the classification penalty is the softmax penalty between classification confidence:

wherein ,

softmax activation function probability representing prediction box i as class p,/>

The probability expressed as a background is given,

when the classification loss formula is 1, the i-th prediction frame is matched with the true frame with the j-th class p, otherwise, the i-th prediction frame is not matched with the proper true frame, and the classification loss formula comprises a positive sample Pos and a negative sample Neg;

outputting a set of independent detection values for each prior frame of each unit in order to predict the detection result; corresponding to a bounding box, the two parts are mainly divided: the first part is the confidence level or score of each category, and when the confidence level of c categories exists, the number of the true detection categories is only c-1; in the prediction process, the category with the highest confidence is the category to which the bounding box belongs, and in particular, when the first confidence value is the highest, the bounding box does not contain the target; the second part is the position of the boundary box, and comprises 4 values (cx, cy, w, h) which respectively represent the center coordinate and the width and the height of the boundary box; for a feature map with the size of m×n, n units are shared, the number of prior frames set by each unit is recorded as k, so that (c+4) k predicted values are needed for each unit, and (c+4) kmn predicted values are needed for all units, and since SSD used by the system is detected by adopting convolution, the detection process of the feature map is completed by (c+4) k convolution kernels;

in order to ensure that positive and negative samples are balanced as much as possible, negative samples are sampled, descending order is carried out according to confidence errors (the smaller the confidence of a prediction background is, the larger the error is) during sampling, and top-k with larger error is selected as a training negative sample so as to ensure that the proportion of the positive and negative samples is close to 1:3.

Further, the specific implementation method of the step S4 is as follows: the non-maximum value suppression method is adopted, and comprises the following substeps:

s41, regarding the detection result obtained in the step S3 as a candidate set, sequencing the candidate set according to the confidence level for each category of targets, selecting the target with the highest confidence level, deleting the target from the candidate set, and adding the target into the detection result set;

s42, calculating the Jaccard overlap ratio between the elements in the candidate set and the target obtained in the S41, and deleting the elements corresponding to the candidate set with the Jaccard overlap ratio larger than a given threshold;

s43, repeating the steps S41 and S42 until the candidate set is empty, and outputting the result set as a final result.

The beneficial effects of the invention are as follows: (1) The SSD destination detection algorithm adopted in the invention utilizes the idea of YOLO regression, simplifies the calculation complexity of the neural network, and improves the real-time performance of the algorithm;

(2) The SSD destination detection algorithm adopted in the invention can extract the hook features with different aspect ratio sizes by utilizing the anchors mechanism of the Faster R-CNN, and the local feature extraction method is more reasonable and effective in recognition;

(3) The SSD target detection algorithm adopted in the invention aims at the characteristic that the features of different scales express different, adopts a multi-scale target feature extraction method, extracts feature images of different scales for detection, and adopts a large-scale feature image (a front feature image) to detect small objects, and adopts a small-scale feature image (a rear feature image) to detect large objects, so that the robustness of detecting the pictures of hoisting areas of different scales is improved;

(4) The invention improves the accuracy of detecting whether the lifting hook is in the working state or not in the current situation and the accuracy of detecting whether the personnel is under the lifting in the working state or not.

Drawings

FIG. 1 is a flow chart of a method for dynamically monitoring a hoisting area based on a target detection algorithm;

FIG. 2 is a diagram of a conventional VGG-16 network architecture;

FIG. 3 is a diagram of an SSD destination detection network of the present invention;

FIG. 4 is a schematic view of a feature pyramid of the present invention;

FIG. 5 is a graph showing the detection results of the present invention.

Detailed Description

The technical scheme of the invention is further described below with reference to the accompanying drawings.

As shown in fig. 1, the method for dynamically monitoring the hoisting area based on the target detection algorithm comprises the following steps:

s1, carrying out data enhancement on an input image of a hoisting area;

the input image of each lifting area is randomly sampled by one of the following three methods:

(2) Randomly cropping on the original image;

According to the method, the number of training samples can be increased, more targets with different shapes and sizes are constructed, the targets are input into a network, the network can learn more robust features, the subsequent algorithm performance is improved, and finally the system is enhanced in sensitivity to target translation and is more robust to targets with different sizes and aspect ratios.

the SSD destination detection network is based on a VGG-16 network, the conventional VGG-16 network structure is shown in FIG. 2, and the SSD network of the present invention is shown in FIG. 3. The training is also Conv1_1, conv1_2, conv2_1, conv2_2, conv3_1, conv3_2, conv3_3, conv4_1, conv4_2, conv4_3, conv5_1, conv5_2, conv5_3 (512);

the specific implementation method comprises the following steps: extracting 19×19 Conv7, 10×10 Conv8_2,5×5 Conv9_2,3×3 Conv10_2 and 1×1 Conv11_2 from the convolution layer as feature maps for detection, adding Conv4_3 layers, and extracting 6 feature maps altogether, thus presenting a pyramid feature structure, as shown in fig. 4; constructing six frames with different dimensions on each point of the 6 feature graphs respectively, and then carrying out regression on categories and positions respectively;

the final detection result is shown in fig. 5, and the specific procedure is as follows: SSD uses the idea of anchor in the fast R-CNN to utilize each unit to set up the priori frame that the scale or length-width ratio are different, and the boundary frame (the prediction boxes) is based on these priori frames, reduces the training degree of difficulty to a certain extent. Typically, each cell will be provided with a number of a priori boxes, which vary in scale and aspect ratio. The number of a priori boxes set up by different feature maps is different (the a priori boxes set up by each cell on the same feature map are the same). The setting of the prior box includes two aspects of scale (or size) and aspect ratio. For the scale of the a priori block, it obeys a linear increasing rule: as the feature map size decreases, the a priori frame dimensions increase linearly.

wherein m refers to the number of feature maps, which is set to 5 in the present embodiment, because the first layer (conv4—3 layer) is set separately; s is(s) _k Representing the proportion of a priori frame size relative to the picture, s _min and s_max The minimum value and the maximum value of the representation proportion are respectively 0.2 and 0.9; for the first feature map, the scale proportion of the prior frame is set to be 0.1, and the scale is 30; for the following feature graphs, the prior frame scale increases linearly according to the above formula, but the scale proportion is first enlarged by 100 times, and the increasing step length is 17, so that the s of each feature graph _k For 20, 37, 54, 71 and 88, dividing these ratios by 100, and multiplying by the picture size to obtain the scale of each feature map; for aspect ratio, select

in the formula ,(c_x ,c _y ) The coordinates of the center of the priori frame on the feature layer; w (w) _b ,h _b Width and height of a priori frame; w (w) _feature ,h _feature Is the width and height of the feature layer; w (w) _img ,h _img Is the original pictureThe width and height of the image; the obtained (x) _min ,y _min ,x _max ,y _max ) Is the center on the k layer characteristic diagram

wherein Pos represents a positive sample;

is an indication quantity, when the pairing value of the ith predicted frame and the jth real frame of the classification p is 1, otherwise, the pairing value is 0; cx, cy, w, h the x coordinate of the center point and the y coordinate of the center point of the frame, the width and the height; d is a priori frame (a priori frame preset by the network itself), l is a predicted frame (a frame of the network output plus a predicted offset), g is a GT frame (a true frame of the dataset annotation); />

Is the coordinate offset of the predicted frame; />

Is the predicted center point x coordinate +.>

Center point y coordinate +.>

Width->

Height offset->

Is the sum of (3); />

Respectively the x coordinate of the central point and the y coordinate of the central point of the prediction frame, and the width and the height; />

wherein ,

The probability expressed as a background is given,

S4, screening the result obtained in the S3 through a non-maximum value inhibition method to obtain an output result;

for each prediction frame, the category (the category with the highest confidence) and the confidence value are determined according to the category confidence, and the prediction frame belonging to the background is filtered. The prediction blocks with lower thresholds are then filtered out based on the confidence threshold (e.g., 0.5). And decoding the left prediction frame, and obtaining the real position parameters of the left prediction frame according to the prior frame. After decoding, the order is descending according to confidence, and then only top-k (e.g., 400) prediction frames are reserved. And finally, filtering out prediction frames with larger Jaccard overlapping degree by adopting a non-maximum suppression method. The last remaining prediction frame is the detection result.

The specific implementation method comprises the following steps: the non-maximum value suppression method is adopted, and comprises the following substeps:

Those of ordinary skill in the art will recognize that the embodiments described herein are for the purpose of aiding the reader in understanding the principles of the present invention and should be understood that the scope of the invention is not limited to such specific statements and embodiments. Those of ordinary skill in the art can make various other specific modifications and combinations from the teachings of the present disclosure without departing from the spirit thereof, and such modifications and combinations remain within the scope of the present disclosure.

Claims

1. The hoisting area dynamic monitoring method based on the target detection algorithm is characterized by comprising the following steps of:

s1, carrying out data enhancement on an input image of a hoisting area;

s2, extracting features of the image obtained in the step S1 by adopting an SSD destination detection network; the SSD destination detection network is based on a VGG-16 network, and training is performed by using Conv1_1, conv1_2, conv2_1, conv2_2, conv3_1, conv3_2, conv3_3, conv4_1, conv4_2, conv4_3, conv5_1, conv5_2 and Conv5_3 (512);

wherein Conv4_3 layer is to be used as the first feature map for detection; the size of the Conv4_3 layer characteristic diagram is 38 multiplied by 38, and an L2 regular layer is added behind the Conv4_3 layer;

s3, extracting a feature map, respectively constructing six frames with different dimensions at each point on the feature map, and then respectively carrying out regression on categories and positions; the specific implementation method comprises the following steps: extracting 19×19 Conv7, 10×10 Conv8_2,5×5 Conv9_2,3×3 Conv10_2 and 1×1 Conv11_2 from the convolution layer as feature maps for detection, adding Conv4_3 layer, and extracting 6 feature maps in total; constructing six frames with different dimensions on each point of the 6 feature graphs respectively, and then carrying out regression on categories and positions respectively;

such that each feature map is provided with two square prior frames of aspect ratio 1 but different sizes; furthermore, the center point distribution of the a priori frame of each cellAt the centre of each unit, i.e.

wherein |f_k I is the size of the kth feature map, and the coordinates of the prior frame are intercepted to be in [0,1 ]]An inner part; the mapping relation between the prior frame coordinates and the original image coordinates on the feature map is as follows: />

regression is performed on the position and the target category simultaneously on each output characteristic diagram, the target loss function is the sum of confidence loss and position loss, and the expression is as follows:

wherein Pos represents a positive sample;

is an indication quantity, when the pairing value of the ith predicted frame and the jth real frame of the classification p is 1, otherwise, the pairing value is 0; cx, cy, w, h the x coordinate of the center point and the y coordinate of the center point of the frame, the width and the height; d is a priori box; />

Is the coordinate offset of the predicted frame; />

Is the predicted center point x coordinate +.>

Center point y coordinate +.>

Width->

Height offset

Is the sum of (3); />

wherein ,

The probability expressed as a background is given,

in order to ensure that positive and negative samples are balanced as much as possible, negative samples are sampled, descending order is carried out according to confidence errors during sampling, and top-k with larger errors is selected as a negative sample for training, so that the proportion of the positive and negative samples is ensured to be close to 1:3;

2. The method for dynamically monitoring the hoisting area based on the target detection algorithm according to claim 1, wherein the image of each hoisting area input in the step S1 is randomly sampled by one of the following three methods:

(2) Randomly cropping on the original image;

3. The method for dynamically monitoring the hoisting area based on the target detection algorithm according to claim 1, wherein the specific implementation method of the step S4 is as follows: the non-maximum value suppression method is adopted, and comprises the following substeps: