CN111753682B - Hoisting area dynamic monitoring method based on target detection algorithm - Google Patents

Hoisting area dynamic monitoring method based on target detection algorithm Download PDF

Info

Publication number
CN111753682B
CN111753682B CN202010528652.8A CN202010528652A CN111753682B CN 111753682 B CN111753682 B CN 111753682B CN 202010528652 A CN202010528652 A CN 202010528652A CN 111753682 B CN111753682 B CN 111753682B
Authority
CN
China
Prior art keywords
frame
feature
target
feature map
layer
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202010528652.8A
Other languages
Chinese (zh)
Other versions
CN111753682A (en
Inventor
马士伟
杨超
赵焕
王建
乐文
段钢
黄希
李炳
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
China Construction Underground Space Co Ltd
Original Assignee
China Construction Underground Space Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by China Construction Underground Space Co Ltd filed Critical China Construction Underground Space Co Ltd
Priority to CN202010528652.8A priority Critical patent/CN111753682B/en
Publication of CN111753682A publication Critical patent/CN111753682A/en
Application granted granted Critical
Publication of CN111753682B publication Critical patent/CN111753682B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V20/00Scenes; Scene-specific elements
    • G06V20/50Context or environment of the image
    • G06V20/52Surveillance or monitoring of activities, e.g. for recognising suspicious objects
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/24Classification techniques
    • G06F18/241Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/24Classification techniques
    • G06F18/241Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches
    • G06F18/2415Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches based on parametric or probabilistic models, e.g. based on likelihood ratio or false acceptance rate versus a false rejection rate
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/047Probabilistic or stochastic networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V2201/00Indexing scheme relating to image or video recognition or understanding
    • G06V2201/07Target detection

Abstract

The invention discloses a hoisting area dynamic monitoring method based on a target detection algorithm, which comprises the following steps: s1, carrying out data enhancement on an input image of a hoisting area; s2, extracting features of the image obtained in the step S1 by adopting an SSD destination detection network; s3, extracting a feature map, respectively constructing six frames with different dimensions at each point on the feature map, and then respectively carrying out regression on categories and positions; s4, screening the result obtained in the S3 through a non-maximum value inhibition method to obtain an output result. According to the invention, the SSD target detection algorithm is adopted to aim at the characteristic that the feature expressions of different scales are different, a multi-scale target feature extraction method is adopted, feature graphs of different scales are extracted to detect, the robustness of detecting the pictures of the hoisting areas of different scales is improved, and the accuracy of detecting whether the lifting hook is in a working state or not and the accuracy of detecting whether a detector is under the hoisting state or not in the current state are improved.

Description

Hoisting area dynamic monitoring method based on target detection algorithm
Technical Field
The invention belongs to the field of computer vision and image processing, and particularly relates to a hoisting area dynamic monitoring method based on a target detection algorithm.
Background
In recent years, target detection has become an important research direction and research hotspot in the fields of computer vision and image processing, and can be applied to the fields of unmanned aerial vehicle, robot navigation, intelligent video monitoring, industrial detection, aerospace and the like. Meanwhile, the target detection is a core part of the intelligent monitoring system, and plays a vital role in the following tasks such as face recognition, gait recognition, crowd counting, instance segmentation and the like. Before deep learning occurs, the target detection method is mainly completed by establishing a certain mathematical model according to a certain priori knowledge. However, with the wide application of deep learning in recent years, the target detection algorithm is developed more rapidly, and the accuracy and the robustness of target detection are improved. The deep learning-based target detection model has the advantages that the deep neural network can autonomously learn the characteristics of different levels, so that compared with the traditional manual design characteristic learning, the deep neural network has richer characteristics and stronger characteristic expression capability. According to the design concept, the method can be mainly divided into two types, namely a target detection algorithm based on region naming and a target detection algorithm based on end-to-end learning. The method for detecting the target based on the region nomination is aimed at the position of the target object in the image, and the method for pre-proposing the candidate region typically comprises the following steps: R-CNN (Region-CNN), fast R-CNN, etc. The target detection method based on end-to-end learning does not need to pre-extract candidate areas, the representative methods are YOLO (You only look once) and SSD (Single Shot MultiBox Detector), the main idea is to uniformly and densely sample at different positions of the picture, different scales and length-width ratios can be adopted during sampling, and then classification and regression are directly carried out after the CNN is utilized to extract the features, so that the whole process only needs one step, and the method has the advantage of high speed. Among the above algorithms, the R-CNN algorithm has low efficiency and large occupied hard disk space, and although Fast R-CNN and Fast R-CNN are improved on the R-CNN algorithm, candidate areas need to be extracted from detection areas firstly to prepare for subsequent feature extraction and classification; while YOLO is fast in detection speed, low in background false detection rate than R-CNN and the like, and supports detection of unnatural images, two objects which have large object positioning errors and fall into the same grid can only detect one of the objects. In contrast, the SSD has relatively better detection performance, and has the two advantages of real-time performance and high accuracy.
SSD is a single detection deep neural network, combining the regression idea of YOLO and the anchors mechanism of Faster R-CNN. By adopting the idea of regression, the computational complexity of the neural network can be simplified, and the real-time performance of the algorithm can be improved; features with different aspect ratio sizes can be extracted by adopting an anchors mechanism, and meanwhile, the method for extracting the local features is more reasonable and effective in recognition compared with a method for extracting global features of a certain position by utilizing YOLO. In other words, the multi-scale regional characteristics of all positions of the full graph are used for regression, so that the characteristic of high YOLO speed is maintained, and the window prediction is ensured to be accurate as fast-RCNN. In addition, SSD adopts a method for extracting multi-scale target features aiming at the characteristic that the feature expressions of different scales are different, and the method is beneficial to improving the robustness of detecting targets of different scales.
Disclosure of Invention
The invention aims to overcome the defects of the prior art, and provides a hoisting area dynamic monitoring method based on a target detection algorithm, which adopts an SSD target detection algorithm to aim at the characteristic that different scale features are expressed differently, adopts a method for extracting multi-scale target features, extracts feature images of different scales to detect, improves the robustness of detecting hoisting area pictures of different scales, and improves the accuracy of detecting whether a lifting hook is in a working state in the current situation.
The aim of the invention is realized by the following technical scheme: a hoisting area dynamic monitoring method based on a target detection algorithm comprises the following steps:
s1, carrying out data enhancement on an input image of a hoisting area;
s2, extracting features of the image obtained in the step S1 by adopting an SSD destination detection network;
s3, extracting a feature map, respectively constructing six frames with different dimensions at each point on the feature map, and then respectively carrying out regression on categories and positions;
s4, screening the result obtained in the S3 through a non-maximum value inhibition method to obtain an output result.
Further, the image of each hoisting area input in step S1 is randomly sampled by one of the following three methods:
(1) Using the whole image, namely an acquired original image of the hoisting area;
(2) Randomly cropping on the original image;
(3) Random clipping with Jaccard overlap constraint, jaccard overlap calculation formula is as follows:
Figure BDA0002534425870000021
a and B are similarity sets of all real frames in the original image and the clipping image respectively; the ratio of the size of cutting to the original picture is between 0.1 and 0.9, and the aspect ratio is between 1/2 and 2;
the input image is resized to a uniform size and flipped horizontally with a probability of 0.5.
Further, the SSD destination detection network is based on a VGG-16 network, and training is performed by using Conv1_1, conv1_2, conv2_1, conv2_2, conv3_1, conv3_2, conv3_3, conv4_1, conv4_2, conv4_3, conv5_1, conv5_2, conv5_3 (512);
FC6 and FC7 change from the original full link layer to a convolution of 3 x 1024 and a convolution of 1 x 1024, the additional layer comprising conv6_1, conv6_2, conv7_1, conv7_2, conv,8_1, conv8_2, conv9_1, conv9_2;
simultaneously changing Pool layer Pool5 from original 2×2 of stride=2 to 3×3 of stride=1;
based on the Atrous algorithm, conv6 employs an extended convolution or band Kong Juanji to exponentially expand the field of view of the convolution without increasing the parameters and model complexity, and uses the expansion rate parameter to represent the magnitude of the expansion;
wherein Conv4_3 layer is to be used as the first feature map for detection; the conv4_3 layer feature map size is 38×38, but the layer is relatively front and is regular and large, so an L2 regular layer is added behind the conv4_3 layer to ensure that the difference from the following detection layer is not very large.
Further, the specific implementation method of the step S3 is as follows: extracting 19×19 Conv7, 10×10 Conv8_2,5×5 Conv9_2,3×3 Conv10_2 and 1×1 Conv11_2 from the convolution layer as feature maps for detection, adding Conv4_3 layer, and extracting 6 feature maps in total; constructing six frames with different dimensions on each point of the 6 feature graphs respectively, and then carrying out regression on categories and positions respectively;
the specific method comprises the following steps: obtaining a plurality of six feature graphs with different sizes by adopting a multi-scale method, and if the system detects by adopting m layers of feature graphs, calculating the prior frame proportion formula of the kth feature graph as follows:
Figure BDA0002534425870000031
wherein m refers to the number of feature patterns, s k Representing the proportion of a priori frame size relative to the picture, s min and smax Representing the minimum and maximum values of the ratio; for the first feature map, the scale proportion of the prior frame is set to be 0.1, and the scale is 30; for the following feature graphs, the prior frame scale increases linearly according to the above formula, but the scale proportion is first enlarged by 100 times, and the increasing step length is 17, so that the s of each feature graph k For 20, 37, 54, 71 and 88, dividing these ratios by 100, and multiplying by the picture size to obtain the scale of each feature map; for aspect ratio, select
Figure BDA0002534425870000032
For a particular aspect ratio, the width and height of the a priori frame is calculated as follows:
Figure BDA0002534425870000033
Figure BDA0002534425870000034
each feature map in the fused different-scale feature modules has a value of a r =1 and scale s k In addition to this, a scale of a is set r =1 and
Figure BDA0002534425870000035
such that each feature map is provided with two square prior frames of aspect ratio 1 but different sizes; furthermore, the center point of the a priori frame of each cell is distributed in the center of the respective cell, i.e. +.>
Figure BDA0002534425870000036
a,b∈[0,|f k |]Wherein |f k I is the size of the kth feature map, and the coordinates of the prior frame are intercepted to be in [0,1 ]]An inner part; the mapping relation between the prior frame coordinates and the original image coordinates on the feature map is as follows:
Figure BDA0002534425870000037
Figure BDA0002534425870000041
Figure BDA0002534425870000042
Figure BDA0002534425870000043
in the formula ,(cx ,c y ) The coordinates of the center of the priori frame on the feature layer; w (w) b ,h b Width and height of a priori frame; w (w) feature ,h feature Is the width and height of the feature layer; w (w) img ,h img Wide and high for the original image; the obtained (x) min ,y min ,x max ,y max ) Is the center on the k layer characteristic diagram
Figure BDA0002534425870000044
Size w k ,h k Mapping the prior frame of the original image to the object frame coordinates of the original image;
regression is performed on the position and the target category simultaneously on each output feature map, wherein the target loss function is the sum of confidence (classification) loss and position loss, and the expression is as follows:
Figure BDA0002534425870000045
where N is the total amount of matching positive samples, let l=0 if n=0; x and c are the indication and confidence of the classification, respectively; l, g are the prediction frame and the real frame, respectively; α is the weight of the position loss; l (L) conf (x, c) is a confidence loss function; l (L) loc (x, l, g) is a position loss function;
the position loss is the smoothl 1 loss between the prediction box L and the real box g:
Figure BDA0002534425870000046
Figure BDA0002534425870000047
Figure BDA0002534425870000048
Figure BDA0002534425870000049
Figure BDA00025344258700000410
wherein Pos represents a positive sample;
Figure BDA00025344258700000411
is an indication quantity, when the pairing value of the ith predicted frame and the jth real frame of the classification p is 1, otherwise, the pairing value is 0; cx, cy, w, h the x coordinate of the center point and the y coordinate of the center point of the frame, the width and the height; d is a priori box;
Figure BDA0002534425870000051
is the coordinate offset of the predicted frame; />
Figure BDA0002534425870000052
Is the predicted center point x coordinate +.>
Figure BDA0002534425870000053
Center point y coordinate +.>
Figure BDA0002534425870000054
Width->
Figure BDA0002534425870000055
Height offset->
Figure BDA0002534425870000056
Is the sum of (3); />
Figure BDA0002534425870000057
Respectively the x coordinate of the central point and the y coordinate of the central point of the prediction frame, and the width and the height;
Figure BDA0002534425870000058
the offset of the center coordinate cx, the offset of the center point coordinate cy, the scaling of the width w and the scaling of the height h of the real frame are respectively shown;
the classification penalty is the softmax penalty between classification confidence:
Figure BDA0002534425870000059
Figure BDA00025344258700000510
wherein ,
Figure BDA00025344258700000511
softmax activation function probability representing prediction box i as class p,/>
Figure BDA00025344258700000512
The probability expressed as a background is given,
Figure BDA00025344258700000513
when the classification loss formula is 1, the i-th prediction frame is matched with the true frame with the j-th class p, otherwise, the i-th prediction frame is not matched with the proper true frame, and the classification loss formula comprises a positive sample Pos and a negative sample Neg;
outputting a set of independent detection values for each prior frame of each unit in order to predict the detection result; corresponding to a bounding box, the two parts are mainly divided: the first part is the confidence level or score of each category, and when the confidence level of c categories exists, the number of the true detection categories is only c-1; in the prediction process, the category with the highest confidence is the category to which the bounding box belongs, and in particular, when the first confidence value is the highest, the bounding box does not contain the target; the second part is the position of the boundary box, and comprises 4 values (cx, cy, w, h) which respectively represent the center coordinate and the width and the height of the boundary box; for a feature map with the size of m×n, n units are shared, the number of prior frames set by each unit is recorded as k, so that (c+4) k predicted values are needed for each unit, and (c+4) kmn predicted values are needed for all units, and since SSD used by the system is detected by adopting convolution, the detection process of the feature map is completed by (c+4) k convolution kernels;
in order to ensure that positive and negative samples are balanced as much as possible, negative samples are sampled, descending order is carried out according to confidence errors (the smaller the confidence of a prediction background is, the larger the error is) during sampling, and top-k with larger error is selected as a training negative sample so as to ensure that the proportion of the positive and negative samples is close to 1:3.
Further, the specific implementation method of the step S4 is as follows: the non-maximum value suppression method is adopted, and comprises the following substeps:
s41, regarding the detection result obtained in the step S3 as a candidate set, sequencing the candidate set according to the confidence level for each category of targets, selecting the target with the highest confidence level, deleting the target from the candidate set, and adding the target into the detection result set;
s42, calculating the Jaccard overlap ratio between the elements in the candidate set and the target obtained in the S41, and deleting the elements corresponding to the candidate set with the Jaccard overlap ratio larger than a given threshold;
s43, repeating the steps S41 and S42 until the candidate set is empty, and outputting the result set as a final result.
The beneficial effects of the invention are as follows: (1) The SSD destination detection algorithm adopted in the invention utilizes the idea of YOLO regression, simplifies the calculation complexity of the neural network, and improves the real-time performance of the algorithm;
(2) The SSD destination detection algorithm adopted in the invention can extract the hook features with different aspect ratio sizes by utilizing the anchors mechanism of the Faster R-CNN, and the local feature extraction method is more reasonable and effective in recognition;
(3) The SSD target detection algorithm adopted in the invention aims at the characteristic that the features of different scales express different, adopts a multi-scale target feature extraction method, extracts feature images of different scales for detection, and adopts a large-scale feature image (a front feature image) to detect small objects, and adopts a small-scale feature image (a rear feature image) to detect large objects, so that the robustness of detecting the pictures of hoisting areas of different scales is improved;
(4) The invention improves the accuracy of detecting whether the lifting hook is in the working state or not in the current situation and the accuracy of detecting whether the personnel is under the lifting in the working state or not.
Drawings
FIG. 1 is a flow chart of a method for dynamically monitoring a hoisting area based on a target detection algorithm;
FIG. 2 is a diagram of a conventional VGG-16 network architecture;
FIG. 3 is a diagram of an SSD destination detection network of the present invention;
FIG. 4 is a schematic view of a feature pyramid of the present invention;
FIG. 5 is a graph showing the detection results of the present invention.
Detailed Description
The technical scheme of the invention is further described below with reference to the accompanying drawings.
As shown in fig. 1, the method for dynamically monitoring the hoisting area based on the target detection algorithm comprises the following steps:
s1, carrying out data enhancement on an input image of a hoisting area;
the input image of each lifting area is randomly sampled by one of the following three methods:
(1) Using the whole image, namely an acquired original image of the hoisting area;
(2) Randomly cropping on the original image;
(3) Random clipping with Jaccard overlap constraint, jaccard overlap calculation formula is as follows:
Figure BDA0002534425870000061
a and B are similarity sets of all real frames in the original image and the clipping image respectively; the ratio of the size of cutting to the original picture is between 0.1 and 0.9, and the aspect ratio is between 1/2 and 2;
the input image is resized to a uniform size and flipped horizontally with a probability of 0.5.
According to the method, the number of training samples can be increased, more targets with different shapes and sizes are constructed, the targets are input into a network, the network can learn more robust features, the subsequent algorithm performance is improved, and finally the system is enhanced in sensitivity to target translation and is more robust to targets with different sizes and aspect ratios.
S2, extracting features of the image obtained in the step S1 by adopting an SSD destination detection network;
the SSD destination detection network is based on a VGG-16 network, the conventional VGG-16 network structure is shown in FIG. 2, and the SSD network of the present invention is shown in FIG. 3. The training is also Conv1_1, conv1_2, conv2_1, conv2_2, conv3_1, conv3_2, conv3_3, conv4_1, conv4_2, conv4_3, conv5_1, conv5_2, conv5_3 (512);
FC6 and FC7 change from the original full link layer to a convolution of 3 x 1024 and a convolution of 1 x 1024, the additional layer comprising conv6_1, conv6_2, conv7_1, conv7_2, conv,8_1, conv8_2, conv9_1, conv9_2;
simultaneously changing Pool layer Pool5 from original 2×2 of stride=2 to 3×3 of stride=1;
based on the Atrous algorithm, conv6 employs an extended convolution or band Kong Juanji to exponentially expand the field of view of the convolution without increasing the parameters and model complexity, and uses the expansion rate parameter to represent the magnitude of the expansion;
wherein Conv4_3 layer is to be used as the first feature map for detection; the conv4_3 layer feature map size is 38×38, but the layer is relatively front and is regular and large, so an L2 regular layer is added behind the conv4_3 layer to ensure that the difference from the following detection layer is not very large.
S3, extracting a feature map, respectively constructing six frames with different dimensions at each point on the feature map, and then respectively carrying out regression on categories and positions;
the specific implementation method comprises the following steps: extracting 19×19 Conv7, 10×10 Conv8_2,5×5 Conv9_2,3×3 Conv10_2 and 1×1 Conv11_2 from the convolution layer as feature maps for detection, adding Conv4_3 layers, and extracting 6 feature maps altogether, thus presenting a pyramid feature structure, as shown in fig. 4; constructing six frames with different dimensions on each point of the 6 feature graphs respectively, and then carrying out regression on categories and positions respectively;
the final detection result is shown in fig. 5, and the specific procedure is as follows: SSD uses the idea of anchor in the fast R-CNN to utilize each unit to set up the priori frame that the scale or length-width ratio are different, and the boundary frame (the prediction boxes) is based on these priori frames, reduces the training degree of difficulty to a certain extent. Typically, each cell will be provided with a number of a priori boxes, which vary in scale and aspect ratio. The number of a priori boxes set up by different feature maps is different (the a priori boxes set up by each cell on the same feature map are the same). The setting of the prior box includes two aspects of scale (or size) and aspect ratio. For the scale of the a priori block, it obeys a linear increasing rule: as the feature map size decreases, the a priori frame dimensions increase linearly.
The specific method comprises the following steps: obtaining a plurality of six feature graphs with different sizes by adopting a multi-scale method, and if the system detects by adopting m layers of feature graphs, calculating the prior frame proportion formula of the kth feature graph as follows:
Figure BDA0002534425870000081
wherein m refers to the number of feature maps, which is set to 5 in the present embodiment, because the first layer (conv4—3 layer) is set separately; s is(s) k Representing the proportion of a priori frame size relative to the picture, s min and smax The minimum value and the maximum value of the representation proportion are respectively 0.2 and 0.9; for the first feature map, the scale proportion of the prior frame is set to be 0.1, and the scale is 30; for the following feature graphs, the prior frame scale increases linearly according to the above formula, but the scale proportion is first enlarged by 100 times, and the increasing step length is 17, so that the s of each feature graph k For 20, 37, 54, 71 and 88, dividing these ratios by 100, and multiplying by the picture size to obtain the scale of each feature map; for aspect ratio, select
Figure BDA0002534425870000082
For a particular aspect ratio, the width and height of the a priori frame is calculated as follows:
Figure BDA0002534425870000083
Figure BDA0002534425870000084
each feature map in the fused different-scale feature modules has a value of a r =1 and scale s k In addition to this, a scale of a is set r =1 and
Figure BDA0002534425870000085
such that each feature map is provided with two square prior frames of aspect ratio 1 but different sizes; furthermore, the center point of the a priori frame of each cell is distributed in the center of the respective cell, i.e. +.>
Figure BDA0002534425870000086
a,b∈[0,|f k |]Wherein |f k I is the size of the kth feature map, and the coordinates of the prior frame are intercepted to be in [0,1 ]]An inner part; the mapping relation between the prior frame coordinates and the original image coordinates on the feature map is as follows:
Figure BDA0002534425870000087
Figure BDA0002534425870000088
Figure BDA0002534425870000089
Figure BDA00025344258700000810
in the formula ,(cx ,c y ) The coordinates of the center of the priori frame on the feature layer; w (w) b ,h b Width and height of a priori frame; w (w) feature ,h feature Is the width and height of the feature layer; w (w) img ,h img Is the original pictureThe width and height of the image; the obtained (x) min ,y min ,x max ,y max ) Is the center on the k layer characteristic diagram
Figure BDA0002534425870000091
Size w k ,h k Mapping the prior frame of the original image to the object frame coordinates of the original image;
regression is performed on the position and the target category simultaneously on each output feature map, wherein the target loss function is the sum of confidence (classification) loss and position loss, and the expression is as follows:
Figure BDA0002534425870000092
where N is the total amount of matching positive samples, let l=0 if n=0; x and c are the indication and confidence of the classification, respectively; l, g are the prediction frame and the real frame, respectively; α is the weight of the position loss; l (L) conf (x, c) is a confidence loss function; l (L) loc (x, l, g) is a position loss function;
the position loss is the smoothl 1 loss between the prediction box L and the real box g:
Figure BDA0002534425870000093
Figure BDA0002534425870000094
Figure BDA0002534425870000095
Figure BDA0002534425870000096
Figure BDA0002534425870000097
wherein Pos represents a positive sample;
Figure BDA0002534425870000098
is an indication quantity, when the pairing value of the ith predicted frame and the jth real frame of the classification p is 1, otherwise, the pairing value is 0; cx, cy, w, h the x coordinate of the center point and the y coordinate of the center point of the frame, the width and the height; d is a priori frame (a priori frame preset by the network itself), l is a predicted frame (a frame of the network output plus a predicted offset), g is a GT frame (a true frame of the dataset annotation); />
Figure BDA0002534425870000099
Is the coordinate offset of the predicted frame; />
Figure BDA00025344258700000910
Is the predicted center point x coordinate +.>
Figure BDA00025344258700000911
Center point y coordinate +.>
Figure BDA00025344258700000912
Width->
Figure BDA00025344258700000913
Height offset->
Figure BDA00025344258700000914
Is the sum of (3); />
Figure BDA00025344258700000915
Respectively the x coordinate of the central point and the y coordinate of the central point of the prediction frame, and the width and the height; />
Figure BDA00025344258700000916
The offset of the center coordinate cx, the offset of the center point coordinate cy, the scaling of the width w and the scaling of the height h of the real frame are respectively shown;
the classification penalty is the softmax penalty between classification confidence:
Figure BDA0002534425870000101
Figure BDA0002534425870000102
wherein ,
Figure BDA0002534425870000103
softmax activation function probability representing prediction box i as class p,/>
Figure BDA0002534425870000104
The probability expressed as a background is given,
Figure BDA0002534425870000105
when the classification loss formula is 1, the i-th prediction frame is matched with the true frame with the j-th class p, otherwise, the i-th prediction frame is not matched with the proper true frame, and the classification loss formula comprises a positive sample Pos and a negative sample Neg;
outputting a set of independent detection values for each prior frame of each unit in order to predict the detection result; corresponding to a bounding box, the two parts are mainly divided: the first part is the confidence level or score of each category, and when the confidence level of c categories exists, the number of the true detection categories is only c-1; in the prediction process, the category with the highest confidence is the category to which the bounding box belongs, and in particular, when the first confidence value is the highest, the bounding box does not contain the target; the second part is the position of the boundary box, and comprises 4 values (cx, cy, w, h) which respectively represent the center coordinate and the width and the height of the boundary box; for a feature map with the size of m×n, n units are shared, the number of prior frames set by each unit is recorded as k, so that (c+4) k predicted values are needed for each unit, and (c+4) kmn predicted values are needed for all units, and since SSD used by the system is detected by adopting convolution, the detection process of the feature map is completed by (c+4) k convolution kernels;
in order to ensure that positive and negative samples are balanced as much as possible, negative samples are sampled, descending order is carried out according to confidence errors (the smaller the confidence of a prediction background is, the larger the error is) during sampling, and top-k with larger error is selected as a training negative sample so as to ensure that the proportion of the positive and negative samples is close to 1:3.
S4, screening the result obtained in the S3 through a non-maximum value inhibition method to obtain an output result;
for each prediction frame, the category (the category with the highest confidence) and the confidence value are determined according to the category confidence, and the prediction frame belonging to the background is filtered. The prediction blocks with lower thresholds are then filtered out based on the confidence threshold (e.g., 0.5). And decoding the left prediction frame, and obtaining the real position parameters of the left prediction frame according to the prior frame. After decoding, the order is descending according to confidence, and then only top-k (e.g., 400) prediction frames are reserved. And finally, filtering out prediction frames with larger Jaccard overlapping degree by adopting a non-maximum suppression method. The last remaining prediction frame is the detection result.
The specific implementation method comprises the following steps: the non-maximum value suppression method is adopted, and comprises the following substeps:
s41, regarding the detection result obtained in the step S3 as a candidate set, sequencing the candidate set according to the confidence level for each category of targets, selecting the target with the highest confidence level, deleting the target from the candidate set, and adding the target into the detection result set;
s42, calculating the Jaccard overlap ratio between the elements in the candidate set and the target obtained in the S41, and deleting the elements corresponding to the candidate set with the Jaccard overlap ratio larger than a given threshold;
s43, repeating the steps S41 and S42 until the candidate set is empty, and outputting the result set as a final result.
Those of ordinary skill in the art will recognize that the embodiments described herein are for the purpose of aiding the reader in understanding the principles of the present invention and should be understood that the scope of the invention is not limited to such specific statements and embodiments. Those of ordinary skill in the art can make various other specific modifications and combinations from the teachings of the present disclosure without departing from the spirit thereof, and such modifications and combinations remain within the scope of the present disclosure.

Claims (3)

1. The hoisting area dynamic monitoring method based on the target detection algorithm is characterized by comprising the following steps of:
s1, carrying out data enhancement on an input image of a hoisting area;
s2, extracting features of the image obtained in the step S1 by adopting an SSD destination detection network; the SSD destination detection network is based on a VGG-16 network, and training is performed by using Conv1_1, conv1_2, conv2_1, conv2_2, conv3_1, conv3_2, conv3_3, conv4_1, conv4_2, conv4_3, conv5_1, conv5_2 and Conv5_3 (512);
FC6 and FC7 change from the original full link layer to a convolution of 3 x 1024 and a convolution of 1 x 1024, the additional layer comprising conv6_1, conv6_2, conv7_1, conv7_2, conv,8_1, conv8_2, conv9_1, conv9_2;
simultaneously changing Pool layer Pool5 from original 2×2 of stride=2 to 3×3 of stride=1;
based on the Atrous algorithm, conv6 employs an extended convolution or band Kong Juanji to exponentially expand the field of view of the convolution without increasing the parameters and model complexity, and uses the expansion rate parameter to represent the magnitude of the expansion;
wherein Conv4_3 layer is to be used as the first feature map for detection; the size of the Conv4_3 layer characteristic diagram is 38 multiplied by 38, and an L2 regular layer is added behind the Conv4_3 layer;
s3, extracting a feature map, respectively constructing six frames with different dimensions at each point on the feature map, and then respectively carrying out regression on categories and positions; the specific implementation method comprises the following steps: extracting 19×19 Conv7, 10×10 Conv8_2,5×5 Conv9_2,3×3 Conv10_2 and 1×1 Conv11_2 from the convolution layer as feature maps for detection, adding Conv4_3 layer, and extracting 6 feature maps in total; constructing six frames with different dimensions on each point of the 6 feature graphs respectively, and then carrying out regression on categories and positions respectively;
the specific method comprises the following steps: obtaining a plurality of six feature graphs with different sizes by adopting a multi-scale method, and if the system detects by adopting m layers of feature graphs, calculating the prior frame proportion formula of the kth feature graph as follows:
Figure QLYQS_1
wherein m refers to the number of feature patterns, s k Representing the proportion of a priori frame size relative to the picture, s min and smax Representing the minimum and maximum values of the ratio; for the first feature map, the scale proportion of the prior frame is set to be 0.1, and the scale is 30; for the following feature graphs, the prior frame scale increases linearly according to the above formula, but the scale proportion is first enlarged by 100 times, and the increasing step length is 17, so that the s of each feature graph k For 20, 37, 54, 71 and 88, dividing these ratios by 100, and multiplying by the picture size to obtain the scale of each feature map; for aspect ratio, select
Figure QLYQS_2
For a particular aspect ratio, the width and height of the a priori frame is calculated as follows:
Figure QLYQS_3
Figure QLYQS_4
each feature map in the fused different-scale feature modules has a value of a r =1 and scale s k In addition to this, a scale of a is set r =1 and
Figure QLYQS_5
such that each feature map is provided with two square prior frames of aspect ratio 1 but different sizes; furthermore, the center point distribution of the a priori frame of each cellAt the centre of each unit, i.e.
Figure QLYQS_6
wherein |fk I is the size of the kth feature map, and the coordinates of the prior frame are intercepted to be in [0,1 ]]An inner part; the mapping relation between the prior frame coordinates and the original image coordinates on the feature map is as follows: />
Figure QLYQS_7
Figure QLYQS_8
Figure QLYQS_9
Figure QLYQS_10
in the formula ,(cx ,c y ) The coordinates of the center of the priori frame on the feature layer; w (w) b ,h b Width and height of a priori frame; w (w) feature ,h feature Is the width and height of the feature layer; w (w) img ,h img Wide and high for the original image; the obtained (x) min ,y min ,x max ,y max ) Is the center on the k layer characteristic diagram
Figure QLYQS_11
Size w k ,h k Mapping the prior frame of the original image to the object frame coordinates of the original image;
regression is performed on the position and the target category simultaneously on each output characteristic diagram, the target loss function is the sum of confidence loss and position loss, and the expression is as follows:
Figure QLYQS_12
where N is the total amount of matching positive samples, let l=0 if n=0; x and c are the indication and confidence of the classification, respectively; l, g are the prediction frame and the real frame, respectively; α is the weight of the position loss; l (L) conf (x, c) is a confidence loss function; l (L) loc (x, l, g) is a position loss function;
the position loss is the smoothl 1 loss between the prediction box L and the real box g:
Figure QLYQS_13
Figure QLYQS_14
Figure QLYQS_15
Figure QLYQS_16
Figure QLYQS_17
wherein Pos represents a positive sample;
Figure QLYQS_19
is an indication quantity, when the pairing value of the ith predicted frame and the jth real frame of the classification p is 1, otherwise, the pairing value is 0; cx, cy, w, h the x coordinate of the center point and the y coordinate of the center point of the frame, the width and the height; d is a priori box; />
Figure QLYQS_23
Is the coordinate offset of the predicted frame; />
Figure QLYQS_24
Is the predicted center point x coordinate +.>
Figure QLYQS_20
Center point y coordinate +.>
Figure QLYQS_22
Width->
Figure QLYQS_25
Height offset
Figure QLYQS_26
Is the sum of (3); />
Figure QLYQS_18
Respectively the x coordinate of the central point and the y coordinate of the central point of the prediction frame, and the width and the height;
Figure QLYQS_21
the offset of the center coordinate cx, the offset of the center point coordinate cy, the scaling of the width w and the scaling of the height h of the real frame are respectively shown;
the classification penalty is the softmax penalty between classification confidence:
Figure QLYQS_27
Figure QLYQS_28
wherein ,
Figure QLYQS_29
softmax activation function probability representing prediction box i as class p,/>
Figure QLYQS_30
The probability expressed as a background is given,
Figure QLYQS_31
when the classification loss formula is 1, the i-th prediction frame is matched with the true frame with the j-th class p, otherwise, the i-th prediction frame is not matched with the proper true frame, and the classification loss formula comprises a positive sample Pos and a negative sample Neg;
outputting a set of independent detection values for each prior frame of each unit in order to predict the detection result; corresponding to a bounding box, the two parts are mainly divided: the first part is the confidence level or score of each category, and when the confidence level of c categories exists, the number of the true detection categories is only c-1; in the prediction process, the category with the highest confidence is the category to which the bounding box belongs, and in particular, when the first confidence value is the highest, the bounding box does not contain the target; the second part is the position of the boundary box, and comprises 4 values (cx, cy, w, h) which respectively represent the center coordinate and the width and the height of the boundary box; for a feature map with the size of m×n, n units are shared, the number of prior frames set by each unit is recorded as k, so that (c+4) k predicted values are needed for each unit, and (c+4) kmn predicted values are needed for all units, and since SSD used by the system is detected by adopting convolution, the detection process of the feature map is completed by (c+4) k convolution kernels;
in order to ensure that positive and negative samples are balanced as much as possible, negative samples are sampled, descending order is carried out according to confidence errors during sampling, and top-k with larger errors is selected as a negative sample for training, so that the proportion of the positive and negative samples is ensured to be close to 1:3;
s4, screening the result obtained in the S3 through a non-maximum value inhibition method to obtain an output result.
2. The method for dynamically monitoring the hoisting area based on the target detection algorithm according to claim 1, wherein the image of each hoisting area input in the step S1 is randomly sampled by one of the following three methods:
(1) Using the whole image, namely an acquired original image of the hoisting area;
(2) Randomly cropping on the original image;
(3) Random clipping with Jaccard overlap constraint, jaccard overlap calculation formula is as follows:
Figure QLYQS_32
a and B are similarity sets of all real frames in the original image and the clipping image respectively; the ratio of the size of cutting to the original picture is between 0.1 and 0.9, and the aspect ratio is between 1/2 and 2;
the input image is resized to a uniform size and flipped horizontally with a probability of 0.5.
3. The method for dynamically monitoring the hoisting area based on the target detection algorithm according to claim 1, wherein the specific implementation method of the step S4 is as follows: the non-maximum value suppression method is adopted, and comprises the following substeps:
s41, regarding the detection result obtained in the step S3 as a candidate set, sequencing the candidate set according to the confidence level for each category of targets, selecting the target with the highest confidence level, deleting the target from the candidate set, and adding the target into the detection result set;
s42, calculating the Jaccard overlap ratio between the elements in the candidate set and the target obtained in the S41, and deleting the elements corresponding to the candidate set with the Jaccard overlap ratio larger than a given threshold;
s43, repeating the steps S41 and S42 until the candidate set is empty, and outputting the result set as a final result.
CN202010528652.8A 2020-06-11 2020-06-11 Hoisting area dynamic monitoring method based on target detection algorithm Active CN111753682B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202010528652.8A CN111753682B (en) 2020-06-11 2020-06-11 Hoisting area dynamic monitoring method based on target detection algorithm

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202010528652.8A CN111753682B (en) 2020-06-11 2020-06-11 Hoisting area dynamic monitoring method based on target detection algorithm

Publications (2)

Publication Number Publication Date
CN111753682A CN111753682A (en) 2020-10-09
CN111753682B true CN111753682B (en) 2023-05-23

Family

ID=72675082

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202010528652.8A Active CN111753682B (en) 2020-06-11 2020-06-11 Hoisting area dynamic monitoring method based on target detection algorithm

Country Status (1)

Country Link
CN (1) CN111753682B (en)

Families Citing this family (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112215308B (en) * 2020-12-13 2021-03-30 之江实验室 Single-order detection method and device for hoisted object, electronic equipment and storage medium
CN112614121A (en) * 2020-12-29 2021-04-06 国网青海省电力公司海南供电公司 Multi-scale small-target equipment defect identification and monitoring method
CN112733671A (en) * 2020-12-31 2021-04-30 新大陆数字技术股份有限公司 Pedestrian detection method, device and readable storage medium
CN113158752A (en) * 2021-02-05 2021-07-23 国网河南省电力公司鹤壁供电公司 Intelligent safety management and control system for electric power staff approach operation
CN112560825B (en) * 2021-02-23 2021-05-18 北京澎思科技有限公司 Face detection method and device, electronic equipment and readable storage medium

Citations (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111027547A (en) * 2019-12-06 2020-04-17 南京大学 Automatic detection method for multi-scale polymorphic target in two-dimensional image

Family Cites Families (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107423760A (en) * 2017-07-21 2017-12-01 西安电子科技大学 Based on pre-segmentation and the deep learning object detection method returned
CN110580487A (en) * 2018-06-08 2019-12-17 Oppo广东移动通信有限公司 Neural network training method, neural network construction method, image processing method and device
CN109886359B (en) * 2019-03-25 2021-03-16 西安电子科技大学 Small target detection method and detection system based on convolutional neural network

Patent Citations (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111027547A (en) * 2019-12-06 2020-04-17 南京大学 Automatic detection method for multi-scale polymorphic target in two-dimensional image

Also Published As

Publication number Publication date
CN111753682A (en) 2020-10-09

Similar Documents

Publication Publication Date Title
CN111753682B (en) Hoisting area dynamic monitoring method based on target detection algorithm
CN111310861B (en) License plate recognition and positioning method based on deep neural network
CN110084292B (en) Target detection method based on DenseNet and multi-scale feature fusion
CN108961235B (en) Defective insulator identification method based on YOLOv3 network and particle filter algorithm
CN110991311B (en) Target detection method based on dense connection deep network
CN110796048B (en) Ship target real-time detection method based on deep neural network
CN111626128A (en) Improved YOLOv 3-based pedestrian detection method in orchard environment
CN111079739B (en) Multi-scale attention feature detection method
CN110991444B (en) License plate recognition method and device for complex scene
CN113850242B (en) Storage abnormal target detection method and system based on deep learning algorithm
CN114627052A (en) Infrared image air leakage and liquid leakage detection method and system based on deep learning
CN111462140B (en) Real-time image instance segmentation method based on block stitching
CN111079604A (en) Method for quickly detecting tiny target facing large-scale remote sensing image
CN105528575A (en) Sky detection algorithm based on context inference
CN111898419B (en) Partitioned landslide detection system and method based on cascaded deep convolutional neural network
CN110334656A (en) Multi-source Remote Sensing Images Clean water withdraw method and device based on information source probability weight
CN113888461A (en) Method, system and equipment for detecting defects of hardware parts based on deep learning
CN111008994A (en) Moving target real-time detection and tracking system and method based on MPSoC
CN115424017B (en) Building inner and outer contour segmentation method, device and storage medium
CN114022408A (en) Remote sensing image cloud detection method based on multi-scale convolution neural network
Zheng et al. Building recognition of UAV remote sensing images by deep learning
CN113487610B (en) Herpes image recognition method and device, computer equipment and storage medium
CN115578615A (en) Night traffic sign image detection model establishing method based on deep learning
Zhao et al. Boundary regularized building footprint extraction from satellite images using deep neural network
Hang et al. CNN based detection of building roofs from high resolution satellite images

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant