CN111753682A - Hoisting area dynamic monitoring method based on target detection algorithm - Google Patents

Hoisting area dynamic monitoring method based on target detection algorithm Download PDF

Info

Publication number
CN111753682A
CN111753682A CN202010528652.8A CN202010528652A CN111753682A CN 111753682 A CN111753682 A CN 111753682A CN 202010528652 A CN202010528652 A CN 202010528652A CN 111753682 A CN111753682 A CN 111753682A
Authority
CN
China
Prior art keywords
feature map
box
target
coordinate
detection
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN202010528652.8A
Other languages
Chinese (zh)
Other versions
CN111753682B (en
Inventor
马士伟
杨超
赵焕
王建
乐文
段钢
黄希
李炳
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
China Construction Underground Space Co Ltd
Original Assignee
China Construction Underground Space Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by China Construction Underground Space Co Ltd filed Critical China Construction Underground Space Co Ltd
Priority to CN202010528652.8A priority Critical patent/CN111753682B/en
Publication of CN111753682A publication Critical patent/CN111753682A/en
Application granted granted Critical
Publication of CN111753682B publication Critical patent/CN111753682B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V20/00Scenes; Scene-specific elements
    • G06V20/50Context or environment of the image
    • G06V20/52Surveillance or monitoring of activities, e.g. for recognising suspicious objects
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/24Classification techniques
    • G06F18/241Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/24Classification techniques
    • G06F18/241Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches
    • G06F18/2415Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches based on parametric or probabilistic models, e.g. based on likelihood ratio or false acceptance rate versus a false rejection rate
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/047Probabilistic or stochastic networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V2201/00Indexing scheme relating to image or video recognition or understanding
    • G06V2201/07Target detection

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Data Mining & Analysis (AREA)
  • General Physics & Mathematics (AREA)
  • Evolutionary Computation (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Artificial Intelligence (AREA)
  • General Engineering & Computer Science (AREA)
  • Computing Systems (AREA)
  • Software Systems (AREA)
  • Molecular Biology (AREA)
  • Computational Linguistics (AREA)
  • Biophysics (AREA)
  • Biomedical Technology (AREA)
  • Mathematical Physics (AREA)
  • General Health & Medical Sciences (AREA)
  • Health & Medical Sciences (AREA)
  • Probability & Statistics with Applications (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Evolutionary Biology (AREA)
  • Multimedia (AREA)
  • Image Analysis (AREA)

Abstract

The invention discloses a hoisting area dynamic monitoring method based on a target detection algorithm, which comprises the following steps: s1, performing data enhancement on the input image of the hoisting area; s2, extracting the features of the image obtained in the step S1 by adopting an SSD target detection network; s3, extracting a feature map, respectively constructing six frames with different scales at each point on the feature map, and then respectively performing regression on categories and positions; and S4, screening the result obtained in the S3 through a non-maximum value inhibition method to obtain an output result. The invention adopts the SSD target detection algorithm to aim at the characteristic that the feature expressions of different scales are different, adopts a multi-scale target feature extraction method, extracts feature maps of different scales for detection, improves the robustness of detecting lifting area pictures of different scales, and improves the accuracy of detecting whether the lifting hook is in a working state under the current condition and the accuracy of detecting whether a person is under lifting in the working state.

Description

Hoisting area dynamic monitoring method based on target detection algorithm
Technical Field
The invention belongs to the field of computer vision and image processing, and particularly relates to a dynamic monitoring method for a hoisting area based on a target detection algorithm.
Background
In recent years, target detection has become an important research direction and a research hotspot in the fields of computer vision and image processing, and can be applied to the fields of unmanned driving, robot navigation, intelligent video monitoring, industrial detection, aerospace and the like. Meanwhile, target detection is a core part of an intelligent monitoring system, and plays an important role in subsequent tasks such as face recognition, gait recognition, crowd counting, instance segmentation and the like. Before deep learning appears, the target detection method is mainly completed by establishing a certain mathematical model according to certain prior knowledge. However, with the wide application of deep learning in recent years, the target detection algorithm is rapidly developed, and the accuracy and robustness of target detection are improved. The target detection model based on deep learning has the advantages that the deep neural network can independently learn the features of different levels, and compared with the traditional manual design feature learning, the target detection model is richer in features and stronger in feature expression capability. According to the design concept, the method can be mainly divided into two types, namely a target detection algorithm based on region nomination and a target detection algorithm based on end-to-end learning. The method for detecting the target based on the region nomination is a typical representative method for proposing a candidate region in advance aiming at the position of a target object in an image, and comprises the following steps: R-CNN (Region-CNN), Fast R-CNN, etc. The representative methods are YOLO (young only look) and SSD (Single Shot MultiBox Detector), the main idea is to uniformly perform intensive sampling at different positions of a picture, different scales and aspect ratios can be adopted during sampling, then classification and regression are directly performed after the characteristics are extracted by using CNN, and the whole process only needs one step, so that the method has the advantage of high speed. Among the algorithms, the R-CNN algorithm has low efficiency and large occupied hard disk space, and although Fast R-CNN and Fast R-CNN are improved on the R-CNN algorithm, candidate regions need to be extracted from a detection region first to prepare for subsequent feature extraction and classification; while YOLO has a high detection speed, a background false detection rate lower than that of R-CNN, and the like, and supports detection of an unnatural image, it can detect only one of two objects having a large object positioning error and falling in the same grid. Compared with the prior art, the SSD has relatively better detection performance, and has the advantages of real-time performance and high accuracy.
The SSD is a single-detection deep neural network, and combines the regression idea of YOLO and the anchors mechanism of Faster R-CNN. By adopting the regression idea, the calculation complexity of the neural network can be simplified, and the real-time performance of the algorithm is improved; the method for extracting the local features is reasonable and effective in recognition compared with a method for extracting global features at a certain position by YOLO (YOLO) by adopting an anchors mechanism. In other words, the multi-scale region features of all positions of the whole graph are used for regression, so that the characteristic of high YOLO speed is kept, and the window prediction is more accurate as that of fast-RCNN. In addition, the SSD adopts a multi-scale target feature extraction method aiming at the characteristic that features of different scales express different features, and the method is beneficial to improving the robustness of detecting the targets of different scales.
Disclosure of Invention
The invention aims to overcome the defects of the prior art, and provides a dynamic monitoring method for a hoisting area based on an SSD (solid State disk) target detection algorithm, which adopts a multi-scale target feature extraction method aiming at the characteristic that different scales of feature expressions are different, extracts feature maps of different scales for detection, improves the robustness of detecting pictures of the hoisting area of different scales, and improves the accuracy of detecting whether a hook is in a working state under the current condition.
The purpose of the invention is realized by the following technical scheme: a hoisting area dynamic monitoring method based on a target detection algorithm comprises the following steps:
s1, performing data enhancement on the input image of the hoisting area;
s2, extracting the features of the image obtained in the step S1 by adopting an SSD target detection network;
s3, extracting a feature map, respectively constructing six frames with different scales at each point on the feature map, and then respectively performing regression on categories and positions;
and S4, screening the result obtained in the S3 through a non-maximum value inhibition method to obtain an output result.
Further, the image of each hoist area input in step S1 is randomly sampled by one of the following three methods:
(1) using the whole image, namely the collected original image of the hoisting area;
(2) randomly cropping on the original image;
(3) random clipping with Jaccard overlap ratio constraint, the calculation formula of the Jaccard overlap ratio is as follows:
Figure BDA0002534425870000021
a and B are similarity sets of all real frames and the cut image in the original image respectively; the ratio of the cutting size to the original image is between [0.1 and 0.9], and the aspect ratio is between [1/2 and 2 ];
the input image is resized to a uniform size and horizontally flipped with a probability of 0.5.
Further, the SSD target detection network is based on the VGG-16 network, and is also trained by Conv1_1, Conv1_2, Conv2_1, Conv2_2, Conv3_1, Conv3_2, Conv3_3, Conv4_1, Conv4_2, Conv4_3, Conv5_1, Conv5_2, Conv5_3 (512);
FC6 and FC7 changed from the original fully connected layer to a convolution of 3 × 1024 and a convolution of 1 × 1024, and additional layers included Conv6_1, Conv6_2, Conv7_1, Conv7_2, Conv,8_1, Conv8_2, Conv9_1, Conv9_ 2;
meanwhile, the pooling layer Pool5 is changed from 2 × 2 with original Stride being 2 to 3 × 3 with Stride being 1;
based on the Atrous algorithm, Conv6 employs an extended convolution or a punctured convolution, exponentially enlarges the field of view of the convolution without increasing the complexity of the parameters and the model, and uses the expansion rate parameter to represent the size of the expansion;
wherein, the Conv4_3 layer is used as a first feature map for detection; the Conv4_3 layer signature size is 38 × 38, but the layer is more regular than the layer, so an L2 regular layer is added after the Conv4_3 layer to ensure that the difference with the detection layer is not very large.
Further, the specific implementation method of step S3 is as follows: extracting 19 × 19 Conv7, 10 × 10 Conv8_2, 5 × 5 Conv9_2, 3 × 3 Conv10_2 and 1 × 1 Conv11_2 from the convolutional layer as feature maps for detection, adding Conv4_3 layers, and extracting 6 feature maps in total; constructing six frames with different sizes at each point on the 6 feature maps, and then performing regression on categories and positions respectively;
the method comprises the following steps: and obtaining a plurality of six feature maps with different sizes by adopting a multi-scale method, wherein if the m-layer feature map is adopted during system detection, the prior frame proportion calculation formula of the kth feature map is as follows:
Figure BDA0002534425870000031
wherein m denotes the number of characteristic diagrams, skRepresenting the ratio of the prior frame size to the picture, smin and smaxMinimum and maximum values representing ratios; for the first feature map, the scale ratio of the prior frame is set to be 0.1, and the scale is 30; for the later characteristic diagram, the prior frame scale is increased linearly according to the formula above, but the scale is firstly enlarged by 100 times, and the increasing step is17, s of each feature mapk20, 37, 54, 71 and 88, dividing the ratios by 100, and multiplying by the picture size to obtain the dimension of each feature map; for the aspect ratio, choose
Figure BDA0002534425870000032
For a particular aspect ratio, the width and height of the prior box are calculated as follows:
Figure BDA0002534425870000033
Figure BDA0002534425870000034
each feature map in the module fusing different scale features has an ar1 and dimension skIn addition to the a priori block of (a), a scale is set tor1 and
Figure BDA0002534425870000035
the prior frames of (1), thus each feature map is provided with two square prior frames with the aspect ratio of 1 but different sizes; furthermore, the center point of the prior frame of each cell is distributed in the center of the respective cell, i.e. the center point of the prior frame of each cell is distributed in the center of the respective cell
Figure BDA0002534425870000036
a,b∈[0,|fk|]Where | fkL is the size of the kth characteristic diagram, and the coordinates of the prior frame are intercepted to be [0,1 ]]Internal; the mapping relation between the coordinate of the prior frame on the characteristic diagram and the coordinate of the original image is as follows:
Figure BDA0002534425870000037
Figure BDA0002534425870000041
Figure BDA0002534425870000042
Figure BDA0002534425870000043
in the formula ,(cx,cy) Coordinates of the center of a prior frame on the characteristic layer are obtained; w is ab,hbWidth and height of the prior frame; w is afeature,hfeatureWidth and height of the feature layer; w is aimg,himgWidth and height of the original image; obtained (x)min,ymin,xmax,ymax) Is the feature map of the k-th layer with the center
Figure BDA0002534425870000044
Size wk,hkMapping the prior frame to the object frame coordinates of the original image;
and (3) simultaneously regressing the position and the target type for each output feature map, wherein the target loss function is the sum of confidence (classification) loss and position loss, and the expression is as follows:
Figure BDA0002534425870000045
where N is the total number of matching positive samples, if N is 0, let L be 0, x and c be the indicated amount and confidence of the classification, L, g be the predicted box and the true box, α be the weight of the position loss, Lconf(x, c) is a confidence loss function; l isloc(x, l, g) is a position loss function;
the position penalty is the Smooth L1 penalty between the predicted frame L and the true frame g:
Figure BDA0002534425870000046
Figure BDA0002534425870000047
Figure BDA0002534425870000048
Figure BDA0002534425870000049
Figure BDA00025344258700000410
wherein Pos represents a positive sample;
Figure BDA00025344258700000411
is an indication quantity, when the value of the ith prediction box and the jth real box of the classification p are paired to be 1, otherwise, the value is 0; cx, cy, w and h are respectively the coordinate of the center point x and the coordinate of the center point y of the frame, the width and the height; d is a prior box;
Figure BDA0002534425870000051
is the coordinate offset of the prediction box;
Figure BDA0002534425870000052
is the predicted x coordinate of the center point
Figure BDA0002534425870000053
Y coordinate of center point
Figure BDA0002534425870000054
Width of
Figure BDA0002534425870000055
Height offset
Figure BDA0002534425870000056
The sum of (a);
Figure BDA0002534425870000057
respectively a central point x coordinate, a central point y coordinate, a width and a height of the prediction frame;
Figure BDA0002534425870000058
are respectively trueThe offset of the center coordinate cx of the real frame, the offset of the center coordinate cy, the scaling of the width w and the scaling of the height h;
the classification penalty is the softmax penalty between the classification confidences:
Figure BDA0002534425870000059
Figure BDA00025344258700000510
wherein ,
Figure BDA00025344258700000511
representing the probability of the softmax activation function for the prediction box i as class p,
Figure BDA00025344258700000512
the probability of being represented as a background,
Figure BDA00025344258700000513
when the number is 1, the ith prediction box is matched with the jth real box with the category p, otherwise, the ith prediction box is not matched with the proper real box, and the classification loss formula comprises a positive sample Pos and a negative sample Neg;
in order to predict the detection result, outputting a set of independent detection values for each prior frame of each unit; corresponding to a bounding box, the method is mainly divided into two parts: the first part is the confidence coefficient or score of each category, and when there are c category confidence coefficients, only c-1 real detection categories are available; in the prediction process, the class with the highest confidence is the class to which the bounding box belongs, and particularly, when the first confidence value is the highest, the bounding box does not contain the target; the second part is the position of the bounding box and comprises 4 values (cx, cy, w, h) which respectively represent the center coordinate and the width and the height of the bounding box; for a feature map with the size of m × n, mn units are shared, the number of the prior frames set by each unit is recorded as k, then (c +4) k predicted values are required for each unit, and (c +4) kmn predicted values are required for all units, and as the SSD used by the system adopts convolution for detection, the (c +4) k convolution kernels are required to complete the detection process of the feature map;
in order to ensure that positive and negative samples are balanced as much as possible, sampling is carried out on the negative samples, descending order arrangement is carried out according to confidence errors (the smaller the confidence of a prediction background is, the larger the error is) during sampling, top-k with larger errors is selected as a training negative sample, and the proportion of the positive and negative samples is ensured to be close to 1: 3.
Further, the specific implementation method of step S4 is as follows: adopting a non-maximum suppression method, comprising the following sub-steps:
s41, regarding the detection result obtained in the step S3 as a candidate set, sorting the candidate set according to the confidence degrees aiming at each type of target, selecting the target with the highest confidence degree, deleting the target from the candidate set, and adding the target into the detection result set;
s42, calculating the Jaccard overlapping rate between the elements in the candidate set and the target obtained in S41, and deleting the elements corresponding to the candidate set with the Jaccard overlapping rate larger than a given threshold value;
and S43, repeating the steps S41 and S42 until the candidate set is empty, and outputting the result set as a final result.
The invention has the beneficial effects that: (1) the SSD target detection algorithm adopted in the invention utilizes the idea of YOLO regression, simplifies the computational complexity of a neural network, and improves the real-time performance of the algorithm;
(2) the SSD target detection algorithm adopted in the invention can extract hook characteristics with different aspect ratio sizes by using the anchors mechanism of Faster R-CNN, and meanwhile, the method for extracting the local characteristics is more reasonable and effective in identification;
(3) the SSD target detection algorithm adopted in the invention adopts a multi-scale target feature extraction method aiming at the characteristic that features of different scales express different, and extracts feature maps of different scales for detection, wherein a large-scale feature map (a feature map closer to the front) can be used for detecting small objects, and a small-scale feature map (a feature map closer to the rear) is used for detecting large objects, so that the robustness of detecting lifting area pictures of different scales is improved;
(4) the invention improves the accuracy of detecting whether the lifting hook is in the working state under the current condition and the accuracy of detecting whether a worker is under the lifting state in the working state.
Drawings
FIG. 1 is a flow chart of a dynamic monitoring method of a hoisting area based on a target detection algorithm of the present invention;
FIG. 2 is a diagram of a conventional VGG-16 network architecture;
FIG. 3 is a diagram of an SSD destination detect network architecture of the present invention;
FIG. 4 is a schematic diagram of a characteristic pyramid of the present invention;
FIG. 5 is a graph showing the results of the detection of the present invention.
Detailed Description
The technical scheme of the invention is further explained by combining the attached drawings.
As shown in fig. 1, the hoisting area dynamic monitoring method based on the target detection algorithm of the present invention includes the following steps:
s1, performing data enhancement on the input image of the hoisting area;
the input image of each hoisting area is randomly sampled by one of the following three methods:
(1) using the whole image, namely the collected original image of the hoisting area;
(2) randomly cropping on the original image;
(3) random clipping with Jaccard overlap ratio constraint, the calculation formula of the Jaccard overlap ratio is as follows:
Figure BDA0002534425870000061
a and B are similarity sets of all real frames and the cut image in the original image respectively; the ratio of the cutting size to the original image is between [0.1 and 0.9], and the aspect ratio is between [1/2 and 2 ];
the input image is resized to a uniform size and horizontally flipped with a probability of 0.5.
The number of training samples can be increased, more targets with different shapes and sizes are constructed at the same time, and the targets are input into the network, so that the network can learn more robust characteristics, the performance of a subsequent algorithm is improved, and finally, the system enhances the sensitivity to target translation and has more robustness to targets with different sizes and aspect ratios.
S2, extracting the features of the image obtained in the step S1 by adopting an SSD target detection network;
the SSD target detection network is based on a VGG-16 network, the traditional VGG-16 network structure is shown in figure 2, and the SSD network of the invention is shown in figure 3. During training, the parameters are also Conv1_1, Conv1_2, Conv2_1, Conv2_2, Conv3_1, Conv3_2, Conv3_3, Conv4_1, Conv4_2, Conv4_3, Conv5_1, Conv5_2 and Conv5_3 (512);
FC6 and FC7 changed from the original fully connected layer to a convolution of 3 × 1024 and a convolution of 1 × 1024, and additional layers included Conv6_1, Conv6_2, Conv7_1, Conv7_2, Conv,8_1, Conv8_2, Conv9_1, Conv9_ 2;
meanwhile, the pooling layer Pool5 is changed from 2 × 2 with original Stride being 2 to 3 × 3 with Stride being 1;
based on the Atrous algorithm, Conv6 employs an extended convolution or a punctured convolution, exponentially enlarges the field of view of the convolution without increasing the complexity of the parameters and the model, and uses the expansion rate parameter to represent the size of the expansion;
wherein, the Conv4_3 layer is used as a first feature map for detection; the Conv4_3 layer signature size is 38 × 38, but the layer is more regular than the layer, so an L2 regular layer is added after the Conv4_3 layer to ensure that the difference with the detection layer is not very large.
S3, extracting a feature map, respectively constructing six frames with different scales at each point on the feature map, and then respectively performing regression on categories and positions;
the specific implementation method comprises the following steps: extracting 19 × 19 Conv7, 10 × 10 Conv8_2, 5 × 5 Conv9_2, 3 × 3 Conv10_2 and 1 × 1 Conv11_2 from the convolutional layer as feature maps for detection, adding a Conv4_3 layer, and extracting 6 feature maps in total, so that the feature structure of a pyramid is presented, as shown in fig. 4; constructing six frames with different sizes at each point on the 6 feature maps, and then performing regression on categories and positions respectively;
fig. 5 shows the final detection result, which is specifically performed as follows: the SSD uses the concept of anchors in the Faster R-CNN for reference, each unit is provided with prior frames with different scales or length-width ratios, and predicted bounding boxes (bounding boxes) are based on the prior frames, so that the training difficulty is reduced to a certain extent. In general, each cell is provided with a plurality of prior frames, and the dimensions and the aspect ratios of the prior frames are different. The prior boxes arranged in different feature maps are different in number (the prior boxes arranged in each unit on the same feature map are the same). The setting of the prior box includes two aspects of scale (or size) and aspect ratio. For the scale of the prior box, it obeys a linear increasing rule: as the feature map size decreases, the a priori box scale increases linearly.
The method comprises the following steps: and obtaining a plurality of six feature maps with different sizes by adopting a multi-scale method, wherein if the m-layer feature map is adopted during system detection, the prior frame proportion calculation formula of the kth feature map is as follows:
Figure BDA0002534425870000081
where m denotes the number of feature maps, which is set to 5 in the present embodiment, because the first layer (Conv4 — 3 layer) is set separately; skRepresenting the ratio of the prior frame size to the picture, smin and smaxThe minimum value and the maximum value of the ratio are respectively 0.2 and 0.9; for the first feature map, the scale ratio of the prior frame is set to be 0.1, and the scale is 30; for the later feature maps, the prior frame scale is increased linearly according to the above formula, but the scale is first enlarged by 100 times, and the increase step size is 17, so that s of each feature map isk20, 37, 54, 71 and 88, dividing the ratios by 100, and multiplying by the picture size to obtain the dimension of each feature map; for the aspect ratio, choose
Figure BDA0002534425870000082
For a particular aspect ratio, the width and height of the prior box are calculated as follows:
Figure BDA0002534425870000083
Figure BDA0002534425870000084
each feature map in the module fusing different scale features has an ar1 and dimension skIn addition to the a priori block of (a), a scale is set tor1 and
Figure BDA0002534425870000085
the prior frames of (1), thus each feature map is provided with two square prior frames with the aspect ratio of 1 but different sizes; furthermore, the center point of the prior frame of each cell is distributed in the center of the respective cell, i.e. the center point of the prior frame of each cell is distributed in the center of the respective cell
Figure BDA0002534425870000086
a,b∈[0,|fk|]Where | fkL is the size of the kth characteristic diagram, and the coordinates of the prior frame are intercepted to be [0,1 ]]Internal; the mapping relation between the coordinate of the prior frame on the characteristic diagram and the coordinate of the original image is as follows:
Figure BDA0002534425870000087
Figure BDA0002534425870000088
Figure BDA0002534425870000089
Figure BDA00025344258700000810
in the formula ,(cx,cy) Coordinates of the center of a prior frame on the characteristic layer are obtained; w is ab,hbWidth and height of the prior frame; w is afeature,hfeatureWidth and height of the feature layer; w is aimg,himgWidth and height of the original image; obtained (x)min,ymin,xmax,ymax) Is the feature map of the k-th layer with the center
Figure BDA0002534425870000091
Size wk,hkMapping the prior frame to the object frame coordinates of the original image;
and (3) simultaneously regressing the position and the target type for each output feature map, wherein the target loss function is the sum of confidence (classification) loss and position loss, and the expression is as follows:
Figure BDA0002534425870000092
where N is the total number of matching positive samples, if N is 0, let L be 0, x and c be the indicated amount and confidence of the classification, L, g be the predicted box and the true box, α be the weight of the position loss, Lconf(x, c) is a confidence loss function; l isloc(x, l, g) is a position loss function;
the position penalty is the Smooth L1 penalty between the predicted frame L and the true frame g:
Figure BDA0002534425870000093
Figure BDA0002534425870000094
Figure BDA0002534425870000095
Figure BDA0002534425870000096
Figure BDA0002534425870000097
wherein Pos represents a positive sample;
Figure BDA0002534425870000098
is an indication quantity, when the value of the ith prediction box and the jth real box of the classification p are paired to be 1, otherwise, the value is 0; cx, cy, w and h are respectively the coordinate of the center point x and the coordinate of the center point y of the frame, the width and the height; d is a prior frame (a prior frame preset by the network itself), l is a prediction frame (a frame output by the network and added with a prediction offset), and g is a GT frame (a real frame marked by the data set);
Figure BDA0002534425870000099
is the coordinate offset of the prediction box;
Figure BDA00025344258700000910
is the predicted x coordinate of the center point
Figure BDA00025344258700000911
Y coordinate of center point
Figure BDA00025344258700000912
Width of
Figure BDA00025344258700000913
Height offset
Figure BDA00025344258700000914
The sum of (a);
Figure BDA00025344258700000915
respectively a central point x coordinate, a central point y coordinate, a width and a height of the prediction frame;
Figure BDA00025344258700000916
respectively the offset of the center coordinate cx of the real frame, the offset of the center coordinate cy, the scaling of the width w and the scaling of the height h;
the classification penalty is the softmax penalty between the classification confidences:
Figure BDA0002534425870000101
Figure BDA0002534425870000102
wherein ,
Figure BDA0002534425870000103
representing the probability of the softmax activation function for the prediction box i as class p,
Figure BDA0002534425870000104
the probability of being represented as a background,
Figure BDA0002534425870000105
when the number is 1, the ith prediction box is matched with the jth real box with the category p, otherwise, the ith prediction box is not matched with the proper real box, and the classification loss formula comprises a positive sample Pos and a negative sample Neg;
in order to predict the detection result, outputting a set of independent detection values for each prior frame of each unit; corresponding to a bounding box, the method is mainly divided into two parts: the first part is the confidence coefficient or score of each category, and when there are c category confidence coefficients, only c-1 real detection categories are available; in the prediction process, the class with the highest confidence is the class to which the bounding box belongs, and particularly, when the first confidence value is the highest, the bounding box does not contain the target; the second part is the position of the bounding box and comprises 4 values (cx, cy, w, h) which respectively represent the center coordinate and the width and the height of the bounding box; for a feature map with the size of m × n, mn units are shared, the number of the prior frames set by each unit is recorded as k, then (c +4) k predicted values are required for each unit, and (c +4) kmn predicted values are required for all units, and as the SSD used by the system adopts convolution for detection, the (c +4) k convolution kernels are required to complete the detection process of the feature map;
in order to ensure that positive and negative samples are balanced as much as possible, sampling is carried out on the negative samples, descending order arrangement is carried out according to confidence errors (the smaller the confidence of a prediction background is, the larger the error is) during sampling, top-k with larger errors is selected as a training negative sample, and the proportion of the positive and negative samples is ensured to be close to 1: 3.
S4, screening the result obtained in the S3 through a non-maximum value inhibition method to obtain an output result;
for each prediction box, firstly, the class (the one with the highest confidence) and the confidence value of the prediction box are determined according to the class confidence, and the prediction boxes belonging to the background are filtered out. The prediction boxes with lower thresholds are then filtered out according to a confidence threshold (e.g., 0.5). And decoding the residual prediction frame, and obtaining the real position parameter of the prediction frame according to the prior frame. After decoding, the blocks are sorted in descending order according to confidence, and then only top-k (e.g. 400) prediction blocks are reserved. And finally, filtering prediction boxes with large Jaccard overlapping degree by adopting a non-maximum value inhibition algorithm. The last remaining prediction box is the detection result.
The specific implementation method comprises the following steps: adopting a non-maximum suppression method, comprising the following sub-steps:
s41, regarding the detection result obtained in the step S3 as a candidate set, sorting the candidate set according to the confidence degrees aiming at each type of target, selecting the target with the highest confidence degree, deleting the target from the candidate set, and adding the target into the detection result set;
s42, calculating the Jaccard overlapping rate between the elements in the candidate set and the target obtained in S41, and deleting the elements corresponding to the candidate set with the Jaccard overlapping rate larger than a given threshold value;
and S43, repeating the steps S41 and S42 until the candidate set is empty, and outputting the result set as a final result.
It will be appreciated by those of ordinary skill in the art that the embodiments described herein are intended to assist the reader in understanding the principles of the invention and are to be construed as being without limitation to such specifically recited embodiments and examples. Those skilled in the art can make various other specific changes and combinations based on the teachings of the present invention without departing from the spirit of the invention, and these changes and combinations are within the scope of the invention.

Claims (5)

1. A hoisting area dynamic monitoring method based on a target detection algorithm is characterized by comprising the following steps:
s1, performing data enhancement on the input image of the hoisting area;
s2, extracting the features of the image obtained in the step S1 by adopting an SSD target detection network;
s3, extracting a feature map, respectively constructing six frames with different scales at each point on the feature map, and then respectively performing regression on categories and positions;
and S4, screening the result obtained in the S3 through a non-maximum value inhibition method to obtain an output result.
2. The method for dynamically monitoring the hoisting area based on the object detection algorithm as claimed in claim 1, wherein the image of each hoisting area input in the step S1 is randomly sampled by one of the following three methods:
(1) using the whole image, namely the collected original image of the hoisting area;
(2) randomly cropping on the original image;
(3) random clipping with Jaccard overlap ratio constraint, the calculation formula of the Jaccard overlap ratio is as follows:
Figure FDA0002534425860000011
a and B are similarity sets of all real frames and the cut image in the original image respectively; the ratio of the cutting size to the original image is between [0.1 and 0.9], and the aspect ratio is between [1/2 and 2 ];
the input image is resized to a uniform size and horizontally flipped with a probability of 0.5.
3. The method for dynamically monitoring the hoisting area based on the target detection algorithm as claimed in claim 1, wherein the SSD target detection network is based on VGG-16 network, and is trained by Conv1_1, Conv1_2, Conv2_1, Conv2_2, Conv3_1, Conv3_2, Conv3_3, Conv4_1, Conv4_2, Conv4_3, Conv5_1, Conv5_2, Conv5_3 (512);
FC6 and FC7 changed from the original fully connected layer to a convolution of 3 × 1024 and a convolution of 1 × 1024, and additional layers included Conv6_1, Conv6_2, Conv7_1, Conv7_2, Conv,8_1, Conv8_2, Conv9_1, Conv9_ 2;
meanwhile, the pooling layer Pool5 is changed from 2 × 2 with original Stride being 2 to 3 × 3 with Stride being 1;
based on the Atrous algorithm, Conv6 employs an extended convolution or a punctured convolution, exponentially enlarges the field of view of the convolution without increasing the complexity of the parameters and the model, and uses the expansion rate parameter to represent the size of the expansion;
wherein, the Conv4_3 layer is used as a first feature map for detection; the Conv4_3 layer signature size is 38 × 38, with an L2 regular layer added after the Conv4_3 layer.
4. The method for dynamically monitoring the hoisting area based on the target detection algorithm as claimed in claim 3, wherein the step S3 is implemented by: extracting 19 × 19 Conv7, 10 × 10 Conv8_2, 5 × 5 Conv9_2, 3 × 3 Conv10_2 and 1 × 1 Conv11_2 from the convolutional layer as feature maps for detection, adding Conv4_3 layers, and extracting 6 feature maps in total; constructing six frames with different sizes at each point on the 6 feature maps, and then performing regression on categories and positions respectively;
the method comprises the following steps: and obtaining a plurality of six feature maps with different sizes by adopting a multi-scale method, wherein if the m-layer feature map is adopted during system detection, the prior frame proportion calculation formula of the kth feature map is as follows:
Figure FDA0002534425860000021
wherein m denotes the number of characteristic diagrams, skTo representThe ratio of the prior frame size to the picture, smin and smaxMinimum and maximum values representing ratios; for the first feature map, the scale ratio of the prior frame is set to be 0.1, and the scale is 30; for the later feature maps, the prior frame scale is increased linearly according to the above formula, but the scale is first enlarged by 100 times, and the increase step size is 17, so that s of each feature map isk20, 37, 54, 71 and 88, dividing the ratios by 100, and multiplying by the picture size to obtain the dimension of each feature map; for the aspect ratio, choose
Figure FDA0002534425860000022
For a particular aspect ratio, the width and height of the prior box are calculated as follows:
Figure FDA0002534425860000023
Figure FDA0002534425860000024
each feature map in the module fusing different scale features has an ar1 and dimension skIn addition to the a priori block of (a), a scale is set tor1 and
Figure FDA0002534425860000025
the prior frames of (1), thus each feature map is provided with two square prior frames with the aspect ratio of 1 but different sizes; furthermore, the center point of the prior frame of each cell is distributed in the center of the respective cell, i.e. the center point of the prior frame of each cell is distributed in the center of the respective cell
Figure FDA0002534425860000026
a,b∈[0,|fk|]Where | fkL is the size of the kth characteristic diagram, and the coordinates of the prior frame are intercepted to be [0,1 ]]Internal; the mapping relation between the coordinate of the prior frame on the characteristic diagram and the coordinate of the original image is as follows:
Figure FDA0002534425860000027
Figure FDA0002534425860000028
Figure FDA0002534425860000029
Figure FDA0002534425860000031
in the formula ,(cx,cy) Coordinates of the center of a prior frame on the characteristic layer are obtained; w is ab,hbWidth and height of the prior frame; w is afeature,hfeatureWidth and height of the feature layer; w is aimg,himgWidth and height of the original image; obtained (x)min,ymin,xmax,ymax) Is the feature map of the k-th layer with the center
Figure FDA0002534425860000032
Size wk,hkMapping the prior frame to the object frame coordinates of the original image;
and (3) simultaneously regressing the position and the target type for each output feature map, wherein the target loss function is the sum of the confidence loss and the position loss, and the expression is as follows:
Figure FDA0002534425860000033
where N is the total number of matching positive samples, if N is 0, let L be 0, x and c be the indicated amount and confidence of the classification, L, g be the predicted box and the true box, α be the weight of the position loss, Lconf(x, c) is a confidence loss function; l isloc(x, l, g) is a position loss function;
the position penalty is the Smooth L1 penalty between the predicted frame L and the true frame g:
Figure FDA0002534425860000034
Figure FDA0002534425860000035
Figure FDA0002534425860000036
Figure FDA0002534425860000037
Figure FDA0002534425860000038
wherein Pos represents a positive sample;
Figure FDA0002534425860000039
is an indication quantity, when the value of the ith prediction box and the jth real box of the classification p are paired to be 1, otherwise, the value is 0; cx, cy, w and h are respectively the coordinate of the center point x and the coordinate of the center point y of the frame, the width and the height; d is a prior box;
Figure FDA00025344258600000310
is the coordinate offset of the prediction box;
Figure FDA00025344258600000311
is the predicted x coordinate of the center point
Figure FDA00025344258600000312
Y coordinate of center point
Figure FDA00025344258600000313
Width of
Figure FDA00025344258600000314
Height offset
Figure FDA00025344258600000315
The sum of (a);
Figure FDA00025344258600000316
respectively a central point x coordinate, a central point y coordinate, a width and a height of the prediction frame;
Figure FDA00025344258600000317
respectively the offset of the center coordinate cx of the real frame, the offset of the center coordinate cy, the scaling of the width w and the scaling of the height h;
the classification penalty is the softmax penalty between the classification confidences:
Figure FDA0002534425860000041
Figure FDA0002534425860000042
wherein ,
Figure FDA0002534425860000043
representing the probability of the softmax activation function for the prediction box i as class p,
Figure FDA0002534425860000044
the probability of being represented as a background,
Figure FDA0002534425860000045
when the number is 1, the ith prediction box is matched with the jth real box with the category p, otherwise, the ith prediction box is not matched with the proper real box, and the classification loss formula comprises a positive sample Pos and a negative sample Neg;
in order to predict the detection result, outputting a set of independent detection values for each prior frame of each unit; corresponding to a bounding box, the method is mainly divided into two parts: the first part is the confidence coefficient or score of each category, and when there are c category confidence coefficients, only c-1 real detection categories are available; in the prediction process, the class with the highest confidence is the class to which the bounding box belongs, and particularly, when the first confidence value is the highest, the bounding box does not contain the target; the second part is the position of the bounding box and comprises 4 values (cx, cy, w, h) which respectively represent the center coordinate and the width and the height of the bounding box; for a feature map with the size of m × n, mn units are shared, the number of the prior frames set by each unit is recorded as k, then (c +4) k predicted values are required for each unit, and (c +4) kmn predicted values are required for all units, and as the SSD used by the system adopts convolution for detection, the (c +4) k convolution kernels are required to complete the detection process of the feature map;
in order to ensure that positive and negative samples are balanced as much as possible, sampling is carried out on the negative samples, descending order arrangement is carried out according to confidence error during sampling, top-k with larger error is selected as a training negative sample, and the proportion of the positive and negative samples is ensured to be close to 1: 3.
5. The method for dynamically monitoring the hoisting area based on the target detection algorithm as recited in claim 4, wherein the step S4 is implemented by: adopting a non-maximum suppression method, comprising the following sub-steps:
s41, regarding the detection result obtained in the step S3 as a candidate set, sorting the candidate set according to the confidence degrees aiming at each type of target, selecting the target with the highest confidence degree, deleting the target from the candidate set, and adding the target into the detection result set;
s42, calculating the Jaccard overlapping rate between the elements in the candidate set and the target obtained in S41, and deleting the elements corresponding to the candidate set with the Jaccard overlapping rate larger than a given threshold value;
and S43, repeating the steps S41 and S42 until the candidate set is empty, and outputting the result set as a final result.
CN202010528652.8A 2020-06-11 2020-06-11 Hoisting area dynamic monitoring method based on target detection algorithm Active CN111753682B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202010528652.8A CN111753682B (en) 2020-06-11 2020-06-11 Hoisting area dynamic monitoring method based on target detection algorithm

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202010528652.8A CN111753682B (en) 2020-06-11 2020-06-11 Hoisting area dynamic monitoring method based on target detection algorithm

Publications (2)

Publication Number Publication Date
CN111753682A true CN111753682A (en) 2020-10-09
CN111753682B CN111753682B (en) 2023-05-23

Family

ID=72675082

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202010528652.8A Active CN111753682B (en) 2020-06-11 2020-06-11 Hoisting area dynamic monitoring method based on target detection algorithm

Country Status (1)

Country Link
CN (1) CN111753682B (en)

Cited By (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112215308A (en) * 2020-12-13 2021-01-12 之江实验室 Single-order detection method and device for hoisted object, electronic equipment and storage medium
CN112614121A (en) * 2020-12-29 2021-04-06 国网青海省电力公司海南供电公司 Multi-scale small-target equipment defect identification and monitoring method
CN112733671A (en) * 2020-12-31 2021-04-30 新大陆数字技术股份有限公司 Pedestrian detection method, device and readable storage medium
CN113158752A (en) * 2021-02-05 2021-07-23 国网河南省电力公司鹤壁供电公司 Intelligent safety management and control system for electric power staff approach operation
CN113688663A (en) * 2021-02-23 2021-11-23 北京澎思科技有限公司 Face detection method and device, electronic equipment and readable storage medium

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107423760A (en) * 2017-07-21 2017-12-01 西安电子科技大学 Based on pre-segmentation and the deep learning object detection method returned
CN109886359A (en) * 2019-03-25 2019-06-14 西安电子科技大学 Small target detecting method and detection model based on convolutional neural networks
US20190377949A1 (en) * 2018-06-08 2019-12-12 Guangdong Oppo Mobile Telecommunications Corp., Ltd. Image Processing Method, Electronic Device and Computer Readable Storage Medium
CN111027547A (en) * 2019-12-06 2020-04-17 南京大学 Automatic detection method for multi-scale polymorphic target in two-dimensional image

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107423760A (en) * 2017-07-21 2017-12-01 西安电子科技大学 Based on pre-segmentation and the deep learning object detection method returned
US20190377949A1 (en) * 2018-06-08 2019-12-12 Guangdong Oppo Mobile Telecommunications Corp., Ltd. Image Processing Method, Electronic Device and Computer Readable Storage Medium
CN109886359A (en) * 2019-03-25 2019-06-14 西安电子科技大学 Small target detecting method and detection model based on convolutional neural networks
CN111027547A (en) * 2019-12-06 2020-04-17 南京大学 Automatic detection method for multi-scale polymorphic target in two-dimensional image

Cited By (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112215308A (en) * 2020-12-13 2021-01-12 之江实验室 Single-order detection method and device for hoisted object, electronic equipment and storage medium
CN112215308B (en) * 2020-12-13 2021-03-30 之江实验室 Single-order detection method and device for hoisted object, electronic equipment and storage medium
CN112614121A (en) * 2020-12-29 2021-04-06 国网青海省电力公司海南供电公司 Multi-scale small-target equipment defect identification and monitoring method
CN112733671A (en) * 2020-12-31 2021-04-30 新大陆数字技术股份有限公司 Pedestrian detection method, device and readable storage medium
CN113158752A (en) * 2021-02-05 2021-07-23 国网河南省电力公司鹤壁供电公司 Intelligent safety management and control system for electric power staff approach operation
CN113688663A (en) * 2021-02-23 2021-11-23 北京澎思科技有限公司 Face detection method and device, electronic equipment and readable storage medium

Also Published As

Publication number Publication date
CN111753682B (en) 2023-05-23

Similar Documents

Publication Publication Date Title
CN110084292B (en) Target detection method based on DenseNet and multi-scale feature fusion
CN108564097B (en) Multi-scale target detection method based on deep convolutional neural network
CN111753682B (en) Hoisting area dynamic monitoring method based on target detection algorithm
CN110991311B (en) Target detection method based on dense connection deep network
CN111310861A (en) License plate recognition and positioning method based on deep neural network
CN110796048B (en) Ship target real-time detection method based on deep neural network
CN113065558A (en) Lightweight small target detection method combined with attention mechanism
CN109190752A (en) The image, semantic dividing method of global characteristics and local feature based on deep learning
CN110991444B (en) License plate recognition method and device for complex scene
CN109145836B (en) Ship target video detection method based on deep learning network and Kalman filtering
CN111079739B (en) Multi-scale attention feature detection method
CN111898432B (en) Pedestrian detection system and method based on improved YOLOv3 algorithm
CN108734200B (en) Human target visual detection method and device based on BING (building information network) features
CN112364931A (en) Low-sample target detection method based on meta-feature and weight adjustment and network model
CN109902576B (en) Training method and application of head and shoulder image classifier
CN114022408A (en) Remote sensing image cloud detection method based on multi-scale convolution neural network
CN113159215A (en) Small target detection and identification method based on fast Rcnn
CN114155474A (en) Damage identification technology based on video semantic segmentation algorithm
CN113487610B (en) Herpes image recognition method and device, computer equipment and storage medium
CN115187786A (en) Rotation-based CenterNet2 target detection method
CN115187530A (en) Method, device, terminal and medium for identifying ultrasonic automatic breast full-volume image
CN116912796A (en) Novel dynamic cascade YOLOv 8-based automatic driving target identification method and device
Li et al. Incremental learning of infrared vehicle detection method based on SSD
CN115861956A (en) Yolov3 road garbage detection method based on decoupling head
CN114494441B (en) Grape and picking point synchronous identification and positioning method and device based on deep learning

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant