CN110399905B - Method for detecting and describing wearing condition of safety helmet in construction scene - Google Patents

Method for detecting and describing wearing condition of safety helmet in construction scene Download PDF

Info

Publication number
CN110399905B
CN110399905B CN201910593069.2A CN201910593069A CN110399905B CN 110399905 B CN110399905 B CN 110399905B CN 201910593069 A CN201910593069 A CN 201910593069A CN 110399905 B CN110399905 B CN 110399905B
Authority
CN
China
Prior art keywords
safety helmet
wearing
target
picture
people
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201910593069.2A
Other languages
Chinese (zh)
Other versions
CN110399905A (en
Inventor
徐守坤
李宁
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Changzhou University
Original Assignee
Changzhou University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Changzhou University filed Critical Changzhou University
Priority to CN201910593069.2A priority Critical patent/CN110399905B/en
Publication of CN110399905A publication Critical patent/CN110399905A/en
Application granted granted Critical
Publication of CN110399905B publication Critical patent/CN110399905B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/21Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
    • G06F18/214Generating training patterns; Bootstrap methods, e.g. bagging or boosting
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/23Clustering techniques
    • G06F18/232Non-hierarchical techniques
    • G06F18/2321Non-hierarchical techniques using statistics or function optimisation, e.g. modelling of probability density functions
    • G06F18/23213Non-hierarchical techniques using statistics or function optimisation, e.g. modelling of probability density functions with fixed number of clusters, e.g. K-means clustering
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V20/00Scenes; Scene-specific elements
    • G06V20/50Context or environment of the image
    • G06V20/52Surveillance or monitoring of activities, e.g. for recognising suspicious objects
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V2201/00Indexing scheme relating to image or video recognition or understanding
    • G06V2201/07Target detection

Abstract

The invention provides a method for detecting and describing the wearing condition of a safety helmet in a construction scene, which is used for detecting and describing the wearing condition of the safety helmet of a worker by adopting an image and natural language processing method. In the aspect of image description, because the current image description method based on the neural network is lack of interpretability and insufficient in detail description, and meanwhile, the research on the image description of the construction scene is deficient, the method adopts a YOLOv3 target detection algorithm and generates the description sentence worn by the safety helmet based on a method of combining rules and templates. Initializing anchor frame parameter values by using K-means clustering, then training and detecting on a self-made data set, and finally generating image description worn by the safety helmet according to predefined rules and sentence templates. The method has obvious advantages in the aspect of detection efficiency, and meanwhile, the generated description is accurate, so that the purpose of reducing the accident rate can be achieved.

Description

Method for detecting and describing wearing condition of safety helmet in construction scene
Technical Field
The invention relates to the technical field of image understanding, in particular to a method for detecting and describing the wearing condition of a safety helmet in a construction scene.
Background
In recent years, with the increasingly accelerated urbanization process of China, the infrastructure is continuously developed, and construction accidents frequently occur. Construction scenes of transformer substations, chemical plants, mine working areas and the like are complex, certain dangerous factors exist, accidents are easily caused by unsafe behaviors of workers, and casualties and economic losses are caused. In a construction site, the safety helmet is a life guarantee, a worker wears the safety helmet to meet the requirement of behavior specification, and the wearing of the safety helmet can reduce the operation risk of the worker to a certain extent. In order to ensure the personal safety of workers and reduce the accident rate caused by not wearing the safety helmet, the behavior description of the safety helmet wearing problem of the constructors is particularly important.
The image description is to express the content in the picture by using a natural language processing method on the basis of image recognition, and is further to recognize the image recognition. In a construction scene, the method has important significance and application value for researching the image description of the safety helmet worn by workers.
Most descriptions generated by the current image description method are global descriptions of images, so that detailed information is easy to lose, and certain accuracy is lacked. For the pictures of the construction scene, the image description generated from the aspect of wearing the safety helmet of the constructor is the basis for analyzing the construction site condition, so that the construction safety and operability are further judged to eliminate potential safety hazards. The current research on the wearing of safety helmets today is directed to the task of image recognition. Whether the wearing of the safety helmet is detected by using a traditional algorithm or a deep learning technology, considerable research results are obtained, but certain limitations exist, namely the wearing condition of the safety helmet of an operator is not described by using natural language.
Disclosure of Invention
The technical problem to be solved by the invention is as follows: in order to overcome the defects in the prior art, the invention provides a method for detecting and describing the wearing condition of a safety helmet in a construction scene, which adopts a YOLOv3 target detection algorithm and generates a description sentence worn by the safety helmet based on a method of combining rules and templates, can more accurately judge and describe whether a constructor wears the safety helmet in the construction process by using the sentence, so as to eliminate potential safety hazards, improve the safety coefficient in the construction scene, and provide theoretical support for an intelligent monitoring robot by using the accurate detection effect and the high-efficiency detection speed.
The technical scheme adopted for solving the technical problems is as follows: a method for detecting and describing the wearing condition of a safety helmet in a construction scene comprises the following steps:
s1: production of data sets
And developing the image acquisition work of the data set by the modes of collecting pictures by a web crawler technology, automatically acquiring the pictures on site and the like. The collected data includes pictures about the wearing of the safety helmet in construction places with various background conditions, different resolutions and different qualities, and contains constructors wearing the safety helmet and constructors not wearing the safety helmet. The total number of the pictures is 5000, the richness of the data set is ensured to a certain extent, various scene conditions are included, and the condition of a real scene can be reflected more completely. The data set is produced by two steps:
s1.1: helmet wearing detection data set manufacturing method
And performing multi-label labeling on the picture sample by using an open source labeling tool LabelImg according to a labeling format of a Pascal VOC2007 public data set, and automatically generating a corresponding xml-format labeling file, wherein the xml-format labeling file comprises an object name and coordinate information of a real boundary box. The labeled target categories are: people (man), helmets (helmet) and people wearing helmets (man webelmet).
S1.2: the method for manufacturing the subtitle data set of the image worn by the helmet comprises the following steps: and (4) performing statement annotation on the data set marked in the step (S1.1). By adopting a mode of combining self-programming labeling software and manual labeling, the labeling of the caption data set is divided into the following steps:
s1.2.1: reading the name and size information (width and height, unit pixel) of each picture by using self-programming labeling software, and giving a unique picture id number to each picture;
s1.2.2: the method comprises the steps of performing caption labeling on pictures by using self-programming labeling software, manually labeling 5 descriptive sentences of each picture, mainly describing by wearing safety helmets of personnel in a construction scene, and endowing each sentence with a unique sentence id number. Each picture has a corresponding picture id number and 5 corresponding sentence id numbers, and picture subtitle labeling data are stored in a json format.
S2: target detection
S2.1: selection of detection model
The existing target detection algorithm based on deep learning is mainly divided into two types, namely a two-stage method based on area detection and a one-stage method without area detection. the idea of the two-stage target detection algorithm is to acquire region detection first and then classify in the current region, which will result in an increase in time complexity of the candidate region based method and a longer detection time. In a one-stage target detection algorithm, the target detection based on YOLOv3 achieves certain effect, the idea of the algorithm is to directly predict the types and the positions of different targets by using only one CNN network, and the method is a rapid and accurate target detection technology. Compared with other detection methods, on the aspect of detection precision, if the target to be detected is not very small, the detection precision is close to that of the fast-RCNN. YOLOv3 is also superior to the SSD in terms of detection speed and accuracy, compared to the SSD belonging to the one-stage detection method as well. And comprehensively considering two aspects of detection speed and detection precision of the algorithm, selecting YOLOv3 as a judgment and description model for judging whether a safety helmet is worn in a construction scene, wherein the trained model can be better applied to engineering.
S2.2: self-made data set preprocessing
According to the Pascal VOC format, a safety helmet wearing data set is self-made, the labeling information comprises the category of the target and the coordinates of the boundary box, and the labeling information is subjected to normalization processing and converted into a training format available for YOLOv 3.
And (3) carrying out normalization processing on the sample marking data, namely dividing the sample marking data by the width and the height of the image to control the final data to be between 0 and 1, so that the training sample data can be read quickly, and the requirement of multi-scale training can be met. The specific normalization formula is shown as follows:
Figure GDA0003933664410000041
Figure GDA0003933664410000042
wherein x is max ,x min ,y max ,y min And the frame mark information of the original sample is represented, the width and the height represent the picture size, the x, y and w, h represent the mark information after normalization, the (x and y) are the coordinates of the central point of the target, and the w and h are the width and the height of the target. In the normalized data sample, the bounding box information of each target of each picture comprises 5 parameters x, y, w, h, and class _ id, wherein the class _ id is a target class number.
S2.3: k-means cluster initialization anchor frame
YOLOv3 initializes the anchor frame by adopting a K-means clustering algorithm to predict the coordinates of the boundary frame, and the size of the anchor frame can influence the detection accuracy. The original YOLOv 3K-means clustering algorithm adopts an Euclidean distance formula, anchor frame parameter values are generated by clustering on a public data set, the anchor frame parameter values have universality, but the anchor frame parameter values are not suitable for a self-made safety helmet wearing data set, so that a new anchor frame needs to be designed before training to improve the detection rate of a boundary frame. Carrying out K-means clustering on a self-made safety helmet wearing data set to obtain 9 anchor frames, arranging the anchor frames from small to large in sequence, and uniformly distributing the anchor frames on feature maps of 3 scales, wherein the first 3 anchor frames correspond to a feature map of 52 multiplied by 52, the middle 3 anchor frames correspond to a feature map of 26 multiplied by 26, and the last 3 anchor frames correspond to a feature map of 13 multiplied by 13. The final 9 anchor frame parameter values are (26, 19), (49, 36), (58, 145), (76, 58), (101, 199), (123, 111), (152, 222), (223, 261), (372, 491), which correspond to the c 1-c 8 cluster center point coordinates respectively, the width and height dimensions of the anchor frame correspond to the width and height of the target frame at the cluster center point, and the data unit is pixel.
S2.4: training of network models
The Yolov3 as a one-stage target detection method has the main advantages that the whole picture is processed by only using a single CNN, the target in the image is positioned, the target category of the image is predicted, and the target detection problem is converted into a regression problem. Besides the coordinate information of the target to be detected needs to be positioned, the confidence of the bounding box needs to be predicted and the score of the predefined target class needs to be obtained during network training. The network model training steps are as follows:
s2.4.1: target coordinate information positioning
The input picture is represented as a tensor of size n × m × 3, where n and m represent the width and height of the picture in pixels, and 3 represents the number of RGB three channels. Firstly, images with different sizes are automatically adjusted to be 416 x 416 fixed sizes, then the original image is divided into 13 x 13 grids, and the grid where the center point of the target is located is responsible for detecting the target. Each mesh predicts 3 bounding boxes overlaid on the mesh and the confidence of these bounding boxes, each bounding box containing 6 predictors: x, y, w, h, confidence and class _ id, wherein (x, y) represents the relative value of the center of the predicted bounding box and the grid boundary, w, h represents the ratio of the width and the height of the predicted bounding box relative to the whole picture, confidence represents confidence for eliminating the bounding box below a threshold value, and class _ id represents the target class number. The prediction information of each bounding box comprises the coordinates, the width and the height of the bounding box, and the coordinate calculation formula of the bounding box is as follows:
b x =σ(t x )+c x ,b y =σ(t y )+c y
Figure GDA0003933664410000051
where, (bx, by) represents the predicted bounding box center coordinates, and bw, bh represents the width and height of the predicted bounding box. tx, ty, tw, th represent the target of network learning, cx, cy is the coordinate offset of the grid, and pw, ph are the preset anchor frame dimensions.
S2.4.2: prediction of bounding box confidence
After the target coordinate information is located, the confidence of the bounding box needs to be predicted, and according to the labeled 3-type (C = 3) targets: human (man), helmet (helmet) and man-worn helmet (man near helmet), 3 bounding boxes are predicted for each grid, each bounding box contains 6 prediction quantities, so that the number of channels is 3 x (4 +1+ 3) =24, and the output 3 scale feature maps are respectively 13 x 24, 26 x 24 and 52 x 24.
S2.4.3: a score prediction for the target category is predefined.
And (3) completing confidence prediction of the boundary box, predicting the score of the predefined target class, and predicting by respectively using 3 feature maps of different scales, namely 13 multiplied by 13, 26 multiplied by 26 and 52 multiplied by 52, by adopting a multi-scale prediction idea to improve the detection effect of the small target.
S2.4.4: training of models
And according to the characteristics of the self-made data set, correspondingly modifying the configuration file of the YOLOv3 network. Before training, the weight files provided by the official website are converted into the weight files under the Keras framework according to the modified network configuration files, so that the pre-training model is loaded, and initialization parameters are provided for training the model.
The batch processing size (batch) during training is set to 64, that is, 64 sample data are randomly selected for training in each iteration, and the grouping (subdivision) is set to 8, that is, the samples are divided into 8 groups and sent to the network training, so as to reduce the pressure of memory occupation. The network model is normalized by BN (batch normalization) to improve the convergence speed of the model. Momentum (momentum) is set to 0.9, weight decay (decay) is set to 0.0005 to prevent model overfitting, initial learning rate (learning rate) is set to 0.001, and the learning rate decays to 1/10 of the original for 5000 iterations. The model is iterated 20000 times finally, which takes 8 hours, and the loss of the model is gradually reduced as the iteration number is increased. The model is quickly fitted in the previous 4000 iterations, the loss value is reduced quickly, and the loss value tends to be stable and only slightly oscillates after 10000 iterations.
S2.5: target detection
Firstly, the size of an input picture is reset to 416 x 416, then, the Darknet-53 network is utilized to extract picture features, then, the feature vectors are sent to a feature pyramid structure to carry out multi-scale prediction, and finally, non-maximum suppression is carried out on a predicted boundary frame so as to eliminate repeated detection and obtain a final prediction result.
S3: statement generation
Firstly, detecting a visual concept in an image by using a target detection algorithm, secondly, combining a predefined rule and a sentence template, then filling the detected visual concept into the sentence template, and finally generating a description sentence worn by the helmet. Sentence description rules and template definition method:
s3.1: definition of sentence description rules
According to the visual concepts of people, safety helmets and people wearing safety helmets extracted from the target detection in the previous stage. And respectively setting a triad (m, n, p) with an initial value of zero for 3 types of targets to be detected to count the number of the targets to be detected, wherein m represents the detected total number of people, n represents the detected total number of safety helmets, and p represents the detected number of people wearing the safety helmets. When p is more than or equal to 0 and less than or equal to m, the number of people wearing the safety helmet is not more than the total number of people in a construction site, otherwise, when p is more than m, the monitoring is regarded as wrong, and a description sentence worn by the safety helmet cannot be generated. If the detected number of people wearing the safety helmet is equal to the total number of people, namely p = m, the safety helmet is worn by all people; if the detected number of people wearing the safety helmet is different from the total number of people, namely p is not equal to m, the situation that some people wear the safety helmet and some people do not wear the safety helmet is shown.
S3.2: definition of sentence description template
The sentence description template is generated through picture subtitle labels, and the generation of words is derived from original picture subtitle labels or visual concepts extracted by a target detection algorithm. The nature of a visual word is a mark, with the aim of preserving a vacancy for a word that describes a particular region in an image. And (3) extracting a visual concept by adopting a target detection algorithm, and generating an image description sentence worn by a safety helmet of a constructor in a construction scene by combining a method based on rules and a template.
According to the method for detecting and describing the wearing condition of the safety helmet in the construction scene, provided by the invention, whether a constructor wears the safety helmet in the construction process can be more accurately judged by adopting a YOLOv3 algorithm, so that potential safety hazards are eliminated, the safety coefficient in the construction scene is improved, and meanwhile, the accurate detection effect and the high-efficiency detection speed can also provide theoretical support for an intelligent monitoring robot.
Drawings
The invention is further illustrated by the following figures and examples.
FIG. 1 is a flow chart of the algorithm of the present invention.
FIG. 2 is a diagram of the algorithm framework of the present invention.
Fig. 3 is a graph comparing the results of the image description by the NIC method and the embodiment method, wherein (a) is a single person wearing image description and (b) is a multi person wearing image description.
Fig. 4 is a graph of experimental results of example method visualization.
Detailed Description
The present invention will now be described in detail with reference to the accompanying drawings. This figure is a simplified schematic diagram, and merely illustrates the basic structure of the present invention in a schematic manner, and therefore it shows only the constitution related to the present invention.
Referring to fig. 1, a method for detecting and describing a wearing condition of a safety helmet in a construction scene according to the present invention is described in detail with reference to specific embodiments.
An embodiment platform is built by using Linux, ubuntu16.04 is selected as an operating system, NVIDIA GeForce GTX 1080Ti, CUDA8.0 and CUDNN6.0 are selected as a GPU, and the memory is 12GB. Training and testing of the model was performed using the Keras deep learning framework. And selecting a single-stage target detection algorithm YOLOv3 to detect the wearing of the safety helmet of the constructor in the picture. The method based on the rules and the templates is used, and the wearing detection algorithm of the safety helmet is combined to generate the image description worn by the safety helmet of the constructor, and the algorithm flow chart is shown in figure 1.
(1) Production of data sets
And the image acquisition work of the data set is developed by the modes of searching pictures through a web crawler technology, acquiring the pictures on site by self, and the like. The collected data include pictures about the safety helmet wearing in construction places with various background conditions, different resolutions and different qualities, and contain constructors wearing the safety helmet and constructors not wearing the safety helmet. The total number of the pictures is 5000, the richness of the data set is ensured to a certain extent, various scene conditions are included, and the condition of a real scene can be reflected more completely. The data set is produced by two steps:
1) Helmet wearing detection data set manufacturing method
And performing multi-label labeling on the picture sample by using an open source labeling tool LabelImg according to a labeling format of a Pascal VOC2007 public data set, and automatically generating a corresponding xml-format labeling file, wherein the xml-format labeling file comprises an object name and coordinate information of a real boundary box. The labeled target categories are: people (man), helmets (helmet) and people wearing helmets (man webelmet).
2) The method for manufacturing the subtitle data set of the image worn by the safety helmet comprises the following steps: performing statement marking on the data set marked in the first step 1). The method of combining self-editing labeling software and manual labeling is adopted, and the labeling of the caption data set is divided into the following steps:
a) Reading the picture name and size information (width and height) of each picture by using self-programming labeling software, and giving a unique picture id number to each picture;
b) The method comprises the steps of performing caption labeling on pictures by using self-programming labeling software, manually labeling 5 descriptive sentences of each picture, mainly describing by wearing safety helmets of personnel in a construction scene, and endowing each sentence with a unique sentence id number. Each picture has a corresponding picture id number and 5 corresponding sentence id numbers, and picture subtitle labeling data are stored in a json format.
(2) Helmet wear detection
In the embodiment, a home-made helmet wearing detection data set is divided into three groups according to the proportion of 7. The training set and the verification set both contain marking information, and the test sample does not contain marking information so as to verify the effectiveness of the trained model.
1) Preprocessing of safety helmet donning detection data sets
According to the Pascal VOC format, a safety helmet wearing data set is self-made, the labeling information comprises the category of the target and the coordinates of the boundary box, and the labeling information is subjected to normalization processing and converted into a training format available for YOLOv 3.
And (3) carrying out normalization processing on the sample marking data, namely dividing the sample marking data by the width and the height of the image to control the final data to be between 0 and 1, so that the training sample data can be read quickly, and the requirement of multi-scale training can be met. The specific normalization formula is shown as follows:
Figure GDA0003933664410000101
Figure GDA0003933664410000102
wherein x is max ,x min ,y max ,y min Representing the mark information of the original sample boundary frame, width and height representing the picture size, and x, y, w and h representing the mark information after normalization, wherein (x and y) are the targetsThe coordinates of the center point, w, h, are the width and height of the target. In the normalized data sample, the bounding box information of each target of each picture includes 5 parameters x, y, w, h, class _ id, where class _ id is a target class number.
The anchor frame parameter values of the public data set are not applicable to the data set of the present invention, and therefore the anchor frame parameter values need to be re-determined from the homemade headgear wearing data set. Through a K-means clustering algorithm, clustering analysis is carried out on the data set of the invention, and 9 anchor box parameter values are respectively (26, 19), (49, 36), (58, 145), (76, 58), (101, 199), (123, 111), (152, 222), (223, 261), (372, 491), which respectively correspond to c 1-c 8 clustering center point coordinates, and the width-height dimension of the anchor box corresponds to the width-height of the target box at the clustering center point.
2) Training and testing of models
According to the characteristics of the self-made data set, the configuration file of the YOLOv3 network is modified correspondingly. Before training, the conversion operation of the weight file is required, the weight file provided by the official website is converted into the weight file under the Keras framework according to the modified network configuration file, so that the loading of the pre-training model is facilitated, and the initialization parameters are provided for the training of the model.
The batch processing size (batch) during training is set to 64, that is, 64 sample data are randomly selected for training in each iteration, and the grouping (subdivision) is set to 8, that is, the samples are divided into 8 groups and sent to the network training, so as to reduce the pressure of memory occupation. The network model is normalized by BN (batch normalization) to improve the convergence speed of the model. Momentum (momentum) is set to 0.9, weight decay (decay) is set to 0.0005 to prevent model overfitting, initial learning rate (learning rate) is set to 0.001, and the learning rate decays to 1/10 of the original for 5000 iterations. The model is iterated 20000 times finally, which takes 8 hours, and the test shows that the loss of the model is gradually reduced along with the increase of the iteration times. The model is quickly fitted in the previous 4000 iterations, the loss value is reduced quickly, and the loss value tends to be stable and only slightly oscillates after 10000 iterations.
The invention adopts a YOLOv3 target detection algorithm to realize the wearing detection of the safety helmet, and simultaneously carries out a comparison experiment with the fast-RCNN and SSD algorithms. After training is finished, loading the obtained model weight file, testing and evaluating the model on a test set, wherein the algorithm of the invention is slightly lower than fast-RCNN on the detection Average accuracy AP (Average Precision) value of the safety helmet, but the algorithm of the invention is superior to other algorithms on the detection speed of the algorithm and the Average accuracy Average MAP (meanAP) of 3 types of targets.
(3) Helmet worn image description sentence generation
And detecting the visual concept in the image by using a target detection algorithm, filling the detected visual concept into a sentence template by combining a predefined rule and the sentence template, and finally generating a description sentence worn by the safety helmet. The algorithm frame diagram is shown in fig. 2.
The self-made helmet wearing image subtitle data set is divided into three groups according to the following steps of 7. The sizes of the training set, the verification set and the test set are 3500, 1000 and 500 respectively, the training set and the verification set both comprise pictures and corresponding caption labels, and the test set does not comprise the caption labels of the pictures, so that the effectiveness of the method is verified.
1) Preprocessing of helmet worn image subtitle data sets
The method comprises the following steps of preprocessing a self-made image subtitle data set, and mainly comprises the following operations: a) Truncating caption marking sentences of more than 15 words in the marking samples; b) Deleting the 'and' in the labeling sample, unifying the capital and lowercase of the words, and converting the capital words into the lowercase; c) Counting word frequency, and giving a unique id number to each word in the labeled sample; d) And constructing a vocabulary table containing 3 groups of information (word id, word and word frequency), storing the words which appear at least 3 times in the labeling sample into the vocabulary table, and regarding the rest words as uncommon words and representing the uncommon words by using 'UNK'.
And constructing a vocabulary table on the self-made picture subtitle training set, wherein the total word amount is 183047, 2872 different words are in total, the words are screened by taking the threshold value as 3, and the size of the generated effective vocabulary table is 1343, namely the size of the description vocabulary is 1343 different words.
2) Sentence description rules and template definition
a) The sentence describes the definition of the rule. According to the visual concept of human, safety helmet and the human 3 kinds of targets wearing the safety helmet extracted from the target detection in the previous stage. And respectively setting a triad (m, n, p) with an initial value of zero for 3 types of targets to be detected to count the number of the targets to be detected, wherein m represents the detected total number of people, n represents the detected total number of safety helmets, and p represents the detected number of people wearing the safety helmets. When p is more than or equal to 0 and less than or equal to m, the number of people wearing the safety helmet is not more than the total number of people in a construction site, otherwise, when p is more than m, the monitoring is regarded as wrong, and a description sentence worn by the safety helmet cannot be generated. If the detected number of people wearing the safety helmet is equal to the total number of people, namely p = m, the safety helmet is worn by all people; if the detected number of people wearing the safety helmet is different from the total number of people, namely p is not equal to m, the situation that some people wear the safety helmet and some people do not wear the safety helmet is shown.
b) The sentence describes the definition of the template. The sentence description template is generated through picture subtitle labels, and the generation of words is derived from the original picture subtitle labels and visual concepts extracted by a target detection algorithm. The nature of a visual word is a mark, with the aim of preserving a vacancy for a word that describes a particular region in an image. And (3) extracting a visual concept by adopting a target detection algorithm, and generating an image description sentence worn by a safety helmet of a constructor in a construction scene by combining a method based on rules and a template.
c) Sentence generation
The sentence is finally generated by the sentence description rule defined above in combination with the sentence description template, for example: the sentence template can be "< num-1> <men-1 > < noon-1 > < then-on-the-words", then the visual concepts of the area (man, wear, helmet) are extracted by using the YOLOv3 algorithm, and in combination with the predefined rules describing the sentences, m =2 and p =2, all constructors in the representation wear safety helmets, and fill the safety helmets into the sentence template (num-1 > → two, t < verb-1> → -wear, n-1> → -hetets) to finally generate the description sentences of the image, wherein: "two men wear leather sheets heads"
(4) Analysis of results
In order to verify the effectiveness of the algorithm, the algorithm is compared with image description algorithms such as NIC, soft-Attention, adaptive and the like on a self-made image subtitle data set worn by a safety helmet, the score of the algorithm on BLEU-4 is found to be equal to that of the Adaptive algorithm, but the score of the algorithm on other evaluation indexes is improved, and the algorithm is used for wearing and detecting the safety helmet, so that the corresponding relation between an image area and a description sentence is enhanced, and the image description is carried out on the wearing of the safety helmet by combining a rule and a template method, so that the number of people wearing the safety helmet in the picture and the number of people not wearing the safety helmet can be accurately described.
Meanwhile, in order to verify the effectiveness of the method, the method is used for testing the test picture without caption marking, and the method is compared with other algorithms for the description sentences generated by the same picture, as shown in fig. 3, because the marking language is English, the description sentences output by the test are English. In fig. 3, (a) shows a single person helmet wearing description, the left side of the figure is a description sentence under good illumination, and the NIC description sentence is: the working man bearings a yellow helmet, the method of the invention describes the sentence as: a man beans a helmet on his head (a person wears a helmet). The right side of the figure is a descriptive statement in the case of insufficient light. The NIC describes the sentence as: the man with a white helmet is wearing a blue shirt, the method of the invention describes sentences as follows: a man, people a helmet on his head (a person is wearing a helmet). The descriptive sentences generated by the two algorithms are slightly different, but have good description on the wearing condition of a single person. In fig. 3, (b) shows a description of wearing a multi-person helmet, and the left part of the figure shows description sentences in the case of wearing a helmet by some persons, and the NIC describes sentences as follows: the method of the invention describes sentences as follows: two men wear helmets and an without helmet. The right side of the figure is a description sentence under the condition that the target size is greatly different, and the NIC description sentence is as follows: two men wear helmets on the field, the method of the invention describes the sentence as: the helmet is worn by three people. As can be seen, the two algorithms have advantages and disadvantages respectively. Description sentences generated by the NIC algorithm have diversity, but the number of people wearing the safety helmet cannot be accurately described because the algorithm is easy to lose detailed information. Because the invention adopts a method based on the combination of the rules and the templates to generate the image description, the generated description sentences are slightly insufficient in the aspect of the diversity of sentences, but the method can better describe the number of people wearing the safety helmet.
Fig. 4 shows a visualized experiment result diagram of the present invention, wherein fig. 4 includes 6 pictures, and from left to right and from top to bottom, the image descriptions generated by each picture experiment are respectively:
a man with out a helmet is working hard (a person without a helmet is working hard).
the man bathing a helmet is standing in the construction site (a person wearing a helmet is standing at the work site).
Wpersons with a helmet about work (two persons working without a helmet).
a man in an orange helmet, people with an orange vest.
a man in awhite helmet is smiling (a man wearing a white helmet is smiling).
a man is bathing ablue helmet on his head (a man wearing a blue helmet on his head).
FIG. 4 shows the results of the experiment: whether the safety helmet is worn by a single person or a plurality of persons; the method can better realize image semantic description no matter under the wearing condition of the safety helmet of the constructor in a simple construction scene or under the wearing condition of the safety helmet of the constructor in a complex construction scene. Therefore, the method can carry out relatively accurate image semantic description according to the wearing condition of the safety helmet of the constructors under different complex scenes.
In light of the foregoing description of preferred embodiments in accordance with the invention, it is to be understood that numerous changes and modifications may be made by those skilled in the art without departing from the scope of the invention. The technical scope of the present invention is not limited to the contents of the specification, and must be determined according to the scope of the claims.

Claims (3)

1. A method for detecting and describing the wearing condition of a safety helmet in a construction scene is characterized by comprising the following steps: the method comprises the following steps:
s1: making a data set;
collecting images of a construction site scene expansion data set by a network crawler technology collection image or self-site image collection mode; the acquired data includes pictures about the wearing of the safety helmet in construction places with various background conditions, different resolutions and different qualities, the pictures contain constructors wearing the safety helmet and constructors not wearing the safety helmet, and all the acquired pictures are used as a data set for wearing the safety helmet; the manufacturing of the safety helmet wearing data set comprises the following steps: the method comprises the steps of manufacturing a safety helmet wearing detection data set and a safety helmet wearing image subtitle data set;
s2: detecting a target;
s2.1: selecting a detection model, comprehensively considering two aspects of detection speed and detection precision of an algorithm, and selecting YOLOv3 as a judgment and description model for judging whether a safety helmet is worn in a construction scene;
s2.2: preprocessing the self-made data set, namely performing normalization processing on the labeling information of the self-made data set worn by the helmet in the step S1, and converting the labeling information into a training format available for YOLOv 3;
s2.3: initializing an anchor frame by K-means clustering;
performing a K-means clustering algorithm on the safety helmet wearing data set normalized in the step S2.2 to initialize an anchor frame so as to predict the coordinates of the boundary frame;
s2.4: training a network model;
firstly, positioning the coordinate information of a marked target, then predicting the confidence coefficient of a boundary box of the marked target, predicting the score of a predefined target class, finally sending an unmarked test picture into a trained target detection network model, framing the score of the target detected in the picture and outputting the target if the score of the detected target is greater than a set threshold value, otherwise, failing to detect the target in the picture;
s2.5: network testing
Firstly, resetting the size of an input picture to 416 multiplied by 416, then extracting picture characteristics by utilizing a Darknet-53 network, then sending a characteristic vector to a characteristic pyramid structure for multi-scale prediction, and finally carrying out non-maximum suppression on a predicted boundary frame to eliminate repeated detection to obtain a final prediction result;
s3: generating a statement;
firstly, detecting a visual concept in an image by using a target detection algorithm, then combining a predefined rule and a sentence template, filling the detected visual concept into the sentence template, and finally generating a description sentence worn by the safety helmet;
the manufacturing of the safety helmet wearing data set in the step S1 specifically comprises the following steps:
s1.1: manufacturing a safety helmet wearing detection data set;
according to the labeling format of the Pascal VOC2007 public data set, performing multi-label labeling on the picture sample by using an open source labeling tool LabelImg, and automatically generating a corresponding xml-format labeling file, wherein the xml-format labeling file comprises an object name and coordinate information of a real boundary box; the labeled target categories are: people, safety helmets, and people wearing safety helmets;
s1.2: and the image subtitle data set is made for wearing the safety helmet;
and (2) performing statement annotation on the data set marked in the step (S1.1), wherein a mode of combining self-programming annotation software and manual annotation is adopted, and the annotation of the caption data set is divided into the following steps:
s1.2.1: reading the name and the size information of each picture by using self-programming labeling software, and giving a unique picture id number to each picture;
s1.2.2: the method comprises the following steps of performing caption labeling on pictures by using self-programming labeling software, manually labeling 5 descriptive sentences of each picture, describing mainly by wearing safety helmets of personnel in a construction scene, and endowing each sentence with a unique sentence id number; each picture has a corresponding picture id number and 5 corresponding sentence id numbers, and picture subtitle labeling data are stored in a json format;
the method for defining sentence description rules and templates in step S3 specifically includes:
s3.1: definition of sentence description rules
According to the visual concepts of people, safety helmets and people wearing safety helmets extracted in the target detection in the previous stage; respectively setting a triplet (m, n, p) with an initial value of zero for 3 types of targets to be detected to count the number of the targets to be detected, wherein m represents the detected total number of people, n represents the detected total number of safety helmets, and p represents the detected number of people wearing safety helmets; when p is more than or equal to 0 and less than or equal to m, the number of people wearing the safety helmet is not more than the total number of people in a construction site, otherwise, when p is more than m, the monitoring is regarded as wrong, and a description sentence worn by the safety helmet cannot be generated; if the detected number of people wearing the safety helmet is equal to the total number of people, namely p = m, the safety helmet is worn by all people; if the detected number of people wearing the safety helmet is different from the total number of people, namely p is not equal to m, the situation that some people wear the safety helmet and some people do not wear the safety helmet is shown;
s3.2: definition of sentence description template
The sentence description template is generated through picture subtitle annotation, and the generation of words is derived from original picture subtitle annotation or visual concepts extracted by a target detection algorithm; the nature of a visual word is a mark, with the aim of reserving a space for the word that describes a particular region in the image; and (3) extracting a visual concept by adopting a target detection algorithm, and generating an image description sentence worn by a safety helmet of a constructor in a construction scene by combining a method based on rules and a template.
2. The method for detecting and describing the wearing condition of the safety helmet in the construction scene according to claim 1, wherein: the step S2.2 of normalizing the annotation information specifically includes:
normalization processing is carried out on the sample labeling data, namely the sample labeling data is divided by the width and the height of an image, so that the final data is controlled to be between 0 and 1, the training sample data can be conveniently and rapidly read, and meanwhile, the requirement of multi-scale training is met, wherein a specific normalization formula is shown as the following formula:
Figure FDA0003933664400000031
Figure FDA0003933664400000032
wherein x is max ,x min ,y max ,y min Representing the mark information of an original sample boundary frame, wherein width and height represent the size of a picture, and x, y and w, h represent the mark information after normalization, (x and y) are the coordinates of the central point of the target, and w and h are the width and the height of the target; in the normalized data sample, the bounding box information of each target of each picture includes 5 parameters x, y, w, h, class _ id, where class _ id is a target class number.
3. The method for detecting and describing the wearing condition of the safety helmet in the construction scene according to claim 1, wherein: the step S2.4 of training the network model specifically includes:
s2.4.1: positioning target coordinate information;
the input picture is expressed as a tensor with the size of n multiplied by m multiplied by 3, wherein n and m represent the width and height of the picture, the unit is a pixel, and 3 represents the number of RGB three channels; firstly, automatically adjusting images with different sizes to be 416 multiplied by 416 fixed sizes, dividing an original image into 13 multiplied by 13 grids, and enabling the grid where the center point of the target is located to be responsible for detecting the target; each mesh predicts 3 bounding boxes overlaid on the mesh and the confidence of these bounding boxes, each bounding box containing 6 predictors: x, y, w, h, confidence and class _ id, wherein (x, y) represents the relative value of the center of the predicted boundary box and the grid boundary, w, h represents the ratio of the width and the height of the predicted boundary box relative to the whole picture, confidence represents confidence for eliminating the boundary box lower than a threshold value, and class _ id represents the target class number; the prediction information of each bounding box comprises the coordinates, the width and the height of the bounding box, and the coordinate calculation formula of the bounding box is as follows:
b x =σ(t x )+c x ,b y =σ(t y )+c y
Figure FDA0003933664400000041
wherein, (bx, by) represents the center coordinates of the predicted bounding box, and bw, bh represents the width and height of the predicted bounding box; tx, ty, tw and th represent the target of network learning, cx and cy are coordinate offset of the grid, and pw and ph are preset anchor frame dimensions;
s2.4.2: predicting the confidence of the bounding box;
after the target coordinate information is positioned, the confidence of the bounding box needs to be predicted, and according to the labeled 3 types of targets: people, safety helmets and people wearing safety helmets, 3 boundary frames are predicted for each grid, each boundary frame comprises 6 predicted quantities, so that the number of channels is 3 x (4 +1+ 3) =24, and the output 3 scale feature maps are respectively 13 x 24, 26 x 24 and 52 x 24 with the unit of pixel;
s2.4.3: pre-defining a score prediction for a target category;
the confidence prediction of the bounding box is completed, then the score of the predefined target category is predicted, in order to improve the detection effect of the small target, the idea of multi-scale prediction is adopted, and the 3 feature maps with different scales of 13 × 13, 26 × 26 and 52 × 52 are respectively used for prediction, and the unit is pixel;
s2.4.4: training of models
According to the characteristics of a self-made safety helmet wearing data set, correspondingly modifying the configuration file of the YOLOv3 network; before training, the weight file is converted into the weight file under the Keras framework according to the modified network configuration file, so that loading of the pre-trained model is facilitated, and initialization parameters are provided for training of the model.
CN201910593069.2A 2019-07-03 2019-07-03 Method for detecting and describing wearing condition of safety helmet in construction scene Active CN110399905B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201910593069.2A CN110399905B (en) 2019-07-03 2019-07-03 Method for detecting and describing wearing condition of safety helmet in construction scene

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201910593069.2A CN110399905B (en) 2019-07-03 2019-07-03 Method for detecting and describing wearing condition of safety helmet in construction scene

Publications (2)

Publication Number Publication Date
CN110399905A CN110399905A (en) 2019-11-01
CN110399905B true CN110399905B (en) 2023-03-24

Family

ID=68322708

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201910593069.2A Active CN110399905B (en) 2019-07-03 2019-07-03 Method for detecting and describing wearing condition of safety helmet in construction scene

Country Status (1)

Country Link
CN (1) CN110399905B (en)

Families Citing this family (28)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110852283A (en) * 2019-11-14 2020-02-28 南京工程学院 Helmet wearing detection and tracking method based on improved YOLOv3
CN111062429A (en) * 2019-12-12 2020-04-24 上海点泽智能科技有限公司 Chef cap and mask wearing detection method based on deep learning
CN111160440B (en) * 2019-12-24 2023-11-21 广东省智能制造研究所 Deep learning-based safety helmet wearing detection method and device
CN111209822A (en) * 2019-12-30 2020-05-29 南京华图信息技术有限公司 Face detection method of thermal infrared image
CN111401276A (en) * 2020-03-20 2020-07-10 广东光速智能设备有限公司 Method and system for identifying wearing of safety helmet
CN111461028A (en) * 2020-04-02 2020-07-28 杭州视在科技有限公司 Mask detection model training and detection method, medium and device in complex scene
CN111582068B (en) * 2020-04-22 2023-07-07 北京交通大学 Method for detecting wearing state of mask for personnel
CN111598040A (en) * 2020-05-25 2020-08-28 中建三局第二建设工程有限责任公司 Construction worker identity identification and safety helmet wearing detection method and system
CN111881730A (en) * 2020-06-16 2020-11-03 北京华电天仁电力控制技术有限公司 Wearing detection method for on-site safety helmet of thermal power plant
CN111753805A (en) * 2020-07-08 2020-10-09 深延科技(北京)有限公司 Method and device for detecting wearing of safety helmet
CN111831853A (en) * 2020-07-16 2020-10-27 深圳市商汤科技有限公司 Information processing method, device, equipment and system
CN111881831A (en) * 2020-07-28 2020-11-03 南京拟态智能技术研究院有限公司 Multi-scale feature fusion target detection system based on deep learning
CN111986255B (en) * 2020-09-07 2024-04-09 凌云光技术股份有限公司 Multi-scale anchor initializing method and device of image detection model
CN112131983A (en) * 2020-09-11 2020-12-25 桂林理工大学 Helmet wearing detection method based on improved YOLOv3 network
CN112329532A (en) * 2020-09-30 2021-02-05 浙江汉德瑞智能科技有限公司 Automatic tracking safety helmet monitoring method based on YOLOv4
CN112257620B (en) * 2020-10-27 2021-10-26 广州华微明天软件技术有限公司 Safe wearing condition identification method
CN112487864A (en) * 2020-11-02 2021-03-12 江阴市智行工控科技有限公司 Method for detecting small target safety helmet and protective clothing for construction site
CN112347943A (en) * 2020-11-09 2021-02-09 哈尔滨理工大学 Anchor optimization safety helmet detection method based on YOLOV4
CN113033289A (en) * 2021-01-29 2021-06-25 南瑞集团有限公司 Safety helmet wearing inspection method, device and system based on DSSD algorithm
CN112906497A (en) * 2021-01-29 2021-06-04 中国海洋大学 Embedded safety helmet detection method and equipment
CN113780322B (en) * 2021-02-09 2023-11-03 北京京东振世信息技术有限公司 Safety detection method and device
CN113128476A (en) * 2021-05-17 2021-07-16 广西师范大学 Low-power consumption real-time helmet detection method based on computer vision target detection
CN113361425A (en) * 2021-06-11 2021-09-07 珠海路讯科技有限公司 Method for detecting whether worker wears safety helmet or not based on deep learning
CN113516076B (en) * 2021-07-12 2023-09-01 大连民族大学 Attention mechanism improvement-based lightweight YOLO v4 safety protection detection method
CN113553963A (en) * 2021-07-27 2021-10-26 广联达科技股份有限公司 Detection method and device of safety helmet, electronic equipment and readable storage medium
CN113486860A (en) * 2021-08-03 2021-10-08 云南大学 YOLOv 5-based safety protector wearing detection method and system
CN114146283A (en) * 2021-08-26 2022-03-08 上海大学 Attention training system and method based on target detection and SSVEP
CN114155669A (en) * 2021-11-30 2022-03-08 安徽富煌钢构股份有限公司 Building construction safety early warning protection system based on BIM

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN106446926A (en) * 2016-07-12 2017-02-22 重庆大学 Transformer station worker helmet wear detection method based on video analysis
CN108875595A (en) * 2018-05-29 2018-11-23 重庆大学 A kind of Driving Scene object detection method merged based on deep learning and multilayer feature
CN109447168A (en) * 2018-11-05 2019-03-08 江苏德劭信息科技有限公司 A kind of safety cap wearing detection method detected based on depth characteristic and video object
CN109829429A (en) * 2019-01-31 2019-05-31 福州大学 Security protection sensitive articles detection method under monitoring scene based on YOLOv3

Family Cites Families (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109145946B (en) * 2018-07-09 2022-02-11 暨南大学 Intelligent image recognition and description method
CN109472220A (en) * 2018-10-23 2019-03-15 广东电网有限责任公司 A kind of substation's worker safety helmet detection method and its system based on Faster R-CNN
CN109635697A (en) * 2018-12-04 2019-04-16 国网浙江省电力有限公司电力科学研究院 Electric operating personnel safety dressing detection method based on YOLOv3 target detection
CN109903331B (en) * 2019-01-08 2020-12-22 杭州电子科技大学 Convolutional neural network target detection method based on RGB-D camera

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN106446926A (en) * 2016-07-12 2017-02-22 重庆大学 Transformer station worker helmet wear detection method based on video analysis
CN108875595A (en) * 2018-05-29 2018-11-23 重庆大学 A kind of Driving Scene object detection method merged based on deep learning and multilayer feature
CN109447168A (en) * 2018-11-05 2019-03-08 江苏德劭信息科技有限公司 A kind of safety cap wearing detection method detected based on depth characteristic and video object
CN109829429A (en) * 2019-01-31 2019-05-31 福州大学 Security protection sensitive articles detection method under monitoring scene based on YOLOv3

Also Published As

Publication number Publication date
CN110399905A (en) 2019-11-01

Similar Documents

Publication Publication Date Title
CN110399905B (en) Method for detecting and describing wearing condition of safety helmet in construction scene
CN109492581B (en) Human body action recognition method based on TP-STG frame
CN107391703B (en) The method for building up and system of image library, image library and image classification method
CN110018524B (en) X-ray security inspection contraband identification method based on vision-attribute
CN111160440B (en) Deep learning-based safety helmet wearing detection method and device
CN104063722B (en) A kind of detection of fusion HOG human body targets and the safety cap recognition methods of SVM classifier
CN109446927B (en) Double-person interaction behavior identification method based on priori knowledge
Gómez et al. Determining the accuracy in image supervised classification problems
CN110738101A (en) Behavior recognition method and device and computer readable storage medium
CN106096542B (en) Image video scene recognition method based on distance prediction information
CN110648322A (en) Method and system for detecting abnormal cervical cells
CN112861917B (en) Weak supervision target detection method based on image attribute learning
US11636695B2 (en) Method for synthesizing image based on conditional generative adversarial network and related device
JP2012088796A (en) Image area division device, image area division method, and image area division program
EP3859666A1 (en) Classification device, classification method, program, and information recording medium
CN112613454A (en) Electric power infrastructure construction site violation identification method and system
CN110675469A (en) Image description method for detecting spatial relationship between targets in construction scene
CN109919036A (en) Worker&#39;s work posture classification method based on time-domain analysis depth network
CN109948609A (en) Intelligently reading localization method based on deep learning
CN115410258A (en) Human face expression recognition method based on attention image
CN105354405A (en) Machine learning based immunohistochemical image automatic interpretation system
CN113221667A (en) Face and mask attribute classification method and system based on deep learning
CN112989958A (en) Helmet wearing identification method based on YOLOv4 and significance detection
CN113436735A (en) Body weight index prediction method, device and storage medium based on face structure measurement
Calefati et al. Reading meter numbers in the wild

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant