CN110399905B

CN110399905B - Method for detecting and describing wearing condition of safety helmet in construction scene

Info

Publication number: CN110399905B
Application number: CN201910593069.2A
Authority: CN
Inventors: 徐守坤; 李宁
Original assignee: Changzhou University
Current assignee: Changzhou University
Priority date: 2019-07-03
Filing date: 2019-07-03
Publication date: 2023-03-24
Anticipated expiration: 2039-07-03
Also published as: CN110399905A

Abstract

The invention provides a method for detecting and describing the wearing condition of a safety helmet in a construction scene, which is used for detecting and describing the wearing condition of the safety helmet of a worker by adopting an image and natural language processing method. In the aspect of image description, because the current image description method based on the neural network is lack of interpretability and insufficient in detail description, and meanwhile, the research on the image description of the construction scene is deficient, the method adopts a YOLOv3 target detection algorithm and generates the description sentence worn by the safety helmet based on a method of combining rules and templates. Initializing anchor frame parameter values by using K-means clustering, then training and detecting on a self-made data set, and finally generating image description worn by the safety helmet according to predefined rules and sentence templates. The method has obvious advantages in the aspect of detection efficiency, and meanwhile, the generated description is accurate, so that the purpose of reducing the accident rate can be achieved.

Description

Method for detecting and describing wearing condition of safety helmet in construction scene

Technical Field

The invention relates to the technical field of image understanding, in particular to a method for detecting and describing the wearing condition of a safety helmet in a construction scene.

Background

In recent years, with the increasingly accelerated urbanization process of China, the infrastructure is continuously developed, and construction accidents frequently occur. Construction scenes of transformer substations, chemical plants, mine working areas and the like are complex, certain dangerous factors exist, accidents are easily caused by unsafe behaviors of workers, and casualties and economic losses are caused. In a construction site, the safety helmet is a life guarantee, a worker wears the safety helmet to meet the requirement of behavior specification, and the wearing of the safety helmet can reduce the operation risk of the worker to a certain extent. In order to ensure the personal safety of workers and reduce the accident rate caused by not wearing the safety helmet, the behavior description of the safety helmet wearing problem of the constructors is particularly important.

The image description is to express the content in the picture by using a natural language processing method on the basis of image recognition, and is further to recognize the image recognition. In a construction scene, the method has important significance and application value for researching the image description of the safety helmet worn by workers.

Most descriptions generated by the current image description method are global descriptions of images, so that detailed information is easy to lose, and certain accuracy is lacked. For the pictures of the construction scene, the image description generated from the aspect of wearing the safety helmet of the constructor is the basis for analyzing the construction site condition, so that the construction safety and operability are further judged to eliminate potential safety hazards. The current research on the wearing of safety helmets today is directed to the task of image recognition. Whether the wearing of the safety helmet is detected by using a traditional algorithm or a deep learning technology, considerable research results are obtained, but certain limitations exist, namely the wearing condition of the safety helmet of an operator is not described by using natural language.

Disclosure of Invention

The technical problem to be solved by the invention is as follows: in order to overcome the defects in the prior art, the invention provides a method for detecting and describing the wearing condition of a safety helmet in a construction scene, which adopts a YOLOv3 target detection algorithm and generates a description sentence worn by the safety helmet based on a method of combining rules and templates, can more accurately judge and describe whether a constructor wears the safety helmet in the construction process by using the sentence, so as to eliminate potential safety hazards, improve the safety coefficient in the construction scene, and provide theoretical support for an intelligent monitoring robot by using the accurate detection effect and the high-efficiency detection speed.

The technical scheme adopted for solving the technical problems is as follows: a method for detecting and describing the wearing condition of a safety helmet in a construction scene comprises the following steps:

s1: production of data sets

And developing the image acquisition work of the data set by the modes of collecting pictures by a web crawler technology, automatically acquiring the pictures on site and the like. The collected data includes pictures about the wearing of the safety helmet in construction places with various background conditions, different resolutions and different qualities, and contains constructors wearing the safety helmet and constructors not wearing the safety helmet. The total number of the pictures is 5000, the richness of the data set is ensured to a certain extent, various scene conditions are included, and the condition of a real scene can be reflected more completely. The data set is produced by two steps:

s1.1: helmet wearing detection data set manufacturing method

And performing multi-label labeling on the picture sample by using an open source labeling tool LabelImg according to a labeling format of a Pascal VOC2007 public data set, and automatically generating a corresponding xml-format labeling file, wherein the xml-format labeling file comprises an object name and coordinate information of a real boundary box. The labeled target categories are: people (man), helmets (helmet) and people wearing helmets (man webelmet).

S1.2: the method for manufacturing the subtitle data set of the image worn by the helmet comprises the following steps: and (4) performing statement annotation on the data set marked in the step (S1.1). By adopting a mode of combining self-programming labeling software and manual labeling, the labeling of the caption data set is divided into the following steps:

s1.2.1: reading the name and size information (width and height, unit pixel) of each picture by using self-programming labeling software, and giving a unique picture id number to each picture;

s1.2.2: the method comprises the steps of performing caption labeling on pictures by using self-programming labeling software, manually labeling 5 descriptive sentences of each picture, mainly describing by wearing safety helmets of personnel in a construction scene, and endowing each sentence with a unique sentence id number. Each picture has a corresponding picture id number and 5 corresponding sentence id numbers, and picture subtitle labeling data are stored in a json format.

S2: target detection

S2.1: selection of detection model

The existing target detection algorithm based on deep learning is mainly divided into two types, namely a two-stage method based on area detection and a one-stage method without area detection. the idea of the two-stage target detection algorithm is to acquire region detection first and then classify in the current region, which will result in an increase in time complexity of the candidate region based method and a longer detection time. In a one-stage target detection algorithm, the target detection based on YOLOv3 achieves certain effect, the idea of the algorithm is to directly predict the types and the positions of different targets by using only one CNN network, and the method is a rapid and accurate target detection technology. Compared with other detection methods, on the aspect of detection precision, if the target to be detected is not very small, the detection precision is close to that of the fast-RCNN. YOLOv3 is also superior to the SSD in terms of detection speed and accuracy, compared to the SSD belonging to the one-stage detection method as well. And comprehensively considering two aspects of detection speed and detection precision of the algorithm, selecting YOLOv3 as a judgment and description model for judging whether a safety helmet is worn in a construction scene, wherein the trained model can be better applied to engineering.

S2.2: self-made data set preprocessing

According to the Pascal VOC format, a safety helmet wearing data set is self-made, the labeling information comprises the category of the target and the coordinates of the boundary box, and the labeling information is subjected to normalization processing and converted into a training format available for YOLOv 3.

And (3) carrying out normalization processing on the sample marking data, namely dividing the sample marking data by the width and the height of the image to control the final data to be between 0 and 1, so that the training sample data can be read quickly, and the requirement of multi-scale training can be met. The specific normalization formula is shown as follows:

wherein x is _max ,x _min ,y _max ,y _min And the frame mark information of the original sample is represented, the width and the height represent the picture size, the x, y and w, h represent the mark information after normalization, the (x and y) are the coordinates of the central point of the target, and the w and h are the width and the height of the target. In the normalized data sample, the bounding box information of each target of each picture comprises 5 parameters x, y, w, h, and class _ id, wherein the class _ id is a target class number.

S2.3: k-means cluster initialization anchor frame

YOLOv3 initializes the anchor frame by adopting a K-means clustering algorithm to predict the coordinates of the boundary frame, and the size of the anchor frame can influence the detection accuracy. The original YOLOv 3K-means clustering algorithm adopts an Euclidean distance formula, anchor frame parameter values are generated by clustering on a public data set, the anchor frame parameter values have universality, but the anchor frame parameter values are not suitable for a self-made safety helmet wearing data set, so that a new anchor frame needs to be designed before training to improve the detection rate of a boundary frame. Carrying out K-means clustering on a self-made safety helmet wearing data set to obtain 9 anchor frames, arranging the anchor frames from small to large in sequence, and uniformly distributing the anchor frames on feature maps of 3 scales, wherein the first 3 anchor frames correspond to a feature map of 52 multiplied by 52, the middle 3 anchor frames correspond to a feature map of 26 multiplied by 26, and the last 3 anchor frames correspond to a feature map of 13 multiplied by 13. The final 9 anchor frame parameter values are (26, 19), (49, 36), (58, 145), (76, 58), (101, 199), (123, 111), (152, 222), (223, 261), (372, 491), which correspond to the c 1-c 8 cluster center point coordinates respectively, the width and height dimensions of the anchor frame correspond to the width and height of the target frame at the cluster center point, and the data unit is pixel.

S2.4: training of network models

The Yolov3 as a one-stage target detection method has the main advantages that the whole picture is processed by only using a single CNN, the target in the image is positioned, the target category of the image is predicted, and the target detection problem is converted into a regression problem. Besides the coordinate information of the target to be detected needs to be positioned, the confidence of the bounding box needs to be predicted and the score of the predefined target class needs to be obtained during network training. The network model training steps are as follows:

s2.4.1: target coordinate information positioning

The input picture is represented as a tensor of size n × m × 3, where n and m represent the width and height of the picture in pixels, and 3 represents the number of RGB three channels. Firstly, images with different sizes are automatically adjusted to be 416 x 416 fixed sizes, then the original image is divided into 13 x 13 grids, and the grid where the center point of the target is located is responsible for detecting the target. Each mesh predicts 3 bounding boxes overlaid on the mesh and the confidence of these bounding boxes, each bounding box containing 6 predictors: x, y, w, h, confidence and class _ id, wherein (x, y) represents the relative value of the center of the predicted bounding box and the grid boundary, w, h represents the ratio of the width and the height of the predicted bounding box relative to the whole picture, confidence represents confidence for eliminating the bounding box below a threshold value, and class _ id represents the target class number. The prediction information of each bounding box comprises the coordinates, the width and the height of the bounding box, and the coordinate calculation formula of the bounding box is as follows:

b _x ＝σ(t _x )+c _x ，b _y ＝σ(t _y )+c _y

where, (bx, by) represents the predicted bounding box center coordinates, and bw, bh represents the width and height of the predicted bounding box. tx, ty, tw, th represent the target of network learning, cx, cy is the coordinate offset of the grid, and pw, ph are the preset anchor frame dimensions.

S2.4.2: prediction of bounding box confidence

After the target coordinate information is located, the confidence of the bounding box needs to be predicted, and according to the labeled 3-type (C = 3) targets: human (man), helmet (helmet) and man-worn helmet (man near helmet), 3 bounding boxes are predicted for each grid, each bounding box contains 6 prediction quantities, so that the number of channels is 3 x (4 +1+ 3) =24, and the output 3 scale feature maps are respectively 13 x 24, 26 x 24 and 52 x 24.

S2.4.3: a score prediction for the target category is predefined.

And (3) completing confidence prediction of the boundary box, predicting the score of the predefined target class, and predicting by respectively using 3 feature maps of different scales, namely 13 multiplied by 13, 26 multiplied by 26 and 52 multiplied by 52, by adopting a multi-scale prediction idea to improve the detection effect of the small target.

S2.4.4: training of models

And according to the characteristics of the self-made data set, correspondingly modifying the configuration file of the YOLOv3 network. Before training, the weight files provided by the official website are converted into the weight files under the Keras framework according to the modified network configuration files, so that the pre-training model is loaded, and initialization parameters are provided for training the model.

The batch processing size (batch) during training is set to 64, that is, 64 sample data are randomly selected for training in each iteration, and the grouping (subdivision) is set to 8, that is, the samples are divided into 8 groups and sent to the network training, so as to reduce the pressure of memory occupation. The network model is normalized by BN (batch normalization) to improve the convergence speed of the model. Momentum (momentum) is set to 0.9, weight decay (decay) is set to 0.0005 to prevent model overfitting, initial learning rate (learning rate) is set to 0.001, and the learning rate decays to 1/10 of the original for 5000 iterations. The model is iterated 20000 times finally, which takes 8 hours, and the loss of the model is gradually reduced as the iteration number is increased. The model is quickly fitted in the previous 4000 iterations, the loss value is reduced quickly, and the loss value tends to be stable and only slightly oscillates after 10000 iterations.

S2.5: target detection

Firstly, the size of an input picture is reset to 416 x 416, then, the Darknet-53 network is utilized to extract picture features, then, the feature vectors are sent to a feature pyramid structure to carry out multi-scale prediction, and finally, non-maximum suppression is carried out on a predicted boundary frame so as to eliminate repeated detection and obtain a final prediction result.

S3: statement generation

Firstly, detecting a visual concept in an image by using a target detection algorithm, secondly, combining a predefined rule and a sentence template, then filling the detected visual concept into the sentence template, and finally generating a description sentence worn by the helmet. Sentence description rules and template definition method:

s3.1: definition of sentence description rules

According to the visual concepts of people, safety helmets and people wearing safety helmets extracted from the target detection in the previous stage. And respectively setting a triad (m, n, p) with an initial value of zero for 3 types of targets to be detected to count the number of the targets to be detected, wherein m represents the detected total number of people, n represents the detected total number of safety helmets, and p represents the detected number of people wearing the safety helmets. When p is more than or equal to 0 and less than or equal to m, the number of people wearing the safety helmet is not more than the total number of people in a construction site, otherwise, when p is more than m, the monitoring is regarded as wrong, and a description sentence worn by the safety helmet cannot be generated. If the detected number of people wearing the safety helmet is equal to the total number of people, namely p = m, the safety helmet is worn by all people; if the detected number of people wearing the safety helmet is different from the total number of people, namely p is not equal to m, the situation that some people wear the safety helmet and some people do not wear the safety helmet is shown.

S3.2: definition of sentence description template

The sentence description template is generated through picture subtitle labels, and the generation of words is derived from original picture subtitle labels or visual concepts extracted by a target detection algorithm. The nature of a visual word is a mark, with the aim of preserving a vacancy for a word that describes a particular region in an image. And (3) extracting a visual concept by adopting a target detection algorithm, and generating an image description sentence worn by a safety helmet of a constructor in a construction scene by combining a method based on rules and a template.

According to the method for detecting and describing the wearing condition of the safety helmet in the construction scene, provided by the invention, whether a constructor wears the safety helmet in the construction process can be more accurately judged by adopting a YOLOv3 algorithm, so that potential safety hazards are eliminated, the safety coefficient in the construction scene is improved, and meanwhile, the accurate detection effect and the high-efficiency detection speed can also provide theoretical support for an intelligent monitoring robot.

Drawings

The invention is further illustrated by the following figures and examples.

FIG. 1 is a flow chart of the algorithm of the present invention.

FIG. 2 is a diagram of the algorithm framework of the present invention.

Fig. 3 is a graph comparing the results of the image description by the NIC method and the embodiment method, wherein (a) is a single person wearing image description and (b) is a multi person wearing image description.

Fig. 4 is a graph of experimental results of example method visualization.

Detailed Description

The present invention will now be described in detail with reference to the accompanying drawings. This figure is a simplified schematic diagram, and merely illustrates the basic structure of the present invention in a schematic manner, and therefore it shows only the constitution related to the present invention.

Referring to fig. 1, a method for detecting and describing a wearing condition of a safety helmet in a construction scene according to the present invention is described in detail with reference to specific embodiments.

An embodiment platform is built by using Linux, ubuntu16.04 is selected as an operating system, NVIDIA GeForce GTX 1080Ti, CUDA8.0 and CUDNN6.0 are selected as a GPU, and the memory is 12GB. Training and testing of the model was performed using the Keras deep learning framework. And selecting a single-stage target detection algorithm YOLOv3 to detect the wearing of the safety helmet of the constructor in the picture. The method based on the rules and the templates is used, and the wearing detection algorithm of the safety helmet is combined to generate the image description worn by the safety helmet of the constructor, and the algorithm flow chart is shown in figure 1.

(1) Production of data sets

And the image acquisition work of the data set is developed by the modes of searching pictures through a web crawler technology, acquiring the pictures on site by self, and the like. The collected data include pictures about the safety helmet wearing in construction places with various background conditions, different resolutions and different qualities, and contain constructors wearing the safety helmet and constructors not wearing the safety helmet. The total number of the pictures is 5000, the richness of the data set is ensured to a certain extent, various scene conditions are included, and the condition of a real scene can be reflected more completely. The data set is produced by two steps:

1) Helmet wearing detection data set manufacturing method

2) The method for manufacturing the subtitle data set of the image worn by the safety helmet comprises the following steps: performing statement marking on the data set marked in the first step 1). The method of combining self-editing labeling software and manual labeling is adopted, and the labeling of the caption data set is divided into the following steps:

a) Reading the picture name and size information (width and height) of each picture by using self-programming labeling software, and giving a unique picture id number to each picture;

b) The method comprises the steps of performing caption labeling on pictures by using self-programming labeling software, manually labeling 5 descriptive sentences of each picture, mainly describing by wearing safety helmets of personnel in a construction scene, and endowing each sentence with a unique sentence id number. Each picture has a corresponding picture id number and 5 corresponding sentence id numbers, and picture subtitle labeling data are stored in a json format.

(2) Helmet wear detection

In the embodiment, a home-made helmet wearing detection data set is divided into three groups according to the proportion of 7. The training set and the verification set both contain marking information, and the test sample does not contain marking information so as to verify the effectiveness of the trained model.

1) Preprocessing of safety helmet donning detection data sets

wherein x is _max ,x _min ,y _max ,y _min Representing the mark information of the original sample boundary frame, width and height representing the picture size, and x, y, w and h representing the mark information after normalization, wherein (x and y) are the targetsThe coordinates of the center point, w, h, are the width and height of the target. In the normalized data sample, the bounding box information of each target of each picture includes 5 parameters x, y, w, h, class _ id, where class _ id is a target class number.

The anchor frame parameter values of the public data set are not applicable to the data set of the present invention, and therefore the anchor frame parameter values need to be re-determined from the homemade headgear wearing data set. Through a K-means clustering algorithm, clustering analysis is carried out on the data set of the invention, and 9 anchor box parameter values are respectively (26, 19), (49, 36), (58, 145), (76, 58), (101, 199), (123, 111), (152, 222), (223, 261), (372, 491), which respectively correspond to c 1-c 8 clustering center point coordinates, and the width-height dimension of the anchor box corresponds to the width-height of the target box at the clustering center point.

2) Training and testing of models

According to the characteristics of the self-made data set, the configuration file of the YOLOv3 network is modified correspondingly. Before training, the conversion operation of the weight file is required, the weight file provided by the official website is converted into the weight file under the Keras framework according to the modified network configuration file, so that the loading of the pre-training model is facilitated, and the initialization parameters are provided for the training of the model.

The batch processing size (batch) during training is set to 64, that is, 64 sample data are randomly selected for training in each iteration, and the grouping (subdivision) is set to 8, that is, the samples are divided into 8 groups and sent to the network training, so as to reduce the pressure of memory occupation. The network model is normalized by BN (batch normalization) to improve the convergence speed of the model. Momentum (momentum) is set to 0.9, weight decay (decay) is set to 0.0005 to prevent model overfitting, initial learning rate (learning rate) is set to 0.001, and the learning rate decays to 1/10 of the original for 5000 iterations. The model is iterated 20000 times finally, which takes 8 hours, and the test shows that the loss of the model is gradually reduced along with the increase of the iteration times. The model is quickly fitted in the previous 4000 iterations, the loss value is reduced quickly, and the loss value tends to be stable and only slightly oscillates after 10000 iterations.

The invention adopts a YOLOv3 target detection algorithm to realize the wearing detection of the safety helmet, and simultaneously carries out a comparison experiment with the fast-RCNN and SSD algorithms. After training is finished, loading the obtained model weight file, testing and evaluating the model on a test set, wherein the algorithm of the invention is slightly lower than fast-RCNN on the detection Average accuracy AP (Average Precision) value of the safety helmet, but the algorithm of the invention is superior to other algorithms on the detection speed of the algorithm and the Average accuracy Average MAP (meanAP) of 3 types of targets.

(3) Helmet worn image description sentence generation

And detecting the visual concept in the image by using a target detection algorithm, filling the detected visual concept into a sentence template by combining a predefined rule and the sentence template, and finally generating a description sentence worn by the safety helmet. The algorithm frame diagram is shown in fig. 2.

The self-made helmet wearing image subtitle data set is divided into three groups according to the following steps of 7. The sizes of the training set, the verification set and the test set are 3500, 1000 and 500 respectively, the training set and the verification set both comprise pictures and corresponding caption labels, and the test set does not comprise the caption labels of the pictures, so that the effectiveness of the method is verified.

1) Preprocessing of helmet worn image subtitle data sets

The method comprises the following steps of preprocessing a self-made image subtitle data set, and mainly comprises the following operations: a) Truncating caption marking sentences of more than 15 words in the marking samples; b) Deleting the 'and' in the labeling sample, unifying the capital and lowercase of the words, and converting the capital words into the lowercase; c) Counting word frequency, and giving a unique id number to each word in the labeled sample; d) And constructing a vocabulary table containing 3 groups of information (word id, word and word frequency), storing the words which appear at least 3 times in the labeling sample into the vocabulary table, and regarding the rest words as uncommon words and representing the uncommon words by using 'UNK'.

And constructing a vocabulary table on the self-made picture subtitle training set, wherein the total word amount is 183047, 2872 different words are in total, the words are screened by taking the threshold value as 3, and the size of the generated effective vocabulary table is 1343, namely the size of the description vocabulary is 1343 different words.

2) Sentence description rules and template definition

a) The sentence describes the definition of the rule. According to the visual concept of human, safety helmet and the human 3 kinds of targets wearing the safety helmet extracted from the target detection in the previous stage. And respectively setting a triad (m, n, p) with an initial value of zero for 3 types of targets to be detected to count the number of the targets to be detected, wherein m represents the detected total number of people, n represents the detected total number of safety helmets, and p represents the detected number of people wearing the safety helmets. When p is more than or equal to 0 and less than or equal to m, the number of people wearing the safety helmet is not more than the total number of people in a construction site, otherwise, when p is more than m, the monitoring is regarded as wrong, and a description sentence worn by the safety helmet cannot be generated. If the detected number of people wearing the safety helmet is equal to the total number of people, namely p = m, the safety helmet is worn by all people; if the detected number of people wearing the safety helmet is different from the total number of people, namely p is not equal to m, the situation that some people wear the safety helmet and some people do not wear the safety helmet is shown.

b) The sentence describes the definition of the template. The sentence description template is generated through picture subtitle labels, and the generation of words is derived from the original picture subtitle labels and visual concepts extracted by a target detection algorithm. The nature of a visual word is a mark, with the aim of preserving a vacancy for a word that describes a particular region in an image. And (3) extracting a visual concept by adopting a target detection algorithm, and generating an image description sentence worn by a safety helmet of a constructor in a construction scene by combining a method based on rules and a template.

c) Sentence generation

The sentence is finally generated by the sentence description rule defined above in combination with the sentence description template, for example: the sentence template can be "< num-1> <men-1 > < noon-1 > < then-on-the-words", then the visual concepts of the area (man, wear, helmet) are extracted by using the YOLOv3 algorithm, and in combination with the predefined rules describing the sentences, m =2 and p =2, all constructors in the representation wear safety helmets, and fill the safety helmets into the sentence template (num-1 > → two, t < verb-1> → -wear, n-1> → -hetets) to finally generate the description sentences of the image, wherein: "two men wear leather sheets heads"

(4) Analysis of results

In order to verify the effectiveness of the algorithm, the algorithm is compared with image description algorithms such as NIC, soft-Attention, adaptive and the like on a self-made image subtitle data set worn by a safety helmet, the score of the algorithm on BLEU-4 is found to be equal to that of the Adaptive algorithm, but the score of the algorithm on other evaluation indexes is improved, and the algorithm is used for wearing and detecting the safety helmet, so that the corresponding relation between an image area and a description sentence is enhanced, and the image description is carried out on the wearing of the safety helmet by combining a rule and a template method, so that the number of people wearing the safety helmet in the picture and the number of people not wearing the safety helmet can be accurately described.

Meanwhile, in order to verify the effectiveness of the method, the method is used for testing the test picture without caption marking, and the method is compared with other algorithms for the description sentences generated by the same picture, as shown in fig. 3, because the marking language is English, the description sentences output by the test are English. In fig. 3, (a) shows a single person helmet wearing description, the left side of the figure is a description sentence under good illumination, and the NIC description sentence is: the working man bearings a yellow helmet, the method of the invention describes the sentence as: a man beans a helmet on his head (a person wears a helmet). The right side of the figure is a descriptive statement in the case of insufficient light. The NIC describes the sentence as: the man with a white helmet is wearing a blue shirt, the method of the invention describes sentences as follows: a man, people a helmet on his head (a person is wearing a helmet). The descriptive sentences generated by the two algorithms are slightly different, but have good description on the wearing condition of a single person. In fig. 3, (b) shows a description of wearing a multi-person helmet, and the left part of the figure shows description sentences in the case of wearing a helmet by some persons, and the NIC describes sentences as follows: the method of the invention describes sentences as follows: two men wear helmets and an without helmet. The right side of the figure is a description sentence under the condition that the target size is greatly different, and the NIC description sentence is as follows: two men wear helmets on the field, the method of the invention describes the sentence as: the helmet is worn by three people. As can be seen, the two algorithms have advantages and disadvantages respectively. Description sentences generated by the NIC algorithm have diversity, but the number of people wearing the safety helmet cannot be accurately described because the algorithm is easy to lose detailed information. Because the invention adopts a method based on the combination of the rules and the templates to generate the image description, the generated description sentences are slightly insufficient in the aspect of the diversity of sentences, but the method can better describe the number of people wearing the safety helmet.

Fig. 4 shows a visualized experiment result diagram of the present invention, wherein fig. 4 includes 6 pictures, and from left to right and from top to bottom, the image descriptions generated by each picture experiment are respectively:

a man with out a helmet is working hard (a person without a helmet is working hard).

the man bathing a helmet is standing in the construction site (a person wearing a helmet is standing at the work site).

Wpersons with a helmet about work (two persons working without a helmet).

a man in an orange helmet, people with an orange vest.

a man in awhite helmet is smiling (a man wearing a white helmet is smiling).

a man is bathing ablue helmet on his head (a man wearing a blue helmet on his head).

FIG. 4 shows the results of the experiment: whether the safety helmet is worn by a single person or a plurality of persons; the method can better realize image semantic description no matter under the wearing condition of the safety helmet of the constructor in a simple construction scene or under the wearing condition of the safety helmet of the constructor in a complex construction scene. Therefore, the method can carry out relatively accurate image semantic description according to the wearing condition of the safety helmet of the constructors under different complex scenes.

In light of the foregoing description of preferred embodiments in accordance with the invention, it is to be understood that numerous changes and modifications may be made by those skilled in the art without departing from the scope of the invention. The technical scope of the present invention is not limited to the contents of the specification, and must be determined according to the scope of the claims.

Claims

1. A method for detecting and describing the wearing condition of a safety helmet in a construction scene is characterized by comprising the following steps: the method comprises the following steps:

s1: making a data set;

collecting images of a construction site scene expansion data set by a network crawler technology collection image or self-site image collection mode; the acquired data includes pictures about the wearing of the safety helmet in construction places with various background conditions, different resolutions and different qualities, the pictures contain constructors wearing the safety helmet and constructors not wearing the safety helmet, and all the acquired pictures are used as a data set for wearing the safety helmet; the manufacturing of the safety helmet wearing data set comprises the following steps: the method comprises the steps of manufacturing a safety helmet wearing detection data set and a safety helmet wearing image subtitle data set;

s2: detecting a target;

s2.1: selecting a detection model, comprehensively considering two aspects of detection speed and detection precision of an algorithm, and selecting YOLOv3 as a judgment and description model for judging whether a safety helmet is worn in a construction scene;

s2.2: preprocessing the self-made data set, namely performing normalization processing on the labeling information of the self-made data set worn by the helmet in the step S1, and converting the labeling information into a training format available for YOLOv 3;

s2.3: initializing an anchor frame by K-means clustering;

performing a K-means clustering algorithm on the safety helmet wearing data set normalized in the step S2.2 to initialize an anchor frame so as to predict the coordinates of the boundary frame;

s2.4: training a network model;

firstly, positioning the coordinate information of a marked target, then predicting the confidence coefficient of a boundary box of the marked target, predicting the score of a predefined target class, finally sending an unmarked test picture into a trained target detection network model, framing the score of the target detected in the picture and outputting the target if the score of the detected target is greater than a set threshold value, otherwise, failing to detect the target in the picture;

s2.5: network testing

Firstly, resetting the size of an input picture to 416 multiplied by 416, then extracting picture characteristics by utilizing a Darknet-53 network, then sending a characteristic vector to a characteristic pyramid structure for multi-scale prediction, and finally carrying out non-maximum suppression on a predicted boundary frame to eliminate repeated detection to obtain a final prediction result;

s3: generating a statement;

firstly, detecting a visual concept in an image by using a target detection algorithm, then combining a predefined rule and a sentence template, filling the detected visual concept into the sentence template, and finally generating a description sentence worn by the safety helmet;

the manufacturing of the safety helmet wearing data set in the step S1 specifically comprises the following steps:

s1.1: manufacturing a safety helmet wearing detection data set;

according to the labeling format of the Pascal VOC2007 public data set, performing multi-label labeling on the picture sample by using an open source labeling tool LabelImg, and automatically generating a corresponding xml-format labeling file, wherein the xml-format labeling file comprises an object name and coordinate information of a real boundary box; the labeled target categories are: people, safety helmets, and people wearing safety helmets;

s1.2: and the image subtitle data set is made for wearing the safety helmet;

and (2) performing statement annotation on the data set marked in the step (S1.1), wherein a mode of combining self-programming annotation software and manual annotation is adopted, and the annotation of the caption data set is divided into the following steps:

s1.2.1: reading the name and the size information of each picture by using self-programming labeling software, and giving a unique picture id number to each picture;

s1.2.2: the method comprises the following steps of performing caption labeling on pictures by using self-programming labeling software, manually labeling 5 descriptive sentences of each picture, describing mainly by wearing safety helmets of personnel in a construction scene, and endowing each sentence with a unique sentence id number; each picture has a corresponding picture id number and 5 corresponding sentence id numbers, and picture subtitle labeling data are stored in a json format;

the method for defining sentence description rules and templates in step S3 specifically includes:

s3.1: definition of sentence description rules

According to the visual concepts of people, safety helmets and people wearing safety helmets extracted in the target detection in the previous stage; respectively setting a triplet (m, n, p) with an initial value of zero for 3 types of targets to be detected to count the number of the targets to be detected, wherein m represents the detected total number of people, n represents the detected total number of safety helmets, and p represents the detected number of people wearing safety helmets; when p is more than or equal to 0 and less than or equal to m, the number of people wearing the safety helmet is not more than the total number of people in a construction site, otherwise, when p is more than m, the monitoring is regarded as wrong, and a description sentence worn by the safety helmet cannot be generated; if the detected number of people wearing the safety helmet is equal to the total number of people, namely p = m, the safety helmet is worn by all people; if the detected number of people wearing the safety helmet is different from the total number of people, namely p is not equal to m, the situation that some people wear the safety helmet and some people do not wear the safety helmet is shown;

s3.2: definition of sentence description template

The sentence description template is generated through picture subtitle annotation, and the generation of words is derived from original picture subtitle annotation or visual concepts extracted by a target detection algorithm; the nature of a visual word is a mark, with the aim of reserving a space for the word that describes a particular region in the image; and (3) extracting a visual concept by adopting a target detection algorithm, and generating an image description sentence worn by a safety helmet of a constructor in a construction scene by combining a method based on rules and a template.

2. The method for detecting and describing the wearing condition of the safety helmet in the construction scene according to claim 1, wherein: the step S2.2 of normalizing the annotation information specifically includes:

normalization processing is carried out on the sample labeling data, namely the sample labeling data is divided by the width and the height of an image, so that the final data is controlled to be between 0 and 1, the training sample data can be conveniently and rapidly read, and meanwhile, the requirement of multi-scale training is met, wherein a specific normalization formula is shown as the following formula:

wherein x is _max ,x _min ,y _max ,y _min Representing the mark information of an original sample boundary frame, wherein width and height represent the size of a picture, and x, y and w, h represent the mark information after normalization, (x and y) are the coordinates of the central point of the target, and w and h are the width and the height of the target; in the normalized data sample, the bounding box information of each target of each picture includes 5 parameters x, y, w, h, class _ id, where class _ id is a target class number.

3. The method for detecting and describing the wearing condition of the safety helmet in the construction scene according to claim 1, wherein: the step S2.4 of training the network model specifically includes:

s2.4.1: positioning target coordinate information;

the input picture is expressed as a tensor with the size of n multiplied by m multiplied by 3, wherein n and m represent the width and height of the picture, the unit is a pixel, and 3 represents the number of RGB three channels; firstly, automatically adjusting images with different sizes to be 416 multiplied by 416 fixed sizes, dividing an original image into 13 multiplied by 13 grids, and enabling the grid where the center point of the target is located to be responsible for detecting the target; each mesh predicts 3 bounding boxes overlaid on the mesh and the confidence of these bounding boxes, each bounding box containing 6 predictors: x, y, w, h, confidence and class _ id, wherein (x, y) represents the relative value of the center of the predicted boundary box and the grid boundary, w, h represents the ratio of the width and the height of the predicted boundary box relative to the whole picture, confidence represents confidence for eliminating the boundary box lower than a threshold value, and class _ id represents the target class number; the prediction information of each bounding box comprises the coordinates, the width and the height of the bounding box, and the coordinate calculation formula of the bounding box is as follows:

b _x ＝σ(t _x )+c _x ，b _y ＝σ(t _y )+c _y

wherein, (bx, by) represents the center coordinates of the predicted bounding box, and bw, bh represents the width and height of the predicted bounding box; tx, ty, tw and th represent the target of network learning, cx and cy are coordinate offset of the grid, and pw and ph are preset anchor frame dimensions;

s2.4.2: predicting the confidence of the bounding box;

after the target coordinate information is positioned, the confidence of the bounding box needs to be predicted, and according to the labeled 3 types of targets: people, safety helmets and people wearing safety helmets, 3 boundary frames are predicted for each grid, each boundary frame comprises 6 predicted quantities, so that the number of channels is 3 x (4 +1+ 3) =24, and the output 3 scale feature maps are respectively 13 x 24, 26 x 24 and 52 x 24 with the unit of pixel;

s2.4.3: pre-defining a score prediction for a target category;

the confidence prediction of the bounding box is completed, then the score of the predefined target category is predicted, in order to improve the detection effect of the small target, the idea of multi-scale prediction is adopted, and the 3 feature maps with different scales of 13 × 13, 26 × 26 and 52 × 52 are respectively used for prediction, and the unit is pixel;

s2.4.4: training of models

According to the characteristics of a self-made safety helmet wearing data set, correspondingly modifying the configuration file of the YOLOv3 network; before training, the weight file is converted into the weight file under the Keras framework according to the modified network configuration file, so that loading of the pre-trained model is facilitated, and initialization parameters are provided for training of the model.