CN111178206B

CN111178206B - Building embedded part detection method and system based on improved YOLO

Info

Publication number: CN111178206B
Application number: CN201911328091.0A
Authority: CN
Inventors: 姜向远; 邢金昊; 于敦政; 陈菲雨; 贾磊; 马思乐; 陈纪旸; 栾义忠; 杜延丽; 岳文斌; 马晓静
Original assignee: Shandong University
Current assignee: Shandong University
Priority date: 2019-12-20
Filing date: 2019-12-20
Publication date: 2023-05-16
Anticipated expiration: 2039-12-20
Also published as: CN111178206A

Abstract

The present disclosure provides a method and a system for detecting a building embedded part based on improved YOLO, which acquire a picture of the building embedded part, calibrate the building embedded part in the picture to form a data set, and divide the data set into a training set and a testing set; utilizing a MobileNet network to replace a Darknet53 network in a YOLO detection algorithm as a feature extraction network, constructing an improved YOLO detection model, and training the improved YOLO detection model by utilizing a training set until the testing requirement of a testing set is met, so as to obtain a final detection model; and acquiring an aerial picture of a building site to be built, turning over the picture, carrying out affine transformation of different scales and Gaussian blur processing, and identifying the input picture by utilizing a final detection model to obtain an embedded part detection result.

Description

Building embedded part detection method and system based on improved YOLO

Technical Field

The disclosure belongs to the technical field of detection of building engineering embedded parts, and relates to a building embedded part detection method and system based on improved YOLO.

Background

The statements in this section merely provide background information related to the present disclosure and may not necessarily constitute prior art.

Along with the rapid development of economy, the development of the building industry as a basic industry is rapid, the prosperity of the building market brings challenges to a plurality of building enterprises, and higher requirements are also provided for the building construction quality and efficiency of the enterprises. The embedded part is a technology which is widely applied in modern constructional engineering, and comprises structural members such as steel plates, bolts, junction boxes and the like, and embedded pipes such as wire connection pipes, drain pipes and the like. The construction quality of the embedded part directly influences the construction progress and the structural safety of the building engineering, so that the construction progress and the structural safety of the building engineering are firmly controlled. The position inspection of various embedded parts at present is carried out on site by workers before cement filling, however, for engineering projects with larger construction area, the number of the embedded parts is large, the positions are scattered, such as wiring of the embedded parts of a junction box, a pipe and the like are complex, and the embedded parts are not easy to inspect, so that the manual inspection is time-consuming and labor-consuming, the efficiency is low, and in the high-rise building construction engineering, the workers climb up to the height to have danger. The unmanned aerial vehicle is used for replacing manual detection of embedded parts of high-rise and large-area constructional engineering, and is a new idea.

In recent years, unmanned aerial vehicles are rapidly popularized in the civil and commercial fields due to the advantages of small size, light weight, capability of carrying various task loads and the like, and are also popularized in the construction field, such as the aspects of foundation construction measurement, construction site management and the like. The unmanned aerial vehicle is utilized to go above an engineering place, the image shot by the cradle head camera can be returned to the computer in real time through the image transmission module for image processing, the position of the embedded part can be rapidly positioned in a higher visual field, comparison with a building design drawing is facilitated, manpower is saved, the inspection speed of the embedded part is accelerated, and engineering efficiency and quality are improved. The rapid detection of embedded parts by matching with the image information returned by the unmanned aerial vehicle requires a high-efficiency and accurate target detection algorithm, a traditional target detection algorithm such as a target detection algorithm based on the feature extraction of a direction gradient histogram (Histogram of Oriented Gradient, HOG), a target detection algorithm based on a support vector machine and the like, and the algorithms do not select a proper sliding window aiming at the target, so that the calculation time complexity is high, the window redundancy is low, and the detection efficiency is low.

According to the knowledge of the inventor, the current target detection algorithm based on deep learning is more and more paid attention to, and the algorithm trains convolutional neural networks (Convolutional Neural Networks, CNNs) by utilizing data and labels in a data set, and can be divided into two types: the first type is a region-based target detection algorithm, such as RCNN (Regions with CNN), faster R-CNN (Faster Regions with CNN), and the like, wherein candidate regions are extracted in advance according to the target object position, so that the detection accuracy is high, but the detection speed is low; the second type is a target detection algorithm based on end-to-end learning, such as YOLO (You Only Look Once), SSD (Single Shot MultiBox Detector), etc., which omits a candidate region generation step, and the feature extraction, the target classification and the target regression are implemented in the same convolutional neural network, so that the target detection speed is greatly improved. However, the neural network of the target detection algorithm based on deep learning has a huge and complex structure, so that the detection speed is low.

Disclosure of Invention

In order to solve the problems, the disclosure provides a building embedded part detection method and system based on improved YOLO.

According to some embodiments, the present disclosure employs the following technical solutions:

a building embedded part detection method based on improved YOLO comprises the following steps:

acquiring a picture of the building embedded part, calibrating the building embedded part in the picture to form a data set, and dividing the data set into a training set and a testing set;

utilizing a MobileNet network to replace a Darknet53 network in a YOLO detection algorithm as a feature extraction network, constructing an improved YOLO detection model, and training the improved YOLO detection model by utilizing a training set until the testing requirement of a testing set is met, so as to obtain a final detection model;

and acquiring an aerial picture of a building site to be built, turning over the picture, carrying out affine transformation of different scales and Gaussian blur processing, and identifying the input picture by utilizing a final detection model to obtain an embedded part detection result.

As a further limitation, when testing, object detection is regarded as a regression problem, the input picture is divided into an S x S grid, and if the center of the detected object exists in the center of a certain cell, the cell is responsible for predicting the object; each cell will produce B bounding boxes, each containing an offset of the center position of the object from the cell position and the width and height of the bounding box and confidence of the object.

As a further limitation, the convolutional neural network is used to extract the characteristics of the target object and predict, and each cell is given a probability value of C categories, which represents the probability that the target in the bounding box for which the cell is responsible for prediction belongs to each category. The conditional probability of an object being present in a cell is Pr (class/object), the probability of an identified object being of a certain class is Pr (class),

the intersection ratio of the predicted border and the real area of the object is expressed as follows:

as a further limitation, the mean square error is used as a loss function to optimize the model parameters, i.e. the mean square error of the output vector of the network structure and the corresponding vector of the real image.

By way of further limitation, the MobileNet network utilizes a depth level separable convolution instead of the standard convolution, decomposed into a depth convolution and a 1 x1 convolution.

As a further limitation, aerial pictures of a building site to be under construction are acquired by a visible light camera carried by the unmanned aerial vehicle.

An improved YOLO-based building embedment detection system, comprising:

the data acquisition module is configured to acquire pictures of the building embedded parts, calibrate the building embedded parts in the pictures to form a data set, and divide the data set into a training set and a testing set;

the model construction and training module is configured to utilize a mobile Net network to replace a Darknet53 network in a YOLO detection algorithm as a feature extraction network, construct an improved YOLO detection model, and train the improved YOLO detection model by utilizing a training set until the testing requirements of a testing set are met, so as to obtain a final detection model;

the detection and identification module is configured to acquire an aerial picture of a building site to be under construction, overturn the picture, affine transformation of different scales and Gaussian blur processing are carried out on the picture, the picture is taken as an input picture, and a final detection model is utilized to identify the input picture, so that an embedded part detection result is obtained.

A computer readable storage medium having stored therein a plurality of instructions adapted to be loaded by a processor of a terminal device and to perform the steps of the improved YOLO-based building embedment detection method.

A terminal device comprising a processor and a computer readable storage medium, the processor configured to implement instructions; the computer readable storage medium is for storing a plurality of instructions adapted to be loaded by a processor and to perform the steps of the improved YOLO-based building embedment detection method.

Compared with the prior art, the beneficial effects of the present disclosure are:

the method and the device can effectively reduce the size of the network model, reduce network parameters, improve detection performance, effectively detect and identify articles such as the embedded wire box, the embedded pipe and the like, and have high monitoring precision.

Drawings

The accompanying drawings, which are included to provide a further understanding of the disclosure, illustrate and explain the exemplary embodiments of the disclosure and together with the description serve to explain the disclosure, and do not constitute an undue limitation on the disclosure.

FIG. 1 is a schematic diagram of a depth separable convolution structure;

FIG. 2 is a schematic diagram of a modified YOLO neural network structure;

FIG. 3 is a schematic diagram of the SE-Block basic structure;

FIG. 4 is a basic structure of SEMoblieNet;

FIGS. 5 (a) - (b) are data set charts;

FIGS. 6 (a) - (d) are comparative illustrations of the detection effect of four methods.

The specific embodiment is as follows:

the disclosure is further described below with reference to the drawings and examples.

It should be noted that the following detailed description is illustrative and is intended to provide further explanation of the present disclosure. Unless defined otherwise, all technical and scientific terms used herein have the same meaning as commonly understood by one of ordinary skill in the art to which this disclosure belongs.

It is noted that the terminology used herein is for the purpose of describing particular embodiments only and is not intended to be limiting of exemplary embodiments in accordance with the present disclosure. As used herein, the singular is also intended to include the plural unless the context clearly indicates otherwise, and furthermore, it is to be understood that the terms "comprises" and/or "comprising" when used in this specification are taken to specify the presence of stated features, steps, operations, devices, components, and/or combinations thereof.

The building site under construction is photographed by using a visible light camera mounted on an unmanned aerial vehicle (for example, the type 100 of the longitude and latitude of the large Xinjiang river), and a data set sample is shown in fig. 5 (a) - (b). In order to ensure the diversity of data, different flight attitudes of the unmanned aerial vehicle such as hovering, lifting, stable flight and the like are considered in the shooting process. By adopting a manual calibration method, a junction box and a drain pipe which are representative in the embedded part are selected as detection targets for calibration, and a PASCAL VOC format is adopted for calibration pictures. Because the background of the construction site is complex, various reinforcing steel bars, construction elements and other embedded parts are covered in the shooting process, so that samples with the shielded area exceeding 50% are not marked for ensuring the effectiveness of data. In addition, in order to make training data more effective and improve network generalization capability, the following expansion operation is performed on the data set pictures: respectively turning the picture left and right and up and down by using a turning matrix; affine transformations of different dimensions; and (5) Gaussian blur processing. The amplified dataset was divided into a training set and a test set according to a 4:1 ratio for a total of 3590 pictures.

The YOLO v3 adopts a new Darknet-53 feature extraction network, the Darknet refers to the idea of ResNet structure, and the YOLO v3 has 53 layers of convolutional neural networks, and the deepening of the network layers makes the feature extraction capability of the YOLO v3 on a target stronger. The Darknet-53 also removes a pooling layer and a full connection layer in the network, each base layer is added with a BN (Batch Normalization) layer and a leak RELU layer, and a residual error module is added in the network, so that the gradient disappearance or gradient explosion problem of the deep network can be solved. However, due to the complexity of the structure of the dark-53 network, the network has a large number of weight parameters, so that the algorithm has high requirements on the image processing equipment, and the detection speed of the picture is also affected. Tiny-YOLO is a simplified version of YOLO v3, the characteristic extraction network is simpler, a residual network is removed, only 7 convolution layers and 6 pooling layers are provided, the number of parameters is reduced, the requirements on equipment are lower, the detection speed is effectively improved, and the detection precision is lost. The feature extraction network is replaced by a lighter neural network MobileNet, and the image detection speed is increased on the basis of improving the complexity of the network by using a more efficient convolution calculation mode.

In performing object detection, YOLO regards object detection as a regression problem, divides an input picture into an sx S grid, and if the center of a detected object exists at the center of a certain cell, the cell is responsible for predicting the object. Each cell generates B bounding boxes, each bounding box containing 5 parameters (x, y, w, h, confidence), where (x, y, w, h) is the offset of the center position of the object relative to the cell position and the width and height of the bounding box, confidence is the Confidence of the object, reflecting whether there is a target object in the bounding box and the accuracy of the object's position, calculated as follows:

wherein P is _r (object) indicates whether the bounding box contains the object, if so, 1 is taken, and if not, 0 is taken;

representing the intersection ratio of the predicted border and the real area of the object.

And extracting the characteristics of the target object by using a convolutional neural network and predicting, wherein each cell is required to give C category probability values, and the probability that the target in the boundary frame of the cell is responsible for prediction belongs to each category is represented. The conditional probability of an object being present in a cell is that the probability of the identified object being of a certain class is:

YOLO uses the mean square sum error as a loss function to optimize model parameters, i.e., the mean square sum error of the output vector of the network structure and the corresponding vector of the real image.

The formula is as follows:

wherein cordrerror represents the error between the predicted data and the calibration data; iouError represents the cross-ratio error; classError represents classification error. The specific formula is as follows:

wherein the parameter lambda _coord The weight for representing the loss error enhances the importance of the bounding box in the loss calculation; lambda (lambda) _noobj The weight of the classification loss function can be expressed, so that the influence of the non-target region on the calculation of the confidence coefficient of the target region can be weakened;

indicating that the target object falls into the jth bounding box of the ith cell, taking 1 if the target object falls into the bounding box, and taking 0 if the target object does not fall into the bounding box;

representing the response prediction value.

The mobilet network is a lightweight and efficient convolutional neural network proposed by Google, and replaces standard convolution with depth-level separable convolution (Depth Separable Convolution), and can be decomposed into two operations of depth convolution (Depthwise Convolution) and 1×1 convolution (Pointwise Convolution) as shown in fig. 1, which not only has higher efficiency in theory, but also can be directly completed by using highly optimized matrix multiplication by a large number of 1×1 convolution operations, and about 95% of multiply-add operations in mobilet come from 1×1 convolution, so that the operation efficiency can be greatly improved.

The improved YOLO network architecture is shown in fig. 2, where 416X 416 size picture inputs are used, 3X 3 standard convolutions are followed by depth separable convolutions, X1 is output after the 5 th depth separable convolutions, then the depth separable convolutions are continued, X2, X3 are output at the 11 th and 13 th depth separable convolutions, respectively, and X1, X2, X3 are used as inputs to the YOLO v3 network.

The deep convolution operation of MobileNet greatly reduces the scale of network parameters and improves the calculation speed of the network, but the defects are obvious: the deep convolution emphasizes the characteristic parameters of the local receptive field, improves the expression capability of the network to a certain extent, but ignores the correlation of the characteristic information of each channel, and then connects the characteristics on each channel by using a 1×1 convolution operation, but cannot completely compensate the loss in precision. In view of the correlation between features describing objects and the division of features into main features and non-main features, this embodiment adds a squeze-and-specification module in the MobileNet network.

The Squeeze-and-expression network (SE-Block) is a network module for the importance degree of object features, which is proposed by Hu Jie team, and the core idea of the module is that the learning of feature weights increases effective feature weights, reduces ineffective or small-effect feature weights, thereby enhancing the feature extraction capability of the network and bringing significant performance improvement to the neural network structure with less additional calculation cost. The basic structure of the module is shown in figure 3.

The SE-Block function first inputs arbitrary data by a standard convolution operator

Is converted into

Is described. The subsequent operation is completed by three steps of compression (Squeeze), excitation (specification) and weight distribution (Scale), the compression operation is that global space information is compressed to obtain a single-channel descriptor, statistical information of each channel is generated through global average pooling (Global Average Pooling), and the statistics z _c By compressing the Feature map (Feature map) u in the spatial dimension H W _c The method is realized by the following calculation formula:

the excitation operation is mainly completed through two full-connected layers (full-connected layers), the first full-connected layer realizes compression of feature mapping, the calculated amount is reduced, and a ReLU layer is added into the two full-connected layers, so that the nonlinear relation between channels can be learned. The second full-connection layer restores the feature mapping to the original channel number, and the final feature mapping importance description factor s is obtained through the normalization of the Sigmoid function, and the calculation formula is as follows:

s＝F _Excitation (z,W)+σ(g(z,W))＝σ(W ₂ δ(W ₁ z))

where δ represents the activation function ReLU. The weight distribution is to weight the feature mapping importance descriptive factor s obtained by the excitation operation into the feature channel by channel through multiplication, the weight distribution of the original feature is completed, and the calculation formula is as follows:

in the embodiment, the SE-Block module is embedded into the MobileNet feature extraction network, and the SEMobileNet-YOLO detection network is provided in combination with the YOLO detection network. The SEMobileNet-YOLO network structure is similar to the MobileNet-YOLO network structure, and an SE-Block module is added after each pair of deep convolution sum and 1×1 convolution, importance weight distribution is carried out on the input feature map, then the next layer is input, and the SEMobileNet basic structure is shown in FIG. 4.

The implementation of this embodiment is performed in the environment of Ubuntu16.04LTS system, CPU i9-9900K, memory 16GB, video card Nvidia Titan XP, and development framework Keas under TensorFlow. The training parameters were set as follows: the initial weight is set to 0.001; the weight decay factor is 0.0005; adopting a momentum gradient descent algorithm with momentum of 0.9; the Batch processing parameter (Batch Size) is set to 16. Initial learning rate of 10 ^-3 After training 300 full datasets, the network is initialized with trained parameters to adjust the learning rate to 10 ^-4 The 300 full datasets were retrained. The learning rate adjustment strategy is to adjust the learning rate according to the proportion of 10% when the loss value of the test set is not reduced through continuous 3 full dataset (Epoch) training; when the loss value of the test set is not reduced after training of 10 continuous full data sets, training is stopped in advance.

After the network training is completed, the pictures of the test set are input into the network for detection. The improved SEMobileNet-YOLO detection method is compared with MobileNet-YOLO, YOLO v3 and tiny-YOLO, and tiny-YOLO is a simplified version of YOLO v3, so that the detection speed is greatly improved on the basis of sacrificing the detection accuracy. And when the intersection ratio (IoU) of the target boundary box predicted by the model and the manually marked boundary box is more than or equal to 0.5, and the target identification is accurate, the detection is considered to be successful, otherwise, the target is considered to be missed. The final detection effect is shown in fig. 6 (a) - (d), the gray dotted frame is the calibration frame drawn when the calibration is performed, and the white solid frame is the calibration frame detected by the algorithm.

It will be apparent to those skilled in the art that embodiments of the present disclosure may be provided as a method, system, or computer program product. Accordingly, the present disclosure may take the form of an entirely hardware embodiment, an entirely software embodiment, or an embodiment combining software and hardware aspects. Furthermore, the present disclosure may take the form of a computer program product embodied on one or more computer-usable storage media (including, but not limited to, disk storage, CD-ROM, optical storage, etc.) having computer-usable program code embodied therein.

The present disclosure is described with reference to flowchart illustrations and/or block diagrams of methods, apparatus (systems), and computer program products according to embodiments of the disclosure. It will be understood that each flow and/or block of the flowchart illustrations and/or block diagrams, and combinations of flows and/or blocks in the flowchart illustrations and/or block diagrams, can be implemented by computer program instructions. These computer program instructions may be provided to a processor of a general purpose computer, special purpose computer, embedded processor, or other programmable data processing apparatus to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing apparatus, create means for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.

These computer program instructions may also be stored in a computer-readable memory that can direct a computer or other programmable data processing apparatus to function in a particular manner, such that the instructions stored in the computer-readable memory produce an article of manufacture including instruction means which implement the function specified in the flowchart flow or flows and/or block diagram block or blocks.

These computer program instructions may also be loaded onto a computer or other programmable data processing apparatus to cause a series of operational steps to be performed on the computer or other programmable apparatus to produce a computer implemented process such that the instructions which execute on the computer or other programmable apparatus provide steps for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.

The foregoing description of the preferred embodiments of the present disclosure is provided only and not intended to limit the disclosure so that various modifications and changes may be made to the present disclosure by those skilled in the art. Any modification, equivalent replacement, improvement, etc. made within the spirit and principle of the present disclosure should be included in the protection scope of the present disclosure.

While the specific embodiments of the present disclosure have been described above with reference to the drawings, it should be understood that the present disclosure is not limited to the embodiments, and that various modifications and changes can be made by one skilled in the art without inventive effort on the basis of the technical solutions of the present disclosure while remaining within the scope of the present disclosure.

Claims

1. A building embedded part detection method based on improved YOLO is characterized by comprising the following steps: the method comprises the following steps:

2. The method for detecting the building embedded part based on the improved YOLO as claimed in claim 1, wherein the method comprises the following steps: when testing is carried out, target detection is regarded as regression problem, an input picture is divided into S multiplied by S grids, and if the center of a detected target exists in the center of a certain cell, the cell is responsible for predicting the target; each cell will produce B bounding boxes, each containing an offset of the center position of the object from the cell position and the width and height of the bounding box and confidence of the object.

3. The method for detecting the building embedded part based on the improved YOLO as claimed in claim 1, wherein the method comprises the following steps: extracting characteristics of a target object by using a MobileNet network, predicting, giving a probability value of C categories to each cell, representing the probability that the target in a boundary frame of the cell responsible for prediction belongs to each category, wherein the conditional probability of the existence of the object in the cell is Pr (class/object), pr (object) represents whether the boundary frame contains the target or not, the probability of the identified object being in a certain category is Pr (class),

4. the method for detecting the building embedded part based on the improved YOLO as claimed in claim 1, wherein the method comprises the following steps: the model parameters, i.e. the mean square error of the output vector of the network structure and the corresponding vector of the real image, are optimized using the mean square error as a loss function.

5. The method for detecting the building embedded part based on the improved YOLO as claimed in claim 1, wherein the method comprises the following steps: the MobileNet network utilizes a depth level separable convolution instead of the standard convolution, which is decomposed into a depth convolution and a 1 x1 convolution.

6. The method for detecting the building embedded part based on the improved YOLO as claimed in claim 1, wherein the method comprises the following steps: and acquiring an aerial picture of the building site to be under construction by carrying a visible light camera through the unmanned aerial vehicle.

7. Building built-in fitting detecting system based on improve YOLO, characterized by: comprising the following steps:

8. A computer-readable storage medium, characterized by: in which instructions are stored which are adapted to be loaded by a processor of a terminal device and to carry out the steps of a building embedment detection method based on improved YOLO according to any one of claims 1-6.

9. A terminal device, characterized by: comprising a processor and a computer-readable storage medium, the processor configured to implement instructions; a computer readable storage medium for storing a plurality of instructions adapted to be loaded by a processor and to perform the steps of a method of improved YOLO-based building embedment detection of any one of claims 1-6.