CN111178206A

CN111178206A - A detection method and system for building embedded parts based on improved YOLO

Info

Publication number: CN111178206A
Application number: CN201911328091.0A
Authority: CN
Inventors: 姜向远; 邢金昊; 于敦政; 陈菲雨; 贾磊; 马思乐; 陈纪旸; 栾义忠; 杜延丽; 岳文斌; 马晓静
Original assignee: Shandong University
Current assignee: Shandong University
Priority date: 2019-12-20
Filing date: 2019-12-20
Publication date: 2020-05-19
Anticipated expiration: 2039-12-20
Also published as: CN111178206B

Abstract

The present disclosure provides a detection method and system for building embedded parts based on improved YOLO, which acquires pictures of building embedded parts, calibrates the building embedded parts in the pictures, forms a data set, and divides the data set into a training set and a Test set; use MobileNet network instead of Darknet53 network in YOLO detection algorithm as feature extraction network, build improved YOLO detection model, use training set to train described improved YOLO detection model, until the test requirements of test set are met, and the final detection model is obtained; Obtain the aerial picture of the construction site under construction, flip the picture, perform affine transformation of different scales and Gaussian blurring, as the input picture, use the final detection model to identify the input picture, and obtain the embedded part detection result.

Description

Building embedded part detection method and system based on improved YOLO

Technical Field

The disclosure belongs to the technical field of detection of embedded parts in constructional engineering, and relates to a method and a system for detecting embedded parts in buildings based on improved YOLO.

Background

The statements in this section merely provide background information related to the present disclosure and may not necessarily constitute prior art.

With the rapid development of economy, the construction industry as the basic industry develops rapidly, the prosperity of the construction market brings opportunities and challenges to a plurality of construction enterprises, and higher requirements are provided for the construction quality and efficiency of the enterprises. The embedded part is a very widely applied technology in modern building engineering, and comprises structural parts such as steel plates, bolts and junction boxes, and embedded pipes such as wiring pipes and drain pipes. The construction quality of the embedded part directly influences the construction progress and the structure safety of the building engineering, so that the control is firmly carried out. In the prior art, workers perform site inspection before cement is poured, however, for engineering projects with large construction areas, the quantity and positions of embedded parts are large, and the embedded parts such as junction boxes, pipes and the like are complex in wiring and difficult to inspect, so that manual inspection is time-consuming and labor-consuming, the efficiency is low, and in high-rise building construction projects, the workers climb up to be dangerous. The unmanned aerial vehicle is a new idea for replacing manpower to detect embedded parts of high-rise and large-area building engineering.

In recent years, unmanned aerial vehicles have been rapidly popularized in the civil and commercial fields due to the advantages of small size, light weight, capability of carrying various task loads and the like, and also have been popularized in the building construction field, such as being used for basic construction measurement, construction site management and the like. Utilize unmanned aerial vehicle to pass back the image real-time passback computer that cloud platform camera was shot at engineering place sky through image transmission module and carry out image processing, can be in the quick location built-in fitting position of higher field of vision, conveniently contrast with the architectural design drawing, use manpower sparingly for the built-in fitting inspection speed improves engineering efficiency and quality. The rapid detection of the embedded part by matching with the image information returned by the unmanned aerial vehicle needs an efficient and accurate target detection algorithm, and the traditional target detection algorithms such as a target detection algorithm based on Histogram of Oriented Gradient (HOG) feature extraction, a target detection algorithm based on a support vector machine and the like do not select a proper sliding window aiming at the target, so that the calculation time complexity is high, the window redundancy is high, and the detection efficiency is low.

According to the knowledge of the inventor, the current target detection algorithm based on deep learning is more and more emphasized by people, and such algorithms mostly use data and labels in data sets to train Convolutional Neural Networks (CNNs), and can be divided into two types: the first type is a region-based target detection algorithm such as RCNN (registration switch CNN), Faster R-CNN (Faster Regions with CNN) and the like, wherein the algorithm extracts candidate Regions aiming at the position of a target object in advance, has higher detection precision, but has slower detection speed; the second type is a target detection algorithm based on end-to-end learning, such as YOLO (young Only Look one), SSD (Single Shot multi box Detector), etc., which omits the candidate region generation step, and implements the feature extraction, the target classification and the target regression in the same convolutional neural network, so that the target detection speed is greatly increased. However, the neural network structure of the target detection algorithm based on deep learning is large and complex, so that the detection speed is slow.

Disclosure of Invention

The invention provides a building embedded part detection method and system based on improved YOLO (YOLO) for solving the problems.

According to some embodiments, the following technical scheme is adopted in the disclosure:

a building embedded part detection method based on improved YOLO comprises the following steps:

acquiring pictures of the building embedded parts, calibrating the building embedded parts in the pictures to form a data set, and dividing the data set into a training set and a testing set;

replacing a Darknet53 network in a YOLO detection algorithm with a MobileNet network as a feature extraction network, constructing an improved YOLO detection model, and training the improved YOLO detection model by using a training set until the test requirements of the test set are met to obtain a final detection model;

obtaining an aerial picture of a building site to be built, turning the picture, performing affine transformation of different scales and Gaussian blur processing on the picture to serve as an input picture, and identifying the input picture by using a final detection model to obtain an embedded part detection result.

As a further limitation, when performing a test, regarding target detection as a regression problem, dividing an input picture into grids of S × S, and if the center of a detected target exists in the center of a certain cell, the cell is responsible for predicting the target; each cell will generate B bounding boxes, each containing the offset of the center position of the object from the cell position, as well as the width and height of the bounding box and the confidence of the target.

By way of further limitation, the convolutional neural network is used for extracting the characteristics of the target object and predicting, and each cell is given C class probability values which represent the probability that the target in the bounding box of which the cell is responsible for predicting belongs to each class. The conditional probability of the existence of an object in a cell is Pr (class/object), and the probability of the recognized object being of a certain class is Pr (class)ass)，

The intersection ratio of the predicted frame and the real area of the object is shown as follows:

as a further limitation, the model parameters, i.e. the mean square sum error of the output vector of the network structure and the corresponding vector of the real image, are optimized using the mean square sum error as a loss function.

As a further limitation, the MobileNet network utilizes depth-level separable convolutions instead of standard convolutions, decomposed into depth convolution and 1 × 1 convolution.

As a further limitation, the unmanned aerial vehicle carries a visible light camera to obtain aerial pictures of the building site to be built.

A building embedment detection system based on improved YOLO, comprising:

the data acquisition module is configured to acquire pictures of the building embedded parts, calibrate the building embedded parts in the pictures to form a data set, and divide the data set into a training set and a test set;

the model building and training module is configured to utilize a MobileNet network to replace a Darknet53 network in a YOLO detection algorithm as a feature extraction network, build an improved YOLO detection model, and train the improved YOLO detection model by utilizing a training set until the test requirements of the test set are met to obtain a final detection model;

and the detection and identification module is configured to acquire an aerial picture of a building site to be built, turn over the picture, perform affine transformation with different scales and Gaussian blur processing on the picture to serve as an input picture, and identify the input picture by using a final detection model to obtain an embedded part detection result.

A computer readable storage medium having stored therein a plurality of instructions adapted to be loaded by a processor of a terminal device and to perform the steps of the improved YOLO based building embedment detection method.

A terminal device comprising a processor and a computer readable storage medium, the processor being configured to implement instructions; the computer readable storage medium is configured to store instructions adapted to be loaded by the processor and to perform the steps of the improved YOLO based building embedment detection method.

Compared with the prior art, the beneficial effect of this disclosure is:

the size of the network model can be effectively reduced, network parameters are reduced, the detection performance is improved, objects such as the embedded line box and the embedded pipe are effectively detected and identified, and the monitoring precision is high.

Drawings

The accompanying drawings, which are included to provide a further understanding of the disclosure, illustrate embodiments of the disclosure and together with the description serve to explain the disclosure and are not to limit the disclosure.

FIG. 1 is a schematic diagram of a depth separable convolution structure;

FIG. 2 is a schematic diagram of an improved YOLO neural network structure;

FIG. 3 is a schematic diagram of the basic structure of SE-Block;

FIG. 4 is a SEMoblieNet basic structure;

FIGS. 5(a) - (b) are sample graphs of datasets;

FIGS. 6(a) - (d) are schematic diagrams comparing the detection effects of the four methods.

The specific implementation mode is as follows:

the present disclosure is further described with reference to the following drawings and examples.

It should be noted that the following detailed description is exemplary and is intended to provide further explanation of the disclosure. Unless defined otherwise, all technical and scientific terms used herein have the same meaning as commonly understood by one of ordinary skill in the art to which this disclosure belongs.

It is noted that the terminology used herein is for the purpose of describing particular embodiments only and is not intended to be limiting of example embodiments according to the present disclosure. As used herein, the singular forms "a", "an" and "the" are intended to include the plural forms as well, and it should be understood that when the terms "comprises" and/or "comprising" are used in this specification, they specify the presence of stated features, steps, operations, devices, components, and/or combinations thereof, unless the context clearly indicates otherwise.

An unmanned aerial vehicle (such as Xinjiang longitude and latitude matrix 100) is used for carrying a visible light camera to shoot a construction site under construction, and data set sample diagrams are shown in fig. 5(a) - (b). In order to guarantee the diversity of data, the different flight attitudes of unmanned aerial vehicle such as hover, go up and down, smooth flight etc. have been considered at the shooting in-process. And adopting a manual calibration method, selecting a relatively representative junction box and a drain pipe in the embedded part as a detection target for calibration, wherein the calibration picture adopts a PASCAL VOC format. Because of the building site background is complicated, the condition that various reinforcing bars and building original paper etc. cover the built-in fitting appears in the in-process of shooing, consequently does not mark to being sheltered from the sample that the area exceeds 50% in order to guarantee the validity of data. In addition, considering that the unmanned aerial vehicle can have postures such as inclined flight in the flight process, in order to enable training data to be more effective and improve the network generalization capability, the following expansion operation is carried out on a data set picture: the picture is respectively turned left and right and up and down by using a turning matrix; affine transformation of different scales; and (5) Gaussian blur processing. The data set after amplification consisted of 3590 pictures, which were divided into training and test sets at a 4:1 ratio.

The YOLO v3 adopts a new Darknet-53 feature extraction network, the Darknet draws the thought of a ResNet structure for reference, 53 layers of convolutional neural networks are provided, and the deepening of the network layer number enables the YOLO v3 to have stronger feature extraction capability on a target. The Darknet-53 also removes a pooling layer and a full connection layer in the network, a BN (Batchnormalization) layer and a Leaky RELU layer are added into each basic layer, and a residual error module is added into the network, so that the problem of gradient disappearance or gradient explosion appearing in a deep network is solved. However, due to the complexity of the Darknet-53 network structure, the network has a large number of weight parameters, so that the algorithm has higher requirements on image processing equipment, and the detection speed of the picture is also influenced. The Tiny-YOLO is a simplified version of YOLO v3, the characteristic extraction network is simple, a residual error network is removed, only 7 convolution layers and 6 pooling layers are needed, the number of parameters is reduced, the requirement on equipment is low, the detection speed is effectively improved, and the detection precision is lost. According to the method, the characteristic extraction network is replaced by a lighter neural network MobileNet, and the speed of image detection is increased on the basis of improving the complexity of the network by using a more efficient convolution calculation mode.

When performing target detection, YOLO regards target detection as a regression problem, divides an input picture into grids of S × S, and if the center of a detection target exists in the center of a certain cell, the cell is responsible for predicting the target. Each cell will generate B bounding boxes (bounding boxes), each bounding box contains 5 parameters (x, y, w, h, Confidence), where (x, y, w, h) is the deviation of the center position of the object from the cell position and the width and height of the bounding box, and Confidence is the Confidence of the target, which reflects whether there is a target object in the bounding box and the accuracy of the object position, and the calculation method is as follows:

wherein, P_r(object) indicates whether the bounding box contains an object, if yes, 1 is taken, and if no, 0 is taken;

and representing the intersection ratio of the predicted frame and the real area of the object.

And extracting the characteristics of the target object by using a convolutional neural network and predicting, wherein each cell is provided with C class probability values which represent the probability that the target in the bounding box of which the cell is responsible for predicting belongs to each class. The conditional probability of the presence of an object in a cell is that if the probability of the identified object being of a certain class is:

YOLO uses the mean square sum error as a loss function to optimize the model parameters, i.e. the mean square sum error of the output vector of the network structure and the corresponding vector of the real image.

The formula is as follows:

wherein coordError represents the error between the predicted data and the calibration data; iouError represents the cross-over ratio error; classleror denotes the classification error. The specific formula is as follows:

wherein the parameter lambda_coordThe weight of the loss error is represented, and the importance of the bounding box in loss calculation is enhanced; lambda [ alpha ]_noobjThe weight of the classification loss function is expressed, so that the influence of the non-target area on the confidence coefficient calculation of the target area can be weakened;

representing that the target object falls into the jth boundary box of the ith cell, if the target object falls into the jth boundary box, taking 1, otherwise, taking 0;

indicating the response prediction value.

The MobileNet network is a lightweight and efficient convolutional neural network proposed by Google, and can be decomposed into two operations of deep Convolution (Depthwise Convolution) and 1 × 1 Convolution (point Convolution) as shown in fig. 1 by using Depth-level Separable Convolution (Depthwise Convolution) instead of standard Convolution, which not only has higher efficiency theoretically, but also can be directly completed by using highly optimized matrix multiplication, and about 95% of multiply-add operations in MobileNet are from 1 × 1 Convolution, so that the operation efficiency can be greatly improved.

Improved YOLO network architecture as shown in fig. 2, a picture input of 416 × 416 size is used, a depth separable convolution stack is used after a standard convolution of 3 × 3, X1 is output after a 5 th depth separable convolution, and then the depth separable convolution stack is continued, and X1, X2 and X3 are connected as inputs into a YOLO v3 network at 11 th and 13 th depth separable convolution outputs X2 and X3, respectively.

Although the deep convolution operation of the MobileNet greatly reduces the scale of network parameters and improves the calculation speed of the network, the defects are obvious: the deep convolution emphasizes the characteristic parameters of the local receptive field, improves the expression capability of the network to a certain extent, but ignores the correlation of characteristic information of each channel, and then connects the characteristics on each channel by using 1 × 1 convolution operation, but cannot completely compensate the loss of precision. Considering the correlation between the features describing the object and the classification of the features between the main features and the non-main features, the present embodiment adds the Squeeze-and-Excitation module to the MobileNet network.

The core idea of the squaeze-and-optimization network (SE-Block) is the learning of feature weights, which increases the effective feature weights, reduces the ineffective or less effective feature weights, thereby enhancing the feature extraction capability of the network, and brings significant performance improvement to the neural network structure with less extra computational cost. The basic structure of the module is shown in figure 3.

The SE-Block function firstly inputs any input through a standard convolution operator

Is converted into one

The feature map of (2). The subsequent operation is completed by three steps of compression (Squeeze), Excitation (Excitation) and weight distribution (Scale), the compression operation is to compress the Global space information to obtain a single channel descriptor, statistical information of each channel is generated through Global Average Pooling (Global Average Pooling), and the statistical quantity z is_cBy compressing features in the spatial dimension H WMapping (Feature map) u_cThe calculation formula is as follows:

the excitation operation is mainly completed through two Fully-connected layers (Fully-connected layers), the first Fully-connected layer realizes the compression of feature mapping, the calculated amount is reduced, and a ReLU layer is added into the two Fully-connected layers, so that the nonlinear relation between channels can be learned. The second full-connection layer restores the feature mapping to the original channel number, and the final feature mapping importance description factor s is obtained through the Sigmoid function normalization, wherein the calculation formula is as follows:

s＝F_Excitation(z,W)+σ(g(z,W))＝σ(W₂δ(W₁z))

where δ represents the activation function ReLU. Weight distribution is to weight the feature mapping importance description factor s obtained by the excitation operation into the feature channel by channel through multiplication, so as to complete the weight distribution of the original feature, and the calculation formula is as follows:

in the embodiment, an SE-Block module is embedded into a MobileNet feature extraction network, and the SEMobileNet-YOLO detection network is provided by combining with the YOLO detection network. The SEMobileNet-YOLO network structure is similar to the structure of the MobileNet-YOLO network, an SE-Block module is added after each pair of deep convolution and 1 x1 convolution, the input feature mapping is subjected to importance weight distribution, and then the input feature mapping is input into the next layer, and the SEMobileNet basic structure is shown in figure 4.

The specific implementation of the embodiment is performed in an ubuntu16.04lts system, a CPU is i9-9900K, a memory is 16GB, a graphics card is Nvidia Titan XP, and a development framework is keras under tensrfow. The training parameters are set as follows: the initial weight is set to 0.001; the weighted attenuation coefficient is 0.0005; adopting a momentum gradient descent algorithm with momentum of 0.9; the Batch processing parameter (Batch Size) is set to 16. Initial learning rate of 10^-3In the state ofAfter training 300 full data sets under the condition, the trained parameters are used for initializing the network, and the learning rate is adjusted to 10^-4300 full datasets were retrained. The learning rate adjustment strategy is that when the loss value of the test set is not reduced after continuous 3 full data set (Epoch) training, the learning rate is adjusted according to the proportion of 10 percent; when the loss value of the test set does not decrease after the training of 10 continuous full data sets, the training is stopped early.

And after the network training is finished, inputting the pictures of the test set into the network for detection. The improved SEMobileNet-YOLO detection method is compared with the improved SEMobileNet-YOLO, YOLO v3 and tiny-YOLO, and tiny-YOLO is a simplified version of YOLO v3, so that the detection speed is greatly improved on the basis of sacrificing the detection precision. When the intersection ratio (IoU) of the target boundary box predicted by the model and the manually marked boundary box is more than or equal to 0.5 and the target recognition is accurate, the detection is considered to be successful, otherwise, the target is considered to be missed. The resulting detection effect is shown in fig. 6(a) - (d), where the gray dashed frame is a calibration frame drawn during calibration, and the white solid frame is a calibration frame detected by the algorithm.

As will be appreciated by one skilled in the art, embodiments of the present disclosure may be provided as a method, system, or computer program product. Accordingly, the present disclosure may take the form of an entirely hardware embodiment, an entirely software embodiment or an embodiment combining software and hardware aspects. Furthermore, the present disclosure may take the form of a computer program product embodied on one or more computer-usable storage media (including, but not limited to, disk storage, CD-ROM, optical storage, and so forth) having computer-usable program code embodied therein.

The present disclosure is described with reference to flowchart illustrations and/or block diagrams of methods, apparatus (systems), and computer program products according to embodiments of the disclosure. It will be understood that each flow and/or block of the flow diagrams and/or block diagrams, and combinations of flows and/or blocks in the flow diagrams and/or block diagrams, can be implemented by computer program instructions. These computer program instructions may be provided to a processor of a general purpose computer, special purpose computer, embedded processor, or other programmable data processing apparatus to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing apparatus, create means for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.

These computer program instructions may also be stored in a computer-readable memory that can direct a computer or other programmable data processing apparatus to function in a particular manner, such that the instructions stored in the computer-readable memory produce an article of manufacture including instruction means which implement the function specified in the flowchart flow or flows and/or block diagram block or blocks.

These computer program instructions may also be loaded onto a computer or other programmable data processing apparatus to cause a series of operational steps to be performed on the computer or other programmable apparatus to produce a computer implemented process such that the instructions which execute on the computer or other programmable apparatus provide steps for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.

The above description is only a preferred embodiment of the present disclosure and is not intended to limit the present disclosure, and various modifications and changes may be made to the present disclosure by those skilled in the art. Any modification, equivalent replacement, improvement and the like made within the spirit and principle of the present disclosure should be included in the protection scope of the present disclosure.

Although the present disclosure has been described with reference to specific embodiments, it should be understood that the scope of the present disclosure is not limited thereto, and those skilled in the art will appreciate that various modifications and changes can be made without departing from the spirit and scope of the present disclosure.

Claims

1. A building embedded part detection method based on improved YOLO is characterized by comprising the following steps: the method comprises the following steps:

2. The method for detecting the building embedded part based on the improved YOLO as claimed in claim 1, wherein: when testing, the target detection is regarded as a regression problem, an input picture is divided into grids of S multiplied by S, and if the center of a detected target exists in the center of a certain unit, the unit grid is responsible for predicting the target; each cell will generate B bounding boxes, each containing the offset of the center position of the object from the cell position, as well as the width and height of the bounding box and the confidence of the target.

3. The method for detecting the building embedded part based on the improved YOLO as claimed in claim 1, wherein: and extracting the characteristics of the target object by using a convolutional neural network and predicting, wherein each cell is provided with C class probability values which represent the probability that the target in the bounding box of which the cell is responsible for predicting belongs to each class. The conditional probability of the presence of an object in a cell is Pr (class/object), the probability that the identified object is of a class is Pr (class),

4. the method for detecting the building embedded part based on the improved YOLO as claimed in claim 1, wherein: the mean square sum error is used as a loss function to optimize the model parameters, i.e. the mean square sum error of the output vector of the network structure and the corresponding vector of the real image.

5. The method for detecting the building embedded part based on the improved YOLO as claimed in claim 1, wherein: the MobileNet network uses depth-level separable convolutions instead of standard convolutions, decomposed into depth convolution and 1 × 1 convolution.

6. The method for detecting the building embedded part based on the improved YOLO as claimed in claim 1, wherein: carry on the visible light camera through unmanned aerial vehicle and acquire the aerial picture of the building site of waiting to be under construction.

7. A building embedded part detection system based on improved YOLO is characterized in that: the method comprises the following steps:

8. A computer-readable storage medium characterized by: stored with instructions adapted to be loaded by a processor of a terminal device and to perform the steps of a method for improved YOLO-based building embedment detection as claimed in any one of claims 1 to 6.

9. A terminal device is characterized in that: the system comprises a processor and a computer readable storage medium, wherein the processor is used for realizing instructions; the computer readable storage medium storing instructions adapted to be loaded by a processor and to perform the steps of any of claims 1-6 of a method for improved YOLO-based building embedment detection.