CN117237625A

CN117237625A - Semantic segmentation network method under multi-stage context guidance in construction site scene

Info

Publication number: CN117237625A
Application number: CN202310998675.9A
Authority: CN
Inventors: 侯振国; 杨伟涛; 窦国举; 李佳男; 罗浩; 阴栋阳; 汪银伟; 曹明明; 柳东阳
Original assignee: China Construction Seventh Engineering Division Corp Ltd
Current assignee: China Construction Seventh Engineering Division Corp Ltd
Priority date: 2023-08-09
Filing date: 2023-08-09
Publication date: 2023-12-15

Abstract

The application provides a semantic segmentation network method under multi-stage context guidance in a construction site scene, which aims to solve the problems that most existing construction site semantic segmentation models at present are not specially designed for construction site data sets, have unsatisfactory performances on the construction site data sets and still need to be improved. The semantic segmentation network provided by the application uses ResNet50 as a backbone network, the characteristics of each layer are enhanced by CWFAM, and then the next layer is input to extract more effective information. The semantic information of high level is enhanced by a non-local attention module of the coordinate center pixel aggregation, and the characteristics of low level are gradually introduced in the decoding process to recover detailed information so as to obtain finer decoding results. Meanwhile, the CFNLAM calculates the attention matrix, so that the relationship between the attention matrix and pixels in the same row and column can be focused more, the calculated amount is greatly reduced, and the calculation speed is improved.

Description

Semantic segmentation network method under multi-stage context guidance in construction site scene

Technical Field

The application relates to the technical field of building construction, in particular to a semantic segmentation network method under multi-stage context guidance in a building site scene.

Background

Computer vision has been widely used in various fields such as monitoring, medical imaging, and autopilot. Semantic segmentation has attracted considerable attention in various computer vision tasks because it allows semantic information of objects in images to be accurately understood. In the actual application scene of building site monitoring, the semantic segmentation technology is used to remarkably improve the safety and efficiency of construction. The construction site is dense in personnel and complex in scene, and meanwhile, accidents of the construction site can be caused by some heavy machinery and unstable structures. Therefore, it is critical that the site be continuously monitored to identify potential risks and take precautions. Traditional monitoring methods such as manual inspection are time-consuming, labor-consuming, low in accuracy and limited in coverage range. Accurate real-time understanding of the entire construction site can be achieved in a high range using advanced semantic segmentation techniques.

Semantic segmentation involves pixel-by-pixel segmentation of objects in an image and gives corresponding semantic information for each pixel. This technique can be used to identify various elements of a job site, such as workers, equipment, materials, and structures. By analyzing the segmentation results, it can be determined whether there is a device failure and a security violation. In addition, semantic segmentation can be integrated with other monitoring systems (such as sensors and unmanned aerial vehicles) to achieve better safety monitoring effects.

One of the main advantages of semantic segmentation is its strong ability to handle complex, dynamic scenarios. Building sites are constantly changing, and semantic segmentation can adapt to these changes and update segmentation results accordingly. This function is particularly useful in real world large projects, as field conditions can vary greatly over time. Another advantage of semantic segmentation is that it can improve efficiency and productivity. The construction process is automatically monitored through the semantic segmentation technology without consuming manpower, so that workers can concentrate on more critical tasks such as maintenance and repair to ensure that resources of a construction site are better utilized and workflow safety compliance is achieved, and the effect of the whole project is improved.

Early deep learning-based visual tasks of building site scenes focused primarily on the identification of specific targets. For example, zhu et al propose a method of tracking equipment and workers in a construction site scene. Luo et al extend the tracking of workers to the identification of worker behavior, such as standing, moving, tilting, etc., so that the activity of the workers can be captured and understood in a remote surveillance video. Wu et al automatically detected whether a construction worker wore a helmet using target detection, and identified the color of the helmet. These work pieces demonstrate the application of deep learning based visual models in construction site scenes. However, since these methods are mostly based on object detection or tracking techniques, the semantic mapping information they generate is very limited, resulting in a difficult to fully understand building site scenarios. In practice, it is more desirable to understand as many common objects as possible at different construction sites. Thus, more and more researchers are beginning to explore the application of semantic segmentation techniques in construction sites.

Currently, semantic segmentation methods for visual understanding of construction sites still have room for improvement, because most existing construction site semantic segmentation models are not specifically designed for construction site datasets, but rather migrate from models designed for other natural image datasets. Thus, there is still room for improvement in the performance of existing models on a construction site dataset.

Disclosure of Invention

In order to solve the problems that most of the existing building site semantic segmentation models in the background art are not specially designed for building site data sets, the performances of the existing building site semantic segmentation models on the building site data sets are not satisfactory, and the existing building site semantic segmentation models still need to be improved, the application provides a semantic segmentation network method under the guidance of multi-stage context in a building site scene.

The technical scheme of the application is as follows: a method of semantic segmentation network under multi-stage context guidance in a building site scene, comprising the steps of: s1, setting a supervision training set: after the site scene image is acquired, carrying out manual pixel-level labeling on the site scene image, labeling the category to which each pixel in the image belongs, and manually dividing all the labeled images into a training set and a testing set; performing supervision training on the ResNet50 network by using images of the training set;

s2-extracting multi-level convolution characteristics of common images of the construction site: adopting a ResNet50 network as a coding module, adopting a characteristic enhancement module (CWFAM) to enhance each layer of network of the ResNet50 network, extracting multi-level convolution characteristics of images of a training set, and marking the extracted characteristics of each layer as follows: f (F) _i ，i∈{1,2,3,4,5}；

S3, enhancing high-level characteristic representation by using global information: adopting a non-local attention module (CFNLAM) with a central pixel aggregation, enhancing the multi-level convolution characteristics extracted by the ResNet50 network, and adding the enhanced characteristics with the original input characteristics through a residual block (ResBlock) to obtain multi-level convolution enhancement characteristics, and guiding the subsequent multi-level decoding process;

s4, a decoding module (Decoder module) is adopted to recover space information operation of the multi-level convolution enhancement features, so that feature decoding is realized;

s5, comparing the decoded features with the features of the test set, and performing cyclic supervision training until the training requirement is met.

Preferably, the feature enhancement module (CWFAM) is configured to decompose the convolution into a deep hole convolution (hole gap d), a deep convolution (convolution kernel (2 d-1) × (2 d-1)), and a 1×1 convolution operation, which enhances each layer of network of the res net50 network as follows:

wherein F is E R ^C×H×W Representing input features, att ε R ^C×H×W A diagram of the attention profile is shown,representing Pixel-Wise Multiplication, < >>A convolution operator (Conv) representing a convolution kernel of size 1 x 1, f _dwConv Representing a Depth convolution operator (Depth-wise convolutional operator).

A specific enhancement procedure for a non-local attention module (CFNLAM) of the co-ordinate center pixel aggregation is as follows,

feature F of the ith stage _i ＝R ^c*h*w Three separate convolution layers are first fed to obtain a query, key and value vector:

Q＝f _conv (F _i )，K＝f _conv (F _i )，K＝f _conv (F _i ) Acquiring Q, K and V epsilon R ^c×(h×w) ；

The vectors Q, K are operated as follows to obtain an attention matrix:

A＝f _softmax (Q×K ^T )，

wherein A is E R ^{(h×w)×(h×w)} Representing an Attention matrix (Attention matrix),

f _softmax () Representing a nonlinear activation operation (softmax),

d is used at any position in A _i,j Representation, stands for Q _i And K _j Is a relationship of (2);

for any position i in Q, i= [1, k, (h×w)]Features in the same column or row as i in K are taken out and spliced into E _i ∈R ^(h+w-1)×c ；

The long-range dependence of the ith element in Q and the jth element in Ei can be expressed asWherein d _i,j E D, D E R ^{(h+w-1)×(w×h)} Representing an association matrix;

incidence matrix (D E R) ^{(h+w-1)×(w×h)} ) Further performing nonlinear activation operation (softmax) to obtain an attention matrix A;

according to V and F _i In (2) to obtain a feature vector V _n ∈R ^c And F ⁱ _n Feature set Φ _n ∈R ^(h+w-1)×c The feature vectors covering all the positions n in the same row or column in V, the features of the positions n depending on the distance are enhanced in the following way:

wherein,representing output characteristics->Feature vector at n, A _i,n Attention values representing positions i and n；

Attention matrix A andafter polymerization (Aggregation), the obtained product is further combined with feature F of the ith stage _i ＝R ^c*h*w And performing Pixel-by-Pixel addition (Pixel-wise addition) to obtain a multi-level convolution enhancement feature, and outputting the decoded feature after the multi-level convolution enhancement feature is decoded by a decoding module (Decoder module).

Preferably, when decoding is performed in step S4 by using a decoding module (Decoder module), the decoding process is divided into three stages, in each decoding stage, firstly, higher-level features are interpolated to realize upsampling, then, the upsampled features are connected in series with low-level features, and the channel size is changed by the convolution layer operation in step S3, the output features are further added with the upsampled features in a residual manner, and after the three-step decoding, a semantic segmentation result is obtained to realize understanding and interpretation of the scene image of the building site.

Preferably, the features of the first stage are ignored during decoding to ensure the speed of reasoning.

Preferably, after step S4 is performed, the feature is further evaluated for class weights to enhance the segmentation effect on the small object, and the specific operation steps are as follows,

and using a weighted cross entropy loss function to solve class unbalance, calculating the pixel quantity of each class according to the result of the training set in the data set, obtaining class proportion, estimating class weight and enhancing the segmentation effect on the small target.

Preferably, in the comparison loop training in step S5, the performance of the semantic segmentation model is measured by using an accuracy rate (Acc) and an intersection ratio (IoU), where Acc represents the ratio of the number of pixels of a certain type to all pixels of correct prediction, and IoU represents the ratio of the real label and the prediction result of each type; for a semantic segmentation task with n classes, the average value of Acc and IoU of all classes is expressed as mAcc and mIoU, and the calculation formula is:

wherein p is _ii And p _ij Indicating the number of pixels for which a pixel of class i is correctly predicted as class i or incorrectly predicted as class j.

Preferably, the ResNet50 network of the encoding module (Decoder module) needs to be pre-trained on the imageNet before use, and in actual use, a large model of the natural scene is migrated to the data of the site scene in a migration learning manner, so as to ensure the effectiveness of extracting the image features of the site scene.

The application has the advantages that: the application provides a lightweight semantic segmentation network method special for a building construction scene, which utilizes local and global background information to realize semantic segmentation of the building site scene. Specifically, the semantic segmentation network provided by the application uses ResNet50 as a backbone network, the characteristics of each layer are enhanced by CWFAM, and then the next layer is input to extract more effective information. The semantic information of high level is enhanced by a non-local attention module of the coordinate center pixel aggregation, and the characteristics of low level are gradually introduced in the decoding process to recover detailed information so as to obtain finer decoding results.

Meanwhile, the CFNLAM calculates the attention matrix, so that the relationship between the attention matrix and pixels in the same row and column can be focused more, the calculated amount is greatly reduced, and the calculation speed is improved.

Drawings

In order to more clearly illustrate the embodiments of the application or the technical solutions in the prior art, the drawings that are required in the embodiments or the description of the prior art will be briefly described, it being obvious that the drawings in the following description are only some embodiments of the application, and that other drawings may be obtained according to these drawings without inventive effort for a person skilled in the art.

Fig. 1 is a schematic diagram of the overall frame of the network to be divided including the frame of the ResNet50 network (upper), the frame of the CWFAM (lower left), and the frame of the CFNLAM (lower right) in example 1;

FIG. 2 is a qualitative result comparison result diagram of the conventional semantic segmentation network method and the semantic segmentation network method of the present application in example 1;

FIG. 3 is a visual presentation of the semantic segmentation network of the present application in embodiment 1 on a different stage feature map;

FIG. 4 is a comparative table of the results of the comparison of ACC indicators in 12 categories in example 1;

fig. 5 is a comparison result of the IOU indices in 12 categories in example 1.

Detailed Description

The following description of the embodiments of the present application will be made clearly and completely with reference to the accompanying drawings, in which it is apparent that the embodiments described are only some embodiments of the present application, but not all embodiments. All other embodiments, which can be made by those skilled in the art based on the embodiments of the application without any inventive effort, are intended to be within the scope of the application.

Example 1: a semantic segmentation network method under the guidance of multi-stage context in a construction site scene is provided, wherein the system setting and environment when the semantic segmentation network is used in the embodiment are as follows: python was selected as the programming language, and a deep learning and image processing library such as PyTorch, openCV, numpy was used. The trained optimizer was Adadelta, and the learning rate was set to 0.0001. 50 batches of training are performed on the network on a host computer provided with a GTX 2080TI GPU display card and an Intel Core i7, and the batch size of the training process is set to be 4.

As shown in fig. 1, the semantic segmentation network in this embodiment is mainly composed of a coding module (Decoder module) module, a non-local attention module, and a Decoder module.

The specific use method of the semantic segmentation network pair is as follows:

s1, setting a supervision training set: after the site scene image is acquired, carrying out manual pixel-level labeling on the site scene image, labeling the category to which each pixel in the image belongs, and manually dividing all the labeled images into a training set and a testing set; the images of the training set are used to conduct supervision training on the ResNet50 network.

S2-extracting multi-level convolution characteristics of common images of the construction site: the ResNet50 network is used as a coding module (Decoder module), the ResNet50 network of the coding module (Decoder module) needs to be pre-trained on the ImageNet before being used, and a large model of a natural scene is migrated to data of the site scene in a migration learning mode in actual use, so that the effectiveness of extracting the image characteristics of the site scene is ensured;

adopting a characteristic enhancement module (CWFAM) to enhance each layer of network of the ResNet50 network, extracting multi-level convolution characteristics of images of a training set, and marking the extracted characteristics of each layer as follows: f (F) _i ，i∈{1,2,3,4,5}；

The attention mechanism can be considered as an adaptive selection process that can select more useful information from the input features and automatically ignore background disturbances. The use of self-care mechanisms and large-sized convolution kernels are two common means of building long-term dependencies, but these two methods are computationally intensive and require significant computational resources.

To take advantage of the self-care mechanism and large-kernel convolution and reduce the computational effort, as shown in fig. 1, the present application employs a feature enhancement module (CWFAM) to decompose the convolution into a deep hole convolution (hole gap d), a deep convolution (convolution kernel (2 d-1) × (2 d-1)), and a 1×1 convolution operation, which enhances each layer of network of the res net50 network as follows:

S3, enhancing high-level characteristic representation by using global information: convolution operators are limited to fixed geometry and small receptive fields, which can simulate long-range context information; in contrast, non-local networks exploit self-attention mechanisms to explore context information of full images; however, conventional non-local modules calculate relationships between all locations in the feature map. The computational complexity of the non-local modules depends on the spatial size of the feature map, which may reduce the computational speed of the overall network.

According to the principle that the contribution of long distance information of all positions is very unbalanced, as shown in fig. 1, the present application proposes a non-local attention module (CFNLAM) with a coordinate center pixel aggregation, features are enhanced by long distance information of key positions, which can extract more semantic information in horizontal and vertical dimensions while maintaining the calculation speed.

Specifically, when in use, a non-local attention module (CFNLAM) with a central pixel aggregation is adopted to enhance the multi-level convolution characteristics extracted by the ResNet50 network, and the enhanced characteristics are added with the original input characteristics in a residual form to obtain multi-level convolution enhancement characteristics, so as to guide the subsequent multi-layer decoding process;

as shown in fig. 1, a specific enhancement procedure of the non-local attention module (CFNLAM) of the coordinate center pixel aggregation is as follows,

feature F of the ith stage _i ＝R ^c*h*w Is first fed into three separate convolutionsLayers to obtain query, key and value vectors:

The vectors Q, K are operated as follows to obtain an attention matrix:

A＝f _softmax (Q×K ^T )，

f _softmax () Representing a nonlinear activation operation (softmax),

wherein,representing output characteristics->Feature vector at n, A _i,n An attention value representing the position in i-channel and n-channel;

attention matrix A andafter polymerization (Aggregation), the obtained product is further combined with feature F of the ith stage _i ＝R ^c*h*w Performing Pixel-by-Pixel addition (Pixel-wise addition) to obtain a multi-level convolution enhancement feature;

the enhanced features are added with the original input features through residual blocks (ResBlock) to obtain multi-level convolution enhanced features, and a subsequent multi-layer decoding process is guided;

s4-decoding: a decoding module (Decoder module) is adopted to recover space information operation of the multi-level convolution enhancement features, so that feature decoding is realized;

the enhanced high-level features are used as a guide for the decoding process, which can utilize the extracted context information to suppress the background information. The low-level features are adopted to recover space details on the short cut path, so that the accuracy of segmentation is improved. Meanwhile, during decoding, the decoding process is divided into three stages, and features of the first stage are ignored so as to ensure the reasoning speed;

specifically, as shown in fig. 1, in each decoding stage, higher-level features are interpolated first to realize upsampling, then the upsampled features are connected in series with lower-level features, the channel size is changed through convolution layer operation, the output features are further added with the upsampled features in a residual manner, after three-step decoding, semantic segmentation results are obtained, understanding and interpretation of the scene image of the construction site are realized, and feature decoding is realized.

S5, using a weighted cross entropy loss function to solve category imbalance, calculating the pixel quantity of each category according to the result of the training set in the data set, obtaining category proportion, estimating category weight, and enhancing the segmentation effect on the small target.

S6, comparing the decoded features with the features of the test set for cyclic supervision training until the training requirement is met.

In this embodiment, when performing contrast cyclic training, the performance of the semantic segmentation model is measured by using an accuracy rate (Acc) and an intersection ratio (IoU), where Acc represents the ratio of the number of pixels of a certain type of correct prediction to all pixels, and IoU represents the ratio of the real label and the prediction result of each type. For a semantic segmentation task with n classes, the average value of Acc and IoU of all classes is expressed as mAcc and mIoU, and the calculation formula is:

Using examples and qualitative comparisons: in order to further verify the semantic segmentation advantages of the semantic segmentation network method in the construction site scene, the semantic segmentation network method provided by the application is compared with a SwiftNet, HRNet, GALDNet, UNet ++ semantic segmentation result.

The data set for this time contains 859 images of the construction site, including 1720 examples. Examples of data sets can be categorized into 12 categories, such as people, excavators, towers, and the like. These images are taken from various construction sites from different angles and locations.

The framework of the code uses PyTorch. The present application selects ResNet-50 as the backbone network for the coding module (Decoder module) of the present application and pre-trains in ImageNet. The optimizer of the network is Adadelta, and the learning rate is 0.0001. The application trains 50 epochs on the GTX 2080TI GPU, and the batch size is set to be 4.

As can be seen intuitively from fig. 2, the semantic segmentation network proposed by the present application can obtain better semantic segmentation results in various challenging scenarios.

Fig. 2 shows a visual comparison example of the different methods. The 3 rd line and the 6 th line can show that the semantic segmentation network method provided by the application can better capture the detail information in the scene of the construction site and generate clear boundaries. Meanwhile, the segmentation results of the 4 th row and the 7 th row show that the semantic segmentation network method provided by the application has good segmentation effects on large targets and small targets. More importantly, the semantic segmentation network method provided by the application can still achieve good segmentation effect even in a multi-instance construction site scene as shown in line 5.

Meanwhile, in fig. 3, feature maps of several different stages in the semantic segmentation network proposed by the present application are first averaged in channel dimension, and then displayed in the form of images. In the proposed CWFAM and CFNLAM modules, semantic information is well extracted into high-level features, while details are retained in low-level features. As can be seen from the CFNLAM and the feature map of the decoder, the background noise is suppressed, and the foreground object gets a higher response, which can guide the operation of the decoder, and realize high-precision segmentation.

Fig. 4 and 5 show the comparison results of the quantitative comparison. The method presented herein has the highest Acc and IoU index among 6 of the 12 categories of the dataset. Thus, in general, the proposed semantic segmentation network approach works best among all the comparison approaches in the segmentation dataset of the construction site.

SwiftNet contains a compact coding module (Decoder module) and a lightweight transversal skip Decoder, achieving best Acc performance in 4 categories and best IoU performance in 3 categories, which is a suboptimal result of these methods.

Although unet++ is recognized as having better semantic segmentation capabilities, the granularity of the decoder portion of unet++ is still not fine enough, which can result in segmentation results losing edge information and position information, and in the data set herein, results are less than ideal.

It will be evident to those skilled in the art that the application is not limited to the details of the foregoing illustrative embodiments, and that the present application may be embodied in other specific forms without departing from the spirit or essential characteristics thereof. The present embodiments are, therefore, to be considered in all respects as illustrative and not restrictive, the scope of the application being indicated by the appended claims rather than by the foregoing description, and all changes which come within the meaning and range of equivalency of the claims are therefore intended to be embraced therein. Any reference sign in a claim should not be construed as limiting the claim concerned.

Claims

1. A method of semantic segmentation network under multi-stage context guidance in a building site scene, comprising the steps of: s1, setting a supervision training set: after the site scene image is acquired, carrying out manual pixel-level labeling on the site scene image, labeling the category to which each pixel in the image belongs, and manually dividing all the labeled images into a training set and a testing set; performing supervision training on the ResNet50 network by using images of the training set;

2. A multi-stage context-guided semantic segmentation network method according to claim 1, wherein: the feature enhancement module (CWFAM) is configured to decompose the convolution into a deep hole convolution (hole gap d), a deep convolution (convolution kernel (2 d-1) × (2 d-1)), and a 1×1 convolution operation, and the enhancement procedure of the feature enhancement module for each layer of network of the res net50 network is as follows:

3. A multi-stage context-guided semantic segmentation network method according to claim 1, wherein: a specific enhancement procedure for a non-local attention module (CFNLAM) of the co-ordinate center pixel aggregation is as follows,

feature F of the ith stage _i ＝R ^c*h*w Is first fed into three separate convolution layers to obtain a query, a key and a value (value) Vector:

The vectors Q, K are operated as follows to obtain an attention matrix:

A＝f _softmax (Q×K ^T )，

f _softmax () Representing a nonlinear activation operation (softmax),

4. A multi-stage context-guided semantic segmentation network method according to claim 1, wherein: when decoding is performed in step S4 by using a decoding module (Decoder module), the decoding process is divided into three stages, in each decoding stage, firstly, higher-level features are interpolated to realize upsampling, then, the upsampled features are connected in series with lower-level features, the channel size is changed by the convolution layer operation in step S3, the output features are further added with the upsampled features in a residual manner, and after the three-step decoding, semantic segmentation results are obtained to realize understanding and interpretation of the scene image of the construction site.

5. A multi-stage context-guided semantic segmentation network method according to claim 4, wherein: during decoding, features of the first stage are ignored to ensure the speed of reasoning.

6. A multi-stage context-guided semantic segmentation network method according to claim 1, 4 or 5, wherein: after step S4 is performed, the feature is estimated by class weights to enhance the segmentation effect on the small object, and the specific operation steps are as follows,

7. A multi-stage context-guided semantic segmentation network method according to claim 1, wherein: in the contrast cyclic training in the step S5, the performance of the semantic segmentation model is measured by adopting an accuracy rate (Acc) and an intersection ratio (IoU), acc represents the ratio of the number of pixels of a certain type of correct prediction to all pixels, and IoU represents the ratio of the real label and the prediction result of each type; for a semantic segmentation task with n classes, the average value of Acc and IoU of all classes is expressed as mAcc and mIoU, and the calculation formula is:

8. A multi-stage context-guided semantic segmentation network method according to claim 1, wherein: the ResNet50 network of the encoding module (Decoder module) needs to be pre-trained on the imageNet before use, and in actual use, a large model of a natural scene is migrated to the data of the site scene in a migration learning mode, so that the effectiveness of extracting the image features of the site scene is ensured.