CN114170422A

CN114170422A - A Semantic Segmentation Method of Underground Image in Coal Mine

Info

Publication number: CN114170422A
Application number: CN202111248280.4A
Authority: CN
Inventors: 程健; 肖洪飞; 闫鹏鹏; 李�昊; 李和平; 王广福
Original assignee: China Coal Research Institute CCRI
Current assignee: China Coal Research Institute CCRI
Priority date: 2021-10-26
Filing date: 2021-10-26
Publication date: 2022-03-11

Abstract

The invention discloses a semantic segmentation method for an underground coal mine image, and belongs to the field of computer vision. Firstly, preprocessing an acquired underground scene image to generate a data set; then constructing a feature extraction network with ResNet-101 as a framework, and inputting and using images with different scales to enhance the extracted features at each stage of the network; then constructing a fusion attention module to fuse the characteristics of each stage, and enhancing global information by using a global attention module to obtain a remote dependency relationship; and finally, inputting the obtained features into a classifier to generate a semantic graph, and finishing semantic segmentation of the image. The method greatly reduces the calculated amount and complexity, adopts an attention mechanism aiming at the complexity of the scene, highlights the semantic information of the target area and improves the image segmentation effect.

Description

Coal mine underground image semantic segmentation method

Technical Field

The invention relates to an image semantic segmentation method, in particular to a coal mine underground image semantic segmentation method which is suitable for being used underground, and belongs to the field of computer vision.

Background

The method for researching the structural characteristics and the recovery method of the underground tunnel visual scene has important significance for the square structural analysis of the underground complex scene. The underground characteristic analysis method is researched aiming at complex environments caused by direct strong light, dark light, dust, water mist and smoke in an underground scene of a coal mine. The traditional image analysis methods mainly comprise: in a complex scene under a well, the brightness can be changed at any time due to the sudden change of illumination conditions, and the position can be changed greatly due to small motion in a narrow scene by using a Lucas-Kanade algorithm, a matching method, an energy method, a phase method and the like. The traditional method is easy to generate wrong results under the complex underground scene. Therefore, the method for analyzing the underground complex scene needs to be researched.

For the problems of sudden illumination change, shadow, position offset and the like in the underground complex scene analysis, the semantic analysis method based on the deep learning image segmentation theory can be well solved, and the deep learning image segmentation model can approach to a nonlinear model with extremely high precision. However, research aiming at the structural scene mostly focuses on the indoor scene of the building, and related work performed under the coal mine underground tunnel scene is not available for a while. Therefore, the invention provides a semantic analysis method of a multilevel feature fusion image segmentation theory by combining the characteristic of large length-width ratio of a coal mine tunnel and ensuring the accuracy and speed of segmentation. In the image segmentation task of other scenes, a great deal of work is done by the predecessors.

The method comprises the steps of processing an image through frequency domain space combination analysis to obtain frequency characteristics of the image, then carrying out deconvolution processing on the processed image, extracting high-dimensional characteristics in the image as characteristic points, and carrying out segmentation training on the characteristic points in the image to be detected by adopting a convolutional neural network to obtain a detection result. The patent (the other two, the bright, horse, zheng and maple. image segmentation method, the device, the computer equipment and the storage medium [ P ]. Guangdong province: CN112598686B,2021-06-04.) utilizes the prior knowledge vector to encode the image to obtain a target characteristic map, then decodes the target characteristic map to obtain a first segmentation map, then reconstructs the first segmentation map according to the prior knowledge vector to obtain a plurality of marked segmentation images, and processes the target characteristic map to obtain a second segmentation map based on the target characteristic map, so that the second segmentation map is fused with a plurality of marked results, and the accuracy of image segmentation is improved. The method comprises the steps of counting an internal spectrum histogram of a remote sensing image as a first classification feature, carrying out supervised classification on the remote sensing image by using a curve matching algorithm, extracting spatial correlation between the remote sensing image and an adjacent image target as a second feature according to a primary result, and segmenting the image by using the curve matching algorithm again by combining the spectrum histogram and the spatial correlation. The patent (Hanjing, Chengxiaoyu, Li shoangyang, Zhang quan, Teng Jie, Wei chi and Li Yi, an infrared road scene segmentation method based on category prototype regression [ P ]. Jiangsu province: CN112381101B,2021-05-28.) uses category prototype regression on a data set to obtain category prototype characteristics, clusters network depth characteristics, enables global category characteristics to be more compact, amplifies differences among various categories, correspondingly constructs a relationship matrix and an attention module, enables overall characteristics to be more compact, and improves final image segmentation accuracy. A patent (heydong. image segmentation method and apparatus [ P ]. beijing, CN112101369B,2021-02-05.) uses a logical relationship between two target regions in an image and position information of respective vertices, wherein the two target regions at least partially overlap, determines position information of an intersection between respective boundaries of the two target regions according to the position information of the respective vertices of the two target regions, and segments the two different objects from the image to be processed according to the position information of the intersection and the logical relationship. In the literature (Chen C, Deng J, N Lv. I, L.I. LED structures Detection in Remote Sensing imaging based on Multi-scale Sensing Segmentation [ C ]//2020IEEE International reference on Smart Internet of things (SmartIoT). IEEE,2020.), a Multi-scale parallel structure is used to replace the traditional multilayer convolutional layer, based on which a new Semantic Segmentation network of coding and decoding structure is proposed, and a conditional random field is also used to constrain the Segmentation result, so that the network Segmentation precision is higher. The literature (Zhang F, Chen Y, Li Z, et al, ACFNet: environmental Class Feature Network for Semantic Segmentation [ C ]// International Conference on Computer Vision (ICCV), IEEE,2019.) proposes the concept of Class centers, which extracts the global context from a classification perspective. This class level context describes the overall representation of each class in the image. And then an attention class feature module is provided, different class centers are calculated and adaptively combined according to each pixel, and therefore an attention class feature segmentation network from coarse to fine is provided. Therefore, the accuracy of image segmentation is improved.

Disclosure of Invention

Aiming at the defects of the prior art, the method for semantically segmenting the underground coal mine image is simple in steps, good in segmentation effect and strong in robustness of scene feature description.

In order to overcome the defects of the prior art, the method for semantically segmenting the underground coal mine image comprises the following steps of:

step 1, acquiring an underground picture, performing labeling pretreatment on picture data, and dividing the image data subjected to labeling pretreatment into a training sample and a test sample data set.

Step 2, inputting a training sample data set into a feature extraction network to extract features of an input picture, wherein the feature extraction network comprises an improved ResNet-101 network, the improved ResNet-101 network structure deletes down-sampling operations of a fourth stage and a fifth stage in a conventional ResNet-101 network, and other contents of the fourth stage and the fifth stage are reserved;

step 3, simultaneously inputting the feature diagram output by the third stage and the additional input feature diagram through multi-scale input in the fourth stage of the improved ResNet-101 network, and outputting a low-level feature diagram; in the fifth stage, the feature diagram output in the fourth stage and the additional input feature diagram are simultaneously input through multi-scale input, and a high-level feature diagram is output; the additional input feature map is obtained after the input picture is processed by a residual error unit, and the additional input feature map is obtained by compressing the original input picture to enable the size of the original input picture to be the same as that of the output feature map of the previous stage;

step 4, constructing a fusion attention module after the fifth stage of the improved ResNet-101 network, fusing a low-level feature map and a high-level feature map by using the fusion attention module, and outputting a new feature map containing global context semantic information;

step 5, a global context enhancement module is constructed behind the fusion attention module, and global representation of the new feature graph is enhanced, so that the remote dependence relation among all pixels in the feature graph is obtained, and a final fusion feature graph is obtained;

step 6, inputting the final fusion feature map into a pre-trained classifier to generate a semantic map, detecting the performance of the generated semantic map by using a test sample data set, checking the performance of a feature extraction network, performing semantic segmentation on the coal mine underground photo image if the performance meets the standard, and re-training if the performance does not meet the standard;

and 7, performing semantic segmentation on the input coal mine underground image by using the trained feature extraction network.

The specific process of the step 1) is as follows:

and 11) acquiring a clear image by using an underground explosion-proof camera.

Step 12) performing semantic segmentation and annotation on the obtained image, namely classifying each pixel in the image; segmenting different regions in the image from each other, each region being defined by semantic information;

and step 13) randomly constructing a training sample set and a testing sample set by the marked images according to the ratio of 4: 1.

The specific process of the step 2) is as follows:

step 21) improving on the basis of an original ResNet-101 network, wherein the improved ResNet-101 network is divided into five stages in total and is used for extracting the characteristics of an input image so as to obtain output characteristic graphs of different levels;

step 22) each of the five stages of the improved ResNet-101 network comprises a plurality of channels, and the importance degree of information contained in each channel for semantic segmentation is different, so that a channel attention mechanism channel is added in each stage, and the importance degree of different channels is represented by allocating a weight value of 0-1 to each channel;

step 23) deleting the down-sampling operation of the fourth stage and the fifth stage for enriching the detail information, thereby preventing the fact that the receptive fields of the feature maps of the fourth and fifth stages of the conventional ResNet-101 are gradually increased along with the process of convolution and down-sampling and the detail information of the small target in the feature maps is gradually lost;

step 24) uses a dilation convolution to save the output signatures of the fourth and fifth stages such that the signatures of the third, fourth and fifth stages are the same size, 1/8 size for the input image.

The specific process of the step 3) is as follows:

step 31) because the receptive field is gradually increased along with the process of convolution and down sampling, the detail information of small targets is gradually lost, in order to obtain more detail information, multi-scale input is adopted, basic residual error units are respectively added at the input ends of the fourth and fifth stages of the improved ResNet-101 network, additional 1/8-sized input images are directly input into the basic residual error units to obtain additional input feature maps of the fourth and fifth stages, the additional input feature maps obtained in the step are subjected to feature extraction once and are low-level feature maps, in the improved ResNet-101 network, the input images of each stage except the first stage are output feature maps of the previous stage, the input of the fourth and fifth stages are high-level feature maps, and the contained detail information is less than that of the low-level feature maps;

step 32) fusing the additional input feature maps processed by the basic residual error unit in the fourth and fifth stages with the normal input feature maps in the fourth and fifth stages respectively, and fully utilizing the shallow feature map so as to enrich the information of the small target in the deep feature map;

step 33) the feature map of the size of the input image 1/8 is enhanced with multi-scale input by: assume that the ResNet-101 network contains L at phase i_iLayer convolution, then the jth layer convolution can be defined as y_j＝M_j(x_j) Wherein y is_jIs the output tensor of the j-th layer, M_jInput picture x of i-th stage containing convolution, ReLU activation function and regularization operation_iHas a size of (N, H)_j,W_j,W_j) N denotes the batch size, H_iAnd W_iHeight and width of input feature map, C_iIs the number of channels; output characteristic diagram F of i stage_iCan be expressed as:

step 34) I_iThe additional input of the ith stage is represented, the resolution of the additional input is the same as the output tensor of the ith-1 stage, and the feature map after feature extraction is as follows:

step 35) the fusion input at stage i is represented as:

in the formula, F_iThe output tensor representing the ith phase,

representing a channel splicing operation;

step 36) in which the fifth stage outputs a high-level characteristic diagram chi_hThe fourth stage low-level characteristic diagram χ_l。

The specific process of the step 4) is as follows:

step 41) constructing a fusion attention modelBlock (2): the fusion attention Module contains two inputs, from step 36) a high level feature map of the fifth stage output

And low-level feature maps of the fourth stage output

H_h×W_hIs a high-grade characteristic diagram chi_hNumber of spatial positions of (H)_l×W_lIs a low-level characteristic diagram chi_lThe number of spatial locations of (a); c_hAnd C_lRespectively a high-level characteristic diagram x_hAnd low level characteristic diagram chi_l1 × 1 convolution W_θFor matching the low-level characteristic diagram chi_lFeature transformation of

Wherein

Is the number of channels of the converted features, R being the real number, ε_lIs a low-level characteristic diagram χ_lThe result after feature transformation is shown as formula (4):

ε_l＝W_θ(χ_l) (4)

step 42) converting the feature into a result epsilon_lF (epsilon) is obtained after regularization of softmax function_l)；

Step 43) processing f (epsilon) using bottleneck feature transformation_l) Obtaining channel dependence, 1 × 1 convolution W_γ1And W_γ2Will be used for_hThe feature conversion of the system obtains an attention output result

The results are as in formula (5):

O_F＝W_γ2ReLU(LN(W_γ1(f(ε_l))))₁(f(ε_l)))) (5)

output O_FReflect x_lPair chi_hCompensation of (2), theseThe compensation is from ×_lIs selected from all the positions of the mobile phone,

step 44) finally outputting the fusion characteristic diagram Y_FComprises the following steps:

Y_F＝cat(O_F，χ_h) (6)。

the specific process of the step 5) is as follows:

step 51) constructing a global attention module after the fifth stage of the ResNet-101 network, acquiring a long-distance dependence relationship which is crucial to semantic segmentation, and setting an input feature X belonging to R^C×H×WC, H, W are the number of channels, the height and the width of the space respectively, and the convolution W is 1 multiplied by 1_θTo convert feature X:

θ＝W_θ(X) (7)

wherein

Is the number of channels of the converted feature;

step 52) obtaining a similar matrix after regularization of the softmax function

Step 53) the output of the attention module is convolved by 1 × 1 by W_γ1And W_γ2And the intermediate normalization and the ReLU function, the result is as follows (8):

step 54) the final output profile Y_G∈R^C×H×WThe expression of (a) is:

Y_G＝cat(O_G，X) (9)。

the specific process of the step 6) is as follows:

step 61) fusing the final output feature map Y obtained in the step 5) with the feature map Y_GInput classifierGenerating a channel semantic segmentation feature map;

step 62) comparing the generated feature map with the real label image labeled in the step 1) for supervising the training of the feature extraction network parameters, thereby obtaining a trained network model; inputting the test sample data set obtained in the step 1) as an input image into a trained network model, and checking the performance of the network model;

and 63) loading the trained model parameters, and performing scene semantic analysis on the next batch of pictures shot underground.

Has the advantages that:

the method adopts an attention mechanism aiming at the complex scene in the underground coal mine image, highlights the semantic information of the target area, improves the image segmentation effect, and gives consideration to the accuracy and the speed of image segmentation compared with other segmentation methods, so that the robustness is better.

The method enhances the extracted features by constructing a multi-scale input network; constructing a fusion attention module, and fusing the extracted characteristics of each stage; meanwhile, a global attention module is constructed to enhance global information and obtain a remote dependency relationship; and finally, the classifier is used for generating a semantic graph to finish semantic segmentation of the image, so that the segmentation accuracy is ensured, and the robustness of the algorithm is improved.

Description of the drawings:

FIG. 1 is a schematic diagram of a basic residual error network unit of the coal mine underground image semantic segmentation method.

FIG. 2 is a schematic view of an attention fusion module of the coal mine underground image semantic segmentation method.

FIG. 3 is a schematic diagram of a global attention module of the coal mine underground image semantic segmentation method.

FIG. 4 is a network framework diagram of the multi-feature fusion image segmentation method of the present invention.

The specific implementation mode is as follows:

the invention is further described below with reference to the accompanying drawings.

The invention relates to a semantic segmentation method for underground images of a coal mine, which comprises the steps of collecting underground scene pictures by using an underground explosion-proof camera, and then preprocessing the pictures to generate a data set; inputting a data set, selecting a feature extraction network to extract features of the picture, constructing a multi-scale input module, and performing enhanced extraction on a feature map; then constructing a fusion attention module, and fusing the extracted characteristics of each stage; constructing a global attention module to enhance global information and obtain a remote dependency relationship; finally, the classifier is used for generating a semantic graph, and semantic segmentation of the image is completed. Compared with other semantic segmentation methods, the method has the advantages that: the calculation amount and the complexity of the algorithm are greatly reduced, an attention mechanism is adopted for scene complexity, the semantic information of a target area is highlighted, the image segmentation effect is improved, and the robustness of the algorithm is greatly enhanced.

As shown in FIG. 4, the method for semantically segmenting the coal mine underground image comprises the following steps:

step 1) acquiring an underground image, performing annotation preprocessing on image data, and dividing the image data subjected to annotation preprocessing into a training sample and a test sample data set.

The specific process is as follows:

and 11) acquiring a clear image by using an underground explosion-proof camera.

Step 12) performing semantic segmentation and annotation on the obtained image, namely classifying each pixel in the image; different regions in the image are segmented from each other, each region being defined by semantic information.

Step 2) inputting the training sample data set obtained in the step 1) into a feature extraction network with ResNet-101 as a skeleton to extract input image features; the downsampling operations of the fourth and fifth stages of the five feature extraction stages in the ResNet-101 are deleted, and the rest of the fourth and fifth stages are retained so that the feature map is 1/8 of the input image.

The specific process is as follows:

step 21) using ResNet-101 as a skeleton network for feature extraction, wherein ResNet-101 is divided into five stages, each stage is composed of a basic constraint Unit (RCU) and is used for extracting features of an input image to obtain output feature maps of different levels.

Step 22) in five stages of the feature extraction network ResNet-101, each stage comprises a plurality of channels, and the importance degree of information contained in each channel for semantic segmentation is different, so that a channel attention mechanism channel is added in each stage, and a weight value of 0-1 is allocated to each channel to represent the importance degree of different channels.

Step 23) deleting the down-sampling operation of the fourth stage and the fifth stage, gradually increasing the receptive field of the characteristic diagram of the 4 th and 5 th stages of the existing ResNet-101 along with the process of convolution and down-sampling, gradually losing the detail information of the small target, and deleting the down-sampling operation of the fourth fifth stage in the step 23) for enriching the detail information.

Step 24) uses a dilation convolution to save the output signatures of the fourth and fifth stages such that the signatures of the third, fourth and fifth stages are the same size, 1/8 the size of the input image.

And 3) in the fourth and fifth stages of deleting the down-sampling operation, enhancing the size extracted in the step 2) into an input image feature map by adopting multi-scale input, and outputting the feature map.

The specific process is as follows:

step 31) the receptive field is gradually increased along with the process of Convolution and down-sampling, the detail information of the small target is gradually lost, in order to obtain more detail information, multi-scale input is adopted, an additional input image is input into a basic Residual error Unit (RCU) to obtain additional input feature maps of a fourth stage and a fifth stage, the structure of the basic Residual error Unit is shown in fig. 1, the additional input feature map obtained in the step is subjected to feature extraction once to obtain a low-level feature map, in a ResNet-101 network, the input images of each stage except the first stage are output feature maps of the previous stage, the inputs of the fourth stage and the fifth stage are high-level feature maps, and the contained detail information is less than that of the low-level feature maps.

And step 32) fusing the fourth and fifth stage additional input feature maps obtained in the step 31) with the ResNet-101 fourth and fifth stage input feature maps respectively, so as to fully utilize the shallow feature map to enrich the information of the small target in the deep feature map.

Step 33) the process of multi-scale input is: assume that the ResNet-101 network contains L at phase i_iLayer convolution, then the jth layer convolution can be defined as y_j＝M_j(x_j) Wherein y is_jIs the output tensor of the j-th layer, M_jInput picture x of i-th stage containing convolution, ReLU activation function and regularization operation_iHas a size of (N, H)_j,W_j,W_j) N denotes the batch size, H_iAnd W_iHeight and width of input feature map, C_iIs the number of channels. Output characteristic diagram F of i stage_iCan be expressed as:

step 34) I_iRepresents the additional input of the ith stage at the same resolution as the output tensor of the ith-1 stage. The characteristic diagram after characteristic extraction is as follows:

step 35) the fusion input at stage i can be expressed as:

in the formula, F_iThe output tensor representing the ith phase,

showing a channel splicing operation.

Step 36) outputting the high-level characteristic graph chi in the fifth stage_hThe fourth stage low-level characteristic diagram χ_l。

Step 4) constructing a fusion attention module, fusing the feature maps of the input image 1/8 obtained in the fourth and fifth stages in the step 3), and outputting a new feature map containing global context semantic information, which is specifically shown in fig. 2;

the specific process is as follows:

step 41) constructing a fusion attention module: the fusion attention Module contains two inputs, from step 36) a high level feature map of the fifth stage output

And low-level feature maps of the fourth stage output

Wherein

ε_l＝W_θ(χ_l) (4)

step 42) converting the feature into a result epsilon_lF (epsilon) is obtained after regularization of softmax function_l)。

The results are as in formula (5):

O_F＝W_γ2ReLU(LN(W_γ1(f(ε_l)))) (5)

output O_FReflect x_lPair chi_hFrom χ_lIs selected from all the positions.

Y_F＝cat(O_F，χ_h) (6)

and 5) constructing a global attention module after the fifth stage of the ResNet-101 network, specifically, as shown in FIG. 3, enhancing the global representation of the new feature diagram obtained in the step 4), and obtaining the remote dependency relationship among the features of different levels to obtain a final fusion feature diagram.

The specific process is as follows:

step 51) constructing a global attention enhancement block, acquiring a long-distance dependence relationship which is crucial to semantic segmentation, and setting an input feature X belonging to R^C×H×WC, H, W are the number of channels, the height and the width of the space respectively, and the convolution W is 1 multiplied by 1_θTo convert feature X:

θ＝W_θ(X) (7)

wherein

Is the number of channels of the converted feature.

step 54) the final output profile Y_G∈R^C×H×WCan be represented by the following formula:

Y_G＝cat(O_G，X) (9)

and 6) inputting the fusion output characteristic diagram obtained in the step 5) into a pre-trained classifier to generate a semantic diagram. And then inputting the test sample data set obtained in the step 1) into the trained network, and checking the performance of the network.

Step 61) fusing the final output feature map Y obtained in the step 5) with the feature map Y_GAnd inputting the data into a classifier to generate a channel semantic segmentation feature map.

Step 62) comparing the generated characteristic diagram with the real label image labeled in the step 1) for supervising the training of the network model parameters, thereby obtaining a trained network model; inputting the test sample data set obtained in the step 1) as an input image into a trained network model, and checking the performance of the network model.

And 63) loading the model parameters trained in the step 62), and carrying out scene semantic analysis on the next batch of photos from underground shooting.

Claims

1. a coal mine image semantic segmentation method is characterized in that the steps are as follows:

Step 1. Collect downhole pictures, label and preprocess the picture data, and divide the labelled and preprocessed picture data into training samples and test sample data sets.

Step 2. Input the training sample data set into the feature extraction network to extract the input image features. The feature extraction network includes the improved ResNet-101 network. The improved ResNet-101 network deletes the original fourth and fifth stages of downsampling operations, and retains the first step. Other elements of Phases IV and V;

Step 3. In the fourth stage of the improved ResNet-101 network, the feature map output from the third stage and the additional input feature map are simultaneously input through multi-scale input, and the low-level feature map is output; the fifth stage simultaneously inputs the fourth stage through multi-scale input The output feature map and the additional input feature map, and output the advanced feature map; the additional input feature map is obtained after the input image is processed by the residual unit, and the additional input feature map compresses the original input image to make it and the previous stage output feature map get the same size;

Step 4. Build a fusion attention module after the fifth stage of the improved ResNet-101 network, use the fusion attention module to fuse the low-level feature map and the high-level feature map, and output a new feature map containing global contextual semantic information;

Step 5. Build a global context enhancement module after the fusion attention module to enhance the global representation of the new feature map, so as to obtain the long-distance dependency between the pixels in the feature map, and obtain the final fusion feature map;

Step 6. Input the final fusion feature map into the pre-trained classifier to generate a semantic map, and then use the test sample data set to detect the performance of the generated semantic map, and test the performance of the feature extraction network. Semantic segmentation of photo images, and retraining if not up to standard;

Step 7. Use the trained feature extraction network to perform semantic segmentation of the input coal mine underground image.

2. a kind of coal mine underground image semantic segmentation method according to claim 1 is characterized in that, the concrete process of described step 1) is:

Step 11) Use an underground explosion-proof camera to obtain clear images.

Step 12) artificially perform semantic segmentation and labeling on the obtained image, that is, classify each pixel in the image; different regions in the image are segmented from each other, and each region is defined by semantic information;

Step 13) Randomly construct a training sample set and a test sample set from the labeled images according to the ratio of 4:1.

3. a kind of coal mine underground image semantic segmentation method according to claim 1 is characterized in that, the concrete process of described step 2) is:

Step 21) Improve on the basis of the original ResNet-101 network, the improved ResNet-101 network is divided into five stages, used to extract the features of the input image, thereby obtaining output feature maps of different levels;

Step 22) Each of the five stages of the improved ResNet-101 network contains multiple channels, and the information contained in each channel is of different importance for semantic segmentation, so add channel attention at each stage. Mechanism channel, by assigning 0-1 weights to each channel to indicate the importance of different channels;

Step 23) In order to enrich the detailed information, the downsampling operations of the fourth and fifth stages are deleted, thereby preventing the receptive field of the fourth and fifth stage feature maps of the conventional ResNet-101 from gradually changing with the process of convolution and downsampling. When it increases, the detailed information of the small target in the feature map is gradually lost;

Step 24) Use dilated convolution to save the output feature maps of the fourth and fifth stages, so that the feature maps of the third, fourth and fifth stages have the same size, which are all 1/8 the size of the input image.

4. a kind of coal mine underground image semantic segmentation method according to claim 3, is characterized in that, the concrete process of described step 3) is:

Step 31) Since the receptive field gradually increases with the process of convolution and downsampling, the detailed information of the small target is gradually lost. In order to obtain more detailed information, multi-scale input is used. The basic residual unit is added to the input end of the fifth stage, and the additional 1/8-sized input image is directly input into the basic residual unit to obtain the additional input feature maps of the fourth and fifth stages. The additional input feature map is a low-level feature map after one feature extraction. In the improved ResNet-101 network, the input image of each stage except the first stage is the output feature map of the previous stage. The fourth and fifth stage The input of the stage is the high-level feature map, which contains less detailed information than the low-level feature map;

Step 32) Integrate the additional input feature maps processed by the basic residual unit in the fourth and fifth stages with the normal input feature maps of the fourth and fifth stages respectively, and make full use of the shallow feature maps to enrich the small objects in the deep feature maps. information in;

Step 33) Use multi-scale input to enhance the feature map of 1/8 size of the input image, wherein the process of multi-scale input is: Assuming that the ResNet-101 network contains Li layer convolution in the _i -th stage, then the j-th layer convolution is It can be defined as y _j = M _j (x _j ), where y _j is the output tensor of the j-th layer, M _j contains the convolution, ReLU activation function and regularization operation, and the size of the input image x _i of the i-th stage is (N, H _j , W _j , W _j ), N represents the batch size, _Hi and Wi represent the height and width of the input feature map, and C _i _represents the number of channels; the output feature map F _i of the i-th stage can be Expressed as:

Step 34) I _i represents the additional input of the i-th stage, and its resolution is the same as the output tensor of the i-1-th stage, and its feature map after feature extraction is:

Step 35) The fusion input of the i-th stage is expressed as:

In the formula, F _i represents the output tensor of the i-th stage,

Indicates the channel splicing operation;

Step 36) wherein the fifth stage outputs a high-level feature map χ _h , and the fourth stage outputs a low-level feature map χ _l .

5. according to the described a kind of coal mine underground image semantic segmentation method of claim 4, it is characterized in that, the concrete process of described step 4) is:

Step 41) Build the fused attention module: The fused attention module contains two inputs, the high-level feature maps from the fifth stage output in step 36)

and the low-level feature map output by the fourth stage

H _h ×W _h is the number of spatial locations of high-level feature map χ _h , H _l ×W _l is the number of spatial locations of low-level feature map X _l ; C _h and C _l are high-level feature map X _h and low-level feature map, respectively The number of channels of X _l , 1×1 convolution W _θ is used to transform the features of the low-level feature map X _l into

in

is the number of channels of the transformed feature, R is a real number, and ε _l is the result of the transformation of the low-level feature map χ _l , as shown in formula (4):

ε _l =W _θ (χ _l ) (4)

Step 42) obtain f(ε _l ) after the feature conversion result ε _l is normalized by the softmax function;

Step 43) Use bottleneck feature transformation to process f(ε _l ) to obtain channel dependencies, 1×1 convolution W _γ1 and W _γ2 convert the features used for χ _h to obtain the attention output result

The result is as formula (5):

O _F =W _γ2 ReLU(LN(W _γ1 (f(ε _l )))) (5)

The output _OF reflects the compensations of χ _l to χ _h , which are picked from all positions of χ _l ,

Step 44) The final output fusion feature map Y _F is:

Y _F = cat( _OF , χ _h ) (6).

6. a kind of coal mine underground image semantic segmentation method according to claim 1 is characterized in that, the concrete process of described step 5) is:

Step 51) Build a global attention module after the fifth stage of the ResNet-101 network to obtain long-range dependencies that are crucial to semantic segmentation. Let the input features X∈R ^C×H×W , C, H, and W be respectively Number of channels, spatial height and width, 1×1 convolution W _{θ is} used to transform feature X:

θ=W _θ (X) (7)

in

is the number of channels of the transformed feature;

Step 52) After regularization by the softmax function, the similarity matrix is obtained

Step 53) The output of the attention module is calculated by 1×1 convolution W _γ1 and W _γ2 and the intermediate normalization and ReLU functions, and the result is as formula (8):

Step 54) The expression of the final output feature map Y _G ∈ R ^C×H×W is:

Y _G = cat(O _G , X) (9).

7. a kind of coal mine underground image semantic segmentation method according to claim 1 is characterized in that, the concrete process of described step 6) is:

Step 61) input the final output fusion feature map Y _G obtained in step 5) into the classifier to generate a channel semantic segmentation feature map;

Step 62) Compare the generated feature map with the real label image marked in step 1) to supervise the training of feature extraction network parameters, thereby obtaining a trained network model; the test sample data set obtained in step 1) is used as input Input the image to the trained network model to test the performance of the network model;

Step 63) Load the trained model parameters, and perform scene semantic analysis on the next batch of photos taken from underground.