CN112837330A

CN112837330A - Leaf segmentation method based on multi-scale double attention mechanism and full convolution neural network

Info

Publication number: CN112837330A
Application number: CN202110230518.4A
Authority: CN
Inventors: 李振波; 郭若皓; 李晔; 杨泳波; 瞿李傲; 岳峻
Original assignee: China Agricultural University
Current assignee: China Agricultural University
Priority date: 2021-03-02
Filing date: 2021-03-02
Publication date: 2021-05-25
Anticipated expiration: 2041-03-02
Also published as: CN112837330B

Abstract

The invention discloses a segmentation system based on a multi-scale double-attention mechanism and a full convolution neural network, which comprises a feature extraction backbone network, a feature pyramid network, a semantic segmentation network, an object detector, a coefficient predictor and a fusion module, wherein the semantic segmentation network comprises a first convolution layer, an attention module and a second convolution layer, wherein: the feature extraction backbone network is a VoVNet57 network and is used for extracting features of the training set images and the test set images and sending the features to the feature pyramid network; the characteristic pyramid network is used for carrying out sibling characteristic map fusion to obtain a P3-P7 characteristic map; inputting a P3-P7 feature map obtained through a feature pyramid fusion network into an FCOS target detector, generating a suggested box type and a position thereof pixel by the target detector, and performing Soft NMS operation on the suggested box to obtain a final detection box; the coefficient predictor carries out weight prediction on the example information of the detection frame to generate an example proportion corresponding to the detection frame; the semantic segmentation network is used for processing the P3-P6 feature maps obtained by the feature pyramid fusion network to generate 4 segmentation maps; and the fusion module is used for overlaying the 4 segmentation maps and the detection frame and outputting a final segmentation map according to the proportion of the corresponding example.

Description

Leaf segmentation method based on multi-scale double attention mechanism and full convolution neural network

Technical Field

The invention relates to an image processing method, in particular to a leaf segmentation method based on a multi-scale double-attention machine system and a full convolution neural network.

Background

Plant phenotype plays an important role in genetics, botany and agronomy, the proportion of leaves in most plant organs is the largest, the leaves play an important role in vegetation growth and development, the estimation of leaf morphological structure and physiological parameters has important significance in vegetation growth monitoring, the observation of the leaves is helpful for revealing the growth state of the leaves, and finally helps us to distinguish the genetic contribution capacity, improve the genetic characteristics of plants and increase the crop yield. In high throughput phenotypic analysis, automated segmentation of plant leaves is a prerequisite for measuring more complex phenotypic traits. Although leaves have distinctive appearance and shape characteristics, occlusion and variation of leaf shape and pose and imaging conditions make this problem challenging.

Since the 80 s in the 20 th century, a plurality of effective methods are proposed to solve the problem of blade segmentation, but the existing blade segmentation method cannot adapt to a complex background, the post-processing method is complicated, but still has a great improvement space in the aspect of precision, and has a certain gap from the real practical application.

The invention aims to overcome the defects of the prior art and provide a leaf segmentation method based on a multi-scale double-attention machine mechanism and a full convolution neural network.

Disclosure of Invention

In order to realize the purpose of the invention, the following technical scheme is adopted for realizing the purpose:

a segmentation system based on a multi-scale double-attention mechanism and a full convolution neural network comprises a feature extraction backbone network, a feature pyramid network, a semantic segmentation network, a target detector, a coefficient predictor and a fusion module, wherein: the feature extraction backbone network is used for extracting features of the training set images and the test set images and sending the features to the feature pyramid network; the characteristic pyramid network is used for carrying out sibling characteristic map fusion to obtain a P3-P7 characteristic map; inputting a P3-P7 feature map obtained through the feature pyramid fusion network into a target detector, and generating a suggested frame category and a position thereof pixel by the target detector to obtain a final detection frame; the coefficient predictor carries out weight prediction on the example information of the detection frame to generate an example proportion corresponding to the detection frame; the semantic segmentation network is used for processing the P3-P6 feature maps obtained by the feature pyramid fusion network to generate 4 segmentation maps; and the fusion module is used for superposing the 4 segmentation maps and the detection frame and multiplying the superposition with the corresponding example proportion so as to output the final segmentation map.

The segmentation system is characterized in that: the feature extraction backbone network is a VoVNet57 network, the VoVNet57 network comprises 3 convolutional layers and 4 OSA modules with a front-back sequence, each OSA module consists of 5 convolutional layers and has the same input/output channel, the input of the VoVNet57 network is an RGB original picture, a set of 128-channel feature maps are output through the 3 convolutional layers, the set of feature maps are input to the first OSA module, the output of the set of feature maps is used as the input of the next OSA module and sequentially operates, the output of each OSA module is reserved, the feature maps output through the VoVNet57 have four layers, the size of each feature map is 1/4, 1/8, 1/16 and 1/32 of the original image, and the number of channels is 256, 512, 768 and 1024.

The segmentation system is characterized in that: the 1/32 feature map output by the last OSA module of the VoVNet57 network is input to a feature pyramid network, the feature pyramid network performs convolution and downsampling operations on the feature map to finally generate a feature map with the size of the original image 1/128, the feature pyramid network performs progressive upsampling on the feature map to respectively generate feature maps with the sizes of the original images 1/64, 1/32, 1/16 and 1/8, the feature maps with the sizes of the original images 1/32, 1/16 and 1/8 and feature maps with corresponding sizes generated by a feature extraction backbone network are fused, and the fused feature map and the feature maps of 1/128 and 1/64 are used as P3-P7 feature maps finally generated by the feature pyramid network.

The segmentation system is characterized in that: the target detector is an FCOS target detector, and the target detector obtains the category and the position coordinate value of the suggestion box of each feature map pixel by pixel through classification and regression calculation of the P3-P7 feature maps obtained through the feature pyramid fusion network.

The segmentation system is characterized in that: note F_i∈R^H*W*CIs the feature at the ith layer of the feature pyramid fusion network, wherein H, W, C represents the height, width and channel number of the feature map respectively, and the real frame provided by the training set is defined as

The 4 coordinates respectively represent the abscissa and ordinate of the upper left corner and the abscissa and ordinate of the lower right corner of the real frame, and c represents the category of the real frame; for any position (x, y) on the feature map by the formula:

mapping it back to the input image, s being the total step size before this layer, the FCOS head detection network regressing the detection frame directly for each position; if position (x, y) is inside the real box, then it is considered a positive sample, otherwise it is considered a negative sample; FCOS defines a 4-dimensional vector t^*＝(l^*，t^*，r^*，b^*) As a regression target, these four quantities represent the distances of this position to the four sides of the frame, respectively; if one position is in a plurality of suggestion boxes, selecting the suggestion box with the smallest area as a target suggestion box; the last layer of the network of the target detector predicts the 1-dimensional classification label and 4-dimensional frame information of the target suggestion frame, and for classification and regression, 4 convolution layers are added behind the features respectively, and because the regression result is positive, exp (x) is used for de-mapping the true value in the regression branch; the loss function is as follows:

wherein L is_clsIs the adaptive loss, L_regIs the loss of cross-over ratio, N_posIs the number of positive samples, λ is 1; FCOS detects objects of different sizes in different layer features, and uses 5 feature layers { P }₃，P₄，P₅，P₆，P₇Step s of 8, 16, 32, 64, 128, respectively; FCOS adds a single branch to predict the centrality of a location by:

BCE loss is adopted during training, and the centrality needs to be multiplied to the score of the classification during the reasoning phase, so that the prediction generated by the position far away from the target center can be restrained.

The segmentation system, wherein the target detector performs Soft NMS operation on the proposed box to obtain a final detection box, comprises:

a. firstly, calculating the intersection and parallel ratio among different suggestion boxes;

b. for a suggestion box with an intersection ratio greater than the threshold, the confidence score is not set to 0 directly, but the score of the box is reduced, as shown in the following formula:

wherein, C_iRepresenting the suggestion box score of each blade, M is the suggestion box with the highest current score,di is one of the rest suggestion boxes, Ht is a set confidence threshold, iou is an intersection ratio, and the suggestion box with the score larger than a preset value is used as a detection box and output.

The segmentation system wherein the suggestion box performs Soft NMS operations further comprising:

c. if the current score is the highest suggestion box M and the rest suggestion boxes d_iWhen the intersection ratio of (a) is greater than or equal to the set threshold Ht, a gaussian weighting function is used to replace the lower half of the formula in step b, as shown in the following formula:

where σ is the width parameter of the function and D is the domain.

The segmentation system of, wherein: the coefficient predictor is used for carrying out weight prediction on the example information of the detection frame output by the target detector so as to generate an example proportion corresponding to the detection frame.

The segmentation system of, wherein: the semantic segmentation network comprises a first convolution layer, an attention module and a second convolution layer, wherein the first convolution layer carries out feature extraction on a P3-P6 feature map obtained through the feature pyramid fusion network, the attention module further promotes network expression of features extracted by the first convolution layer and outputs the network expression to the second convolution layer, and the second convolution layer generates 4 global segmentation maps after upsampling the output of the attention module.

The segmentation system of, wherein: feature map of a feature pyramid network

Inputting the first convolution layer to expand the channel and fully extract the features, thereby generating a new feature map

Wherein R is a real number domain, C, H and W respectively represent the channel number, length and width of the characteristic diagram, and then a multi-scale double attention module is applied to sequentially infer the space attention diagram

And channel attention map

Where R is the real number domain, C, H and W represent the number of channels, length and width of the feature map, respectively, and the whole process can be summarized as follows:

wherein M is_outIs the final output of the multi-scale dual attention module,

representing the product of the matrix elements.

The segmentation system of, wherein: the multi-scale dual attention module includes a spatial attention module and a channel attention module, the spatial attention module generates two feature descriptors along a channel using global average pooling and maximum pooling:

and

where R is the real number domain, C, H and W represent the number of channels, length and width of the feature map, respectively, and the spatial attention module concatenates the descriptors and applies the convolutional layer to generate a global spatial attention map

Where R is the real number domain, C, H and W represent the number of channels, length and width of the feature map, respectively, and the first convolution kernel of the convolution layer of the spatial attention module is sized to

Where R is the real number field, C, H and W represent the number of channels, length and width of the feature map, respectively, where R is the rate of reduction of the channel size, and thus, the global spatial context attention map M_{S_G}Through the lower partCalculating the formula:

wherein

Representing the splicing operation, B is batch normalization, δ is the activation function ReLU. Wherein W₀And PW₁7 × 7 layers and point convolution layers respectively, the convolution kernel size of which is respectively

And C × H × W;

the spatial attention module performs a convolution operation on each spatial location using point convolution as a local context extractor, the local context

Calculated by the following formula:

M_{S_L}(M_in)＝B(PW₁(δ(B(PW₀(M_in)))))

finally, the feature matrix is combined and output by using broadcast addition, and the feature graph output by the spatial attention module

Comprises the following steps:

wherein

Represents matrix addition, σ is an activation function Sigmoid;

the channel attention module puts the feature map output by the spatial attention module into a global space module of the channel attention module to generate two groups of channel descriptors:

and

both descriptors are forwarded to a shared multi-layered convolutional subnet to generate a global channel attention map

The shared subnet is composed of two point convolutional layers,

the channel attention module is inserted into the local branch in parallel and keeps the same architecture as the local space attention, and finally, the feature map of the multi-scale double attention module is summarized and output by using broadcast addition,

the specific calculation is as follows:

the segmentation system of, wherein: the fusion module superposes the 4 global segmentation graphs with the detection frames, multiplies the instance proportion corresponding to the detection frames by the instance proportion, and then outputs a final segmentation graph, which comprises the following steps:

a. cutting the 4 global segmentation maps by using all the detection frames to obtain the regions of the segmentation maps corresponding to all the detection frames;

b. performing interpolation operation on the cut region to adjust the size of the region to be consistent with the example proportion matrix;

c. multiplying the adjusted region by the proportion of the corresponding example to obtain a segmentation map of each detection frame;

d. and adding and combining the segmentation maps of all the detection frames to generate a final segmentation map.

Drawings

FIG. 1 is a schematic flow diagram of the process of the present invention;

FIG. 2 is a schematic diagram of the overall network architecture;

FIG. 3 is a schematic diagram of a multi-scale dual attention module;

FIG. 4 is a schematic view of a channel attention module;

FIG. 5 is a schematic view of a spatial attention module;

FIG. 6 is a diagram illustrating a split branch operation;

fig. 7 is a graph of the effect of segmentation.

Detailed Description

The following detailed description of the present invention will be made with reference to the accompanying drawings 1-7.

As shown in fig. 2, the present invention provides a segmentation system based on a multi-scale double-attention mechanism and a full convolution neural network, which includes a feature extraction backbone network, a feature pyramid network, a semantic segmentation network, an object detector, a coefficient predictor and a fusion module, wherein the semantic segmentation network includes a first convolution layer, an attention module and a second convolution layer.

As shown in FIG. 1, the leaf segmentation method based on the multi-scale double attention mechanism and the full convolution neural network of the invention comprises the following steps:

(1) a data set provided by a leaf segmentation challenge match (LSC) is obtained and the original available picture is obtained by decompressing the H5 file. The original available pictures include an original RGB map, a label map, a binary map, and a leaf center map.

The primary catalog of the data set includes four folders: a1, A2 and A4 are mainly used for storing delayed images of arabidopsis top views, the number of original RGB pictures is 993, A3 is mainly used for storing delayed images of tobacco top views, and the number of original RGB pictures is 83. The four folders each include both a training set and a test set, wherein the training set includes an original RGB map, a label map, a binary map, and a leaf center map, and the test set includes only the original RGB map and the binary map.

(2) The training set is converted to a COCO _2017 data format for ease of manipulation and processing.

The primary catalog of a dataset in the COCO _2017 format includes two folders: and an angles folder for storing the annotated json file, and a train2017 for storing the original.

(3) And carrying out image enhancement operation on the sample pictures in the training set so as to expand the training samples.

Randomly selecting sample pictures in the training set to perform at least one of the following operations to increase the number of the sample pictures: 1) horizontally turning and vertically turning the image; 2) carrying out affine transformation on the image, wherein the affine transformation comprises translation, scaling and rotation; 3) the image is light-conditioned so that the image becomes darker. Therefore, the generalization capability of the model is improved, the overfitting degree of the model is reduced, and the detection and segmentation precision is effectively improved.

(4) And inputting the extended training set blade images and the extended test set blade images into a feature extraction network. The feature extraction network comprises a feature extraction backbone network and a feature pyramid network. The feature extraction backbone network is a VoVNet57 network.

(5) Features of the training set images and the test set images were extracted using a pre-trained VoVNet57 network, which VoVNet57 network comprised 3 convolutional layers and 4 OSA modules with a tandem order. The input of VovNet57 is RGB original picture, and a set of 128 channels feature map is outputted through 3-layer convolution, and the set of feature map is inputted to the first OSA module, and its output is used as the input of the next OSA module, and sequentially operates in turn, and the output of each OSA module is retained, so the feature map outputted through VovNet57 has four layers in total, the size is 1/4, 1/8, 1/16 and 1/32 of the original picture, and the number of channels is 256, 512, 768 and 1024 respectively. The core modules of the VoVNet57 network are OSA modules, each consisting of 5 convolutional layers, with identical input/output channels, and whose features are aggregated simultaneously to the last layer, each convolutional layer containing bi-directional connections, one connected to the next layer to produce features with a larger receptive field, and the other aggregated only into the final output feature map. Therefore, the problem of information redundancy caused by intensive connection of a traditional feature extraction backbone network can be effectively solved, the feature extraction effect is enhanced, the feature extraction speed and the GPU computing efficiency are improved, model convergence can be accelerated through the pre-trained VoVNet57 network, and model performance is improved.

(6) Inputting a feature map obtained by a VoVNet57 feature extraction backbone network into a feature pyramid network, wherein the feature pyramid network constructs a feature pyramid by utilizing hierarchical semantic features of a convolutional network. The feature map 1/32 output by the last OSA module of the VoVNet57 network is input to the feature pyramid network, the feature pyramid network performs convolution and downsampling operations on the feature map to finally generate a feature map with the size of the original image 1/128, the feature map is progressively upsampled to generate feature maps with the sizes of the original images 1/64, 1/32, 1/16 and 1/8, the feature maps with the sizes of the original images 1/32, 1/16 and 1/8 are fused with feature maps with corresponding sizes generated by the feature extraction backbone network, and the fused feature maps and the feature maps of 1/128 and 1/64 are finally generated layer feature maps. The processing of the feature pyramid network comprises a fusion process of top-down and lateral connection, and the small feature map of the top layer is amplified to the size same as that of the feature map of the previous stage in an upsampling mode, so that the advantage that the strong semantic features of the top layer (which are beneficial to classification) are utilized, the high-resolution information of the bottom layer (which are beneficial to positioning) is utilized, and the upsampling method is realized through nearest neighbor interpolation. In order to combine the high-level semantic features with the accurate positioning capability of the bottom level, a lateral connection structure similar to a residual network is adopted, and the lateral connection fuses the features of the upper level which are subjected to upsampling and have the same resolution as the current level by an addition method to finally obtain a P3-P7 feature map (namely, the feature maps with the sizes respectively being 1/8/, 1/16, 1/32, 1/64 and 1/128 of the original drawings). Therefore, the feature pyramid network can further improve semantic expression through context information, and increase feature mapping resolution, so as to better retain information of small target objects and output features with stronger expression capability.

(7) Inputting the P3-P7 feature maps (i.e. feature maps with the sizes of original images 1/8/, 1/16, 1/32, 1/64 and 1/128) obtained by the feature pyramid fusion network into an FCOS target detector, and obtaining the class of the recommended frame of each feature map by the target detector through classification and regression calculation pixel by pixelAnd distinguishing the coordinate values of the positions. Note F_i∈R^H*W*CIs the feature at the ith level of the feature pyramid fusion network, where R represents the real number domain and H, W, C represents the feature map height, width, and channel number, respectively. Defining the real box provided by the training set as

The 4 coordinates represent the abscissa and ordinate of the upper left corner and the abscissa and ordinate of the lower right corner of the real frame, respectively, and c represents the category of the real frame. For any position (x, y) on the feature map, the following formula can be used:

mapping it back to the input image, s is the total step size before this level, and the FCOS target detector directly regresses the proposed box for each position. If the position (x, y) on the feature map is inside the real box, it is considered a positive sample, otherwise it is considered a negative sample. The FCOS target Detector defines a 4-dimensional vector t^*＝(l^*，t^*，r^*，b^*) As a regression target, these four quantities represent the distance of this position to the four edges of the proposed box, respectively. And if one position is in a plurality of suggestion boxes, selecting the suggestion box with the smallest area as the target suggestion box. Predicting the 1-dimensional classification label and the 4-dimensional frame information of the target suggestion frame by using the last layer of the target detector, wherein the loss function of the target suggestion frame is as follows:

wherein L is_clsIs the adaptive classification loss, L_regIs the loss of cross-over ratio, N_posIs the number of positive samples, λ is 1, Σ_x,yDenotes summing all positions, C^* _x,yRepresenting which category the suggestion box at that location (x, y) belongs to, f being an indicator function, if C^* _x,yIf greater than 0, the value is 1, otherwise the value is 0, t_x,yAnd t^* _x,yRespectively representing the positions of the proposed and real boxes, P_x,yProbability of belonging to the category, i.e. confidence score of the suggested box at the location, P_x,yThe SoftMax function is defined as follows, obtained by the SoftMax function:

where e is the exponent, Vi is the weight, Σ, on the ith neuron_jMeans that all neurons are summed. FCOS detects objects of different sizes in different layer features, and uses 5 feature layers { P }₃，P₄，P₅，P₆，P₇With steps s of 8, 16, 32, 64, 128, respectively. FCOS adds a single branch to predict the centrality of a location by:

where min is the minimum function, max is the maximum function, l^*、r^*、t^*And b^*The distances from the position to the four sides of the frame are respectively expressed, two-classification cross entropy loss is adopted during training, and the centrality needs to be multiplied to the classification score during the inference stage, so that the prediction generated by the position far away from the target center can be restrained.

(8) The target detector performs a Soft NMS operation on the proposed box to obtain a final detection box. In order to ensure the recall rate of object detection, the traditional target detection method usually uses NMS (network management system) for processing, and the main method is to sort the generated suggestion boxes according to confidence scores, reserve the box with the highest score and delete other boxes with the overlapping area of the boxes larger than a certain proportion; however, the NMS processing method, although simple and effective, has a certain problem, the biggest problem is that it forces all the scores of the adjacent suggestion boxes to zero, and if a real object appears in the overlapping area, the detection of the object will fail.

a. Firstly, sorting all suggestion boxes according to the confidence score;

b. selecting a suggestion box with the highest confidence score, and calculating the intersection ratio of the suggestion box and the rest suggestion boxes;

c. the confidence score of the remaining suggestion box is recalculated through the intersection proportion, and the specific method is shown as the following formula:

wherein, C_iRepresenting the confidence score of the suggestion frame of each blade, wherein M is the suggestion frame with the highest current confidence score, di is one of the rest suggestion frames, Ht is a set confidence threshold value, iou is an intersection ratio, and the suggestion frame with the score larger than a preset value (for example, 0.6) is used as a detection frame and output;

d. when the intersection ratio is smaller than the threshold value, the confidence score of the rest suggestion boxes is unchanged;

e. when the intersection ratio is greater than the threshold, the confidence score of the remaining suggested box is not set to 0 directly, but the score of the box is decreased as follows: if the current suggestion box M and the remaining suggestion box d_iWhen the intersection ratio of (1) is greater than or equal to the set threshold Ht, the confidence score of the recommended frame linearly decreases. Due to the value C of the formula in step C_iNot a continuous function, when a suggestion box d_iWhen the overlap-to-sum ratio of M exceeds the threshold Ht, the score will jump, and the jump will produce a large fluctuation to the detection result, so a more stable and continuous score resetting function is needed to replace the lower half of the formula in step c, and a gaussian weighting function is used, as shown in the following formula:

wherein σ is a width parameter of the function, and D is a domain;

f. sequentially calculating the confidence score of the suggestion frame and the rest suggestion frames;

g. and sequencing the confidence scores of the rest suggestion boxes, and continuously circulating the steps until all the suggestion boxes are completely calculated.

The calculation complexity of Soft NMS is the same as that of NMS, and the score attenuation mode is adopted, so that the recall rate of the model can be effectively improved, and the detection result is improved.

(9) And sending the detection frame to a coefficient predictor, and carrying out weight prediction on the example information of the detection frame by the coefficient predictor so as to generate an example proportion corresponding to the detection frame. The coefficient predictor is a convolutional layer whose output is a 3D structure tensor that can encode instance level information such as the rough shape and pose of an object.

(10) Extracting features of the P3-P6 feature map obtained through the feature pyramid fusion network through a first convolution layer of a semantic segmentation network, further improving the features extracted by the first convolution layer through an attention module, outputting the adjusted feature map to a second convolution layer through the attention module, and upsampling the output of the attention module by the second convolution layer to generate 4 global segmentation maps. In order to better improve the expression capability of the segmentation network and correctly focus on a target object, the invention provides a novel multi-scale double attention module with space and channel descriptors, as shown in fig. 3, a feature map output by a first convolution layer sequentially passes through the space attention module and the channel attention module to generate an attention weight map, and the weight map is subjected to matrix multiplication with an input feature map to generate a final adjusted feature map. This module aggregates the global and local features (as shown in FIG. 2). Feature map of feature pyramid network

As an input, where R is the real number field, C, H and W represent the number of channels, length and width of the feature map, respectively. First, the feature map is input into the first convolution layer to expand the channel and fully extract the features, thereby generating a new feature map

(where R is the real number field and C, H and W represent the number, length and width of channels of the feature map, respectively) and channel attention maps

(where R is the real number field and C, H and W represent the number of channels, length and width, respectively, of the feature map). The whole process can be summarized as follows:

wherein M is_outIs the final output of the multi-scale dual attention module, M_inIn order to input the characteristic diagram,

and expressing the product of matrix elements, wherein Ws is a space attention diagram and Wc is a channel attention diagram. Considering the arrangement sequence of the two modules, an ablation experiment is designed to select the optimal arrangement mode, and through the ablation experiment, the spatial attention module is placed in front of the channel attention module with the highest precision, so that the strategy is selected. .

As shown in fig. 4, the spatial attention module focuses mainly on the dependency of the convolution features on the space and generates a spatial attention matrix to highlight the information-rich areas. In computing spatial attention, the spatial attention module generates two feature descriptors along the path using global average pooling and maximum pooling:

and

where R is the real number field and C, H and W represent the number of channels, length and width of the feature map, respectively. These feature descriptors indicate features of maximum pooling and average pooling. Next, the spatial attention module multiplies the descriptors by the feature map and applies the convolutional layer to generate a global spatial attention map

Wherein R is a real number field, C, H and W respectively represent the channel number and length of the characteristic diagramDegree and width. To reduce the parameters and improve the robustness of the training, the first convolution kernel of the convolutional layer is sized

Where R is the real number field and C, H and W represent the number of channels, length and width, respectively, of the feature map, where R is the rate of channel size reduction. Thus, the global spatial context attention map M_{S_G}Can be calculated by the following formula:

wherein

Representing the splicing operation, B is batch normalization, MaxPool is maximum pooling, AvgPool is average pooling, M_inFor the input profile, δ is the activation function ReLU, W₀And PW₁7 × 7 layers and point convolution layers respectively, the convolution kernel size of which is respectively

And C H W, MLConv is a multilayer convolution.

Parallel local branch modules are added in the spatial attention module to enrich element context and improve multi-scale information expression. In this branch, the inter-attention module performs a convolution operation on each spatial location using a point convolution (convolution kernel size of 1) as a local context extractor. Thus, local context

Can be calculated by the following formula:

M_{S_L}(M_in)＝B(PW₁(δ(B(PW₀(M_in)))))

wherein B is batch normalization, PW₀And PW₁Respectively 7 by 7 convolutional layers and point convolutional layers, M_inTo input the profile, δ is the activation function ReLU. Finally, the feature matrix is combined and output by using broadcast addition, and the feature graph output by the improved spatial attention module

Comprises the following steps:

wherein

Representing a matrix addition of the signals of the first and second,

representing the product of matrix elements, σ being the activation function Sigmoid, M_inIs an input feature map, C_GAnd C_LGlobal spatial attention and local spatial attention, respectively.

As shown in FIG. 5, unlike the spatial attention module, the channel attention module is able to capture inter-dependencies between channels and learn inter-channel relationships of features with the goal of using more information to assign higher weights to the channels. To efficiently compute the channel attention map, the feature map output by the spatial attention module is put into the global spatial module of the channel attention module, generating two sets of channel descriptors:

and

The shared sub-network consisting of two point convolutional layers instead of twoThe full connection layer is specifically calculated as follows:

where MLConv is multilayer convolution, MaxPool is maximum pooling, AvgPool is average pooling, M_out’For a characteristic map of the spatial attention module output, PW₀And PW₁Respectively 7 by 7 convolutional layers and point convolutional layers, M_inFor the input profile, B is the batch normalization and δ is the activation function ReLU. Similar to the spatial attention module, we also insert a local branch into the channel attention module in parallel, in this branch, the spatial attention module uses point convolution as a local context extractor to perform convolution operation on each spatial position, finally uses broadcast addition to summarize and output the feature map of the multi-scale double attention module,

the specific calculation is as follows:

wherein M is_out’For the feature map output by the spatial attention module, σ is the activation function Sigmoid, CG and CL are the global channel attention and the local channel attention respectively,

the product of the elements of the matrix is represented,

representing a matrix addition.

(11) The fusion module superimposes the 4 global segmentation maps with the detection frames, multiplies the superimposed global segmentation maps by the proportion of the instances corresponding to the detection frames, and outputs the final segmentation maps as shown in fig. 6.

The convergence module itself has translational variability that enables the network to use different activations to accomplish the function of how to distinguish and locate leaves the whole process of steps a-d above can be calculated by:

wherein, the propofol is a detection frame, the bases is an area of the detection frame corresponding to the segmentation map, the coefficients is a confidence score of the detection frame, I is a linear interpolation operation, and roiign is an operation of fixing the size of the detection frame.

(12) Training the example segmentation model, and storing the trained model; testing the test data set by using the trained model to realize real-time segmentation of the blade image; iteratively optimizing the model parameters based on the defined loss function until the model converges; FIG. 7 is a diagram illustrating the effect of the trained model on segmenting the pictures in the test set.

The invention can obtain the following beneficial effects:

(1) the invention adopts a stage target detection branch, thus improving the detection speed;

(2) the method utilizes a data enhancement technology comprising operations such as turning, affine transformation, illumination adjustment, light and shade contrast transformation and the like to carry out data enhancement on the training sample, enriches image data, expands the scale of a data set, solves the problem of sample shortage, and simultaneously enhances the robustness and generalization capability of the model;

(3) the invention adopts FPN to extract the characteristics, breaks the trouble of parameter setting of the traditional detection method based on manual extraction of characteristics such as edges, contours, textures and the like;

(4) according to the invention, the automatic segmentation of the blades is realized by using a computer vision technology, compared with manual detection, the labor cost is saved, the production efficiency is improved, and the agricultural unmanned management is realized in a real sense;

(5) the invention provides a novel multi-scale double-attention mechanism which can improve the expression capability of a segmentation network in local and global dimensions;

(6) the invention effectively embeds the attention module into the segmentation network and generates the corresponding position-sensitive segmentation graph, which is beneficial to the distinction between the blades.

Claims

1. A segmentation system based on a multi-scale double-attention mechanism and a full convolution neural network comprises a feature extraction backbone network, a feature pyramid network, a semantic segmentation network, a target detector, a coefficient predictor and a fusion module, and is characterized in that: the feature extraction backbone network is used for extracting features of the training set images and the test set images and sending the features to the feature pyramid network; the characteristic pyramid network is used for carrying out sibling characteristic map fusion to obtain a P3-P7 characteristic map; inputting a P3-P7 feature map obtained through the feature pyramid fusion network into a target detector, and generating a suggested frame category and a position thereof pixel by the target detector to obtain a final detection frame; the coefficient predictor carries out weight prediction on the example information of the detection frame to generate an example proportion corresponding to the detection frame; the semantic segmentation network is used for processing the P3-P6 feature maps obtained by the feature pyramid fusion network to generate 4 segmentation maps; and the fusion module is used for superposing the 4 segmentation maps and the detection frame and multiplying the superposition with the corresponding example proportion so as to output the final segmentation map.

2. The segmentation system of claim 1, wherein: the feature extraction backbone network is the VoVNet57 network.

3. The segmentation system of claim 1, wherein: and the coefficient predictor is used for carrying out weight prediction on the example information of the detection frame output by the target detector.