CN113361373A

CN113361373A - Real-time semantic segmentation method for aerial image in agricultural scene

Info

Publication number: CN113361373A
Application number: CN202110612989.1A
Authority: CN
Inventors: 熊盛武; 刘江梁; 王晓楠; 詹昶; 余涛
Original assignee: Wuhan University of Technology WUT
Current assignee: Wuhan University of Technology WUT
Priority date: 2021-06-02
Filing date: 2021-06-02
Publication date: 2021-09-07

Abstract

The invention discloses a real-time semantic segmentation method for aerial images in agricultural scenes. The method comprises the steps that a camera carried by an unmanned aerial vehicle is used for collecting image data of an original farmland, the image data are transmitted to a server, the server is used for processing the image data of the original farmland, and finally a data set required by network training is generated. A corresponding semantic segmentation network model is constructed, the model is composed of lightweight modules, the real-time performance can be met, a good segmentation effect can be kept, and meanwhile, a weight cross entropy loss function is used for training the network. In addition, abundant texture information in the agricultural scene image is utilized to improve the semantic segmentation effect. After training is completed, the network model is transplanted to the unmanned aerial vehicle, the unmanned aerial vehicle shoots images in an actual scene, the cut images are transmitted to the semantic segmentation network model on the unmanned aerial vehicle to generate segmentation results, the segmentation results are transmitted to the server side, and analysis and decision-making are performed by a user.

Description

Real-time semantic segmentation method for aerial image in agricultural scene

Technical Field

The invention relates to the field of image recognition in an agricultural scene, in particular to real-time semantic segmentation of an aerial image in the agricultural scene, and marking of a specific area.

Background

Information acquisition and analysis under traditional agricultural scene need consume a large amount of human costs and efficiency is not high, and image recognition technology based on deep learning at present constantly develops, and the application in each field is more and more extensive, and people also begin to apply image recognition technology to agricultural scene.

Semantic segmentation is to assign a category to each pixel point in an image, the categories are predefined categories with practical significance, and the current semantic segmentation is widely applied to unmanned driving and medical image analysis. The semantic segmentation of the image under the agricultural scene is of great significance, particularly, the farmland condition is monitored and analyzed, and farmers take corresponding countermeasures after obtaining the analysis result, so that the potential income of the whole growing season is improved or some losses are reduced. The image data acquisition efficiency under the agricultural scene is low, and the problem of unbalanced category exists, and the unbalanced category of the data set influences the effect of the image semantic segmentation method. In addition, many repeated irregular structures in the aerial images of the farm contain abundant texture information, and the information can be learned to improve the segmentation effect.

Along with the application of unmanned aerial vehicle technique in the agricultural field is more and more extensive, can utilize unmanned aerial vehicle to carry out the collection and the analysis of farmland image, but the hardware performance that unmanned aerial vehicle can carry on is limited, and image segmentation method need reach a real-time effect in practical application, therefore segmentation method need can be quick analysis farmland image and occupy less memory.

Disclosure of Invention

Aiming at the technical problems, the invention provides a real-time semantic segmentation method for aerial images in an agricultural scene, an unmanned aerial vehicle is used for collecting farmland image data and transmitting the farmland image data to a server, the server processes the image data to obtain a training set and a testing set, a lightweight semantic segmentation network model is built based on the idea of deep learning, and then the model is transplanted to the unmanned aerial vehicle, so that the unmanned aerial vehicle can collect and segment images, meets the real-time performance, has a good segmentation effect and finally improves the economic benefit of agriculture.

The technical scheme of the invention is a real-time semantic segmentation method for aerial images in agricultural scenes, which specifically comprises the following steps:

step 1, collecting original agricultural scene image data;

step 2, preprocessing original agricultural scene image data, generating corresponding label images, and then dividing a training set and a verification set;

step 3, constructing a real-time semantic segmentation network model, wherein the semantic segmentation network model comprises a backbone feature extraction network, a void space pyramid pooling module, a texture feature extraction module and an up-sampling module;

the main feature extraction network is used for generating a shallow feature map and a deep feature map, the obtained deep feature map is transmitted to the cavity space pyramid pooling module, the cavity space pyramid pooling module is used for multi-scale feature extraction, and then extracted multi-scale feature maps are connected, so that the segmentation accuracy of different scale areas is improved;

the texture feature extraction module transmits a shallow feature map in a main feature extraction network for extracting multi-scale texture features;

connecting the multi-scale feature map output by the cavity space pyramid pooling module with the texture feature map output by the texture feature extraction module, inputting the multi-scale feature map and the texture feature map into an up-sampling module for up-sampling operation to restore the multi-scale feature map and the texture feature map to the size of an original image, calculating the probability of different types of each pixel point by using a softmax function, and then generating a segmentation image;

step 4, training the constructed semantic segmentation network model;

and 5, inputting the cut test image data into the trained semantic segmentation network model to generate a semantic segmentation result.

Further, the specific implementation in step 2 includes the following sub-steps;

step 2.1, marking original agricultural scene image data, wherein marking modes comprise shadow, dryness, nutrient deficiency, weeds, ponding and canals, and generating corresponding label images;

step 2.2, cutting the original image and the corresponding label image into a plurality of images with certain sizes;

step 2.3, deleting the images which do not contain the marked areas and the images of which the marked areas are larger than a certain threshold value, so that all the images can keep enough context information;

step 2.4, calculating the proportion of the total number of the labeling pixel points of each category in all the images, and performing down-sampling on the image of the category with the overlarge proportion, so that the problem that the semantic segmentation network training effect is poor due to extreme unbalance of the category is avoided;

and 2.5, dividing the processed data set and the label graph according to a certain proportion to obtain a training set and a verification set, wherein the training set and the verification set both have corresponding label graphs.

Further, the main feature extraction network firstly performs convolution downsampling on the image by 3 × 3 to obtain a shallow feature map; then n bottleeck modules are arranged, the bottleeck modules are divided into a step size of 1 and a step size of 2, the bottleeck module with the step size of 1 is formed by 1 × 1 convolution, Relu6 activation function, 3 × 3 depth separable convolution, Relu6 activation function, 1 × 1 convolution, linear activation function and jump connection of the initial characteristic diagram; the bottleeck module with step size of 2 is convolved by 1 × 1, Relu6 activation function, 3 × 3 depth separable convolution with step size of 2, Relu6 activation function, 1 × 1 convolution, linear activation function; after n bottleeck modules, performing 1 × 1 convolution, performing average pooling operation and 1 × 1 convolution, and finally outputting a deep feature map;

further, the cavity space pyramid pooling module is composed of 1 × 1 convolution, 3 × 3 cavity convolution with an expansion rate of 6, 3 × 3 cavity convolution with an expansion rate of 12, 3 × 3 cavity convolution with an expansion rate of 16 and global average pooling, multi-scale feature extraction is achieved, and finally feature fusion is performed, so that the segmentation accuracy of different scale regions is improved.

Further, the texture feature extraction module takes a shallow layer feature map in the trunk feature extraction network as input, then transmits the shallow layer feature map into 4 branches, extracts multi-scale texture features, performs 1 × 1 convolution operation on a first branch, performs 2 × 2 convolution operation on a second branch, performs 3 × 3 convolution operation on a third branch, performs 8 × 8 convolution operation on a fourth branch, performs statistical texture quantization operation on feature maps obtained by respective convolution calculation, performs multi-layer perceptron operation and upsampling after quantization is completed, and finally connects outputs of different branches to obtain final texture features.

Furthermore, statistical texture quantization is constructed based on the idea of statistical texture in traditional digital image processing, a feature graph A obtained by first convolution of each branch of an input texture feature extraction module is firstly input, global average pooling operation is firstly carried out on the input feature graph A to obtain an average feature g, then cosine similarity of a feature vector and the average feature g of each pixel in a space is calculated to obtain a similarity feature graph S, and the formula is as follows:

wherein | g |₂A 2-norm representing a vector; carrying out quantitative statistics on the similarity feature map S, and extracting information representation to obtain N quantization level features, wherein the nth quantization level is represented as:

then it is rightS carries out quantitative coding to each pixel point S_iEncoding into an N-dimensional vector E_i,nThe concrete formula is as follows:

encoding quantization characteristic E_i,nAnd a quantization level characteristic L_nAnd transmitting the result of the connection into a multilayer perceptron, then performing up-sampling on the average characteristic g, and then connecting the result of the up-sampling with the result output by the multilayer perceptron to finally obtain the statistical texture characteristic.

Further, a semantic segmentation network model is trained by adopting a weight cross entropy loss function L, specifically, class weights of different classes in a training set are calculated by using a median frequency method, freqc represents the frequency of the class c appearing in the training set, mean-freqc represents the median of all class frequencies, and each class weight coefficient w is obtained by calculation_cThen, a corresponding weight cross entropy loss function L is established, wherein M represents the total number of categories, y_cRepresenting the original probability, 1 if the prediction class is the same as the real class c, or 0, p_cAnd (3) representing the prediction probability belonging to the class c, wherein the specific formula is as follows:

further, when training the semantic segmentation network model, an SGD is selected as an optimizer, the initial learning rate is 0.01, the loss function is a weight cross entropy loss function obtained through calculation in the last step, the weight cross entropy loss function is transmitted into a training set and a verification set, the semantic segmentation network model is trained, the initial learning rate is 0.01, the weight attenuation is 0.0001, the batch size is 4, training is carried out for 200000 times, and trained model parameters are stored.

Further, in step 1, a specific image acquisition path is planned in advance, and a camera carried by an unmanned aerial vehicle is used for acquiring image data in a fixed farm area.

The advantages of the invention are mainly reflected in that: the method has the advantages that the deep learning method is utilized to realize semantic segmentation of aerial images in an agricultural scene, and aiming at the application requirements of the method in an actual scene, namely the method needs to be transplanted to an unmanned aerial vehicle with poor hardware performance and needs to realize real-time segmentation, the lightweight network module is utilized to extract trunk characteristics, so that the speed is high and the parameters are few. The method also utilizes abundant texture information in the farm aerial image to improve the segmentation effect, and improves the segmentation precision of different regional scales through the cavity space pyramid pooling module. And finally, aiming at the class unbalance problem, a data set down-sampling method and a weight cross entropy loss function are adopted to improve the segmentation accuracy of the class with low ratio.

Drawings

FIG. 1 is a flow chart of a method for real-time semantic segmentation of aerial images in an agricultural setting, in accordance with an embodiment of the present invention;

FIG. 2 is a network architecture diagram of the present invention implementing semantic segmentation;

FIG. 3 is a flow chart of data set generation in accordance with the present invention;

FIG. 4 is a diagram of a network architecture for extracting features of a backbone implemented in accordance with the present invention;

FIG. 5 is a flow chart of texture feature extraction implemented in the present invention;

FIG. 6 is a flow chart of statistical texture quantization implemented in the present invention.

Detailed Description

The method provided by the invention designs a corresponding network structure aiming at the problems existing in semantic segmentation of aerial images in agricultural scenes, and adopts some related skills to enable the method to play a better role in actual scenes. The embodiments will be described in detail with reference to the accompanying drawings.

Step 1: the camera carried by the unmanned aerial vehicle is used for collecting image data in a fixed farm area, and a specific image collecting path needs to be planned in advance. The aerial image is transmitted to the server side in real time by using a wireless network, and the server side needs to store image data;

step 2: processing all the acquired image data by using the server, thereby obtaining a training set and a verification set required by the network;

furthermore, step 2 comprises the following substeps:

step 2.1: the labelme tool is used for marking original aerial images of farms, and marking categories comprise shadows, dryness, nutrient deficiency, weeds, ponding and canals, and the existence of the categories can affect the growth of crops and the final income. The labeling modes belong to important information in the agricultural field and can guide a user to make a next decision. Each category is allocated with a pixel value which is 1 to 6 respectively, the background which does not contain the labeling area is allocated with a pixel point of 0, and a corresponding label image can be generated after the labeling is finished;

step 2.2: cutting an original image and a corresponding label image into a plurality of images with the size of 512 multiplied by 512 by using a sliding window mode;

step 2.3: traversing the label graphs of all the images, and deleting the label graphs and the corresponding images which do not contain the labeling areas and have the labeling areas larger than 90%, so that all the images can keep enough context information, and simultaneously, some redundant information is reduced, and the network can learn enough information;

step 2.4: and traversing all the images, calculating the total number of the labeled pixel points of each category, calculating the respective occupation ratio, and randomly deleting the images corresponding to the categories with the occupation ratio exceeding 30% while only keeping 80% of the number of the images. The network can only learn the information of the category with a large proportion due to the unbalanced category, so that the segmentation effect of the category with a small proportion is poor, and the step is a data down-sampling method for relieving the problem of the extreme unbalanced category;

step 2.5: and dividing the processed data set and the label graph according to the proportion of 7:3 to obtain a training set and a verification set, wherein the training set and the verification set both have corresponding label graphs.

And step 3: aiming at the problems in the agricultural scene, a real-time semantic segmentation network model is constructed, wherein the network model mainly comprises a trunk feature extraction network, a cavity space pyramid pooling module, a texture feature extraction module and an up-sampling module, and the network inputs images of 512 x 512 and outputs the images as a segmentation result graph;

the main feature extraction network firstly performs convolution downsampling on an image by 3 x 3, and then 5 bottleeck modules are used, wherein the bottleeck is a lightweight feature extraction module provided by a network MobileNet V2. The bottleeck module is divided into 1 and 2 stride, the 1 stride bottleeck module is composed of 1 × 1 convolution, Relu6 activation function, 3 × 3 depth separable convolution, Relu6 activation function, 1 × 1 convolution, linear activation function and jump connection of the initial feature map, the 2 stride bottleeck module is different in that there is no jump connection and the 3 × 3 depth separable convolution has a step size of 2, and is composed of 1 × 1 convolution, Relu6 activation function, 3 × 3 depth separable convolution (step size of 2), Relu6 activation function, 1 × 1 convolution, linear activation function. The bottleeck module can well improve the network computing rate and reduce the model parameter quantity. After 5 bottleeck modules, performing 1 × 1 convolution, performing average pooling operation and 1 × 1 convolution, and finally outputting a feature map;

and (3) transmitting the characteristic diagram obtained in the previous step into a cavity space pyramid pooling module, wherein cavity convolution is used for improving the receptive field and better acquiring the context information of the image so as to improve the final segmentation precision. The cavity space pyramid pooling module is composed of 1 × 1 convolution, 3 × 3 cavity convolution with expansion rate of 6, 3 × 3 cavity convolution with expansion rate of 12, 3 × 3 cavity convolution with expansion rate of 16 and global average pooling, multi-scale feature extraction is realized, and finally feature fusion is performed, so that the segmentation precision of different scale regions is improved;

the texture feature extraction module takes a shallow feature map obtained by the first layer of convolution in the trunk feature extraction network as an input, because the texture features are mainly contained in the low-dimensional features. Then the multi-scale texture features are transmitted into 4 branches for extraction. Performing 1 × 1 convolution operation on a first branch, performing 2 × 2 convolution operation on a second branch, performing 3 × 3 convolution operation on a third branch, performing 8 × 8 convolution operation on a fourth branch, performing statistical texture quantization operation on feature maps obtained by respective convolution calculation, performing mlp multi-layer perceptron operation and upsampling after quantization is completed, and finally connecting the outputs of different branches to obtain final texture features;

the statistical texture quantization is constructed based on the idea of statistical texture in traditional digital image processing, and is similar to a histogram for modeling the statistical texture of an image. Firstly, inputting a feature map A obtained by the first convolution of each branch of the texture feature extraction module, and firstly carrying out global average pooling operation on the input feature map A to obtain an average feature g. Then calculating the cosine similarity of the feature vector and the average feature g of each pixel in the space to obtain a similarity feature map S, wherein the formula is as follows:

wherein | g |₂A 2-norm representing a vector; carrying out quantitative statistics on the similarity feature map S, extracting information representation, and obtaining N quantization level features, wherein N is set to be 150, and the nth quantization level is represented as:

then, the S is quantized and coded, and each pixel point S is subjected to quantization coding_iEncoding into an N-dimensional vector E_i,nThe concrete formula is as follows:

encoding quantization characteristic E_i,nAnd a quantization level characteristic L_nThe result of the connection is transmitted to the multilayer perceptron mlp, then the average characteristic g is up-sampled, and the up-sampled result is connected with the result output by the multilayer perceptron, and finally the statistical texture characteristic is obtained;

and connecting the characteristic graph output by the cavity space pyramid pooling module with the characteristic graph output by the texture characteristic extraction module. And then, performing up-sampling operation to restore the original image to the original image, calculating the probability of different types of each pixel point by using a softmax function, and then generating a segmentation image.

And 4, step 4: training a built semantic segmentation network model;

because the data downsampling can only relieve the problem of class imbalance, a weight cross entropy loss function L is required to be used, class weights of different classes in a training set are calculated by a median frequency method, freqc represents the frequency of the class c in the training set, mean-freqc represents the median of all class frequencies, and a weight coefficient w of each class is calculated_cThen, a corresponding weight cross entropy loss function L is established, wherein M represents the total number of categories, y_cRepresenting the original probability, 1 if the prediction class is the same as the real class c, or 0, p_cAnd (3) representing the prediction probability belonging to the class c, wherein the specific formula is as follows:

after a network model is constructed, SGD is selected as an optimizer, the initial learning rate is 0.01, and the loss function is a weight cross entropy loss function obtained through the last step of calculation. And transmitting a training set and a verification set, training the network model, wherein the initial learning rate is 0.01, the weight attenuation is 0.0001, the batch size is 4, training is carried out for 200000 times, and the trained model parameters are stored.

And 5: transplanting the trained model and parameters to an unmanned aerial vehicle;

step 6: the unmanned aerial vehicle collects test image data in an actual farm, firstly zooms the image, and then cuts the image to 512 x 512;

and 7: and transmitting the cut image into a semantic segmentation network model carried by the unmanned aerial vehicle, generating a semantic segmentation result, transmitting the semantic segmentation result to a server in real time, and analyzing and making a decision by a user according to the segmentation result.

The above description is a specific embodiment of the present invention, which is used for illustration of the technical solution and is not an absolute limitation. Modifications and additions may be made thereto by those skilled in the art without departing from the inventive concept. Furthermore, processes not described in detail in this specification are all derived from the prior art.

Claims

1. A real-time semantic segmentation method for aerial images in agricultural scenes is characterized by comprising the following steps:

step 1, collecting original agricultural scene image data;

step 4, training the constructed semantic segmentation network model;

2. The real-time semantic segmentation method for the aerial images in the agricultural scene as claimed in claim 1, characterized in that: the specific implementation in the step 2 comprises the following substeps;

3. The real-time semantic segmentation method for the aerial images in the agricultural scene as claimed in claim 1, characterized in that: the main feature extraction network firstly performs 3 multiplied by 3 convolution downsampling on the image to obtain a shallow feature map; then n bottleeck modules are arranged, the bottleeck modules are divided into a step size of 1 and a step size of 2, the bottleeck module with the step size of 1 is formed by 1 × 1 convolution, Relu6 activation function, 3 × 3 depth separable convolution, Relu6 activation function, 1 × 1 convolution, linear activation function and jump connection of the initial characteristic diagram; the bottleeck module with step size of 2 is convolved by 1 × 1, Relu6 activation function, 3 × 3 depth separable convolution with step size of 2, Relu6 activation function, 1 × 1 convolution, linear activation function; and after n bottleeck modules, performing 1 × 1 convolution, performing average pooling operation and 1 × 1 convolution, and finally outputting a deep feature map.

4. The real-time semantic segmentation method for the aerial images in the agricultural scene as claimed in claim 1, characterized in that: the cavity space pyramid pooling module is composed of 1 x 1 convolution, 3 x 3 cavity convolution with an expansion rate of 6, 3 x 3 cavity convolution with an expansion rate of 12, 3 x 3 cavity convolution with an expansion rate of 16 and global average pooling, multi-scale feature extraction is achieved, and finally feature fusion is conducted, so that segmentation accuracy of regions with different scales is improved.

5. The real-time semantic segmentation method for the aerial images in the agricultural scene as claimed in claim 1, characterized in that: the texture feature extraction module takes a shallow layer feature map in a trunk feature extraction network as input, then transmits the shallow layer feature map into 4 branches to extract multi-scale texture features, the first branch is subjected to 1 × 1 convolution operation, the second branch is subjected to 2 × 2 convolution operation, the third branch is subjected to 3 × 3 convolution operation, the fourth branch is subjected to 8 × 8 convolution operation, then statistical texture quantization operation is carried out on feature maps obtained by respective convolution calculation, multi-layer perceptron operation and up-sampling are carried out after quantization is completed, and finally output of different branches is connected to obtain final texture features.

6. The real-time semantic segmentation method for the aerial image in the agricultural scene as claimed in claim 5, characterized in that: the statistical texture quantization is constructed based on the idea of statistical texture in traditional digital image processing, firstly, a feature graph A obtained by first convolution of each branch of an input texture feature extraction module is input, for the input feature graph A, global average pooling operation is firstly carried out to obtain an average feature g, then the cosine similarity of a feature vector and the average feature g of each pixel in a space is calculated to obtain a similarity feature graph S, and the formula is as follows:

7. The real-time semantic segmentation method for the aerial images in the agricultural scene as claimed in claim 1, characterized in that: the semantic segmentation network model is trained by adopting a weight cross entropy loss function L, specifically, class weights of different classes in a training set are calculated by using a median frequency method, freqc represents the frequency of the class c appearing in the training set, mean-freqc represents the median of all class frequencies, and a weight coefficient w of each class is obtained by calculation_cThen, a corresponding weight cross entropy loss function L is established, wherein M represents the total number of categories, y_cRepresenting the original probability, 1 if the prediction class is the same as the real class c, or 0, p_cAnd (3) representing the prediction probability belonging to the class c, wherein the specific formula is as follows:

8. the real-time semantic segmentation method for the aerial image in the agricultural scene as claimed in claim 7, characterized in that: when the semantic segmentation network model is trained, an SGD is selected as an optimizer, the initial learning rate is 0.01, the loss function is a weight cross entropy loss function obtained through calculation in the last step, the weight cross entropy loss function is transmitted into a training set and a verification set, the semantic segmentation network model is trained, the initial learning rate is 0.01, the weight attenuation is 0.0001, the batch size is 4, training is carried out for 200000 times, and trained model parameters are stored.

9. The real-time semantic segmentation method for the aerial images in the agricultural scene as claimed in claim 1, characterized in that: in the step 1, a specific image acquisition path is planned in advance, and a camera carried by an unmanned aerial vehicle is adopted to acquire image data in a fixed farm area.