CN110399840B

CN110399840B - Rapid lawn semantic segmentation and boundary detection method

Info

Publication number: CN110399840B
Application number: CN201910683100.1A
Authority: CN
Inventors: 李小霞; 叶远征; 王学渊; 孙维
Original assignee: Southwest University of Science and Technology
Current assignee: Southwest University of Science and Technology
Priority date: 2019-05-22
Filing date: 2019-07-26
Publication date: 2024-04-02
Anticipated expiration: 2039-07-26
Also published as: CN110399840A

Abstract

In order to quickly and accurately identify lawn and non-lawn areas and boundary positions thereof in different scenes, environments and seasons, the invention provides a quick lawn semantic segmentation and boundary detection method. The method comprises the following steps: step 1, obtaining a video frame through a camera; step 2, segmenting the current frame by using the rapid semantic segmentation model to obtain a segmentation result mask image; step 3, binarizing the segmentation result mask image, and detecting the lawn boundary by using an eight-neighborhood coding method; step 4, mapping the detection result onto the original image to serve as an output image; and repeating the steps 2 to 4 until the system is closed. The method can quickly and accurately perform semantic segmentation on the lawn, and detect the lawn boundary.

Description

Rapid lawn semantic segmentation and boundary detection method

Technical Field

The invention belongs to the field of computer vision, and particularly relates to a rapid lawn semantic segmentation and boundary detection method.

Background

With the rapid development of artificial intelligence and big data, data with visual information grows exponentially. The aim of researching computer vision is to extract the target with semantic information from massive video and image data, so that the computer can better understand and solve the problems in the real world, and great convenience is brought to people. Although the target detection can identify the position and the type of the target in the image, the specific boundary of the target cannot be detected, and large-area or irregular targets such as lawns, lake water, sky, wall cracks and the like in the image cannot be accurately detected.

However, in medical treatment, intelligent robots, unmanned aerial vehicles and other applications, the operation is generally performed in a large-area specific area, and a computer is required to identify the target area and the position of the boundary thereof, which is an identification and boundary positioning problem of the target area, and is summarized as a boundary detection problem. For the problems of lawn semantic segmentation and boundary detection, the semantics of scenes in images need to be analyzed, and lawn and non-lawn areas are identified. On the basis of which the boundaries of the lawn and the non-lawn area in contact with the lawn are located.

Target region identification is an image segmentation problem, and image segmentation is mainly divided into two methods: an image segmentation method based on artificial design features and a semantic segmentation method based on convolutional neural models (Convolutional neural network, CNN). Image segmentation based on artificial design features mainly comprises a thresholding method, clustering, texture and other methods. The methods have real-time speed, but are extremely easy to cause the problems of holes, mutual 'pollution' among similar characteristic areas, false recognition and the like, so that the boundary is positioned inaccurately. The semantic segmentation method based on the convolutional neural model has the capability of automatically learning the features, and different layers learn different features: the convolution layer of the lower layer can express the detail information of the image, learn the local area characteristics of the image, and is favorable for positioning the boundary of each target area in the image; the high-level convolution layer can express semantic information of the image, learn deep abstract features, be favorable for classifying each target region in the image, and achieve better segmentation effect than a method based on manual design features.

Along with the development of deep learning, long J et al firstly uses full convolution CNN for semantic segmentation, proposes an FCN model, adopts a learnable deconvolution structure to carry out up-sampling to compensate detail loss caused by repeated standard convolution and pooling layers, carries out pixel-by-pixel classification, but increases the calculated amount due to the learnable deconvolution structure, and the model lacks local detail information and semantic information, so that serious intra-class inconsistency phenomenon occurs. Semantic segmentation models such as SegNet, deepLab, PSPNet and ICNet appear later. However, the above models have some problems, such as small SegNet effective receptive field and insufficient high-level semantic information; deep lab lacks detailed information of the image; the PSPNet has large calculated amount and very slow speed; ICNet high-level semantic information has weaker expression capability.

In summary, the current image segmentation method is difficult to meet the practical application requirements in terms of recognition rate, recognition speed, real-time performance, functionality and the like.

Disclosure of Invention

Aiming at the detection problem of the lawn area, the invention provides a rapid lawn semantic segmentation and boundary detection method. The method has the characteristics of high recognition rate and high speed.

The technical solution of the invention comprises the following steps:

step 1, obtaining a video frame through a camera;

step 2, segmenting the current frame by using the rapid lawn semantic segmentation model to obtain a segmentation result mask image;

step 3, binarizing the segmentation result mask image, and detecting the lawn boundary by using an eight-neighborhood coding method;

step 4, mapping the detection result onto the original image to serve as an output image;

and repeating the steps 2 to 4 until the system is closed.

Compared with the prior art, the invention has the remarkable advantages that: 1) The method has high detection speed and can meet the requirement of real-time property; 2) The method has high accuracy, can accurately detect the lawn area and the boundary thereof, and has practicability.

Drawings

FIG. 1 is a flow chart of lawn detection according to the present invention;

FIG. 2 is a block diagram of the PULNet model of the present invention;

FIG. 3 is a diagram of a DilateResNet 50 network in the PULNet model of the present invention;

FIG. 4 is a pooled pyramid structure in the PULNet model of the present invention;

FIG. 5 is a convolution process of an up-sampling dimension-reduction structure in the PULNet model of the present invention;

FIG. 6 is a local detail information network of an image in the PULNet model of the present invention;

FIG. 7 is a schematic diagram of the eight neighborhood encoding method of the present invention for locating boundary points of binary images;

FIG. 8 is a graph comparing the grass cutting effect of the present invention with other methods;

fig. 9 is a partial example of a self-built dataset.

Detailed Description

The invention will be further described with reference to the accompanying drawings and specific examples.

The fast lawn semantic segmentation and boundary detection flow chart is shown in fig. 1, and comprises video acquisition, PULNet lawn segmentation, lawn class, mask binarization, lawn boundary positioning and video output.

The method comprises the following specific steps:

step 1, collecting video, and collecting video frames through a camera to serve as input of a follow-up detection network.

And 2, segmenting the current frame by using a quick semantic segmentation model to obtain a segmentation result mask image, inputting the current frame into a PULNet model, and segmenting a lawn type mask and a non-lawn type mask, wherein the color of the non-lawn type mask is black (0, 0), and the color of the lawn mask is green (4,250,7).

Fig. 2 is a block diagram of the pulsnet, constructed from a network of image local detail information, a network of dilatedResNet 50 infrastructure, an upsampled dimension-reduction structure, and a pooling pyramid. In the figure, the green square is a standard convolution output characteristic diagram, the red square is an expanded convolution output characteristic diagram, the yellow square is a pooling characteristic diagram, and the purple square is a characteristic diagram for prediction.

In order to meet the real-time performance and accuracy of semantic segmentation, reduce the complexity of a model and improve generalization capability, the ResNet50 network is designed into a Dilated_ResNet50 network structure, as shown in figure 3. Firstly, discarding the last average pooling, feature stretching and full connection layer of the ResNet50 network, and only leaving a feature map extraction layer for extracting semantic features; secondly, changing the channel number of the output feature graphs of the modules except Conv1_x to 128, 256, 512 and 1024 for reducing the feature dimension of the network, and bilinear interpolating the output feature graph of Conv3_1 to half of the input feature graph to further improve the speed of semantic feature extraction; finally, in order to avoid the insufficient expression capacity of the semantics and improve the effective receptive field of the network, the standard convolution of 3×3 in Conv4_x and Conv5_x is changed into the expansion convolution of 3×3 with the expansion rate of 2 (Dilated Convolution).

Meanwhile, the method designs a pooling pyramid structure, see fig. 4, wherein the input and output feature images of the pooling pyramid are 1/32 times of the image size, global average pooling is firstly carried out on the input feature images, and the window sizes are the average pooling of the input feature images 1/2, 1/3 and 1/4, so that four pooling feature images are respectively obtained, and the pooling pyramid is formed; secondly, performing bilinear interpolation on the four pooled feature maps to an image size of 1/32; and finally, carrying out additive fusion. The pooling pyramid fuses the context information of the feature images of different areas, and improves the effective receptive field, so that the semantic expression capacity of the feature images is enhanced, and the loss of detail information caused by the feature images of a single pooling layer is weakened. The pooling layer of the pyramid structure greatly improves invariance of the effective receptive field and the characteristics to rotation, translation, multiscale change and the like of the image under the condition of losing less detail information.

In order to further accelerate the speed of semantic segmentation network, reduce the complexity of the model feature map and enhance the detail information of the feature map, the method designs an up-sampling dimension-reducing structure which fuses the low-level feature map with rich detail information while reducing the dimension of the feature map. FIG. 5 is a convolution process of the up-sampling dimension-reduction structure, with specific parameters of the convolution on the right, "Conv1×1,256,1, BN, reLU" representing a standard convolution with a filter size of 1×1, a channel number of 256, a step size of 1, followed by a batch normalization BN and an activation function ReLU; "dilated_conv3× 3,128, 2, BN" represents a dilation convolution with a filter size of 3×3, a channel number of 128, a dilation rate of 2, followed by batch normalization BN; upsampling is a bilinear interpolation method. Up-sampling the dimension-reducing structure: firstly, reducing a pooling characteristic diagram with 1024 channels to 256 by using a filter with the size of 1 multiplied by 1, and performing expansion convolution with the channel number of 128, the filter size of 3 multiplied by 3 and the expansion rate of 2 after double up-sampling; secondly, reducing the channel number of the output characteristic diagram of the Conv3_1 residual block in the Dilayed_ResNet 50 basic network to 128 by using a filter with the size of 1 multiplied by 1; finally, after the output characteristic diagrams of the two processes are additively integrated, the same up-sampling and expansion convolution processing is carried out. Regression and prediction are carried out on the feature graphs after two upsampling processes in the upsampling dimension-reducing structure, and the prediction result is set as P _1/16 And P _1/8 。

In order to further compensate for the loss of detail information caused by the pooling pyramid and the up-sampling dimension-reducing structure, an image local detail information network is constructed, see fig. 6, and an image with the original size is used as input, and the specific process is as follows: first, a standard convolution layer with 32 channels, 3×3 filter size and 2 step size is used to perform two convolutions (each convolution has a batch normalization BN and an activation function ReLU); secondly, using a filter with the channel number of 64 and the same size as the previous two convolutions and a step length convolution for one time, using a filter with the size of 1 multiplied by 1 to change the channel number into 128, and carrying out additive fusion with an output characteristic diagram of an up-sampling dimension-reducing structure and then up-sampling to 2 times of the original one;and finally, splicing after expansion convolution with expansion ratios of 4, 7 and 9, up-sampling by 2 times, and up-sampling after changing the number of channels to obtain an output layer. Regression and prediction are carried out on the feature map after local detail information network up-sampling, and the prediction result is set as P _1/4 . The image local detail information network not only extracts the image detail information, but also combines the context information, thereby being beneficial to improving the accuracy of semantic segmentation.

The method defines the prediction results of three feature graphs with original graph sizes of 1/4, 1/8 and 1/16 as P _1/4 、P _1/8 And P _1/16 Three cross entropy losses are obtainedL ₁ 、L ₂ AndL ₃ total cross entropy lossLThe calculation mode of (2) is as follows:

(1) Changing the channel number of three characteristic diagrams to the training class number by using a filter with the size of 1 multiplied by 1nAnd changing the shape of the feature map into a vector form;

(2) The size of the label image (class and pixel value are equal) is first scaled to the sizes of the three feature maps in (1), and the shape is changed to vector form, and then masking processing is performed. The purpose of the mask is to take out the values less than or equal to the category number in the three-size tag images to form a tag vectorG=（G ₁ ，G ₂ ，G ₃ ) Recording the position index of each gray value in G in the label image, and taking out the result in (1) according to the index to form a predictive vectorP=（P ₁ ，P ₂ ，P ₃ ）；

(3) Combining the three groupsPAndGcalculating cross entropy loss according to (1)L ₁ 、L ₂ AndL ₃ ：

（1）

in the formula (1)nFor training sample numbers, the total cross entropy loss is calculated according to equation (2):

（2）

representation of parametersWRegularization of->Respectively representL ₁ 、L ₂ AndL ₃ weight coefficient of (c) in the above-mentioned formula (c).

Step 3, binarizing the segmentation result mask image, and measuring the lawn boundary by using an eight-neighborhood coding method, wherein the specific extraction method comprises the following steps:

(1) Obtaining a segmentation result mask image of the PULNet semantic segmentation model, wherein the color of a non-lawn mask is black (0, 0), and the color of a lawn mask is green (4,250,7);

(2) Binarization mask image, lawn is 1, non-lawn is 0;

(3) Traversing the image from bottom to top by using a 3×3 window with a certain step length from left to right, and counting the number of lawn pixel points in the windowN _d ：

（3）

C _km Coding of 8 neighborhood points representing window center point, grass class of 1, non-grass class of 0, subscriptk，m∈[0,2]And (2) andkandmat the same time, the number of the non-parallel lines is 1,dstep size for 3 x 3 window traversal.N _d <4, when the window does not scan the lawn boundary point; if it isN _d Not less than 4, the lawn boundary point is undetermined, and the window moves rightwards,N _d if the window is increased continuously, the window searches the lawn boundary point, the central coordinate point (i, j) of the current window is the lawn boundary point coordinate, and the traversing to the right is stopped. If the window scans from left to right, the starting pointN _d ≥ 7，And as the window moves to the right, the magnitude of the invariance or change is smaller, indicating that the area is all lawn, ifN _d Continuously decreasing, then the area is obstructed; FIG. 7 is a schematic diagram of eight neighborhood encoding to locate binary image boundary points.

And 4, mapping the detection result onto the original image to serve as an output image, and repeating the steps 2 to 4 until the system is closed.

The lawn dividing effect is shown in FIG. 8, the divided image is P _1/4 The segmentation mask image is obtained from the label image.

Table 1 is a comparison of metrics on a self-built lawn dataset (see FIG. 9 for partial dataset examples), input image size 848×480, server configuration used as GPU GTX1080Ti, CPU I7-7700K. Experimental results show that the method is very excellent in detection accuracy and detection speed, the average merging ratio (Intersection Over Union, IOU) reaches 96.32%, the speed reaches 67.3 frames/second, and the method has good practicability.

Table 1 comparison of lawn test set index

Claims

1. A rapid lawn semantic segmentation and boundary detection method comprises the following steps:

step 1, obtaining a video frame through a camera;

repeating the steps 2 to 4 until the system is closed;

the method for obtaining the segmentation result mask image by segmenting the current frame by using the rapid lawn semantic segmentation model comprises the following steps:

inputting the current frame into a PULNet model, and dividing a lawn mask and a non-lawn mask, wherein the division result mask image comprises the lawn mask and the non-lawn mask, the color of the non-lawn mask is black (0, 0), and the color of the lawn mask is green (4,250,7);

the PULNet model is a semantic segmentation model capable of expressing rich image detail information and semantic information:

(1.1) establishing a Dilated_ResNet50 base network, expanding an effective receptive field of a model and reducing complexity of the model;

(1.2) constructing a pooling pyramid structure, and improving invariance of a feature map to rotation, translation and multi-scale change of an image;

(1.3) designing an up-sampling dimension-reducing structure, further accelerating the speed of a semantic segmentation network and enhancing the detailed information of the feature images, carrying out regression and prediction on the feature images after two up-sampling processes in the up-sampling dimension-reducing structure, and setting the prediction result as P _1/16 And P _1/8 The method comprises the steps of carrying out a first treatment on the surface of the Wherein P is _1/16 Prediction result of feature map representing original map size 1/16, P _1/8 A prediction result of the feature map with the original map size of 1/8 is shown;

the up-sampling dimension-reducing structure realizes the up-sampling dimension-reducing process, which comprises the following steps: firstly, reducing a pooling characteristic diagram with 1024 channels output by a pooling pyramid structure to 256 by using a filter with the size of 1 multiplied by 1, and performing expansion convolution with the channel number of 128, the filter size of 3 multiplied by 3 and the expansion rate of 2 after up-sampling twice; secondly, reducing the channel number of the output characteristic diagram of the Conv3_1 residual block in the Dilayed_ResNet 50 basic network to 128 by using a filter with the size of 1 multiplied by 1; finally, after the output characteristic diagrams of the two processes are additively integrated, the same up-sampling and expansion convolution processing is carried out;

(1.4) constructing a local detail information network of the image, compensating detail information loss caused by a pooling pyramid and an up-sampling dimension-reducing structure, carrying out regression and prediction on a feature map after up-sampling of the local detail information network, and setting a prediction result as P _1/4 The method comprises the steps of carrying out a first treatment on the surface of the Wherein P is _1/4 A prediction result of the feature map with the original map size of 1/4 is shown;

the local detail information network of the image takes the image with the original size as input, and the specific process is as follows: firstly, carrying out convolution twice by using a standard convolution layer with the channel number of 32, the filter size of 3 multiplied by 3 and the step length of 2, wherein each convolution has batch normalization BN and an activation function ReLU; secondly, using a filter with the channel number of 64 and the same size as the previous two convolutions and a step length convolution for one time, using a filter with the size of 1 multiplied by 1 to change the channel number into 128, and carrying out additive fusion with an output characteristic diagram of an up-sampling dimension-reducing structure and then up-sampling to 2 times of the original one; and finally, splicing after expansion convolution with expansion ratios of 4, 7 and 9, up-sampling by 2 times, and up-sampling after changing the number of channels to obtain an output layer.

2. The method of claim 1, wherein the binarizing the segmented result mask image in step 3 detects the lawn boundary by using eight-neighborhood encoding, and the specific method is as follows:

(2.1) obtaining a segmentation result mask image of the PULNet semantic segmentation model, wherein the color of a non-lawn mask is black (0, 0), and the color of a lawn mask is green (4,250,7);

(2.2) binarizing the mask image, wherein the lawn class is 1, and the non-lawn class is 0;

(2.3) traversing the image from bottom to top and from left to right by a 3X 3 window with a certain step length, and counting the number of lawn pixel points in the windowC _km Coding 8 neighborhood points representing window center points, wherein the lawn class is 1, the non-lawn class is 0, the subscript k, m epsilon [0,2]And k and m are not 1 at the same time, the subscript d represents the step size of window traversal, N _d <4, if the window does not scan the lawn boundary point, N _d Not less than 4, the lawn boundary point is undetermined, and the window moves rightwards, N _d If the number of the coordinates is increased, the window searches the lawn boundary point, the central coordinate point (i, j) of the current window is the lawn boundary point coordinate, and the window stopsTraversing to right, if the window scans from left to right, starting point N _d 7 or more, and as the window moves to the right, the magnitude of the invariance or change is smaller, indicating that the area scanned from left to right of the window is all lawn, if N _d The area of the window that scans from left to right is increasingly getting obstructed.