CN114266952A

CN114266952A - Real-time semantic segmentation method based on deep supervision

Info

Publication number: CN114266952A
Application number: CN202111600850.1A
Authority: CN
Inventors: 柯逍; 蒋培龙; 曾淦雄
Original assignee: Fuzhou University
Current assignee: Fuzhou University
Priority date: 2021-12-24
Filing date: 2021-12-24
Publication date: 2022-04-01
Anticipated expiration: 2041-12-24
Also published as: CN114266952B

Abstract

The invention provides a real-time semantic segmentation method based on deep supervision, which comprises the following steps of; step S1, collecting scene image data for deep supervision aiming at a specific application scene, and constructing a scene image database; step S2, carrying out pixel level annotation on the scene images in the database, and deriving an annotation file in a PASCAL VOC format so as to enable the annotation file to meet the training requirement of a semantic segmentation task; step S3, constructing a deep supervision-based real-time semantic segmentation network CFSegNet; step S4, training a CFSegNet neural network model by using the labeled data set; step S5, preprocessing image data collected in an application scene, and then inputting the preprocessed image data into a CFSegNet neural network model to obtain an image semantic segmentation result; the method has the advantages of high accuracy, good timeliness and low requirement on the calculation performance of the equipment, and is suitable for being deployed in terminal equipment with limited performance.

Description

Real-time semantic segmentation method based on deep supervision

Technical Field

The invention relates to the technical field of pattern recognition and computer vision, in particular to a real-time semantic segmentation method based on deep supervision.

Background

In recent years, computer vision related technologies are emerging in more and more fields including automatic driving, medical image segmentation, and so on, and it can be said that computer vision is leading a new research enthusiasm. Computer vision is similar to biological vision systems, and the computer and other hardware devices are used for processing pictures and videos to obtain scene information, so that people can make decisions.

The main task of computer vision is the localization and detection of objects, i.e. objects that need to be position-marked and their kind identified.

Usually, we are interested in only some objects or regions in the image, and how to distinguish the interested parts from the picture needs to perform image segmentation. The image segmentation is to divide the regions according to a certain rule (such as object edges and pixel value boundaries), so that the features of the same divided regions are similar, and the features of different regions are different. In brief, a picture can be divided into regions with different meanings through image segmentation, wherein the region with the important meaning is called as an object or a foreground, and the other regions are called as backgrounds, so that the region with the important meaning can be distinguished from the background and further analyzed, and the whole picture can be understood more clearly. Image semantic segmentation requires region division of different objects according to their boundaries and pixel-level class division of each region. In some scenarios such as automatic driving, the semantic segmentation model is deployed in the edge device, and at this time, the model is required to perform reasoning at a faster speed on the premise of maintaining high performance, and how to obtain a good compromise between the speed and the performance is a very challenging problem.

Disclosure of Invention

The real-time semantic segmentation method based on deep supervision provided by the invention has the advantages of high accuracy, good timeliness and low requirement on the calculation performance of equipment, and is suitable for being deployed in terminal equipment with limited performance.

The real-time semantic segmentation method based on deep supervision comprises the following steps;

step S1, collecting scene image data for deep supervision aiming at a specific application scene, and constructing a scene image database;

step S2, carrying out pixel level annotation on the scene images in the database, and deriving an annotation file in PASCALVOC format so as to enable the annotation file to meet the training requirement of the semantic segmentation task;

step S3, constructing a deep supervision-based real-time semantic segmentation network CFSegNet;

step S4, training a CFSegNet neural network model by using the labeled data set;

and step S5, preprocessing image data collected in the application scene, and then inputting the preprocessed image data into the CFSegNet neural network model to obtain an image semantic segmentation result.

The step S1 specifically includes the following steps:

step S11: analyzing the influence of various factors in an application scene on an image semantic segmentation result, wherein the factors comprise weather or illumination;

step S12: according to the analysis result in the step S11, adverse effects are overcome by using a multi-image sampling method including taking as many captured application scene images as possible to cover various scenes that are likely to occur;

step S13: and sorting the collected images, and eliminating images which are not suitable for the training task due to repeated and error factors to obtain a corresponding scene image database.

The step S2 specifically includes the following steps;

step S21: analyzing semantic categories to be divided under an application scene by combining the acquired image information according to application requirements;

step S22: downloading and installing image annotation software labelme, and configuring the labelme according to the semantic category obtained in the step S21;

step S23: framing the category edge of each image obtained in the step S1 by using labelme image labeling software, and storing labeling information into a json file with the same name as each image;

step S24: and converting the json file generated in the step S23 into the PASCALVOC format by using a labelme2voc script in labelme image annotation software so as to meet the training requirement of the semantic segmentation task.

Step S3 specifically includes the following steps:

step S31: adopting ResNet-18 as an encoder of CFSegNet, wherein a bottleneck layer of ResNet-18 performs down-sampling on an input image by 4 times, and except for a first stage, ResNet-18 performs down-sampling on the image by 2 times in three subsequent stages;

step S32: ResNet-18 saves the expression of the down-sampling stage through intensive connection in the first to third stages, and introduces a deep supervision module to supervise the expression output by the encoder in the second to fourth stages, thereby reducing the loss of space information in the encoding stage;

step S33: inputting the output result of the fourth stage of the encoder into a pyramid pooling module PPM to obtain an expression with rich multi-scale information;

step S34: inputting the expression obtained in the step S33 into a cascaded upsampling path, and performing 2 times upsampling on the expression for a total of 3 times by using a channel fusion module CFM in combination with the dense connection in the step S32 to obtain an expression of fusing semantic information and spatial information;

step S35: the expression obtained in step S34 is upsampled by 8 times by a bilinear interpolation algorithm, and a prediction result is output by a 1 × 1 convolution.

Step S4 specifically includes the following steps:

step S41: training the model proposed in step S3, and setting the initial parameters as follows:

initial learning rate, i.e., -learning rate: 0.01;

weight decay, namely-weight decay: 0.0005;

momentum, i.e., -momentum: 0.9;

in the training stage, polynomial weight attenuation is adopted as a learning rate attenuation strategy, wherein the minimum learning rate is set to be 0.0001, an attenuation factor is set to be 0.9, and the batch size is determined according to the size of the image collected in the application scene and the video memory of the training server;

step S42, the model final loss function is:

wherein Loss_final，Loss_main，Loss_auxRespectively representing the final loss, the main body loss and the auxiliary loss of the model, wherein alpha is the weight of the auxiliary loss and is set to be 0.4, K is the number of the deep supervision modules and is set to be 3, and s is the serial number of the deep supervision modules; the loss function adopts cross information entropy, and the formula is shown as follows;

wherein Loss represents Loss value, M represents the number of semantic categories, c represents pixel point sequence number, and y_cIs a one-hot vector, taking only two values 0 and 1; if the class is consistent with the sample class, 1 is selected, otherwise 0, p is selected_cRepresents the probability that the predicted pixel belongs to class c;

step S43: in the training stage, a random gradient descent method is used as an optimizer, and the weight value and the offset value after the convolutional neural network is updated are calculated;

step S44: carrying out random affine transformation on part of the training samples, carrying out corresponding transformation on the label file, and then adding the label file into the training samples of the model to participate in training;

step S45: random position cutting is carried out on part of the training samples, corresponding positions of the label files are cut, and then the cut labels are added into the training samples of the model to participate in training;

step S46: and stopping training after the iteration reaches 160000 times, and storing the trained model.

Step S5 specifically includes the following steps:

step S51: acquiring image data as input through a camera in an application scene;

step S52: adjusting an input image to 2048 × 1024 size;

step S53: obtaining a prediction result graph from the image obtained in the step S52 through CFSegNet;

step S54: and scaling the prediction result image obtained in the step S53 into an original input size through a bilinear interpolation algorithm to obtain a final result image.

The method provided by the invention focuses on implementation of semantic segmentation in an actual application scene, and has innovative significance and application value; the method has high accuracy and good timeliness, and is suitable for being deployed in terminal equipment with limited performance.

Compared with the prior art, the invention has the following beneficial effects:

1. the real-time semantic segmentation method based on deep supervision constructed by the invention can effectively perform semantic segmentation on different scenes, and improves the image segmentation effect.

2. The invention provides a combined training loss function, accelerates the training speed, has better convergence and has the advantage of smaller model volume.

3. Compared with the traditional method, the method has relatively better speed in processing the image data with higher resolution.

Drawings

The invention is described in further detail below with reference to the following figures and detailed description:

FIG. 1 is a schematic flow diagram of the present invention;

fig. 2 is a schematic diagram of a network structure of the method of the present invention.

Detailed Description

It should be noted that the following detailed description is exemplary and is intended to provide further explanation of the disclosure. Unless defined otherwise, all technical and scientific terms used herein have the same meaning as commonly understood by one of ordinary skill in the art to which this application belongs.

It is noted that the terminology used herein is for the purpose of describing particular embodiments only and is not intended to be limiting of example embodiments according to the present application. As used herein, the singular forms "a", "an" and "the" are intended to include the plural forms as well, and it should be understood that when the terms "comprises" and/or "comprising" are used in this specification, they specify the presence of stated features, steps, operations, devices, components, and/or combinations thereof, unless the context clearly indicates otherwise.

As shown in the figure, the real-time semantic segmentation method based on deep supervision comprises the following steps;

The step S1 specifically includes the following steps:

The step S2 specifically includes the following steps;

Step S3 specifically includes the following steps:

Step S4 specifically includes the following steps:

initial learning rate, i.e., -learning rate: 0.01;

weight decay, namely-weight decay: 0.0005;

momentum, i.e., -momentum: 0.9;

step S42, the model final loss function is:

Step S5 specifically includes the following steps:

step S52: adjusting an input image to 2048 × 1024 size;

The above description is only a preferred embodiment of the present invention, and all equivalent changes and modifications made in accordance with the claims of the present invention should be covered by the present invention.

Claims

1. The real-time semantic segmentation method based on deep supervision is characterized by comprising the following steps: comprises the following steps;

2. The deep supervision based real-time semantic segmentation method according to claim 1, characterized in that: the step S1 specifically includes the following steps:

3. The deep supervision based real-time semantic segmentation method according to claim 1, characterized in that: the step S2 specifically includes the following steps;

4. The deep supervision based real-time semantic segmentation method according to claim 1, characterized in that: step S3 specifically includes the following steps:

5. The deep supervision based real-time semantic segmentation method according to claim 1, characterized in that: step S4 specifically includes the following steps:

initial learning rate, i.e., -learning rate: 0.01;

weight decay, namely-weight decay: 0.0005;

momentum, i.e., -momentum: 0.9;

step S42, the model final loss function is:

6. The deep supervision based real-time semantic segmentation method according to claim 1, characterized in that: step S5 specifically includes the following steps:

step S52: adjusting an input image to 2048 × 1024 size;