CN111553391A

CN111553391A - Feature fusion method in semantic segmentation technology

Info

Publication number: CN111553391A
Application number: CN202010274552.7A
Authority: CN
Inventors: 杨绿溪; 顾恒瑞; 朱紫辉; 王路; 李春国; 黄永明
Original assignee: Southeast University
Current assignee: Southeast University
Priority date: 2020-04-09
Filing date: 2020-04-09
Publication date: 2020-08-18

Abstract

A feature fusion method in semantic segmentation technology. A semantic segmentation network based on an encoder-decoder structure proposes a feature fusion mode in a decoder part. The feature fusion method comprises operations of splicing, pooling, convolution, activation, addition and the like, and can effectively fuse different features, improve the expression capability of the network on the features and finally achieve the purpose of improving the accuracy of network segmentation. Meanwhile, in the network training process, the method can accelerate the convergence process of the loss function and shorten the training time. The invention tests on the semantic segmentation public data set and obtains good experimental results. Compared with a network without feature fusion or using other feature fusion methods, the network using the feature fusion method provided by the invention has higher accuracy in testing and the loss function can be converged faster during training.

Description

Feature fusion method in semantic segmentation technology

Technical Field

The invention relates to the field of computer vision, in particular to a feature fusion method in a semantic segmentation technology.

Background

In recent years, with continuous breakthrough of parallel computing theory and hardware implementation level, the field of computer vision has been greatly developed. Especially, in the il svrc (image net Large Scale Visual recognitionchallenge) game in 2012, AlexNet based on the convolutional neural network takes the champion of the classified item, which causes the hot tide of deep learning, and the deep learning technology starts to play a great splendid. At present, in the field of computer vision, deep learning, especially convolutional neural networks, play an increasingly important role in various visual recognition tasks.

Semantic segmentation technology has been attracting attention as one of the key technologies in the field of computer vision. Semantic segmentation is a task applied to scene understanding, and is classified at the pixel level. Semantic segmentation has wide application prospect and covers a plurality of fields including automatic driving, man-machine interaction, robots, augmented reality and the like.

At present, in the semantic segmentation task, most of the mainstream algorithms are based on deep learning, especially the convolutional neural network.

In 2015, Jonathan Long et al of UC Berkeley proposed a full convolution neural network (FCN) for semantic segmentation, which performed pioneering work in semantic segmentation and solved the problem of pixel segmentation. The full convolutional neural network proposes to replace all the fully connected layers behind the traditional neural network with convolutional layers, which is also the origin of the full convolutional name. Semantic segmentation techniques based on convolutional neural networks have also evolved rapidly from this.

In the same year, U-Net network also proposes that the U-Net network is a typical 'coder-decoder' structure, and the structure is a semantic segmentation structure which is more mainstream at present, and SegNet is also adopted in a similar structure. The semantic segmentation networks based on the structure of the encoder-decoder of U-Net and SegNet have good performance and better performance in segmentation tasks.

In general, in a semantic segmentation network based on an "encoder-decoder" structure, feature fusion is performed in a "decoder" part, for example, in an FCN, the feature fusion adopts an addition (add) mode, and in a U-Net, the feature fusion adopts a concatenation (concat) mode. The feature fusion is to improve the expression ability of the network to the features, so that the network can obtain more accurate segmentation results.

Different feature fusion methods have different effects, how to search for more effective feature methods has further improvement on network performance, and the method is a hot problem of current semantic segmentation research.

Disclosure of Invention

In order to solve the existing problems, the invention provides a feature fusion method in a semantic segmentation technology, which can improve the expression capability of a network on features so as to achieve the purpose of improving the accuracy of semantic segmentation results.

To achieve the object, the present invention provides a feature fusion method in semantic segmentation technology, comprising the following steps:

step 1: constructing a semantic segmentation network based on an encoder-decoder structure;

step 2: performing feature fusion in a decoder part;

and step 3: the feature fusion method comprises splicing, pooling, convolution, activation and addition operations;

and 4, step 4: training the network subjected to the feature fusion on a training set;

and 5: training a network which is not subjected to feature fusion and uses other feature fusion methods on a training set;

step 6: testing the trained network on a test set;

and 7: analyzing and comparing the effects of the characteristic fusion method.

As a further improvement of the present invention, in the step 1, a semantic segmentation network based on an "encoder-decoder" is constructed, in this architecture, the "encoder" part extracts features through a convolutional layer and a pooling layer, the depth of a feature map is continuously deepened, but the size is also continuously reduced; after receiving the features extracted by the encoder, the decoder performs up-sampling by deconvolution to restore the size of the feature map, and finally obtains a semantic segmentation result equivalent to the size of the original image.

As a further improvement of the present invention, in the step 2, feature fusion is performed in the "decoder" part, where feature fusion means that the network layer in the "decoder" not only takes the output of the last network layer in the "decoder" as an output, but also receives the output of the corresponding network layer in the "encoder" as an input, so that different features can be fused together, and the feature expression capability of the network can be improved.

As a further improvement of the present invention, in the step 3, the proposed feature fusion method includes operations of splicing, pooling, convolution, activation, and addition, where the output of the previous network layer in the "decoder" and the output of the corresponding network layer in the "encoder" are used as inputs in the network layer in the "decoder", the two inputs are spliced in channel dimension, and then subjected to pooling, convolution, and activation operations, and then added to the feature map that is not processed after splicing, so as to obtain the output after feature fusion.

As a further improvement of the present invention, in step 4, the network after feature fusion is trained on a training set, an open data set is used during training, a change curve of a loss function in the training process, an accuracy curve of the network on a verification set and training duration are recorded, and an influence of the feature fusion method on the network training process is studied.

As a further improvement of the present invention, in the step 5, the network without feature fusion and using other feature fusion methods is trained on a training set, the training set is consistent with the training set in the step 4, and a change curve of a loss function in the training process, an accuracy curve of the network on a verification set and a training time length are also recorded.

As a further improvement of the present invention, in step 6, the networks trained in step 4 and step 5 are tested on the same test set, the test set uses an open data set, and records test results respectively, including accuracy and average cross-over ratio indexes, and outputs semantic segmentation results at the same time.

As a further improvement of the present invention, in step 7, the proposed feature fusion method is analyzed and compared in terms of its effects, and compared with a network without feature fusion and using other feature fusion methods, the proposed feature fusion method is analyzed in terms of its performances during testing and training, and the indexes of segmentation accuracy and average cross-over ratio are compared during testing; and comparing the convergence speed of the loss function with the training time length during training.

The invention provides a feature fusion method in a semantic segmentation technology, which constructs a semantic segmentation network based on an encoder-decoder structure and provides a feature fusion mode in a decoder part. The feature fusion method comprises operations of splicing, pooling, convolution, activation, addition and the like, and can effectively fuse different features, improve the expression capability of the network on the features and finally achieve the purpose of improving the accuracy of network segmentation. Meanwhile, in the network training process, the method can accelerate the convergence process of the loss function and shorten the training time. The invention tests on the semantic segmentation public data set and obtains good experimental results. Compared with a network without feature fusion or using other feature fusion methods, the network using the feature fusion method provided by the invention has higher accuracy in testing and the loss function can be converged faster during training.

Drawings

FIG. 1 is a diagram of a semantically partitioned network architecture;

FIG. 2 is a schematic view of feature fusion;

FIG. 3 is an additive feature fusion approach;

FIG. 4 is a feature fusion approach to stitching;

FIG. 5 is a feature fusion approach of the present invention;

FIG. 6 is a diagram of semantic segmentation results;

FIG. 7 is a IoU calculation diagram;

fig. 8 is a comparison of the training process.

Detailed Description

The invention is described in further detail below with reference to the following detailed description and accompanying drawings:

the invention provides a feature fusion method in a semantic segmentation technology, aiming at improving the expression capability of a network on features so as to improve the accuracy of a segmentation result.

The specific embodiment of the invention is as follows:

step 1: and constructing a semantic segmentation network based on an encoder-decoder. Fig. 1 shows a semantic-partitioned network structure diagram constructed in the present invention, in an "encoder" part, a network mainly consists of a convolutional layer and a pooling layer, and as the network deepens, the depth of a feature diagram also continuously increases, but the size thereof continuously decreases; in the 'decoder' part, the network mainly consists of a deconvolution layer, the deconvolution layer can continuously recover the size of the feature graph, and finally a semantic segmentation result with the size equivalent to that of the input is output.

Step 2: feature fusion is performed in the "decoder" part. Fig. 2 gives a schematic representation of feature fusion. The feature fusion means that the network layer in the "decoder" not only takes the output of the last network layer in the "decoder" as the output, but also receives the output of the network layer in the corresponding "encoder" as the input, so that different features can be fused, and the feature expression capability of the network can be improved.

At present, the common feature fusion methods include addition and splicing. The feature fusion method of addition is, as shown in fig. 3, to obtain an output result by adding values at corresponding positions in two feature maps. The feature fusion method of splicing is shown in fig. 4, and means that two feature maps are spliced in channel dimension to obtain an output result.

And step 3: the feature fusion method proposed by the present invention is shown in fig. 5, and includes operations of splicing, pooling, convolution, activation, addition, and the like. Recording the feature map of two inputs as feature_AAnd feature_B，feature_AAnd feature_BSplicing on the dimension of the channel to obtain feature_C，feature_CObtaining feature after global pooling (global pooling), convolution (conv), ReLU activation function, convolution (conv) and sigmoid activation function_D，feature_DRe-harmonizing feature_CAnd adding to obtain an output.

And 4, step 4: training the network subjected to feature fusion on a training set, wherein an open data set CamVid is used during training, the CamVid is a road and driving scene understanding data set and comprises 5 video sequences, and the data comes from a 960 x 720 resolution camera installed on an automobile instrument panel. The data set samples a total of 701 frames from the video sequences (4 at 1 frame per second and 1 at 15 frames per second), which were manually labeled with 32 categories. Sturgess et al divided the data set into 367 training sets, 100 validation sets, and 233 test sets.

In the training process, a change curve of a loss function in the network, an accuracy curve of the network on a verification set, training time and the like are recorded, and the influence of the characteristic fusion method on the network training process is researched.

And 5: and (4) training the network which is not subjected to the feature fusion and uses other feature fusion methods on a training set, wherein the training set is consistent with the training set in the step 4, and a change curve of a loss function in the training process, an accuracy curve of the network on a verification set, training duration and the like are also recorded.

Step 6: and (5) testing the networks trained in the step (4) and the step (5) on the same test set, wherein the test set uses an open data set, respectively records test results including indexes such as accuracy and average cross-over ratio, and simultaneously outputs semantic segmentation results. FIG. 6 is a diagram illustrating a set of semantic segmentation results.

And 7: analyzing and comparing the effect of the characteristic fusion method provided by the invention. Compared with a network which does not perform characteristic fusion and uses other characteristic fusion methods, the performance of the characteristic fusion method provided by the invention is analyzed during testing and training, and indexes of segmentation accuracy and average cross-over ratio are compared during testing; and during training, comparing the convergence speed of the loss function, the training time and the like.

TABLE 1

Table 1 shows the comparison of the accuracy of different feature fusion modes. The accuracy index uses a mean intersection over Union (mlou). The Intersection over Union (IoU) refers to the area of Intersection between the predicted value and the true value and the ratio of the areas of Intersection between the predicted value and the true value, and the calculation diagram is given in fig. 7. mlou refers to the average of IoU for each class.

It can be seen that without feature fusion, the average intersection-to-parallel ratio of the network on the test set is 64, 72%, while with feature fusion, the value of mlio u can be increased, the addition mode can be increased by 2.81% to 57.53%, and the splicing mode can be increased by 2.91% to 57.62%. The addition is similar to how much the concatenation improves mlou. In comparison, when the network using the feature fusion mode provided by the invention is tested, the mIoU value can reach 58.94%, and is increased by 4.02%. Therefore, the method for fusing the characteristics can better improve the accuracy of network segmentation.

FIG. 8 is a graph showing loss functions for a network without feature fusion, a network with additive feature fusion, a network with splice feature fusion, and a network with the feature fusion method of the present invention during training. The loss functions are all cross-entropy loss functions, defined as follows:

wherein, y_iA label representing the authenticity of the tag,

indicates the predicted label, and N is the total number of categories.

It can be seen that, in a network without feature fusion, the loss function curve converges slowest; the network adopting the mode of adding and splicing feature fusion ensures that the convergence of the loss function curve is quicker; and by adopting the network of the characteristic fusion mode in the invention, the loss function curve is converged fastest, so the effect of shortening the training time can be achieved.

The above description is only a preferred embodiment of the present invention, and is not intended to limit the present invention in any way, but any modifications or equivalent variations made according to the technical spirit of the present invention are within the scope of the present invention as claimed.

Claims

1. The feature fusion method in the semantic segmentation technology is characterized by comprising the following steps of:

step 2: performing feature fusion in a decoder part;

step 6: testing the trained network on a test set;

and 7: analyzing and comparing the effects of the characteristic fusion method.

2. The method for feature fusion in semantic segmentation technology according to claim 1, characterized in that: in the step 1, a semantic segmentation network based on an encoder-decoder is constructed, in the framework, the encoder part extracts features through a convolutional layer and a pooling layer, the depth of a feature map is continuously deepened, and the size is also continuously reduced; after receiving the features extracted by the encoder, the decoder performs up-sampling by deconvolution to restore the size of the feature map, and finally obtains a semantic segmentation result equivalent to the size of the original image.

3. The method for feature fusion in semantic segmentation technology according to claim 1, characterized in that: in the step 2, feature fusion is performed in the "decoder" part, where the feature fusion means that the network layer in the "decoder" not only takes the output of the last network layer in the "decoder" as an output, but also receives the output of the network layer in the corresponding "encoder" as an input, so that different features can be fused together, and the feature expression capability of the network can be improved.

4. The method for feature fusion in semantic segmentation technology according to claim 1, characterized in that: in the step 3, the proposed feature fusion method includes operations of splicing, pooling, convolution, activation and addition, the network layer in the "decoder" takes the output of the last network layer in the "decoder" and the output of the corresponding network layer in the "encoder" as inputs, the two inputs are spliced in channel dimension, and then subjected to pooling, convolution and activation operations, and then added to the feature map which is not processed after splicing, so as to obtain the output after feature fusion.

5. The method for feature fusion in semantic segmentation technology according to claim 1, characterized in that: in the step 4, the network after feature fusion is trained on a training set, an open data set is used during training, a change curve of a loss function in the training process, an accuracy curve of the network on a verification set and training duration are recorded, and the influence of a feature fusion method on the network training process is researched.

6. The method for feature fusion in semantic segmentation technology according to claim 1, characterized in that: in the step 5, training is performed on the network which is not subjected to the feature fusion and uses other feature fusion methods on the training set, the training set is kept consistent with the training set in the step 4, and the change curve of the loss function in the training process, the accuracy curve of the network on the verification set and the training duration are also recorded.

7. The method for feature fusion in semantic segmentation technology according to claim 1, characterized in that: in the step 6, the networks trained in the step 4 and the step 5 are tested on the same test set, the test set uses an open data set, test results including accuracy and average cross-over ratio indexes are respectively recorded, and meanwhile, semantic segmentation results are output.

8. The method for feature fusion in semantic segmentation technology according to claim 1, characterized in that: in the step 7, the functions of the proposed feature fusion method are analyzed and compared, and compared with a network which does not perform feature fusion and uses other feature fusion methods, the performance of the proposed feature fusion method is analyzed during testing and training, and indexes of segmentation accuracy and average cross-over ratio are compared during testing; and comparing the convergence speed of the loss function with the training time length during training.