CN114821058A

CN114821058A - Image semantic segmentation method and device, electronic equipment and storage medium

Info

Publication number: CN114821058A
Application number: CN202210461550.8A
Authority: CN
Inventors: 周涛; 李天鹏; 庄林志; 邵蒙悦; 吴婕; 王清如
Original assignee: Jinan Boguan Intelligent Technology Co Ltd
Current assignee: Jinan Boguan Intelligent Technology Co Ltd
Priority date: 2022-04-28
Filing date: 2022-04-28
Publication date: 2022-07-29

Abstract

The application discloses an image semantic segmentation method, which is applied to electronic equipment comprising an image semantic segmentation model, wherein the image semantic segmentation model comprises a down-sampling module and a sparse multilayer perceptron module, and the image semantic segmentation method comprises the following steps: acquiring an original image, and extracting a downsampling feature map of the original image by using a downsampling module; carrying out linear coding on the down-sampling feature map to obtain a linear coding feature map; performing feature extraction on the linear coding feature map by using a sparse multilayer perceptron module to obtain output features; the image semantic segmentation method based on the image semantic segmentation comprises the steps of training an image semantic segmentation model by utilizing the output features, and executing an image semantic segmentation task by utilizing the trained image semantic segmentation model. The application also discloses an image semantic segmentation device, an electronic device and a storage medium, which have the beneficial effects.

Description

Image semantic segmentation method and device, electronic equipment and storage medium

Technical Field

The present application relates to the field of image processing technologies, and in particular, to an image semantic segmentation method and apparatus, an electronic device, and a storage medium.

Background

Image semantic segmentation is a basic task in the field of computer vision, and aims to assign a specified semantic label to each pixel in an image, which is essentially a multi-classification problem at a pixel level. With the development of deep learning technology in the field of computer vision, more and more image semantic segmentation algorithms are applied to the fields of automatic driving, video monitoring, medical image analysis, human-computer interaction and the like. In the related technology, a convolutional neural network or a Transformer network is generally used for extracting image features, but the image features extracted based on the convolutional neural network lack global semantic information, so that the precision of image semantic segmentation is low; and the amount of calculation for extracting image features based on the transform network is huge, and the speed of image semantic segmentation is slow.

Therefore, how to balance the speed and the precision of the semantic segmentation of the image is a technical problem that needs to be solved by those skilled in the art at present.

Disclosure of Invention

The present application provides an image semantic segmentation method, an image semantic segmentation apparatus, an electronic device, and a storage medium, which can balance the speed and precision of image semantic segmentation, so that the image semantic segmentation operation has a faster speed and higher precision.

In order to solve the technical problem, the present application provides an image semantic segmentation method, which is applied to an electronic device including an image semantic segmentation model, where the image semantic segmentation model includes a down-sampling module and a sparse multi-layer perceptron module, and the image semantic segmentation method includes:

acquiring an original image, and extracting a downsampling feature map of the original image by using the downsampling module;

carrying out linear coding on the down-sampling feature map to obtain a linear coding feature map;

performing feature extraction on the linear coding feature map by using the sparse multilayer perceptron module to obtain output features; the output features comprise global semantic information and local semantic information, and the sparse multilayer perceptron module comprises a sparse multilayer perceptron, a depth separable convolution structure and a residual error structure;

and training the image semantic segmentation model by using the output characteristics, and executing an image semantic segmentation task by using the trained image semantic segmentation model.

Optionally, the performing feature extraction on the linear coding feature map by using the sparse multilayer perceptron module to obtain an output feature includes:

channel splitting is carried out on the linear coding characteristic diagram in a preset proportion, and a first sub characteristic diagram and a second sub characteristic diagram are obtained;

extracting global semantic information of the first sub-feature map by using the sparse multilayer perceptron;

extracting local semantic information of the second sub-feature map by using the depth separable convolution structure;

performing feature fusion on the global semantic information and the local semantic information to obtain fusion features;

and inputting the fusion feature into the residual error structure to obtain the output feature.

Optionally, the sparse multi-layer perceptron includes a first processing branch and a second processing branch;

correspondingly, the extracting, by using the sparse multi-layer perceptron, global semantic information of the first sub-feature map includes:

interacting the pixels in the same row in the first sub-feature diagram through a full-connection layer in the first processing branch to obtain a row processing result;

interacting the pixels in the same column in the first sub-feature diagram through a full connection layer in the second processing branch to obtain a column processing result;

and generating global semantic information of the first sub-feature map according to the row processing result and the column processing result.

Optionally, the depth-separable convolution structure includes a depth-separable convolution layer, a bulk normalization layer, and an activation function;

correspondingly, extracting the local semantic information of the second sub-feature map by using the depth separable convolution structure comprises:

inputting the second sub-feature map into a depth separable convolution layer to obtain a depth separable convolution result;

and processing the depth separable convolution result by using the batch normalization layer and the activation function in sequence to obtain the local semantic information of the second sub-feature map.

Optionally, performing feature fusion on the global semantic information and the local semantic information to obtain a fusion feature, including:

performing feature fusion on the global semantic information and the local semantic information to obtain a fusion result;

and adjusting the channel sequence of the fusion result through channel Shuffle operation to obtain the fusion characteristic.

Optionally, the down-sampling module includes a plurality of convolution modules and a channel attention module connected in parallel; the step lengths of all the convolution modules are the same, and the convolution kernels of any two convolution modules are different in size;

correspondingly, the extracting the downsampling feature map of the original image by using the downsampling module comprises the following steps:

inputting the original image into each convolution module respectively;

and performing feature screening on output results of all the convolution modules by using the same channel attention module to obtain a down-sampling feature map of the original image.

Optionally, training the image semantic segmentation model by using the output features includes:

training the image semantic segmentation model by using a randomly sampled loss function and the output features;

and calculating loss values according to the randomly sampled partial pixel points in the output characteristics by the randomly sampled loss function.

The application also provides an image semantic segmentation device, which comprises a down-sampling module, a linear coding module, a sparse multilayer perceptron module and a training module;

the down-sampling module is used for acquiring an original image and extracting a down-sampling feature map of the original image by using the down-sampling module;

the linear coding module is used for performing linear coding on the down-sampling feature map to obtain a linear coding feature map;

the sparse multilayer perceptron module is used for extracting the characteristics of the linear coding characteristic graph by utilizing the sparse multilayer perceptron module to obtain output characteristics; the output features comprise global semantic information and local semantic information, and the sparse multi-layered perceptron module comprises a sparse multi-layered perceptron, a depth separable convolution structure and a residual structure;

and the training module is used for training the image semantic segmentation model by utilizing the output characteristics so as to execute an image semantic segmentation task by utilizing the trained image semantic segmentation model.

The application also provides a storage medium, on which a computer program is stored, and when the computer program is executed, the steps executed by the image semantic segmentation method are realized.

The application also provides an electronic device, which comprises a memory and a processor, wherein the memory is stored with a computer program, and the processor realizes the steps executed by the image semantic segmentation method when calling the computer program in the memory.

The application provides an image semantic segmentation method, which is applied to electronic equipment comprising an image semantic segmentation model, wherein the image semantic segmentation model comprises a down-sampling module and a sparse multilayer perceptron module, and the image semantic segmentation method comprises the following steps: acquiring an original image, and extracting a downsampling feature map of the original image by using the downsampling module; carrying out linear coding on the down-sampling feature map to obtain a linear coding feature map; performing feature extraction on the linear coding feature map by using the sparse multilayer perceptron module to obtain output features; the output features comprise global semantic information and local semantic information, and the sparse multilayer perceptron module comprises a sparse multilayer perceptron, a depth separable convolution structure and a residual error structure; and training the image semantic segmentation model by using the output characteristics, and executing an image semantic segmentation task by using the trained image semantic segmentation model.

According to the method, the original image is subjected to down sampling and linear coding to obtain a linear coding characteristic graph, and the linear coding characteristic graph is subjected to characteristic extraction by utilizing a sparse multilayer perceptron module to obtain output characteristics. The sparse multi-layer perceptron module extracts output features from the extracted linear coding feature map, wherein the output features comprise global semantic information and local semantic information, and the image semantic segmentation model trained based on the output features can balance the speed and the precision of image semantic segmentation, so that the image semantic segmentation operation has higher speed and higher precision. The application also provides an image semantic segmentation device, a storage medium and an electronic device, which have the beneficial effects and are not repeated herein.

Drawings

In order to more clearly illustrate the embodiments of the present application, the drawings needed for the embodiments will be briefly described below, and it is obvious that the drawings in the following description are only some embodiments of the present application, and that other drawings can be obtained by those skilled in the art without inventive effort.

Fig. 1 is a flowchart of an image semantic segmentation method according to an embodiment of the present disclosure;

fig. 2 is a schematic diagram of a network structure of a Swin Transformer according to an embodiment of the present application;

fig. 3 is a diagram of an aggregation structure of multi-scale spatial information provided in an embodiment of the present application;

FIG. 4 is a block diagram of a sparse multi-layered perceptron module according to an embodiment of the present disclosure;

FIG. 5 is a flowchart of obtaining global context information of a feature according to an embodiment of the present disclosure;

FIG. 6 is a flowchart of a depth separable convolution module according to an embodiment of the present application;

fig. 7 is a structural diagram of a feature fusion module according to an embodiment of the present application;

fig. 8 is a diagram illustrating random sampling comparison of a loss function according to an embodiment of the present application.

Detailed Description

In order to make the objects, technical solutions and advantages of the embodiments of the present application clearer, the technical solutions in the embodiments of the present application will be clearly and completely described below with reference to the drawings in the embodiments of the present application, and it is obvious that the described embodiments are some embodiments of the present application, but not all embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present application.

Referring to fig. 1, fig. 1 is a flowchart of an image semantic segmentation method according to an embodiment of the present disclosure.

The specific steps may include:

s101: acquiring an original image, and extracting a downsampling feature map of the original image by using the downsampling module;

the embodiment can be applied to electronic equipment comprising an image semantic segmentation model, and the image semantic segmentation model realizes operations such as matting and background removal through image semantic segmentation. The present embodiment does not limit the type of the image semantic segmentation model, and may include Swin Transformer, SegNet, PSPNet, and the like, for example. The image semantic segmentation model can comprise a down-sampling module and a sparse multilayer perceptron module.

The step may be preceded by an operation of acquiring raw images from the sample set, and the number of raw images is not limited herein. After the original image is input into the down-sampling module, the down-sampling module can perform feature extraction on the original image to obtain a down-sampling feature map. In particular, the downsampling module may be a single convolution layer, a single pooling layer, a Focus structure, or a plurality of convolution modules connected in parallel.

S102: carrying out linear coding on the down-sampling feature map to obtain a linear coding feature map;

the image semantic segmentation model may further include a linear coding module (e.g., an embedding module), and the downsampled feature map is input into the linear coding module and then is subjected to linear coding to obtain a linear coding feature map.

S103: performing feature extraction on the linear coding feature map by using the sparse multilayer perceptron module to obtain output features;

wherein the sparse multilayer perceptron module comprises a sparse multilayer perceptron, a depth separable convolution structure and a residual structure; the sparse multilayer perceptron is used for extracting global semantic information of the linear coding feature map; the depth separable convolution structure is used for extracting local semantic information of the linear coding feature map; the residual structure is also called as a residual network and is used for relieving the problem of gradient disappearance or gradient explosion caused by increasing the depth in the deep neural network. The embodiment may use self-entry, the full link layer to obtain the global semantic information and the local semantic information.

S104: and training the image semantic segmentation model by using the output characteristics, and executing an image semantic segmentation task by using the trained image semantic segmentation model.

After the output features including the global semantic information and the local semantic information are obtained, the image semantic segmentation model is trained by using the output features. And after receiving the image semantic segmentation task, executing the image semantic segmentation task by using the trained image semantic segmentation model.

Specifically, the image semantic segmentation model trained in this embodiment may be applied to a decision device of an autonomous vehicle, and the process of executing the image semantic segmentation task is as follows: the camera of the automatic driving vehicle shoots the surrounding environment image, an image semantic segmentation task is generated according to the surrounding environment image, the trained image semantic segmentation model can be used for processing the surrounding environment image and outputting an image semantic segmentation result. The semantic segmentation result of the image can include road information of other vehicles, pedestrians and the like in the surrounding environment image, and a vehicle driving route is planned according to the semantic segmentation result.

In the embodiment, the original image is downsampled and linearly coded to obtain a linear coding feature map, and the linear coding feature map is subjected to feature extraction by using a sparse multilayer perceptron module to obtain output features. The sparse multi-layer perceptron module extracts output features from the extracted linear coding feature map, wherein the output features comprise global semantic information and local semantic information, and the image semantic segmentation model trained based on the output features can balance the speed and the precision of image semantic segmentation, so that the image semantic segmentation operation has higher speed and higher precision.

As a further introduction to the corresponding embodiment of fig. 1, the downsampling module may include a plurality of convolution modules connected in parallel, and further include a channel attention module; the step length of each convolution module in the plurality of convolution modules connected in parallel is the same, and the convolution kernels of any two convolution modules are different in size. Further, the process of extracting the downsampled feature map of the original image by using the downsampling module in S101 includes: inputting the original image into each convolution module respectively; and performing feature screening on output results of all the convolution modules by using the same channel attention module to obtain a down-sampling feature map of the original image.

Aiming at the problems of serious information loss and poor small target segmentation effect when a Transformer is downsampled, the scheme provides a multi-scale spatial information aggregation method, features under different scales are extracted from an image by adopting a plurality of convolution kernels with different sizes, the feature extraction capability of a model is enhanced, and the small target segmentation effect of the model is improved. The above embodiments may also use a channel attention mechanism to screen the extracted features, both preserving important features and reducing the impact of useless features on the deep network.

As a further description of the corresponding embodiment of fig. 1, S103 may obtain the output characteristics by:

step 1: and splitting a channel of a preset proportion on the linear coding characteristic diagram to obtain a first sub characteristic diagram and a second sub characteristic diagram.

In the step, the linear coding feature map is split according to the channel dimension in a preset proportion, and the obtained first sub feature map and the second sub feature map have the same channel number. The preset ratio may be any value, and optionally, the preset ratio may be 1: 1, channel splitting is carried out in equal proportion.

And 2, step: and extracting global semantic information of the first sub-feature map by using a sparse multilayer perceptron.

The sparse multilayer perceptron comprises a first processing branch and a second processing branch, the first sub-feature diagram can be copied into 2 parts, the first sub-feature diagram is respectively input into the first processing branch and the second processing branch for processing, and the specific process is as follows:

correspondingly, the extracting, by using the sparse multi-layer perceptron, global semantic information of the first sub-feature map includes: interacting the pixels in the same row in the first sub-feature map through a full connection layer in the first processing branch to obtain a row processing result; and interacting the pixels in the same column in the first sub-feature diagram through a full connection layer in the second processing branch to obtain a column processing result. The operations of the two processing branches may be executed simultaneously or sequentially.

After the operations of the two processing branches are completed, the global semantic information of the first sub-feature graph may be generated according to the row processing result and the column processing result.

And step 3: extracting local semantic information of the second sub-feature map by using the depth separable convolution structure;

the depth separable convolution structure comprises a depth separable convolution layer, a batch normalization layer and an activation function, and the process of extracting the local semantic information is as follows: inputting the second sub-feature map into the depth separable convolution layer to obtain a depth separable convolution result; and processing the depth separable convolution result by using a batch normalization layer and the activation function in sequence to obtain the local semantic information of the second sub-feature map.

And 4, step 4: performing feature fusion on the global semantic information and the local semantic information to obtain fusion features;

specifically, the present embodiment may perform feature fusion on the global semantic information and the local semantic information to obtain a fusion result; and adjusting the channel sequence of the fusion result through channel Shuffle operation to obtain the fusion characteristic.

And 5: and inputting the fusion feature into the residual error structure to obtain the output feature.

Specifically, the present embodiment may add the feature of the jump connection branch in the fusion feature and the residual structure to obtain the output feature. The embodiment provides a sparse multilayer perceptron structure aiming at the problems that self-attribute in a transform network has a large amount of matrix calculation and lacks local semantic information, reduces the number of model parameters by adopting sparse connection and parameter sharing, reduces the complexity of a model, and reduces the model reasoning time while maintaining the precision of the model.

As a further introduction to the corresponding embodiment of fig. 1, S104 may train the image semantic segmentation model by: training the image semantic segmentation model by using a randomly sampled loss function and the output features; and calculating loss values according to the randomly sampled partial pixel points in the output characteristics by the randomly sampled loss function. Aiming at the problems of high difficulty in acquisition of semantic segmentation data and alleviation of overfitting of a model, the scheme provides a semantic segmentation training scheme based on a random sampling calculation loss function, and by randomly sampling labels when a model gradient is calculated, a regularization effect is brought, and model convergence is accelerated.

The flow described in the above embodiment is explained below by an embodiment in practical use.

In the field of automatic driving, the balance between the speed and the precision of the image semantic segmentation algorithm and the fusion of the global semantic information and the local position information gradually become the bottleneck of the development of the automatic driving technology. The semantic segmentation algorithm based on the convolutional neural network is limited by the receptive field of a convolutional kernel, and the shallow network is difficult to acquire the global semantic information of the image, so that the edge segmentation effect of a large target in a real scene is poor. In order to solve the problem of global semantic information, the scheme adopts a Transformer-based network to extract image characteristics; however, when the transform network performs downsampling on the initial image, information loss is serious, and the small target segmentation effect is extremely poor. The scheme provides a multi-scale spatial information aggregation method as a down-sampling method of a Transformer network, so that a network model has abundant spatial position information in a shallow layer, and the small target segmentation effect is improved; because a large amount of matrix calculation exists in self-attention in a Transformer network, the convergence is slow during network model training, and the self-attention lacks local semantic information of an image, the scheme provides a sparse multilayer perceptron structure which can reduce the calculation amount of the model and enhance the local feature expression capability of the model; in order to relieve the overfitting problem in the model training process, the loss is calculated by randomly sampling the output of the network model and the label, the gradient is solved through back propagation of the loss value, and then the parameters of the model are updated through the gradient, so that the model convergence is accelerated.

The semantic segmentation method based on the sparse multilayer perceptron in the scheme is divided into three links: the method comprises an image downsampling process based on a multi-scale spatial information aggregation layer, a feature extraction process of a sparse multi-layer perceptron structure and a model training process based on random sampling calculation loss.

Taking Swin Transformer as an example, replacing 4 multiplied by 4 convolution in a down-sampling Patch Partition layer of an image with a multi-scale spatial information aggregation layer, removing a self-attention module, and replacing with a sparse multi-layer perceptron. Referring to fig. 2, fig. 2 is a schematic diagram of a network structure of a Swin Transformer according to an embodiment of the present application, where N1, N2, N3, and N4 are stacking numbers of sparse MLP modules. Finally, four stages of the whole network respectively obtain feature maps with the sizes of the width and the height of the original input image images 1/4, 1/8, 1/16 and 1/32. In fig. 2, H represents height, W represents width, C represents channel number, Patch Partition represents image splitting layer, Linear Embedding represents Linear coding module, Sparse MLP represents Sparse multi-layer perceptron, and Patch gathering represents multi-scale spatial information aggregation layer.

The image downsampling process based on the multi-scale spatial information aggregation layer is as follows:

the multi-scale spatial information aggregation module is a down-sampling module for extracting features of an original image, and the specific operation of the multi-scale spatial information aggregation module is to introduce parallel convolution modules with the same step size but different convolution kernel sizes. The present embodiment takes convolution kernels with scales of 4 × 4, 8 × 8, 16 × 16, and 32 × 32 and a standard convolution module with a step size of 4 as examples, and the calculation formula is shown in formula 1. In other scenes, the number of corresponding convolution modules, the size of convolution kernels and the step length can be adjusted according to actual conditions.

Out ═ Concat (Conv1(in), Conv2(in), Conv3(in), Conv4 (in)); equation 1

Out represents the output results of all convolution modules, Concat represents the merge function, and Conv1(in), Conv2(in), Conv3(in), and Conv4(in) represent the feature maps of each convolution module output.

After the convolution parallel structure is used, the feature selection operation is respectively carried out on the output of each convolution layer: feature screening is performed by using a shared Channel attention module Channel attribute, parameter quantity of a model is reduced in a parameter sharing mode, finally, Concat operation is performed on four feature maps to obtain a down-sampling feature map, and fig. 3 is a multi-scale spatial information aggregation structure diagram provided by the embodiment of the application, as shown in fig. 3, the embodiment reduces influence of image noise on deep network features while retaining important features.

The embodiment provides the multi-scale spatial information aggregation structure, the problem that the position information of the small target is seriously lost when the initial image in the transform is downsampled is effectively solved, the semantic segmentation effect of the small target is remarkably improved, and the multi-scale spatial information aggregation structure can be applied to other computer vision fields in a plug-and-play mode.

The feature extraction process of the sparse multi-layer perceptron structure is as follows:

after down-sampling the image, the feature map is sent to an embedding module to perform linear coding operation (as shown in fig. 2) to obtain a linear coding feature map, and then a sparse multi-layer perceptron module is used to perform feature extraction on the feature map, so as to extract global semantic information (also called high-level semantic information) and local semantic information (also called local position information). The sparse multilayer perceptron module consists of a sparse multilayer perceptron, a depth separable convolution and a residual error structure. Fig. 4 is a block diagram of the whole module, and fig. 4 is a block diagram of a sparse multi-layered perceptron module according to an embodiment of the present application. In FIG. 4, Channel Split denotes the ratio as (1-r): r, Sparse MLP represents a Sparse multi-layer perceptron, 3 × 3DWConv (Depthwise convolution) represents depth separable convolution with a convolution kernel size of 3, BN (batch normalization) is an algorithm for accelerating neural network training, convergence speed and stability, which is often used in a depth network, ReLU represents an activation function, and Shuffle represents a function for randomly ordering all elements of a sequence.

The specific operation of the sparse multi-layer perceptron module for processing the feature map is divided into the following five parts: (1) splitting a characteristic diagram channel; (2) the MLP module extracts global information; (3) performing local information extraction by depth separable convolution; (4) fusing global information and local information; (5) and constructing blocks of a plurality of residual error structures.

(1) Splitting a characteristic diagram channel: linear coding feature maps for input

Splitting the channel dimension of the linear coding feature graph f according to a preset proportion to obtain a first sub-feature graph

And a second sub-feature map

Taking an equal ratio split (i.e., r is 0.5) as an example, C1 ═ C2, C1+ C2 ═ C, C1, C2, and C indicate the number of channels in the feature map, and H, W indicates the height and width of the feature map.

(2) The MLP module extracts global semantic information: the obtained first sub-feature map

And calculating global semantic information for pixels on the same row and the same column by adopting two branches respectively by using a full connection layer. For the line branch, the permaute operation is carried out on the first sub-feature diagram to obtain a feature diagram

Then, a full-connection layer processing characteristic diagram with H input and output channels is used, and permute operation is performed on the result to obtain a line processing result

Namely:

f _H ＝linear(permute(f ₁ ) Equation 2)

Similarly, for the column branch, the first sub-feature diagram is processed by using a full-connection layer with W input and output channels to obtain a column processing result

Namely:

f _W ＝linear(f ₁ ) (ii) a Equation 3

Finally, matrix dot multiplication is carried out on the row processing result and the column processing result to obtain the dimension [ C1, H, W ]]And performing matrix dot multiplication with the first sub-feature map to obtain global semantic information

Namely:

f ^smlp ＝(f _H ×f _W )×f ₁ (ii) a Equation 4

In the above process, a flowchart for obtaining global context information of a feature by using parallel branches is shown in fig. 5.

(3) And (3) performing local semantic information extraction by deep separable convolution: the present embodiment uses local information of the depth separable convolution enhancement features. The second sub-feature map f obtained in the process (1) ₂ As input features, a deep separable convolution process f with a convolution kernel size of 3 is used ₂ And an output with a dimension of C2 multiplied by H multiplied by W is obtained. Adding a batch normalization layer (BN) and a nonlinear activation function (RELU) behind the depth separable convolution layer to process the output and finally obtain local semantic information

The use of normalization and nonlinear activation functions aims to speed up model convergence and enhance the generalization capability of the model. The whole process of the depth separable convolution module is shown in fig. 6, and the calculation formula is as follows:

f ^dw ＝ReLU(BN(DWConv(f ₂ ) ); equation 5

(4) And (3) fusing global information and local information: output f of sparse multilayer perceptron layer ^smlp And the output f of the depth separable convolution module ^dw Concat fusion is carried out, and the channel sequence of the feature map is disturbed through channel Shuffle operation to obtain a fusion feature f ^out . The Shuffle operation plays a role in channel exchange, which not only reduces the parameter number of the model, but also reduces the calculation amount. The structure diagram of the feature fusion module is shown in fig. 7, where C1 is C2 is C/2 in fig. 7, and the calculation formula is as follows:

f ^out ＝Shuffle(Concat(f ^smlp ,f ^dw ) ); equation 6

(5) Constructing blocks of a plurality of residual error structures: the problem of gradient disappearance or gradient explosion caused by depth increase in the deep neural network is solved by using a residual structure, and the following calculation is carried out to obtain the output f of the whole sparse multilayer perceptron module ^res (i.e., output characteristics). The whole sparse MLP module is used as a block structure and is constructedA plurality of block structures to complete the final design of the network.

f ^res ＝f+f ^out (ii) a Equation 7

The embodiment provides an efficient sparse multilayer perceptron structure. The structure only uses two MLP branches to calculate the context information of the same row and the same column of a pixel position on a feature map, and meanwhile, the local features of the image are extracted by adopting depth separable convolution. Compared with self-attribute, the method has the advantages that the calculated amount of the model is obviously reduced, the local semantic information of the image is acquired, and the feature expression capability of the model is improved; compared with a model of a convolutional neural network architecture, global semantic information is easier to acquire. In addition, the structure can be embedded into other neural network models in a plug-and-play manner. The embodiment can simply and efficiently extract the global and local semantic information of the image, and improve the feature expression capability of the semantic segmentation model.

The model training process based on random sampling computation loss is as follows:

the loss functions used in the conventional semantic segmentation basically use cross entropy loss functions and Dice loss functions. When the depth and the width of the model are small, the overfitting risk of the model is small, but because the semantic segmentation data acquisition difficulty is large, the labeling is time-consuming, the available data is relatively few, the model with the large depth and the large width is generally adopted, and the overfitting of the model is easily caused. Aiming at the problem, the scheme provides a random sampling loss function training method, the final gradient is calculated through random sampling, the model convergence can be effectively accelerated, and an additional regularization effect is brought.

The cross entropy loss function commonly used at present is calculated by the following formula:

the conventional cross entropy loss function has a better effect when the network parameters are less, as shown in fig. 8, the general semantic segmentation loss function calculation directly calculates the output and all the losses of the corresponding labels, and then calculates all gradients through back propagation and updates the parameters. The random sampling loss calculation proposed by the scheme is that the loss value calculation is carried out on the selected pixels and the corresponding labels by randomly sampling the output image, and the unselected pixels do not participate in any calculation. Fig. 8 shows a feature diagram before loss calculation, a general loss function calculation, and a random function calculation of random sampling, where gray squares in the diagram represent pixel points participating in the loss calculation. The specific implementation method of the model training method for calculating loss by random sampling is as follows:

taking a cityscaps data set as an example, the size of an input image is 512 × 1024, feature maps with the sizes of 1/4, 1/8, 1/16 and 1/32 of the width and the height of the input image are obtained through the processing of the first step and the second step, and the four feature maps are aggregated through FPN (feature Pyramid networks) information and processed through a UPERNet network to obtain a 512 × 1024 semantic segmentation result map. And constructing a random sampling loss function based on binary cross entropy after model forward reasoning, and randomly sampling 128 multiplied by 256 pixel points to calculate a cross entropy loss value. I.e. 32768 pixels are sampled for the calculation of the loss value, followed by the corresponding back propagation and parameter update. The loss calculation formula loss is as follows:

k is the number of selected pixels, p represents the probability of correct prediction, y _i Is the label of sample i.

The loss can be calculated by the method, so that the model training and convergence can be accelerated, the video memory occupancy rate of the GPU during training is reduced, the model has randomness during parameter updating because of the sampling of the randomness, and the parameter updating of the randomness is more effective than a common regularization method on semantic segmentation; when the model is more complex, the overfitting of the model is easier to slow down by calculating the loss value through random sampling, and meanwhile, the model is easier to converge, so that the phenomenon of overfitting of the model is greatly relieved.

The embodiment provides an efficient semantic segmentation model training method based on random sampling computation loss, the loss is computed by performing random sampling on the output of a network model, a better regularization effect is brought to the model, the convergence of the model is promoted, and the generalization performance of the model is improved.

The image semantic segmentation device provided by the embodiment of the application comprises a down-sampling module, a linear coding module, a sparse multilayer perceptron module and a training module;

the sparse multilayer perceptron module is used for extracting the characteristics of the linear coding characteristic graph by utilizing the sparse multilayer perceptron module to obtain output characteristics; the output features comprise global semantic information and local semantic information, and the sparse multilayer perceptron module comprises a sparse multilayer perceptron, a depth separable convolution structure and a residual error structure;

Further, the sparse multilayer perceptron module is used for carrying out channel splitting on the linear coding feature map in a preset proportion to obtain a first sub-feature map and a second sub-feature map; the sparse multi-layer perceptron is further used for extracting global semantic information of the first sub-feature map; the depth separable convolution structure is further used for extracting local semantic information of the second sub-feature map; the global semantic information and the local semantic information are subjected to feature fusion to obtain fusion features; and the fusion feature is further used for inputting the fusion feature into the residual error structure to obtain the output feature.

Further, the sparse multilayer perceptron comprises a first processing branch and a second processing branch;

correspondingly, the process of the sparse multi-layer perceptron module for extracting the global semantic information of the first sub-feature map by using the sparse multi-layer perceptron comprises the following steps: interacting the pixels in the same row in the first sub-feature diagram through a full-connection layer in the first processing branch to obtain a row processing result; interacting the pixels in the same column in the first sub-feature diagram through a full connection layer in the second processing branch to obtain a column processing result; and generating global semantic information of the first sub-feature map according to the row processing result and the column processing result.

Further, the depth-separable convolution structure includes a depth-separable convolution layer, a bulk normalization layer, and an activation function;

correspondingly, the process of the sparse multi-layer perceptron module for extracting the local semantic information of the second sub-feature map by using the depth separable convolution structure comprises the following steps: inputting the second sub-feature map into a depth separable convolution layer to obtain a depth separable convolution result; and processing the depth separable convolution result by using the batch normalization layer and the activation function in sequence to obtain the local semantic information of the second sub-feature map.

Further, the process of performing feature fusion on the global semantic information and the local semantic information by the sparse multilayer perceptron module to obtain fusion features includes: performing feature fusion on the global semantic information and the local semantic information to obtain a fusion result; and adjusting the channel sequence of the fusion result through channel Shuffle operation to obtain the fusion characteristic.

Further, the down-sampling module comprises a plurality of convolution modules and channel attention modules connected in parallel; all the convolution modules have the same step length, and the convolution kernels of any two convolution modules have different sizes;

correspondingly, the down-sampling module is used for respectively inputting the original image into each convolution module; and the same channel attention module is further used for performing feature screening on the output results of all the convolution modules to obtain a down-sampling feature map of the original image.

Further, the training module is used for training the image semantic segmentation model by using a randomly sampled loss function and the output features; and the random sampling loss function is used for calculating loss values according to part of randomly sampled pixel points in the output characteristics.

Since the embodiments of the apparatus portion and the method portion correspond to each other, please refer to the description of the embodiments of the method portion for the embodiments of the apparatus portion, which is not repeated here.

The present application also provides a storage medium having a computer program stored thereon, which when executed, may implement the steps provided by the above-described embodiments. The storage medium may include: various media capable of storing program codes, such as a usb disk, a removable hard disk, a Read-Only Memory (ROM), a Random Access Memory (RAM), a magnetic disk, or an optical disk.

The application further provides an electronic device, which may include a memory and a processor, where the memory stores a computer program, and the processor may implement the steps provided by the foregoing embodiments when calling the computer program in the memory. Of course, the electronic device may also include various network interfaces, power supplies, and the like.

The embodiments are described in a progressive manner in the specification, each embodiment focuses on differences from other embodiments, and the same and similar parts among the embodiments are referred to each other. The device disclosed by the embodiment corresponds to the method disclosed by the embodiment, so that the description is simple, and the relevant points can be referred to the method part for description. It should be noted that, for those skilled in the art, it is possible to make several improvements and modifications to the present application without departing from the principle of the present application, and such improvements and modifications also fall within the scope of the claims of the present application.

It is further noted that, in the present specification, relational terms such as first and second, and the like are used solely to distinguish one entity or action from another entity or action without necessarily requiring or implying any actual such relationship or order between such entities or actions. Also, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, an element defined by the phrase "comprising an … …" does not exclude the presence of other identical elements in a process, method, article, or apparatus that comprises the element.

Claims

1. An image semantic segmentation method is applied to electronic equipment comprising an image semantic segmentation model, wherein the image semantic segmentation model comprises a down-sampling module and a sparse multi-layer perceptron module, and the image semantic segmentation method comprises the following steps:

2. The image semantic segmentation method according to claim 1, wherein the extracting features of the linear coding feature map by using the sparse multilayer perceptron module to obtain output features comprises:

3. The image semantic segmentation method according to claim 2, wherein the sparse multilayer perceptron comprises a first processing branch and a second processing branch;

4. The image semantic segmentation method according to claim 2, wherein the depth-separable convolution structure includes a depth-separable convolution layer, a batch normalization layer, and an activation function;

5. The image semantic segmentation method according to claim 2, wherein the performing feature fusion on the global semantic information and the local semantic information to obtain a fusion feature comprises:

6. The image semantic segmentation method according to claim 1, wherein the downsampling module comprises a plurality of convolution modules and a channel attention module connected in parallel; the step lengths of all the convolution modules are the same, and the convolution kernels of any two convolution modules are different in size;

inputting the original image into each convolution module respectively;

and performing feature screening on the output results of all the convolution modules by using the same channel attention module to obtain a down-sampling feature map of the original image.

7. The image semantic segmentation method according to claim 1, wherein training the image semantic segmentation model by using the output features comprises:

8. An image semantic segmentation device is characterized by comprising a down-sampling module, a linear coding module, a sparse multilayer perceptron module and a training module;

9. An electronic device, comprising a memory in which a computer program is stored and a processor, wherein the processor implements the steps of the image semantic segmentation method according to any one of claims 1 to 7 when calling the computer program in the memory.

10. A storage medium having stored thereon computer-executable instructions which, when loaded and executed by a processor, carry out the steps of the image semantic segmentation method according to any one of claims 1 to 7.