CN114445430A

CN114445430A - Real-time image semantic segmentation method and system for lightweight multi-scale feature fusion

Info

Publication number: CN114445430A
Application number: CN202210365378.6A
Authority: CN
Inventors: 石敏; 沈佳林; 易清明; 骆爱文; 戴国帅
Original assignee: Jinan University
Current assignee: Jinan University
Priority date: 2022-04-08
Filing date: 2022-04-08
Publication date: 2022-05-06
Anticipated expiration: 2042-04-08
Also published as: CN114445430B

Abstract

The invention provides a lightweight multi-scale feature fusion real-time image semantic segmentation method, which comprises the following steps: constructing an image semantic segmentation network comprising an initialization module, a separation-extraction-merging bottleneck module, a partition-fusion channel attention module, a feature fusion module and a multi-scale attention decoder. Extracting multi-scale features of an initial feature map of an image to be processed through an image semantic segmentation network to obtain a multi-scale feature map; extracting and fusing global channel information and local channel information of the initial feature map to obtain a channel information feature map; performing feature fusion on the multi-scale feature map, the channel information feature map and the original image to be processed to obtain a fusion feature map; and recovering the image precision based on the fusion feature graph to obtain an image semantic segmentation result. The invention can ensure the precision and accuracy of the model in the lightweight image semantic segmentation network model with relatively small parameter quantity, improve the reasoning speed of the model and realize the real-time semantic segmentation of the image.

Description

Real-time image semantic segmentation method and system for lightweight multi-scale feature fusion

Technical Field

The invention relates to the technical field of computer vision, in particular to a method and a system for semantic segmentation of a light-weight multi-scale feature fusion real-time image.

Background

With the development of computer vision technology, the application of image semantic segmentation in the fields of automatic driving, intelligent medical treatment, robots and the like is more and more deep and wide. The image semantic segmentation is to classify each pixel point in the picture into a corresponding category, find the position of each category object in the picture, and accurately segment the image to provide accurate object boundary information. With the development of deep learning and the continuous improvement of computer hardware performance, the image semantic segmentation algorithm based on the deep convolutional neural network obtains good effect.

Many of the existing image semantic segmentation models based on the deep convolutional neural network generally achieve higher precision by using deeper convolutional layers and larger feature channels, and have a large number of hyper-parameters. For example, pspnet (pyramid Scene networking) introduces a pyramid pool module, and realizes 80.2% of mliou on a cityscaps test set, but the model has 6570 thousands of parameters, the inference time is far lower than the real-time standard, and the model is difficult to be deployed on hardware terminals such as mobile phones, automobile systems, wearable devices and internet of things devices.

Disclosure of Invention

The invention provides a lightweight multi-scale feature fusion real-time image semantic segmentation method and system for solving the defect of low inference speed of the existing image semantic segmentation model, and the inference speed of the model is improved while the precision of the model is ensured.

In order to solve the technical problems, the technical scheme of the invention is as follows:

in a first aspect, the invention provides a lightweight multi-scale feature fusion real-time image semantic segmentation method, which comprises the following steps:

s1: constructing an image semantic segmentation network, wherein the image semantic segmentation network comprises an initialization module, a separation-extraction-merging bottleneck module, a partition-fusion channel attention module, a feature fusion module and a multi-scale attention decoder; the separation-extraction-merging bottleneck module comprises a plurality of separation-extraction-merging bottleneck structures which are connected in sequence.

S2: and inputting the image to be processed into the image semantic segmentation network, and outputting an image semantic segmentation result by the image semantic segmentation network.

The initialization module extracts features of an image to be processed to obtain an initial feature map of the image to be processed; the separation-extraction-combination bottleneck module adopts a separation characteristic channel to extract multi-scale characteristics of the initial characteristic diagram to obtain a multi-scale characteristic diagram; the partition-fusion channel attention module extracts and fuses global channel information and local channel information of the initial feature map to obtain a channel information feature map; the feature fusion module performs feature fusion on the multi-scale feature map, the channel information feature map and the original image to be processed to obtain a fusion feature map; and the multi-scale attention decoder recovers the image precision based on the fusion characteristic graph to obtain an image semantic segmentation result.

In a second aspect, the present invention provides a lightweight multi-scale feature fusion real-time image semantic segmentation system, including:

the image semantic segmentation network comprises an initialization module, a separation-extraction-merging bottleneck module, a partition-fusion channel attention module, a feature fusion module and a multi-scale attention decoder. The separation-extraction-merging bottleneck module comprises a plurality of separation-extraction-merging bottleneck structures which are connected in sequence. The initialization module extracts the features of the image to be processed to obtain an initial feature map of the image to be processed. And the separation-extraction-combination bottleneck module adopts a separation characteristic channel to extract the multi-scale characteristics of the characteristic diagram of the image to be processed to obtain a multi-scale characteristic diagram. And the partition-fusion channel attention module extracts and fuses the global channel information and the local channel information of the initial characteristic diagram to obtain a channel information characteristic diagram. And the feature fusion module performs feature fusion on the multi-scale feature map, the channel information feature map and the original image to be processed to obtain a fusion feature map. And the multi-scale attention decoder recovers the image precision according to the fusion characteristic graph to obtain an image semantic segmentation result.

Compared with the prior art that the accuracy of the model is improved by stacking powerful hardware resources and power consumption, the technical scheme of the invention has the beneficial effects that: the method comprises the steps of extracting multi-scale features of an image through a separating-extracting-merging bottleneck module, extracting and fusing global channel information and local channel information of an initial feature map through a partitioning-fusing channel attention module, fusing and supplementing effective information and detail information of the image through a feature fusing module, recovering Precision of objects with different scales in the image through a multi-scale attention decoder, ensuring Precision (Precision) and Accuracy (Accuracy) of a model in a lightweight image semantic segmentation network model with relatively small parameter quantity, and improving Inference Speed (Inference Speed) of the model. The method not only improves the segmentation precision of the image target, but also realizes the rapid semantic segmentation of the image, can finally realize good performance balance between the model precision and the calculation speed, and can simultaneously meet the requirements of practical application scenes on timeliness and accuracy.

Drawings

Fig. 1 is a flowchart of a semantic segmentation method for a lightweight multi-scale feature fusion real-time image in embodiment 1.

Fig. 2 is a structural diagram of a parallel concatenation module in embodiment 2.

Fig. 3 is a structural diagram of the separation-extraction-combination bottleneck structure in example 2.

FIG. 4 is a block diagram of the partition-merge-channel attention module in example 2.

Fig. 5 is a structural diagram of a feature fusion module in embodiment 2.

Fig. 6 is a diagram of the entire network structure of the image semantic segmentation network in embodiment 3.

Fig. 7 is a block diagram of a multi-scale attention decoder in embodiment 3.

Fig. 8 is a comparison graph of performance results of evaluating the comprehensive accuracy and the computational efficiency of different semantic segmentation networks on the cityscaps dataset and a comparison graph of performance results of the comprehensive accuracy and the parameter amount in example 4.

FIG. 9 is a graph comparing the visualization of the segmentation effect of part of the existing semantic segmentation network in example 4 and LMFFNet-3-8 in the Cityscapes data set.

FIG. 10 is a graph comparing the visualization of the segmentation effect of another part of the existing semantic segmentation network in the Cityscapes data set in example 4 with LMFFNet-3-8.

FIG. 11 is a visualization contrast diagram of the segmentation effect of different semantic segmentation networks in the CamVid data set in example 4.

Fig. 12 is an architecture diagram of a real-time image semantic segmentation system with lightweight multi-scale feature fusion in embodiment 5.

Detailed Description

The drawings are for illustrative purposes only and are not to be construed as limiting the patent;

the technical solution of the present invention is further described below with reference to the accompanying drawings and examples.

Example 1

The embodiment provides a lightweight multi-scale feature fusion real-time image semantic segmentation method, as shown in fig. 1, fig. 1 is a flowchart of the lightweight multi-scale feature fusion real-time image semantic segmentation method, and the method includes the following steps:

s1: an image semantic segmentation network is constructed, and comprises an initialization module, a separation-extraction-merging bottleneck module (SEM-B Block), a partition-fusion channel attention module (PMCA), a Feature Fusion Module (FFM) and a multi-scale attention decoder (MAD). The separation-extraction-merger bottleneck module comprises a plurality of separation-extraction-merger bottleneck structures (SEM-B).

The initialization module extracts the features of the image to be processed to obtain an initial feature map of the image to be processed.

And the separation-extraction-combination bottleneck module adopts a separation characteristic channel to extract the multi-scale characteristics of the characteristic diagram of the image to be processed to obtain a multi-scale characteristic diagram. In this embodiment, the separation-extraction-merging bottleneck structure in the separation-extraction-merging bottleneck module follows the concept of separation-extraction-merging, and first, a 3 × 3 standard convolution layer is used to compress useful information of an initial feature map of an image to be processed, then, Split feature channel separation is adopted, feature processing is performed on the compressed feature map by using parallel double branches with receptive field differences to generate feature maps of different scales, and finally, the 3 × 3 standard convolution layer is used again to fuse multi-scale information on the feature maps of different scales to obtain a multi-scale feature map.

And the partition-fusion channel attention module extracts and fuses global channel information and local channel information of the initial feature map to obtain a channel information feature map. In this embodiment, the partition-fusion channel attention module includes a global attention generation branch and a local attention fusion branch. In the local attention fusion branch, the attention weight of the regional channel of the initial feature map is acquired in a partitioning mode, then the attention weight of each regional channel is fused in a self-adaptive learning mode, and the relation of the sub-regions of the feature map is considered from the aspect of partitioning. The global attention branch generates a global channel attention weight by acquiring global information of the initial feature map. And finally, obtaining the final channel attention weight by fusing the regional channel attention weight and the global channel attention weight. Meanwhile, global information and local information are fused to generate channel attention weight, the regional relation and the global information of the feature map are considered, and the accuracy of the image semantic segmentation network is improved.

And the feature fusion module performs feature fusion on the multi-scale feature map, the channel information feature map and the original image to be processed to obtain a fusion feature map. In the embodiment, the multi-scale feature map, the channel information feature map and the original image to be processed are subjected to feature fusion, effective information and detail information of the image are fused and supplemented, and the accuracy of the image semantic segmentation network is improved.

And the multi-scale attention decoder recovers the image precision based on the fusion characteristic graph to obtain an image semantic segmentation result. In the embodiment, the fusion feature diagram fused with the multi-scale features guides the multi-scale attention decoder to recover the image precision, performs weighted attention on important information in an image space region, suppresses non-important information, and effectively improves the recovery precision of the decoder on objects with different scales.

The method comprises the steps of extracting multi-scale features of an image through a separating-extracting-merging bottleneck module, extracting and merging global channel information and local channel information of the image through a partitioning-merging channel attention module, extracting and merging effective information and detail information of the image through a feature merging module, recovering precision of objects with different scales in the image through a multi-scale attention decoder, ensuring the precision and accuracy of the model in a lightweight image semantic segmentation network model with relatively small parameter quantity, reducing the parameter quantity and the calculated quantity of the model, reducing the consumption of resources, improving the reasoning speed of the model, and ensuring the lightweight and real-time performance of the network.

Example 2

The embodiment is an improvement on the basis of the real-time image semantic segmentation method for lightweight multi-scale feature fusion provided in embodiment 1.

In this embodiment, the initialization module includes 3 × 3 standard convolutional layers. In a specific implementation process, the resolution of the image to be processed may be customized according to needs, and before the image to be processed is input to the network, the image may be scaled to an image resolution set in advance, for example, for a 3-channel RGB image, the resolution of the input image of each channel to be processed is set in advance to 512 × 1024. Inputting an image to be processed into an initialization module, performing convolution operation on the image to be processed through a 3 x 3 standard convolution layer with the step length of 2, reducing the size of the original input image by half, expanding the number of channels of an output feature map to 32, performing convolution operation on the convolution operation continuously through two 3 x 3 standard convolution layers, extracting context semantic information, reducing image information loss, and obtaining a first feature map. And inputting the first feature map into a parallel splicing module (FFM-A1), and performing parallel splicing on the first feature map and the original image to be processed by the parallel splicing module to obtain a second feature map. As shown in fig. 2, it is a structural diagram of an embodiment of the parallel concatenation module (FFM-a 1) in this embodiment. In the initial stage of the image semantic segmentation network, an initialization module extracts features of an image to be processed to obtain a first feature map, but part of useful information is lost. Therefore, under the condition of keeping the number of channels unchanged, the original image to be processed is downsampled to obtain downsampled image information with the number of channels being 3. And (3) splicing the downsampled information with the number of channels as compensation information with the first characteristic diagram in parallel, and finally obtaining a second characteristic diagram through 1 × 1 standard convolution layer operation processing, and extracting more useful information by using a small amount of model parameters.

In this embodiment, a plurality of separation-extraction-merging bottleneck structures are used to extract the multi-scale features of the image to be processed by using the separation feature channel, so as to obtain a multi-scale feature map. As shown in fig. 3, it is a structural diagram of the separation-extraction-merging bottleneck structure in the present embodiment. In the specific implementation process, a 3 × 3 standard convolutional layer is used as an input bottleneck of a separation-extraction-combination bottleneck structure, an initial feature map generates a feature map with half of the number of input channels through the input bottleneck, then the feature map with half of the number of input channels is generated through channel separation to generate two feature maps which respectively enter a first branch and a second branch, and the number of the input feature maps of the first branch and the second branch is 1/4 of the original number of channels, so that the overall structure of the image semantic segmentation network is simplified, and the computational complexity is reduced.

In order to improve the acquisition of multi-scale feature information, the feature map entering the first branch is subjected to convolution operation through a 3 × 3 depth separable convolutional layer, the feature map entering the second branch is subjected to convolution operation through a 3 × 3 depth separable hole convolutional layer, context information of different scales is acquired through the difference of receptive fields of the first branch and the second branch, and information loss on each channel is reduced.

And merging the convolution operation results of the first branch and the second branch, and restoring the channel number of the feature map to 1/2 of the original channel number, so that the relevance among the channels is improved while the multi-scale feature information is obtained. Then, using a 3 × 3 standard convolution layer as an output bottleneck of the separation-extraction-combination bottleneck structure, performing convolution operation on a combination result through the 3 × 3 standard convolution layer, integrating feature maps of different scales and recovering the number of channels consistent with the number of input channels, finally performing feature mapping on the convolution output after the number of channels is recovered and the feature map input into the current separation-extraction-combination bottleneck structure, and inputting a feature mapping result into a next separation-extraction-combination bottleneck structure for processing;

and obtaining a multi-scale characteristic diagram after a plurality of separation-extraction-combination bottleneck structure treatments.

In a separation-extraction-merging bottleneck structure, the output ends of the 3 × 3 standard convolution layer, the 3 × 3 depth-separable convolution layer and the 3 × 3 depth-separable hole convolution layer are sequentially connected with a PReLU active layer and a BN layer, and the PReLU and the batch normalized active function are used for processing before each convolution operation, so that the convergence speed of the model is improved.

An attention mechanism is added in the neural network, so that the interested information can be extracted in the training process, and the irrelevant information can be ignored, and the accuracy of the model can be improved. Fig. 4 is a block diagram of the partition-merge-channel attention module in the present embodiment. In this embodiment, the partition-fusion channel attention module adopts a dual-branch channel attention mechanism, and includes a global attention generation branch and a local attention fusion branch, and the partition-fusion channel attention module extracts and fuses global channel information and local channel information of the initial feature map, and the step of obtaining the channel information feature map includes:

inputting the feature map A output by the first separating-extracting-merging bottleneck structure in the separating-extracting-merging bottleneck module into the partitioning-merging channel attention module and then respectively entering the global attention generation branch and the local attention merging branch.

To capture global channel information, a profile A of a branch is generated for input global attention₁Performing global average pooling, and learning characteristic diagram A by multilayer neural network₁Obtaining a characteristic graph A by the channel weight₁The global channel information of (c).

To compensate for the loss of channel information after global average pooling, the feature map A of the input local attention fusion branch is₂Partitioning the feature map A₂Divided into 4 areas; for characteristic diagram A₂Respectively carrying out global average pooling, and learning average pooling values of 4 regions by neural network self-adaptation to obtain a characteristic diagram A₂The local channel information of (1).

And performing weighted fusion on the global channel information and the local channel information to obtain a group of final summary weights, guiding the neural network to extract interesting information of the output characteristic diagram according to the summary weights to obtain a channel information characteristic diagram, and efficiently finishing the integration of the characteristic information.

And the feature fusion module splices the multi-scale feature map, the channel information feature map and the original image to be processed in the channel direction, and performs convolution operation on the spliced result through a standard convolution layer to obtain a fusion feature map.

Fig. 5 is a structural diagram of the feature fusion module in the present embodiment. In this embodiment, the feature fusion module establishes a remote connection, adds the downsampled original image to be processed into the feature fusion module to compensate the detail information, and connects the head end feature map of the feature fusion module through the short-range jump connection model. The feature fusion module not only integrates the down-sampling information of the original image to be processed, but also introduces the information in the separation-extraction-combination bottleneck structure. In addition, unlike many existing methods of directly fusing input feature maps by using feature mapping and jump connection, when connecting separation-extraction-merging bottleneck modules, the feature fusion module introduces a partition-fusion channel attention module to extract important information of the feature maps on channels, and enhances the fusion of the feature fusion module to the important feature information. The feature fusion module ensures that the model captures the information relation between different distances by combining the long-range jump connection and the short-range jump connection, and the accuracy of the model is improved. In addition, as a lightweight network, considering how to sufficiently extract information in the feature map to improve model accuracy and make the scale of model parameters as small as possible, the present invention prefers to stack a limited number of feature fusion modules to ensure that the model is sufficiently lightweight.

Example 3

The embodiment is an improvement on the basis of the real-time image semantic segmentation method for lightweight multi-scale feature fusion provided in the embodiment 2.

The network structure of the image semantic segmentation network LMFFNet of the present embodiment is shown in table 1.

Table 1 network structure of image semantic segmentation network LMFFNet

As shown in fig. 6, fig. 6 is a diagram of an overall network structure of the image semantic segmentation network in this embodiment, where (i) standard convolution, (ii) down-sampling module, (iii) point-by-point convolution, (iv) depth separable convolution, and (v) up-sampling unit.

In this embodiment, the split-extract-merge bottleneck module comprises a first split-extract-merge bottleneck module (SEM-B Block 1) and a second split-extract-merge bottleneck module (SEM-B Block 2); the partition-fusion channel attention module includes a first partition-fusion channel attention module and a second partition-fusion channel attention module; the feature fusion module includes a first feature fusion module (FFM-B1) and a second feature fusion module (FFM-B2).

Carrying out convolution operation on the image to be processed through 3 standard convolution layers to obtain a first characteristic diagram; and performing parallel splicing on the first characteristic diagram and the original image to be processed, and performing point-by-point convolution on a splicing result through a1 × 1 convolution layer to obtain a second characteristic diagram.

The second feature map is subjected to primary downsampling processing by a downsampling module and then is input into a first separation-extraction-merging bottleneck module, and the second feature map is subjected to multi-scale feature extraction by M separation-extraction-merging bottleneck structures to obtain a first multi-scale feature map; transmitting an output feature map of a first separation-extraction-merging bottleneck structure in a first separation-extraction-merging bottleneck module to a first partition-merging channel attention module, wherein the first partition-merging channel attention module extracts and merges global channel information and local channel information of the output feature map to obtain a first channel information feature map; and the second feature fusion module splices the first multi-scale feature map, the first channel information feature and the original image to be processed after 4 times of down sampling in the channel direction, and carries out point-by-point convolution on the spliced result through a1 x 1 convolution layer to obtain a first fusion feature map.

The first fusion characteristic diagram is subjected to down-sampling processing for one time by a down-sampling module and then input into a second separation-extraction-merging bottleneck module, and the first fusion characteristic diagram is subjected to multi-scale characteristic extraction by N separation-extraction-merging bottleneck structures to obtain a second multi-scale characteristic diagram; transmitting an output feature map of a first separation-extraction-merging bottleneck structure in a second separation-extraction-merging bottleneck module to a second partition-merging channel attention module, wherein the second partition-merging channel attention module extracts and merges global channel information and local channel information of the output feature map to obtain a second channel information feature map; and inputting the second multi-scale feature map and the second channel information feature map into a second feature fusion module, splicing the second multi-scale feature map, the second channel information feature and the original image to be processed subjected to 8-time down-sampling in the channel direction by the second feature fusion module, and performing point-by-point convolution on a spliced result through a1 × 1 convolution layer to obtain a second fusion feature map. And inputting the first fusion feature map and the second fusion feature map into the multi-scale attention decoder, and performing image precision recovery by the multi-scale attention decoder according to the first fusion feature map and the second fusion feature map to obtain an image semantic segmentation result.

In a semantic segmentation architecture based on an encoder-decoder, the encoder generates dense feature maps, and the decoder upsamples the feature maps to match the resolution of the original input image, a well-designed decoder can effectively recover spatial details and improve semantic segmentation accuracy.

In this embodiment, the first fused feature map includes a feature mapF ₁₁The second fused feature map comprises a feature mapF ₂₁And characteristic diagramsF ₃₁. As shown in fig. 7, fig. 7 is a structural diagram of a multi-scale attention decoder in this embodiment, and the step of inputting the first fusion feature map and the second fusion feature map into the multi-scale attention decoder to obtain the image semantic segmentation result includes:

will feature mapF ₁₁Performing point-by-point convolution by a1 × 1 convolution layer to obtain characteristic channel number C₁Characteristic diagram ofF ₁₂Drawing the characteristicsF ₂₁After point-by-point convolution is carried out on a1 multiplied by 1 convolution layer, the number of characteristic channels C is obtained₂Characteristic diagram ofF ₂₂To characteristic diagramF ₂₂Bilinear upsampling to obtain a feature mapF ₂₃The expression is as follows:

wherein the content of the first and second substances,

representing a convolution calculation of 1 x 1,

representing bilinear upsampling;

the characteristic diagram is combinedF ₁₂And characteristic diagramsF ₂₃Splicing in the channel direction, and then passing the splicing result through a depth separable convolution layer and a Sigmoid activation function to obtain multi-scale attentionForce characteristic diagramM _mam. The expression is as follows:

wherein, the first and the second end of the pipe are connected with each other,

indicating that the Sigmoid-activated function,

representing a point-by-point convolution of the depth separable convolution layer,

representing a depth convolution of the depth separable convolution layer,

the method comprises the step of splicing a plurality of feature maps in the channel direction.

By collecting mid-level featuresF ₁₁And high level featuresF ₂₁And the input information is recovered with smaller computational complexity, so that the rapid reasoning is realized. Outputting multi-scale attention feature maps by fusing mid-level and high-level features using Concat and using depth-separable convolutional layers with Sigmoid activation functionsM _mam。

Will feature mapF ₃₁After convolution operation is carried out on a depth separable convolution layer, a characteristic diagram is obtainedF ₃₂To the feature mapF ₃₂Carrying out bilinear upsampling to obtain a feature map F₃₃The expression is as follows:

characteristic diagramF ₃₁Reducing the number of characteristic channels to the class number space of pixels through a depth separable convolution layer, and then performing bilinear upsampling to obtain a multi-scale attention characteristic diagramM _mamFeature map of uniform sizeF ₃₃. Finally, the attention feature map is passed through multiple scalesM _mamAndF ₃₁pixel-level multiplication is carried out, weighted attention is carried out on an interested area in an image space, non-important part pixels are restrained, the feature diagram is restored to the resolution ratio of the original input image, and a pixel category matrix is obtainedF _outAs a result of semantic segmentation of an image, the expression thereof is as follows:

。

the first and second fused feature maps already contain enough information to recover the spatial information of the original image to be processed and have better accuracy. Therefore, when generating a multi-scale attention feature map by fusing features of different depths, only the middle layer feature map is adoptedF ₁₁And deep level feature mapF ₂₁. Therefore, the multi-scale attention decoder can recover spatial details of the feature map well on the basis of a parameter size of 0.09M.

In addition, the 1 × 1 convolutional layer adopted by the embodiment reduces the number of characteristic image channels, and the method for reducing the parameter of the convolutional kernel by the block convolution and the depth separable convolutional layer effectively reduces the parameters of the image semantic segmentation network and improves the real-time performance of the network.

Example 4

The embodiment is an improvement on the basis of the real-time image semantic segmentation method for lightweight multi-scale feature fusion provided in embodiment 3.

The present embodiment evaluates the image semantic segmentation network LMFFNet using 4 datasets, the cityscaps dataset, the CamVid dataset, the kittti dataset, and the WildDash2 dataset, respectively.

In this embodiment, when training the image semantic segmentation network LMFFNet, small batch gradient descent (SGD) is used as an optimization strategy in a training phase, the batch size is set to 0.9, and the weight attenuation is set to

. Further, the learning rate decay strategy used in the present embodiment is "poly", and the initial learning rate is set to be

The power is 0.9, and the expression is as follows:

wherein the content of the first and second substances,lrwhich represents the current learning rate, is,lr _initwhich represents the initial learning rate of the learning,iterthe number of iterations is indicated and,max_iterthe maximum number of iterations is indicated, powerrepresents a learning rate attenuation curve shape control parameter.

When training the LMFFNet by using a Cityscapes data set and a CamVid data set, a maximum training period is set to be 1000 by adopting data enhancement technologies such as random scale and horizontal inversion, random parameters are respectively set to be 0.75, 1.0, 1.25, 1.5, 1.75 and 2.0, and training images are converted to different scales. For the citrescaps dataset, the training image is randomly cropped to a resolution of 512 × 1024 during the training phase due to the limitation of GPU memory size. For the CamVid dataset, ablation experiments were performed using both 720 x 960 and 360 x 480 resolutions. The problem of class imbalance is handled with the hard-to-negative sample loss function (OHEM) loss on the cityscaps dataset and the class weighting scheme on the CamVid dataset. The expression of class weight Wclass is expressed as follows:

wherein the content of the first and second substances,cfor a hyper-parameter set to 1.10,P _classrepresenting the distribution of class samples.

In this embodiment, the depth of the LMFFNet is automatically determined by a bayesian algorithm. A Bayesian optimization method is defined to specify the depths of FFM-B1 and FFM-B2, and a new evaluation index I is introduced_auto，I_autoMathematics of (2)The model is represented as follows:

wherein w is the weight coefficient of mIoU, m_imIoU, m corresponding to the model in the ith iteration of the Bayesian optimization algorithm_bIs the baseline of mIoU, f_iFps, f corresponding to the model in the ith iteration of the Bayesian optimization algorithm_bBase line of fps, p_iFor the parameter quantity corresponding to the model in the ith iteration of the Bayesian optimization algorithm,dis a sensitivity inhibition factor to the parameter. m is_bAnd f_bRespectively measuring the lowest tolerable mIoU and fps in the searching process. In the present embodiment, w, m are set_bAnd f_bIs a number of 65, and is,dis 20.

Use index I_iThe mIoU, fps and parameter quantity of LMFFNet are measured, and the expression is as follows:

wherein

，

，

。m_max，f_maxAnd p_maxThe maximum mIoU, fps and parameter quantity of the network in the comparison table are respectively.

The effect of SEM-B Block on network performance is shown in tables 2,3 and 4.

TABLE 2 Experimental results of searching for different M and N values in SEM-B Block using Bayesian optimization on the Cityscapes dataset

TABLE 3 Experimental results of ablation experiments on SEM-B Block depth on the Cityscapes dataset

Table 4 experimental results for evaluation of different bottleneck structures on the cityscaps dataset

As shown in Table 2, the two parameters M and N represent the number of SEM-

B blocks

1 and 2 in LMFFNet. In this embodiment, the optimal network depths M and N of SEM-B Block1 and SEM-B Block2 are automatically determined by using a bayesian optimization algorithm, and table 2 shows that the best I for balancing mlou, forward inference speed and parameter quantity is found by using bayesian optimization_auto. According to the experimental result, self-defined evaluation index I_autoMaximum when M and N are 3 and 8 respectively, indicating that LMFFNet can make the most appropriate trade-off between parameters and accuracy when M =3 and N = 8. Therefore, M =3 and N =8 are set to construct the backbone of LMFFNet-3-8.

In FFM-B1 and FFM-B2, the depth of the SEM-B Block always affects the overall performance of the network. In the embodiment, the parameter alpha is set to control the number M and N of SEM-B in SEM-B Block1 and SEM-B Block2 respectively, and the influence of the parameter alpha on the parameter performance of LMFFNet is explored. As shown in table 3, M = α is set first, and N is fixedly set to 8. As can be seen from the table, as α increases, mliou of LMFFNet remains unchanged until α = 6. When the alpha =8 is reached, the mIoU is reduced again because the void rate of SEM-B in the SEM-B Block1 is designed to be 2, and the receptive field is not greatly improved along with the increase of the number of SEM-B, so that the mIoU is not improved basically; when the depth of the SEM-B is increased to a certain depth, as the SEM-B superposed behind the SEM-B is already in a deep network, more global semantic information is captured, and the mIoU is improved. As the depth continues to increase, mlou decreases due to overfitting, so, by analysis of table 3, it is SEM-B Blcok1 that M =3 is the most able to trade off mlou, quantity of parameters, and number of SEM-bs for forward inference speed performance. When setting M =3 and N = α, it is obvious that mliou of LMFFNet increases with increasing α, and reaches a maximum value when increasing to 8. As the setting rule of the void rate of SEM-B in SEM-B Block2 is 2,2,4, 4...... 16,16,32,32, the void rate is multiplied, and the larger void rate is basically maintained and the receptive field is larger, the mIoU is increased continuously. Thus setting N to 8 can achieve the best parametric performance.

Therefore, when M =3 and N =8, the parameter performance of the LMFFNet can keep a good balance among the mIoU, the forward inference speed and the parameter quantity, which is consistent with the conclusion obtained by the Bayesian optimization automatic search of M and N depths, and the effectiveness of the Bayesian automatic optimization of the optimal M, N depth by self-defining a balance index is verified.

To evaluate the effectiveness of SEM-B, this example used the botttleneck bottleneck structure of ResNet, the bottleneck structure SS-nbt of Non-bottleneck structure Non-bt-1D, LEDNet of ERFNet, and the DAB module of DABNet instead of SEM-B in LMFFNet, to construct four segmented networks, respectively. According to the experimental result of table 4, the semantic segmentation network constructed by using the ResNet bottleneck structure minimizes the parameter quantity of the segmentation network, has the fastest inference speed, but has larger accuracy loss. The semantic segmentation network constructed by using Non-bt-1D has the accuracy of 74.3% mIoU, but the parameter scale is the largest. Compared with other 4 bottlenecks, the semantic segmentation network constructed by using SEM-B obtains the best segmentation precision of 74.9% mIoU at a high inference speed of 118.9 fps under the condition of only 1.35M model parameters, and the best balance among the segmentation precision, the inference speed and the model parameters is realized.

The impact of the PMCA module on network performance is shown in table 5.

TABLE 5 FFM structural Performance without attention and with different attention mechanisms

In order to prove the effectiveness of PMCA, in this embodiment, different attention modules are inserted into FFM to evaluate the parameter performance, as shown in table 5, it can be seen that PMCA is 0.7% higher than the channel attention module proposed by SENet, and the loss of inference speed is very small. This suggests that in real-time semantic segmentation application scenarios, PMCA can replace the currently popular send channel attention module.

The impact of the FFM module on network performance is shown in tables 6 and 7.

TABLE 6 Experimental results of ablation experiments on FFM-B on the Cityscapes dataset

TABLE 7 Experimental results for evaluation of different amounts of FFM-B on the Cityscapes dataset

As shown in Table 6, the addition of one or more short, medium and long range connections to FFM-B can greatly improve the accuracy of the model. Meanwhile, on the Cityscapes verification set, the mIoU of the short-range and long-range feature connection model reaches 74.9%, and the mIoU of the model is reduced by 1.2% by adding intermediate-range connection on the basis of the mIoU. The result shows that the FFM-B fuses the shallow feature of the short-range connection and the downsampled original image of the long-range connection, and achieves better performance. The features of the intermediate junctions are close to the output profile of SEM-B, resulting in similar dimensions for the two profiles. Therefore, the FFM-B has the problem of damaging the accuracy of the model when multi-scale fusion is carried out by using the intermediate range connection.

To investigate the impact of the number of FFM-Bs on the network, this example used different numbers of FFM-B modules in LMFFNet and evaluated their performance as shown in Table 7. As the number of SEM-B increases, the mIoU of the model increases continuously. However, when the FFM-B number increases to 4, the mIoU starts to decrease. The possible reason is that the number of the FFM-B is increased to increase the down-sampling times of the model, influence the recovery of the spatial detail information and damage the precision. In addition, as the number of FFM-B increases, the number of model parameters increases rapidly, which is not favorable for lightweight application scenarios. Therefore, the present embodiment sets the number of FFM-B to 2 to construct a real-time semantic segmentation network.

The impact of the FFM module on network performance is shown in tables 8 and 9.

TABLE 8 Experimental results for MAD evaluated on the Cityscapes dataset

TABLE 9 Experimental results of ablation experiments on MADs on the Cityscapes dataset

As shown in table 8, the present embodiment generates an attention feature map by fusing features of different depths, and explores the influence of the attention feature map on the performance of the decoder. As can be seen from table 8, the use of only shallow, middle or deep feature maps to generate the attention feature map impairs the performance of the model, because the use of only single-scale feature maps to generate the attention feature map easily causes the decoder to focus only on single-scale features during feature recovery, and for images containing multi-scale features, the decoding cannot focus on feature recovery of each scale at the same time, thus resulting in reduced model performance. When two or more than two features are used, if shallow features are included, precision loss can also be caused, because shallow features are all local activation of objects, and the middle layer and the deep layer include semantic activation of features, and the fusion of the shallow features including only local activation and the middle layer and the deep layer including semantic activation to generate a multi-scale attention feature map can interfere with the semantic activation and influence the feature recovery of the decoder.

Therefore, when the MAD generates the multi-scale feature map by fusing the middle-layer features and the deep-layer features, good segmentation results can be generated for objects of all scales in the multi-scale feature map.

To verify the effectiveness of the MAD and FFM algorithms, this example performed a series of experiments. Firstly, a certain number of SEM-B are used for constructing a basic network (Base network) of the LMFFNet, then the FFM is added to the Base network, and the Base + FFM is constructed to be used as a backbone network of the LMFFNet. The MAD, PAD and ERFD are respectively added into a 'Base + FFM' backbone network to further observe the performance change of the LMFFNet. The comparative experiment results are shown in table 9, the FFM is added to the Base network to form the Base + FFM backbone network, so that the mIoU can be improved by about 1.1%, but the model parameters are increased, and the reasoning speed is slower. The combined network "Base + FFM + PAD" achieved 74.0% mIoU, and the other combined network "Base + FFM + ERFD" achieved 74.3% mIoU. The precision of the 'Base + FFM + MAD' network reaches 74.9% mIoU, and is improved by 1.3% mIoU compared with the 'Base + FFM' network. The results show that the FFM and MAD combination improves the real-time segmentation accuracy while the increase in model size is within an acceptable range. Compared with PAD or ERFD, the LMFFNet constructed by FFM and MAD has better balance in the aspects of network segmentation accuracy, reasoning speed, model size and the like.

The results of comparing LMFFNet with the prior art most advanced method are shown in table 10, table 11, table 12, table 13, table 14 and table 15.

TABLE 10 Pre-Classification results for different models on the Cityscapes dataset

TABLE 11 Performance comparison of LMFFNet with the most advanced semantic segmentation method on the Cityscapes dataset

TABLE 12 Performance comparison of LMFFNet with the most advanced semantic segmentation method on the CamVid dataset

TABLE 13 comparison of recall, precision and F1 scores of LMFFNet on the Cityscapes dataset with the most advanced semantic segmentation methods

TABLE 14 Performance comparison of LMFFNet with the most advanced semantic segmentation method on KITTI datasets

Table 15 comparison of LMFFNet performance with the most advanced semantic segmentation method on the WildDash2 dataset

As shown in Table 10, the LMFFNet-3-8 of the present embodiment has higher segmentation accuracy on the Cityscapes data set than the prior art. In addition, LMFFNet-3-8 performs better than most other networks in segmenting small objects such as traffic lights and traffic signs, and shape-uncertain objects such as terrain and vegetation.

As shown in Table 11, Table 11 shows the results of the experiments performed on the Cityscapes data set with LMFFNet-3-8 and the most popular other models. LMFFNet-3-8 achieves an mIoU of about 75.1%. Compared with other high-quality real-time semantic segmentation networks such as BiSeNetV1_ R (BiSeNetV1_ ResNet18), SwiftNet and ShelfNet, LMFFNet-3-8 achieves quite good accuracy without any pre-training. The BiSeNetV1_ R achieved 74.7% mlou, but the parameter amount was as high as 49.0M. LMFFNet-3-8 achieves greater accuracy at a slightly smaller image resolution, 0.4% higher than BiSeNetV1_ R, and the model parameters are very small, approximately 44 times less than BiSeNetV1_ R. When using pre-training, SwiftNet has higher segmentation precision in the citrescap validation set and test set. While on the Cityscape validation set, SwiftNet only obtained 70.4% mIoU without pre-training, and the model size of SwiftNet was 10 times larger than LMFFNet-3-8. Using a self-defined balance index I for balancing mIoU, forward reasoning speed and parameter quantity_iTo measure the balance of each network, LMFFNet-3-8I_iThe best balance is reached up to 1.60, which shows that LMFFNet-3-8 gets a better balance among mIoU, forward inference speed and parameter quantity than other networks. Table 11 also gives the results of each model inference speed and FLOPs, where FLOPs @1024 denotes the values of FLOPs (floating point operands) after normalizing the resolution of the input image to be processed to a size of 1024 × 1024. To make a fair comparison of the speeds, this embodiment uses the pytorech platform on the same 3090 GPU graphics card to evaluate the inference speed of all networks. Obviously, compared with other high precision networks, such as swift net, ESNet, BiSeNetV1_ R, LMFFNet-3-8 has relatively small data in terms of GFLOPs @ 1024.

In addition, the present embodiment evaluates the performance of the model on two resolution images on a CamVid dataset. Summary of the results of the experiments performed on the CamVid dataset as shown in Table 12, LMFFNet-3-8 has a great advantage in accuracy, with 360 × 480 and 720 × 960 images reaching 69.1% and 72.0% mIoU, respectively. Moreover, the reasoning speed of the LMFFNet-3-8 is far beyond the requirement of real-time segmentation.

Table 13 shows recall, accuracy and F1-score for each network in the Cityscapes and CamVid data sets, and it is clear from Table 13 that LMFFNet-3-8 achieves the best results in this index.

To verify the robustness of LMFFNet-3-8, this example also evaluated it on the KITTI and Wilddash2 data sets. This example uses models trained in the cityscaps dataset to predict the KITTI dataset and the WildDahs2 dataset, respectively. As shown in tables 14 and 15, in KITTI data set, the mIoU of LMFFNet-3-8 is as high as 49.3%, 6.4% higher than ERFNet, the best results are achieved, and the index I is weighted_iIs the highest. In the Wilddash2 dataset, LMFFNet-3-8 achieves the highest accuracy and fastest inference speed, while weighing index I_iAlso obtain the bestThe result of (1).

As shown in fig. 8, (a) in fig. 8 is a comparison graph of performance results of evaluating the comprehensive accuracy and the computational efficiency of different semantic segmentation networks on evaluating the cityscaps data set, and (b) in fig. 8 is a comparison graph of performance results of evaluating the comprehensive accuracy and the parameter quantity of different semantic segmentation networks on evaluating the cityscaps data set, compared with the prior art framework, LMFFNet-3-8 has obvious advantages in precision, and keeps good reasoning speed for real-time segmentation tasks. Further, (b) in FIG. 8 shows that for edge computing platforms with limited hardware resources, LMFFNet-3-8 achieves a sufficient tradeoff between accuracy and parameter size.

As shown in fig. 9 and 10, fig. 9 is a visual comparison graph of the segmentation effect of a part of the existing semantic segmentation network and LMFFNet-3-8 in the cityscaps data set, fig. 10 is a visual comparison graph of the segmentation effect of another part of the existing semantic segmentation network and LMFFNet-3-8 in the cityscaps data set, and as shown in part by boxes in fig. 9 and 10, LMFFNet-3-8 has better segmentation on small-scale objects such as traffic signs, bus front mirrors, traffic lights, riders far away from the shooting location, and other networks usually ignore these objects directly or mistakenly classify these objects into other categories, which proves the superiority of LMFFNet-3-8 compared with the current popular semantic segmentation network in terms of actual segmentation visualization.

As shown in fig. 11, fig. 11 is a visual comparison diagram of the segmentation effect of different real-time semantic segmentation networks in a CamVid data set, and it can be seen from the diagram that LMFFNet-3-8 can achieve clear segmentation between a trunk and a pedestrian instead of directly using the trunk as a part of an adult like other networks, which shows that LMFFNet-3-8 can clearly determine the boundary of an object and achieve a good segmentation effect. In addition, in the CamVid data set, LMFFNet-3-8 also has good segmentation on smaller-scale targets such as pedestrians and traffic lights at a distance, which further illustrates that LMFFNet-3-8 is very suitable for semantic segmentation in a lane scene.

In summary, LMFFNet achieves 75.1% mliou on the citrescaps data test set with a parameter of 1.35M, and the forward inference speed is up to 118.9 fps at an input resolution of 512 × 1024. In the experiment of the CamVid data set, the LMFFNet reaches 69.1% mIoU at an input resolution of 360 × 480, reaches 72.0% at an input resolution of 720 × 960, and has forward inference speeds of 120.8 fps and 116.4 fps at the two input resolutions respectively. LMFFNet reaches 49.3 percent and 23.1 percent on a KITTI data set and a Wilddash2 data set respectively, and achieves the best effect compared with a popular semantic segmentation network.

The image semantic segmentation method provided by the invention can ensure the Precision (Precision) and Accuracy (Accuracy) of the model in the lightweight LMFFNet model with relatively small parameter quantity, improve the Inference Speed (Inference Speed) of the model and realize the real-time semantic segmentation of the image.

Example 5

Referring to fig. 12, the present embodiment provides a lightweight multi-scale feature fusion real-time image semantic segmentation system, which is applied to the lightweight multi-scale feature fusion real-time image semantic segmentation method described in the foregoing embodiment, and includes: the image semantic segmentation network comprises an initialization module 1, a separation-extraction-merging bottleneck module 2, a partition-fusion channel attention module 3, a feature fusion module 4 and a multi-scale attention decoder 5; the separation-extraction-merging bottleneck module 2 comprises a plurality of separation-extraction-merging bottleneck structures which are connected in sequence.

The initialization module 1 extracts the features of the image to be processed to obtain an initial feature map of the image to be processed.

The separation-extraction-combination bottleneck module 2 adopts a separation characteristic channel to extract the multi-scale characteristics of the image to be processed to obtain a multi-scale characteristic map. In this embodiment, the separation-extraction-merging bottleneck structure in the separation-extraction-merging bottleneck module 2 follows the concept of separation-extraction-merging, and first, a 3 × 3 standard convolution layer is used to compress useful information of an initial feature map of an image to be processed, then, Split feature channel separation is adopted, parallel double branches with receptive field differences are used to perform feature processing on the compressed feature map to generate feature maps of different scales, and finally, the 3 × 3 standard convolution layer is used again to fuse multi-scale information on the feature maps of different scales to obtain a multi-scale feature map.

The partition-fusion channel attention module 3 extracts and fuses global channel information and local channel information of the initial feature map to obtain a channel information feature map. In this embodiment, the partition-fusion channel attention module 3 includes a global attention generation branch and a local attention fusion branch. In the local attention fusion branch, the attention weight of the regional channel of the initial feature map is acquired in a partitioning mode, then the attention weight of each regional channel is fused in a self-adaptive learning mode, and the relation of the sub-regions of the feature map is considered from the aspect of partitioning. The global attention branch generates a global channel attention weight by acquiring global information of the initial feature map. And finally, obtaining the final channel attention weight by fusing the regional channel attention weight and the global channel attention weight. Meanwhile, global information and local information are fused to generate channel attention weight, the regional relation and the global information of the feature map are considered, and the accuracy of the image semantic segmentation network is improved.

And the feature fusion module 4 performs feature fusion on the multi-scale feature map, the channel information feature map and the original image to be processed to obtain a fusion feature map. In the embodiment, the multi-scale feature map, the channel information feature map and the original image to be processed are subjected to feature fusion, effective information and detail information of the image are fused and supplemented, and the accuracy of the image semantic segmentation network is improved.

And the multi-scale attention decoder 5 recovers the image precision according to the fusion characteristic graph to obtain an image semantic segmentation result. In the embodiment, the multi-scale attention decoder 5 is guided to perform image precision recovery through the fusion feature diagram fused with the multi-scale features, important information in an image space region is subjected to weighted attention, non-important information is restrained, and the recovery precision of the decoder on objects with different scales is effectively improved.

The terms describing positional relationships in the drawings are for illustrative purposes only and are not to be construed as limiting the patent.

It should be understood that the above-described embodiments of the present invention are merely examples for clearly illustrating the present invention, and are not intended to limit the embodiments of the present invention. Other variations and modifications will be apparent to persons skilled in the art in light of the above description. And are neither required nor exhaustive of all embodiments. Any modification, equivalent replacement, and improvement made within the spirit and principle of the present invention should be included in the protection scope of the claims of the present invention.

Claims

1. The real-time image semantic segmentation method for the lightweight multi-scale feature fusion is characterized by comprising the following steps of:

s1: constructing an image semantic segmentation network, wherein the image semantic segmentation network comprises an initialization module, a separation-extraction-merging bottleneck module, a partition-fusion channel attention module, a feature fusion module and a multi-scale attention decoder; the separation-extraction-merging bottleneck module comprises a plurality of separation-extraction-merging bottleneck structures which are connected in sequence;

s2: inputting an image to be processed into the image semantic segmentation network, and outputting an image semantic segmentation result by the image semantic segmentation network;

the initialization module extracts features of an image to be processed to obtain an initial feature map of the image to be processed; the separation-extraction-combination bottleneck module adopts a separation characteristic channel to extract multi-scale characteristics of the initial characteristic diagram to obtain a multi-scale characteristic diagram; the partition-fusion channel attention module extracts and fuses global channel information and local channel information of the initial characteristic graph to obtain a channel information characteristic graph; the feature fusion module performs feature fusion on the multi-scale feature map, the channel information feature map and the original image to be processed to obtain a fusion feature map; and the multi-scale attention decoder recovers the image precision based on the fusion characteristic graph to obtain an image semantic segmentation result.

2. The method for semantically segmenting the light-weight multi-scale feature fusion real-time image according to claim 1, wherein the initialization module performs feature extraction on the image to be processed to obtain an initial feature map of the image to be processed, and comprises:

carrying out convolution operation on the image to be processed through a convolution layer to obtain a first characteristic diagram;

and performing parallel splicing on the first characteristic diagram and the original image to be processed, and performing convolution operation on a spliced result through a1 × 1 convolutional layer to obtain a second characteristic diagram.

3. The method for semantic segmentation of the lightweight multi-scale feature fusion real-time image according to claim 2, wherein the separation-extraction-merging bottleneck module adopts a separation feature channel to extract the multi-scale features of the initial feature map, and the step of obtaining the multi-scale feature map comprises:

inputting a second feature map into a first separating-extracting-merging bottleneck structure, after the second feature map is subjected to convolution operation through a 3 x 3 standard convolution layer, separating the convolution operation result through a channel to generate two feature maps, and respectively performing convolution operation through a 3 x 3 depth separable convolution layer and a 3 x 3 depth separable void convolution layer;

merging the two convolution operation results, performing convolution operation on the merged result through a 3 x 3 standard convolution layer, performing feature mapping on the merged result and a feature map input into the current separation-extraction-merging bottleneck structure, and inputting the feature mapping result into the next separation-extraction-merging bottleneck structure for processing;

4. The method for lightweight multiscale feature fused real-time image semantic segmentation as claimed in claim 3, wherein the output ends of the 3 x 3 standard convolutional layer, the 3 x 3 depth separable convolutional layer and the 3 x 3 depth separable hole convolutional layer are sequentially connected with a PReLU active layer and a BN layer in a separation-extraction-merging bottleneck structure.

5. The method for real-time image semantic segmentation for lightweight multi-scale feature fusion according to claim 3, wherein the partition-fusion channel attention module comprises a global attention generation branch and a local attention fusion branch; the partition-fusion channel attention module extracts and fuses global channel information and local channel information of the initial feature map to obtain a channel information feature map, and the partition-fusion channel attention module comprises the following steps:

inputting a feature map A output by a first separating-extracting-merging bottleneck structure in the separating-extracting-merging bottleneck module into a partitioning-merging channel attention module and then respectively entering a global attention generating branch and a local attention merging branch;

feature map A for an input global attention generating leg₁Performing global average pooling and learning feature map A₁Obtaining a learning characteristic graph A by the channel weight₁Global channel information of (2);

feature map A for input local attention fusion branch₂Partitioning the feature map A₂Dividing the area into k areas; for characteristic diagram A₂Respectively carrying out global average pooling and learning average pooling values of the k regions to obtain a characteristic diagram A₂Local channel information of (2);

and performing weighted fusion on the global channel information and the local channel information, and performing feature extraction on an output feature map according to a weighted fusion result to obtain a channel information feature map.

6. The method for semantically segmenting the light-weight multi-scale feature fusion real-time image according to claim 5, wherein the step of performing feature fusion on the multi-scale feature map, the channel information feature map and the original image to be processed by the feature fusion module to obtain a fusion feature map comprises the following steps:

and the characteristic fusion module splices the multi-scale characteristic diagram, the channel information characteristic diagram and the original image to be processed in the channel direction, and performs convolution operation on the spliced result through a1 x 1 convolution layer to obtain a fusion characteristic diagram.

7. The method for semantically segmenting the light-weight multi-scale feature fusion real-time image according to claim 2, wherein the step of convolving the image to be processed by the convolutional layer to obtain the first feature map comprises:

inputting an image to be processed into an initialization module, performing convolution operation on the image to be processed by passing through a 3 × 3 standard convolution layer with the step length of 2, and performing convolution operation on the image to be processed by continuously passing through two 3 × 3 standard convolution layers to obtain a first characteristic diagram.

8. The method for lightweight multi-scale feature fused real-time image semantic segmentation according to claim 1, wherein the split-extract-merge bottleneck module comprises a first split-extract-merge bottleneck module and a second split-extract-merge bottleneck module; the partition-fusion channel attention module includes a first partition-fusion channel attention module and a second partition-fusion channel attention module; the feature fusion module comprises a first feature fusion module and a second feature fusion module; then, the image to be processed is input into the image semantic segmentation network, and the step of outputting the image semantic segmentation result by the image semantic segmentation network comprises the following steps:

carrying out convolution operation on the image to be processed through a convolution layer to obtain a first characteristic diagram; performing parallel splicing characteristic fusion on the first characteristic diagram and an original image to be processed to obtain a second characteristic diagram;

the second feature map is input into a first separation-extraction-combination bottleneck module after being subjected to down sampling, and the second feature map is subjected to multi-scale feature extraction through a plurality of separation-extraction-combination bottleneck structures to obtain a first multi-scale feature map; transmitting an output feature map of a first separation-extraction-merging bottleneck structure in a first separation-extraction-merging bottleneck module to a first partition-merging channel attention module, wherein the first partition-merging channel attention module extracts and merges global channel information and local channel information of the output feature map to obtain a first channel information feature map; inputting the first multi-scale feature map, the first channel information feature map and an original image to be processed into a first feature fusion module, and performing feature fusion on the first multi-scale feature map, the first channel information feature and the original image to be processed by the first feature fusion module to obtain a first fusion feature map;

the first fused feature map is subjected to down sampling once and then input into a second separation-extraction-merging bottleneck module, and the first fused feature map is subjected to multi-scale feature extraction through a plurality of separation-extraction-merging bottleneck structures to obtain a second multi-scale feature map; transmitting an output feature map of a first separation-extraction-merging bottleneck structure in a second separation-extraction-merging bottleneck module to a second partition-merging channel attention module, wherein the second partition-merging channel attention module extracts and merges global channel information and local channel information of the output feature map to obtain a second channel information feature map; inputting the second multi-scale feature map, the second channel information feature map and the original image to be processed into a second feature fusion module, and performing feature fusion on the second multi-scale feature map, the second channel information feature and the original image to be processed by the second feature fusion module to obtain a second fusion feature map;

and inputting the first fusion feature map and the second fusion feature map into the multi-scale attention decoder, and performing image precision recovery by the multi-scale attention decoder according to the first fusion feature map and the second fusion feature map to obtain an image semantic segmentation result.

9. The method of claim 8, wherein the first fused feature map comprises a feature mapF ₁₁The second fused feature map comprises a feature mapF ₂₁And characteristic diagramsF ₃₁(ii) a Inputting the first fused feature map and the second fused feature map into the multi-scale attention decoder to obtain an image semantic segmentation result, wherein the step of obtaining the image semantic segmentation result comprises the following steps:

will feature mapF ₁₁Through aPerforming convolution operation on the 1 × 1 standard convolution layers to obtain the characteristic channel number C₁Characteristic diagram ofF ₁₂Drawing the characteristicsF ₂₁Performing convolution operation on a1 × 1 standard convolution layer to obtain a characteristic channel with the number of C₂Characteristic diagram ofF ₂₂To characteristic diagramF ₂₂Carrying out bilinear up-sampling to obtain a characteristic diagramF ₂₃The expression is as follows:

wherein the content of the first and second substances,

representing a convolution calculation of 1 x 1,

representing bilinear upsampling;

the characteristic diagram is combinedF ₁₂And characteristic diagramsF ₂₃Splicing in the channel direction, and then passing the splicing result through a depth separable convolution layer and a Sigmoid activation function to obtain a multi-scale attention feature mapM _mamThe expression is as follows:

indicating that the Sigmoid-activated function,

representing a point-by-point convolution of depth separable convolution layers,

representing a depth convolution of the depth separable convolution layer,

representing the splicing of a plurality of characteristic graphs in the channel direction;

attention feature map of multi-scaleM _mamAnd characteristic diagramsF ₃₃Multiplication at pixel level is performed, andF ₃₃the spatial characteristic information is weighted and paid attention to obtain a pixel class matrixF _outputAs a result of semantic segmentation of an image, the expression thereof is as follows:

。

10. the real-time image semantic segmentation system for lightweight multi-scale feature fusion is applied to the real-time image semantic segmentation method for lightweight multi-scale feature fusion, which is characterized by comprising the following steps of:

the image semantic segmentation network comprises an initialization module, a separation-extraction-merging bottleneck module, a partition-fusion channel attention module, a feature fusion module and a multi-scale attention decoder; the separation-extraction-merging bottleneck module comprises a plurality of separation-extraction-merging bottleneck structures which are connected in sequence;

the initialization module extracts the features of the image to be processed to obtain an initial feature map of the image to be processed;

the separation-extraction-combination bottleneck module adopts a separation characteristic channel to extract multi-scale characteristics of the characteristic diagram of the image to be processed to obtain a multi-scale characteristic diagram;

the partition-fusion channel attention module extracts and fuses global channel information and local channel information of the initial feature map to obtain a channel information feature map;

the feature fusion module performs feature fusion on the multi-scale feature map, the channel information feature map and the original image to be processed to obtain a fusion feature map;

and the multi-scale attention decoder recovers the image precision according to the fusion characteristic graph to obtain an image semantic segmentation result.