CN114187311A

CN114187311A - Image semantic segmentation method, device, equipment and storage medium

Info

Publication number: CN114187311A
Application number: CN202111530231.XA
Authority: CN
Inventors: 徐鑫
Original assignee: Jingdong Kunpeng Jiangsu Technology Co Ltd
Current assignee: Jingdong Kunpeng Jiangsu Technology Co Ltd
Priority date: 2021-12-14
Filing date: 2021-12-14
Publication date: 2022-03-15

Abstract

The invention discloses a training method of an image semantic segmentation model. The method comprises the following steps: inputting a first sample image into an initial image semantic segmentation model; extracting a first image feature and a second image feature of the first sample image through a feature extraction network; inputting the first image characteristic into an edge detection network to obtain an edge characteristic, and inputting the second image characteristic into a semantic segmentation network to obtain a semantic characteristic; inputting the semantic features and the edge features into a feature fusion network, obtaining fusion features obtained after fusion of the semantic features and the edge features, and obtaining a first semantic category prediction confidence map based on the fusion features; and calculating a first loss function value based on the first semantic category prediction confidence map and semantic category label data corresponding to the first sample image, and adjusting network parameters in the initial image semantic segmentation model. The invention improves the problem of detail blurring caused by lack of rich spatial information in image semantic segmentation.

Description

Image semantic segmentation method, device, equipment and storage medium

Technical Field

The embodiment of the invention relates to the technical field of computer vision processing, in particular to a method, a device, equipment and a storage medium for image semantic segmentation.

Background

With the development of artificial intelligence, the image semantic segmentation technology has wide application in application scenes such as automatic driving, path planning and collision avoidance systems.

However, the prior art has at least the following technical problems:

the semantic segmentation network adopting the encoder-decoder structure has the problem of detail blurring due to lack of rich spatial information.

Disclosure of Invention

The embodiment of the invention provides an image semantic segmentation method, device, equipment and storage medium, which are used for fusing edge features with higher resolution and semantic features with abundant spatial information based on a self-attention mechanism, fully utilizing complementary information to provide abundant spatial structure details for image semantic segmentation and solving the problem of detail blurring caused by the lack of abundant spatial information in a common semantic segmentation network.

In a first aspect, an embodiment of the present invention provides a training method for an image semantic segmentation model, including:

inputting a first sample image into an initial image semantic segmentation model; wherein the initial image semantic segmentation model comprises: the system comprises a feature extraction network, a semantic segmentation network, an edge detection network and a feature fusion network;

performing feature extraction on the first sample image through the feature extraction network to obtain a first image feature and a second image feature; the resolution of the first image feature is higher than the resolution of the second image feature;

inputting the first image characteristic into the edge detection network to obtain the edge characteristic of the first sample image output by the edge detection network, and inputting the second image characteristic into the semantic segmentation network to obtain the semantic characteristic of the first sample image output by the semantic segmentation network;

inputting the semantic features and the edge features into the feature fusion network, obtaining fusion features obtained by fusing the semantic features and the edge features by the feature fusion network based on an attention mechanism, and obtaining a first semantic category prediction confidence map based on the fusion features;

calculating a first loss function value based on the first semantic category prediction confidence map and semantic category label data corresponding to the first sample image, and adjusting network parameters in the initial image semantic segmentation model based on the first loss function value.

In a second aspect, an embodiment of the present invention provides an image semantic segmentation method, including:

acquiring an image to be analyzed;

inputting the image to be analyzed into a target image semantic segmentation model obtained by training by adopting the training method of the image semantic segmentation model;

and acquiring a target semantic category prediction confidence map output by the target image semantic segmentation model, and determining a semantic segmentation result of the image to be analyzed based on the target semantic category prediction confidence map.

In a third aspect, an embodiment of the present invention further provides a training apparatus for an image semantic segmentation model, where the apparatus includes:

the input module is used for inputting the first sample image into the initial image semantic segmentation model; wherein the initial image semantic segmentation model comprises: the system comprises a feature extraction network, a semantic segmentation network, an edge detection network and a feature fusion network;

the characteristic extraction module is used for extracting the characteristics of the first sample image through the characteristic extraction network to obtain a first image characteristic and a second image characteristic; the resolution of the first image feature is higher than the resolution of the second image feature;

the edge detection module is used for inputting the first image feature into the edge detection network, obtaining the edge feature of the first sample image output by the edge detection network, inputting the second image feature into the semantic segmentation network, and obtaining the semantic feature of the first sample image output by the semantic segmentation network;

the feature fusion module is used for inputting the semantic features and the edge features into the feature fusion network, obtaining fusion features obtained by fusing the semantic features and the edge features by the feature fusion network based on an attention mechanism, and obtaining a first semantic category prediction confidence map based on the fusion features;

and the parameter adjusting module is used for calculating a first loss function value based on the first semantic category prediction confidence map and the semantic category label data corresponding to the first sample image, and adjusting network parameters in the initial image semantic segmentation model based on the first loss function value.

In a fourth aspect, an embodiment of the present invention further provides an image semantic segmentation apparatus, where the apparatus includes:

the acquisition module is used for acquiring an image to be analyzed;

the input module is used for inputting the image to be analyzed into a target image semantic segmentation model obtained by training through the image semantic segmentation model;

and the determining module is used for acquiring a target semantic category prediction confidence map output by the target image semantic segmentation model and determining a semantic segmentation result of the image to be analyzed based on the target semantic category prediction confidence map.

In a fifth aspect, an embodiment of the present invention further provides a terminal device, which includes a memory, a processor, and a computer program stored on the memory and executable on the processor, where the processor implements the training method for the image semantic segmentation model or the image semantic segmentation method when executing the program.

In a sixth aspect, the embodiment of the present invention further provides a computer-readable storage medium, on which a computer program is stored, which, when being executed by a processor, implements the training method for the image semantic segmentation model or the image semantic segmentation method.

Extracting a first image feature and a second image feature of a first sample image through a feature extraction network of an initial image semantic segmentation model, inputting the first image feature into an edge detection network to obtain an edge feature of the first sample image, inputting the second image feature into the semantic segmentation network to obtain a semantic feature of the first sample image, fusing the semantic feature and the edge feature based on an attention mechanism to obtain a fusion feature, and obtaining a first semantic category prediction confidence map based on the fusion feature; the method comprises the steps of predicting a confidence map based on a first semantic category and semantic category label data corresponding to a first sample image, calculating a first loss function value, adjusting network parameters in an initial image semantic segmentation model based on the first loss function value, fusing edge features with high resolution and semantic features with abundant spatial information based on an attention mechanism, providing abundant spatial structure details for image semantic segmentation by fully utilizing complementary information, and solving the problem that the semantic segmentation network adopting an encoder-decoder structure is blurred in details due to the lack of the abundant spatial information.

Drawings

FIG. 1 is a flowchart of a training method for an image semantic segmentation model according to an embodiment of the present invention;

fig. 2A is a flowchart of a feature extraction method according to an embodiment of the present invention;

fig. 2B is a schematic structural diagram of an edge detection network according to an embodiment of the present invention;

fig. 2C is a schematic structural diagram of an image semantic segmentation model according to an embodiment of the present invention;

FIG. 2D is a flowchart of a training method of an image semantic segmentation model according to a second embodiment of the present invention;

FIG. 2E is a schematic diagram of a feature fusion method according to a second embodiment of the present invention;

FIG. 3 is a flowchart of a training method for an image semantic segmentation model according to a third embodiment of the present invention;

FIG. 4 is a flowchart of an image semantic segmentation method according to a fourth embodiment of the present invention;

fig. 5 is a block diagram of a training apparatus for an image semantic segmentation model according to a fifth embodiment of the present invention;

fig. 6 is a block diagram of an image semantic segmentation apparatus according to a sixth embodiment of the present invention;

fig. 7 is a block diagram of a terminal device according to a seventh embodiment of the present invention.

Detailed Description

The present invention will be described in further detail with reference to the accompanying drawings and examples. It is to be understood that the specific embodiments described herein are merely illustrative of the invention and are not limiting of the invention. It should be noted that, for convenience of description, only a part of the structures related to the present invention, not all of the structures, are shown in the drawings, and furthermore, embodiments of the present invention and features of the embodiments may be combined with each other without conflict.

Example one

Fig. 1 is a flowchart of a training method for an image semantic segmentation model according to an embodiment of the present invention, where the embodiment is applicable to training the image semantic segmentation model, and the method may be executed by a training apparatus for the image semantic segmentation model, and the apparatus may be implemented by software and/or hardware.

As shown in fig. 1, the method specifically includes the following steps:

step 110, inputting a first sample image into an initial image semantic segmentation model; the initial image semantic segmentation model comprises the following steps: the system comprises a feature extraction network, a semantic segmentation network, an edge detection network and a feature fusion network.

Wherein the first sample image may be a sample image in a training sample set used for training an initial image semantic segmentation model; the first sample image may be pre-labeled with semantic category labels that identify semantic categories for portions of the set of pixels in the sample image.

The initial image semantic segmentation model refers to an untrained or completely untrained image semantic segmentation model. The initial image semantic segmentation model is used for performing semantic classification on the first sample image, and may include: the system comprises a feature extraction network, a semantic segmentation network, an edge detection network and a feature fusion network. The characteristic extraction network is used for extracting the image characteristics of the first sample image; the semantic segmentation network is used for performing semantic segmentation on the first sample image and extracting semantic features in the first sample image; the edge detection network is used for extracting edge features of the first sample image; the feature fusion network is used for fusing the semantic features and the edge features of the first sample image.

Step 120, performing feature extraction on the first sample image through a feature extraction network to obtain a first image feature and a second image feature; the resolution of the first image feature is higher than the resolution of the second image feature.

Wherein, the first image characteristic and the second image characteristic extracted by the characteristic extraction network are both in the form of characteristic graphs. The feature extraction network can select a residual error network or a U-Net structure model, the feature extraction network is not limited too much in the embodiment of the invention, as long as the first image feature and the second image feature with semantic information can be extracted, and the resolution of the first image feature is higher than that of the second image feature.

Specifically, feature extraction is carried out on the first sample image through a feature extraction network of the initial image semantic segmentation model, and a first image feature with high resolution and a second image feature with low resolution are obtained.

Step 130, inputting the first image feature into the edge detection network to obtain the edge feature of the first sample image output by the edge detection network, and inputting the second image feature into the semantic segmentation network to obtain the semantic feature of the first sample image output by the semantic segmentation network.

Wherein the edge detection network is used for identifying the edge feature of the first sample image.

Specifically, the first image features with high resolution are input into an edge detection network to extract more accurate edge information, and the second image features with low resolution but more semantic information are input into a semantic segmentation network to extract richer semantic information.

Step 140, inputting the semantic features and the edge features into a feature fusion network, obtaining fusion features obtained by fusing the semantic features and the edge features by the feature fusion network based on an attention mechanism, and obtaining a first semantic category prediction confidence map based on the fusion features.

Among them, Attention Mechanism (Attention Mechanism) is derived from the study of human vision. In cognitive science, due to the bottleneck of information processing, a part of all information is selectively focused while other visible information is ignored, and the mechanism is generally called attention mechanism.

The first semantic class prediction confidence map is based on a map formed by confidence degrees of semantic classes of all parts of pixel sets in the first sample image predicted by the fusion features.

Specifically, the feature fusion network fuses the input semantic features and the edge features based on an attention mechanism to obtain fusion features, solves the problem of information overload, and allocates computing resources to more important feature fusion tasks.

Step 150, calculating a first loss function value based on the first semantic category prediction confidence map and semantic category label data corresponding to the first sample image, and adjusting network parameters in the initial image semantic segmentation model based on the first loss function value.

The first loss function is used for measuring the difference between semantic categories obtained by performing image semantic segmentation on the first sample image through the initial image semantic segmentation model and semantic category label data corresponding to the first sample image.

Specifically, a first loss function value is determined according to semantic category label data corresponding to the first sample image and the first semantic category prediction confidence map, and network parameters in the initial image semantic segmentation model are adjusted based on the first loss function value.

For example, the network parameters in the initial image semantic segmentation model may include untrained initial network parameters, and may also include pre-trained network parameters.

According to the technical scheme of the embodiment, a first sample image is input into an initial image semantic segmentation model; performing feature extraction on the first sample image through a feature extraction network to obtain a first image feature and a second image feature; inputting the first image characteristic into an edge detection network to obtain the edge characteristic of a first sample image output by the edge detection network, and inputting the second image characteristic into a semantic segmentation network to obtain the semantic characteristic of the first sample image output by the semantic segmentation network; inputting the semantic features and the edge features into a feature fusion network, obtaining fusion features obtained by fusing the semantic features and the edge features by the feature fusion network based on an attention mechanism, and obtaining a first semantic category prediction confidence map based on the fusion features; the method comprises the steps of predicting a confidence map based on a first semantic category and semantic category label data corresponding to a first sample image, calculating a first loss function value, adjusting network parameters in an initial image semantic segmentation model based on the first loss function value, fusing edge features with high resolution and semantic features with abundant spatial information based on a self-attention mechanism, providing abundant spatial structure details for image semantic segmentation by fully utilizing complementary information, and improving the detail blurring problem caused by the fact that a common semantic segmentation network is lack of the abundant spatial information.

Optionally, the feature extraction network includes: a plurality of convolutional layers connected in series; the first image features comprise image features extracted from the N convolutional layers at the bottommost layer in the plurality of convolutional layers, and the second image features comprise image features extracted from other convolutional layers except the N convolutional layers at the bottommost layer in the plurality of convolutional layers; wherein N is an integer not less than 1 and less than the total number of convolutional layers included in the feature extraction network.

Specifically, the feature extraction network extracts the features of the N convolutional layers by connecting the plurality of convolutional layers in series and adopting a bottom-to-top feature extraction idea. Since the underlying feature resolution is higher and can be used to provide detailed edge location and structure information, a first image feature of the first sample image is extracted from the N convolutional layers at the bottom of the plurality of convolutional layers, and a second image feature of the first sample image is extracted from the other convolutional layers except the N convolutional layers at the bottom of the plurality of convolutional layers.

Optionally, the feature extraction network adopts a depth residual error network structure that removes the average pooling layer and the full-link layer and retains a preset number of convolution layers.

The conventional deep convolutional neural network includes: a preset number of convolutional layers, an average pooling layer, and a full-link layer. The recognition performance of the deep convolutional neural network on the image depends on the depth of the network to a great extent, but as the depth of the network is deepened continuously, the convergence of the deep convolutional neural network becomes more and more difficult, and phenomena such as gradient dispersion or gradient explosion and the like easily occur, so that the recognition accuracy of classification is reduced on the contrary. Based on the network depth, the deep residual error network ResNet well solves the problem caused by the network depth. In the embodiment of the invention, the depth residual error network is further improved to be used as a feature extraction network, the average pooling layer and the full-link layer of the depth residual error network are removed, and a preset number of convolutional layers are reserved. The depth residual error network structure selected by the embodiment of the invention can be a 34-layer depth residual error network, a 50-layer depth residual error network or a 101-layer depth residual error network.

Optionally, the preset number is 5; the first image features comprise image features respectively extracted from the 1 st convolutional layer, the 2 nd convolutional layer, the 3 rd convolutional layer and the 4 th convolutional layer in the 5 convolutional layers connected in series; the second image features include image features extracted from the 5 th convolutional layer of the 5 convolutional layers in the series.

Specifically, as shown in fig. 2A, the feature extraction network in the embodiment of the present invention includes 5 convolutional layers connected in series, and the image features of the first sample image are extracted from the first 4 convolutional layers of the 5 convolutional layers connected in series, that is, the 1 st convolutional layer, the 2 nd convolutional layer, the 3 rd convolutional layer, and the 4 th convolutional layer, respectively, to make first image features; the image feature of the first sample image is extracted from the 5 th convolution layer of the 5 convolution layers connected in series as the second image feature.

Exemplary, parameters of the image features extracted for 5 convolutional layers are shown in table 1.

TABLE 1

Parameter(s)	The 1 st convolutional layer	2 nd convolutional layer	The 3 rd convolutional layer	The 4 th convolutional layer	The 5 th convolutional layer
						Size of	128×H/2×C/2	256×H/4×C/4	512×H/8×C/8	1024×H/8×C/8	2048×H/8×C/8
Receptive field	11	39	95	831	1087

Where H is the height of the first sample image and C is the width of the first sample image.

It should be noted that, the first convolutional layer of the conventional depth residual error network structure is 7 × 7 convolution, the step size of the 4 th convolutional layer and the 5 th convolutional layer is 2, the first convolutional layer can be set to be three-layer 3x3 convolution in order to reduce the amount of computation, and meanwhile, in order to maintain the receptive field range of the conventional depth residual error network structure, the step size of the convolution of the 4 th convolutional layer and the first 3x3 of the 5 th convolutional layer is reduced from 2 to 1, and is replaced by the dilation convolution with a dilation rate of 2.

Optionally, when N is not less than 2, the edge detection network includes at least one propagation (Propagate) module and an edge module;

fusing the first image characteristics through each Propagate module and inputting the fused first image characteristics to the edge module;

and after the edge module performs convolution operation and nonlinear operation on the fusion features input by the Propagate module, outputting the edge features.

Specifically, when N is not less than 2, that is, when the first image feature includes image features extracted from at least two convolutional layers at the bottommost layer of the plurality of convolutional layers, the edge detection network includes at least one propgate module, the N first image features are fused by the propgate module, the fused first image feature is input to the edge module, the fusion feature input by the propgate module is subjected to convolution operation and nonlinear regression analysis operation by a preset convolution kernel in the edge module, and the edge feature of the first sample image is output.

When N is 1, the edge detection network does not include the propgate module, and the first image features extracted by 1 convolutional layer are directly input to the edge module without being fused.

Illustratively, the features extracted by the N bottommost convolutional layers are fused by a Propagate module, which allows information to be stepped back in order to Propagate context information to higher resolution feature layers. Features extracted by higher convolutional layers may change the number of channels by 1 × 1 convolutional layers, and then the spatial size is adjusted by bilinear interpolation so that features from higher layers are concatenated with features from lower layers to obtain fused features. The process of fusing the first image features by the respective Propagate modules may be expressed as:

wherein, f ([ theta ])_i) Representing a convolution kernel parameter of theta_iThe convolution operation of (2); ReLU (×) represents a regular function; up (, x) represents a bilinear interpolation operation, the purpose of which is to increase the resolution of the first parameter to the same size as the second parameter; concat [, ] C]Representing the concatenation of channel dimensions, i represents the number of layers of the convolutional layer,

representing the image characteristics of the first image characteristics fused by a Propagate module corresponding to the ith convolutional layer; f⁽ⁱ⁾The first image feature extracted from the ith convolution layer is shown.

Resulting fusion features

The hierarchical characteristics of the N bottommost convolutional layers are integrated.

Optionally, if N is greater than 2, the edge detection network includes a propgate module for each two adjacent convolutional layers in the N convolutional layers, respectively; wherein, the ith convolutional layer and the (i +1) th convolutional layer correspond to the ith propgate module, and i takes a value in [1, N-1 ]; and the number of the first and second groups,

when i is equal to N-1, the (N-1) th propgate module is used for executing fusion operation on the image features respectively extracted by the (N-1) th convolutional layer and the Nth convolutional layer and inputting the fusion result to the (N-2) th propgate module;

when i is equal to 1, the 1 st propgate module is used for executing fusion operation on the image features extracted from the 1 st convolutional layer and the fusion result input by the 2 nd propgate module, and inputting the obtained fusion features into the edge module;

and when i is equal to other values, the ith propgate module is used for executing fusion operation on the image features extracted from the ith convolutional layer and the fusion result input by the (i +1) th propgate module, and inputting the fusion result into the (i-1) th propgate module.

For example, in order to fully exert the function of the underlying features, an information trace-back route opposite to the feature extraction process from bottom to top is adopted, and an edge detection network as shown in fig. 2B is designed by using a propagation idea from top to bottom, if N is 4, the edge detection network respectively comprises a propate module for every two adjacent convolutional layers of the 4 convolutional layers, the 1 st convolutional layer and the 2 nd convolutional layer correspond to the 1 st propate module, the 2 nd convolutional layer and the 3 rd convolutional layer correspond to the 2 nd propate module, and the 3 rd convolutional layer and the 4 th convolutional layer correspond to the 3 rd propate module. The 3 rd propgate module is used for executing fusion operation on the image features extracted by the 3 rd convolutional layer and the 4 th convolutional layer respectively, inputting the fusion result to the 2 nd propgate module, the 2 nd propgate module is used for executing fusion operation on the image features extracted by the 2 nd convolutional layer and the fusion result input by the propgate module corresponding to the 3 rd convolutional layer, and inputting the fusion result to the 1 st propgate module, and the 1 st propgate module is used for executing fusion operation on the image features extracted by the 1 st convolutional layer and the fusion result input by the propgate module corresponding to the 2 nd convolutional layer, and inputting the obtained fusion features into the edge module.

When N is 2, the edge detection network includes 1 superpage module, and the first image features extracted from the 2 convolutional layers are fused and then the obtained fused features are input to the edge module.

As shown in fig. 2C, the initial image semantic segmentation model provided by the embodiment of the present invention includes: the system comprises a feature extraction network, a semantic segmentation network, an edge detection network and a feature fusion network. The training method of the image semantic segmentation model comprises the following steps: inputting a first sample image into an initial image semantic segmentation model; performing feature extraction on the first sample image through a feature extraction network to obtain a first image feature and a second image feature; inputting the first image characteristic into an edge detection network to obtain the edge characteristic of a first sample image output by the edge detection network, and inputting the second image characteristic into a semantic segmentation network to obtain the semantic characteristic of the first sample image output by the semantic segmentation network; inputting the semantic features and the edge features into a feature fusion network, obtaining fusion features obtained by fusing the semantic features and the edge features by the feature fusion network based on an attention mechanism, and obtaining a first semantic category prediction confidence map based on the fusion features; and calculating a first loss function value based on the first semantic category prediction confidence map and semantic category label data corresponding to the first sample image, and adjusting network parameters in the initial image semantic segmentation model based on the first loss function value.

Example two

Fig. 2D is a flowchart of a training method for an image semantic segmentation model according to the second embodiment of the present invention. In this embodiment, on the basis of the above embodiment, the training method of the image semantic segmentation model is further optimized, and in this embodiment, the feature fusion network includes an attention module, an upsampling module, and a concatenation module: obtaining an attention map based on the edge features through an attention module; performing upsampling operation on semantic features through an upsampling module to obtain an upsampling result; and performing preset calculation operation on the attention map and the up-sampling result through a series module, and then performing channel dimension series connection on the calculation operation result and the up-sampling result to obtain fusion characteristics.

As shown in fig. 2D, the method specifically includes:

step 210, inputting a first sample image into an initial image semantic segmentation model; the initial image semantic segmentation model comprises the following steps: the system comprises a feature extraction network, a semantic segmentation network, an edge detection network and a feature fusion network.

Step 220, performing feature extraction on the first sample image through a feature extraction network to obtain a first image feature and a second image feature; the resolution of the first image feature is higher than the resolution of the second image feature.

Step 230, inputting the first image feature into the edge detection network to obtain the edge feature of the first sample image output by the edge detection network, and inputting the second image feature into the semantic segmentation network to obtain the semantic feature of the first sample image output by the semantic segmentation network.

Step 240, obtaining an attention map based on the edge features through an attention module; performing upsampling operation on semantic features through an upsampling module to obtain an upsampling result; and performing preset calculation operation on the attention map and the up-sampling result through a series module, and then performing channel dimension series connection on the calculation operation result and the up-sampling result to obtain fusion characteristics.

Specifically, as shown in fig. 2E, the edge features are input into the attention module to obtain an attention map; the semantic features are input into an upsampling module to perform upsampling operation to obtain an upsampling result, after weighted calculation operation is performed on the upsampling result of the attention map and the semantic features, the calculation operation result and the upsampling result are connected in series through channel dimensions to obtain a fusion feature, and therefore the feature fusion network has the capability of ignoring irrelevant information and paying attention to key information.

Illustratively, the edge feature input attention module derives an attention map, which may be expressed as

W＝sigmod[f(F_E,θ_f)]；

Wherein, F_ERepresenting edge features, theta_fAs a parameter of the attention module, F (F)_E,θ_f) An attention map, sigmoid [ ] corresponding to the edge features under the parameters representing the known attention module]The function is an activation function, the function is to normalize the attention map, and W represents the normalized attention map.

For example, the fusion feature obtained by concatenating the channel dimensions of the calculation operation result and the upsampling result may be represented as:

F_f＝Concat[W*Up(F_S),Up(F_S)]；

wherein, F_fIndicates the fusion characteristics, Concat [, ]]Showing the concatenation of the dimensions of the channels, F_sIndicating edge features and Up (×) indicating upsampling.

And step 250, obtaining a first semantic type prediction confidence map based on the fusion features.

Specifically, the fusion features are input into a preset convolution layer; obtaining a first semantic category prediction confidence map which is output after the preset convolution layer reduces the dimension of the fusion features to K channels; and K is the number of elements in a preset semantic category set.

Illustratively, the fused features are input into a 1 × 1 convolutional layer, and the dimensions of the fused features are reduced to K channels, where K is the number of elements in a preset semantic category set.

And step 260, calculating a first loss function value based on the first semantic category prediction confidence map and semantic category label data corresponding to the first sample image, and adjusting network parameters in the initial image semantic segmentation model based on the first loss function value.

Optionally, calculating a first loss function value based on the first semantic class prediction confidence map and the semantic class label data corresponding to the first sample image, includes:

determining a first type of relevant prediction confidence map based on the semantic features, and determining a first edge prediction confidence map based on the edge features;

and calculating a first loss function value based on the first-class related prediction confidence map, the first edge prediction confidence map, the first semantic-class prediction confidence map and semantic-class label data and edge label data corresponding to the first sample image.

The first-class correlation prediction confidence map is a map formed by confidence degrees of semantic categories of all parts of pixel sets in the first sample image predicted based on semantic features; the first edge prediction confidence map is a map formed by the confidence of edge pixel points in the first sample image predicted based on the edge features.

Specifically, a first-class related prediction confidence map is determined based on semantic features, a first-edge prediction confidence map is determined based on edge features, in order to achieve end-to-end training of the initial image semantic segmentation model, a loss function of the initial image semantic segmentation model is determined according to the first-class related prediction confidence map, the first-edge prediction confidence map, the first semantic-class prediction confidence map, semantic-class label data corresponding to the first sample image and the edge label data, and a first loss function value is calculated.

For example, determining the first class of relevant prediction confidence maps based on semantic features may include: inputting the semantic features into a first preset convolutional layer; obtaining a first-class related prediction confidence map which is output after the preset convolution layer reduces the semantic features to K channels; and K is the number of elements in a preset semantic category set. Wherein, the first predetermined convolutional layer may be a 1 × 1 convolutional layer. Determining the first edge prediction confidence map based on the edge features may include: inputting the edge characteristics into a second preset convolution layer; and processing the output result of the second preset convolution layer based on the preset activation function to obtain a first edge prediction confidence map. Wherein the second predetermined convolutional layer may be a 1 × 1 convolutional layer. The preset activation function may be a Sigmoid function, and the output result of the second preset convolution layer may be mapped to 0 and 1 for normalization due to the properties of single increase of the Sigmoid function and single increase of the inverse function.

Optionally, calculating a first loss function value based on the first-class correlated prediction confidence map, the first edge prediction confidence map, the first semantic-class prediction confidence map, and semantic-class label data and edge label data corresponding to the first sample image, includes:

calculating a first function value of a first scene surveillance loss function based on the first-class related prediction confidence map and semantic category label data corresponding to the first sample image;

calculating a second function value of the edge supervision loss function based on the first edge prediction confidence map and the edge label data corresponding to the first sample image;

calculating a third function value of a second scene surveillance loss function based on the first semantic category prediction confidence map and semantic category label data corresponding to the first sample image;

a first loss function value is calculated based on the first function value, the second function value, and the third function value.

Specifically, the calculation formula of the first loss function value may be expressed as:

L(X；W)＝λ₁L_seg(Y；W_S)+λ₂L_edge(Z；W_E)+λ₃L_final(Y；W_FF)；

wherein X ∈ R^(3×H*W)Is a first sample image, L_seg(Y；W_S) Semantic category label data Y belonging to R corresponding to the first sample image X^(K×H*W)Substituting a first function value of the first scene surveillance loss function; w_SSegmenting network parameters of the network for semantics; l is_edge(Z；W_E) Corresponding edge label data Z epsilon R for the first sample image X^(1×H*W)Substituting the second function value into the edge supervision loss function to obtain a second function value; w_EDetecting a network parameter of the network for the edge; l is_final(Y；W_FF) Semantic category label data Y belonging to R corresponding to the first sample image X^(K×H*W)Substituting into a second scene supervision loss function meterCalculating a third function value; w_FFNetwork parameters for the feature fusion network. Lambda [ alpha ]₁、λ₂And λ₃Three hyperparameters between 0 and 1, for weighting the first, second and third function values, respectively.

Wherein the first scene surveillance loss function is:

the edge-supervised loss function is:

the second scenario supervised loss function is:

wherein,

predicting a confidence map for the first class of correlations; logPr (y)_m＝1|F_E；W_E) Predicting a confidence map for the first edge;

predicting a confidence map for the first semantic category; j represents the semantic class label data and,

representing a positive set of semantic class labels, Y_- ^kRepresenting a negative set corresponding to the semantic category label; m represents edge label data; z₊Representing a positive set corresponding to the edge label; z_-And a and beta are respectively a first parameter and a second parameter of the edge surveillance loss function.

Illustratively, set λ₁＝0.4，λ₂＝0.4，λ₃After determining the first loss function, the network parameters in the initial image semantic segmentation model are adjusted based on the first loss function value, which may be converted into an optimization problem for the parameters, that is, the first loss function is determined

W＝argmin(0.4L_seg(Y；W_S)+0.4L_edge(Z；W_E)+L_final(Y；W_FF))；

Wherein W represents a network parameter W in the initial image semantic segmentation model_S、W_E、W_FFThe optimal parameter set of (2).

EXAMPLE III

Fig. 3 is a flowchart of a training method for an image semantic segmentation model according to a third embodiment of the present invention. In this embodiment, before the first sample image is input into the initial image semantic segmentation model, the method further includes: and pre-training a semantic segmentation network and an edge detection network in the initial image semantic segmentation model based on the second sample image.

As shown in fig. 3, the method specifically includes:

and 310, pre-training a semantic segmentation network and an edge detection network in the initial image semantic segmentation model based on the second sample image.

Specifically, in order to obtain clearly visible semantic features and edge features, the training process of the initial image semantic segmentation model can be divided into two stages, and in the first stage, a semantic segmentation network and an edge detection network in the initial image semantic segmentation model are pre-trained on the basis of a second sample image; in the second stage, based on the training method of the image semantic segmentation model provided by the embodiment of the invention, the first sample image is input into the initial image semantic segmentation model, and model training is performed on the feature extraction network, the semantic segmentation network, the edge detection network and the feature fusion network of the initial image semantic segmentation model.

Step 320, inputting the first sample image into an initial image semantic segmentation model; the initial image semantic segmentation model comprises the following steps: the system comprises a feature extraction network, a semantic segmentation network, an edge detection network and a feature fusion network.

Step 330, performing feature extraction on the first sample image through a feature extraction network to obtain a first image feature and a second image feature; the resolution of the first image feature is higher than the resolution of the second image feature.

Step 340, inputting the first image feature into an edge detection network to obtain an edge feature of the first sample image output by the edge detection network, and inputting the second image feature into a semantic segmentation network to obtain a semantic feature of the first sample image output by the semantic segmentation network.

Step 350, obtaining an attention map based on the edge features through an attention module; performing upsampling operation on semantic features through an upsampling module to obtain an upsampling result; and performing preset calculation operation on the attention map and the up-sampling result through a series module, and then performing channel dimension series connection on the calculation operation result and the up-sampling result to obtain fusion characteristics.

And 360, obtaining a first semantic type prediction confidence map based on the fusion features.

Step 370, calculating a first loss function value based on the first semantic class prediction confidence map and the semantic class label data corresponding to the first sample image, and adjusting the network parameters in the initial image semantic segmentation model based on the first loss function value.

According to the technical scheme of the embodiment, a semantic segmentation network and an edge detection network in an initial image semantic segmentation model are pre-trained on the basis of a second sample image; inputting a first sample image into an initial image semantic segmentation model; performing feature extraction on the first sample image through a feature extraction network to obtain a first image feature and a second image feature; inputting the first image characteristic into an edge detection network to obtain the edge characteristic of a first sample image output by the edge detection network, and inputting the second image characteristic into a semantic segmentation network to obtain the semantic characteristic of the first sample image output by the semantic segmentation network; inputting the semantic features and the edge features into a feature fusion network, obtaining fusion features obtained by fusing the semantic features and the edge features by the feature fusion network based on an attention mechanism, and obtaining a first semantic category prediction confidence map based on the fusion features; the method comprises the steps of predicting a confidence map based on a first semantic category and semantic category label data corresponding to a first sample image, calculating a first loss function value, adjusting network parameters in an initial image semantic segmentation model based on the first loss function value, fusing edge features with high resolution and semantic features with abundant spatial information based on a self-attention mechanism, providing abundant spatial structure details for image semantic segmentation by fully utilizing complementary information, improving the detail blurring problem caused by the lack of abundant spatial information of a common semantic segmentation network, enabling the initial image semantic segmentation model to be fast converged, and improving the model training speed.

Optionally, pre-training a semantic segmentation network and an edge detection network in the initial image semantic segmentation model based on the second sample image, including:

inputting a second sample image into an initial image semantic segmentation model;

performing feature extraction on the second sample image through a feature extraction network to obtain a third image feature and a fourth image feature; the resolution of the third image feature is higher than the resolution of the fourth image feature;

inputting the third image characteristic into an edge detection network to obtain the edge characteristic of a second sample image output by the edge detection network, and inputting the fourth image characteristic into a semantic segmentation network to obtain the semantic characteristic of the second sample image output by the semantic segmentation network;

determining a second type of related prediction confidence map based on semantic features of a second sample image; determining a second edge prediction confidence map based on the edge features of the second sample image;

and calculating a second loss function value based on the second-class related prediction confidence map, the second edge prediction confidence map and the semantic category label data and the edge label data corresponding to the second sample image, and adjusting network parameters in the semantic segmentation network and the edge detection network based on the second loss function value.

Specifically, firstly, feature extraction is carried out on a second sample image through a feature extraction network based on the same method for extracting the edge feature and the semantic feature of the first sample image in the second stage, so as to obtain a third image feature and a fourth image feature; the resolution of the third image feature is higher than the resolution of the fourth image feature. Inputting the third image characteristic into an edge detection network to obtain an edge characteristic of a second sample image output by the edge detection network; and inputting the fourth image characteristic into a semantic segmentation network to obtain the semantic characteristic of the second sample image output by the semantic segmentation network. Secondly, determining a second type of related prediction confidence map based on semantic features of a second sample image; determining a second edge prediction confidence map based on the edge features of the second sample image; and finally, calculating a second loss function value based on the second-class related prediction confidence map, the second edge prediction confidence map and the semantic category label data and the edge label data corresponding to the second sample image, and adjusting network parameters in the semantic segmentation network and the edge detection network based on the second loss function value.

Optionally, calculating a second loss function value based on the second-class correlated prediction confidence map, the second edge prediction confidence map, and the semantic category label data and the edge label data corresponding to the second sample image, includes:

calculating a fourth function value of the first scene surveillance loss function based on the second-class correlation prediction confidence map and semantic category label data corresponding to the second sample image;

calculating a fifth function value of the edge surveillance loss function based on the second edge prediction confidence map and the edge label data corresponding to the second sample image;

and calculating a second loss function value based on the fourth function value and the fifth function value.

The second loss function is used for measuring the difference between the semantic category obtained by performing image semantic segmentation on the second sample image through the initial image semantic segmentation model and the semantic category label data corresponding to the second sample image.

For example, the calculation formula of the second loss function value may be expressed as:

L(X′；W′)＝λ′₁L_seg(Y′；W_S)+λ′₂L_edge(Z′；W_E)；

wherein X' is belonged to R^(3×H*W)For the second sample image, L_seg(Y′；W_S) Semantic category label data Y' belonged to R corresponding to the second sample image X^(K×H*W)Substituting a fourth function value of the first scene surveillance loss function; w_SSegmenting network parameters of the network for semantics; l is_edge(Z′；W_E) The edge label data Z' epsilon R corresponding to the second sample image X^(1×H*W)Substituting the function into a fifth function value obtained by calculating an edge supervision loss function; w_EDetecting a network parameter of the network for the edge; lambda'₁And λ'₂Is two hyperparameters between 0 and 1, which are used to weight the fourth function value and the fifth function value, respectively.

Wherein the first scene surveillance loss function is:

the edge-supervised loss function is:

wherein,

predicting a confidence map for the first class of correlations; logPr (ym ═ 1| F)_E；W_E) Predicting a confidence map for the first edge; j represents the semantic class label data and,

to representPositive set of semantic class labels, Y_- ^kRepresenting a negative set corresponding to the semantic category label; m represents edge label data; z + represents a positive set corresponding to the edge label; z-represents the negative set to which the edge label corresponds.

Illustratively, λ 'is provided'₁＝1，λ′₂After determining the second loss function value, the network parameters in the semantic segmentation network and the edge detection network are adjusted based on the second loss function value, which can be converted into an optimization problem for the parameters, that is, the second loss function value is determined as 1

W′＝argmin(L_seg(Y′；W_S)+L_edge(Z；W_E))；

Wherein W' represents a network parameter W of a semantic segmentation network in an initial image semantic segmentation model_SAnd network parameters W of the edge detection network_EA collection of (a).

Optionally, the first scene surveillance loss function and the second scene surveillance loss function are standard cross entropy loss functions, and the edge surveillance loss function value is a binary cross entropy loss function.

Specifically, Cross Entropy (Cross Entropy) is an important concept in shannon information theory, and is mainly used for measuring difference information between two probability distributions, and the Cross Entropy is used as a loss function in a network model and is used for measuring a difference between semantic category label data corresponding to a sample image and semantic categories corresponding to the sample image output by an initial image semantic segmentation model. The binary cross entropy loss function is a cross entropy loss function applicable to two classification categories; the standard cross entropy loss function is a cross entropy loss function that is applicable to multiple classification classes.

Example four

Fig. 4 is a flowchart of an image semantic segmentation method according to a fourth embodiment of the present invention, where this embodiment is applicable to a case of performing semantic segmentation on an image to be analyzed, and the method may be executed by a training device of an image semantic segmentation model, and the training device may be implemented by software and/or hardware.

As shown in fig. 4, the method specifically includes the following steps:

and step 410, acquiring an image to be analyzed.

Specifically, the mode of acquiring the image to be processed may be to acquire the image to be analyzed through a camera, and may also be to acquire the image to be analyzed through an interception mode.

Step 420, inputting the image to be analyzed into the target image semantic segmentation model obtained by training with the training method of the image semantic segmentation model according to any one of the first to third embodiments of the present invention.

The target image semantic segmentation model is a model which is completely trained by adopting the training method of the image semantic segmentation model of any one of the first embodiment to the third embodiment of the invention.

Specifically, an image to be analyzed is input into a target image semantic segmentation model for image semantic segmentation, and a target semantic category prediction confidence map is obtained. The target image semantic segmentation model comprises the following steps: the system comprises a feature extraction network, a semantic segmentation network, an edge detection network and a feature fusion network. Inputting an image to be analyzed into a target image semantic segmentation model; performing feature extraction on an image to be analyzed through a feature extraction network to obtain a first analysis image feature and a second analysis image feature, wherein the resolution of the first analysis image feature is higher than that of the second analysis image feature; inputting the first analytic image characteristic into an edge detection network to obtain the edge characteristic of an image to be analyzed output by the edge detection network, and inputting the second analytic image characteristic into a semantic segmentation network to obtain the semantic characteristic of the image to be analyzed output by the semantic segmentation network; and inputting the semantic features and the edge features into a feature fusion network, obtaining fusion features obtained by fusing the semantic features and the edge features by the feature fusion network based on an attention mechanism, and obtaining a target semantic category prediction confidence map of the image to be analyzed based on the fusion features.

And 430, acquiring a target semantic category prediction confidence map output by the target image semantic segmentation model, and determining a semantic segmentation result of the image to be analyzed based on the target semantic category prediction confidence map.

The target semantic category prediction confidence map is used for representing the confidence corresponding to the target semantic category of each pixel set in the image to be analyzed.

Specifically, according to a target semantic category prediction confidence map output by a target image semantic segmentation model, the confidence corresponding to the target semantic category of each pixel set in the image to be analyzed is determined, and the semantic segmentation result of the image to be analyzed is determined based on the confidence.

For example, the manner of determining the semantic segmentation result of the image to be analyzed based on the target semantic category prediction confidence map may be as follows: classifying a target pixel set of an image to be analyzed into a target semantic category corresponding to the maximum confidence; or may be: and if the confidence corresponding to the target semantic category of the target pixel set in the image to be analyzed is greater than a preset confidence threshold, classifying the target pixel set in the image to be analyzed into the target semantic category.

According to the technical scheme of the embodiment, the image to be analyzed is obtained and is input into a target image semantic segmentation model obtained by training through the training method of the image semantic segmentation model in any embodiment of the invention; the method comprises the steps of obtaining a target semantic category prediction confidence map output by a target image semantic segmentation model, determining a semantic segmentation result of an image to be analyzed based on the target semantic category prediction confidence map, fusing edge features with higher resolution and semantic features with abundant spatial information based on a self-attention mechanism, providing abundant spatial structure details for image semantic segmentation by fully utilizing complementary information, and improving the detail blurring problem caused by the fact that a common semantic segmentation network lacks abundant spatial information.

EXAMPLE five

Fig. 5 is a schematic structural diagram of a training device for an image semantic segmentation model according to a fifth embodiment of the present invention. The embodiment may be applicable to the case of training the image semantic segmentation model, the apparatus may be implemented in a software and/or hardware manner, and the apparatus may be integrated in any device that provides a function of training the image semantic segmentation model, as shown in fig. 5, the apparatus for training the image semantic segmentation model specifically includes: an input module 510, a feature extraction module 520, an edge detection module 530, a feature fusion module 540, and a parameter adjustment module 550.

An input module 510, configured to input the first sample image into an initial image semantic segmentation model; the initial image semantic segmentation model comprises the following steps: the system comprises a feature extraction network, a semantic segmentation network, an edge detection network and a feature fusion network;

a feature extraction module 520, configured to perform feature extraction on the first sample image through a feature extraction network to obtain a first image feature and a second image feature; the resolution of the first image feature is higher than the resolution of the second image feature;

an edge detection module 530, configured to input the first image feature into an edge detection network, obtain an edge feature of the first sample image output by the edge detection network, input the second image feature into a semantic segmentation network, and obtain a semantic feature of the first sample image output by the semantic segmentation network;

the feature fusion module 540 is configured to input the semantic features and the edge features into a feature fusion network, obtain fusion features obtained by fusing the semantic features and the edge features by the feature fusion network based on an attention mechanism, and obtain a first semantic category prediction confidence map based on the fusion features;

the parameter adjusting module 550 is configured to calculate a first loss function value based on the first semantic category prediction confidence map and the semantic category label data corresponding to the first sample image, and adjust a network parameter in the initial image semantic segmentation model based on the first loss function value.

Optionally, when N is not less than 2, the edge detection network includes at least one propagation propgate module and an edge module;

Optionally, the feature fusion network includes an attention module, an upsampling module, and a concatenation module:

obtaining an attention map based on the edge features through an attention module;

performing upsampling operation on semantic features through an upsampling module to obtain an upsampling result;

and performing preset calculation operation on the attention map and the up-sampling result through a series module, and then performing channel dimension series connection on the calculation operation result and the up-sampling result to obtain fusion characteristics.

Optionally, the parameter adjusting module 550 includes:

the first determining unit is used for determining a first-class related prediction confidence map based on the semantic features and determining a first edge prediction confidence map based on the edge features;

and the calculating unit is used for calculating a first loss function value based on the first-class related prediction confidence map, the first edge prediction confidence map, the first semantic-class prediction confidence map and semantic-class label data and edge label data corresponding to the first sample image.

Optionally, the first computing unit is specifically configured to:

Optionally, the method further includes:

and the pre-training module is used for pre-training the semantic segmentation network and the edge detection network in the initial image semantic segmentation model based on the second sample image before the first sample image is input into the initial image semantic segmentation model.

Optionally, the pre-training module includes:

the input unit is used for inputting the second sample image into the initial image semantic segmentation model;

the feature extraction unit is used for extracting features of the second sample image through a feature extraction network to obtain a third image feature and a fourth image feature; the resolution of the third image feature is higher than the resolution of the fourth image feature;

the feature output unit is used for inputting the third image feature into the edge detection network, obtaining the edge feature of the second sample image output by the edge detection network, inputting the fourth image feature into the semantic segmentation network, and obtaining the semantic feature of the second sample image output by the semantic segmentation network;

a second determining unit, configured to determine a second-class correlation prediction confidence map based on semantic features of a second sample image; determining a second edge prediction confidence map based on the edge features of the second sample image;

and the parameter adjusting unit is used for calculating a second loss function value based on the second-class related prediction confidence map, the second edge prediction confidence map and the semantic category label data and the edge label data corresponding to the second sample image, and adjusting network parameters in the semantic segmentation network and the edge detection network based on the second loss function value.

Optionally, the parameter adjusting unit is specifically configured to:

The product can execute the training method of the image semantic segmentation model provided by any one of the first embodiment to the third embodiment of the invention, and has corresponding functional modules and beneficial effects of the execution method.

EXAMPLE six

Fig. 6 is a schematic structural diagram of an image semantic segmentation apparatus according to a sixth embodiment of the present invention. The embodiment may be applicable to the case of performing semantic segmentation on an image to be analyzed, the apparatus may be implemented in a software and/or hardware manner, and the apparatus may be integrated in any device providing an image semantic segmentation function, as shown in fig. 6, the image semantic segmentation apparatus specifically includes: an acquisition module 610, an input module 620, and a determination module 630.

An obtaining module 610, configured to obtain an image to be analyzed;

an input module 620, configured to input an image to be analyzed into a target image semantic segmentation model obtained by training using the training method of the image semantic segmentation model according to any one of the first to third embodiments of the present invention;

the determining module 630 is configured to obtain a target semantic category prediction confidence map output by the target image semantic segmentation model, and determine a semantic segmentation result of the image to be analyzed based on the target semantic category prediction confidence map.

The product can execute the image semantic segmentation method provided by any embodiment of the invention, and has corresponding functional modules and beneficial effects of the execution method.

EXAMPLE seven

Fig. 7 is a block diagram of a terminal device according to a seventh embodiment of the present invention, as shown in fig. 7, the computer device includes a processor 710, a memory 720, an input device 730, and an output device 740; the number of the processors 710 in the computer device may be one or more, and one processor 710 is taken as an example in fig. 7; the processor 710, the memory 720, the input device 730, and the output device 740 in the computer apparatus may be connected by a bus or other means, and the connection by the bus is exemplified in fig. 7.

The memory 720 is used as a computer readable storage medium for storing software programs, computer executable programs, and modules, such as program instructions/modules corresponding to the training method of the image semantic segmentation model in the embodiment of the present invention (for example, the input module 510, the feature extraction module 520, the edge detection module 530, the feature fusion module 540, and the parameter adjustment module 550 in the training device of the image semantic segmentation model) or program instructions/modules corresponding to the image semantic segmentation method in the embodiment of the present invention (for example, the acquisition module 610, the input module 620, and the determination module 630 in the image semantic segmentation device). The processor 710 executes various functional applications and data processing of the computer device, i.e., implementing the training method of the image semantic segmentation model or the image semantic segmentation method described above, by executing the software programs, instructions and modules stored in the memory 720.

The memory 720 may mainly include a program storage area and a data storage area, wherein the program storage area may store an operating system, an application program required for at least one function; the storage data area may store data created according to the use of the terminal, and the like. Further, the memory 720 may include high speed random access memory, and may also include non-volatile memory, such as at least one magnetic disk storage device, flash memory device, or other non-volatile solid state storage device. In some examples, memory 720 may further include memory located remotely from processor 710, which may be connected to the terminal device via a network. Examples of such networks include, but are not limited to, the internet, intranets, local area networks, mobile communication networks, and combinations thereof.

The input device 730 may be used to receive input numeric or character information and generate key signal inputs related to user settings and function control of the terminal apparatus. The output device 740 may include a display device such as a display screen.

Example eight

An eighth embodiment of the present invention further provides a computer-readable storage medium containing a computer program, where the computer program is stored on the computer-readable storage medium, and when the computer program is executed by a processor, the method for implementing a training method for an image semantic segmentation model, where the method includes: inputting a first sample image into an initial image semantic segmentation model; the initial image semantic segmentation model comprises the following steps: the system comprises a feature extraction network, a semantic segmentation network, an edge detection network and a feature fusion network; performing feature extraction on the first sample image through a feature extraction network to obtain a first image feature and a second image feature; the resolution of the first image feature is higher than the resolution of the second image feature; inputting the first image characteristic into an edge detection network to obtain the edge characteristic of a first sample image output by the edge detection network, and inputting the second image characteristic into a semantic segmentation network to obtain the semantic characteristic of the first sample image output by the semantic segmentation network; inputting the semantic features and the edge features into a feature fusion network, obtaining fusion features obtained by fusing the semantic features and the edge features by the feature fusion network based on an attention mechanism, and obtaining a first semantic category prediction confidence map based on the fusion features; and calculating a first loss function value based on the first semantic category prediction confidence map and semantic category label data corresponding to the first sample image, and adjusting network parameters in the initial image semantic segmentation model based on the first loss function value.

Alternatively, the program when executed by a processor implements a method of image semantic segmentation, the method comprising: acquiring an image to be analyzed; inputting an image to be analyzed into a target image semantic segmentation model obtained by training by adopting the training method of the image semantic segmentation model in any one of the first embodiment to the third embodiment of the invention; and acquiring a target semantic category prediction confidence map output by the target image semantic segmentation model, and determining a semantic segmentation result of the image to be analyzed based on the target semantic category prediction confidence map.

Of course, the storage medium containing the computer-executable instructions provided by the embodiments of the present invention is not limited to the above method, and may also perform the training method of the image semantic segmentation model or the related operations in the image semantic segmentation method provided by any embodiment of the present invention.

From the above description of the embodiments, it is obvious for those skilled in the art that the present invention can be implemented by software and necessary general hardware, and certainly, can also be implemented by hardware, but the former is a better embodiment in many cases. Based on such understanding, the technical solutions of the present invention may be embodied in the form of a software product, which can be stored in a computer-readable storage medium, such as a floppy disk, a Read-Only Memory (ROM), a Random Access Memory (RAM), a FLASH Memory (FLASH), a hard disk or an optical disk of a computer, and includes several instructions for enabling a computer device (which may be a personal computer, a server, or a network device) to execute the methods of the embodiments of the present invention.

It should be noted that, in the embodiment of the training apparatus for the image semantic segmentation model or the image semantic segmentation apparatus, each included unit and module is only divided according to functional logic, but is not limited to the above division, as long as the corresponding function can be realized; in addition, specific names of the functional units are only for convenience of distinguishing from each other, and are not used for limiting the protection scope of the present invention.

It is to be noted that the foregoing is only illustrative of the preferred embodiments of the present invention and the technical principles employed. It will be understood by those skilled in the art that the present invention is not limited to the particular embodiments illustrated herein, but is capable of various obvious changes, rearrangements and substitutions as will now become apparent to those skilled in the art without departing from the scope of the invention. Therefore, although the present invention has been described in greater detail by the above embodiments, the present invention is not limited to the above embodiments, and may include other equivalent embodiments without departing from the spirit of the present invention, and the scope of the present invention is determined by the scope of the appended claims.

Claims

1. A training method of an image semantic segmentation model is characterized by comprising the following steps:

2. The method of claim 1, wherein the feature extraction network comprises: a plurality of convolutional layers connected in series; the first image features comprise image features extracted from the N convolutional layers at the bottommost layer of the plurality of convolutional layers, and the second image features comprise image features extracted from other convolutional layers of the plurality of convolutional layers except the N convolutional layers at the bottommost layer; wherein N is an integer not less than 1 and less than the total number of convolutional layers included in the feature extraction network.

3. The method of claim 2, wherein the feature extraction network employs a deep residual network structure that removes average pooling layers and full-link layers, and retains a preset number of convolutional layers.

4. The method according to claim 3, wherein the preset number is 5; the first image features comprise image features respectively extracted from the 1 st convolutional layer, the 2 nd convolutional layer, the 3 rd convolutional layer and the 4 th convolutional layer in the 5 convolutional layers connected in series; the second image features include image features extracted from a 5 th convolutional layer of the 5 convolutional layers in the series.

5. The method according to claim 2, wherein when N is not less than 2, the edge detection network includes at least one propagation propgate module and one edge module;

and after the edge module executes convolution operation and nonlinear operation on the fusion features input by the Propagate module, outputting the edge features.

6. The method of claim 5, wherein if N is greater than 2, the edge detection network comprises a Propagate module for each two adjacent convolutional layers of the N convolutional layers; wherein, the ith convolutional layer and the (i +1) th convolutional layer correspond to the ith propgate module, and i takes a value in [1, N-1 ]; and the number of the first and second groups,

7. The method of claim 1, wherein the feature fusion network comprises an attention module, an upsampling module, and a concatenation module:

obtaining, by the attention module, an attention map based on the edge features;

performing, by the upsampling module, upsampling operation on the semantic features to obtain an upsampling result;

and performing preset calculation operation on the attention map and the up-sampling result through the series module, and then performing series connection of channel dimensions on the calculation operation result and the up-sampling result to obtain fusion characteristics.

8. The method according to any of claims 1-7, wherein the calculating a first loss function value based on the first semantic class prediction confidence map and corresponding semantic class label data of the first sample image comprises:

determining a first class of related prediction confidence maps based on the semantic features, and determining a first edge prediction confidence map based on the edge features;

9. The method of claim 8, wherein the calculating a first loss function value based on the first class of associated prediction confidence maps, the first edge prediction confidence map, the first semantic class prediction confidence map, and semantic class label data and edge label data corresponding to the first sample image comprises:

calculating a first function value of a first scene surveillance loss function based on the first-class correlation prediction confidence map and semantic category label data corresponding to the first sample image;

calculating a second function value of the edge surveillance loss function based on the first edge prediction confidence map and the edge label data corresponding to the first sample image;

calculating a first loss function value based on the first function value, the second function value, and the third function value.

10. The method of claim 8, wherein prior to inputting the first sample image into the initial image semantic segmentation model, the method further comprises:

and pre-training a semantic segmentation network and an edge detection network in the initial image semantic segmentation model based on the second sample image.

11. The method of claim 10, wherein the pre-training of the semantic segmentation network and the edge detection network in the initial image semantic segmentation model based on the second sample image comprises:

inputting a second sample image into the initial image semantic segmentation model;

performing feature extraction on the second sample image through the feature extraction network to obtain a third image feature and a fourth image feature; the resolution of the third image feature is higher than the resolution of the fourth image feature;

inputting the third image feature into the edge detection network to obtain an edge feature of the second sample image output by the edge detection network, and inputting the fourth image feature into the semantic segmentation network to obtain a semantic feature of the second sample image output by the semantic segmentation network;

determining a second type of relevant prediction confidence map based on semantic features of the second sample image; determining a second edge prediction confidence map based on the edge features of the second sample image;

and calculating a second loss function value based on the second-class related prediction confidence map, the second edge prediction confidence map and the semantic class label data and the edge label data corresponding to the second sample image, and adjusting network parameters in the semantic segmentation network and the edge detection network based on the second loss function value.

12. The method of claim 11, wherein calculating a second loss function value based on the second class of associated prediction confidence maps, the second edge prediction confidence map, and the semantic class label data and edge label data corresponding to the second sample image comprises:

calculating a fourth function value of the first scene surveillance loss function based on the second-class correlated prediction confidence map and semantic category label data corresponding to the second sample image;

calculating a fifth function value of an edge surveillance loss function based on the second edge prediction confidence map and the edge label data corresponding to the second sample image;

calculating a second loss function value based on the fourth function value and the fifth function value.

13. The method of claim 9, wherein the first scene supervised loss function and the second scene supervised loss function are standard cross entropy loss functions and the edge supervised loss function value is a binary cross entropy loss function.

14. A method for semantic segmentation of an image, the method comprising:

acquiring an image to be analyzed;

inputting the image to be analyzed into a target image semantic segmentation model obtained by training by adopting the training method of the image semantic segmentation model according to any one of claims 1 to 13;

15. An apparatus for training a semantic segmentation model of an image, comprising:

16. An image semantic segmentation apparatus, comprising:

the acquisition module is used for acquiring an image to be analyzed;

an input module, configured to input the image to be analyzed into a target image semantic segmentation model obtained by training using the training method of the image semantic segmentation model according to any one of claims 1 to 13;

17. A terminal device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, wherein the processor implements a method for training an image semantic segmentation model according to any one of claims 1 to 13 or implements a method for image semantic segmentation according to claim 14 when executing the program.

18. A computer-readable storage medium, on which a computer program is stored which, when being executed by a processor, carries out a method for training an image semantic segmentation model according to any one of claims 1 to 13 or a method for carrying out an image semantic segmentation as claimed in claim 14.