CN111444923A

CN111444923A - Image semantic segmentation method and device under natural scene

Info

Publication number: CN111444923A
Application number: CN202010286607.6A
Authority: CN
Inventors: 李硕豪; 张军; 何华; 周浩; 王风雷
Original assignee: National University of Defense Technology
Current assignee: National University of Defense Technology
Priority date: 2020-04-13
Filing date: 2020-04-13
Publication date: 2020-07-24

Abstract

The application relates to a method and a device for segmenting image semantics under a natural scene. The method comprises the following steps: extracting a preliminary feature matrix of an image to be semantically segmented through a convolutional layer of a convolutional neural network, respectively calculating the maximum value and the average value of pixels in a receptive field of the preliminary feature matrix through a pooling layer, obtaining the edge gradient feature of the preliminary feature matrix according to the difference information of the maximum value and the average value, obtaining a fusion feature according to the edge gradient feature and the preliminary feature matrix, performing feature fusion and extraction on the fusion feature according to a preset depth residual error network to obtain a depth feature, establishing a long dependency relationship among all pixels in the depth feature to obtain a dependency feature map, and classifying the dependency feature map through an output layer to obtain the classification corresponding to each pixel. By adopting the method, the accuracy of image semantic segmentation can be improved.

Description

Image semantic segmentation method and device under natural scene

Technical Field

The application relates to the technical field of machine learning, in particular to a method and a device for segmenting image semantics in a natural scene.

Background

Unlike image-level processing in high-level computer vision, image semantic segmentation is the basis and difficulty in low-level computer vision. Belonging to pixel-level image processing, image segmentation divides each pixel into specific semantic labels. It can make the computer conveniently know the scene and accurately find the corresponding object position. Image semantic segmentation plays an important role in the fields of computer vision and artificial intelligence, such as autopilot, robotic environmental perception, and hospital image measurement.

In the existing image semantic segmentation technology, the features and the pixel classification of the image are mainly extracted through deep convolutional networks (DCNNs), but the deep convolutional networks not only reduce the feature resolution and the positioning accuracy of the target object in the image, but also generally cause the loss of partial components in the case that the target object has different scales. Therefore, the mainstream solution for semantic segmentation of images is to increase feature resolution. By analyzing the existing model, the definition of the target edge and the boundary can greatly influence the image segmentation result. In the structure of DCNNs, the target boundary accuracy is mainly affected by two factors. On the one hand, the spatial resolution of feature mapping is reduced by downsampling in convolutional and pooling layers, resulting in boundary blurring and shifting. On the other hand, multi-scaling of objects may lead to a series of problems such as loss of large object parts and small object classification errors.

Disclosure of Invention

Therefore, in order to solve the technical problem, it is necessary to provide a method and an apparatus for segmenting image semantics in a natural scene, which can solve the problem of inaccurate image semantics segmentation performed by a deep convolutional network.

A method for semantic segmentation of images in natural scenes comprises the following steps:

extracting a preliminary characteristic matrix of the image to be semantically segmented through a convolution layer of a convolution neural network;

respectively calculating the maximum value and the average value of pixels in the receptive field of the preliminary characteristic matrix through a pooling layer, and obtaining the edge gradient characteristic of the preliminary characteristic matrix according to the difference information of the maximum value and the average value;

obtaining fusion characteristics according to the edge gradient characteristics and the preliminary characteristic matrix, and performing characteristic fusion and extraction on the fusion characteristics according to a preset depth residual error network to obtain depth characteristics;

establishing a long dependency relationship among all pixels in the depth feature to obtain a dependency feature graph;

and classifying the dependency feature graph through an output layer to obtain the classification corresponding to each pixel.

In one embodiment, the method further comprises the following steps: and performing feature extraction on the image to be semantically segmented with a preset size through a convolution layer of a convolution neural network, and obtaining a preliminary feature matrix of a target size after batch standardization layer processing.

In one embodiment, the method further comprises the following steps: the maximum value of the pixels in the receptive field is calculated by the maximum pooling layer and the average value of the pixels in the receptive field is calculated by the average pooling layer.

In one embodiment, the method further comprises the following steps: calculating the difference value between the maximum value and the average value through an Eltwise layer to obtain the difference information, and obtaining edge gradient characteristics according to the difference information; and fusing the edge gradient feature and the preliminary feature matrix through the set parameters of the Eltwise layer to obtain a fused feature.

In one embodiment, the method further comprises the following steps: and establishing a long dependency relationship among all pixels in the depth feature through the pyramid pooling layer and the hole convolution pyramid layer respectively to obtain a dependency feature graph.

In one embodiment, the method further comprises the following steps: acquiring multi-level pooled output through a pyramid pooling layer, and sampling the multi-level pooled output by adopting bilinear interpolation to obtain a two-dimensional feature matrix with the same size as the depth feature; fusing the two-dimensional feature matrix to obtain a prior feature, and fusing the prior feature and the depth feature to obtain a fused feature map; inputting the fused feature map into the hole convolution pyramid layer to obtain a plurality of hole feature matrixes with the same size as the depth features; the hole convolution pyramid layer comprises a plurality of pooling layers with the same convolution kernel and different convolution kernel intervals; and sampling the hole characteristic matrix by using a bilinear interpolation value to obtain a dependency characteristic diagram.

In one embodiment, the method further comprises the following steps: and classifying the dependence characteristic graph through a softmax layer to obtain the classification corresponding to each pixel.

An apparatus for semantic segmentation of images in natural scenes, the apparatus comprising:

the preliminary feature extraction module is used for extracting a preliminary feature matrix of the image to be semantically segmented through a convolution layer of the convolution neural network;

the edge feature extraction module is used for respectively calculating the maximum value and the average value of pixels in the preliminary feature matrix receptive field through a pooling layer, and obtaining the edge gradient feature of the preliminary feature matrix according to the difference information of the maximum value and the average value;

the depth feature extraction module is used for obtaining fusion features according to the edge gradient features and the preliminary feature matrix, and performing feature fusion and extraction on the fusion features according to a preset depth residual error network to obtain depth features;

the dependency establishing module is used for establishing a long dependency relationship among pixels in the depth feature to obtain a dependency feature graph;

and the classification module is used for classifying the dependency feature graph through an output layer to obtain the classification corresponding to each pixel.

A computer device comprising a memory and a processor, the memory storing a computer program, the processor implementing the following steps when executing the computer program:

A computer-readable storage medium, on which a computer program is stored which, when executed by a processor, carries out the steps of:

According to the image semantic segmentation method, the device, the computer equipment and the storage medium in the natural scene, the deep neural network, the edge gradient theory and the long-distance dependence principle are combined, and when an image is input to the deep neural network, a final image semantic segmentation result can be directly obtained. Therefore, the invention can realize end-to-end image semantic segmentation, can more accurately judge the edge of a specific object in the image, can overcome the influence of the object multi-scale problem, and can realize better segmentation on large objects and small objects in the image.

Drawings

FIG. 1 is a schematic flow chart illustrating a method for semantic segmentation of images in natural scenes in one embodiment;

FIG. 2 is a design framework diagram of edge gradient feature extraction in one embodiment;

FIG. 3 is a diagram illustrating the relationship between a fully connected conditional random field and a layer of hole convolution pyramids in another embodiment;

FIG. 4 is a design framework diagram of the long distance dependence of the pyramid pooling layer and the aperture convolution pyramid layer in one embodiment;

FIG. 5 is a general flowchart of a method for semantic segmentation of images in natural scenes in one embodiment;

FIG. 6 is a block diagram illustrating an embodiment of an apparatus for semantic segmentation of images in natural scenes;

FIG. 7 is a diagram illustrating an internal structure of a computer device according to an embodiment.

Detailed Description

In order to make the objects, technical solutions and advantages of the present application more apparent, the present application is described in further detail below with reference to the accompanying drawings and embodiments. It should be understood that the specific embodiments described herein are merely illustrative of the present application and are not intended to limit the present application.

In one embodiment, as shown in fig. 1, there is provided a method for semantic segmentation of an image in a natural scene, including the following steps:

and 102, extracting a preliminary characteristic matrix of the image to be semantically segmented through a convolution layer of the convolution neural network.

The convolutional layer is a layer in a convolutional neural network, and can sample an image to be semantically segmented.

The image to be semantically segmented may be a picture in a natural scene, such as a photo taken by a mobile phone, a camera, etc., or a picture drawn by an artist by hand and stored in a computer device. The image to be semantically segmented can be a color image or a gray-scale image.

The preliminary feature matrix contains preliminary information in the image to be semantically segmented, and the classification of the pixels cannot be determined through the preliminary feature matrix.

And 104, respectively calculating the maximum value and the average value of the pixels in the receptive field of the preliminary characteristic matrix through the pooling layer, and obtaining the edge gradient characteristic of the preliminary characteristic matrix according to the difference information of the maximum value and the average value.

The receptive field refers to the area of pixels covered by the convolution kernel size in the pooling layer, for example, a convolution kernel of 2 × 2 would include 4 pixels in the receptive field.

Taking the convolution kernel 2 × 2 as an example, the edge gradient extraction of the feature is realized by simulating a Roberts operator through the pooling layer, and the gradient calculation of the Roberts operator is as follows:

wherein G represents a gradient. The pixel value in the receptive field is a₁₁,a₁₂,a₂₁,a₂₂Let a be₁₁≥a₂₂,a₁₂≥a₂₁After the four pixel values are substituted into the Roberts operator gradient formula, the gradient formula can be expressed as:

wherein, a_max＝max(a₁₁,a₁₂,a₂₁,a₂₂)，a_mean＝mean(a₁₁,a₁₂,a₂₁,a₂₂) It is found from the derivation that the gradient can be calculated by the difference between the maximum and mean values of the pixels in the receptive field. When the gradient G (x, y) is small, the maximum and mean differences in the receptive field are small, indicating that these pixel values are similar, andthe probability of edges in this region is less. When the gradient G (x, y) is large, the maximum value and the average value difference in the receptive field are large, indicating that the variation in the pixel values is large, and the probability that an edge exists in the region is high. Thus, the gradient between pixels can be replaced by the statistical maximum and mean values in the receptive field. This has the advantage that edges can be detected in each direction, not just the vertical direction in the Roberts operator.

And 106, obtaining fusion characteristics according to the edge gradient characteristics and the preliminary characteristic matrix, and performing characteristic fusion and extraction on the fusion characteristics according to a preset depth residual error network to obtain depth characteristics.

The depth residual error network can be realized by a network with a ResNet structure, and the size of the depth feature can be further reduced by fusion and extraction.

And step 108, establishing a long dependency relationship among all pixels in the depth feature to obtain a dependency feature map.

The long dependency relationship refers to the internal relation of each pixel, and the accuracy of pixel classification can be further improved by establishing the dependency between the pixels.

And step 110, classifying the dependency feature graph through an output layer to obtain a classification corresponding to each pixel.

In the image semantic segmentation method under the natural scene, the deep neural network, the edge gradient theory and the long-distance dependence principle are combined, and when an image is input to the deep neural network, a final image semantic segmentation result can be directly obtained. Therefore, the invention can realize end-to-end image semantic segmentation, can more accurately judge the edge of a specific object in the image, can overcome the influence of the object multi-scale problem, and can realize better segmentation on large objects and small objects in the image.

In one embodiment, feature extraction is performed on a to-be-semantically segmented image with a preset size through a convolutional layer of a convolutional neural network, and a preliminary feature matrix of a target size is obtained after batch normalization layer processing.

Specifically, after receiving an image to be semantically segmented, converting the image to be semantically segmented into a preset size, for example, setting the preset size to be 321 × 321, and then performing feature extraction on the input image by using a convolutional layer and a batch normalization layer, for example, setting parameters of the convolutional layer to be convolution kernel size of 7, sliding interval of 2 and expansion size of 3, and obtaining a preliminary matrix size of 161 × 161, wherein the batch normalization layer aims to adjust distribution of intermediate result data and has no parameters.

In one embodiment, the maximum value of the pixels in the receptive field is calculated by the maximum pooling layer and the average value of the pixels in the receptive field is calculated by the average pooling layer.

In this embodiment, the maximum pooling layer and the average pooling layer in the DCNN model may be used to complete the calculation.

In one embodiment, the difference value between the maximum value and the average value is calculated through the Eltwise layer to obtain difference information, the edge gradient characteristic is obtained according to the difference information, and the edge gradient characteristic and the preliminary characteristic matrix are fused through the set parameters of the Eltwise layer to obtain a fusion characteristic.

In this embodiment, the operation of the Eltwise layer includes: product (dot product), sum (addition or subtraction) and max (in large value). Therefore, after the maximum value and the average value are calculated, the calculation of the difference information can be completed through the Eltwise layer.

Specifically, as shown in fig. 2, the size of the maximum pooling layer and the average pooling layer is 2 × 2, the sliding interval is set to 1, two parameters of the Eltwise layer are set to 2 and-1, respectively, which indicate that the edge gradient feature in the image is additionally added on the basis of the preliminary appearance feature as a supplement, and finally the preliminary appearance feature and the edge feature are fused by a convolution layer.

After the fusion features are obtained, further fusion and extraction are required to be performed on the fusion features to obtain depth features, which can be specifically realized through a depth residual error network.

Specifically, the depth residual network adopts a commonly-used partial-depth residual network ResNet, the structures of the partial-depth residual network ResNet are sequentially (a convolution layer 1, a batch normalization layer 1, a Scale layer 1, a Relu layer 1, a convolution layer 2, a batch normalization layer 2, a Scale layer 2, a Relu layer 2, a convolution layer 3, a batch normalization layer 3, a Scale layer 3, an Eltwise layer 1, a Relu layer 3) × from input to output, the convolution kernel sizes of the three convolution layers are respectively 1 × 1 × 064, 3 ×, 1 ×, 1 ×, (a convolution layer 4, a batch normalization layer 4, a Relu layer 4, a convolution layer 5, a batch normalization layer 5, a Scale layer 5, a Relu layer 5, a convolution layer 6, a batch normalization layer 6, an Eltwuse layer 6, a batch normalization layer 54, a batch normalization layer 72, a standard layer 72, a batch normalization layer 72, a standard layer 72, a batch normalization layer 72, a standard layer 72, a batch normalization layer 72, a standard layer 72, a batch normalization layer 72, a standard layer 72, a batch normalization layer 72, a batch parameter, a batch normalization layer, a batch parameter.

In one embodiment, the long dependency relationship among pixels in the depth feature is established through the pyramid pooling layer and the hole convolution pyramid layer respectively, so that a dependency feature graph is obtained.

The pyramid pooling layer includes multiple pooling layers, each level of convolution kernel is different, the hole convolution pyramid layer includes multiple convolution layers, each convolution kernel has the same size, but the convolution kernel intervals are different, as shown in fig. 3.

In another embodiment, the multilevel pooled output is obtained through a pyramid pooling layer, the multilevel pooled output is sampled by bilinear interpolation to obtain a two-dimensional feature matrix with the same size as the depth feature, the two-dimensional feature matrix is fused to obtain a prior feature, the prior feature and the depth feature are fused to obtain a fused feature map, the fused feature map is input into a hole convolution pyramid layer to obtain a plurality of hole feature matrices with the same size as the depth feature, the hole convolution pyramid layer comprises pooling layers with the same convolution kernels and different convolution kernel intervals, and the hole feature matrices are sampled by bilinear interpolation to obtain a dependency feature map.

In the embodiment, the pyramid pooling layer and the empty convolution pyramid layer are combined to form a simplified full-connection conditional random field, and the multi-scale problem of the object and the context modeling between all parts of the object are solved by establishing long-distance dependence between all nodes. The pyramid pooling layer module obtains a layered global prior by adopting average pooling, and combines a plurality of local context information with global context information to solve the multi-scale object problem. The hole convolution pyramid layer can model the relation among all the nodes and establish long-distance dependence among the nodes, so that the structural modeling of the relation among all the parts in the object is realized. Energy function E of conditional random field_centreThe following were used:

wherein phi represents a potential function of the potential,

a function of a unary potential is represented,

representing a binary potential function, whereas in the present invention it is also possible to establish long distance dependent connections between nodes by a combination of a pyramid pooling layer and a hole convolution pyramid layer, a hole convolution pyramid layer (convolution kernel: 3 × 3; hole set-upA potential function F of r)^(r)Can be expressed as:

thus, the energy function E of the pyramid pooling layer and the hole convolution pyramid layer_centreCan be expressed as:

taking the size of the depth feature 41 × 41 as an example, as shown in fig. 4, based on the above derivation, the invention adds a pyramid pooling layer after the last feature map of the DCNN, which consists of four average pooling layers with kernel sizes 1, 2, 4, 5, the outputs of the four average pooling layers 41 × 41, 21 × 21, 11 × 11, and 9 × 9, respectively, then upsamples the four-level outputs respectively using bilinear interpolation, and obtains four two-dimensional feature matrices with resolutions 41 × 41 to fuse into a global prior and connect them with the input feature matrices, next, structurally modeling the relationship between the parts in the object through the hole convolution pyramid layer, which consists of five convolutional layers with convolution kernel sizes of 3 × 3, but with different hole sizes between the convolutional kernels, in the invention, the space between the convolutional kernels is set to 0, 6, 12, 18, 24, sliding to 1, thereby obtaining 5 features with resolution 41 × 41, the final feature matrix is obtained by interpolation and classifying the pixel output of the bilinear interpolation by the method of 321 × 321.

In one embodiment, the dependency feature map is classified through a softmax layer, and a classification corresponding to each pixel is obtained.

To sum up, the overall flow of the embodiment of the present invention is shown in fig. 5, and in fig. 5, the image semantic segmentation method in the natural scene is divided into four steps of preliminary feature extraction, edge feature extraction, feature fusion and refinement, long-distance dependency establishment, and context feature extraction.

It should be understood that, although the steps in the flowchart of fig. 1 are shown in order as indicated by the arrows, the steps are not necessarily performed in order as indicated by the arrows. The steps are not performed in the exact order shown and described, and may be performed in other orders, unless explicitly stated otherwise. Moreover, at least a portion of the steps in fig. 1 may include multiple sub-steps or multiple stages that are not necessarily performed at the same time, but may be performed at different times, and the order of performance of the sub-steps or stages is not necessarily sequential, but may be performed in turn or alternately with other steps or at least a portion of the sub-steps or stages of other steps.

In one embodiment, as shown in fig. 6, there is provided an apparatus for semantic segmentation of an image in a natural scene, including: a preliminary feature extraction module 602, an edge feature extraction module 604, a depth feature extraction module 606, a dependency establishment module 608, and a classification module 610, wherein:

a preliminary feature extraction module 602, configured to extract a preliminary feature matrix of the image to be semantically segmented by using a convolutional layer of a convolutional neural network;

an edge feature extraction module 604, configured to calculate a maximum value and an average value of pixels in the preliminary feature matrix receptive field through a pooling layer, respectively, and obtain an edge gradient feature of the preliminary feature matrix according to difference information of the maximum value and the average value;

a depth feature extraction module 606, configured to obtain a fusion feature according to the edge gradient feature and the preliminary feature matrix, and perform feature fusion and extraction on the fusion feature according to a preset depth residual error network to obtain a depth feature;

a dependency building module 608, configured to build a long dependency relationship between pixels in the depth feature to obtain a dependency feature map;

and the classification module 610 is configured to classify the dependency feature map through an output layer to obtain a classification corresponding to each pixel.

In one embodiment, the preliminary feature extraction module 602 is further configured to perform feature extraction on a to-be-semantically segmented image of a preset size through a convolutional layer of a convolutional neural network, and obtain a preliminary feature matrix of a target size after batch normalization layer processing.

In one embodiment, the edge feature extraction module 604 is further configured to calculate a maximum value of the pixels in the receptive field by the maximum pooling layer, and calculate an average value of the pixels in the receptive field by the average pooling layer.

In one embodiment, the depth feature extraction module 606 is further configured to calculate a difference between the maximum value and the average value through an Eltwise layer to obtain the difference information, and obtain an edge gradient feature according to the difference information; and fusing the edge gradient feature and the preliminary feature matrix through the set parameters of the Eltwise layer to obtain a fused feature.

In one embodiment, the dependency building module 608 is further configured to build a long dependency relationship between each pixel in the depth feature through the pyramid pooling layer and the hole convolution pyramid layer, respectively, to obtain a dependency feature map.

In one embodiment, the dependency building module 608 is further configured to obtain a multi-level pooled output through the pyramid pooling layer, and sample the multi-level pooled output by using bilinear interpolation to obtain a two-dimensional feature matrix having the same size as the depth feature; fusing the two-dimensional feature matrix to obtain a prior feature, and fusing the prior feature and the depth feature to obtain a fused feature map; inputting the fused feature map into the hole convolution pyramid layer to obtain a plurality of hole feature matrixes with the same size as the depth features; the hole convolution pyramid layer comprises a plurality of pooling layers with the same convolution kernel and different convolution kernel intervals; and sampling the hole characteristic matrix by using a bilinear interpolation value to obtain a dependency characteristic diagram.

In one embodiment, the classification module 610 is further configured to classify the dependency feature map through a softmax layer, so as to obtain a classification corresponding to each pixel.

For specific limitations of the image semantic segmentation apparatus under the natural scene, reference may be made to the above limitations on the image semantic segmentation method under the natural scene, and details are not described herein again. All or part of the modules in the image semantic segmentation device under the natural scene can be realized by software, hardware and a combination thereof. The modules can be embedded in a hardware form or independent from a processor in the computer device, and can also be stored in a memory in the computer device in a software form, so that the processor can call and execute operations corresponding to the modules.

In one embodiment, a computer device is provided, which may be a server, the internal structure of which may be as shown in fig. 7. The computer device includes a processor, a memory, a network interface, and a database connected by a system bus. Wherein the processor of the computer device is configured to provide computing and control capabilities. The memory of the computer device comprises a nonvolatile storage medium and an internal memory. The non-volatile storage medium stores an operating system, a computer program, and a database. The internal memory provides an environment for the operation of an operating system and computer programs in the non-volatile storage medium. The database of the computer device is used for storing image data to be semantically segmented. The network interface of the computer device is used for communicating with an external terminal through a network connection. The computer program is executed by a processor to realize a method for semantic segmentation of images in natural scenes.

Those skilled in the art will appreciate that the architecture shown in fig. 7 is merely a block diagram of some of the structures associated with the disclosed aspects and is not intended to limit the computing devices to which the disclosed aspects apply, as particular computing devices may include more or less components than those shown, or may combine certain components, or have a different arrangement of components.

In an embodiment, a computer device is provided, comprising a memory storing a computer program and a processor implementing the embodiments of the method in the above embodiments when executing the computer program.

In an embodiment, a computer-readable storage medium is provided, on which a computer program is stored, which computer program, when being executed by a processor, is adapted to carry out embodiments of the method in the above-mentioned embodiments.

It will be understood by those of ordinary skill in the art that all or a portion of the processes of the methods of the embodiments described above may be implemented by a computer program that may be stored on a non-volatile computer-readable storage medium, which when executed, may include the processes of the embodiments of the methods described above, wherein any reference to memory, storage, database, or other medium used in the embodiments provided herein may include non-volatile and/or volatile memory.

The technical features of the above embodiments can be arbitrarily combined, and for the sake of brevity, all possible combinations of the technical features in the above embodiments are not described, but should be considered as the scope of the present specification as long as there is no contradiction between the combinations of the technical features.

The above-mentioned embodiments only express several embodiments of the present application, and the description thereof is more specific and detailed, but not construed as limiting the scope of the invention. It should be noted that, for a person skilled in the art, several variations and modifications can be made without departing from the concept of the present application, which falls within the scope of protection of the present application. Therefore, the protection scope of the present patent shall be subject to the appended claims.

Claims

1. A method for semantic segmentation of images in natural scenes comprises the following steps:

2. The method according to claim 1, wherein extracting a preliminary feature matrix of the image to be semantically segmented by convolutional layers of a convolutional neural network comprises:

and performing feature extraction on the image to be semantically segmented with a preset size through a convolution layer of a convolution neural network, and obtaining a preliminary feature matrix of a target size after batch standardization layer processing.

3. The method according to claim 1, wherein the calculating the maximum value and the average value of the pixels in the preliminary feature matrix receptive field by the pooling layer respectively comprises:

the maximum value of the pixels in the receptive field is calculated by the maximum pooling layer and the average value of the pixels in the receptive field is calculated by the average pooling layer.

4. The method of claim 1, wherein deriving a fused feature from the edge gradient feature and the preliminary feature matrix comprises:

calculating the difference value between the maximum value and the average value through an Eltwise layer to obtain the difference information, and obtaining edge gradient characteristics according to the difference information;

and fusing the edge gradient feature and the preliminary feature matrix through the set parameters of the Eltwise layer to obtain a fused feature.

5. The method according to any one of claims 1 to 4, wherein establishing long dependency relationships among pixels in the depth feature to obtain a dependency feature map comprises:

and establishing a long dependency relationship among all pixels in the depth feature through the pyramid pooling layer and the hole convolution pyramid layer respectively to obtain a dependency feature graph.

6. The method of claim 5, wherein the establishing long dependency relationships between pixels in the depth features through a pyramid pooling layer and a hole convolution pyramid layer, respectively, to obtain a dependency feature map comprises:

acquiring multi-level pooled output through a pyramid pooling layer, and sampling the multi-level pooled output by adopting bilinear interpolation to obtain a two-dimensional feature matrix with the same size as the depth feature;

fusing the two-dimensional feature matrix to obtain a prior feature, and fusing the prior feature and the depth feature to obtain a fused feature map;

inputting the fused feature map into the hole convolution pyramid layer to obtain a plurality of hole feature matrixes with the same size as the depth features; the hole convolution pyramid layer comprises a plurality of pooling layers with the same convolution kernel and different convolution kernel intervals;

and sampling the hole characteristic matrix by using a bilinear interpolation value to obtain a dependency characteristic diagram.

7. The method according to any one of claims 1 to 4, wherein classifying the dependency feature map through an output layer to obtain a classification corresponding to each pixel comprises:

and classifying the dependence characteristic graph through a softmax layer to obtain the classification corresponding to each pixel.

8. An apparatus for semantic segmentation of an image in a natural scene, the apparatus comprising:

9. A computer device comprising a memory and a processor, the memory storing a computer program, wherein the processor implements the steps of the method of any one of claims 1 to 7 when executing the computer program.

10. A computer-readable storage medium, on which a computer program is stored, which, when being executed by a processor, carries out the steps of the method of any one of claims 1 to 7.