CN111444923A - Image semantic segmentation method and device under natural scene - Google Patents

Image semantic segmentation method and device under natural scene Download PDF

Info

Publication number
CN111444923A
CN111444923A CN202010286607.6A CN202010286607A CN111444923A CN 111444923 A CN111444923 A CN 111444923A CN 202010286607 A CN202010286607 A CN 202010286607A CN 111444923 A CN111444923 A CN 111444923A
Authority
CN
China
Prior art keywords
feature
layer
preliminary
dependency
depth
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202010286607.6A
Other languages
Chinese (zh)
Inventor
李硕豪
张军
何华
周浩
王风雷
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
National University of Defense Technology
Original Assignee
National University of Defense Technology
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by National University of Defense Technology filed Critical National University of Defense Technology
Priority to CN202010286607.6A priority Critical patent/CN111444923A/en
Publication of CN111444923A publication Critical patent/CN111444923A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/20Image preprocessing
    • G06V10/26Segmentation of patterns in the image field; Cutting or merging of image elements to establish the pattern region, e.g. clustering-based techniques; Detection of occlusion
    • G06V10/267Segmentation of patterns in the image field; Cutting or merging of image elements to establish the pattern region, e.g. clustering-based techniques; Detection of occlusion by performing operations on regions, e.g. growing, shrinking or watersheds
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/24Classification techniques
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/25Fusion techniques
    • G06F18/253Fusion techniques of extracted features
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/40Extraction of image or video features
    • G06V10/44Local feature extraction by analysis of parts of the pattern, e.g. by detecting edges, contours, loops, corners, strokes or intersections; Connectivity analysis, e.g. of connected components
    • G06V10/443Local feature extraction by analysis of parts of the pattern, e.g. by detecting edges, contours, loops, corners, strokes or intersections; Connectivity analysis, e.g. of connected components by matching or filtering
    • G06V10/449Biologically inspired filters, e.g. difference of Gaussians [DoG] or Gabor filters
    • G06V10/451Biologically inspired filters, e.g. difference of Gaussians [DoG] or Gabor filters with interaction between the filter responses, e.g. cortical complex cells
    • G06V10/454Integrating the filters into a hierarchical structure, e.g. convolutional neural networks [CNN]

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • Data Mining & Analysis (AREA)
  • Artificial Intelligence (AREA)
  • Evolutionary Computation (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • General Engineering & Computer Science (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Biomedical Technology (AREA)
  • Multimedia (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • General Health & Medical Sciences (AREA)
  • Molecular Biology (AREA)
  • Evolutionary Biology (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Health & Medical Sciences (AREA)
  • Biophysics (AREA)
  • Software Systems (AREA)
  • Mathematical Physics (AREA)
  • Computing Systems (AREA)
  • Computational Linguistics (AREA)
  • Biodiversity & Conservation Biology (AREA)
  • Image Analysis (AREA)

Abstract

The application relates to a method and a device for segmenting image semantics under a natural scene. The method comprises the following steps: extracting a preliminary feature matrix of an image to be semantically segmented through a convolutional layer of a convolutional neural network, respectively calculating the maximum value and the average value of pixels in a receptive field of the preliminary feature matrix through a pooling layer, obtaining the edge gradient feature of the preliminary feature matrix according to the difference information of the maximum value and the average value, obtaining a fusion feature according to the edge gradient feature and the preliminary feature matrix, performing feature fusion and extraction on the fusion feature according to a preset depth residual error network to obtain a depth feature, establishing a long dependency relationship among all pixels in the depth feature to obtain a dependency feature map, and classifying the dependency feature map through an output layer to obtain the classification corresponding to each pixel. By adopting the method, the accuracy of image semantic segmentation can be improved.

Description

Image semantic segmentation method and device under natural scene
Technical Field
The application relates to the technical field of machine learning, in particular to a method and a device for segmenting image semantics in a natural scene.
Background
Unlike image-level processing in high-level computer vision, image semantic segmentation is the basis and difficulty in low-level computer vision. Belonging to pixel-level image processing, image segmentation divides each pixel into specific semantic labels. It can make the computer conveniently know the scene and accurately find the corresponding object position. Image semantic segmentation plays an important role in the fields of computer vision and artificial intelligence, such as autopilot, robotic environmental perception, and hospital image measurement.
In the existing image semantic segmentation technology, the features and the pixel classification of the image are mainly extracted through deep convolutional networks (DCNNs), but the deep convolutional networks not only reduce the feature resolution and the positioning accuracy of the target object in the image, but also generally cause the loss of partial components in the case that the target object has different scales. Therefore, the mainstream solution for semantic segmentation of images is to increase feature resolution. By analyzing the existing model, the definition of the target edge and the boundary can greatly influence the image segmentation result. In the structure of DCNNs, the target boundary accuracy is mainly affected by two factors. On the one hand, the spatial resolution of feature mapping is reduced by downsampling in convolutional and pooling layers, resulting in boundary blurring and shifting. On the other hand, multi-scaling of objects may lead to a series of problems such as loss of large object parts and small object classification errors.
Disclosure of Invention
Therefore, in order to solve the technical problem, it is necessary to provide a method and an apparatus for segmenting image semantics in a natural scene, which can solve the problem of inaccurate image semantics segmentation performed by a deep convolutional network.
A method for semantic segmentation of images in natural scenes comprises the following steps:
extracting a preliminary characteristic matrix of the image to be semantically segmented through a convolution layer of a convolution neural network;
respectively calculating the maximum value and the average value of pixels in the receptive field of the preliminary characteristic matrix through a pooling layer, and obtaining the edge gradient characteristic of the preliminary characteristic matrix according to the difference information of the maximum value and the average value;
obtaining fusion characteristics according to the edge gradient characteristics and the preliminary characteristic matrix, and performing characteristic fusion and extraction on the fusion characteristics according to a preset depth residual error network to obtain depth characteristics;
establishing a long dependency relationship among all pixels in the depth feature to obtain a dependency feature graph;
and classifying the dependency feature graph through an output layer to obtain the classification corresponding to each pixel.
In one embodiment, the method further comprises the following steps: and performing feature extraction on the image to be semantically segmented with a preset size through a convolution layer of a convolution neural network, and obtaining a preliminary feature matrix of a target size after batch standardization layer processing.
In one embodiment, the method further comprises the following steps: the maximum value of the pixels in the receptive field is calculated by the maximum pooling layer and the average value of the pixels in the receptive field is calculated by the average pooling layer.
In one embodiment, the method further comprises the following steps: calculating the difference value between the maximum value and the average value through an Eltwise layer to obtain the difference information, and obtaining edge gradient characteristics according to the difference information; and fusing the edge gradient feature and the preliminary feature matrix through the set parameters of the Eltwise layer to obtain a fused feature.
In one embodiment, the method further comprises the following steps: and establishing a long dependency relationship among all pixels in the depth feature through the pyramid pooling layer and the hole convolution pyramid layer respectively to obtain a dependency feature graph.
In one embodiment, the method further comprises the following steps: acquiring multi-level pooled output through a pyramid pooling layer, and sampling the multi-level pooled output by adopting bilinear interpolation to obtain a two-dimensional feature matrix with the same size as the depth feature; fusing the two-dimensional feature matrix to obtain a prior feature, and fusing the prior feature and the depth feature to obtain a fused feature map; inputting the fused feature map into the hole convolution pyramid layer to obtain a plurality of hole feature matrixes with the same size as the depth features; the hole convolution pyramid layer comprises a plurality of pooling layers with the same convolution kernel and different convolution kernel intervals; and sampling the hole characteristic matrix by using a bilinear interpolation value to obtain a dependency characteristic diagram.
In one embodiment, the method further comprises the following steps: and classifying the dependence characteristic graph through a softmax layer to obtain the classification corresponding to each pixel.
An apparatus for semantic segmentation of images in natural scenes, the apparatus comprising:
the preliminary feature extraction module is used for extracting a preliminary feature matrix of the image to be semantically segmented through a convolution layer of the convolution neural network;
the edge feature extraction module is used for respectively calculating the maximum value and the average value of pixels in the preliminary feature matrix receptive field through a pooling layer, and obtaining the edge gradient feature of the preliminary feature matrix according to the difference information of the maximum value and the average value;
the depth feature extraction module is used for obtaining fusion features according to the edge gradient features and the preliminary feature matrix, and performing feature fusion and extraction on the fusion features according to a preset depth residual error network to obtain depth features;
the dependency establishing module is used for establishing a long dependency relationship among pixels in the depth feature to obtain a dependency feature graph;
and the classification module is used for classifying the dependency feature graph through an output layer to obtain the classification corresponding to each pixel.
A computer device comprising a memory and a processor, the memory storing a computer program, the processor implementing the following steps when executing the computer program:
extracting a preliminary characteristic matrix of the image to be semantically segmented through a convolution layer of a convolution neural network;
respectively calculating the maximum value and the average value of pixels in the receptive field of the preliminary characteristic matrix through a pooling layer, and obtaining the edge gradient characteristic of the preliminary characteristic matrix according to the difference information of the maximum value and the average value;
obtaining fusion characteristics according to the edge gradient characteristics and the preliminary characteristic matrix, and performing characteristic fusion and extraction on the fusion characteristics according to a preset depth residual error network to obtain depth characteristics;
establishing a long dependency relationship among all pixels in the depth feature to obtain a dependency feature graph;
and classifying the dependency feature graph through an output layer to obtain the classification corresponding to each pixel.
A computer-readable storage medium, on which a computer program is stored which, when executed by a processor, carries out the steps of:
extracting a preliminary characteristic matrix of the image to be semantically segmented through a convolution layer of a convolution neural network;
respectively calculating the maximum value and the average value of pixels in the receptive field of the preliminary characteristic matrix through a pooling layer, and obtaining the edge gradient characteristic of the preliminary characteristic matrix according to the difference information of the maximum value and the average value;
obtaining fusion characteristics according to the edge gradient characteristics and the preliminary characteristic matrix, and performing characteristic fusion and extraction on the fusion characteristics according to a preset depth residual error network to obtain depth characteristics;
establishing a long dependency relationship among all pixels in the depth feature to obtain a dependency feature graph;
and classifying the dependency feature graph through an output layer to obtain the classification corresponding to each pixel.
According to the image semantic segmentation method, the device, the computer equipment and the storage medium in the natural scene, the deep neural network, the edge gradient theory and the long-distance dependence principle are combined, and when an image is input to the deep neural network, a final image semantic segmentation result can be directly obtained. Therefore, the invention can realize end-to-end image semantic segmentation, can more accurately judge the edge of a specific object in the image, can overcome the influence of the object multi-scale problem, and can realize better segmentation on large objects and small objects in the image.
Drawings
FIG. 1 is a schematic flow chart illustrating a method for semantic segmentation of images in natural scenes in one embodiment;
FIG. 2 is a design framework diagram of edge gradient feature extraction in one embodiment;
FIG. 3 is a diagram illustrating the relationship between a fully connected conditional random field and a layer of hole convolution pyramids in another embodiment;
FIG. 4 is a design framework diagram of the long distance dependence of the pyramid pooling layer and the aperture convolution pyramid layer in one embodiment;
FIG. 5 is a general flowchart of a method for semantic segmentation of images in natural scenes in one embodiment;
FIG. 6 is a block diagram illustrating an embodiment of an apparatus for semantic segmentation of images in natural scenes;
FIG. 7 is a diagram illustrating an internal structure of a computer device according to an embodiment.
Detailed Description
In order to make the objects, technical solutions and advantages of the present application more apparent, the present application is described in further detail below with reference to the accompanying drawings and embodiments. It should be understood that the specific embodiments described herein are merely illustrative of the present application and are not intended to limit the present application.
In one embodiment, as shown in fig. 1, there is provided a method for semantic segmentation of an image in a natural scene, including the following steps:
and 102, extracting a preliminary characteristic matrix of the image to be semantically segmented through a convolution layer of the convolution neural network.
The convolutional layer is a layer in a convolutional neural network, and can sample an image to be semantically segmented.
The image to be semantically segmented may be a picture in a natural scene, such as a photo taken by a mobile phone, a camera, etc., or a picture drawn by an artist by hand and stored in a computer device. The image to be semantically segmented can be a color image or a gray-scale image.
The preliminary feature matrix contains preliminary information in the image to be semantically segmented, and the classification of the pixels cannot be determined through the preliminary feature matrix.
And 104, respectively calculating the maximum value and the average value of the pixels in the receptive field of the preliminary characteristic matrix through the pooling layer, and obtaining the edge gradient characteristic of the preliminary characteristic matrix according to the difference information of the maximum value and the average value.
The receptive field refers to the area of pixels covered by the convolution kernel size in the pooling layer, for example, a convolution kernel of 2 × 2 would include 4 pixels in the receptive field.
Taking the convolution kernel 2 × 2 as an example, the edge gradient extraction of the feature is realized by simulating a Roberts operator through the pooling layer, and the gradient calculation of the Roberts operator is as follows:
Figure BDA0002448755800000051
wherein G represents a gradient. The pixel value in the receptive field is a11,a12,a21,a22Let a be11≥a22,a12≥a21After the four pixel values are substituted into the Roberts operator gradient formula, the gradient formula can be expressed as:
Figure BDA0002448755800000052
wherein, amax=max(a11,a12,a21,a22),amean=mean(a11,a12,a21,a22) It is found from the derivation that the gradient can be calculated by the difference between the maximum and mean values of the pixels in the receptive field. When the gradient G (x, y) is small, the maximum and mean differences in the receptive field are small, indicating that these pixel values are similar, andthe probability of edges in this region is less. When the gradient G (x, y) is large, the maximum value and the average value difference in the receptive field are large, indicating that the variation in the pixel values is large, and the probability that an edge exists in the region is high. Thus, the gradient between pixels can be replaced by the statistical maximum and mean values in the receptive field. This has the advantage that edges can be detected in each direction, not just the vertical direction in the Roberts operator.
And 106, obtaining fusion characteristics according to the edge gradient characteristics and the preliminary characteristic matrix, and performing characteristic fusion and extraction on the fusion characteristics according to a preset depth residual error network to obtain depth characteristics.
The depth residual error network can be realized by a network with a ResNet structure, and the size of the depth feature can be further reduced by fusion and extraction.
And step 108, establishing a long dependency relationship among all pixels in the depth feature to obtain a dependency feature map.
The long dependency relationship refers to the internal relation of each pixel, and the accuracy of pixel classification can be further improved by establishing the dependency between the pixels.
And step 110, classifying the dependency feature graph through an output layer to obtain a classification corresponding to each pixel.
In the image semantic segmentation method under the natural scene, the deep neural network, the edge gradient theory and the long-distance dependence principle are combined, and when an image is input to the deep neural network, a final image semantic segmentation result can be directly obtained. Therefore, the invention can realize end-to-end image semantic segmentation, can more accurately judge the edge of a specific object in the image, can overcome the influence of the object multi-scale problem, and can realize better segmentation on large objects and small objects in the image.
In one embodiment, feature extraction is performed on a to-be-semantically segmented image with a preset size through a convolutional layer of a convolutional neural network, and a preliminary feature matrix of a target size is obtained after batch normalization layer processing.
Specifically, after receiving an image to be semantically segmented, converting the image to be semantically segmented into a preset size, for example, setting the preset size to be 321 × 321, and then performing feature extraction on the input image by using a convolutional layer and a batch normalization layer, for example, setting parameters of the convolutional layer to be convolution kernel size of 7, sliding interval of 2 and expansion size of 3, and obtaining a preliminary matrix size of 161 × 161, wherein the batch normalization layer aims to adjust distribution of intermediate result data and has no parameters.
In one embodiment, the maximum value of the pixels in the receptive field is calculated by the maximum pooling layer and the average value of the pixels in the receptive field is calculated by the average pooling layer.
In this embodiment, the maximum pooling layer and the average pooling layer in the DCNN model may be used to complete the calculation.
In one embodiment, the difference value between the maximum value and the average value is calculated through the Eltwise layer to obtain difference information, the edge gradient characteristic is obtained according to the difference information, and the edge gradient characteristic and the preliminary characteristic matrix are fused through the set parameters of the Eltwise layer to obtain a fusion characteristic.
In this embodiment, the operation of the Eltwise layer includes: product (dot product), sum (addition or subtraction) and max (in large value). Therefore, after the maximum value and the average value are calculated, the calculation of the difference information can be completed through the Eltwise layer.
Specifically, as shown in fig. 2, the size of the maximum pooling layer and the average pooling layer is 2 × 2, the sliding interval is set to 1, two parameters of the Eltwise layer are set to 2 and-1, respectively, which indicate that the edge gradient feature in the image is additionally added on the basis of the preliminary appearance feature as a supplement, and finally the preliminary appearance feature and the edge feature are fused by a convolution layer.
After the fusion features are obtained, further fusion and extraction are required to be performed on the fusion features to obtain depth features, which can be specifically realized through a depth residual error network.
Specifically, the depth residual network adopts a commonly-used partial-depth residual network ResNet, the structures of the partial-depth residual network ResNet are sequentially (a convolution layer 1, a batch normalization layer 1, a Scale layer 1, a Relu layer 1, a convolution layer 2, a batch normalization layer 2, a Scale layer 2, a Relu layer 2, a convolution layer 3, a batch normalization layer 3, a Scale layer 3, an Eltwise layer 1, a Relu layer 3) × from input to output, the convolution kernel sizes of the three convolution layers are respectively 1 × 1 × 064, 3 ×, 1 ×, 1 ×, (a convolution layer 4, a batch normalization layer 4, a Relu layer 4, a convolution layer 5, a batch normalization layer 5, a Scale layer 5, a Relu layer 5, a convolution layer 6, a batch normalization layer 6, an Eltwuse layer 6, a batch normalization layer 54, a batch normalization layer 72, a standard layer 72, a batch normalization layer 72, a standard layer 72, a batch normalization layer 72, a standard layer 72, a batch normalization layer 72, a standard layer 72, a batch normalization layer 72, a standard layer 72, a batch normalization layer 72, a batch parameter, a batch normalization layer, a batch parameter.
In one embodiment, the long dependency relationship among pixels in the depth feature is established through the pyramid pooling layer and the hole convolution pyramid layer respectively, so that a dependency feature graph is obtained.
The pyramid pooling layer includes multiple pooling layers, each level of convolution kernel is different, the hole convolution pyramid layer includes multiple convolution layers, each convolution kernel has the same size, but the convolution kernel intervals are different, as shown in fig. 3.
In another embodiment, the multilevel pooled output is obtained through a pyramid pooling layer, the multilevel pooled output is sampled by bilinear interpolation to obtain a two-dimensional feature matrix with the same size as the depth feature, the two-dimensional feature matrix is fused to obtain a prior feature, the prior feature and the depth feature are fused to obtain a fused feature map, the fused feature map is input into a hole convolution pyramid layer to obtain a plurality of hole feature matrices with the same size as the depth feature, the hole convolution pyramid layer comprises pooling layers with the same convolution kernels and different convolution kernel intervals, and the hole feature matrices are sampled by bilinear interpolation to obtain a dependency feature map.
In the embodiment, the pyramid pooling layer and the empty convolution pyramid layer are combined to form a simplified full-connection conditional random field, and the multi-scale problem of the object and the context modeling between all parts of the object are solved by establishing long-distance dependence between all nodes. The pyramid pooling layer module obtains a layered global prior by adopting average pooling, and combines a plurality of local context information with global context information to solve the multi-scale object problem. The hole convolution pyramid layer can model the relation among all the nodes and establish long-distance dependence among the nodes, so that the structural modeling of the relation among all the parts in the object is realized. Energy function E of conditional random fieldcentreThe following were used:
Figure BDA0002448755800000081
wherein phi represents a potential function of the potential,
Figure BDA0002448755800000082
a function of a unary potential is represented,
Figure BDA0002448755800000083
representing a binary potential function, whereas in the present invention it is also possible to establish long distance dependent connections between nodes by a combination of a pyramid pooling layer and a hole convolution pyramid layer, a hole convolution pyramid layer (convolution kernel: 3 × 3; hole set-upA potential function F of r)(r)Can be expressed as:
Figure BDA0002448755800000084
thus, the energy function E of the pyramid pooling layer and the hole convolution pyramid layercentreCan be expressed as:
Figure BDA0002448755800000085
taking the size of the depth feature 41 × 41 as an example, as shown in fig. 4, based on the above derivation, the invention adds a pyramid pooling layer after the last feature map of the DCNN, which consists of four average pooling layers with kernel sizes 1, 2, 4, 5, the outputs of the four average pooling layers 41 × 41, 21 × 21, 11 × 11, and 9 × 9, respectively, then upsamples the four-level outputs respectively using bilinear interpolation, and obtains four two-dimensional feature matrices with resolutions 41 × 41 to fuse into a global prior and connect them with the input feature matrices, next, structurally modeling the relationship between the parts in the object through the hole convolution pyramid layer, which consists of five convolutional layers with convolution kernel sizes of 3 × 3, but with different hole sizes between the convolutional kernels, in the invention, the space between the convolutional kernels is set to 0, 6, 12, 18, 24, sliding to 1, thereby obtaining 5 features with resolution 41 × 41, the final feature matrix is obtained by interpolation and classifying the pixel output of the bilinear interpolation by the method of 321 × 321.
In one embodiment, the dependency feature map is classified through a softmax layer, and a classification corresponding to each pixel is obtained.
To sum up, the overall flow of the embodiment of the present invention is shown in fig. 5, and in fig. 5, the image semantic segmentation method in the natural scene is divided into four steps of preliminary feature extraction, edge feature extraction, feature fusion and refinement, long-distance dependency establishment, and context feature extraction.
It should be understood that, although the steps in the flowchart of fig. 1 are shown in order as indicated by the arrows, the steps are not necessarily performed in order as indicated by the arrows. The steps are not performed in the exact order shown and described, and may be performed in other orders, unless explicitly stated otherwise. Moreover, at least a portion of the steps in fig. 1 may include multiple sub-steps or multiple stages that are not necessarily performed at the same time, but may be performed at different times, and the order of performance of the sub-steps or stages is not necessarily sequential, but may be performed in turn or alternately with other steps or at least a portion of the sub-steps or stages of other steps.
In one embodiment, as shown in fig. 6, there is provided an apparatus for semantic segmentation of an image in a natural scene, including: a preliminary feature extraction module 602, an edge feature extraction module 604, a depth feature extraction module 606, a dependency establishment module 608, and a classification module 610, wherein:
a preliminary feature extraction module 602, configured to extract a preliminary feature matrix of the image to be semantically segmented by using a convolutional layer of a convolutional neural network;
an edge feature extraction module 604, configured to calculate a maximum value and an average value of pixels in the preliminary feature matrix receptive field through a pooling layer, respectively, and obtain an edge gradient feature of the preliminary feature matrix according to difference information of the maximum value and the average value;
a depth feature extraction module 606, configured to obtain a fusion feature according to the edge gradient feature and the preliminary feature matrix, and perform feature fusion and extraction on the fusion feature according to a preset depth residual error network to obtain a depth feature;
a dependency building module 608, configured to build a long dependency relationship between pixels in the depth feature to obtain a dependency feature map;
and the classification module 610 is configured to classify the dependency feature map through an output layer to obtain a classification corresponding to each pixel.
In one embodiment, the preliminary feature extraction module 602 is further configured to perform feature extraction on a to-be-semantically segmented image of a preset size through a convolutional layer of a convolutional neural network, and obtain a preliminary feature matrix of a target size after batch normalization layer processing.
In one embodiment, the edge feature extraction module 604 is further configured to calculate a maximum value of the pixels in the receptive field by the maximum pooling layer, and calculate an average value of the pixels in the receptive field by the average pooling layer.
In one embodiment, the depth feature extraction module 606 is further configured to calculate a difference between the maximum value and the average value through an Eltwise layer to obtain the difference information, and obtain an edge gradient feature according to the difference information; and fusing the edge gradient feature and the preliminary feature matrix through the set parameters of the Eltwise layer to obtain a fused feature.
In one embodiment, the dependency building module 608 is further configured to build a long dependency relationship between each pixel in the depth feature through the pyramid pooling layer and the hole convolution pyramid layer, respectively, to obtain a dependency feature map.
In one embodiment, the dependency building module 608 is further configured to obtain a multi-level pooled output through the pyramid pooling layer, and sample the multi-level pooled output by using bilinear interpolation to obtain a two-dimensional feature matrix having the same size as the depth feature; fusing the two-dimensional feature matrix to obtain a prior feature, and fusing the prior feature and the depth feature to obtain a fused feature map; inputting the fused feature map into the hole convolution pyramid layer to obtain a plurality of hole feature matrixes with the same size as the depth features; the hole convolution pyramid layer comprises a plurality of pooling layers with the same convolution kernel and different convolution kernel intervals; and sampling the hole characteristic matrix by using a bilinear interpolation value to obtain a dependency characteristic diagram.
In one embodiment, the classification module 610 is further configured to classify the dependency feature map through a softmax layer, so as to obtain a classification corresponding to each pixel.
For specific limitations of the image semantic segmentation apparatus under the natural scene, reference may be made to the above limitations on the image semantic segmentation method under the natural scene, and details are not described herein again. All or part of the modules in the image semantic segmentation device under the natural scene can be realized by software, hardware and a combination thereof. The modules can be embedded in a hardware form or independent from a processor in the computer device, and can also be stored in a memory in the computer device in a software form, so that the processor can call and execute operations corresponding to the modules.
In one embodiment, a computer device is provided, which may be a server, the internal structure of which may be as shown in fig. 7. The computer device includes a processor, a memory, a network interface, and a database connected by a system bus. Wherein the processor of the computer device is configured to provide computing and control capabilities. The memory of the computer device comprises a nonvolatile storage medium and an internal memory. The non-volatile storage medium stores an operating system, a computer program, and a database. The internal memory provides an environment for the operation of an operating system and computer programs in the non-volatile storage medium. The database of the computer device is used for storing image data to be semantically segmented. The network interface of the computer device is used for communicating with an external terminal through a network connection. The computer program is executed by a processor to realize a method for semantic segmentation of images in natural scenes.
Those skilled in the art will appreciate that the architecture shown in fig. 7 is merely a block diagram of some of the structures associated with the disclosed aspects and is not intended to limit the computing devices to which the disclosed aspects apply, as particular computing devices may include more or less components than those shown, or may combine certain components, or have a different arrangement of components.
In an embodiment, a computer device is provided, comprising a memory storing a computer program and a processor implementing the embodiments of the method in the above embodiments when executing the computer program.
In an embodiment, a computer-readable storage medium is provided, on which a computer program is stored, which computer program, when being executed by a processor, is adapted to carry out embodiments of the method in the above-mentioned embodiments.
It will be understood by those of ordinary skill in the art that all or a portion of the processes of the methods of the embodiments described above may be implemented by a computer program that may be stored on a non-volatile computer-readable storage medium, which when executed, may include the processes of the embodiments of the methods described above, wherein any reference to memory, storage, database, or other medium used in the embodiments provided herein may include non-volatile and/or volatile memory.
The technical features of the above embodiments can be arbitrarily combined, and for the sake of brevity, all possible combinations of the technical features in the above embodiments are not described, but should be considered as the scope of the present specification as long as there is no contradiction between the combinations of the technical features.
The above-mentioned embodiments only express several embodiments of the present application, and the description thereof is more specific and detailed, but not construed as limiting the scope of the invention. It should be noted that, for a person skilled in the art, several variations and modifications can be made without departing from the concept of the present application, which falls within the scope of protection of the present application. Therefore, the protection scope of the present patent shall be subject to the appended claims.

Claims (10)

1. A method for semantic segmentation of images in natural scenes comprises the following steps:
extracting a preliminary characteristic matrix of the image to be semantically segmented through a convolution layer of a convolution neural network;
respectively calculating the maximum value and the average value of pixels in the receptive field of the preliminary characteristic matrix through a pooling layer, and obtaining the edge gradient characteristic of the preliminary characteristic matrix according to the difference information of the maximum value and the average value;
obtaining fusion characteristics according to the edge gradient characteristics and the preliminary characteristic matrix, and performing characteristic fusion and extraction on the fusion characteristics according to a preset depth residual error network to obtain depth characteristics;
establishing a long dependency relationship among all pixels in the depth feature to obtain a dependency feature graph;
and classifying the dependency feature graph through an output layer to obtain the classification corresponding to each pixel.
2. The method according to claim 1, wherein extracting a preliminary feature matrix of the image to be semantically segmented by convolutional layers of a convolutional neural network comprises:
and performing feature extraction on the image to be semantically segmented with a preset size through a convolution layer of a convolution neural network, and obtaining a preliminary feature matrix of a target size after batch standardization layer processing.
3. The method according to claim 1, wherein the calculating the maximum value and the average value of the pixels in the preliminary feature matrix receptive field by the pooling layer respectively comprises:
the maximum value of the pixels in the receptive field is calculated by the maximum pooling layer and the average value of the pixels in the receptive field is calculated by the average pooling layer.
4. The method of claim 1, wherein deriving a fused feature from the edge gradient feature and the preliminary feature matrix comprises:
calculating the difference value between the maximum value and the average value through an Eltwise layer to obtain the difference information, and obtaining edge gradient characteristics according to the difference information;
and fusing the edge gradient feature and the preliminary feature matrix through the set parameters of the Eltwise layer to obtain a fused feature.
5. The method according to any one of claims 1 to 4, wherein establishing long dependency relationships among pixels in the depth feature to obtain a dependency feature map comprises:
and establishing a long dependency relationship among all pixels in the depth feature through the pyramid pooling layer and the hole convolution pyramid layer respectively to obtain a dependency feature graph.
6. The method of claim 5, wherein the establishing long dependency relationships between pixels in the depth features through a pyramid pooling layer and a hole convolution pyramid layer, respectively, to obtain a dependency feature map comprises:
acquiring multi-level pooled output through a pyramid pooling layer, and sampling the multi-level pooled output by adopting bilinear interpolation to obtain a two-dimensional feature matrix with the same size as the depth feature;
fusing the two-dimensional feature matrix to obtain a prior feature, and fusing the prior feature and the depth feature to obtain a fused feature map;
inputting the fused feature map into the hole convolution pyramid layer to obtain a plurality of hole feature matrixes with the same size as the depth features; the hole convolution pyramid layer comprises a plurality of pooling layers with the same convolution kernel and different convolution kernel intervals;
and sampling the hole characteristic matrix by using a bilinear interpolation value to obtain a dependency characteristic diagram.
7. The method according to any one of claims 1 to 4, wherein classifying the dependency feature map through an output layer to obtain a classification corresponding to each pixel comprises:
and classifying the dependence characteristic graph through a softmax layer to obtain the classification corresponding to each pixel.
8. An apparatus for semantic segmentation of an image in a natural scene, the apparatus comprising:
the preliminary feature extraction module is used for extracting a preliminary feature matrix of the image to be semantically segmented through a convolution layer of the convolution neural network;
the edge feature extraction module is used for respectively calculating the maximum value and the average value of pixels in the preliminary feature matrix receptive field through a pooling layer, and obtaining the edge gradient feature of the preliminary feature matrix according to the difference information of the maximum value and the average value;
the depth feature extraction module is used for obtaining fusion features according to the edge gradient features and the preliminary feature matrix, and performing feature fusion and extraction on the fusion features according to a preset depth residual error network to obtain depth features;
the dependency establishing module is used for establishing a long dependency relationship among pixels in the depth feature to obtain a dependency feature graph;
and the classification module is used for classifying the dependency feature graph through an output layer to obtain the classification corresponding to each pixel.
9. A computer device comprising a memory and a processor, the memory storing a computer program, wherein the processor implements the steps of the method of any one of claims 1 to 7 when executing the computer program.
10. A computer-readable storage medium, on which a computer program is stored, which, when being executed by a processor, carries out the steps of the method of any one of claims 1 to 7.
CN202010286607.6A 2020-04-13 2020-04-13 Image semantic segmentation method and device under natural scene Pending CN111444923A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202010286607.6A CN111444923A (en) 2020-04-13 2020-04-13 Image semantic segmentation method and device under natural scene

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202010286607.6A CN111444923A (en) 2020-04-13 2020-04-13 Image semantic segmentation method and device under natural scene

Publications (1)

Publication Number Publication Date
CN111444923A true CN111444923A (en) 2020-07-24

Family

ID=71651648

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202010286607.6A Pending CN111444923A (en) 2020-04-13 2020-04-13 Image semantic segmentation method and device under natural scene

Country Status (1)

Country Link
CN (1) CN111444923A (en)

Cited By (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111985542A (en) * 2020-08-05 2020-11-24 华中科技大学 Representative graph structure model, visual understanding model establishing method and application
CN112617850A (en) * 2021-01-04 2021-04-09 苏州大学 Premature beat and heart beat detection method for electrocardiosignals
CN113052247A (en) * 2021-03-31 2021-06-29 清华苏州环境创新研究院 Garbage classification method and garbage classifier based on multi-label image recognition
CN114528746A (en) * 2020-11-04 2022-05-24 中国石油化工股份有限公司 Complex lithology identification method, identification system, electronic device and storage medium
CN117991093A (en) * 2024-04-03 2024-05-07 成都航天凯特机电科技有限公司 Permanent magnet synchronous motor fault diagnosis method

Citations (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108875596A (en) * 2018-05-30 2018-11-23 西南交通大学 A kind of railway scene image, semantic dividing method based on DSSNN neural network
CN109800806A (en) * 2019-01-14 2019-05-24 中山大学 A kind of corps diseases detection algorithm based on deep learning
CN109829926A (en) * 2019-01-30 2019-05-31 杭州鸿泉物联网技术股份有限公司 Road scene semantic segmentation method and device
CN110136141A (en) * 2019-04-24 2019-08-16 佛山科学技术学院 A kind of image, semantic dividing method and device towards complex environment
CN110348445A (en) * 2019-06-06 2019-10-18 华中科技大学 A kind of example dividing method merging empty convolution sum marginal information
CN110490265A (en) * 2019-08-23 2019-11-22 安徽大学 A kind of image latent writing analysis method based on two-way convolution sum Fusion Features
CN110598771A (en) * 2019-08-30 2019-12-20 北京影谱科技股份有限公司 Visual target identification method and device based on deep semantic segmentation network
CN110992320A (en) * 2019-11-22 2020-04-10 电子科技大学 Medical image segmentation network based on double interleaving

Patent Citations (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108875596A (en) * 2018-05-30 2018-11-23 西南交通大学 A kind of railway scene image, semantic dividing method based on DSSNN neural network
CN109800806A (en) * 2019-01-14 2019-05-24 中山大学 A kind of corps diseases detection algorithm based on deep learning
CN109829926A (en) * 2019-01-30 2019-05-31 杭州鸿泉物联网技术股份有限公司 Road scene semantic segmentation method and device
CN110136141A (en) * 2019-04-24 2019-08-16 佛山科学技术学院 A kind of image, semantic dividing method and device towards complex environment
CN110348445A (en) * 2019-06-06 2019-10-18 华中科技大学 A kind of example dividing method merging empty convolution sum marginal information
CN110490265A (en) * 2019-08-23 2019-11-22 安徽大学 A kind of image latent writing analysis method based on two-way convolution sum Fusion Features
CN110598771A (en) * 2019-08-30 2019-12-20 北京影谱科技股份有限公司 Visual target identification method and device based on deep semantic segmentation network
CN110992320A (en) * 2019-11-22 2020-04-10 电子科技大学 Medical image segmentation network based on double interleaving

Non-Patent Citations (5)

* Cited by examiner, † Cited by third party
Title
CHARLES-WAN: "Caffe 议事(二):从零开始搭建 ResNet 之 网络的搭建(上)", 《博客园:HTTPS://WWW.CNBLOGS.COM/CHARLES-WAN/P/6535395.HTML》 *
HAOZHOU等: "Edge gradient feature and long distance dependency for image semantic segmentation", 《IET COMPUTER VISION》 *
JUNZHANG等: "Accurate Moving Target Detection Based on Background Subtraction and SUSAN", 《INTERNATIONAL JOURNAL OF COMPUTER AND ELECTRICAL ENGINEERING》 *
徐树奎等: "对象边框标注数据的弱监督图像语义分割", 《国防科技大学学报》 *
文常保等: "《人工神经网络理论及应用》", 31 March 2019, 西安电子科技大学出版社 *

Cited By (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111985542A (en) * 2020-08-05 2020-11-24 华中科技大学 Representative graph structure model, visual understanding model establishing method and application
CN111985542B (en) * 2020-08-05 2022-07-12 华中科技大学 Representative graph structure model, visual understanding model establishing method and application
CN114528746A (en) * 2020-11-04 2022-05-24 中国石油化工股份有限公司 Complex lithology identification method, identification system, electronic device and storage medium
CN112617850A (en) * 2021-01-04 2021-04-09 苏州大学 Premature beat and heart beat detection method for electrocardiosignals
CN112617850B (en) * 2021-01-04 2022-08-30 苏州大学 Premature beat and heart beat detection system for electrocardiosignals
CN113052247A (en) * 2021-03-31 2021-06-29 清华苏州环境创新研究院 Garbage classification method and garbage classifier based on multi-label image recognition
CN117991093A (en) * 2024-04-03 2024-05-07 成都航天凯特机电科技有限公司 Permanent magnet synchronous motor fault diagnosis method

Similar Documents

Publication Publication Date Title
CN110119728B (en) Remote sensing image cloud detection method based on multi-scale fusion semantic segmentation network
CN111444923A (en) Image semantic segmentation method and device under natural scene
CN109559320B (en) Method and system for realizing visual SLAM semantic mapping function based on hole convolution deep neural network
CN110781756A (en) Urban road extraction method and device based on remote sensing image
US9633282B2 (en) Cross-trained convolutional neural networks using multimodal images
CN112183414A (en) Weak supervision remote sensing target detection method based on mixed hole convolution
CN113205142B (en) Target detection method and device based on incremental learning
CN114255238A (en) Three-dimensional point cloud scene segmentation method and system fusing image features
CN109063549B (en) High-resolution aerial video moving target detection method based on deep neural network
CN108564102A (en) Image clustering evaluation of result method and apparatus
CN112862774B (en) Accurate segmentation method for remote sensing image building
CN107506792B (en) Semi-supervised salient object detection method
CN113901900A (en) Unsupervised change detection method and system for homologous or heterologous remote sensing image
CN112528974B (en) Distance measuring method and device, electronic equipment and readable storage medium
CN109635714B (en) Correction method and device for document scanning image
CN110969171A (en) Image classification model, method and application based on improved convolutional neural network
CN111768415A (en) Image instance segmentation method without quantization pooling
CN113436220B (en) Image background estimation method based on depth map segmentation
CN111507288A (en) Image detection method, image detection device, computer equipment and storage medium
CN114419406A (en) Image change detection method, training method, device and computer equipment
CN116229066A (en) Portrait segmentation model training method and related device
CN117853596A (en) Unmanned aerial vehicle remote sensing mapping method and system
CN116310832A (en) Remote sensing image processing method, device, equipment, medium and product
CN110880003A (en) Image matching method and device, storage medium and automobile
CN110321794B (en) Remote sensing image oil tank detection method integrated with semantic model

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
RJ01 Rejection of invention patent application after publication
RJ01 Rejection of invention patent application after publication

Application publication date: 20200724