CN115115835A

CN115115835A - Image semantic segmentation method, device, equipment, storage medium and program product

Info

Publication number: CN115115835A
Application number: CN202210685972.3A
Authority: CN
Inventors: 聂聪冲
Original assignee: Tencent Technology Shenzhen Co Ltd
Current assignee: Tencent Technology Shenzhen Co Ltd
Priority date: 2022-06-16
Filing date: 2022-06-16
Publication date: 2022-09-27

Abstract

The application provides a semantic segmentation method, a semantic segmentation device, semantic segmentation equipment, a semantic segmentation storage medium and a semantic segmentation program product for an image; the method comprises the following steps: acquiring an image to be segmented comprising at least two objects and a depth image corresponding to the image to be segmented; coding the depth image to obtain a depth coding result; calling at least two segmentation coding networks, and performing iterative fusion coding including spatial screening and channel recombination on the depth coding result and the image to be segmented to obtain a target coding result, wherein the spatial screening is used for performing feature screening on the image to be segmented in a spatial dimension, and the channel recombination is used for performing feature screening on the image to be segmented in a channel dimension; and performing semantic segmentation on the image to be segmented based on the target coding result to obtain a semantic segmentation result corresponding to each object. By the method and the device, complementarity and interdependency between the semantics of the images are fully excavated, and the precision of semantic segmentation is effectively improved.

Description

Image semantic segmentation method, device, equipment, storage medium and program product

Technical Field

The present application relates to the field of artificial intelligence technologies, and in particular, to a method, an apparatus, a device, a storage medium, and a program product for semantic segmentation of an image.

Background

Artificial Intelligence (AI) is a theory, method, technique and application system that uses a digital computer or a machine controlled by a digital computer to simulate, extend and expand human Intelligence, perceive the environment, acquire knowledge and use the knowledge to obtain the best results. In other words, artificial intelligence is a comprehensive technique of computer science that attempts to understand the essence of intelligence and produce a new intelligent machine that can react in a manner similar to human intelligence. Artificial intelligence is the research of the design principle and the implementation method of various intelligent machines, so that the machines have the functions of perception, reasoning and decision making.

In the related art, semantic segmentation of an image is usually to directly process an image to be segmented to obtain a corresponding semantic segmentation result, so that complementarity and interdependency between semantics of the image cannot be sufficiently mined due to the fact that the image to be segmented is directly processed.

Disclosure of Invention

The embodiment of the application provides a semantic segmentation method and device for an image, a computer-readable storage medium and a computer program product, which can fully mine complementarity and interdependency among semantics of the image and effectively improve the precision of semantic segmentation.

The technical scheme of the embodiment of the application is realized as follows:

the embodiment of the application provides a semantic segmentation method of an image, which comprises the following steps:

acquiring an image to be segmented comprising at least two objects and a depth image corresponding to the image to be segmented;

coding the depth image to obtain a depth coding result;

calling at least two segmentation coding networks, and performing iterative fusion coding comprising spatial screening and channel recombination on the depth coding result and the image to be segmented to obtain a target coding result, wherein the spatial screening is used for performing feature screening on the image to be segmented in a spatial dimension, and the channel recombination is used for performing feature screening on the image to be segmented in a channel dimension;

and performing semantic segmentation on the image to be segmented based on the target coding result to obtain a semantic segmentation result corresponding to each object.

In some embodiments, the invoking the channel attention layer, and performing channel reorganization on the spatial screening result to obtain a channel reorganization result includes: performing convolution processing on the space screening result to obtain a second convolution processing result of the space screening result; performing convolution processing on the first convolution processing result to obtain a third convolution processing result; carrying out normalization processing on the third convolution processing result to obtain a normalization processing result; performing dot product on the normalization processing result and the second convolution processing result to obtain a second dot product result; and determining the second dot product result as the channel recombination result.

The embodiment of the application provides a semantic segmentation device for an image, which comprises:

the system comprises an acquisition module, a segmentation module and a processing module, wherein the acquisition module is used for acquiring an image to be segmented comprising at least two objects and a depth image corresponding to the image to be segmented;

the coding module is used for coding the depth image to obtain a depth coding result;

the fusion coding module is used for calling at least two segmentation coding networks and carrying out iterative fusion coding comprising spatial screening and channel recombination on the depth coding result and the image to be segmented to obtain a target coding result, wherein the spatial screening is used for carrying out feature screening on the image to be segmented in a spatial dimension, and the channel recombination is used for carrying out feature screening on the image to be segmented in a channel dimension;

and the semantic segmentation module is used for performing semantic segmentation on the image to be segmented based on the target coding result to obtain a semantic segmentation result corresponding to each object.

In some embodiments, the encoding module is further configured to perform downsampling on the depth image to obtain a downsampling result of the depth image; pooling the downsampling processing result of the depth image to obtain a pooling processing result of the depth image; and calling at least two depth coding networks, and performing iterative coding processing on the pooling processing result of the depth image to obtain the depth coding result.

In some embodiments, the coding module is further configured to invoke a 1 st depth coding network, and perform coding processing on the pooled processing result of the depth image to obtain a 1 st depth coding result; calling an (i + 1) th depth coding network, and coding the (i) th depth coding result to obtain an (i + 1) th depth coding result; determining an Nth depth coding result as the depth coding result; wherein i is more than or equal to 1 and less than or equal to N-1, N represents the number of the depth coding networks, and the size of the (i + 1) th depth coding network is smaller than that of the (i) th depth coding network.

In some embodiments, the depth coding network comprises at least two structurally identical coding layers, including a first coding layer and a second coding layer; the coding module is further configured to invoke the first coding layer, and perform coding processing on the ith depth coding result to obtain a first coding result; calling the second coding layer, and coding the first coding result to obtain a second coding result; determining the second encoding result as the i +1 th depth encoding result.

In some embodiments, the fusion coding module is further configured to perform downsampling on the image to be segmented to obtain a downsampling result of the image to be segmented; pooling the down-sampling processing result of the image to be segmented to obtain a pooling processing result of the image to be segmented; and calling the at least two segmentation coding networks, and carrying out iterative fusion coding comprising space screening and channel recombination on the pooling processing result of the image to be segmented and the depth coding result to obtain the target coding result.

In some embodiments, the depth coding results comprise i depth coding results, i is greater than or equal to 1 and less than or equal to N-1, N representing the number of depth coding networks that encode the depth image; the fusion coding module is further configured to invoke a 1 st segmentation coding network, and perform fusion coding including spatial screening and channel recombination on the pooling processing result of the image to be segmented and the 1 st depth coding result to obtain a 1 st segmentation coding result; calling an i +1 th segmentation coding network, and performing fusion coding including spatial screening and channel recombination on an i-th segmentation coding result and the i-th depth coding result to obtain an i + 1-th segmentation coding result; determining the Nth segmentation coding result as the target coding result; wherein the size of the (i + 1) th split encoding network is smaller than that of the (i) th split encoding network.

In some embodiments, the split coding network comprises at least two residual layers and at least one attention residual layer; the fusion coding module is further configured to invoke the at least two residual error layers, and perform feature extraction on the ith segmentation coding result to obtain a feature extraction result of the ith segmentation coding result; and calling the at least one attention residual error layer, and carrying out fusion coding comprising space screening and channel recombination on the feature extraction result and the ith depth coding result to obtain the (i + 1) th segmentation coding result.

In some embodiments, when the number of the attention residual layers is at least two, the fusion coding module is further configured to call a 1 st attention residual layer, and perform fusion coding including spatial screening and channel recombination on the feature extraction result and the ith depth coding result to obtain a 1 st fusion coding result; calling a jth attention residual error layer, and carrying out fusion coding comprising space screening and channel recombination on the ith depth coding result and the jth-1 fusion coding result to obtain a jth fusion coding result, wherein j is more than or equal to 2 and is less than or equal to M, and M represents the number of the attention residual error layers; and determining the Mth fusion coding result as the (i + 1) th segmentation coding result.

In some embodiments, the attention residual layer includes a spatial attention layer, a channel attention layer, and a residual connection layer; the fusion coding module is further configured to call a spatial attention layer of the 1 st attention residual layer, and perform spatial screening on the feature extraction result and the ith depth coding result to obtain a spatial screening result of the 1 st attention residual layer; calling a channel attention layer of the 1 st attention residual error layer, and performing channel recombination on the space screening result to obtain a channel recombination result of the 1 st attention residual error layer; and calling the residual connecting layer of the 1 st attention residual layer, and fusing the channel recombination result and the feature extraction result to obtain a 1 st fusion coding result.

In some embodiments, the attention residual layer includes a spatial attention layer, a channel attention layer, and a residual connection layer; when the number of the attention residual error layers is one, the fusion coding module is further configured to call the spatial attention layer, and perform spatial screening on the feature extraction result and the ith depth coding result to obtain a spatial screening result; calling the channel attention layer, and performing channel recombination on the space screening result to obtain a channel recombination result; and calling the residual connecting layer, and fusing the channel recombination result and the feature extraction result to obtain the (i + 1) th segmentation coding result.

In some embodiments, the fusion coding module is further configured to perform convolution processing on the feature extraction result to obtain a first convolution processing result of the feature extraction result; performing dot product on the first convolution processing result and the ith depth coding result to obtain a first dot product result; and determining the first dot product result as the space screening result.

In some embodiments, the fusion coding module is further configured to perform convolution processing on the spatial screening result to obtain a second convolution processing result of the spatial screening result; performing convolution processing on the first convolution processing result to obtain a third convolution processing result; carrying out normalization processing on the third convolution processing result to obtain a normalization processing result; performing dot product on the normalization processing result and the second convolution processing result to obtain a second dot product result; and determining the second dot product result as the channel recombination result.

An embodiment of the present application provides an electronic device, including:

a memory for storing executable instructions;

and the processor is used for realizing the semantic segmentation method of the image provided by the embodiment of the application when the executable instructions stored in the memory are executed.

The embodiment of the application provides a computer-readable storage medium, which stores executable instructions for causing a processor to execute the method for semantic segmentation of an image provided by the embodiment of the application.

Embodiments of the present application provide a computer program product or computer program comprising computer instructions stored in a computer readable storage medium. The processor of the computer device reads the computer instructions from the computer-readable storage medium, and the processor executes the computer instructions, so that the computer device executes the semantic segmentation method for the image according to the embodiment of the present application.

The embodiment of the application has the following beneficial effects:

coding a depth image corresponding to an image to be segmented to obtain a depth coding result; and performing fusion coding on the depth coding result and the image to be segmented, and performing semantic segmentation on the image to be segmented based on the obtained target coding result to obtain a corresponding semantic segmentation result. Therefore, the depth image of the image to be segmented has the distance information between each pixel point in the image to be segmented and the camera, the depth coding result and the image to be segmented are subjected to iterative fusion coding comprising space screening and channel recombination, the image to be segmented is subjected to feature selection in the space dimension and the channel dimension, the semantic information of the image can be fully mined, then, the image to be segmented is subjected to semantic segmentation based on the obtained target coding result, the complementarity and the interdependency between the semantics of the image can be fully mined, and the semantic segmentation accuracy is effectively improved.

Drawings

FIG. 1 is a schematic structural diagram of a semantic segmentation system architecture of an image provided in an embodiment of the present application;

FIG. 2 is a schematic structural diagram of an apparatus for semantic segmentation of an image according to an embodiment of the present disclosure;

fig. 3A to fig. 3D are schematic flow charts of a semantic segmentation method for an image according to an embodiment of the present application;

fig. 4A to 4D are schematic diagrams illustrating a semantic segmentation method for an image according to an embodiment of the present disclosure.

Detailed Description

In order to make the purpose, technical solutions and advantages of the present application clearer, the present application will be described in further detail with reference to the accompanying drawings, the described embodiments should not be considered as limiting the present application, and all other embodiments obtained by a person of ordinary skill in the art without making creative efforts fall within the protection scope of the present application.

In the following description, reference is made to "some embodiments" which describe a subset of all possible embodiments, but it is understood that "some embodiments" may be the same subset or different subsets of all possible embodiments, and may be combined with each other without conflict.

In the following description, references to the terms "first \ second \ third" are only to distinguish similar objects and do not denote a particular order, but rather the terms "first \ second \ third" are used to interchange specific orders or sequences, where appropriate, so as to enable the embodiments of the application described herein to be practiced in other than the order shown or described herein.

Unless defined otherwise, all technical and scientific terms used herein have the same meaning as commonly understood by one of ordinary skill in the art to which this application belongs. The terminology used herein is for the purpose of describing embodiments of the present application only and is not intended to be limiting of the application.

Before further detailed description of the embodiments of the present application, terms and expressions referred to in the embodiments of the present application will be described, and the terms and expressions referred to in the embodiments of the present application will be used for the following explanation.

1) Semantic segmentation: semantic segmentation is a fundamental task in computer vision, where we need to divide visual input into different semantically interpretable categories, the interpretability of semantics, i.e. classification categories, is meaningful in the real world.

2) Depth image (Depth Map): each pixel value of the depth image of the original image represents a distance of a corresponding pixel point in the original image from the camera.

3) Convolutional Neural Networks (CNN), Convolutional Neural Networks: is a type of Feed Forward Neural Networks (FNN) that includes convolution calculations and has a Deep structure, and is one of the representative algorithms of Deep Learning (Deep Learning). The convolutional neural network has a Representation Learning (Representation Learning) capability, and can perform Shift-Invariant Classification (Shift-Invariant Classification) on an input image according to a hierarchical structure thereof.

4) Down-sampling: the new sequence is obtained by sampling a sample sequence several samples apart, and thus is a down-sampling of the original sequence. In practice, the down-sampling is decimation. The main purpose of reducing an image (otherwise known as downsampling or downsampling) is two: fitting the image to the size of the display area; a thumbnail of the corresponding image is generated.

5) Performing pooling treatment: the pooling treatment is used for improving important characteristic information, compressing characteristics, reducing calculated amount and relieving overfitting.

6) Self-Attention layer (Self-Attention): is a mechanism of attention that is used to focus on the correlation between different parts of the overall input.

7) Feed Forward neural network (FFN): the feedforward neural network is an artificial neural network, each neuron of the feedforward neural network is arranged in a layered mode, each neuron is only connected with the neuron of the previous layer, each layer receives the output of the previous layer and outputs the output to the next layer, and feedback does not exist among the layers. The feed-forward neural network comprises a perceptron network, a BP network and an RBF network. Among them, the perceptron network is the simplest feedforward network, which is mainly used for pattern classification, and can also be used in learning control and multi-modal control based on pattern classification. Sensor networks can be divided into single-layer sensor networks and multi-layer sensor networks. The BP network refers to a feedforward network in which a Back Propagation (Back Propagation) learning algorithm is adopted for the connection right adjustment. The difference from the perceptron is that the neuron transformation function of the BP network adopts a Sigmoid function (Sigmoid function), so that the output quantity is a continuous quantity between 0 and 1, and arbitrary nonlinear mapping from input to output can be realized. An RBF network refers to a feed-forward network where hidden layer neurons consist of RBF neurons. An RBF neuron is a neuron whose transformation Function is RBF (Radial Basis Function). A typical RBF network consists of three layers: an input layer, one or more RBF layers (hidden layers) consisting of RBF neurons, and an output layer consisting of linear neurons.

In the implementation process of the embodiment of the present application, the applicant finds that the following problems exist in the related art:

in the related art, for semantic segmentation, a single image to be segmented is generally used as an input of a segmentation network, and the segmentation network is called to process the image to be segmented, so as to obtain a corresponding semantic segmentation result. Therefore, information complementarity and interdependency of the image to be segmented cannot be fully mined, and the semantic segmentation accuracy is poor.

The embodiment of the application provides a semantic segmentation method and device for an image, electronic equipment, a computer-readable storage medium and a computer program product, which can fully mine complementarity and interdependency among semantics of the image and effectively improve the precision of semantic segmentation. An exemplary application of the semantic segmentation apparatus for images provided in the embodiments of the present application is described below, and the apparatus provided in the embodiments of the present application may be implemented as various types of user terminals such as a notebook computer, a tablet computer, a desktop computer, a set-top box, a mobile device (e.g., a mobile phone, a portable music player, a personal digital assistant, a dedicated messaging device, and a portable game device), and may also be implemented as a server.

Referring to fig. 1, fig. 1 is a schematic architecture diagram of a semantic segmentation system 100 for an image provided in an embodiment of the present application, and in order to implement an application scenario of defect detection, a terminal (an exemplary terminal 400 is shown) is connected to a server 200 through a network 300, where the network 300 may be a wide area network or a local area network, or a combination of the two.

The terminal 400 is configured for use by a user of the client 410 for display on a graphical interface 410-1 (graphical interface 410-1 is illustratively shown). The terminal 400 and the server 200 are connected to each other through a wired or wireless network.

In some embodiments, the server 200 may be an independent physical server, may also be a server cluster or a distributed system formed by a plurality of physical servers, and may also be a cloud server providing basic cloud computing services such as a cloud service, a cloud database, cloud computing, a cloud function, cloud storage, a network service, cloud communication, a middleware service, a domain name service, a security service, a CDN, and a big data and artificial intelligence platform. The terminal 400 may be a smart phone, a tablet computer, a laptop computer, a desktop computer, a smart speaker, a smart watch, a smart voice interaction device, a smart home appliance, a vehicle-mounted terminal, etc., but is not limited thereto. The terminal and the server may be directly or indirectly connected through wired or wireless communication, and the embodiment of the present application is not limited.

In some embodiments, the server 200 acquires the image to be segmented and the corresponding depth image from the terminal 400, and transmits the image to be segmented and the corresponding depth image to the terminal 400, and the terminal 400 determines a semantic segmentation result of each object of the image to be segmented based on the image to be segmented and the corresponding depth image, and transmits the semantic segmentation result to the server 200.

In other embodiments, the terminal 400 obtains an image to be segmented and a corresponding depth image, and sends the image to be segmented and the corresponding depth image to the server 200, and the server 200 determines a semantic segmentation result of each object of the image to be segmented based on the image to be segmented and the corresponding depth image, and sends the semantic segmentation result to the terminal 400.

In other embodiments, the vehicle-mounted camera shoots a driving picture, the vehicle-mounted terminal receives the driving picture shot by the vehicle-mounted camera and obtains a depth image corresponding to the driving picture, a semantic segmentation result of each object in the driving picture is determined based on the driving picture and the corresponding depth image, and a driving state of the vehicle is determined based on the semantic segmentation result of each object in the driving picture.

In other embodiments, the embodiments of the present application may be implemented by Cloud Technology (Cloud Technology), which refers to a hosting Technology for unifying resources of hardware, software, network, etc. in a wide area network or a local area network to implement calculation, storage, processing, and sharing of data.

The cloud technology is a general name of a network technology, an information technology, an integration technology, a management platform technology, an application technology and the like based on cloud computing business model application, can form a resource pool, is used as required, and is flexible and convenient. Cloud computing technology will become an important support. Background services of the technical network system require a large amount of computing and storage resources.

Referring to fig. 2, fig. 2 is a schematic structural diagram of a server 200 of a semantic segmentation method for an image according to an embodiment of the present application, where the server 200 shown in fig. 2 includes: at least one processor 210, memory 250, at least one network interface 220. The various components in server 200 are coupled together by a bus system 240. It is understood that the bus system 240 is used to enable communications among the components. The bus system 240 includes a power bus, a control bus, and a status signal bus in addition to a data bus. For clarity of illustration, however, the various buses are labeled as bus system 240 in fig. 2.

The Processor 210 may be an integrated circuit chip having Signal processing capabilities, such as a general purpose Processor, a Digital Signal Processor (DSP), or other programmable logic device, discrete gate or transistor logic device, discrete hardware components, or the like, wherein the general purpose Processor may be a microprocessor or any conventional Processor, or the like.

The memory 250 may be removable, non-removable, or a combination thereof. Exemplary hardware devices include solid state memory, hard disk drives, optical disk drives, and the like. Memory 250 optionally includes one or more storage devices physically located remotely from processor 210.

The memory 250 includes volatile memory or nonvolatile memory, and can also include both volatile and nonvolatile memory. The nonvolatile Memory may be a Read Only Memory (ROM), and the volatile Memory may be a Random Access Memory (RAM). The memory 250 described in embodiments herein is intended to comprise any suitable type of memory.

In some embodiments, memory 250 is capable of storing data, examples of which include programs, modules, and data structures, or a subset or superset thereof, to support various operations, as exemplified below.

The operating system 251, which includes system programs for handling various basic system services and performing hardware related tasks, such as a framework layer, a core library layer, a driver layer, etc., is used for implementing various basic services and for handling hardware based tasks.

A network communication module 252 for communicating to other electronic devices via one or more (wired or wireless) network interfaces 220, exemplary network interfaces 220 including: bluetooth, wireless compatibility authentication (WiFi), and Universal Serial Bus (USB), among others.

In some embodiments, the semantic segmentation apparatus for images provided by the embodiments of the present application may be implemented in software, and fig. 2 illustrates the semantic segmentation apparatus 255 for images stored in the memory 250, which may be software in the form of programs and plug-ins, and includes the following software modules: the obtaining module 2551, the encoding module 2552, the fusion encoding module 2553, and the semantic segmentation module 2554 are logical, and thus may be arbitrarily combined or further split according to the implemented functions. The functions of the respective modules will be explained below.

In other embodiments, the semantic segmentation apparatus for images provided in the embodiments of the present Application may be implemented in hardware, and as an example, the semantic segmentation apparatus for images provided in the embodiments of the present Application may be a processor in the form of a hardware decoding processor, which is programmed to perform the semantic segmentation method for images provided in the embodiments of the present Application, for example, the processor in the form of the hardware decoding processor may be implemented by one or more Application Specific Integrated Circuits (ASICs), DSPs, Programmable Logic Devices (PLDs), Complex Programmable Logic Devices (CPLDs), Field Programmable Gate Arrays (FPGAs), or other electronic components.

The semantic segmentation method for the image provided by the embodiment of the present application will be described in conjunction with an exemplary application and implementation of the server or the terminal provided by the embodiment of the present application.

In some embodiments, referring to fig. 4A, fig. 4A is a schematic diagram illustrating a semantic segmentation method for an image according to an embodiment of the present disclosure. Acquiring an image to be segmented 42 comprising at least two objects and a depth image 41 corresponding to the image to be segmented; and coding the depth image 41 to obtain a depth coding result. And performing fusion coding on the depth coding result and the image 42 to be segmented to obtain a target coding result. Based on the target coding result, the image to be segmented 42 is semantically segmented to obtain semantic segmentation results 43 corresponding to each object.

Referring to fig. 3A, fig. 3A is a schematic flowchart of a semantic segmentation method for an image according to an embodiment of the present application, and will be described with reference to steps 101 to 104 shown in fig. 3A, an execution subject of the following steps 101 to 104 may be a server or a terminal, and the following description will take the execution subject as an example of the server.

In step 101, a server acquires an image to be segmented including at least two objects and a depth image corresponding to the image to be segmented.

In some embodiments, the object may be an object in the image to be segmented, such as a tree, a cup, a person, or the like. Each pixel value of the depth image corresponding to the image to be segmented represents the distance between the corresponding pixel point in the original image and the camera.

By way of example, referring to fig. 4B, fig. 4B is a schematic diagram illustrating a semantic segmentation method for an image according to an embodiment of the present application. The image to be segmented shown in fig. 4B is an aerial image, and the aerial image shown in fig. 4B includes at least two objects, where the object may be an object in the image to be segmented, and the object may be a tree, a house, a road, and the like.

In step 102, the depth image is encoded to obtain a depth encoding result.

In some embodiments, encoding is also called image encoding, and refers to a technique of representing an image or information contained in an image with a smaller number of bits under a condition that a certain quality (a requirement of a signal-to-noise ratio or a subjective evaluation score) is satisfied.

In some embodiments, referring to fig. 3B, fig. 3B is a flowchart illustrating a semantic segmentation method for an image according to an embodiment of the present disclosure, and step 102 illustrated in fig. 3B may be implemented by following steps 1021 to 1023.

At step 1021, downsampling processing is performed on the depth image to obtain a downsampling processing result of the depth image.

In some embodiments, the downsampling process is performed every several samples of a sample sequence, so that the new sequence is a downsampled sequence of the original sequence. In practice, the down-sampling is decimation. The main purpose of reducing an image (otherwise known as downsampling or downsampling) is two: fitting the image to the size of the display area; a thumbnail of the corresponding image is generated.

In some embodiments, the step 1021 may be further implemented by: and calling a down-sampling layer, and performing down-sampling processing on the depth image to obtain a down-sampling processing result of the depth image.

By way of example, referring to fig. 4A, fig. 4A is a schematic diagram illustrating a semantic segmentation method for an image according to an embodiment of the present application. The down-sampling layer 51 is called to perform down-sampling processing on the depth image 41, and a down-sampling processing result of the depth image is obtained.

In step 1022, the downsampling processing result of the depth image is pooled to obtain a pooled depth image result.

In some embodiments, the pooling process is used to improve important feature information, compress features, reduce computational complexity, and mitigate overfitting.

In some embodiments, the above step 1022 may also be implemented by: and calling a pooling layer, and performing pooling on the down-sampling processing result of the depth image to obtain a pooling processing result of the depth image.

As an example, referring to fig. 4A, the pooling layer 53 is called, and the downsampling processing result of the depth image is pooled, so as to obtain the pooling processing result of the depth image.

In step 1023, at least two depth coding networks are called to perform iterative coding processing on the depth image pooling processing result to obtain a depth coding result.

In some embodiments, the depth coding network is configured to code the result of the pooling process of the depth images, and the depth coding network includes at least two coding layers with the same structure, and the coding layers are configured to code the input data.

As an example, the depth coding network 54, the depth coding network 55, the depth coding network 56, and the depth coding network 57 are invoked to perform iterative coding processing on the pooled processing result of the depth images, so as to obtain a depth coding result.

In some embodiments, referring to fig. 3C, fig. 3C is a schematic flowchart of a semantic segmentation method for an image provided in an embodiment of the present application, and step 1023 illustrated in fig. 3C may be implemented by the following steps 10231 to 10233.

In step 10231, a 1 st depth coding network is called to perform coding processing on the depth image pooling processing result, so as to obtain a 1 st depth coding result.

For example, referring to fig. 4A, the 1 st depth coding network 54 is called to perform coding processing on the pooled processing result of the depth images, so as to obtain the 1 st depth coding result.

In step 10232, an (i + 1) th depth coding network is called to perform coding processing on an ith depth coding result to obtain an (i + 1) th depth coding result.

In some embodiments, i is greater than or equal to 1 and less than or equal to N-1, N represents the number of the depth coding networks, the size of the (i + 1) th depth coding network is smaller than that of the (i) th depth coding network, and the structure of the (i + 1) th depth coding network is the same as that of the (i) th depth coding network.

As an example, when i is equal to 1, referring to fig. 4A, the 2 nd depth coding network 55 is called, and the 1 st depth coding result is subjected to coding processing, resulting in the 2 nd depth coding result.

As an example, when i is 2, referring to fig. 4A, the 3 rd depth coding network 56 is called, and the 2 nd depth coding result is subjected to coding processing, so as to obtain the 3 rd depth coding result.

As an example, when i is 3, referring to fig. 4A, the 4 th depth coding network 57 is called, and the 3 rd depth coding result is subjected to coding processing, resulting in a 4 th depth coding result.

In some embodiments, the depth coding network comprises at least two structurally identical coding layers, including a first coding layer and a second coding layer; the above step 10232 can be implemented by: calling a first coding layer, and coding an ith depth coding result to obtain a first coding result; calling a second coding layer, and coding the first coding result to obtain a second coding result; and determining the second coding result as the (i + 1) th depth coding result.

For example, referring to fig. 4A, a first coding layer 541 is invoked to perform coding processing on the ith depth coding result, so as to obtain a first coding result; calling a second coding layer 542, and coding the first coding result to obtain a second coding result; and determining the second coding result as the (i + 1) th depth coding result.

For example, referring to fig. 4A, a first coding layer 541 is invoked to perform coding processing on the ith depth coding result, so as to obtain a first coding result; calling a second coding layer 542, and coding the first coding result to obtain a second coding result; calling the third coding layer 543, and coding the second coding result to obtain a third coding result; and determining the third coding result as the (i + 1) th depth coding result.

In step 10233, the nth depth coding result is determined as the depth coding result.

Wherein i is more than or equal to 1 and less than or equal to N-1, N represents the number of the depth coding networks, and the size of the (i + 1) th depth coding network is smaller than that of the (i) th depth coding network.

As an example, referring to fig. 4A, the 4 th depth coding result is determined as the depth coding result, and the number of depth coding networks 51 to 57 is 4.

Therefore, the depth image of the image to be segmented is coded through the depth coding network comprising at least two coding layers with the same structure, the multi-mode characteristics of the depth image can be explicitly aggregated on a plurality of different scales, and the subsequent fusion coding of the multi-mode characteristics based on the depth image and the image to be segmented is facilitated, so that the complementarity and the interdependency between the semantics of the image can be fully excavated, and the precision of the semantic segmentation is effectively improved.

In step 103, at least two segmentation coding networks are called, and iterative fusion coding including spatial screening and channel recombination is performed on the depth coding result and the image to be segmented, so as to obtain a target coding result.

In some embodiments, the spatial screening is used for performing the feature screening on the image to be segmented in a spatial dimension, and the channel reorganization is used for performing the feature screening on the image to be segmented in a channel dimension, wherein the spatial screening is implemented by a spatial attention layer in the segmentation coding network, and the channel reorganization is implemented by a channel attention layer in the segmentation coding network.

In some embodiments, the fusion encoding is used to encode at least two different inputs, resulting in an encoded result. The fusion coding may be implemented by a split coding network comprising at least two residual layers and at least one attention residual layer.

In some embodiments, referring to fig. 3B, fig. 3B is a flowchart illustrating a semantic segmentation method for an image provided in an embodiment of the present application, and step 103 illustrated in fig. 3B may be implemented by the following steps 1031 to 1033.

In step 1031, a downsampling process is performed on the image to be segmented to obtain a downsampling process result of the image to be segmented.

In some embodiments, the step 1031 may be further implemented as follows: and calling a down-sampling layer, and performing down-sampling processing on the image to be segmented to obtain a down-sampling processing result of the image to be segmented.

By way of example, referring to fig. 4A, fig. 4A is a schematic diagram illustrating a semantic segmentation method for an image according to an embodiment of the present application. The downsampling layer 521 is called to perform downsampling processing on the image to be segmented 42, and a downsampling processing result of the image to be segmented is obtained.

In step 1032, the downsampling processing result of the image to be segmented is pooled to obtain a pooled processing result of the image to be segmented.

In some embodiments, the step 1032 may be further implemented by: and calling a pooling layer, and pooling the down-sampling processing result of the image to be segmented to obtain a pooling processing result of the image to be segmented.

As an example, referring to fig. 4A, the pooling layer 522 is invoked to pool the downsampling processing result of the image to be segmented 42, resulting in a pooled processing result of the image to be segmented 42.

In step 1033, at least two segmentation coding networks are invoked, and the pooling processing result and the depth coding result of the image to be segmented are subjected to iterative fusion coding including spatial screening and channel recombination, so as to obtain a target coding result.

In some embodiments, the split coding network is configured to perform a fusion coding on at least two inputs to obtain a target coding result.

In some embodiments, the depth coding results include i depth coding results, i is greater than or equal to 1 and less than or equal to N-1, N characterizing the number of depth coding networks that encode the depth image; referring to fig. 3D, fig. 3D is a flowchart illustrating a semantic segmentation method for an image according to an embodiment of the present disclosure, and step 1033 illustrated in fig. 3D may be implemented by steps 10331 to 10333.

In step 10331, the 1 st segmentation coding network is invoked, and fusion coding including spatial screening and channel recombination is performed on the pooling processing result of the image to be segmented and the 1 st depth coding result to obtain the 1 st segmentation coding result.

For example, referring to fig. 4A, the 1 st segmentation coding network 58 is called to perform fusion coding on the pooling processing result of the image to be segmented and the 1 st depth coding result, so as to obtain the 1 st segmentation coding result.

In step 10332, an (i + 1) th segmentation coding network is called, and fusion coding including spatial screening and channel recombination is performed on the ith segmentation coding result and the ith depth coding result to obtain an (i + 1) th segmentation coding result.

As an example, referring to fig. 4A, when i is equal to 1, the 2 nd segmentation coding network 59 is invoked to perform fusion coding on the 1 st segmentation coding result and the 1 st depth coding result, so as to obtain the 2 nd segmentation coding result.

As an example, referring to fig. 4A, when i is 2, the 3 rd split coding network 60 is invoked, and the 2 nd split coding result and the 2 nd depth coding result are fusion coded to obtain the 3 rd split coding result.

As an example, referring to fig. 4A, when i is 3, the 4 th segmentation coding network 61 is called, and the 3 rd segmentation coding result and the 3 rd depth coding result are fusion coded to obtain the 4 th segmentation coding result.

In some embodiments, the split coding network comprises at least two residual layers and at least one attention residual layer; step 10332 may be implemented as follows: calling at least two residual error layers, and performing feature extraction on the ith segmentation coding result to obtain a feature extraction result of the ith segmentation coding result; and calling at least one attention residual error layer, and carrying out fusion coding comprising space screening and channel recombination on the feature extraction result and the ith depth coding result to obtain an (i + 1) th segmentation coding result.

In some embodiments, the Residual layer refers to a Residual Network (Residual Network) which is characterized by easy optimization and can improve accuracy by adding a considerable depth. The inner residual block uses jump connection, and the problem of gradient disappearance caused by depth increase in a deep neural network is relieved.

As an example, referring to fig. 4A, at least two residual layers (residual layer 581 and residual layer 582) are called, and feature extraction is performed on the ith segmentation and coding result to obtain a feature extraction result of the ith segmentation and coding result; and calling at least one attention residual layer (attention residual layer 583), and performing fusion coding on the feature extraction result and the ith depth coding result to obtain an (i + 1) th segmentation coding result.

Therefore, the segmentation coding result is subjected to feature extraction through at least two residual error layers, and the residual error layers can improve the accuracy rate by increasing the equivalent depth, so that the problem of gradient disappearance caused by depth increase in the deep neural network is effectively solved.

In some embodiments, when the number of the attention residual layers is at least two, the aforementioned invoking at least one attention residual layer performs fusion coding including spatial screening and channel recombination on the feature extraction result and the ith depth coding result to obtain an i +1 th segmentation coding result, which may be implemented by the following manners: calling a 1 st attention residual error layer, and performing fusion coding comprising space screening and channel recombination on the feature extraction result and the ith depth coding result to obtain a 1 st fusion coding result; calling a jth attention residual error layer, and carrying out fusion coding comprising space screening and channel recombination on the ith depth coding result and the jth-1 fusion coding result to obtain a jth fusion coding result, wherein j is more than or equal to 2 and is less than or equal to M, and M represents the number of the attention residual error layers; and determining the Mth fusion coding result as the (i + 1) th segmentation coding result.

As an example, referring to fig. 4A, when j is 2, the 1 st attention residual layer 591 is called, and fusion coding is performed on the feature extraction result and the ith depth coding result to obtain a 1 st fusion coding result; and calling a 2 nd attention residual layer 592, and performing fusion coding on the 1 st depth coding result and the 1 st fusion coding result to obtain a 2 nd fusion coding result.

In some embodiments, the attention residual layer includes a spatial attention layer, a channel attention layer, and a residual connection layer; the above-mentioned calling 1 st attention residual layer, to the feature extraction result and the ith depth coding result, carry on the fusion coding, get 1 st fusion coding result, can be realized through the following way: calling a spatial attention layer of the 1 st attention residual error layer, and carrying out spatial screening on the feature extraction result and the ith depth coding result to obtain a spatial screening result of the 1 st attention residual error layer; calling a channel attention layer of the 1 st attention residual error layer, and performing channel recombination on the spatial screening result to obtain a channel recombination result of the 1 st attention residual error layer; and calling a residual connecting layer of the 1 st attention residual layer, and fusing the channel recombination result and the feature extraction result to obtain a 1 st fusion coding result.

In some embodiments, the spatial attention layer is used for spatial screening of the feature extraction result and the ith depth coding result; the channel attention layer is used for channel recombination of the space screening result; the residual connecting layer is used for performing fusion processing on the channel recombination result and the feature extraction result, wherein the fusion processing may be summation of the channel recombination result and the feature extraction result.

For example, referring to fig. 4C, fig. 4C is a schematic diagram illustrating a semantic segmentation method for an image according to an embodiment of the present application. Calling a spatial attention layer of the 1 st attention residual error layer, and carrying out spatial screening on the feature extraction result and the ith depth coding result to obtain a spatial screening result of the 1 st attention residual error layer; calling a channel attention layer of the 1 st attention residual error layer, and performing channel recombination on the spatial screening result to obtain a channel recombination result of the 1 st attention residual error layer; and calling a residual connecting layer of the 1 st attention residual layer, and fusing the channel recombination result and the feature extraction result to obtain a 1 st fusion coding result.

In other embodiments, the attention residual layer includes a spatial attention layer, a channel attention layer, and a residual connection layer; when the number of the attention residual error layers is one, the above-mentioned calling at least one attention residual error layer, and performing fusion coding on the feature extraction result and the ith depth coding result to obtain an i +1 th segmentation coding result, which may be implemented by the following manners: calling a spatial attention layer, and carrying out spatial screening on the feature extraction result and the ith depth coding result to obtain a spatial screening result; calling a channel attention layer, and performing channel recombination on the spatial screening result to obtain a channel recombination result; and calling a residual connecting layer, and fusing the channel recombination result and the feature extraction result to obtain an i +1 th segmentation coding result.

As an example, referring to FIG. 4C, calling the spatial attention layer, the feature extraction result F is _C And ith depth coding result F _D Carrying out space screening to obtain a space screening result F _N (ii) a Calling channel attention layer, and screening result F in space _N Channel recombination is carried out to obtain a channel recombination result F _Y (ii) a Calling residual connecting layer to recombine result F of channel _Y And feature extraction result F _C And performing fusion processing to obtain an i +1 th segmentation coding result.

In some embodiments, the invoking of the spatial attention layer, and performing spatial filtering on the feature extraction result and the ith depth coding result to obtain a spatial filtering result may be implemented in the following manner: performing convolution processing on the feature extraction result to obtain a first convolution processing result of the feature extraction result; performing dot product on the first convolution processing result and the ith depth coding result to obtain a first dot product result; and determining the first dot product result as a space screening result.

In some embodiments, Dot Product processing is mathematically called the quantity Product (Scalar Product), which refers to a binary operation that takes two vectors over a real number R and returns a real-valued Scalar, which is the standard inner Product of Euclidean space.

Convolution processing in functional analysis, Convolution or Convolution (Convolution) is a mathematical operation that generates a third function from two functions f and g, and is essentially a special integral transformation that characterizes the integral of the overlap length multiplied by the function value of the overlapping part of the function f and g, which is flipped and shifted. Convolution can also be considered a generalization of "moving average" if one of the functions participating in the convolution is considered to be an indicative function of the interval.

As an example, referring to FIG. 4C, feature extraction result F is _C Performing convolution processing to obtain a first convolution processing result of the feature extraction result; the first volume processing result and the ith depth coding result F _D Performing dot product to obtain a first dot product result; determining the first dot product result as a spatial screening result F _N 。

In some embodiments, the invoking of the channel attention layer to perform channel reorganization on the spatial screening result to obtain a channel reorganization result may be implemented in the following manner: performing convolution processing on the space screening result to obtain a second convolution processing result of the space screening result; performing convolution processing on the first convolution processing result to obtain a third convolution processing result; carrying out normalization processing on the third convolution processing result to obtain a normalization processing result; performing dot product on the normalization processing result and the second convolution processing result to obtain a second dot product result; and determining the second dot product result as a channel recombination result.

In some embodiments, the normalization process is a simplified calculation, i.e., a dimensional expression is transformed into a dimensionless expression, which becomes a scalar.

As an example, see FIG. 4C for spatial screening result F _N Performing convolution processing to obtain a second convolution processing result F of the space screening result _Q (ii) a Performing convolution processing on the first convolution processing result to obtain a third convolution processing result F _P (ii) a For the third convolution processing result F _P Carrying out normalization processing to obtain a normalization processing result F _X (ii) a The normalization processing result F _X And a second convolution processing result F _Q Performing dot product to obtain the secondTwo dot product results; determining the second dot product result as a channel recombination result F _Y 。

In some embodiments, the first convolution processing result and the third convolution processing result are different in size.

Therefore, through the design of the attention residual error layer, the depth image is used as a space weight, and the features of the image to be segmented are screened in the space dimension. The method has the advantages that the channel attention layer of the attention residual error layer is used for selecting the features of the self-adaptive weights of different channels in the channel dimension, then the selected features are fused with the features of the input image to be segmented through the residual error connecting layer, the semantic segmentation accuracy is effectively enhanced, the difference among different types of objects can be increased explicitly, the complementary information in the multi-mode features is fully utilized for semantic segmentation, and the semantic segmentation accuracy is effectively enhanced.

In step 10333, the nth division encoding result is determined as the target encoding result.

The size of the (i + 1) th split coding network is smaller than that of the (i) th split coding network, and the structure of the (i + 1) th split coding network is the same as that of the (i) th split coding network.

In some embodiments, referring to fig. 4D, fig. 4D is a schematic diagram of a semantic segmentation method for an image provided by an embodiment of the present application. The network 50 comprises a depth coding network and a split coding network, both comprising 5 layers, wherein the first layer of the depth coding network of the network 50 outputs an image of size

Wherein H represents the height of the image to be segmented, W represents the width of the image to be segmented, and the first layer of the depth coding network of the network 50 is used for down-sampling the depth image; the image size of the second layer output of the depth coding network of network 50 is

The second layer of the depth coding network of network 50 comprises a pooling layer and at least two coding layers of size 64; networkThe third layer of the depth coding network of 50 outputs an image size of

The third layer of the depth coding network of network 50 comprises at least two coding layers of size 128; the fourth layer of the depth coding network of network 50 outputs an image of size

The fourth layer of the depth coding network of network 50 comprises at least two coding layers of size 256; the fifth layer of the depth coded network of network 50 outputs an image of size

The fourth layer of the depth coding network of network 50 comprises at least two coding layers of size 512.

In some embodiments, referring to FIG. 4D, the size of the output image of the first layer of the segmented coding network of network 50 is

The first layer of the split coding network of network 50 is used for downsampling; the output image of the second layer of the segmented coding network of network 50 has a size of

The second layer of the partitioned coded network of the network 50 comprises a pooling layer, two residual layers and one attention residual layer; the output image of the third layer of the segmented coding network of network 50 has a size of

The third layer of the partitioned coded network of network 50 includes three residual layers and one attention residual layer; the output image of the fourth layer of the segmented coding network of network 50 has a size of

The fourth layer of the partitioned coding network of network 50 comprises five residual layers anda layer of attention residual; the output image of the fifth layer of the segmented coding network of the network 50 has a size of

The fifth layer of the partitioned coded network of network 50 includes two residual layers and one attention residual layer.

In some embodiments, referring to fig. 4D, the network 101 comprises a depth coding network and a split coding network, both comprising 5 layers, wherein a first layer of the depth coding network of the network 101 outputs an image size of

H represents the height of an image to be segmented, W represents the width of the image to be segmented, and a first layer of a depth coding network of the network 101 is used for performing down-sampling processing on the depth image; the image size of the second layer output of the depth coding network of network 101 is

The second layer of the depth coding network of network 101 comprises a pooling layer and at least two coding layers of size 64; the third layer of the depth coding network of network 101 outputs an image size of

The third layer of the depth coding network of network 101 comprises at least two coding layers of size 128; the fourth layer of the depth coding network of network 101 outputs an image of size

The fourth layer of the depth coding network of network 101 comprises at least two coding layers of size 256; the fifth output of the depth coded network of network 101 has an image size of

The fourth layer of the depth coding network of network 101 comprises at least two coding layers of size 512.

In some embodiments, referring to FIG. 4D, the size of the output image of the first layer of the segmented coding network of network 101 is

The first layer of the split coding network of the network 101 is used for downsampling processing; the output image of the second layer of the segmented coding network of network 101 has a size of

The second layer of the partitioned coded network of network 101 comprises a pooling layer, two residual layers and one attention residual layer; the output image of the third layer of the segmented coding network of network 101 has a size of

The third layer of the partitioned coded network of network 101 comprises three residual layers and one attention residual layer; the output image of the fourth layer of the segmented coding network of network 101 has a size of

The fourth layer of the partitioned coded network of network 101 comprises 22 layers of residual layers and one layer of attention residual layer; the output image of the fifth layer of the segmented coding network of the network 101 has a size of

The fifth layer of the split coding network of network 101 comprises two residual layers and one attention residual layer.

In step 104, semantic segmentation is performed on the image to be segmented based on the target coding result, so as to obtain semantic segmentation results corresponding to each object.

In some embodiments, the step 104 may be implemented by: and decoding the target coding result, and determining the decoding result as a semantic segmentation result of each object corresponding to the image to be segmented.

In some embodiments, the decoding process may be implemented by: and calling the context model, and decoding the target coding result to obtain a decoding processing result.

In some embodiments, the context Module includes a Pyramid pool Module, a spatial Pyramid pool Module, and a self-attention Module, where the Pyramid Pool Module (PPM) can effectively increase the reception field and increase the utilization efficiency of the global information by Pooling at least more. The Spatial Pyramid module (ASPP) can generate an output with a fixed size regardless of the size of an input image, and the Spatial Pyramid module uses a plurality of windows (posing windows), and the Spatial Pyramid module can use different sizes (scales) of the same image as input to obtain Pooling features with the same length. The self-attention module is used to find the degree of association between each vector and other vectors (including itself).

In this way, a depth coding result is obtained by coding the depth image corresponding to the image to be segmented; and performing fusion coding on the depth coding result and the image to be segmented, and performing semantic segmentation on the image to be segmented based on the obtained target coding result to obtain a corresponding semantic segmentation result. Therefore, the depth image of the image to be segmented has the distance information between each pixel point in the image to be segmented and the camera, the depth coding result and the image to be segmented are subjected to iterative fusion coding comprising space screening and channel recombination, the image to be segmented is subjected to feature selection in the space dimension and the channel dimension, the semantic information of the image can be fully mined, then, the image to be segmented is subjected to semantic segmentation based on the obtained target coding result, the complementarity and the interdependency between the semantics of the image can be fully mined, and the semantic segmentation accuracy is effectively improved.

In the following, an exemplary application of the embodiment of the present application in an actual semantic segmentation application scenario will be described.

Image processing tasks performed in the field of computer vision can be roughly classified into the following three categories: the method comprises the steps of image classification, target detection and image segmentation, wherein the image segmentation comprises semantic segmentation and example segmentation. The semantic segmentation method for the image belongs to a semantic segmentation task in an image processing task. The semantic segmentation of the image is a task that includes information processing in the image processing task, and assigns a high-level semantic label of the image to each pixel, that is, classifies each pixel in the image, where the high-level semantic label refers to various object classes (e.g., human, animal, car, etc.) and background classes (e.g., sky, grassland, etc.) in the image. The semantic segmentation task puts high requirements on classification precision and positioning precision: on one hand, the boundary of the outline of the object needs to be accurately positioned, and on the other hand, the region in the outline needs to be accurately classified, so that the specific object can be well segmented from the background. Therefore, how to maintain the balance between the positioning accuracy and the classification accuracy in semantic segmentation is an important problem, generally speaking, to improve the classification accuracy, the field of view of the deep network needs to be improved, so that more information can be fused, but expanding the field of view of the deep network causes a great amount of details of the image to be lost, which is not beneficial to positioning the boundary, so that one of the targets of improving the semantic segmentation is to fuse more global information without losing local details of the image.

According to the image semantic segmentation method provided by the embodiment of the application, complementary information exists between the image to be segmented and the depth image corresponding to the image to be segmented, and the performance of semantic segmentation can be remarkably improved through multi-modal image data to be segmented and depth image data.

In some embodiments, referring to fig. 4A, a progressive attention fusion network provided in an embodiment of the present application includes an encoding network portion and a decoding network portion, where the encoding network portion employs a progressive fusion encoder, and the progressive fusion encoder includes two branches, and simultaneously extracts features from an image to be segmented and a depth image, and fuses the depth image into a branch corresponding to the image to be segmented at each scale, so as to enhance the discrimination capability of the encoding network for objects with different sizes.

In some embodiments, referring to fig. 4B, fig. 4B is a schematic diagram illustrating a semantic segmentation method for an image according to an embodiment of the present disclosure. When the image to be segmented is an aerial image, the segmentation result obtained after performing semantic segmentation on the image to be segmented by the image semantic segmentation method provided by the embodiment of the application is shown in fig. 4B.

In some embodiments, the semantic segmentation method for the image provided by the embodiment of the present application obtains multi-scale image features through a dual-stream progressive fusion encoder module and a decoder, and effectively aggregates depth features into branches of a multi-scale image to be segmented through a dual attention residual network. The decoding part comprises a pyramid pool module, a spatial pyramid pool module and a self-attention module so as to capture context information for semantic segmentation.

In some embodiments, referring to fig. 4D, fig. 4D is a schematic diagram of a semantic segmentation method for an image provided by an embodiment of the present application. In order to improve the recognition capability of a model for multi-scale targets in an image to be segmented, a double-flow progressive fusion encoder is designed in the embodiment of the application and is used for extracting and fusing multi-modal features under multi-scale. Specifically, the image to be segmented and the depth image are feature-coded using ResNet-50/101 and ResNet-34, respectively. As shown in fig. 4D, considering that the expanded network can reduce the loss of spatial resolution while maintaining a large receptive field, the expanded convolution is used in the last two layers of the original expanded network and the down-sampling operation is removed, so that the spatial size is 1/8 of the input image. Meanwhile, in order to fuse the depth features into the image branches to be segmented on various scales, a dual attention residual module is proposed to replace the last residual block of each stage in the RGB branch.

In some embodiments, referring to fig. 4C, fig. 4C is a schematic diagram of a semantic segmentation method for an image provided by an embodiment of the present application. As shown in fig. 4C, the result of the dual attention residual module is used to fuse multi-modal features at each scale, and given the features of the image to be segmented and the features of the depth image, the spatial attention layer and the channel attention layer are sequentially executed, and the output of the channel attention layer is fused with the input of the dual attention residual module, and then the output of the dual attention residual module is output.

In some embodiments, referring to fig. 4C, for the spatial attention layer, since the depth image can reflect the category semantic distribution to some extent, such as buildings always have a large height, while the height of the permeable surface is almost zero. Thus, the depth feature F is employed _D As a spatial attention weight, it may adaptively preserve features of higher height objects and filter features of lower height objects. Specifically, the feature F after dimensionality reduction may be obtained by convolution processing with a scale of 1 × 1 and convolution processing with a scale of 3 × 3 _M . The expression of the output of the spatial attention layer may be:

F _N ＝F _M ⊙F _D (1)

wherein, F _N Output characterizing spatial attention layers, F _M Characterizing the features after dimensionality reduction, F _D A characterization depth feature, an characterization dot product.

In some embodiments, referring to FIG. 4C, for the channel attention layer, by performing channel rescaling, at feature F _N Selecting effective characteristic channels, specifically, respectively adopting 1 × 1 convolution on the characteristic diagram F _M And F _N To obtain a feature map F after convolution _P And F _Q This is to ensure channel consistency for both profiles. Then, the global average pooling and full-link layer may adaptively obtain the weight of each channel, and then obtain a channel weight vector through a normalization operation, and finally, the expression of the output of the channel attention may be:

where i, j, k represent the feature indices of the length, height and width dimensions of the channel, respectively.

In some embodiments, referring to fig. 4C, the spatial attention layer filters RGB features in spatial dimension using depth features as spatial weights, and the channel attention layer performs feature selection in channel dimension through adaptive weights of different channels. And finally, fusing the selected features with the features of the input image to be segmented by the residual connecting layer, and enhancing the features of the object with higher height. Therefore, the dual attention residual module provided by the embodiment of the application can explicitly increase the feature difference between the categories, and fully utilize complementary information in the multi-modal features to perform semantic segmentation.

In some embodiments, referring to fig. 4A, the network shown in fig. 4A may be trained in the following manner, and considering that the difference of the number of pixels of each class object in the training set is large, the network shown in fig. 4D may be trained by using a weighted cross entropy loss function, where the expression of the weighted cross entropy loss function may be:

wherein, y _i The true data, p, characterizing the current pixel i _i Normalized prediction results, w, characterizing the network shown in FIG. 4A _i And characterizing the weight of the ith pixel, and N characterizing the total number of pixel classes of the image to be segmented.

The expression of the weight of the ith pixel may be:

in the present embodiment, by training the network shown in fig. 4A on two 1080Ti GPUs, each GPU may have 11G of memory. And (3) optimizing by adopting a random gradient descent method, wherein the weight attenuation is 0.0001, the momentum is 0.9, and the initial learning rate is 0.01. The learning rate strategy is employed to perform a learning rate update, with the learning rate updated after each iteration to

Meanwhile, in order to improve the batch processing size, the scheme also carries out batch processing normalization statistics on all samples of the GPU.

The image semantic segmentation method provided by the embodiment of the application can accurately and efficiently realize image semantic segmentation under an aerial image scene by using optical data and radar data as fusion input conditions, can fully mine complementarity and interdependency between image semantics, and effectively improves the accuracy of semantic segmentation.

It is understood that, in the embodiments of the present application, data related to an image to be segmented and the like need to be approved or approved by a user when the embodiments of the present application are applied to specific products or technologies, and the collection, use and processing of the related data need to comply with relevant laws and regulations and standards of relevant countries and regions.

Continuing with the exemplary structure of the semantic segmentation apparatus 255 for images provided by the embodiments of the present application implemented as software modules, in some embodiments, as shown in fig. 2, the software modules stored in the semantic segmentation apparatus 255 for images in the memory 240 may include: an obtaining module 2551, configured to obtain an image to be segmented including at least two objects, and a depth image corresponding to the image to be segmented; the encoding module 2552 is configured to encode the depth image to obtain a depth encoding result; a fusion coding module 2553, configured to invoke at least two segmentation coding networks, perform fusion coding on a depth coding result and an image to be segmented, and obtain a target coding result, where the spatial screening is used to perform feature screening on the image to be segmented in a spatial dimension, and the channel recombination is used to perform feature screening on the image to be segmented in a channel dimension; and the semantic segmentation module 2554 is configured to perform semantic segmentation on the image to be segmented based on the target coding result to obtain a semantic segmentation result corresponding to each object.

In some embodiments, the encoding module 2552 is further configured to perform downsampling on the depth image, so as to obtain a downsampling result of the depth image; performing pooling on the downsampling processing result of the depth image to obtain a pooling processing result of the depth image; and calling at least two depth coding networks, and performing iterative coding processing on the pooling processing result of the depth image to obtain a depth coding result.

In some embodiments, the encoding module 2552 is further configured to invoke a 1 st depth coding network, and perform coding processing on the pooling processing result of the depth image to obtain a 1 st depth coding result; calling an (i + 1) th depth coding network, and coding an ith depth coding result to obtain an (i + 1) th depth coding result; determining the Nth depth coding result as a depth coding result; wherein i is more than or equal to 1 and less than or equal to N-1, N represents the number of the depth coding networks, and the size of the (i + 1) th depth coding network is smaller than that of the (i) th depth coding network.

In some embodiments, the depth coding network comprises at least two structurally identical coding layers, including a first coding layer and a second coding layer; the coding module 2552 is further configured to invoke a first coding layer, and perform coding processing on the ith depth coding result to obtain a first coding result; calling a second coding layer, and coding the first coding result to obtain a second coding result; and determining the second encoding result as the (i + 1) th depth encoding result.

In some embodiments, the fusion coding module 2553 is further configured to perform downsampling on the image to be segmented to obtain a downsampling result of the image to be segmented; performing pooling treatment on the down-sampling treatment result of the image to be segmented to obtain a pooling treatment result of the image to be segmented; and calling at least two segmentation coding networks, and performing iterative fusion coding including space screening and channel recombination on the pooling processing result and the depth coding result of the image to be segmented to obtain a target coding result.

In some embodiments, the depth coding results include i depth coding results, i is greater than or equal to 1 and less than or equal to N-1, N characterizing the number of depth coding networks that encode the depth image; the fusion coding module 2553 is further configured to invoke the 1 st segmentation coding network, and perform fusion coding including spatial screening and channel recombination on the pooling processing result and the 1 st depth coding result of the image to be segmented, so as to obtain the 1 st segmentation coding result; calling an i +1 th segmentation coding network, and performing fusion coding including spatial screening and channel recombination on an i-th segmentation coding result and an i-th depth coding result to obtain an i +1 th segmentation coding result; determining the Nth segmentation coding result as a target coding result; wherein the size of the (i + 1) th split encoding network is smaller than that of the (i) th split encoding network.

In some embodiments, the split coding network comprises at least two residual layers and at least one attention residual layer; the fusion coding module 2553 is further configured to call at least two residual error layers, and perform feature extraction on the ith segmentation coding result to obtain a feature extraction result of the ith segmentation coding result; and calling at least one attention residual error layer, and carrying out fusion coding comprising space screening and channel recombination on the feature extraction result and the ith depth coding result to obtain an (i + 1) th segmentation coding result.

In some embodiments, when the number of the attention residual layers is at least two, the fusion coding module 2553 is further configured to call the 1 st attention residual layer, and perform fusion coding including spatial screening and channel recombination on the feature extraction result and the ith depth coding result to obtain a 1 st fusion coding result; calling a jth attention residual error layer, and carrying out fusion coding comprising space screening and channel recombination on the ith depth coding result and the jth-1 fusion coding result to obtain a jth fusion coding result, wherein j is more than or equal to 2 and is less than or equal to M, and M represents the number of the attention residual error layers; and determining the Mth fusion coding result as the (i + 1) th segmentation coding result.

In some embodiments, the attention residual layer includes a spatial attention layer, a channel attention layer, and a residual connection layer; the fusion coding module 2553 is further configured to call a spatial attention layer of the 1 st attention residual layer, and perform spatial screening on the feature extraction result and the ith depth coding result to obtain a spatial screening result of the 1 st attention residual layer; calling a channel attention layer of the 1 st attention residual error layer, and performing channel recombination on the spatial screening result to obtain a channel recombination result of the 1 st attention residual error layer; and calling a residual connecting layer of the 1 st attention residual layer, and fusing the channel recombination result and the feature extraction result to obtain a 1 st fusion coding result.

In some embodiments, the attention residual layer includes a spatial attention layer, a channel attention layer, and a residual connection layer; when the number of the attention residual error layers is one, the fusion coding module 2553 is further configured to invoke a spatial attention layer, and perform spatial screening on the feature extraction result and the ith depth coding result to obtain a spatial screening result; calling a channel attention layer, and performing channel recombination on the spatial screening result to obtain a channel recombination result; and calling a residual connecting layer, and fusing the channel recombination result and the feature extraction result to obtain an i +1 th segmentation coding result.

In some embodiments, the fusion coding module 2553 is further configured to perform convolution processing on the feature extraction result to obtain a first convolution processing result of the feature extraction result; performing dot product on the first convolution processing result and the ith depth coding result to obtain a first dot product result; and determining the first dot product result as a space screening result.

In some embodiments, the fusion coding module 2553 is further configured to perform convolution processing on the spatial screening result to obtain a second convolution processing result of the spatial screening result; performing convolution processing on the first convolution processing result to obtain a third convolution processing result; carrying out normalization processing on the third convolution processing result to obtain a normalization processing result; performing dot product on the normalization processing result and the second convolution processing result to obtain a second dot product result; and determining the second dot product result as a channel recombination result.

Embodiments of the present application provide a computer program product or computer program comprising computer instructions stored in a computer readable storage medium. The processor of the computer device reads the computer instructions from the computer readable storage medium, and the processor executes the computer instructions to make the computer device execute the semantic segmentation method of the image described in the embodiment of the present application.

Embodiments of the present application provide a computer-readable storage medium storing executable instructions, which when executed by a processor, will cause the processor to perform a semantic segmentation method for an image provided by embodiments of the present application, for example, the semantic segmentation method for an image as shown in fig. 3A.

In some embodiments, the computer-readable storage medium may be FRAM, ROM, PROM, EPROM, EEP ROM, flash memory, magnetic surface memory, optical disk, or CD-ROM; or may be various devices including one or any combination of the above memories.

In some embodiments, executable instructions may be written in any form of programming language (including compiled or interpreted languages), in the form of programs, software modules, scripts or code, and may be deployed in any form, including as a stand-alone program or as a module, component, subroutine, or other unit suitable for use in a computing environment.

By way of example, executable instructions may correspond, but do not necessarily have to correspond, to files in a file system, and may be stored in a portion of a file that holds other programs or data, such as in one or more scripts in a HyperText Markup Language (HTML) document, in a single file dedicated to the program in question, or in multiple coordinated files (e.g., files that store one or more modules, sub-programs, or portions of code).

By way of example, executable instructions may be deployed to be executed on one computing device or on multiple computing devices at one site or distributed across multiple sites and interconnected by a communication network.

In summary, the embodiment of the present application has the following beneficial effects:

(1) coding a depth image corresponding to an image to be segmented to obtain a depth coding result; and performing fusion coding on the depth coding result and the image to be segmented, and performing semantic segmentation on the image to be segmented based on the obtained target coding result to obtain a corresponding semantic segmentation result. Therefore, the depth image of the image to be segmented has the distance information between each pixel point in the image to be segmented and the camera, the depth coding result and the image to be segmented are subjected to iterative fusion coding comprising space screening and channel recombination, the image to be segmented is subjected to feature selection in the space dimension and the channel dimension, the semantic information of the image can be fully mined, then, the image to be segmented is subjected to semantic segmentation based on the obtained target coding result, the complementarity and the interdependency between the semantics of the image can be fully mined, and the semantic segmentation accuracy is effectively improved.

(2) The depth image of the image to be segmented is encoded through the depth encoding network comprising at least two encoding layers with the same structure, the multi-modal characteristics of the depth image can be explicitly aggregated on a plurality of different scales, and the subsequent fusion encoding of the multi-modal characteristics based on the depth image and the image to be segmented is facilitated, so that the complementarity and the interdependency between the semantics of the image can be fully excavated, and the accuracy of the semantic segmentation is effectively improved.

(3) The segmentation coding result is subjected to feature extraction through at least two residual error layers, and the residual error layers can improve the accuracy by increasing equivalent depth, so that the problem of gradient disappearance caused by depth increase in a deep neural network is effectively solved.

(4) Through the design of the attention residual error layer, the depth image is used as a space weight, and the features of the image to be segmented are screened in the space dimension. The method has the advantages that the channel attention layer of the attention residual error layer is used for selecting the features of the self-adaptive weights of different channels in the channel dimension, then the selected features are fused with the features of the input image to be segmented through the residual error connecting layer, the semantic segmentation accuracy is effectively enhanced, the difference among different types of objects can be increased explicitly, the complementary information in the multi-mode features is fully utilized for semantic segmentation, and the semantic segmentation accuracy is effectively enhanced.

(5) The progressive attention fusion network provided by the embodiment of the application comprises a coding network part and a decoding network part, wherein the coding network part adopts a progressive fusion encoder, the progressive fusion encoder comprises two branches, the features are simultaneously extracted from an image to be segmented and a depth image, and the depth image is fused into the branch corresponding to the image to be segmented at each scale so as to enhance the distinguishing capability of the coding network on objects with different sizes.

(6) The image semantic segmentation method provided by the embodiment of the application can accurately and efficiently realize image semantic segmentation under an aerial image scene by using optical data and radar data as fusion input conditions, can fully mine complementarity and interdependency between image semantics, and effectively improves the accuracy of semantic segmentation.

The above description is only an example of the present application, and is not intended to limit the scope of the present application. Any modification, equivalent replacement, and improvement made within the spirit and scope of the present application are included in the protection scope of the present application.

Claims

1. A method of semantic segmentation of an image, the method comprising:

coding the depth image to obtain a depth coding result;

calling at least two segmentation coding networks, and performing iterative fusion coding including spatial screening and channel recombination on the depth coding result and the image to be segmented to obtain a target coding result, wherein the spatial screening is used for performing feature screening on the image to be segmented in a spatial dimension, and the channel recombination is used for performing feature screening on the image to be segmented in a channel dimension;

2. The method of claim 1, wherein the encoding the depth image to obtain a depth coding result comprises:

carrying out downsampling processing on the depth image to obtain a downsampling processing result of the depth image;

pooling the downsampling processing result of the depth image to obtain a pooling processing result of the depth image;

and calling at least two depth coding networks, and performing iterative coding processing on the pooling processing result of the depth image to obtain the depth coding result.

3. The method of claim 2, wherein the invoking at least two depth coding networks to perform iterative coding on the pooled processing result of the depth image to obtain the depth coding result comprises:

calling a 1 st depth coding network, and coding the pooling processing result of the depth image to obtain a 1 st depth coding result;

calling an (i + 1) th depth coding network, and coding the (i) th depth coding result to obtain an (i + 1) th depth coding result;

determining an Nth depth coding result as the depth coding result;

and N represents the number of the depth coding networks, wherein i is more than or equal to 1 and less than or equal to N-1, and the size of the (i + 1) th depth coding network is smaller than that of the (i) th depth coding network.

4. The method of claim 3, wherein the depth coding network comprises at least two structurally identical coding layers, and wherein the at least two structurally identical coding layers comprise a first coding layer and a second coding layer; the calling the (i + 1) th depth coding network, and performing coding processing on the (i) th depth coding result to obtain an (i + 1) th depth coding result, includes:

calling the first coding layer, and coding the ith depth coding result to obtain a first coding result;

calling the second coding layer, and coding the first coding result to obtain a second coding result;

determining the second encoding result as the i +1 th depth encoding result.

5. The method according to claim 1, wherein the invoking of the at least two segmentation coding networks, performing iterative fusion coding including spatial screening and channel reorganization on the depth coding result and the image to be segmented to obtain a target coding result, comprises:

carrying out downsampling processing on the image to be segmented to obtain a downsampling processing result of the image to be segmented;

pooling the down-sampling processing result of the image to be segmented to obtain a pooling processing result of the image to be segmented;

and calling the at least two segmentation coding networks, and performing iterative fusion coding comprising the spatial screening and the channel recombination on the pooling processing result of the image to be segmented and the depth coding result to obtain the target coding result.

6. The method of claim 5, wherein the depth coding results comprise i depth coding results, i is greater than or equal to 1 and less than or equal to N-1, and N represents the number of depth coding networks that encode the depth image;

the invoking the at least two segmentation coding networks, and performing iterative fusion coding including the spatial screening and the channel recombination on the pooling processing result of the image to be segmented and the depth coding result to obtain the target coding result includes:

calling a 1 st segmentation coding network, and performing fusion coding comprising the spatial screening and the channel recombination on the pooling processing result of the image to be segmented and the 1 st depth coding result to obtain a 1 st segmentation coding result;

calling an i +1 th segmentation coding network, and performing fusion coding comprising the spatial screening and the channel recombination on an i-th segmentation coding result and the i-th depth coding result to obtain an i +1 th segmentation coding result;

determining the Nth segmentation coding result as the target coding result;

wherein the size of the (i + 1) th split encoding network is smaller than that of the (i) th split encoding network.

7. The method of claim 6, wherein the split coding network comprises at least two residual layers and at least one attention residual layer;

the invoking the (i + 1) th segmentation coding network, performing fusion coding including the spatial screening and the channel recombination on the ith segmentation coding result and the ith depth coding result, and obtaining an (i + 1) th segmentation coding result includes:

calling the at least two residual error layers, and performing feature extraction on the ith segmentation coding result to obtain a feature extraction result of the ith segmentation coding result;

and calling the at least one attention residual error layer, and carrying out fusion coding comprising the spatial screening and the channel recombination on the feature extraction result and the ith depth coding result to obtain the (i + 1) th segmentation coding result.

8. The method according to claim 7, wherein when the number of the attention residual layers is at least two, said invoking the at least one attention residual layer, and performing fusion coding including the spatial filtering and the channel reorganization on the feature extraction result and the ith depth coding result to obtain the i +1 th segmentation coding result comprises:

calling a 1 st attention residual error layer, and carrying out fusion coding comprising the space screening and the channel recombination on the feature extraction result and the ith depth coding result to obtain a 1 st fusion coding result;

calling a jth attention residual error layer, and carrying out fusion coding comprising the spatial screening and the channel recombination on the ith depth coding result and the jth-1 fusion coding result to obtain a jth fusion coding result, wherein j is more than or equal to 2 and is less than or equal to M, and M represents the number of the attention residual error layers;

and determining the Mth fusion coding result as the (i + 1) th segmentation coding result.

9. The method of claim 8, wherein the attention residual layer comprises a spatial attention layer, a channel attention layer, and a residual connection layer; the calling of the 1 st attention residual error layer, performing fusion coding including the spatial screening and the channel recombination on the feature extraction result and the ith depth coding result to obtain a 1 st fusion coding result, including:

calling a spatial attention layer of the 1 st attention residual error layer, and performing spatial screening on the feature extraction result and the ith depth coding result to obtain a spatial screening result of the 1 st attention residual error layer;

calling a channel attention layer of the 1 st attention residual error layer, and performing channel recombination on the space screening result to obtain a channel recombination result of the 1 st attention residual error layer;

and calling the residual connecting layer of the 1 st attention residual layer, and fusing the channel recombination result and the feature extraction result to obtain a 1 st fusion coding result.

10. The method of claim 7, wherein the attention residual layer comprises a spatial attention layer, a channel attention layer, and a residual connection layer;

when the number of the attention residual error layers is one, the calling the at least one attention residual error layer, and performing fusion coding including the spatial screening and the channel recombination on the feature extraction result and the ith depth coding result to obtain the (i + 1) th segmentation coding result, including:

calling the spatial attention layer, and carrying out spatial screening on the feature extraction result and the ith depth coding result to obtain a spatial screening result;

calling the channel attention layer, and performing channel recombination on the space screening result to obtain a channel recombination result;

and calling the residual connecting layer, and fusing the channel recombination result and the feature extraction result to obtain the (i + 1) th segmentation coding result.

11. The method of claim 10, wherein the invoking the spatial attention layer and performing spatial filtering on the feature extraction result and the ith depth coding result to obtain a spatial filtering result comprises:

performing convolution processing on the feature extraction result to obtain a first convolution processing result of the feature extraction result;

performing dot product on the first convolution processing result and the ith depth coding result to obtain a first dot product result;

and determining the first dot product result as the spatial screening result.

12. An apparatus for semantic segmentation of an image, the apparatus comprising:

the fusion coding module is used for calling at least two segmentation coding networks and carrying out iterative fusion coding comprising space screening and channel recombination on the depth coding result and the image to be segmented to obtain a target coding result, wherein the space screening is used for carrying out feature screening on the image to be segmented in a space dimension, and the channel recombination is used for carrying out feature screening on the image to be segmented in a channel dimension;

13. An electronic device, characterized in that the electronic device comprises:

a memory for storing executable instructions;

a processor for implementing the method of semantic segmentation of an image according to any one of claims 1 to 11 when executing executable instructions or a computer program stored in the memory.

14. A computer-readable storage medium storing executable instructions or a computer program, wherein the executable instructions, when executed by a processor, implement the method of semantic segmentation of images according to any one of claims 1 to 11.

15. A computer program product comprising a computer program or instructions, characterized in that the computer program or instructions, when executed by a processor, implement the method of semantic segmentation of an image according to any one of claims 1 to 11.