CN111292340A

CN111292340A - Semantic segmentation method, device, equipment and computer readable storage medium

Info

Publication number: CN111292340A
Application number: CN202010076535.2A
Authority: CN
Inventors: 钱晨; 林君仪; 陈小康
Original assignee: Beijing Sensetime Technology Development Co Ltd
Current assignee: Beijing Sensetime Technology Development Co Ltd
Priority date: 2020-01-23
Filing date: 2020-01-23
Publication date: 2020-06-16
Anticipated expiration: 2040-01-23
Also published as: CN111292340B

Abstract

A semantic segmentation method, a semantic segmentation device, semantic segmentation equipment and a computer-readable storage medium are disclosed, wherein the method comprises the following steps: acquiring to-be-processed feature information of an image to be processed, wherein the image to be processed comprises a first mode image and a second mode image corresponding to the first mode image, and the to-be-processed feature information comprises first feature information of the first mode image and second feature information of the second mode image; connecting the first characteristic information and the second characteristic information to obtain first connection characteristic information; determining compensation information of the feature information to be processed according to the first connection feature information; correcting the characteristic information to be processed according to the compensation information to obtain corrected characteristic information; and determining the semantic classification result of each pixel in the image to be processed according to the correction characteristic information.

Description

Semantic segmentation method, device, equipment and computer readable storage medium

Technical Field

The present disclosure relates to computer vision technologies, and in particular, to a semantic segmentation method, apparatus, device, and computer-readable storage medium.

Background

Semantic segmentation is an important problem in computer vision, and has applications in many fields such as automatic driving, geological monitoring, facial segmentation, precision agriculture, and the like.

With the development of depth sensors, depth information is also introduced into semantic segmentation. However, due to the uncertainty of depth information in real scenes and the large amount of noise in depth information, the semantic segmentation effect based on depth information is not ideal.

Disclosure of Invention

The embodiment of the disclosure provides a semantic segmentation scheme.

According to an aspect of the present disclosure, there is provided a semantic segmentation method, the method including: acquiring to-be-processed feature information of an image to be processed, wherein the image to be processed comprises a first mode image and a second mode image corresponding to the first mode image, and the to-be-processed feature information comprises first feature information of the first mode image and second feature information of the second mode image; connecting the first characteristic information and the second characteristic information to obtain first connection characteristic information; determining compensation information of the feature information to be processed according to the first connection feature information; correcting the characteristic information to be processed according to the compensation information to obtain corrected characteristic information; and determining the semantic classification result of each pixel in the image to be processed according to the correction characteristic information.

In combination with any one of the embodiments provided in the present disclosure, the compensation information includes compensation information of the first characteristic information and/or compensation information of the second characteristic information; the correcting the feature information to be processed according to the compensation information to obtain corrected feature information includes: under the condition that the compensation information comprises compensation information of first characteristic information, correcting the first characteristic information according to the compensation information of the first characteristic information to obtain corrected characteristic information corresponding to the first characteristic information; and/or under the condition that the compensation information comprises compensation information of second characteristic information, correcting the second characteristic information according to the compensation information of the second characteristic information to obtain corrected characteristic information corresponding to the second characteristic information.

In combination with any one of the embodiments provided by the present disclosure, the determining a semantic classification result of each pixel in the image to be processed according to the corrected feature information includes: fusing according to the correction feature information to obtain fused feature information; and acquiring a semantic classification result of each pixel in the first mode image and/or the second mode image according to the fusion characteristic information.

In combination with any one of the embodiments provided by the present disclosure, the determining compensation information of the feature information to be processed according to the first connection feature information includes: performing global pooling operation on the first connection characteristic information to obtain global information for each characteristic information channel; determining a weight coefficient of each channel of the feature information to be processed according to the global information; and multiplying the weight coefficient and the corresponding characteristic information to be processed according to channels to obtain the compensation information.

In combination with any one of the embodiments provided by the present disclosure, the correcting the to-be-processed feature information according to the compensation information to obtain corrected feature information includes: and multiplying the compensation information by a set scaling matrix, and adding the multiplication result and at least one of the first characteristic information and the second characteristic information corresponding to the compensation information to obtain the corrected characteristic information.

In combination with any one of the embodiments provided in the present disclosure, the corrected feature information includes corrected feature information corresponding to the first feature information, and/or corrected feature information corresponding to the second feature information; the fusing according to the correction feature information to obtain fused feature information includes: under the condition that the correction characteristic information comprises correction characteristic information corresponding to the first characteristic information, connecting the correction characteristic information corresponding to the first characteristic information with the second characteristic information to obtain second connection characteristic information; or, when the corrected feature information includes corrected feature information corresponding to the second feature information, connecting the first feature information with the corrected feature information corresponding to the second feature information to obtain second connection feature information; or, when the corrected feature information includes corrected feature information corresponding to the first feature information and corrected feature information corresponding to the second feature information, connecting the corrected feature information corresponding to the first feature information and the corrected feature information corresponding to the second feature information to obtain second connection feature information; reducing the dimension of the second connection characteristic information into single-channel characteristic information; acquiring a first weight coefficient corresponding to each position in the first characteristic information and a second weight coefficient corresponding to each position in the second characteristic information according to the single-channel characteristic information, wherein the sum of the first weight coefficient and the second weight coefficient corresponding to the same position is 1; and obtaining fusion characteristic information according to the first characteristic information and the first weight coefficient, and the second characteristic information and the second weight coefficient.

In combination with any one of the embodiments provided herein, the first mode image includes a color mode image and the second mode image includes a depth mode image.

In combination with any one of the embodiments provided by the present disclosure, the acquiring feature information to be processed of an image to be processed includes: coding the depth mode image to obtain a coded image with the same channel number as the color mode image; and performing feature extraction on the coded image to obtain the second feature information.

In combination with any one of the embodiments provided by the present disclosure, the method is applied to a semantic segmentation network, where the semantic segmentation network includes a first feature extraction network for extracting first feature information of the first mode image, a second feature extraction network for extracting second feature information of the second mode image, and a separation aggregation network for obtaining the fused feature information according to the first feature information and the second feature information, and the method further includes: and performing end-to-end training on a semantic segmentation network formed by the first feature extraction network, the second feature extraction network and the separation aggregation network.

In connection with any one of the embodiments provided by the present disclosure, the first feature extraction network comprises M first feature extraction subnetworks, the second feature extraction network comprises M second feature extraction subnetworks, and the split aggregation network comprises M split aggregation subnetworks; the method further comprises the following steps: obtaining input information of an i +1 th first feature sub-network according to an output result of the i-th first feature extraction sub-network and an output result of the i-th separation aggregation sub-network, and obtaining input information of an i +1 th second feature sub-network according to an output result of the i-th second feature extraction sub-network and an output result of the i-th separation aggregation sub-network, wherein i < M, i and M are positive integers; and obtaining the fusion characteristic information according to the output results of the M separation aggregation sub-networks.

In combination with any one of the embodiments provided by the present disclosure, the obtaining the fused feature information according to the output results of the M disjoint aggregation sub-networks includes: and acquiring the fusion characteristic information according to the output results of the 1 st separation aggregation sub-network and the Mth separation aggregation sub-network.

According to an aspect of the present disclosure, there is provided a semantic segmentation apparatus, the apparatus including: the device comprises an acquisition unit, a processing unit and a processing unit, wherein the acquisition unit is used for acquiring to-be-processed characteristic information of an image to be processed, the image to be processed comprises a first mode image and a second mode image corresponding to the first mode image, and the to-be-processed characteristic information comprises first characteristic information of the first mode image and second characteristic information of the second mode image; the connection unit is used for connecting the first characteristic information and the second characteristic information to obtain first connection characteristic information; the determining unit is used for determining compensation information of the feature information to be processed according to the first connection feature information; the correction unit is used for correcting the characteristic information to be processed according to the compensation information to obtain corrected characteristic information; and the classification unit is used for determining the semantic classification result of each pixel in the image to be processed according to the correction characteristic information.

In combination with any one of the embodiments provided in the present disclosure, the compensation information includes compensation information of the first characteristic information and/or compensation information of the second characteristic information; the correction unit is specifically configured to: under the condition that the compensation information comprises compensation information of first characteristic information, correcting the first characteristic information according to the compensation information of the first characteristic information to obtain corrected characteristic information corresponding to the first characteristic information; and/or under the condition that the compensation information comprises compensation information of second characteristic information, correcting the second characteristic information according to the compensation information of the second characteristic information to obtain corrected characteristic information corresponding to the second characteristic information.

In combination with any one of the embodiments provided by the present disclosure, the classification unit is specifically configured to: fusing according to the correction feature information to obtain fused feature information; and acquiring a semantic classification result of each pixel in the first mode image and/or the second mode image according to the fusion characteristic information.

In combination with any one of the embodiments provided by the present disclosure, the determining unit is specifically configured to: performing global pooling operation on the first connection characteristic information to obtain global information for each characteristic information channel; determining a weight coefficient of each channel of the feature information to be processed according to the global information; and multiplying the weight coefficient and the corresponding characteristic information to be processed according to channels to obtain the compensation information.

In combination with any one of the embodiments provided by the present disclosure, the correction unit is specifically configured to: and multiplying the compensation information by a set scaling matrix, and adding the multiplication result and at least one of the first characteristic information and the second characteristic information corresponding to the compensation information to obtain the corrected characteristic information.

In combination with any one of the embodiments provided in the present disclosure, the corrected feature information includes corrected feature information corresponding to the first feature information, and/or corrected feature information corresponding to the second feature information; the classification unit is configured to, when performing fusion according to the corrected feature information to obtain fused feature information, specifically: under the condition that the correction characteristic information comprises correction characteristic information corresponding to the first characteristic information, connecting the correction characteristic information corresponding to the first characteristic information with the second characteristic information to obtain second connection characteristic information; or, when the corrected feature information includes corrected feature information corresponding to the second feature information, connecting the first feature information with the corrected feature information corresponding to the second feature information to obtain second connection feature information; or, when the corrected feature information includes corrected feature information corresponding to the first feature information and corrected feature information corresponding to the second feature information, connecting the corrected feature information corresponding to the first feature information and the corrected feature information corresponding to the second feature information to obtain second connection feature information; reducing the dimension of the second connection characteristic information into single-channel characteristic information; acquiring a first weight coefficient corresponding to each position in the first characteristic information and a second weight coefficient corresponding to each position in the second characteristic information according to the single-channel characteristic information, wherein the sum of the first weight coefficient and the second weight coefficient corresponding to the same position is 1; and obtaining fusion characteristic information according to the first characteristic information and the first weight coefficient, and the second characteristic information and the second weight coefficient.

In combination with any one of the embodiments provided by the present disclosure, the obtaining unit is specifically configured to: coding the depth mode image to obtain a coded image with the same channel number as the color mode image; and performing feature extraction on the coded image to obtain the second feature information.

In combination with any embodiment provided by the present disclosure, the apparatus is applied to a semantic segmentation network, where the semantic segmentation network includes a first feature extraction network for extracting first feature information of the first pattern image, a second feature extraction network for extracting second feature information of the second pattern image, and a separation aggregation network for obtaining the fusion feature information according to the first feature information and the second feature information, and the apparatus further includes a first training unit for performing end-to-end training on the semantic segmentation network formed by the first feature extraction network, the second feature extraction network, and the separation aggregation network.

In connection with any one of the embodiments provided by the present disclosure, the first feature extraction network comprises M first feature extraction subnetworks, the second feature extraction network comprises M second feature extraction subnetworks, and the split aggregation network comprises M split aggregation subnetworks; the device further comprises a second training unit, which is used for obtaining the input information of an n +1 th first feature sub-network according to the output result of the nth first feature extraction sub-network and the output result of the nth separation aggregation sub-network, and obtaining the input information of an n +1 th second feature sub-network according to the output result of the nth second feature extraction sub-network and the output result of the nth separation aggregation sub-network, wherein n < M, and both n and M are positive integers; and obtaining the fusion characteristic information according to the output results of the M separation aggregation sub-networks.

In combination with any one of the embodiments provided by the present disclosure, the second training unit is specifically configured to: and acquiring the fusion characteristic information according to the output results of the 1 st separation aggregation sub-network and the Mth separation aggregation sub-network.

According to an aspect of the present disclosure, an electronic device is provided, which includes a memory for storing computer instructions executable on a processor, and the processor is configured to implement the semantic segmentation method according to any one of the embodiments of the present disclosure when executing the computer instructions.

According to an aspect of the present disclosure, there is provided a computer-readable storage medium, on which a computer program is stored, which when executed by a processor, implements the semantic segmentation method according to any one of the embodiments of the present disclosure.

In the embodiment of the disclosure, by connecting the first feature information of the image to be processed with the corresponding second feature information, the compensation information of the feature information to be processed is obtained and corrected to obtain corrected feature information, and the semantic classification result of each pixel in the image to be processed is determined according to the corrected feature information, so that the semantic segmentation of the image to be processed is realized. According to the method and the device, the compensation information is used for correcting the specially processed characteristic information, noise in the characteristic information to be processed is filtered, and the semantic segmentation effect is improved.

It is to be understood that both the foregoing general description and the following detailed description are exemplary and explanatory only and are not restrictive of the disclosure.

Drawings

The accompanying drawings, which are incorporated in and constitute a part of this specification, illustrate embodiments consistent with the present specification and together with the description, serve to explain the principles of the specification.

FIG. 1 is a flow diagram illustrating a method of semantic segmentation in accordance with at least one embodiment of the present disclosure;

fig. 2A is a flow chart illustrating a method of determining compensation information for processing characteristic information in accordance with at least one embodiment of the present disclosure;

fig. 2B is a flow chart illustrating a method of fusing according to the correction feature information in accordance with at least one embodiment of the present disclosure;

FIG. 3 is a schematic diagram of a semantic segmentation network shown in at least one embodiment of the present disclosure;

FIG. 4 is a schematic diagram of sub-networks included in a semantic segmentation network, as shown in at least one embodiment of the present disclosure;

FIG. 5 is a schematic diagram illustrating a semantic segmentation apparatus in accordance with at least one embodiment of the present disclosure;

fig. 6 is a schematic structural diagram of an electronic device according to at least one embodiment of the present disclosure.

Detailed Description

Reference will now be made in detail to the exemplary embodiments, examples of which are illustrated in the accompanying drawings. When the following description refers to the accompanying drawings, like numbers in different drawings represent the same or similar elements unless otherwise indicated. The implementations described in the exemplary embodiments below are not intended to represent all implementations consistent with the present disclosure. Rather, they are merely examples of apparatus and methods consistent with certain aspects of the present disclosure, as detailed in the appended claims.

Embodiments of the invention are operational with numerous other general purpose or special purpose computing system environments or configurations. Examples of well known computing systems, environments, and/or configurations that may be suitable for use with the computer system/server include, but are not limited to: personal computer systems, server computer systems, thin clients, thick clients, hand-held or laptop devices, microprocessor-based systems, set top boxes, programmable consumer electronics, network pcs, minicomputer systems, mainframe computer systems, distributed cloud computing environments that include any of the above systems, and the like.

The computer system/server may be described in the general context of computer system-executable instructions, such as program modules, being executed by a computer system. Generally, program modules may include routines, programs, objects, components, logic, data structures, etc. that perform particular tasks or implement particular abstract data types. The computer system/server may be practiced in distributed cloud computing environments where tasks are performed by remote processing devices that are linked through a communications network. In a distributed cloud computing environment, program modules may be located in both local and remote computer system storage media including memory storage devices.

Fig. 1 is a flow chart illustrating a semantic segmentation method according to at least one embodiment of the present disclosure. As shown in fig. 1, the method includes steps 101 to 105.

In step 101, feature information to be processed of an image to be processed is acquired.

The image to be processed comprises a first mode image and a second mode image corresponding to the first mode image, and the feature information to be processed comprises first feature information of the first mode image and second feature information of the second mode image.

In the embodiment of the present disclosure, the first mode image and the corresponding second mode image may be acquired by the same image acquisition device, or acquired by image acquisition devices of corresponding modes respectively. The correspondence means that each pixel point in the first mode image can find a corresponding pixel in the second mode image, and the positions of the two corresponding pixels in the space are the same, that is, the two corresponding pixels are respectively used for representing the same point in the space. The first mode image and the second mode image may be registered to achieve correspondence therebetween, i.e., one-to-one correspondence between each pixel in the first mode image and each pixel in the second mode image. It will be appreciated by those skilled in the art that the first mode image and the corresponding second image may be acquired by one or more other devices, by other means, and are not limited to those described above.

The feature information of the image to be processed, that is, the feature information of the first mode image and the feature information of the second mode image, may be obtained by using a pre-trained neural network, for example, a convolutional neural network. In order to distinguish the feature information of the first mode image from the feature information of the second mode image, the feature information of the first mode image is referred to as first feature information and the feature information of the second mode image is referred to as second feature information.

In one example, the first mode image may include a color mode image (color image), and the second mode image may include a depth mode image (depth image). It will be understood by those skilled in the art that the first mode image and the second mode image may also include other modes of images, and the present disclosure is not limited thereto.

In the case where the first mode image includes a color mode image (color image) and the second mode image includes a depth mode image, the depth mode image may be first encoded, and a single-channel depth image may be converted into the same number of channels as the color mode image.

For example, HHA (horizontal disparity, height above gravity, and the angle of the pixel's local normal map the information direction) encoding may be performed on the depth mode image to convert a single-channel depth image into a three-channel image so that feature extraction may be performed directly through a convolutional neural network. Wherein, three different channels obtained by HHA coding are respectively a horizontal difference channel, a ground height channel and an angle channel of a surface normal vector.

In step 102, the first feature information and the second feature information are connected to obtain first connection feature information.

And connecting (collocation) the first characteristic information and the second characteristic information to obtain first connection characteristic information, thereby preliminarily fusing the first mode and the second mode together.

In step 103, according to the first connection characteristic information, determining compensation information of the to-be-processed characteristic information.

The first feature information and the second feature information are typically feature information of multiple channels, that is, feature information obtained from multiple angles, and noise features of the multiple channels are typically different. For the first characteristic information of each channel of the first mode, the compensation information of the first characteristic information may be determined according to the noise characteristic, and for the second characteristic information of each channel of the second mode, the compensation information of the second characteristic information may also be determined according to the noise characteristic.

Taking the RGB modality and the depth modality as an example, if the channel in the RGB modality has a small amount of information in the geometric feature, the feature information of the channel in the depth modality having a large amount of information in the geometric feature may be used as the compensation information of the channel in the RGB modality. The compensation information of the first characteristic information of the color image is second characteristic information of a depth mode; the compensation information of the second characteristic information of the depth image is the first characteristic information of the color mode.

In step 104, the feature information to be processed is corrected according to the compensation information, so as to obtain corrected feature information.

For the first mode, the compensation information of the second mode can be used for correcting the first characteristic information, namely, the characteristic with larger information amount in the second mode is used for correcting to filter the noise characteristic in the first mode; for the second mode, the noise characteristics in the depth mode are filtered by correcting the characteristics with larger information amount in the first mode.

In step 105, according to the correction feature information, determining a semantic classification result of each pixel in the image to be processed.

After the correction, the noise in the first characteristic information in the first mode and the noise in the second characteristic information in the second mode are respectively suppressed, and the semantic segmentation is performed on the image to be processed according to the corrected characteristic information, so that the accuracy of semantic classification of each pixel is improved.

In the embodiment of the disclosure, the first feature information of the image to be processed is connected with the corresponding second feature information, the compensation information of the feature information to be processed is obtained and corrected to obtain the corrected feature information, and the semantic classification result of each pixel in the image to be processed is determined according to the corrected feature information, so that the semantic segmentation of the image to be processed is realized. According to the method and the device, the compensation information is used for correcting the specially processed characteristic information, noise in the characteristic information to be processed is filtered, and the semantic segmentation effect is improved.

In an embodiment of the present disclosure, the compensation information includes compensation information of the first characteristic information and/or compensation information of the second characteristic information.

And under the condition that the compensation information comprises compensation information of first characteristic information, correcting the first characteristic information according to the compensation information of the first characteristic information to obtain corrected characteristic information corresponding to the first characteristic information.

And under the condition that the compensation information comprises compensation information of second characteristic information, correcting the second characteristic information according to the compensation information of the second characteristic information to obtain corrected characteristic information corresponding to the second characteristic information.

Under the condition that the compensation information comprises compensation information of first characteristic information and compensation information of second characteristic information, correcting the first characteristic information according to the compensation information of the first characteristic information to obtain corrected characteristic information corresponding to the first characteristic information; and correcting the second characteristic information according to the compensation information of the second characteristic information to obtain corrected characteristic information corresponding to the second characteristic information.

In some embodiments, the semantic classification result of each pixel in the image to be processed may be determined according to the correction feature information in the following manner.

Firstly, fusion is carried out according to the correction characteristic information to obtain fusion characteristic information.

Steps 102-104 are processes of separating the first mode and the second mode based on the first connection characteristic information, and correcting at least one mode. After the correction feature information is obtained, the two modes can be fused by fusing the correction feature information on the channel dimension.

After the correction, the noise in the corrected feature information is suppressed, and thus the noise in the fused feature information obtained by channel fusion is also suppressed compared to the first connection feature information.

And then, according to the fusion characteristic information, obtaining a semantic classification result of each pixel in the first mode image and/or the second mode image.

And according to the fusion characteristic information, performing semantic classification on each pixel in the first mode image and/or the second mode image to obtain a semantic classification result, thereby realizing semantic segmentation on the color image or the depth image. Since each pixel in the first mode image and the second mode image corresponds, the semantic segmentation result obtained for the first mode image and the second mode image is the same.

The following describes the semantic segmentation method according to the embodiment of the present disclosure in detail by taking the first mode image as a color mode image and the second mode image as a depth mode image as an example. It will be understood by those skilled in the art that the method is applicable to other modes of images and is not limited to the following.

The first characteristic information of the color mode image (color image) can be expressed as

The first feature information may be a C × H × W feature map (feature map) obtained by a convolutional neural network; the second feature information of the depth mode image (depth image) may be expressed as

The two-feature information may also be a C × H × W feature map (feature map) obtained by a convolutional neural network. Where C denotes the number of channels of the feature map, H denotes the height of the feature map, and W denotes the width of the feature map.

Connecting (associating) the first characteristic information and the second characteristic information to obtain first connection characteristic information F_con1＝RGB_in||HHA_inThereby preliminarily fusing the color RGB mode and the depth mode together.

In some embodiments, the compensation information of the feature information to be processed may be determined according to the first connection feature information in the following manner. As shown in FIG. 2A, the method includes steps 21-23.

In step 21, a global pooling operation is performed on the first connection feature information to obtain global information for each feature information channel.

Performing global pooling operation on the first connection feature information to obtain global information I for each feature information channel, see formula (1):

I＝F_pg(RGB_in||HHA_in) (1)

wherein, F_gpRepresenting global pooling operations, RGB_inFirst characteristic information, HHA, representing RGB modality_inSecond feature information representing a depth modality, | | | representing that the first feature information and the second feature information are processedConnection, RGB_in||HHA_inIndicating the obtained first connection characteristic information F_con1，I＝(I₁，…，I_k，…，I_2C) And C is the channel number of the first characteristic information (second characteristic information), wherein k and C are positive integers, and k is less than or equal to 2C.

The global information is established based on channels of feature information in RGB mode and depth mode, for global information I, each vector contained in the global information I represents one channel, namely I is a cross-mode global descriptor for aggregating expression statistics of the whole input (RGB mode input and depth mode input).

In step 22, the weighting coefficients of the channels of the feature information to be processed are determined according to the global information.

The process of obtaining the weight coefficient of each characteristic information channel is a process of obtaining a cross modal attention vector for depth input, and is also a process of performing compression coding on global information.

For the first characteristic information of the RGB mode, the compensation information is second characteristic information of the depth mode; for the second feature information of the depth modality, the compensation information is the first feature information of the RGB modality.

For RGB mode, obtaining weight vector W of depth mode from global information I_hhaIt can be expressed by equation (2):

for a depth mode, acquiring a weight vector W of the RGB mode from the global information I_rgbIt can be expressed by equation (3):

in the formulas (2) (3), W_hhaThe weight vectors are weight coefficients corresponding to the C characteristic information channels containing the second characteristic information; w_rgbWeighting coefficients corresponding to C characteristic information channels containing first characteristic informationA weight vector of (a); f_mlpRepresenting a multilayer perceptron, and processing the global information I through the multilayer perceptron to obtain a weight value corresponding to each channel; σ denotes scaling the weight values to fall within the range of (0, 1), e.g. normalization by sigmoid function.

In step 23, multiplying the weight coefficient by the corresponding feature information to be processed according to a channel to obtain the compensation information. The channel-by-channel multiplication is to multiply the characteristic information of one channel with the weight coefficient corresponding to the channel.

For the RGB modality, the weighting coefficients of the feature information channels of the second feature information may be multiplied by the second feature information according to the channels, and the obtained filtered feature is also the compensation information of the first feature information of the RGB modality, which is shown in formula (4):

wherein,

the weight vectors representing the channel-wise multiplication between the second feature information and the depth modalities.

By multiplying the second feature information by the weight vector of the depth mode according to channels, the visual appearance feature and the geometric feature with larger information quantity in the second feature information can be utilized to effectively suppress noise in the depth information stream, so that less noise and more accurate depth feature information HHA can be obtained_filtered. HHA of depth feature information_filteredThe first feature information is corrected as compensation information of the RGB mode, so that a feature information channel with small information amount in the RGB mode can be compensated, noise in the RGB mode is suppressed, and error information is transmitted to a deeper layer of a network as far as possible.

Similarly, for the depth mode, the weight coefficient of each feature information channel of the first feature information may be multiplied by the first feature information according to the channel, and the obtained filtered feature is also the compensation information of the second feature information, which is shown in formula (5):

wherein,

the weight vectors representing the first feature information and the RGB modalities are multiplied by channel.

By multiplying the first feature information by the weight vector of the RGB mode according to the channel, the visual appearance feature and the geometric feature with larger information quantity in the first feature information can be utilized to effectively suppress noise in RGB information flow, so that the depth feature information RGB with smaller noise and more accuracy is obtained_filtered. The RGB feature information RGB_filteredThe second feature information is corrected as the compensation information of the depth mode, so that the feature information channel with small information quantity in the depth mode can be compensated, and the noise in the depth mode is suppressed, thereby reducing the transmission of wrong information to a deeper layer of the network as much as possible.

After the obtained compensation information, the compensation information may be multiplied by a set scaling matrix, and at least one of the first feature information and the second feature information corresponding to the multiplication result and the compensation information may be added to obtain the corrected feature information.

Compensation HHA for first feature information_filteredAnd compensation information RGB of second characteristic information_filteredUsing the compensation information HHA separately_filteredAnd RGB_filteredThe first characteristic information and the second characteristic information are corrected.

In one example, the first feature information may be corrected by multiplying compensation information of the first feature information by a first set scaling matrix and adding the multiplication result to the first feature information, see equation (6):

RGB_rec＝S₁·HHA_filtered+RGB_in(6)

wherein S is₁For the first setting of the scaling matrix for adjusting the compensation information HHA_filteredTo avoid the characteristics of the RGB modality being over-tuned;

indicating the corrected first characteristic information. In the embodiment of the present application, the size of the first scaling matrix is not limited, and may be adjusted according to actual requirements.

Instead of directly resetting the weights of the RGB modality features using the correction coefficients as the depth features, the embodiments of the present disclosure employ an offset approach to improve the RGB feature response at the corresponding locations while preserving the original features of the RGB modality.

In the disclosed embodiment, the above correction steps can be performed in a symmetric as well as bi-directional manner. That is, the second feature information may be corrected by multiplying the compensation information of the second feature information by a second set scaling matrix, and adding the multiplication result to the second feature information, see equation (7):

HHA_rec＝S₁·RGB_filtered+HHA_in(7)

wherein S is₂For the second setting of the scaling matrix for adjusting the compensation information RGB_filteredTo avoid the features of the depth mode being over-tuned;

indicating the corrected second characteristic information.

The depth feature response at the corresponding position is improved by adopting an offset mode, and the original features of the depth mode can be reserved at the same time.

As described above, the process described in steps 101-104 may be referred to as a "separation" process, which is a process of separating compensation information from first connection characteristic information and correcting the corresponding characteristic information. The "integration" process, i.e., the process of fusing the correction feature information, is described next.

Since the features of the RGB modality and the features of the depth modality have strong complementarity, in the embodiment of the present disclosure, the features of the RGB-depth modality are fused by a gating mechanism at a certain position in space. Fig. 2B is a flowchart illustrating a method for fusing according to the correction feature information according to at least one embodiment of the present disclosure. The correction characteristic information comprises correction characteristic information corresponding to the first characteristic information and/or correction characteristic information corresponding to the second characteristic information. As shown in FIG. 2B, the method includes steps 201-204.

Step 201 includes

substeps

2011, 2012, 2013, and the relationship between

substeps

2011, 2012, 2013 is an or.

In sub-step 2011, when the corrected feature information includes corrected feature information corresponding to the first feature information, the corrected feature information corresponding to the first feature information is connected to the second feature information to obtain second connection feature information.

In sub-step 2012, if the corrected feature information includes corrected feature information corresponding to the second feature information, the first feature information and the corrected feature information corresponding to the second feature information are connected to obtain the second connection feature information.

In sub-step 2013, when the corrected feature information includes corrected feature information corresponding to the first feature information and corrected feature information corresponding to the second feature information, the corrected feature information corresponding to the first feature information and the corrected feature information corresponding to the second feature information are connected to obtain the second connection feature information.

To obtain the corresponding correction characteristic information RGB of RGB mode in the separation process_recCorrection feature information HHA corresponding to depth modality_recTaking the connection as an example, the obtained second connection characteristic message is

For correction of characteristic informationMessage RGB_recAnd HHA_recAnd connecting, namely splicing the feature maps after the two modes are corrected.

In step 202, dimension reduction is performed on the second connection feature information to obtain single-channel feature information.

In the embodiment of the present disclosure, the second connection feature information of a high dimension may be mapped to a single channel feature information by using a predefined mapping function to perform dimension reduction, that is, to map to a spatial-wise gate (spatial-wise gate).

For RGB modalities, the mapping process is seen in equation (8):

wherein G is_rgbIs a space-based gate in an RGB modality, which is the feature information of a single channel. The mapping may be implemented, for example, by a 1 x 1 convolution operation.

For depth modalities, the mapping process is seen in equation (9):

wherein G is_hhaThe method is a space-based gate in a depth mode, and is the characteristic information of a single channel. The mapping can also be implemented by a 1 x 1 convolution operation.

In step 203, according to the single-channel feature information, a first weight coefficient corresponding to each position in the first feature information and a second weight coefficient corresponding to each position in the second feature information are obtained, where a sum of the first weight coefficient and the second weight coefficient corresponding to the same position is 1.

In the above "separation" process, the performed operation may be regarded as being performed for the channel of the feature information, and in the "aggregation" process, the performed operation may be regarded as being performed for each position in the feature information (each position of the feature map).

In one example, the method can be implemented byApplication of softmax function to space-based gate G in RGB modality_rgbAnd a space-based gate G in depth mode_hhaTo obtain the weighting coefficient corresponding to each position in the RGB mode and the weighting coefficient corresponding to each position in the depth mode, respectively, see formula (10):

wherein,

a first weight coefficient corresponding to the position (i, j) in the RGB mode;

is a second weight coefficient corresponding to the position (i, j) in the depth mode, and

that is, for the position (i, j), the sum of the corresponding first weight coefficient and the second weight coefficient is 1, where i and j are positive integers, i is less than or equal to W, j is less than or equal to H, H represents the height of the feature map, and W represents the width of the feature map.

In step 204, fusion feature information is obtained according to the first feature information and the first weight coefficient, and the second feature information and the second weight coefficient.

In the embodiment of the present disclosure, the fused feature information may be obtained by weighting the first feature information and the second feature information with the first weight coefficient and the second weight coefficient, respectively, as shown in equation (11):

after the fusion feature information corresponding to each position is obtained through the formula (11), the final fusion feature information M can be obtained.

In some embodiments, the semantic segmentation method described above may be performed by a pre-trained semantic segmentation network. As shown in fig. 3, the semantic segmentation network 300 may include a first feature extraction network 301, a second feature extraction network 302, and a separation aggregation network 303. The first feature extraction network 301 is configured to extract first feature information of the color image, the second feature extraction network 302 is configured to extract second feature information of the depth image, and the separation aggregation network 303 is configured to obtain the fusion feature information according to corrected feature information by using the method according to any implementation manner of the present disclosure.

The whole semantic segmentation network 300 comprising the first feature extraction network 301, the second feature extraction network 302 and the separation aggregation network 303 can be trained end to end, so as to complete the common fine tuning process. For example, a color image sample labeled with the semantics of each pixel (i.e., labeled with a semantic truth value) and a corresponding depth image sample may be input to the semantic segmentation network 300, network parameters of the first feature extraction network 301, the second feature extraction network 302, and the separation aggregation network 303 in the semantic segmentation network 300 are adjusted according to the difference between the predicted semantic classification result of each pixel and the corresponding semantic truth value, and when the difference between the predicted semantic classification result of each pixel and the corresponding semantic truth value is smaller than a set threshold or reaches a set iteration number, the training of the semantic segmentation network 300 is completed.

In some embodiments, the first feature extraction network comprises M first feature extraction subnetworks, the second feature extraction network comprises M second feature extraction subnetworks, and the separate aggregation network comprises M separate aggregation subnetworks. As shown in fig. 4, the first feature extraction network comprises first

feature extraction sub-networks

401, 402, …, 40M, the second feature extraction network comprises second

feature extraction sub-networks

411, 412, …, 41M, and the split aggregation network comprises split

aggregation sub-networks

421, 422, …, 42M.

By using the semantic segmentation network with the structure, the fusion characteristic information can be obtained by the following method: and obtaining input information of the (n + 1) th first feature sub-network according to the output result of the nth first feature extraction sub-network and the output result of the nth separation aggregation sub-network, and obtaining input information of the (n + 1) th first feature sub-network according to the output result of the nth second feature extraction sub-network and the output result of the nth separation aggregation sub-network, wherein n < M, and both n and M are positive integers. As shown in fig. 4, taking n 1 as an example, first feature information obtained by feature extraction of a color image by the first feature extraction sub-network 401 and second feature information obtained by feature extraction of a three-channel depth image by the second feature extraction sub-network 411 are input to the separation aggregation sub-network 421. The first fused feature information output by the separation aggregation sub-network 421 may be divided equally into two paths, where the fused feature information of each path is first fused feature information x 0.5, and one path is combined with the first feature information output by the first feature extraction sub-network 401 and input to the first feature extraction sub-network 402 of the next layer; the other path is combined with the second feature information outputted from the second feature extraction sub-network 402 and inputted to the second feature extraction sub-network 412 of the next layer.

In one example, the fused feature information may be obtained from output results of the first and mth disjoint aggregation sub-networks. The final fusion characteristic information is obtained by combining the shallow fusion characteristic information obtained by the first separation aggregation sub-network with the deep fusion characteristic information obtained by the last separation sub-network in the network, so that the accuracy of the semantic segmentation result can be improved.

Fig. 5 is a schematic diagram of a semantic segmentation apparatus according to an embodiment of the present disclosure. As shown in fig. 5, the apparatus includes an obtaining unit 501, configured to obtain feature information to be processed of an image to be processed, where the image to be processed includes a first mode image and a second mode image corresponding to the first mode image, and the feature information to be processed includes first feature information of the first mode image and second feature information of the second mode image; a connection unit 502, configured to connect the first feature information and the second feature information to obtain first connection feature information; a determining unit 503, configured to determine compensation information of the feature information to be processed according to the first connection feature information; a correcting unit 504, configured to correct the feature information to be processed according to the compensation information, so as to obtain corrected feature information; and a classifying unit 505, configured to determine a semantic classification result of each pixel in the image to be processed according to the correction feature information.

In some embodiments, the compensation information comprises compensation information for the first characteristic information and/or compensation information for the second characteristic information; the correction unit 504 is specifically configured to: under the condition that the compensation information comprises compensation information of first characteristic information, correcting the first characteristic information according to the compensation information of the first characteristic information to obtain corrected characteristic information corresponding to the first characteristic information; and/or under the condition that the compensation information comprises compensation information of second characteristic information, correcting the second characteristic information according to the compensation information of the second characteristic information to obtain corrected characteristic information corresponding to the second characteristic information.

In some embodiments, classification unit 505 is specifically configured to: fusing according to the correction feature information to obtain fused feature information; and acquiring a semantic classification result of each pixel in the first mode image and/or the second mode image according to the fusion characteristic information.

In some embodiments, the determining unit 503 is specifically configured to: performing global pooling operation on the first connection characteristic information to obtain global information for each characteristic information channel; determining a weight coefficient of each channel of the feature information to be processed according to the global information; and multiplying the weight coefficient and the corresponding characteristic information to be processed according to channels to obtain the compensation information.

In some embodiments, the correction unit 504 is specifically configured to: and multiplying the compensation information by a set scaling matrix, and adding the multiplication result and at least one of the first characteristic information and the second characteristic information corresponding to the compensation information to obtain the corrected characteristic information.

In some embodiments, the correction characteristic information includes correction characteristic information corresponding to the first characteristic information, and/or correction characteristic information corresponding to the second characteristic information; the classifying unit 505 is specifically configured to, when being configured to perform fusion according to the corrected feature information to obtain fused feature information: under the condition that the correction characteristic information comprises correction characteristic information corresponding to the first characteristic information, connecting the correction characteristic information corresponding to the first characteristic information with the second characteristic information to obtain second connection characteristic information; or, when the corrected feature information includes corrected feature information corresponding to the second feature information, connecting the first feature information with the corrected feature information corresponding to the second feature information to obtain second connection feature information; or, when the corrected feature information includes corrected feature information corresponding to the first feature information and corrected feature information corresponding to the second feature information, connecting the corrected feature information corresponding to the first feature information and the corrected feature information corresponding to the second feature information to obtain second connection feature information; reducing the dimension of the second connection characteristic information into single-channel characteristic information; acquiring a first weight coefficient corresponding to each position in the first characteristic information and a second weight coefficient corresponding to each position in the second characteristic information according to the single-channel characteristic information, wherein the sum of the first weight coefficient and the second weight coefficient corresponding to the same position is 1; and obtaining fusion characteristic information according to the first characteristic information and the first weight coefficient, and the second characteristic information and the second weight coefficient.

In some embodiments, the first mode image comprises a color mode image and the second mode image comprises a depth mode image.

In some embodiments, the obtaining unit 501 is specifically configured to: coding the depth mode image to obtain a coded image with the same channel number as the color mode image; and performing feature extraction on the coded image to obtain the second feature information.

In some embodiments, the apparatus is applied to a semantic segmentation network, where the semantic segmentation network includes a first feature extraction network for extracting first feature information of the first pattern image, a second feature extraction network for extracting second feature information of the second pattern image, and a separation aggregation network for obtaining the fused feature information according to the first feature information and the second feature information, and the apparatus further includes a first training unit for performing end-to-end training on a semantic segmentation network formed by the first feature extraction network, the second feature extraction network, and the separation aggregation network.

In some embodiments, the first feature extraction network comprises M first feature extraction subnetworks, the second feature extraction network comprises M second feature extraction subnetworks, and the split aggregation network comprises M split aggregation subnetworks; the device further comprises a second training unit, which is used for obtaining the input information of an n +1 th first feature sub-network according to the output result of the nth first feature extraction sub-network and the output result of the nth separation aggregation sub-network, and obtaining the input information of an n +1 th second feature sub-network according to the output result of the nth second feature extraction sub-network and the output result of the nth separation aggregation sub-network, wherein n < M, and both n and M are positive integers; and obtaining the fusion characteristic information according to the output results of the M separation aggregation sub-networks.

In some embodiments, the second training unit is specifically configured to: and acquiring the fusion characteristic information according to the output results of the 1 st separation aggregation sub-network and the Mth separation aggregation sub-network.

Fig. 6 is an electronic device provided in at least one embodiment of the present disclosure, and the device includes a memory and a processor, where the memory is used to store computer instructions executable on the processor, and the processor is used to implement the semantic segmentation method according to any one of the implementations of the present disclosure when executing the computer instructions.

At least one embodiment of the present disclosure also provides a computer-readable storage medium, on which a computer program is stored, which when executed by a processor implements the semantic segmentation method according to any one of the implementations of the present disclosure.

As will be appreciated by one skilled in the art, one or more embodiments of the present description may be provided as a method, system, or computer program product. Accordingly, one or more embodiments of the present description may take the form of an entirely hardware embodiment, an entirely software embodiment or an embodiment combining software and hardware aspects. Furthermore, one or more embodiments of the present description may take the form of a computer program product embodied on one or more computer-usable storage media (including, but not limited to, disk storage, CD-ROM, optical storage, and the like) having computer-usable program code embodied therein.

The present description also provides a computer readable storage medium, on which a computer program may be stored, which when executed by a processor, implements the steps of the method for detecting a driver's gaze area described in any one of the embodiments of the present description, and/or implements the steps of the method for training a neural network of a driver's gaze area described in any one of the embodiments of the present description. Wherein "and/or" means having at least one of the two, e.g., "A and/or B" includes three schemes: A. b, and "A and B".

The embodiments in the present specification are described in a progressive manner, and the same and similar parts among the embodiments are referred to each other, and each embodiment focuses on the differences from the other embodiments. In particular, for the data processing apparatus embodiment, since it is substantially similar to the method embodiment, the description is relatively simple, and for the relevant points, reference may be made to part of the description of the method embodiment.

The foregoing description has been directed to specific embodiments of this disclosure. Other embodiments are within the scope of the following claims. In some cases, the acts or steps recited in the claims may be performed in a different order than in the embodiments and still achieve desirable results. In addition, the processes depicted in the accompanying figures do not necessarily require the particular order shown, or sequential order, to achieve desirable results. In some embodiments, multitasking and parallel processing may also be possible or may be advantageous.

Embodiments of the subject matter and the functional operations described in this specification can be implemented in: digital electronic circuitry, tangibly embodied computer software or firmware, computer hardware including the structures disclosed in this specification and their structural equivalents, or a combination of one or more of them. Embodiments of the subject matter described in this specification can be implemented as one or more computer programs, i.e., one or more modules of computer program instructions, encoded on a tangible, non-transitory program carrier for execution by, or to control the operation of, data processing apparatus. Alternatively or additionally, the program instructions may be encoded on an artificially generated propagated signal, e.g., a machine-generated electrical, optical, or electromagnetic signal, that is generated to encode and transmit information to suitable receiver apparatus for execution by the data processing apparatus. The computer storage medium may be a machine-readable storage device, a machine-readable storage substrate, a random or serial access memory device, or a combination of one or more of them.

The processes and logic flows described in this specification can be performed by one or more programmable computers executing one or more computer programs to perform corresponding functions by operating on input data and generating output. The processes and logic flows can also be performed by, and apparatus can also be implemented as, special purpose logic circuitry, e.g., an FPGA (field programmable gate array) or an ASIC (application-specific integrated circuit).

Computers suitable for executing computer programs include, for example, general and/or special purpose microprocessors, or any other type of central processing unit. Generally, a central processing unit will receive instructions and data from a read-only memory and/or a random access memory. The basic components of a computer include a central processing unit for implementing or executing instructions and one or more memory devices for storing instructions and data. Generally, a computer will also include, or be operatively coupled to receive data from or transfer data to, or both, one or more mass storage devices for storing data, e.g., magnetic, magneto-optical disks, or optical disks. However, a computer does not necessarily have such a device. Moreover, a computer may be embedded in another device, e.g., a mobile telephone, a Personal Digital Assistant (PDA), a mobile audio or video player, a game console, a Global Positioning System (GPS) receiver, or a portable storage device such as a Universal Serial Bus (USB) flash drive, to name a few.

Computer-readable media suitable for storing computer program instructions and data include all forms of non-volatile memory, media and memory devices, including by way of example semiconductor memory devices (e.g., EPROM, EEPROM, and flash memory devices), magnetic disks (e.g., an internal hard disk or a removable disk), magneto-optical disks, and CD ROM and DVD-ROM disks. The processor and the memory can be supplemented by, or incorporated in, special purpose logic circuitry.

While this specification contains many specific implementation details, these should not be construed as limitations on the scope of any invention or of what may be claimed, but rather as descriptions of features specific to particular embodiments of particular inventions. Certain features that are described in this specification in the context of separate embodiments can also be implemented in combination in a single embodiment. In other instances, features described in connection with one embodiment may be implemented as discrete components or in any suitable subcombination. Moreover, although features may be described above as acting in certain combinations and even initially claimed as such, one or more features from a claimed combination can in some cases be excised from the combination, and the claimed combination may be directed to a subcombination or variation of a subcombination.

Similarly, while operations are depicted in the drawings in a particular order, this should not be understood as requiring that such operations be performed in the particular order shown or in sequential order, or that all illustrated operations be performed, to achieve desirable results. In some cases, multitasking and parallel processing may be advantageous. Moreover, the separation of various system modules and components in the embodiments described above should not be understood as requiring such separation in all embodiments, and it should be understood that the described program components and systems can generally be integrated together in a single software product or packaged into multiple software products.

Thus, particular embodiments of the subject matter have been described. Other embodiments are within the scope of the following claims. In some cases, the actions recited in the claims can be performed in a different order and still achieve desirable results. Further, the processes depicted in the accompanying figures do not necessarily require the particular order shown, or sequential order, to achieve desirable results. In some implementations, multitasking and parallel processing may be advantageous.

The above description is only for the purpose of illustrating the preferred embodiments of the one or more embodiments of the present disclosure, and is not intended to limit the scope of the one or more embodiments of the present disclosure, and any modifications, equivalent substitutions, improvements, etc. made within the spirit and principle of the one or more embodiments of the present disclosure should be included in the scope of the one or more embodiments of the present disclosure.

Claims

1. A method of semantic segmentation, the method comprising:

acquiring to-be-processed feature information of an image to be processed, wherein the image to be processed comprises a first mode image and a second mode image corresponding to the first mode image, and the to-be-processed feature information comprises first feature information of the first mode image and second feature information of the second mode image;

connecting the first characteristic information and the second characteristic information to obtain first connection characteristic information;

determining compensation information of the feature information to be processed according to the first connection feature information;

correcting the characteristic information to be processed according to the compensation information to obtain corrected characteristic information;

and determining the semantic classification result of each pixel in the image to be processed according to the correction characteristic information.

2. The method according to claim 1, wherein the compensation information comprises compensation information of the first characteristic information and/or compensation information of the second characteristic information;

the correcting the feature information to be processed according to the compensation information to obtain corrected feature information includes:

under the condition that the compensation information comprises compensation information of first characteristic information, correcting the first characteristic information according to the compensation information of the first characteristic information to obtain corrected characteristic information corresponding to the first characteristic information;

and/or under the condition that the compensation information comprises compensation information of second characteristic information, correcting the second characteristic information according to the compensation information of the second characteristic information to obtain corrected characteristic information corresponding to the second characteristic information.

3. The method according to claim 1 or 2, wherein the determining a semantic classification result of each pixel in the image to be processed according to the corrected feature information comprises:

fusing according to the correction feature information to obtain fused feature information;

and acquiring a semantic classification result of each pixel in the first mode image and/or the second mode image according to the fusion characteristic information.

4. The method according to any one of claims 1 to 3, wherein the determining compensation information of the feature information to be processed according to the first connection feature information comprises:

performing global pooling operation on the first connection characteristic information to obtain global information for each characteristic information channel;

determining a weight coefficient of each channel of the feature information to be processed according to the global information;

and multiplying the weight coefficient and the corresponding characteristic information to be processed according to channels to obtain the compensation information.

5. The method according to any one of claims 1 to 4, wherein the correcting the feature information to be processed according to the compensation information to obtain corrected feature information includes:

and multiplying the compensation information by a set scaling matrix, and adding the multiplication result and at least one of the first characteristic information and the second characteristic information corresponding to the compensation information to obtain the corrected characteristic information.

6. The method according to any one of claims 1 to 5, wherein the corrected feature information comprises corrected feature information corresponding to the first feature information and/or corrected feature information corresponding to the second feature information;

the fusing according to the correction feature information to obtain fused feature information includes:

under the condition that the correction characteristic information comprises correction characteristic information corresponding to the first characteristic information, connecting the correction characteristic information corresponding to the first characteristic information with the second characteristic information to obtain second connection characteristic information;

or, when the corrected feature information includes corrected feature information corresponding to the second feature information, connecting the first feature information with the corrected feature information corresponding to the second feature information to obtain second connection feature information;

or, when the corrected feature information includes corrected feature information corresponding to the first feature information and corrected feature information corresponding to the second feature information, connecting the corrected feature information corresponding to the first feature information and the corrected feature information corresponding to the second feature information to obtain second connection feature information;

reducing the dimension of the second connection characteristic information into single-channel characteristic information;

acquiring a first weight coefficient corresponding to each position in the first characteristic information and a second weight coefficient corresponding to each position in the second characteristic information according to the single-channel characteristic information, wherein the sum of the first weight coefficient and the second weight coefficient corresponding to the same position is 1;

and obtaining fusion characteristic information according to the first characteristic information and the first weight coefficient, and the second characteristic information and the second weight coefficient.

7. The method of any of claims 1 to 6, wherein the first mode image comprises a color mode image and the second mode image comprises a depth mode image.

8. The method according to claim 7, wherein the obtaining of the feature information to be processed of the image to be processed comprises:

coding the depth mode image to obtain a coded image with the same channel number as the color mode image;

and performing feature extraction on the coded image to obtain the second feature information.

9. The method according to any one of claims 1 to 8, applied to a semantic segmentation network including a first feature extraction network for extracting first feature information of the first mode image, a second feature extraction network for extracting second feature information of the second mode image, and a separation aggregation network for deriving the fused feature information from the first feature information and the second feature information, the method further comprising:

and performing end-to-end training on a semantic segmentation network formed by the first feature extraction network, the second feature extraction network and the separation aggregation network.

10. The method of claim 9, wherein the first feature extraction network comprises M first feature extraction subnetworks, wherein the second feature extraction network comprises M second feature extraction subnetworks, and wherein the split aggregation network comprises M split aggregation subnetworks; the method further comprises the following steps:

obtaining input information of an i +1 th first feature sub-network according to an output result of the i-th first feature extraction sub-network and an output result of the i-th separation aggregation sub-network, and obtaining input information of an i +1 th second feature sub-network according to an output result of the i-th second feature extraction sub-network and an output result of the i-th separation aggregation sub-network, wherein i < M, i and M are positive integers;

and obtaining the fusion characteristic information according to the output results of the M separation aggregation sub-networks.

11. The method according to claim 10, wherein obtaining the fused feature information according to the output results of the M disjoint aggregation sub-networks comprises:

and acquiring the fusion characteristic information according to the output results of the 1 st separation aggregation sub-network and the Mth separation aggregation sub-network.

12. An apparatus for semantic segmentation, the apparatus comprising:

the device comprises an acquisition unit, a processing unit and a processing unit, wherein the acquisition unit is used for acquiring to-be-processed characteristic information of an image to be processed, the image to be processed comprises a first mode image and a second mode image corresponding to the first mode image, and the to-be-processed characteristic information comprises first characteristic information of the first mode image and second characteristic information of the second mode image;

the connection unit is used for connecting the first characteristic information and the second characteristic information to obtain first connection characteristic information;

the determining unit is used for determining compensation information of the feature information to be processed according to the first connection feature information;

the correction unit is used for correcting the characteristic information to be processed according to the compensation information to obtain corrected characteristic information;

and the classification unit is used for determining the semantic classification result of each pixel in the image to be processed according to the correction characteristic information.

13. The apparatus according to claim 12, wherein the compensation information comprises compensation information of the first characteristic information and/or compensation information of the second characteristic information; the correction unit is specifically configured to:

14. The apparatus according to claim 12 or 13, wherein the classification unit is specifically configured to: fusing according to the correction feature information to obtain fused feature information; obtaining a semantic classification result of each pixel in the first mode image and/or the second mode image according to the fusion characteristic information;

the determining unit is specifically configured to: performing global pooling operation on the first connection characteristic information to obtain global information for each characteristic information channel; determining a weight coefficient of each channel of the feature information to be processed according to the global information; and multiplying the weight coefficient and the corresponding characteristic information to be processed according to channels to obtain the compensation information.

15. The device according to any one of claims 12 to 14, characterized in that the correction unit is specifically configured to: and multiplying the compensation information by a set scaling matrix, and adding the multiplication result and at least one of the first characteristic information and the second characteristic information corresponding to the compensation information to obtain the corrected characteristic information.

16. The apparatus according to any one of claims 12 to 15, wherein the corrected feature information comprises corrected feature information corresponding to the first feature information and/or corrected feature information corresponding to the second feature information; the classification unit is specifically configured to, when performing fusion according to the corrected feature information to obtain fused feature information:

17. The apparatus of any of claims 12 to 16, wherein the first mode image comprises a color mode image and the second mode image comprises a depth mode image; the obtaining unit is specifically configured to: coding the depth mode image to obtain a coded image with the same channel number as the color mode image; and performing feature extraction on the coded image to obtain the second feature information.

18. The apparatus according to any one of claims 12 to 17, wherein the apparatus is applied to a semantic segmentation network, the semantic segmentation network includes a first feature extraction network for extracting first feature information of the first mode image, a second feature extraction network for extracting second feature information of the second mode image, and a separation aggregation network for obtaining the fused feature information according to the first feature information and the second feature information, and the apparatus further includes a first training unit for performing end-to-end training on a semantic segmentation network composed of the first feature extraction network, the second feature extraction network, and the separation aggregation network;

the first feature extraction network comprises M first feature extraction subnetworks, the second feature extraction network comprises M second feature extraction subnetworks, and the split aggregation network comprises M split aggregation subnetworks; the device further comprises a second training unit, which is used for obtaining the input information of an n +1 th first feature sub-network according to the output result of the nth first feature extraction sub-network and the output result of the nth separation aggregation sub-network, and obtaining the input information of an n +1 th second feature sub-network according to the output result of the nth second feature extraction sub-network and the output result of the nth separation aggregation sub-network, wherein n < M, and both n and M are positive integers; obtaining the fusion characteristic information according to the output results of the M separation aggregation sub-networks;

the second training unit is specifically configured to: and acquiring the fusion characteristic information according to the output results of the 1 st separation aggregation sub-network and the Mth separation aggregation sub-network.

19. An electronic device, comprising a memory for storing computer instructions executable on a processor, the processor being configured to implement the method of any one of claims 1 to 11 when executing the computer instructions.

20. A computer-readable storage medium, on which a computer program is stored, which, when being executed by a processor, carries out the method of any one of claims 1 to 11.