CN114419323A

CN114419323A - Cross-modal learning and domain self-adaptive RGBD image semantic segmentation method

Info

Publication number: CN114419323A
Application number: CN202210328137.4A
Authority: CN
Inventors: 刘伟; 郭永发; 余晓霞; 刘家伟; 张苗辉
Original assignee: East China Jiaotong University
Current assignee: East China Jiaotong University
Priority date: 2022-03-31
Filing date: 2022-03-31
Publication date: 2022-04-29
Anticipated expiration: 2042-03-31
Also published as: CN114419323B

Abstract

The method is based on cross-modal learning and field self-adaptive RGBD image semantic segmentation, and adopts data of two modalities, namely RGB and depth images, as input to construct a cross-modal based image semantic segmentation network; the method adopts Jensen-Shannon divergence to enable semantic segmentation results of each branch of the network to be consistent as much as possible. The method designs a set of counter-generating-based domain self-adaptive method, and obtains three semantic segmentation results by taking a semantic segmentation network as a generator; three discriminators are designed, and three semantic segmentation results are respectively used as the input of the discriminators; the generator makes the semantic segmentation of the source field and the target field consistent in distribution as much as possible; the discriminator aims to correctly distinguish which domain the semantic segmentation result comes from; the purpose of the generator and the discriminator is contrary, the generator and the discriminator are mutually improved in the continuous game, and finally the alignment of different fields on the output level is realized, namely the RGBD data is marked with high precision in cross-fields.

Description

Cross-modal learning and domain self-adaptive RGBD image semantic segmentation method

Technical Field

The invention relates to a cross-modal learning and domain self-adaptive RGBD image semantic segmentation method, belonging to the technical field of image semantic segmentation.

Background

In recent years, image semantic segmentation based on deep convolutional networks has advanced significantly. Training a segmented network requires a large amount of annotation data. However, manual annotation acquisition of large numbers of pixel-level semantically segmented annotation data sets is quite difficult due to time and labor. Accurately predicting the class of all the pixels in an image remains a challenging problem, especially when the model is trained on one dataset (source domain) and predicted on another dataset (target domain). The difference between the source domain and the target domain causes the accuracy of the model trained on the source domain to be reduced to some extent on the target domain. Even in the absence of training data, intelligent critical systems, such as autonomous systems, require that autonomous vehicles perform robustly in a variety of different test environments. For these systems, the semantic segmentation model trained on the images in the sunny day of Beijing should also have good prediction results on the images in the foggy day of Kunming.

Although autonomous vehicles and robots are equipped with various sensors such as RGB images and depth images, most of the existing image semantic segmentation techniques focus on semantic segmentation of RGB images. Few field-adaptive technology-based schemes can focus on data of both RGB and depth modalities. In order to solve the problem of difference between different fields, the current main method adopts a field self-adaptive technology to align source field data with artificial labeling information and target field data without artificial labeling information.

Many semantic segmentation methods based on domain adaptive techniques use an antagonistic learning approach to narrow the gap between the source domain and the target domain. In the field adaptive learning based on the antagonistic learning, the purposes of the generator and the discriminator are mutually contradictory, the generator and the discriminator are mutually improved in the continuous game process, and finally the source field or the target field is selected under the condition that the discrimination capability of the discriminator is reliable, so that the field difference between the source field and the target field is reduced. The current few methods combine cross-modal learning and a domain adaptive method to achieve high-precision cross-dataset semantic segmentation of image scenes.

Previous research and practice has shown that fully mining the complementarity of multi-modal data can enhance the semantic understanding of a viewer to a scene. Therefore, it is necessary to sufficiently mine two modality information of the RGB image and the depth image, and improve the accuracy of semantic segmentation of an image scene based on domain adaptation of the contrast learning.

Disclosure of Invention

The invention aims to solve the technical problem of realizing semantic segmentation of an image semantic scene across a data set by fully mining two modal information of an RGB image and a depth image and combining field self-adaptation based on countermeasure generation type learning.

The technical scheme of the invention is that a cross-modal learning and domain self-adaptive RGBD image semantic segmentation method is adopted, two modal data of an RGB image and a depth image are used as input, a cross-modal based image semantic segmentation network is constructed, and the image semantic segmentation method is in the source domain

Training a semantic segmentation network supervised; in order to fully utilize the data of the two modes, JS (Jensen-Shannon) divergence is adopted to measure the difference between different probability distributions; the method designs a domain self-adaptive method based on a countermeasure generation formula, and three semantic segmentation probability outputs of an RGBD image are obtained by taking a semantic segmentation network as a generator; three discriminators based on convolutional neural network are designed, and related weight information graphs are output according to three semantic segmentation probabilities of the semantic segmentation networkAs an input to the arbiter; the purpose of the generator is to make the source and target domains as similar as possible in the distribution of the output; the purpose of the discriminator is to distinguish whether the corresponding semantic segmentation probability output is from a sample of a target field or a source field as much as possible, namely which field the input sample comes from; the purposes of the generator and the discriminator are opposite, the generator and the discriminator are mutually improved in the continuous game process, and finally, the source field or the target field of the input sample still cannot be distinguished under the condition that the discrimination capability of the discriminator is reliable, so that the source field and the target field are aligned on the output level, namely, the difference of the source field and the target field about image semantic segmentation is reduced.

The method adopts two deep neural networks (semantic segmentation networks such as deplab) to respectively extract 256-dimensional RGB image features

And 256-dimensional depth image features

(ii) a The RGB image features and the depth image features are directly fused to form 512-dimensional fused features

：

Wherein the content of the first and second substances,

representing a feature join operation;

、

and

is subjected to convolution,

And respectively obtaining semantic segmentation probability output after the up-sampling operation

、

And

(ii) a Assuming that the height and width of the image input to the semantic segmentation network are H and W, respectively, and the predefined number of semantic categories is K, then

、

And

is of dimension of

The elements in the matrix represent the probability of the model about the pixel prediction category at the corresponding spatial position of the RGBD image.

The image semantic segmentation method is in the source field

Supervised training of semantic segmentation networks:

hypothesis source domain

For a pair of labelled RGB and depth images

It is shown that, among others,

which represents an RGB image, is provided,

a depth image is represented in the image,

a truth label representing manual labeling;

output of RGBD image semantic segmentation model

、

And

about a sample

The supervised segmentation loss of (a) can be expressed as:

h and W respectively represent the height, width and width of an RGBD image, and K represents the number of semantic label types;

representation matrix

At the position of image space

Probability for semantic class C;

representation matrix

At the position of image space

Probability for semantic class C;

representation matrix

At the position of image space

Probability for semantic class C;

representing the position of a tag in image space

The value for semantic class C.

Probability output

And

JS divergence loss between

Expressed as:

wherein the content of the first and second substances,

representing KL divergence measures to measure two probability outputs

And

the degree of difference between;

representation matrix

At the position of image space

Probability values for semantic class C.

Probability output

And

JS divergence loss between

Expressed as:

where H is the height of the RGBD image; w is the width of the RGBD image; k is the number of semantic categories;

representation matrix

At the position of image space

Probability for semantic class C;

representation matrix

At the position of image space

Probability for semantic class C;

representation matrix

At the position of image space

Probability for semantic class C.

The cross-modality based image semantic segmentation network has network input of RGB image and depth image.

In the training phase, the output of the network is a probabilistic output

、

And

；

、

and

is of the dimension of

(ii) a Three areThe probability output is essentially the distribution of the segmentation network with respect to semantic classes currently predicted across modal input samples.

In the testing stage, the network outputs the weighted sum of the corresponding elements with respect to the three probabilities to obtain the final semantic segmentation result of the RGBD image.

The three discriminators based on the convolutional neural network

、

And

convolutional neural networks having the same network structure, the input size being

The output values are 0 and 1; 0 and 1 correspond to the target domain and the source domain, respectively;

output according to probability

、

And

the weight information maps are respectively calculated as follows:

wherein the content of the first and second substances,

、

and

respectively representing prediction weight information

、

And

at the corresponding position

A value of (d) above; the RGBD images of the source domain and the target domain are input into the semantic segmentation network to obtain the weight information map according to the formula for calculating the weight information map

、

And

；

corresponding discriminator

、

And

can be expressed as:

wherein the content of the first and second substances,

、

and

are respectively a discriminator

、

And

parameters to be solved;

、

and

respectively-indicated discriminator

、

And

corresponding domain cross entropy loss;

and

respectively representing the number of RGB images and depth image pairs used for training in the source field and the target field;

representing an RGBD image pair in a source domain;

representing an RGBD image pair in a target domain;

、

and

respectively represent

Corresponding to

、

And

；

、

and

respectively represent

Corresponding to

、

And

。

the target function of the cross-modality based image semantic segmentation network can be expressed as the sum of supervised loss of a source domain and antagonistic loss of a target domain:

wherein the content of the first and second substances,

is the network parameter of the semantic segmentation model to be solved;

is to fight against the loss

Corresponding weights, which are typically set manually by cross-validation;

supervised losses are generated about the source field for the RGBD image semantic segmentation network;

the loss of the RGBD image semantic segmentation network in terms of supervision of the source field is expressed as the sum of cross entropy loss and cross modal loss:

wherein the content of the first and second substances,

and

weights representing the corresponding losses, the weights typically being set manually by cross-validation;

representing a probabilistic output

And

in between

Loss of divergence;

representing a probabilistic output

And

in between

Loss of divergence;

、

、

are respectively as

、

And

about a sample

,

,

Supervised segmentation loss of (2);

countermeasure loss for semantically partitioned networks

Expressed as:

wherein the content of the first and second substances,

、

and

respectively represent

Corresponding to

、

And

；

、

and

respectively represent

Corresponding to

、

And

；

、

and

respectively-indicated discriminator

、

And

corresponding domain cross entropy loss;

representing the number of RGB picture and depth image pairs used for target domain training.

The RGBD-oriented image semantic segmentation method based on the domain self-adaptation comprises the following training steps:

s1: on a source field with a label, training a semantic segmentation network by utilizing an RGBD image semantic segmentation network about a supervised loss formula of the source field until convergence or a certain number of times of termination of iteration is reached;

s2: using discriminators on source and target domains

、

And

training a discriminator by using the target function formula;

s3: training a semantic segmentation network on a source field and a target field by using an object function of the RGBD image semantic segmentation network;

s4: steps S2 and S3 are repeated until a certain number of iterations is met, or the semantic segmentation network converges and the discriminator cannot correctly distinguish from which domain the training data came from.

The cross-modal image semantic segmentation network based on the RGB image and the depth image has the advantages that the cross-modal image semantic segmentation network based on the RGB image and the depth image better fuses two different modal data of the RGB image and the depth image; the image semantic segmentation method designed by the invention is based on the field self-adaptive technology, and the image in the target field does not need a semantic segmentation label of manual annotation, so that the workload of manual annotation of the image is reduced; the segmentation method designed by the invention inputs two different modal data of the RGB image and the depth image in the training and testing stages, adopts cross-modal learning to fully mine the correlation of the two different modal data of the RGB image and the depth image, and can improve the precision of image semantic segmentation; the method combines antagonistic learning and cross-modal learning, can reduce the difference between the source field and the target field, and is favorable for improving the generalization of the model between different data sets.

Drawings

FIG. 1 is a schematic diagram of an RGB image;

FIG. 2 is a diagram illustrating image semantic segmentation results of an RGB image;

FIG. 3 is a schematic diagram of a cross-modality based RGBD image semantic segmentation network according to the present invention;

FIG. 4 is a diagram illustrating the expression of the adaptive and correlated loss function in the anti-generative domain according to the present invention.

Detailed Description

The task of image semantic segmentation is to predict the class of each picture element in an image. As shown in fig. 1 and fig. 2, according to a certain segmentation method or rule, the RGB image in fig. 1 is semantically segmented to obtain the segmentation result shown in fig. 2, wherein the pixels of each category are replaced by corresponding gray levels, for example, the pixels corresponding to the car correspond to black.

The model designed in this embodiment inputs RGBD multi-modal data in the training and testing stages, that is, one RGB image and one corresponding depth image. The output is the result of semantic segmentation of the RGBD image, i.e. the class label of each pixel on each image. It is assumed here that the height and width of the RGBD image are H and W, respectively, and the preset number of semantic categories is K.

As shown in FIG. 3, the invention adopts two deep neural networks to respectively extract 256-dimensional RGB image features

And 256-dimensional depth image features

. Here, the deep neural network may use a common convolutional neural network, such as deplab V3. The features of the RGB image and the depth image are directly fused to form 512-dimensional fused features

：

Wherein the content of the first and second substances,

indicating a feature join operation.

、

And

after being convoluted, the,

And obtaining probability outputs after up-sampling operations

、

And

。

、

and

the matrix is a dimension (H, W, K), and elements in the matrix represent the probability of the model about pixel prediction categories at corresponding spatial positions of the RGBD image.

The semantic segmentation method designed by the embodiment is in the source field

The semantic segmentation network is trained supervised. Hypothesis source domain

For a pair of labelled RGB and depth images

And (4) showing.

Wherein the content of the first and second substances,

which represents an RGB image, is provided,

a depth image is represented in the image,

representing a manually marked truth label.

Output of the RGBD image semantic segmentation model

、

And

about a sample

The supervised segmentation loss of (a) can be expressed as:

wherein, H and W respectively represent the height, width and width of the RGBD image, and K represents the number of semantic label types.

Representation matrix

At the position of image space

Probability for semantic class C.

Representation matrix

At the position of image space

Probability for semantic class C.

Representation matrix

At the position of image space

Probability for semantic class C.

Representing the position of a tag in image space

The value for semantic class C.

In the RGBD segmentation network designed in this embodiment, the input of the network is an RGB image and a depth image, and in the training stage, the output of the network is a probability output

、

And

。

、

and

is of the dimension of

. The three probability outputs are essentially the distributions of the segmentation network with respect to the semantic classes currently predicted across the modal input samples. Therefore, for the same RGBD image,

、

and

should have similar probability distributions.

This example uses JS divergence (Jensen-Shannon divergence) as a loss measure for cross-modal estimation to measure the difference between different probability distributions.

And

the JS divergence loss in between can be expressed as:

wherein the content of the first and second substances,

representing KL divergence measures to measure two probability outputs

And

the degree of difference therebetween.

Representation matrix

At the position of image space

Probability values for semantic class C.

Similarly, probability output

And

the JS divergence loss in between can be expressed as:

therefore, the loss of the RGBD image semantic segmentation network with respect to the source domain supervision can be expressed as the sum of cross-entropy loss and cross-modal loss:

（1）

wherein the content of the first and second substances,

and

the weight representing the corresponding loss is typically set manually by cross-validation.

FIG. 4 illustrates a table showing the domain adaptation and associated loss function for the robust mode.

The embodiment adopts a pair generation formula to realize the domain self-adaptation of the RGBD image semantic segmentation.

The framework of the countermeasure generation learning designed by the embodiment comprises a generator and three discriminators based on the convolutional neural network. Wherein the generator is an RGBD semantic segmentation network.

The core idea of the countermeasure generation formula of the embodiment is to train the generator, i.e. train the RGBD semantic segmentation network. The purpose of the generator is to generate a probabilistic output of the RGBD image of the target domain

、

And

the distribution should be as similar as possible to the probability map of the source domain. The purpose of the discriminator is to distinguish as much as possible whether the corresponding weight information graph is from a sample of the target domain or the source domain, i.e. from which domain the input sample is coming from. The purposes of the generator and the discriminator are opposite, the generator and the discriminator are mutually improved in the continuous game process, and finally, the input sample still cannot be distinguished from the source field or the target field under the condition that the discrimination capability of the discriminator is reliable.

The present embodiment is directed to probabilistic output

、

And

three discriminators based on the convolutional neural network are respectively designed. The three discriminators have the same network structure convolutional neural network, the number of input channels is equal to the number of semantic labels, and the output values are 0 and 1. Where 0 and 1 correspond to the target domain and the source domain, respectively. First, respectively according to

、

And

calculating a weight information graph:

wherein the content of the first and second substances,

、

and

respectively representing prediction weight information

、

And

at the corresponding position

The value of (c) above. As described above, the RGBD images of the source domain and the target domain are input into the semantic segmentation network, and the weight information graph can be obtained

、

And

。

correspond toIs determined by

、

And

can be expressed as:

（2）

（3）

（4）

wherein the content of the first and second substances,

、

and

are respectively a discriminator

、

And

parameters to be solved;

、

and

respectively-indicated discriminator

、

And

corresponding domain cross entropy loss;

and

representing an RGBD image pair in the source domain,

representing an RGBD image pair in the target domain.

、

And

respectively represent

Corresponding to

、

And

。

、

and

respectively represent

Corresponding to

、

And

。

the countermeasure loss of the semantic segmentation network of the embodiment can be expressed as:

the objective function of the RGBD image semantic segmentation network can be expressed as the sum of supervised loss in the source domain and antagonistic loss in the target domain, i.e.:

（5）

wherein the content of the first and second substances,

are the network parameters of the semantic segmentation model that need to be solved,

is to fight against the loss

A corresponding weight; the weights are typically set manually by cross-validation.

The invention designs a set of RGBD-oriented image semantic segmentation method based on domain self-adaptation, and the training steps of the method mainly comprise:

s1: training a semantic segmentation network on a labeled source field by using a formula (1) until convergence or a certain number of times of ending iteration;

s2: training a discriminator on a source domain and a target domain by using formulas (2), (3) and (4);

s3: training a semantic segmentation network on a source field and a target field by using a formula (5);

Claims

1. The method is characterized in that data of two different modes, namely an RGB image and a depth image, are used as input to construct a cross-mode-based image semantic segmentation network, and an image semantic segmentation algorithm is applied to the source field

Training a semantic segmentation network supervised; in order to fully utilize the data of the two modes, JS divergence is adopted to measure the difference between different probability distributions, so that the outputs of the different modes are consistent as much as possible; the method designs a domain adaptive algorithm based on a countermeasure generation formula, and obtains three semantic segmentation outlines by taking a semantic segmentation network as a generatorA rate map; three discriminators based on a convolutional neural network are designed, and information graphs generated by three semantic segmentation probability graphs of the semantic segmentation network are used as the input of the discriminators; the purposes of the generator and the discriminator are mutually contradictory, the generator and the discriminator are mutually improved in the continuous game process, and finally, the input sample can not be distinguished from the source field or the target field under the condition that the discrimination capability of the discriminator is reliable, so that the source field and the target field are aligned on the output level, and the high-precision labeling of the RGBD data cross-field is realized.

2. The method for semantic segmentation of RGBD images based on cross-modal learning and domain adaptation according to claim 1, wherein the method adopts two deep neural networks to extract features of the RGB images with 256 dimensions respectively

And 256-dimensional depth image features

：

Wherein the content of the first and second substances,

representing a feature join operation;

、

and

is subjected to convolution,

And obtaining probability outputs after up-sampling operations

、

And

、

And

is of dimension of

3. The method of claim 1, wherein the image semantic segmentation algorithm is on a source domain

Supervised training of semantic segmentation networks：

Hypothesis source domain

For a pair of labelled RGB and depth images

It is shown that, among others,

which represents an RGB image, is provided,

a depth image is represented in the image,

a truth label representing manual labeling;

output of RGBD image semantic segmentation model

、

And

about a sample

The supervised segmentation loss of (a) can be expressed as:

representation matrix

At the position of image space

Probability for semantic class C;

representation matrix

At the position of image space

Probability for semantic class C;

representation matrix

At the position of image space

Probabilities about semantic categories;

representing the position of a tag in image space

The value for semantic class C.

4. The method of claim 2, wherein the probability output is based on a cross-modal learning and domain adaptive RGBD image semantic segmentation method

And

JS divergence loss between

Expressed as:

wherein the content of the first and second substances,

representing KL divergence measures to measure two probability outputs

And

the degree of difference between;

representation matrix

At the position of image space

Probability values for semantic class C.

5. The method of claim 2, wherein the probability output is based on a cross-modal learning and domain adaptive RGBD image semantic segmentation method

And

JS divergence loss between

Expressed as:

wherein H and W are the height and width of the RGBD image, respectively; k is the number of semantic categories;

representation matrix

At the position of image space

Probability for semantic class C;

representation matrix

At the position of image space

Probability for semantic class C;

representing KL divergence measures to measure two probability outputs

And

the degree of difference therebetween.

6. The method for semantic segmentation of RGBD images based on cross-modal learning and domain adaptation according to claim 1, wherein the network input for the cross-modal based image semantic segmentation is an RGB image and a depth image; the output of the network during the training phase is a probabilistic output

、

And

；

、

and

is of the dimension of

(ii) a The three probability outputs are essentially the distributions of the segmentation network with respect to the semantic categories currently predicted across modal input samples; in the testing stage, the network outputs the weighted sum of the corresponding elements with respect to the three probabilities to obtain the final semantic segmentation result of the RGBD image.

7. The method for semantic segmentation of RGBD images based on cross-modal learning and domain adaptation according to claim 1, wherein the three discriminators based on convolutional neural network

、

And

according to a probability map

、

And

the weight information graph is calculated as follows:

wherein the content of the first and second substances,

、

and

respectively representing prediction weight information

、

And

at the corresponding position

、

And

；

corresponding discriminator

、

And

can be expressed as:

wherein the content of the first and second substances,

、

and

are respectively a discriminator

、

And

parameters to be solved;

、

and

respectively-indicated discriminator

、

And

corresponding domain cross entropy loss;

and

respectively representing sourcesThe number of RGB picture and depth image pairs used for field and target field training;

representing an RGBD image pair in the source domain,

representing an RGBD image pair in a target domain;

、

and

respectively represent

Corresponding to

、

And

；

、

and

respectively represent

Corresponding to

、

And

。

8. the method according to claim 6, wherein the target function of the cross-modality based image semantic segmentation network is expressed as a sum of a source domain supervised loss and a target domain confrontation loss:

wherein the content of the first and second substances,

is the network parameter of the semantic segmentation model to be solved;

is to fight against the loss

Corresponding weights, which are typically set manually by cross-validation;

wherein the content of the first and second substances,

and

representing a probabilistic output

And

in between

Loss of divergence;

representing a probabilistic output

And

in between

Loss of divergence;

、

、

are respectively as

、

And

about a sample

,

,

Supervised segmentation loss of (2);

countermeasure loss for semantically partitioned networks

Expressed as:

wherein the content of the first and second substances,

、

and

respectively represent

Corresponding to

、

And

；

、

and

respectively represent

Corresponding to

、

And

；

、

and

respectively-indicated discriminator

、

And

corresponding domain cross entropy loss;

9. The RGBD-based image semantic segmentation method based on cross-modal learning and domain adaptation according to claim 1, wherein the training step comprises:

s2: using discriminators on source and target domains

、

And

training a discriminator by using the target function formula;