CN114708436A

CN114708436A - Training method of semantic segmentation model, semantic segmentation method, semantic segmentation device and semantic segmentation medium

Info

Publication number: CN114708436A
Application number: CN202210620456.2A
Authority: CN
Inventors: 涂鹏; 凌明; 杨作兴; 艾国
Original assignee: Shenzhen MicroBT Electronics Technology Co Ltd
Current assignee: Shenzhen MicroBT Electronics Technology Co Ltd
Priority date: 2022-06-02
Filing date: 2022-06-02
Publication date: 2022-07-05
Anticipated expiration: 2042-06-02
Also published as: CN114708436B

Abstract

The embodiment of the application provides a training method, a semantic segmentation device and a semantic segmentation medium of a semantic segmentation model, wherein the training method specifically comprises the following steps: determining a first output and a second output corresponding to the multiple unlabelled images by using the teacher model and the student model respectively; determining loss information based on the first output and the second output; updating a first parameter of the student model according to the loss information; determining loss information based on the first output and the second output, comprising: generating a pseudo label according to the first semantic representation; determining first loss information according to the second semantic representation and the pseudo label; and/or performing pooling processing of multiple kinds of scale information on the first coding feature and the second coding feature respectively to obtain a first pooling feature and a second pooling feature; and determining second loss information according to the pooling fusion characteristics of the first pooling characteristics under the condition of the scale information and the second pooling characteristics. The embodiment of the application can improve the performance of the semantic segmentation model.

Description

Training method of semantic segmentation model, semantic segmentation method, semantic segmentation device and semantic segmentation medium

Technical Field

The present application relates to the field of computer vision technologies, and in particular, to a training method for a semantic segmentation model, a semantic segmentation method, an apparatus, and a medium.

Background

Semantic segmentation is an important research content in the field of computer vision, and aims to segment an image into regions with different semantic information and label corresponding semantic tags of each region. In the field of semantic segmentation, it is often difficult and costly to obtain labels because mask labeling requirements are closely attached to the target edge, otherwise unreasonable supervision noise is brought to the segmentation model training. The semi-supervised learning can effectively utilize the unmarked data to supplement the labeled samples so as to reduce the labeling cost.

A training method of a semantic segmentation model can generate a pseudo label for label-free data, and the pseudo label is used as a potential real label of the label-free data, so that the label-free data can be utilized.

In practical applications, the pseudo label may not be consistent with the potential real label, and in such a case, the label-free data learning process may learn wrong information from an incorrect pseudo label, thereby causing a performance degradation of the semantic segmentation model.

Disclosure of Invention

The embodiment of the application provides a training method of a semantic segmentation model, which can improve the performance of the semantic segmentation model by means of label-free data.

Correspondingly, the embodiment of the application also provides a semantic segmentation method, a training device of a semantic segmentation model, a semantic segmentation device, electronic equipment and a machine readable medium, which are used for ensuring the realization and application of the method.

In order to solve the above problem, an embodiment of the present application discloses a training method for a semantic segmentation model, where the semantic segmentation model includes: the teacher model and the student model, the training data of the semantic segmentation model includes: a plurality of unmarked images under the information of various scales corresponding to at least one unmarked image; the method comprises the following steps:

determining a first output and a second output corresponding to the plurality of the non-labeled images by respectively using a teacher model and a student model;

determining loss information based on the first output and the second output;

updating a first parameter of the student model according to the loss information;

wherein the type corresponding to the first output and the second output comprises: coding feature type and/or semantic representation type; in the case where the first output and the second output correspond to a type of encoding feature, the first output comprises: a first encoding feature, the second output comprising: a second coding feature; in the case that the first output and the second output correspond to semantic representation types, the first output comprises: a first semantic representation, the second output comprising: a second semantic representation;

said determining loss information from said first output and said second output comprises:

generating pseudo labels corresponding to the multiple label-free images according to the first semantic representation; determining first loss information according to the second semantic representation and the pseudo label; and/or

Performing pooling processing on the first coding feature and the second coding feature with multiple kinds of scale information respectively to obtain a first pooling feature and a second pooling feature; and determining second loss information according to the pooling fusion characteristics of the first pooling characteristics under the condition of scale information and the second pooling characteristics.

Optionally, the generating, according to the first semantic representation, pseudo labels corresponding to the multiple label-free images includes:

carrying out scale alignment processing on the first semantic representations respectively corresponding to the multiple non-labeled images to obtain first alignment semantic representations respectively corresponding to the multiple non-labeled images;

performing fusion processing on first alignment semantic representations respectively corresponding to a plurality of non-labeled images to obtain fusion semantic representations;

and generating pseudo labels corresponding to the multiple unmarked images according to the fusion semantic representation.

Optionally, the determining first loss information according to the second semantic representation and the pseudo tag includes:

carrying out scale alignment processing on the second semantic representations respectively corresponding to the multiple non-labeled images to obtain second alignment semantic representations respectively corresponding to the multiple non-labeled images; determining first loss information according to the second alignment semantic representation and the pseudo label; or

Converting the pseudo-label to a target pseudo-label that matches the second semantic representation; and determining first loss information according to the second semantic representation and the target pseudo label.

Optionally, the determining process of the pooled fusion features comprises: and respectively carrying out fusion processing on the first pooling features corresponding to the single scale information to obtain pooling fusion features corresponding to the multiple scale information.

Optionally, the method further comprises:

and updating the second parameter of the teacher model according to the updated first parameter.

In order to solve the above problem, an embodiment of the present application discloses a semantic segmentation method, where the method includes:

receiving an image to be processed;

performing semantic segmentation on the image to be processed by using a teacher model or a student model of a semantic segmentation model to obtain a corresponding segmentation result;

wherein the training data of the semantic segmentation model comprises: a plurality of unmarked images of one unmarked image under the information of various scales; the training process of the semantic segmentation model comprises the following steps: determining a first output and a second output corresponding to the plurality of the non-labeled images by respectively using a teacher model and a student model; determining loss information based on the first output and the second output; updating a first parameter of the student model according to the loss information;

the loss information includes: first loss information and/or second loss information; the second loss information is obtained according to the second semantic representation and pseudo labels corresponding to a plurality of label-free images, and the pseudo labels are obtained according to the first semantic representation; the second loss information is obtained according to the pooling fusion characteristics of the first pooling characteristics under the scale information condition and the second pooling characteristics; the first pooling characteristic is obtained by pooling multiple scales of information on the first coding characteristic; the second pooling characteristic is obtained by pooling multiple kinds of scale information of the second coding characteristic.

In order to solve the above problem, an embodiment of the present application discloses a training device for a semantic segmentation model, where the semantic segmentation model includes: the teacher model and the student model, the training data of the semantic segmentation model includes: a plurality of unmarked images of one unmarked image under the information of various scales; the device comprises:

the model processing module is used for determining a first output and a second output corresponding to the multiple unlabeled images by respectively utilizing a teacher model and a student model;

a loss processing module for determining loss information according to the first output and the second output;

the first parameter updating module is used for updating a first parameter of the student model according to the loss information;

the loss processing module includes:

the first loss processing module is used for generating pseudo labels corresponding to the multiple unlabeled images according to the first semantic representation; determining first loss information according to the second semantic representation and the pseudo label; and/or

The second loss processing module is used for respectively carrying out pooling processing on the first coding feature and the second coding feature on multiple kinds of scale information to obtain a first pooling feature and a second pooling feature; and determining second loss information according to the pooling fusion characteristics of the first pooling characteristics under the condition of scale information and the second pooling characteristics.

In order to solve the above problem, an embodiment of the present application discloses a semantic segmentation apparatus, where the apparatus includes:

the receiving module is used for receiving the image to be processed;

the semantic segmentation module is used for performing semantic segmentation on the image to be processed by utilizing a teacher model or a student model of a semantic segmentation model to obtain a corresponding segmentation result;

The embodiment of the application also discloses an electronic device, which comprises: a processor; and a memory having executable code stored thereon that, when executed, causes the processor to perform a method as described in embodiments of the present application.

The embodiment of the application also discloses a machine-readable medium, wherein executable codes are stored on the machine-readable medium, and when the executable codes are executed, a processor is caused to execute the method according to the embodiment of the application.

The embodiment of the application has the following advantages:

in the embodiment of the application, in the training process of the semantic segmentation model, a teacher model and a student model are respectively utilized to determine a first output and a second output corresponding to a plurality of unlabeled images; determining loss information based on the first output and the second output; and updating the first parameter of the student model according to the loss information.

The loss information of the embodiment of the present application may include: at least one of the first loss information and the second loss information. The first loss information can represent the loss of the segmentation accuracy dimension represented by the output of the semantic segmentation model in the aspect of semantic representation types. The second loss information can represent the loss of the dimensionality of the coding feature represented by the output of the semantic segmentation model in the aspect of the coding feature type.

According to the embodiment of the application, under the condition that the segmentation accuracy of the semantic segmentation model in the multi-scale scene is improved and/or the accuracy of the coding features of the semantic segmentation model in the multi-scale scene is improved, the accuracy of pseudo labels corresponding to a plurality of label-free images can be improved, the matching degree between the pseudo labels and potential real labels can be improved, and the performance of the semantic segmentation model can be improved by means of label-free data.

Drawings

FIG. 1 is a schematic structural diagram of a semantic segmentation model according to an embodiment of the present application;

FIG. 2 is a schematic diagram of a labeled data training method of a semantic segmentation model according to an embodiment of the present application;

FIG. 3 is a flowchart illustrating steps of a method for training a semantic segmentation model according to an embodiment of the present application;

FIG. 4 is a schematic diagram of a training process of a semantic segmentation model according to one embodiment of the present application;

FIG. 5 is a schematic diagram of a training process for a semantic segmentation model according to one embodiment of the present application;

FIG. 6 is a schematic diagram of a training process for a semantic segmentation model according to one embodiment of the present application;

FIG. 7 is a flow chart illustrating the steps of a semantic segmentation method according to an embodiment of the present application;

FIG. 8 is a schematic structural diagram of a training apparatus for a semantic segmentation model according to an embodiment of the present application;

FIG. 9 is a block diagram of a semantic segmentation apparatus according to an embodiment of the present application;

fig. 10 is a schematic structural diagram of an apparatus provided in an embodiment of the present application.

Detailed Description

In order to make the aforementioned objects, features and advantages of the present application more comprehensible, the present application is described in further detail with reference to the accompanying drawings and the detailed description.

The embodiment of the application can be applied to semantic segmentation scenes. In a semantic segmentation scenario, an image may be semantically segmented by a semantic segmentation model, for example, semantic tags (such as a table, a wall, a sky, a person, a dog, etc.) may be added to image objects in the image. The segmentation result obtained by the semantic segmentation model may include: the corresponding image area of the image object in the image and the corresponding semantic label.

The semantic segmentation model of the embodiment of the application can be used for representing a first mapping relation between the image to be processed and the segmentation result. The embodiment of the application can train the mathematical model to obtain the semantic segmentation model. The mathematical model is a scientific or engineering model constructed by using a mathematical logic method and a mathematical language, and is a mathematical structure which is generally or approximately expressed by adopting the mathematical language aiming at the characteristic or quantity dependency relationship of a certain object system, and the mathematical structure is a relational structure which is described by means of mathematical symbols. The mathematical model may be one or a set of algebraic, differential, integral or statistical equations, and combinations thereof, by which the interrelationships or causal relationships between the variables of the system are described quantitatively or qualitatively. In addition to mathematical models described by equations, there are also models described by other mathematical tools, such as algebra, geometry, topology, mathematical logic, etc. Where the mathematical model describes the behavior and characteristics of the system rather than the actual structure of the system. The method can adopt methods such as machine learning and deep learning methods to train the mathematical model, and the machine learning method can comprise the following steps: linear regression, decision trees, random forests, etc., and the deep learning method may include: CNN (Convolutional Neural Networks), LSTM (Long Short-Term Memory), GRU (Gated cyclic Unit), and the like.

Referring to fig. 1, a schematic structural diagram of a semantic segmentation model according to an embodiment of the present application is shown, where the semantic segmentation model specifically includes: an encoding module 101 and a decoding module 102.

The encoding module 101 may be configured to perform feature extraction on an image to be processed to obtain an encoding feature corresponding to the image to be processed. The encoding module 101 may downsize the feature map into a lower dimensional representation via a convolutional layer and a downsampled layer. The purpose of the encoding module 101 may be to extract low-level features and high-level features, thereby improving the accuracy of semantic segmentation using the extracted spatial information and global information.

The encoding module 101 may be configured to characterize a second mapping relationship between the image to be processed and the encoding feature. Examples of the encoding module 101 may include: VGG (Visual Geometry Group Network), ResNet (Residual Network), lightweight Network, and the like. It is understood that the embodiment of the present application does not impose a limitation on the specific network corresponding to the first feature extraction unit 101.

Wherein the residual network may be a convolutional network. The convolution network can be a deep feedforward artificial neural network and has better performance in image recognition. The convolutional network may specifically include a convolutional layer (convolutional layer) and a pooling layer (pooling layer). The convolutional layer is used to automatically extract features from an input image to obtain a feature map (feature map). The pooling layer is used for pooling the feature map to reduce the number of features in the feature map. The pooling treatment of the pooling layer includes maximum pooling, average pooling, random pooling and the like, and an appropriate method can be selected according to actual needs.

The decoding module 102 is configured to determine a semantic representation corresponding to the image to be processed according to the encoding feature output by the encoding module 101. The processing of the decoding module 102 may include: convolution processing, stacking processing, depth separable convolution and sampling processing, and the like. The semantic representations may include: and semantic labels corresponding to the pixel points in the image to be processed. The decoding module 102 may restore the spatial dimension by using an upsampling operation, merge the features extracted in the encoding process, and complete the semantic representation output with the same scale as the image to be processed on the premise of reducing the information loss as much as possible. The semantic representation may be semantic information corresponding to a pixel point in the image to be processed, and the semantic information may be probability that the pixel point belongs to a preset category. The preset category is usually plural.

A semi-supervised learning method is adopted in a semantic segmentation model training method, specifically, a pseudo label can be generated for label-free data, and the pseudo label is used as a potential real label of the label-free data, so that the supervision effect of the pseudo label in the label-free data learning process is realized. In practical applications, the pseudo label may not match the potential real label, and in such a case, the label-free data learning process may learn wrong information from the incorrect pseudo label, thereby resulting in a decrease in performance of the semantic segmentation model.

Aiming at the technical problem that the performance of a semantic segmentation model is reduced due to the fact that a pseudo label is inconsistent with a potential real label, the embodiment of the application provides a training method of the semantic segmentation model, and the semantic segmentation model specifically comprises the following steps: teacher model and student model, the training data of this semantic segmentation model can include: a plurality of unmarked images under the information of various scales corresponding to at least one unmarked image;

the training method can comprise the following steps:

determining a first output and a second output corresponding to the multiple unlabelled images by using the teacher model and the student model respectively;

determining loss information based on the first output and the second output;

wherein the type corresponding to the first output and the second output comprises: coding feature type and/or semantic representation type; in the case where the first output and the second output correspond to a type of encoding feature, the first output includes: a first encoding feature, the second output comprising: a second coding feature; in the case where the first output and the second output correspond to semantic representation types, the first output includes: a first semantic representation, the second output comprising: a second semantic representation;

determining loss information based on the first output and the second output, comprising:

generating pseudo labels corresponding to a plurality of label-free images according to the first semantic representation; determining first loss information based on the second semantic representation and the pseudo tag; and/or

Performing pooling processing on the first coding feature and the second coding feature with multiple kinds of scale information respectively to obtain a first pooling feature and a second pooling feature; and determining second loss information according to the pooling fusion characteristics of the first pooling characteristics under the condition of the scale information and the second pooling characteristics.

Teacher models and student models are the category of knowledge distillation technology. Knowledge distillation is an information extraction mode based on a neural network, is an effective network compression mode, generates a teacher model through an integration or large-scale training mode, and then softens an output label of the teacher model, so that the information amount among different types is increased, and the compatibility of different model classification tasks is stronger. The teacher model can guide the student models to solve the actual semantic segmentation problem, and the student models can effectively inherit the excellent classification capability and prediction capability in the teacher model. The teacher model and the student model may have the same network structure, for example, both the teacher model and the student model may include: the encoding module 101 and the decoding module 102 shown in fig. 1. The teacher model and the student model may have different parameters, for example, the parameter of the teacher model may be the second parameter and the parameter of the student model may be the first parameter.

The scale of the image object in the image can be changed due to factors such as the distance between the image object (such as an object) and the acquisition device (such as a camera, a video camera and the like). In the related technology, for different images with the same image content under different scales, a semantic segmentation model usually gives different segmentation results; this is one reason why the pseudo labels do not match the potential true labels in the semi-supervised learning process of the semantic segmentation model.

The training data set for the semantic segmentation model in the embodiment of the present application may include: a plurality of unmarked images of one unmarked image under the information of various scales. The multiple label-free images can be used as label-free training data in the semi-supervised learning process of the semantic segmentation model, and can represent multi-scale scenes.

In the training process of the semantic segmentation model, a teacher model and a student model are respectively utilized to determine a first output and a second output corresponding to a plurality of unlabeled images; determining loss information based on the first output and the second output; and updating the first parameter of the student model according to the loss information.

The loss information of the embodiment of the present application may include: at least one of the first loss information and the second loss information. The first loss information can represent the loss of the segmentation accuracy dimension represented by the output of the semantic segmentation model in the aspect of semantic representation types. The second loss information can represent the loss of the dimension of the coding feature represented by the output of the semantic segmentation model in the aspect of the coding feature type.

According to the embodiment of the application, under the condition that the segmentation accuracy of the semantic segmentation model in the multi-scale scene is improved and/or the accuracy of the coding features of the semantic segmentation model in the multi-scale scene is improved, the accuracy of pseudo labels corresponding to a plurality of label-free images can be improved, the matching degree between the pseudo labels and potential real labels can be improved, and further the performance of the semantic segmentation model can be improved.

By applying the technical scheme of the embodiment of the application, no matter how the scale information of an image changes, the semantic segmentation model of the embodiment of the application can obtain coding features with higher similarity and/or semantic representation with higher matching degree aiming at different scale information of the image. Therefore, the embodiment of the application can improve the robustness and other performances of the semantic segmentation model.

Method embodiment 1

This embodiment explains a training process of the semantic segmentation model. The training process of the semantic segmentation model can comprise the following steps: a training process with labeled data and a training process without labeled data. Wherein, the training with labeled data can be executed first, and then the training without labeled data can be executed.

In the training process with the labeled data, the student model can be trained by utilizing the labeled data.

The training process of the student model may include: forward propagation and backward propagation.

The Forward Propagation (Forward Propagation) may sequentially calculate, according to a first parameter of the student model, and according to an order from the input layer to the output layer, and finally obtain output information (e.g., a segmentation result). Wherein the output information may be used to determine error information.

Back Propagation (Backward Propagation) may be sequentially calculated and update the first parameter of the student model according to the error information in an order from the output layer to the input layer. In the back propagation process, the gradient information of the first parameter of the student model can be determined, and the first parameter of the student model is updated by using the gradient information. For example, the backward propagation may sequentially calculate and store gradient information of the first parameter of the processing layers (including the input layer, the intermediate layer, and the output layer) of the student model in the order from the output layer to the input layer according to a chain rule in calculus.

Training data set of the embodiment of the application

In which

Represents the kth image sample in the training set, and both k and m can be positive integers. In the semi-supervised semantic segmentation task, the training data set typically comprises: with pixel levelAnnotated datasets and annotated non-datasets.

For convenience of description, in the training data set X

The individual image samples are marked with annotation data, and

each image sample is annotated data. Wherein the annotated data set is denoted as

Y refers to the pixel level labeling of the corresponding image; annotate-free data sets

。

For marked data

，

The i-th input image is characterized,

is prepared by reacting with

Corresponding pixel level labeling. Hypothesis student model pairs

The output division result is

Then can be based on

And

and determining error information, and updating a first parameter of the student model according to the error information. The embodiment of the application can determine the error information by using cross entropy loss functions, logarithmic loss functions or loss functions such as mean square error loss functions and the like.

The method for updating the first parameter of the student model can comprise the following steps: a gradient descent method, a newton method, a quasi-newton method, or a conjugate gradient method, etc., and it is understood that the embodiment of the present application is not limited to a specific update method.

The embodiment of the application can characterize the mapping relation between the error information and the first parameter through the loss function. In practical applications, a partial derivative may be obtained for the first parameter, and the obtained partial derivative is written out in a form of a vector, and a vector corresponding to the partial derivative may be referred to as gradient information corresponding to the first parameter. The updating amount corresponding to the first parameter can be obtained according to the gradient information and the step length information.

When the gradient descent method is used, a batch gradient descent method, a random gradient descent method, a small batch gradient descent method, or the like may be used. In a particular implementation, the iteration may be performed from one input image; alternatively, the iteration may be performed from a plurality of input images. The convergence condition of the iteration may be: the error information meets a first preset condition. The first preset condition may be: the absolute value of the difference between the error information and the first preset value is smaller than the threshold value of the difference; or the number of iterations exceeds a threshold number of times, etc. In other words, in case the error information meets the first preset condition, the iteration may be ended; in this case, a first target value of the first parameter of the student model may be obtained.

In the training process with the labeled data, after the first parameter is updated once, the second parameter of the teacher model can be updated according to the updated first parameter. The method for updating the second parameter may include: an exponentially weighted average method, etc.

It is assumed that the first parameter of the current time may refer to the first parameter of the ith time, the second parameter of the last time may refer to the second parameter of the (i-1) th time, i may refer to the number of iterations, and i may be a positive integer. Specifically, a first weight and a second weight corresponding to the current first parameter and the last second parameter may be set, respectively, and the current first parameter and the last second parameter may be weighted according to the first weight and the second weight. Wherein the first weight and the second weight may be between [0,1], the sum of the first weight and the second weight may be 1, and the second weight may be a value close to 1, such as 0.99, etc.

The updating process of the second parameter is shown as formula (1):

（1）

wherein the content of the first and second substances,

which is indicative of a second parameter of the first,

the first parameter represents the current time and β represents the second weight.

Referring to fig. 2, a schematic diagram of a training method of labeled data of a semantic segmentation model according to an embodiment of the present application is shown, where the semantic segmentation model may include: student model and teacher model, the student model can include: the teacher model may include: a second encoding module and a second decoding module. The teacher model and the student models may have the same network structure, and the teacher model and the student models may have different parameters, for example, the parameter of the teacher model may be the second parameter, and the parameter of the student models may be the first parameter.

The embodiment of the application can train the student model by utilizing the labeled data. The forward propagation of the student model can obtain a segmentation result, error information can be determined according to the segmentation result and the pixel level label, and the backward propagation of the student model is performed according to the error information, so that the first parameter of the student model can be updated in the backward propagation process. After one update of the first parameter is completed, the second parameter of the teacher model may be updated according to the updated first parameter.

After training with labeled data is completed, training without labeled data may be performed.

Referring to fig. 3, a schematic flowchart illustrating steps of a training method of a semantic segmentation model according to an embodiment of the present application is shown, where the semantic segmentation model specifically includes: teacher model and student model, the training data of this semantic segmentation model can include: and a plurality of unmarked images under the information of various scales corresponding to the at least one unmarked image. Of course, the training data mentioned here may include N (N may be a natural number greater than 1) unlabelled images, and there are multiple unlabelled images with multiple scale information corresponding to each of the unlabelled images; the method specifically comprises the following steps:

step 301, determining a first output and a second output corresponding to a plurality of unmarked images by respectively using a teacher model and a student model;

step 302, determining loss information according to the first output and the second output;

step 303, updating the first parameter of the student model according to the loss information;

wherein the type corresponding to the first output and the second output may include: coding feature type and/or semantic representation type; in the case where the first output and the second output correspond to a type of encoding feature, the first output includes: a first encoding feature, the second output comprising: a second coding feature; in the case where the first output and the second output correspond to semantic representation types, the first output includes: a first semantic representation, the second output comprising: a second semantic representation;

step 302 determines loss information based on the first output and the second output, which may include:

step 321, generating pseudo labels corresponding to the multiple label-free images according to the first semantic representation; determining first loss information based on the second semantic representation and the pseudo tag; and/or

322, performing pooling treatment on the first coding feature and the second coding feature by using multiple kinds of scale information to obtain a first pooling feature and a second pooling feature; and determining second loss information according to the pooling fusion characteristics of the first pooling characteristics under the condition of the scale information and the second pooling characteristics.

The embodiment of the application can be used for marking-free data sets

Any one of the images is used as an unmarked image, and a plurality of unmarked images of the unmarked image under the information of various scales are obtained. The kind of the scale information may be two, or the kind of the scale information may be more than two. For example, the various dimensional information may include: (0.75X, 1.00X, 1.25X); alternatively, the multiple scale information may include: (0.75X, 1.00X); alternatively, the multiple scale information may include: (1.00X, 1.25X); alternatively, the multiple scale information may include: (0.5X, 0.75X, 1.00X, 1.25X, 1.5X). Wherein the number before X can represent the scaling factor relative to an unmarked image. According to the scaling coefficient, the embodiment of the application can scale a non-labeled image to obtain a corresponding non-labeled image. It can be understood that, a person skilled in the art may determine various kinds of scale information according to the actual application requirements, and the embodiment of the present application does not limit the specific various kinds of scale information.

In step 301, the plurality of label-free images may be input into a teacher model and a student model, respectively, and a first output and a second output corresponding to the plurality of label-free images may be output by the teacher model and the student model, respectively.

It should be noted that, in the training process of the unlabeled data, the training data may include m (m may be a natural number greater than 1) unlabeled images, and for each of the unlabeled images, there are m unlabeled images having N (N may be a natural number greater than 1) kinds of scale information. The training corresponding to the N unlabeled images may be performed in parallel or in series, and the specific training sequence corresponding to the N unlabeled images is not limited in the embodiment of the present application.

The method shown in fig. 3 can be used for training without labeled data. In the training process without the labeled data, the initial value of the first parameter of the student model can be obtained through training with the labeled data, and the initial value of the second parameter of the teacher model can be obtained through training with the labeled data.

Specifically, training with labeled data may be performed before training without labeled data to obtain an initial value of the first parameter and an initial value of the second parameter. In the training process with the labeled data, under the condition that the error information meets a first preset condition, the iteration can be ended; in this case, a first target value of the first parameter of the student model may be obtained; assume that a second target value of the second parameter of the teacher model is obtained from the first target value of the first parameter using formula (1). Then, in the training process without the labeled data, the initial value of the first parameter may be: the first target value of the first parameter, and the initial value of the second parameter may be: a second target value of the second parameter.

The type to which the first output and the second output correspond may include: a coding feature type, and/or a semantic representation type.

Wherein, in case that the first output and the second output correspond to the type of the encoding feature, the first output may include: the first encoding feature, the second output may include: a second encoding feature. The first coding feature may be output by a coding module of the teacher model and the second coding feature may be output by a coding module of the student model.

In the case where the first output and the second output correspond to semantic representation types, the first output may include: the first semantic representation, the second output may include: and a second semantic representation. The first semantic representation may be output by a decoding module of the teacher model and the second semantic representation may be output by a decoding module of the student model.

In step 302, the determined loss information may include: at least one of the first loss information and the second loss information. The first loss information can represent the loss of the segmentation accuracy dimension represented by the output of the semantic segmentation model in the aspect of semantic representation types. The second loss information can represent the loss of the dimension of the coding feature represented by the output of the semantic segmentation model in the aspect of the coding feature type.

The first loss information can be used for improving the matching degree of the semantic representation under the multi-scale change scene, and the semantic representation can influence the segmentation accuracy of the model, so that the first loss information can be used for improving the segmentation accuracy of the semantic segmentation model under the multi-scale scene.

Referring to fig. 4, a schematic diagram of a training process of a semantic segmentation model according to an embodiment of the present application is shown, in which a plurality of unlabeled images corresponding to (0.75X, 1.00X, 1.25X) and other multi-scale information may be input into a teacher model and a student model, respectively.

In the branch of the teacher model, a first semantic representation can be output through the teacher model, and pseudo labels corresponding to a plurality of unmarked images are generated according to the first semantic representation. In the branch of the student model, a second semantic representation can be output through the student model, and first loss information is determined according to the second semantic representation and the pseudo label; further, back propagation of the student model can be performed according to the first loss information.

For the same unmarked image sample

For example, a plurality of corresponding label-free images can be expressed as: (1.25X)

、1.0X

And 0.75X

). A second semantic representation of the student model output can be noted as

、

And

the first semantic representation of the teacher model output may be written as:

、

and

where C2 may represent the number of preset categories that need to be segmented. The first loss information is targeted to: and the student model and the teacher model obtain consistent semantic representation aiming at the multi-scale information.

The step 321 of generating pseudo labels corresponding to a plurality of label-free images may further include: carrying out scale alignment processing on the first semantic representations respectively corresponding to the multiple non-labeled images to obtain first alignment semantic representations respectively corresponding to the multiple non-labeled images; performing fusion processing on the first alignment semantic representations respectively corresponding to the multiple non-labeled images to obtain fusion semantic representations; and generating pseudo labels corresponding to the multiple label-free images according to the fusion semantic representation.

Because the multiple unlabeled images have difference in scale information, the first semantic representation output by the teacher model also has difference in spatial resolution, and in order to align the difference of different scales, linear interpolation processing equal-scale alignment processing can be used, and the method is to

And

spatial resolution of and

alignment, after which:

。

further, the embodiment of the application can represent the first alignment semantics with the same spatial resolution

、

And

and performing fusion processing such as mean operation and the like to obtain fusion semantic representation.

The averaging operation can be represented as equation (2), where mean () represents the averaging function:

(2)

further, the embodiment of the application can be expressed according to the fusion semantics

Generating a pseudo tag

Generating a pseudo labelThe process can be referred to formula (3):

（3）

wherein the content of the first and second substances,

pixel value of

QUOTE can be a non-negative integer less than C2, i.e.

。

The process of determining the first loss information according to the second semantic representation and the pseudo tag in the embodiment of the application may include:

carrying out scale alignment processing on the second semantic representations respectively corresponding to the multiple non-labeled images to obtain second alignment semantic representations respectively corresponding to the multiple non-labeled images; determining first loss information according to the second alignment semantic representation and the pseudo label; or alternatively

Converting the pseudo-label to a target pseudo-label that matches the second semantic representation; determining first loss information according to the second semantic representation and the target pseudo label

The second semantic representations of the three scales output by the first decoding module of the student model can be respectively recorded as:

、

and

(ii) a Linear interpolation equal-scale alignment processing is carried out on the second semantic representations of the three scales, so that the following second alignment semantic representations can be obtained:

、

and

。

further, a pseudo tag may be added

And as potential labels of the second alignment semantic representation of the three scales, determining first loss information between the pseudo label and the second alignment semantic representation by using a loss function such as cross entropy. The first loss information can be determined according to formula (4):

（4）

wherein, the first and the second end of the pipe are connected with each other,

representing a plurality of unlabeled images at different zoom factors,

representing a cross entropy function.

In step 322, a pooling operator may be first utilized to pool the first and second coding features with multiple kinds of scale information, respectively, to obtain first and second pooled features.

The first and second encoding features may be output by the encoding modules of the teacher model and the student model, respectively. The coding characteristics output by the coding modules of the teacher model and the student model can have information difference on the receptive field. The receptive field may refer to the size of the area on the input image where the pixels on the feature map of the convolutional neural network are mapped. According to the embodiment of the application, the first coding feature and the second coding feature are respectively subjected to pooling processing of multiple kinds of scale information, and information of different corresponding receptive fields can be captured aiming at specific scale information, so that the consistency of the information among different receptive fields can be improved.

Those skilled in the art can determine various kinds of scale information according to the actual application requirements. For example, the various dimensional information may include: at least two of 1 × 1, 2 × 2, 4 × 4 and 8 × 8, it is to be understood that the embodiment of the present application does not impose any limitation on the specific scale information corresponding to the pooling process.

Referring to fig. 5, a schematic diagram of a training process of a semantic segmentation model according to an embodiment of the present application is shown, in which a plurality of label-free images corresponding to (0.75X, 1.00X, 1.25X) and other multi-scale information can be input into a teacher model and a student model, respectively.

In the branch of the teacher model, the first coding feature can be output through a coding module of the teacher model, and the first coding feature is subjected to pooling processing of multiple kinds of scale information. In the branch of the student model, a second coding feature can be output through a coding module of the student model, and the second coding feature is subjected to pooling processing of multiple kinds of scale information.

Assume that the first encoding characteristic from the teacher model is

Corresponding to 1.25X, 1.00X and 0.75X inputs, respectively, of the original input X; assume that the second coding feature from the student model is

Corresponding to the 1.25X, 1.00X, and 0.75X inputs, respectively, of the original input X.

Assuming four scales of pooling (1 × 1, 2 × 2, 4 × 4, and 8 × 8) of the first encoding features M' from the teacher model, then

、

And

the corresponding first pooling characteristics may be:

、

and

(ii) a Wherein the first pooling characteristic may comprise: pooling features with spatial resolutions of (1 × 1, 2 × 2, 4 × 4, and 8 × 8), respectively.

The embodiment of the application can respectively perform fusion processing on the first pooling features corresponding to the single scale information to obtain pooling fusion features corresponding to the multiple scale information. For example, will

、

And

adding the pooled features with the same medium resolution ratio, and then averaging to obtain pooled fusion features

。

Assuming that the second coding features M from the student model are pooled in four scales (1 × 1, 2 × 2, 4 × 4, and 8 × 8), then

、

And

corresponding toThe second pooling characteristic may be

、

And

(ii) a Wherein the second pooling characteristic may comprise: pooled features with spatial resolutions of (1 × 1, 2 × 2, 4 × 4 and 8 × 8), respectively. The second pooling feature may be denoted as

、

And

。

the second loss information can be determined according to the pooling fusion characteristics of the first pooling characteristics under the condition of the scale information and the second pooling characteristics. The second loss information may be determined according to a distance metric method between the pooled fusion feature and the second pooled feature. The distance measurement method may include: manhattan distance, or euclidean distance, etc. Equation (5) is an example of a process of determining the second loss information:

（5）

wherein the content of the first and second substances,

representing the input image at different scaling factors,

for calculating

The distance, ilrlloss, represents the second loss information.

The training process shown in fig. 4 may correspond to technical solution a, and according to the first loss information, technical solution a may perform back propagation on the first decoding module and the first encoding module included in the student model to update the first parameters of the first decoding module and the first encoding module. After one update of the first parameter is completed, the second parameter of the teacher model (including the second decoding module and the second encoding module) may be updated according to the updated first parameter with reference to formula (1).

The training process shown in fig. 5 may be the technical solution B, and the technical solution B may perform back propagation on the first encoding module included in the student model according to the second loss information, so as to update the first parameter of the first encoding module. After the first parameter of the first encoding module is updated once, the second parameter of the second encoding module of the teacher model may be updated according to the updated first parameter with reference to formula (1).

Referring to fig. 6, a schematic diagram of a training process of a semantic segmentation model according to an embodiment of the present application is shown, wherein a student model can be propagated backward according to the first loss information and the second loss information at the same time. Specifically, the updating of the first parameter of the first decoding module may be based on the first loss information; the updating of the first parameter of the first encoding module may be based on both the first loss information and the second loss information. After the update of the first parameter is completed once, the second parameter of the teacher model (including the second decoding module and the second encoding module) may be updated according to the updated first parameter with reference to formula (1).

The training process shown in fig. 6 may be implemented as technical solution C, and the technical solution C may perform back propagation on the student model according to the first loss information and the second loss information at the same time.

One difference between the technical solutions a, B and C is that the adopted loss information is different: technical scheme a has adopted first loss information, technical scheme B has adopted second loss information, and technical scheme C has adopted first loss information and second loss information.

The reverse propagation range of the loss information is different as follows: the back propagation range of the first loss information is: a first decoding module and a first encoding module; the back propagation range of the second loss information is: a first encoding module.

In practical application, a person skilled in the art can adopt any one of the technical scheme a, the technical scheme B and the technical scheme C according to practical application requirements.

The method for updating the first parameter of the student model can comprise the following steps: a gradient descent method, a newton method, a quasi-newton method, a conjugate gradient method, etc., and it is understood that the embodiment of the present application is not limited to a specific update method.

In the training process without the labeled data shown in fig. 3, the embodiment of the present application may characterize the mapping relationship between the first loss information or the second loss information and the first parameter via a loss function. In practical applications, a partial derivative may be obtained for the first parameter, and the obtained partial derivative is written in a form of a vector, where the vector corresponding to the partial derivative may be referred to as gradient information corresponding to the first parameter. The update amount corresponding to the first parameter can be obtained according to the gradient information and the step length information.

When the gradient descent method is used, a batch gradient descent method, a random gradient descent method, a small batch gradient descent method, or the like may be used. In a particular implementation, the iteration may be performed from one input image; alternatively, the iteration may be performed from a plurality of input images. The convergence condition of the iteration may be: the first loss information or the second loss information meets a second preset condition. The second preset condition may be: the absolute value of the difference between the first loss information or the second loss information and a second preset value is smaller than a difference threshold value; or the number of iterations exceeds a threshold number, etc. In other words, in case that the first loss information or the second loss information meets the second preset condition, the iteration may be ended; in this case, the third target value of the first parameter of the student model can be obtained, that is, the training of the student model is completed according to the label-free data, and the third target value can be used in the subsequent semantic segmentation process. Referring to the training process without the labeled data, the second parameter of the teacher model may be updated according to the updated first parameter in the embodiment of the present application, and when the training of the teacher model is completed according to the unlabeled data, the second parameter of the teacher model may be a fourth target value, and the fourth target value may be used in a subsequent semantic segmentation process.

In summary, in the training method of the semantic segmentation model according to the embodiment of the application, in the training process of the semantic segmentation model, a teacher model and a student model are respectively used to determine a first output and a second output corresponding to a plurality of unlabeled images; determining loss information based on the first output and the second output; and updating the first parameter of the student model according to the loss information.

Method embodiment two

In this embodiment, a semantic segmentation process of a semantic segmentation model is described, and the semantic segmentation model can perform semantic segmentation on an image to be processed to obtain a corresponding segmentation result.

Referring to fig. 7, a schematic flow chart illustrating steps of a semantic segmentation method according to an embodiment of the present application is shown, where the method may specifically include the following steps:

step 701, receiving an image to be processed;

step 702, performing semantic segmentation on the image to be processed by using a teacher model or a student model of a semantic segmentation model to obtain a corresponding segmentation result;

wherein the training data of the semantic segmentation model may include: a plurality of unmarked images under the information of various scales corresponding to at least one unmarked image; the training process of the semantic segmentation model can comprise the following steps: determining a first output and a second output corresponding to the multiple unmarked images by respectively using a teacher model and a student model; determining loss information based on the first output and the second output; updating a first parameter of the student model according to the loss information;

the loss information may include: first loss information and/or second loss information; the second loss information may be obtained according to the second semantic representation and a pseudo label corresponding to the multiple unlabeled images, and the pseudo label may be obtained according to the first semantic representation; the second loss information is obtained according to the pooling fusion characteristics of the first pooling characteristics under the scale information condition and the second pooling characteristics; the first pooling feature can be obtained by pooling multiple kinds of scale information on the first coding feature; the second pooling feature is obtained by pooling multiple kinds of scale information of the second coding feature.

In step 702, the semantic segmentation model may perform semantic segmentation on the image to be processed according to the process shown in fig. 1. Specifically, the coding module in the semantic segmentation model can extract the coding features of the image to be processed. And a decoding module in the semantic segmentation model can determine a segmentation result corresponding to the image to be processed according to the coding characteristics.

The method and the device can perform semantic segmentation on the image to be processed by utilizing a teacher model or a student model of the semantic segmentation model. The first parameter of the student model is obtained under the iterative convergence condition, and the student model is used for semantic segmentation, so that the performance of the semantic segmentation can be improved. The second parameter of the teacher model is obtained by performing exponential weighted average on the first parameter, and the exponential weighted average can improve the smoothness of the second parameter, so that the teacher model can improve the generalization capability of semantic segmentation.

In summary, in the semantic segmentation method according to the embodiment of the present application, the adopted semantic segmentation model updates the first parameter of the student model according to the loss information. The loss information may include: at least one of the first loss information and the second loss information. The first loss information can represent the loss of the segmentation accuracy dimension represented by the output of the semantic segmentation model in the aspect of semantic representation types. The second loss information can represent the loss of the dimension of the coding feature represented by the output of the semantic segmentation model in the aspect of the coding feature type.

According to the embodiment of the application, under the condition that the segmentation accuracy of the semantic segmentation model in the multi-scale scene is improved and/or the accuracy of the coding features of the semantic segmentation model in the multi-scale scene is improved, the accuracy of pseudo labels corresponding to a plurality of label-free images can be improved, the matching degree between the pseudo labels and potential real labels can be improved, the performance of the semantic segmentation model can be improved, and the accuracy of segmentation results can be improved.

It should be noted that, for simplicity of description, the method embodiments are described as a series of acts or combination of acts, but those skilled in the art will recognize that the embodiments are not limited by the order of acts described, as some steps may occur in other orders or concurrently depending on the embodiments. Further, those of skill in the art will recognize that the embodiments described in this specification are presently preferred embodiments and that no particular act is required to implement the embodiments of the disclosure.

On the basis of the foregoing embodiment, this embodiment further provides a training apparatus for a semantic segmentation model, and referring to fig. 8, the semantic segmentation model may include: teacher model and student model, the training data of this semantic segmentation model includes: a plurality of unmarked images under the information of various scales corresponding to at least one unmarked image; the device may specifically include: a model processing module 801, a loss processing module 802 and a first parameter update module 803.

The model processing module 801 is configured to determine a first output and a second output corresponding to the multiple unlabeled images by using a teacher model and a student model respectively;

a loss processing module 802, configured to determine loss information according to the first output and the second output;

a first parameter updating module 803, configured to update a first parameter of the student model according to the loss information;

wherein the type corresponding to the first output and the second output may include: coding feature type and/or semantic representation type; in the case where the first output and the second output correspond to a type of encoding feature, the first output may include: the first encoding feature, the second output may include: a second coding feature; in the case where the first output and the second output correspond to semantic representation types, the first output may include: the first semantic representation, the second output may include: a second semantic representation;

the loss processing module 802 may include:

a first loss processing module 821, configured to generate pseudo labels corresponding to the multiple unlabeled images according to the first semantic representation; determining first loss information based on the second semantic representation and the pseudo tag; and/or

A second loss processing module 822, configured to perform pooling processing on the first coding feature and the second coding feature with multiple kinds of scale information, respectively, to obtain a first pooled feature and a second pooled feature; and determining second loss information according to the pooled fusion features of the first pooled features under the condition of the scale information and the second pooled features.

Optionally, the first loss processing module 821 may specifically include:

the first scale alignment processing module is used for carrying out scale alignment processing on the first semantic representations respectively corresponding to the multiple non-labeled images so as to obtain the first alignment semantic representations respectively corresponding to the multiple non-labeled images;

the first fusion processing module is used for performing fusion processing on the first alignment semantic representations respectively corresponding to the multiple unlabeled images to obtain fusion semantic representations;

and the pseudo label generating module is used for generating the pseudo labels corresponding to the multiple label-free images according to the fusion semantic representation.

Optionally, the first loss processing module 821 may specifically include:

the first loss determining module is used for respectively representing the corresponding second alignment semantics; determining first loss information according to the second alignment semantic representation and the pseudo label; or alternatively

A second loss determination module to convert the pseudo-label to a target pseudo-label that matches the second semantic representation; and determining first loss information according to the second semantic representation and the target pseudo label.

Optionally, the apparatus may further include:

and the second parameter updating module is used for updating the second parameter of the teacher model according to the updated first parameter.

On the basis of the foregoing embodiment, this embodiment further provides a semantic segmentation apparatus, and with reference to fig. 9, the apparatus may include:

a receiving module 901, configured to receive an image to be processed;

a semantic segmentation module 902, configured to perform semantic segmentation on the to-be-processed image by using a teacher model or a student model of a semantic segmentation model to obtain a corresponding segmentation result;

wherein, the training data of the semantic segmentation model comprises: a plurality of unmarked images under various scale information corresponding to at least one unmarked image; the training process of the semantic segmentation model comprises the following steps: determining a first output and a second output corresponding to the multiple unmarked images by respectively using a teacher model and a student model; determining loss information based on the first output and the second output; updating a first parameter of the student model according to the loss information;

the loss information may include: first loss information and/or second loss information; the second loss information is obtained according to the second semantic representation and pseudo labels corresponding to the multiple label-free images, and the pseudo labels can be obtained according to the first semantic representation; the second loss information can be obtained according to the pooling fusion characteristics of the first pooling characteristics under the condition of the scale information and the second pooling characteristics; the first pooling feature can be obtained by pooling multiple kinds of scale information of the first coding feature; the second pooling feature may be obtained by pooling information of multiple scales of the second coding feature.

The present application further provides a non-transitory, readable storage medium, where one or more modules (programs) are stored, and when the one or more modules are applied to a device, the device may execute instructions (instructions) of method steps in this application.

Embodiments of the present application provide one or more machine-readable media having instructions stored thereon, which when executed by one or more processors, cause an electronic device to perform the methods as described in one or more of the above embodiments. In the embodiment of the present application, the electronic device includes various types of devices such as a terminal device and a server (cluster).

Embodiments of the disclosure may be implemented as an apparatus for performing desired configurations using any suitable hardware, firmware, software, or any combination thereof, which may include: electronic devices such as terminal devices and servers (clusters). Fig. 10 schematically illustrates an example apparatus 1100 that may be used to implement various embodiments described herein.

For one embodiment, fig. 10 illustrates an example apparatus 1100 having one or more processors 1102, a control module (chipset) 1104 coupled to at least one of the processor(s) 1102, a memory 1106 coupled to the control module 1104, a non-volatile memory (NVM)/storage 1108 coupled to the control module 1104, one or more input/output devices 1110 coupled to the control module 1104, and a network interface 1112 coupled to the control module 1104.

The processor 1102 may include one or more single-core or multi-core processors, and the processor 1102 may include any combination of general-purpose or special-purpose processors (e.g., graphics processors, application processors, baseband processors, etc.). In some embodiments, the apparatus 1100 can be implemented as a terminal device, a server (cluster), or the like in the embodiments of the present application.

In some embodiments, the apparatus 1100 may include one or more computer-readable media (e.g., the memory 1106 or the NVM/storage 1108) having instructions 1114 and one or more processors 1102 in combination with the one or more computer-readable media configured to execute the instructions 1114 to implement modules to perform the actions described in this disclosure.

For one embodiment, control module 1104 may include any suitable interface controllers to provide any suitable interface to at least one of the processor(s) 1102 and/or to any suitable device or component in communication with control module 1104.

The control module 1104 may include a memory controller module to provide an interface to the memory 1106. The memory controller module may be a hardware module, a software module, and/or a firmware module.

The memory 1106 may be used, for example, to load and store data and/or instructions 1114 for the device 1100. For one embodiment, memory 1106 may include any suitable volatile memory, such as suitable DRAM. In some embodiments, the memory 1106 may comprise a double data rate type four synchronous dynamic random access memory (DDR4 SDRAM).

For one embodiment, control module 1104 may include one or more input/output controllers to provide an interface to NVM/storage 1108 and input/output device(s) 1110.

For example, NVM/storage 1108 may be used to store data and/or instructions 1114. NVM/storage 1108 may include any suitable non-volatile memory (e.g., flash memory) and/or may include any suitable non-volatile storage device(s) (e.g., one or more Hard Disk Drives (HDDs), one or more Compact Disc (CD) drives, and/or one or more Digital Versatile Disc (DVD) drives).

NVM/storage 1108 may include storage resources that are physically part of the device on which apparatus 1100 is installed, or it may be accessible by the device and need not be part of the device. For example, NVM/storage 1108 may be accessed over a network via input/output device(s) 1110.

Input/output device(s) 1110 may provide an interface for apparatus 1100 to communicate with any other suitable device, input/output devices 1110 may include communication components, audio components, sensor components, and so forth. Network interface 1112 may provide an interface for device 1100 to communicate over one or more networks, and device 1100 may communicate wirelessly with one or more components of a wireless network according to any of one or more wireless network standards and/or protocols, such as access to a communication standard-based wireless network, e.g., WiFi, 2G, 3G, 4G, 5G, etc., or a combination thereof.

For one embodiment, at least one of the processor(s) 1102 may be packaged together with logic for one or more controller(s) (e.g., memory controller module) of the control module 1104. For one embodiment, at least one of the processor(s) 1102 may be packaged together with logic for one or more controllers of control module 1104 to form a System In Package (SiP). For one embodiment, at least one of the processor(s) 1102 may be integrated on the same die with logic for one or more controller(s) of the control module 1104. For one embodiment, at least one of the processor(s) 1102 may be integrated on the same die with logic for one or more controller(s) of control module 1104 to form a system on chip (SoC).

In various embodiments, the apparatus 1100 may be, but is not limited to being: a server, a desktop computing device, or a mobile computing device (e.g., a laptop computing device, a handheld computing device, a tablet, a netbook, etc.), among other terminal devices. In various embodiments, the apparatus 1100 may have more or fewer components and/or different architectures. For example, in some embodiments, device 1100 includes one or more cameras, keyboards, Liquid Crystal Display (LCD) screens (including touch screen displays), non-volatile memory ports, multiple antennas, graphics chips, Application Specific Integrated Circuits (ASICs), and speakers.

The detection device can adopt a main control chip as a processor or a control module, sensor data, position information and the like are stored in a memory or an NVM/storage device, a sensor group can be used as an input/output device, and a communication interface can comprise a network interface.

For the device embodiment, since it is basically similar to the method embodiment, the description is simple, and for the relevant points, refer to the partial description of the method embodiment.

The embodiments in the present specification are described in a progressive manner, each embodiment focuses on differences from other embodiments, and the same and similar parts among the embodiments are referred to each other.

Embodiments of the present application are described with reference to flowchart illustrations and/or block diagrams of methods, terminal devices (systems), and computer program products according to embodiments of the application. It will be understood that each flow and/or block of the flow diagrams and/or block diagrams, and combinations of flows and/or blocks in the flow diagrams and/or block diagrams, can be implemented by computer program instructions. These computer program instructions may be provided to a processor of a general purpose computer, special purpose computer, embedded processor, or other programmable data processing terminal to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing terminal, create means for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.

These computer program instructions may also be stored in a computer-readable memory that can direct a computer or other programmable data processing terminal to function in a particular manner, such that the instructions stored in the computer-readable memory produce an article of manufacture including instruction means which implement the function specified in the flowchart flow or flows and/or block diagram block or blocks.

These computer program instructions may also be loaded onto a computer or other programmable data processing terminal to cause a series of operational steps to be performed on the computer or other programmable terminal to produce a computer implemented process such that the instructions which execute on the computer or other programmable terminal provide steps for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.

While preferred embodiments of the present application have been described, additional variations and modifications of these embodiments may occur to those skilled in the art once they learn of the basic inventive concepts. Therefore, it is intended that the appended claims be interpreted as including the preferred embodiment and all such alterations and modifications as fall within the true scope of the embodiments of the application.

Finally, it should also be noted that, herein, relational terms such as first and second, and the like may be used solely to distinguish one entity or action from another entity or action without necessarily requiring or implying any actual such relationship or order between such entities or actions. Also, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or terminal that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or terminal. Without further limitation, an element defined by the phrase "comprising an … …" does not exclude the presence of other like elements in a process, method, article, or terminal that comprises the element.

The above detailed description is given of a training method and apparatus for a semantic segmentation module, a semantic segmentation method, an electronic device, and a machine-readable medium, which are provided in the embodiments of the present application, and specific examples are applied in the present application to explain the principles and embodiments of the present application, and the description of the above embodiments is only used to help understand the method and core ideas of the present application; meanwhile, for a person skilled in the art, according to the idea of the present application, there may be variations in the specific embodiments and the application scope, and in summary, the content of the present specification should not be construed as a limitation to the present application.

Claims

1. A training method of a semantic segmentation model is characterized in that the semantic segmentation model comprises the following steps: the teacher model and the student model, the training data of the semantic segmentation model includes: a plurality of unmarked images under various scale information corresponding to at least one unmarked image; the method comprises the following steps:

determining loss information based on the first output and the second output;

Performing pooling processing on the first coding feature and the second coding feature with multiple kinds of scale information respectively to obtain a first pooling feature and a second pooling feature; and determining second loss information according to the pooling fusion characteristics of the first pooling characteristics under the scale information condition and the second pooling characteristics.

2. The method of claim 1, wherein generating pseudo labels corresponding to the plurality of label-free images according to the first semantic representation comprises:

performing fusion processing on the first alignment semantic representations respectively corresponding to the multiple non-labeled images to obtain fusion semantic representations;

and generating pseudo labels corresponding to the multiple label-free images according to the fusion semantic representation.

3. The method of claim 1, wherein determining first loss information based on the second semantic representation and the pseudo tag comprises:

4. The method of claim 1, wherein the determining of the pooled fusion features comprises: and respectively carrying out fusion processing on the first pooling features corresponding to the single scale information to obtain pooling fusion features corresponding to the multiple scale information.

5. The method according to any one of claims 1 to 4, further comprising:

6. A method of semantic segmentation, the method comprising:

receiving an image to be processed;

wherein the training data of the semantic segmentation model comprises: a plurality of unmarked images under the information of various scales corresponding to at least one unmarked image; the training process of the semantic segmentation model comprises the following steps: determining a first output and a second output corresponding to the plurality of the non-labeled images by respectively using a teacher model and a student model; determining loss information based on the first output and the second output; updating a first parameter of the student model according to the loss information;

7. An apparatus for training a semantic segmentation model, the semantic segmentation model comprising: the teacher model and the student model, the training data of the semantic segmentation model includes: a plurality of unmarked images under the information of various scales corresponding to at least one unmarked image; the device comprises:

the loss processing module includes:

The second loss processing module is used for respectively carrying out pooling processing on the first coding characteristic and the second coding characteristic on multiple kinds of scale information to obtain a first pooling characteristic and a second pooling characteristic; and determining second loss information according to the pooling fusion characteristics of the first pooling characteristics under the condition of scale information and the second pooling characteristics.

8. An apparatus for semantic segmentation, the apparatus comprising:

the receiving module is used for receiving the image to be processed;

9. An electronic device, comprising: a processor; and

memory having stored thereon executable code which, when executed, causes the processor to perform the method of any of claims 1-6.

10. A machine readable medium having executable code stored thereon, which when executed, causes a processor to perform the method of any of claims 1-6.