CN114565812A

CN114565812A - Training method and device of semantic segmentation model and semantic segmentation method of image

Info

Publication number: CN114565812A
Application number: CN202210200623.8A
Authority: CN
Inventors: 梁俪倩; 单言虎; 苏治中; 廖杰
Original assignee: Beijing Horizon Robotics Technology Research and Development Co Ltd
Current assignee: Beijing Horizon Robotics Technology Research and Development Co Ltd
Priority date: 2022-03-01
Filing date: 2022-03-01
Publication date: 2022-05-31
Anticipated expiration: 2042-03-01
Also published as: CN114565812B

Abstract

The embodiment of the disclosure discloses a training method and a device of a semantic segmentation model and a semantic segmentation method of an image, wherein the method comprises the following steps: acquiring training data; training a pre-established semantic segmentation network by using training data and adopting a semi-supervised learning training mode of fusing pixel-level contrast learning and cross supervision to obtain a semantic segmentation model. According to the embodiment of the disclosure, through cross-supervised semi-supervised learning, labeled image data and unlabelled image data can be used for training together, the robustness of the semantic segmentation model is effectively improved, the distance between the same type of pixel features can be effectively shortened by combining with pixel-level contrast learning, the distance between different types of pixel features is simultaneously widened, the discriminability of the pixel features is improved, and therefore the performance of the semantic segmentation model is effectively improved.

Description

Training method and device of semantic segmentation model and semantic segmentation method of image

Technical Field

The present disclosure relates to computer vision technologies, and in particular, to a training method and apparatus for a semantic segmentation model, and a semantic segmentation method for an image.

Background

Semi-supervised semantic segmentation is a hot spot problem in the recent computer vision field, and can be used for carrying out model training by utilizing a small amount of labeled image data and a large amount of unlabeled image data to obtain a semantic segmentation model, and further classifying each pixel in an image based on the semantic segmentation model. The existing semi-supervised semantic segmentation method usually obtains an initial model through labeled image data training, and then predicts unlabeled image data based on the initial model to expand the labeled data and continue to train the model. However, the performance of the semantic segmentation model of the existing semi-supervised semantic segmentation method depends too much on the accuracy of the initial model, and if the accuracy of the initial model is not high, the performance of the obtained semantic segmentation model is poor.

Disclosure of Invention

The semantic segmentation model based on the semantic segmentation is provided for solving the technical problem of poor performance of the semantic segmentation model. The embodiment of the disclosure provides a training method and a device of a semantic segmentation model and a semantic segmentation method of an image.

According to an aspect of the embodiments of the present disclosure, there is provided a training method of a semantic segmentation model, including: acquiring training data; and training a pre-established semantic segmentation network by using the training data and adopting a semi-supervised learning training mode of fusing pixel-level contrast learning and cross supervision to obtain a semantic segmentation model.

According to another aspect of the embodiments of the present disclosure, there is provided a training apparatus for a semantic segmentation model, including: the first acquisition module is used for acquiring training data; and the first processing module is used for training a pre-established semantic segmentation network by using the training data and adopting a semi-supervised learning mode of fusing pixel-level contrast learning and cross supervision to obtain a semantic segmentation model.

According to still another aspect of the embodiments of the present disclosure, there is provided a semantic segmentation method for an image, including: acquiring image data to be processed; performing semantic segmentation on the image data to be processed based on a pre-obtained semantic segmentation model to obtain a semantic segmentation result corresponding to the image data to be processed; the semantic segmentation model is obtained by the training method of the semantic segmentation model according to any one of the embodiments of the present disclosure.

According to a further aspect of the embodiments of the present disclosure, there is provided a computer-readable storage medium storing a computer program for executing the method according to any one of the above-mentioned embodiments of the present disclosure.

According to still another aspect of the embodiments of the present disclosure, there is provided an electronic apparatus including: a processor; a memory for storing the processor-executable instructions; the processor is configured to read the executable instructions from the memory and execute the instructions to implement the method according to any of the above embodiments of the present disclosure.

Based on the training method and device for the semantic segmentation model and the image semantic segmentation method provided by the embodiment of the disclosure, a pre-established semantic segmentation training network is trained based on a training mode of semi-supervised learning integrating pixel-level contrast learning and cross supervision to obtain the semantic segmentation model, and through the cross-supervised semi-supervised learning, labeled image data and unlabeled image data can be used for training together, so that the robustness of the semantic segmentation model is effectively improved, the distance between the same type of pixel features can be effectively shortened by combining the pixel-level contrast learning, the distance between different types of pixel features is simultaneously increased, the discriminability of the pixel features is improved, and the performance of the semantic segmentation model is effectively improved.

The technical solution of the present disclosure is further described in detail by the accompanying drawings and examples.

Drawings

The above and other objects, features and advantages of the present disclosure will become more apparent by describing in more detail embodiments of the present disclosure with reference to the attached drawings. The accompanying drawings are included to provide a further understanding of the embodiments of the disclosure and are incorporated in and constitute a part of this specification, illustrate embodiments of the disclosure and together with the description serve to explain the principles of the disclosure and not to limit the disclosure. In the drawings, like reference numbers generally represent like parts or steps.

FIG. 1 is an exemplary application scenario of a training method of a semantic segmentation model provided by the present disclosure;

FIG. 2 is a flowchart illustrating a training method of a semantic segmentation model according to an exemplary embodiment of the present disclosure;

FIG. 3 is a flowchart of step 202 provided by an exemplary embodiment of the present disclosure;

FIG. 4 is a flowchart of step 202 provided by another exemplary embodiment of the present disclosure;

fig. 5 is a schematic diagram of an overall architecture of a deplab 3+ network according to an exemplary embodiment of the present disclosure;

FIG. 6 is a schematic diagram of a model training principle provided by an exemplary embodiment of the present disclosure;

FIG. 7 is a flowchart of step 2029 provided by an exemplary embodiment of the present disclosure;

FIG. 8 is a block flow diagram of a sampling process provided by an exemplary embodiment of the present disclosure;

FIG. 9 is a flowchart of step 202 provided by yet another exemplary embodiment of the present disclosure;

FIG. 10 is a flowchart illustrating a semantic segmentation method for an image according to an exemplary embodiment of the disclosure;

FIG. 11 is a schematic structural diagram of a training apparatus for a semantic segmentation model provided in an exemplary embodiment of the present disclosure;

fig. 12 is a schematic structural diagram of a first processing module 502 according to an exemplary embodiment of the disclosure;

fig. 13 is a schematic structural diagram of a first processing module 502 according to another exemplary embodiment of the present disclosure;

fig. 14 is a schematic structural diagram of a fifth determining unit 50229 provided in an exemplary embodiment of the present disclosure;

fig. 15 is a schematic structural diagram of a fourth determining unit 50228 provided in an exemplary embodiment of the present disclosure;

FIG. 16 is a schematic structural diagram of an apparatus for semantic segmentation of an image according to an exemplary embodiment of the present disclosure;

fig. 17 is a schematic structural diagram of an application embodiment of the electronic device of the present disclosure.

Detailed Description

Hereinafter, example embodiments according to the present disclosure will be described in detail with reference to the accompanying drawings. It is to be understood that the described embodiments are merely a subset of the embodiments of the present disclosure and not all embodiments of the present disclosure, with the understanding that the present disclosure is not limited to the example embodiments described herein.

It should be noted that: the relative arrangement of the components and steps, the numerical expressions, and numerical values set forth in these embodiments do not limit the scope of the present disclosure unless specifically stated otherwise.

It will be understood by those within the art that the terms "first", "second", etc. in the embodiments of the present disclosure are used only for distinguishing between different steps, devices or modules, etc., and do not denote any particular technical meaning or necessary logical order therebetween.

It is also understood that in embodiments of the present disclosure, "a plurality" may refer to two or more and "at least one" may refer to one, two or more.

It is also to be understood that any reference to any component, data, or structure in the embodiments of the disclosure, may be generally understood as one or more, unless explicitly defined otherwise or stated otherwise.

In addition, the term "and/or" in the present disclosure is only one kind of association relationship describing an associated object, and means that three kinds of relationships may exist, for example, a and/or B may mean: a exists alone, A and B exist simultaneously, and B exists alone. In addition, the character "/" in the present disclosure generally indicates that the former and latter associated objects are in an "or" relationship.

It should also be understood that the description of the various embodiments of the present disclosure emphasizes the differences between the various embodiments, and the same or similar parts may be referred to each other, so that the descriptions thereof are omitted for brevity.

Meanwhile, it should be understood that the sizes of the respective portions shown in the drawings are not drawn in an actual proportional relationship for the convenience of description.

The following description of at least one exemplary embodiment is merely illustrative in nature and is in no way intended to limit the disclosure, its application, or uses.

Techniques, methods, and apparatus known to those of ordinary skill in the relevant art may not be discussed in detail but are intended to be part of the specification where appropriate.

It should be noted that: like reference numbers and letters refer to like items in the following figures, and thus, once an item is defined in one figure, it need not be discussed further in subsequent figures.

The disclosed embodiments may be applied to electronic devices such as terminal devices, computer systems, servers, etc., which are operational with numerous other general purpose or special purpose computing system environments or configurations. Examples of well known terminal devices, computing systems, environments, and/or configurations that may be suitable for use with electronic devices, such as terminal devices, computer systems, servers, and the like, include, but are not limited to: personal computer systems, server computer systems, thin clients, thick clients, hand-held or laptop devices, microprocessor-based systems, set-top boxes, programmable consumer electronics, networked personal computers, minicomputer systems, mainframe computer systems, distributed cloud computing environments that include any of the above, and the like.

Electronic devices such as terminal devices, computer systems, servers, etc. may be described in the general context of computer system-executable instructions, such as program modules, being executed by a computer system. Generally, program modules may include routines, programs, objects, components, logic, data structures, etc. that perform particular tasks or implement particular abstract data types. The computer system/server may be practiced in distributed cloud computing environments where tasks are performed by remote processing devices that are linked through a communications network. In a distributed cloud computing environment, program modules may be located in both local and remote computer system storage media including memory storage devices.

Summary of the disclosure

In the process of implementing the present disclosure, the inventor finds that, in the existing semi-supervised semantic segmentation method, an initial model is obtained through labeled image data training, and then unlabeled image data is predicted based on the initial model to expand the labeled data and continue training the model. However, the performance of the semantic segmentation model of the existing semi-supervised semantic segmentation method depends too much on the accuracy of the initial model, and if the accuracy of the initial model is not high, the performance of the obtained semantic segmentation model is poor.

Brief description of the drawings

Fig. 1 is an exemplary application scenario of the training method of the semantic segmentation model provided in the present disclosure.

Aiming at the training of the semantic segmentation model, the training method of the semantic segmentation model is executed by utilizing the training device of the semantic segmentation model, and can be used for training by combining cross-supervised semi-supervised learning and pixel-level contrast learning based on tagged image data and non-tagged image data, wherein the tagged image data and the non-tagged image data are simultaneously input in batches (batch) and are trained together, so that the robustness of the semantic segmentation model is effectively improved, the supervised learning is carried out on the tagged image data, the cross-supervised learning and the pixel-level contrast learning are carried out on the non-tagged image data, the distance between the same type of pixel features can be effectively shortened, the distances between different types of pixel features are simultaneously widened, the discriminability of the pixel features is improved, and the performance of the semantic segmentation model is effectively improved.

The training method of the semantic segmentation model can be applied to training of the semantic segmentation model required by any scene or field, such as fields of automatic driving scenes, geographic information systems, medical image analysis, robots and the like, and is not limited specifically.

Exemplary method

Fig. 2 is a flowchart illustrating a training method of a semantic segmentation model according to an exemplary embodiment of the present disclosure. The embodiment can be applied to an electronic device, such as a server or a terminal, as shown in fig. 2, and includes the following steps:

step 201, training data is acquired.

The training data comprises labeled image data and unlabeled image data, and can also comprise labels corresponding to the labeled image data. The training data may be collected in advance in any practicable manner. For example, in the field of automatic driving, a data acquisition vehicle may be adopted to travel on a road to be acquired, a camera device, such as a camera, is arranged on the data acquisition vehicle, an image of a surrounding environment is acquired through the camera in the traveling process, image data is acquired, and a label of the image data with the label may be acquired in any implementable labeling manner, such as acquired through manual labeling, acquired through an implementable automatic labeling manner, acquired through trained classification model prediction, and the like, which are not specifically limited.

Step 202, training a pre-established semantic segmentation network by using training data and adopting a semi-supervised learning training mode integrating pixel-level contrast learning and cross supervision to obtain a semantic segmentation model.

The semi-monitoring means monitoring learning and non-monitoring learning are combined, the labeled image data is used for monitoring learning, the non-labeled image data is used for non-monitoring learning, the cross monitoring means that output results are obtained simultaneously through two semantic segmentation networks for the non-monitored non-labeled image data, the network 2 is monitored based on the output results of the network 1, the network 1 is monitored based on the output results of the network 2, the pixel-level contrast learning means that pixel-level contrast loss is determined according to preset pixel-level contrast rules aiming at feature graphs output by preset layers of the non-labeled image data in the two networks, the pixel-level contrast loss is used for updating network parameters, the distances between the same type of pixel features are drawn closer, the distances between different types of pixel features are pushed farther, and the discriminative performance of the pixel features is improved.

The training mode of the semi-supervised learning integrating the pixel-level contrast learning and the cross supervision refers to that the supervised learning loss, the cross supervised learning loss and the pixel-level contrast loss are synthesized and are commonly used for updating the network parameters, namely, the labeled image data and the unlabelled image data are cooperatively trained instead of using the labeled image data to train to obtain an initial model, so that the training method of the semantic segmentation model disclosed by the invention is independent of the initial model, for example, the labeled image data and the unlabelled image data in the invention can be simultaneously input according to batches (batch) and are trained together to obtain the supervised learning loss, the cross supervised learning loss and the pixel-level contrast loss, and the three kinds of comprehensive losses are used for updating the network parameters.

The training method for the semantic segmentation model provided by this embodiment is to train a pre-established semantic segmentation training network based on a training mode of semi-supervised learning that combines pixel-level contrast learning and cross supervision to obtain the semantic segmentation model, and through the semi-supervised learning of cross supervision, labeled image data and unlabeled image data can be used for training together, so as to effectively improve the robustness of the semantic segmentation model, and in combination with the pixel-level contrast learning, the distance between pixel features of the same type can be effectively shortened, and the distance between pixel features of different types can be simultaneously increased, so that the discriminability of the pixel features is improved, and thus the performance of the semantic segmentation model is effectively improved. The training method of the semantic segmentation model does not depend on the initial model, is based on labeled image data and unlabeled image data, and is trained cooperatively by a training mode of semi-supervised learning integrating pixel-level contrast learning and cross supervision, so that the problems of low accuracy of the initial model and the like do not exist, and the problems of poor performance and the like caused by the fact that the performance of the model in the prior art excessively depends on the accuracy of the initial model are effectively solved.

In an optional example, fig. 3 is a flowchart of step 202 provided in an exemplary embodiment of the present disclosure, in this example, the training data includes first image data, first label data corresponding to the first image data, and second image data, and step 202 may specifically include the following steps:

2021a, respectively performing supervised learning training on a first semantic segmentation network and a second semantic segmentation network which are established in advance by using the first image data and the first label data; and performing cross supervised learning training on the first semantic segmentation network and the second semantic segmentation network by using the second image data, and performing pixel level comparison learning training on the first semantic segmentation network and the second semantic segmentation network based on a pixel level comparison rule.

The first semantic segmentation network and the second semantic segmentation network may adopt any implementable semantic segmentation network, such as a semantic segmentation network based on deplaybv 3+, a semantic segmentation network based on UNet, a semantic segmentation network based on FCN (full volume Networks), and the like, and may be specifically set according to actual requirements, which is not limited in this disclosure. It should be noted that the first semantic segmentation network and the second semantic segmentation network have the same network structure, and initialized network parameters are different.

Step 2022a, determining the current loss based on a preset loss function, where the preset loss function includes a first loss function corresponding to supervised learning, a second loss function corresponding to cross-supervised learning, and a third loss function corresponding to pixel-level contrast learning.

The first loss function, the second loss function, and the third loss function may be set according to actual requirements, for example, a cross entropy loss function or any other implementable loss function may be adopted, and the disclosure is not limited.

Taking the first semantic segmentation network as an example, the supervised learning principle is as follows: and inputting the first image data into a first semantic segmentation network to obtain a first output result, comparing the first output result with corresponding first label data, and determining a first loss corresponding to supervised learning based on a first loss function corresponding to the supervised learning.

The principle of cross supervised learning is that the second image data without labels is respectively input into a first semantic segmentation network and a second semantic segmentation network, the labels of the second semantic segmentation network are determined to supervise the second semantic segmentation network based on the output result of the first semantic segmentation network, meanwhile, the labels of the first semantic segmentation network are determined to supervise the first semantic segmentation network based on the output result of the second semantic segmentation network, so that cross supervision is realized, and the second loss corresponding to the cross supervised learning is determined based on the second loss function corresponding to the cross supervised learning.

The pixel-level comparison rule can be set according to actual requirements, and is used for performing pixel-level comparison on feature maps generated by the first semantic segmentation network and the second semantic segmentation network, and determining a third loss corresponding to pixel-level comparison learning based on a third loss function corresponding to the pixel-level comparison learning.

And then determining the current loss by combining the first loss, the second loss and the third loss, adjusting network parameters of the first semantic segmentation network and the second semantic segmentation network, and performing next iteration to train the first semantic segmentation network and the second semantic segmentation network.

Step 2023a, determining that the current loss meets a preset condition, ending the training, and using the first semantic segmentation network or the second semantic segmentation network obtained by the training as a semantic segmentation model.

The preset condition may be set according to an actual requirement, and the present disclosure is not limited, for example, the current loss enters a convergence state, specifically, for example, the current loss is less than a preset loss threshold, and N consecutive losses are less than the preset loss threshold, and the like.

When the current loss is determined to meet the preset condition, the training can be finished, and the first semantic segmentation network and the second semantic segmentation network obtained through training can be used as semantic segmentation models for semantic segmentation scenes because the first semantic segmentation network and the second semantic segmentation network are trained in the same process.

According to the method, through supervised learning of labeled image data, cross supervised learning of unlabeled image data and pixel level contrast learning, the obtained first loss corresponding to the supervised learning, the obtained second loss corresponding to the cross supervised learning and the obtained third loss corresponding to the pixel level contrast learning are used for determining the current loss, network parameters are updated, effective fusion of the cross supervised semi-supervised learning and the pixel level contrast learning is achieved, the learning effectiveness of the unlabeled image data is improved on the basis of ensuring the learning value of the labeled image data, information contained in the unlabeled data is effectively utilized, and the robustness of a model is improved.

In an alternative example, fig. 4 is a flowchart of step 202 provided by another exemplary embodiment of the present disclosure, in this example, the training data includes first image data, first label data corresponding to the first image data, and second image data, and step 202 may specifically include the following steps:

step 2021b, predicting the first image data by using the first semantic segmentation network to obtain first probability data corresponding to the first image data.

The first probability data includes probability values that each pixel in the first image data belongs to each type, and the types may be set according to actual requirements, for example, the types may include vehicles, pedestrians, obstacles, and the like, and are not limited specifically.

And predicting the first image data by using the first semantic segmentation network, namely inputting the first image data into the first semantic segmentation network, and obtaining an output result through processing of the first semantic segmentation network, wherein a specific prediction principle is not repeated.

Optionally, the first image data may be image data that meets input requirements of the first semantic segmentation network after preprocessing the original image data.

Step 2022b, predicting the first image data by using the second semantic segmentation network to obtain second probability data corresponding to the first image data.

For a specific prediction principle, refer to the foregoing steps, which are not described herein again.

Step 2023b, predicting the second image data by using the first semantic segmentation network to obtain a third feature map and third probability data corresponding to the second image data.

Wherein, the third feature map is a feature map output by a preset layer (such as the last convolutional layer) in the first semantic segmentation network. The specific preset layer may be determined according to a specific network structure of the first semantic segmentation network, which is not limited herein.

For example, fig. 5 is a schematic diagram of an overall architecture of a deplab v3+ network according to an exemplary embodiment of the present disclosure, where a third feature diagram is a feature diagram output by a last convolutional layer in an Encoder (a last convolutional layer feature diagram labeled in the figure), and the deplab v3+ network architecture is the prior art and is not described herein again.

For the specific prediction principle, refer to the previous step, and are not described herein again.

Step 2024b, predicting the second image data by using the second semantic segmentation network to obtain a fourth feature map and fourth probability data corresponding to the second image data.

The detailed operation of this step is referred to the aforementioned step 2023b, and is not described herein again.

Wherein, the steps 2021b-2024b are not in sequence.

Step 2025, determining the first pseudo tag data as the tag data of the second image data under the second semantic segmentation network based on the third probability data.

The first pseudo tag data includes a tag of a type to which each pixel in the second image data belongs, and is obtained by encoding the third probability data, and a specific encoding mode may be set according to an actual requirement, for example, by using one-hot (one-hot) encoding.

Step 2026, determining the second pseudo tag data as the tag data of the second image data under the first semantic segmentation network based on the fourth probability data.

The specific operation of this step is referred to as step 2025, and details are not repeated.

Step 2025 is not in sequence with step 2026.

Step 2027, determining a first cross entropy loss based on the first probability data, the second probability data, and the first label data.

Specifically, first probability data and second probability data obtained through prediction are respectively compared with first label data, and corresponding first cross entropy loss is determined based on a first loss function.

Illustratively, the first loss function may be expressed as L_s：

Wherein D is_lFor the tagged data set (i.e., the first image data), | D_lL represents the number of samples in the labeled dataset, W, H the width and height of the image, respectively, l_ceIt is indicated that the cross-entropy calculation,

the predicted results of the first semantic segmentation network and the second semantic segmentation network, namely the probability value corresponding to the ith pixel in the first probability data and the probability value corresponding to the ith pixel in the second probability data,

is the label value corresponding to the ith pixel in the first label data. For the same pixel or pixels of the same pixel,

step 2028, determining a second cross entropy loss based on the third probability data, the second pseudo label data, the fourth probability data, and the first pseudo label data.

Specifically, the second cross entropy loss is determined based on the second loss function, the specific principle is similar to that in step 2027, and details are not repeated here, except that in this step, the second pseudo tag data is a tag compared with the third probability data, and the first pseudo tag data is a tag compared with the fourth probability data.

Step 2029, comparing the third feature map and the fourth feature map at pixel level based on the third probability data and the fourth probability data to obtain a pixel-level contrast loss.

Specifically, a pixel-level comparison rule may be set in advance according to actual requirements, data required by the third loss function is determined, and then the pixel-level comparison loss is determined based on the third loss function.

Step 2027-step 2029 are not in sequence.

And step 2030, adjusting parameters of the semantic segmentation network based on the first cross entropy loss, the second cross entropy loss and the pixel level contrast loss until a preset training end condition is met, and obtaining a semantic segmentation model.

Specifically, the comprehensive loss can be obtained by weighting the first cross entropy loss, the second cross entropy loss and the pixel-level contrast loss, and is used for adjusting parameters of the semantic segmentation network (including the first semantic segmentation network and the second semantic segmentation network), and the weighted weight can be set according to actual requirements. The preset training end condition may be set according to actual requirements, for example, convergence of the synthetic loss is achieved, and details are not described again.

Illustratively, fig. 6 is a schematic diagram of a model training principle provided in an exemplary embodiment of the present disclosure. Wherein, for labeled image data, probability map 1 and probability map 2 represent first probability data and second probability data, respectively, and for unlabeled image data, probability map 1 and probability map 2 represent third probability data and fourth probability data, respectively; the pseudo tag 1 and the pseudo tag 2 represent first pseudo tag data and second pseudo tag data, respectively. And performing cross supervision and pixel-level comparison learning on the unlabeled image data, and performing supervised learning on the labeled image data.

In an alternative example, fig. 7 is a flowchart of step 2029 provided by an exemplary embodiment of the present disclosure, in which the pixel-level contrast loss includes a first pixel-level directional contrast loss and a second pixel-level directional contrast loss; the step 2029 of comparing the third feature map and the fourth feature map at the pixel level based on the third probability data and the fourth probability data to obtain a pixel-level contrast loss includes:

step 20291, perform channel dimension reduction on the third feature map to obtain a fifth feature map, and perform channel dimension reduction on the fourth feature map to obtain a sixth feature map.

The channel dimension reduction may be implemented in any implementable manner, for example, the channel dimension reduction is performed through 1 × 1 convolution layers, which is not limited in this disclosure.

Step 20292, based on the fifth feature map, the sixth feature map and a preset sampling rule, performing pixel sample pair sampling to obtain a first positive sample pair set, a second positive sample pair set, a first negative sample pair set and a second negative sample pair set.

The first positive sample pair set and the first negative sample pair set are respectively a positive sample pair set and a negative sample pair set obtained by sampling with the pixel in the fifth feature map as a reference sample, and the second positive sample pair set and the second negative sample pair set are respectively a positive sample pair set and a negative sample pair set obtained by sampling with the pixel in the sixth feature map as a reference sample. The preset sampling rule can be set according to actual requirements.

Illustratively, pixel sample pair sampling is carried out based on the position relation and probability magnitude situation of any first pixel in the fifth feature map and any second pixel in the sixth feature map.

At step 20293, a first pixel-level directional contrast loss is determined based on the first set of positive sample pairs and the first set of negative sample pairs.

Specifically, the first pixel-level directional contrast loss may be determined based on a first pixel-level directional contrast loss function, and the first pixel-level directional contrast loss function may be set according to actual needs.

At step 20294, a second pixel-level directional contrast loss is determined based on the second set of positive sample pairs and the second set of negative sample pairs.

Specifically, the second pixel-level directional contrast loss may be determined based on a second pixel-level directional contrast loss function, and the second pixel-level directional contrast loss function may be set according to actual needs.

Step 20293 is not in sequence with step 20294.

At step 20295, a pixel-level contrast loss is obtained based on the first pixel-level directional contrast loss and the second pixel-level directional contrast loss.

Specifically, the pixel-level contrast loss may be obtained by weighting the first pixel-level directional contrast loss and the second pixel-level directional contrast loss.

The method further improves the discrimination of the image characteristics through the pixel-level directional contrast loss, and further improves the model performance.

In an optional example, the sampling of the pixel sample pairs based on the fifth feature map, the sixth feature map and the preset sampling rule in step 20292 to obtain a first positive sample pair set, a second positive sample pair set, a first negative sample pair set and a second negative sample pair set includes:

for any first pixel in the fifth feature map and any second pixel in the sixth feature map, determining that the first pixel and the second pixel are located at the same position, the maximum probability corresponding to the first pixel is smaller than the maximum probability corresponding to the second pixel, and the maximum probability corresponding to the second pixel is larger than a preset probability threshold, and then taking the first pixel and the second pixel as a positive sample pair with the pixel in the fifth feature map as a reference sample; thereby obtaining a first set of positive sample pairs; determining that the first pixel and the second pixel are located at the same position, the maximum probability corresponding to the first pixel is greater than the maximum probability corresponding to the second pixel, and the maximum probability corresponding to the first pixel is greater than a preset threshold, and then taking the first pixel and the second pixel as a positive sample pair taking a pixel in the sixth feature map as a reference sample; thereby obtaining a second set of positive sample pairs; determining that the first pixel and the second pixel are located at different positions, and the maximum probability of the first pixel is inconsistent with the maximum probability of the second pixel, taking the first pixel and the second pixel as a negative sample pair taking the pixel in the fifth feature map as a reference sample, and taking the first pixel and the second pixel as a negative sample pair taking the pixel in the sixth feature map as a reference sample; thereby obtaining a first set of negative sample pairs and a second set of negative sample pairs.

Wherein the probability corresponding to the first pixel may be obtained based on the third probability data, and the probability corresponding to the second pixel may be obtained based on the fourth probability data. Taking the third probability data as an example, the third probability data includes probability values that the pixels respectively belong to the types, that is, probability values that each pixel corresponds to a plurality of types, where the type corresponding to the maximum probability may represent that the pixel belongs to the type. Since the fifth feature map and the sixth feature map are obtained by predicting the second image data through the first semantic segmentation network and the second semantic segmentation network, respectively, the first pixel and the second pixel at the same position are considered to belong to the same type, or whether the first pixel and the second pixel belong to the same type is further determined based on the third probability data and the fourth probability data. And then determining a first positive sample pair set and a second positive sample pair set based on the magnitude relation of the maximum probabilities respectively corresponding to the first pixel and the second pixel at the same position and the magnitude relation of the larger maximum probability and a preset probability threshold, determining a first negative sample pair set and a second negative sample pair set based on the inconsistency of the types of the maximum probabilities respectively corresponding to the first pixel and the second pixel at different positions, wherein the positive sample pair and the negative sample pair represent whether the sample pair belongs to the same type, the positive sample pair represents that two pixels in the sample pair belong to the same type, the negative sample pair represents that two pixels in the sample pair belong to different types, therefore, the pixel-level directional contrast loss determined based on the positive sample pair set and the negative sample pair set can guide the semantic segmentation network to draw close the distance between the pixel features of the same type and simultaneously push away the distance between the pixel features of different types, the discriminability of the pixel characteristics is improved.

In an alternative example, the determining the first pixel-level directional contrast loss based on the first set of positive sample pairs and the first set of negative sample pairs of step 20293 includes:

generating a first positive sample pair mask based on the first set of positive sample pairs; generating a first negative exemplar pair mask based on the first set of negative exemplar pairs; a first pixel-level directional contrast loss is determined based on the first positive-sample pair mask and the first negative-sample pair mask.

The first positive sample pair mask includes the position of each first positive sample pair in the feature map, and similarly, the first negative sample pair mask includes the position of each first negative sample pair in the feature map.

In an alternative example, the determining of the second pixel-level directional contrast loss based on the second set of positive sample pairs and the second set of negative sample pairs of step 20294 includes:

generating a second positive sample pair mask based on the second set of positive sample pairs; generating a second negative exemplar pair mask based on the second set of negative exemplar pairs; a second pixel-level directional contrast loss is determined based on the second positive-sample pair mask and the second negative-sample pair mask.

Illustratively, fig. 8 is a block flow diagram of a sampling process provided by an exemplary embodiment of the present disclosure. The third probability data, the fourth probability data, the fifth feature map and the sixth feature map are used as the basis, wherein P () represents the maximum probability of the pixel.

Illustratively, the first pixel level directional contrast loss may be represented as L_C1：

The second pixel level directional contrast loss can be expressed as L_C2：

Wherein W, H are the width and height of the image,

a fifth characteristic diagram is shown in which,

showing a sixth characteristic diagram, each of which is a characteristic diagram

The middle pixel is used as a reference sample, N represents the number of reference samples in a positive sample pair (a first positive sample pair or a second positive sample pair),

respectively show respectively

A first positive sample pair mask and a second positive sample pair mask generated after the middle pixel is taken as a reference sample and sampled,

are respectively provided with

When the middle pixel is used as a reference sample, the first negative sample pair mask and the second negative sample pair mask which correspond to the middle pixel after sampling, r () represents an exponential function of cosine similarity,

the set of negative example pairs may represent either a first set of negative example pairs or a second set of negative example pairs.

Fig. 9 is a flowchart of step 202 provided by yet another exemplary embodiment of the present disclosure.

In an alternative example, the determining the first pseudo tag data based on the third probability data of step 2025 includes:

step 20251, performing one-hot encoding on the third probability data to obtain first pseudo tag data.

In which, the one-hot encoding, also called one-bit valid encoding, is to use an N-bit status register to encode N states, each state having its own independent register bit, and at any time, where only one bit is valid, the valid bit in this disclosure indicates the type of pixel correspondence. For example, for a certain pixel a, the probability of belonging to a pedestrian is 0.1, the probability of belonging to a vehicle is 0.6, the probability of belonging to a road edge is 0.1, and the probability of belonging to an obstacle is 0.2, and the pseudo tag corresponding to the pixel a obtained through encoding is 0100, and the specific encoding principle is not described herein again.

Determining second pseudo tag data based on the fourth probability data of step 2026 includes:

step 20261, performing one-hot encoding on the fourth probability data to obtain second pseudo tag data.

For the specific principle, refer to the foregoing steps, and are not described herein again.

In an alternative example, the determining the second cross entropy loss based on the third probability data, the second pseudo tag data, the fourth probability data and the first pseudo tag data of step 2028 includes:

step 20281, determining first cross entropies respectively corresponding to the pixels of the second image data under the first semantic segmentation network based on the third probability data and the second pseudo label data.

Specifically, the second pseudo tag data is used as tag data to be compared with the third probability data, and the first cross entropy corresponding to each pixel is determined based on the first cross entropy calculation rule, and any implementable calculation mode can be adopted for specific cross entropy calculation, which is not described herein again.

Step 20282, determining second cross entropies respectively corresponding to the pixels of the second image data under the second semantic segmentation network based on the fourth probability data and the first pseudo label data.

Similarly, the first pseudo tag data is used as tag data compared with the fourth probability data, and the second cross entropy is determined based on the second cross entropy calculation rule.

Step 20281 is not in sequence with step 20282.

Step 20283, determining a second cross entropy loss based on the first cross entropy and the second cross entropy corresponding to each pixel of the second image data.

Illustratively, the second cross-entropy penalty may be expressed as L_U：

Wherein D is_uRepresenting an unlabeled data set (i.e., second image data), W, H being the width and height of the image,

respectively, a pseudo first pseudo tag (i.e., a pseudo tag corresponding to the ith pixel in the first pseudo tag data) and a second pseudo tag (i.e., a pseudo tag corresponding to the ith pixel in the second pseudo tag data)/_ceIt is indicated that the cross-entropy calculation,

i.e. representing a first cross entropy corresponding to the ith pixel,

representing a second cross entropy corresponding to the ith pixel,

the prediction results of the first semantic segmentation network and the second semantic segmentation network are respectively the probability value corresponding to the ith pixel in the third probability data and the probability value corresponding to the ith pixel in the fourth probability data.

In an alternative example, the adjusting the parameters of the semantic segmentation network in step 2030 based on the first cross entropy loss, the second cross entropy loss and the pixel-level contrast loss until the preset training end condition is satisfied includes:

step 20301, performing weighted summation on the first cross entropy loss, the second cross entropy loss and the pixel level contrast loss based on the preset weight to obtain a comprehensive loss.

Step 20302, determining that the comprehensive loss does not meet a preset training end condition, and adjusting parameters of the semantic segmentation network based on the comprehensive loss; or determining that the comprehensive loss meets a preset training end condition, and ending the training.

Illustratively, the composite loss may be expressed as L:

L＝α₁L_S+α₂L_U+α₃L_C1+α₄L_C2

wherein alpha is₁、α₂、α₃、α₄Respectively representing a first cross entropy loss L_SSecond cross entropy loss L_UFirst pixel level directional contrast loss L_C1Second pixel level directional contrast loss L_C2The specific weight value may be set according to actual requirements, for example, may be set as: alpha is alpha₁＝1、α₂＝1、α₃＝0.5、α₄The value is 0.5, and may be set to other values according to actual requirements, which is not limited specifically.

According to the method, positive and negative samples based on the feature map are guided to be sampled through the prediction probability of the unlabeled image data in the two semantic segmentation networks, so that the pixel-level directional contrast loss is determined, the parameter updating of the semantic segmentation networks is participated, the distance between the pixel features of the same type is effectively shortened, the distance between the pixel features of different types is simultaneously pushed far, the discriminability of the pixel features is improved, and the performance of a semantic segmentation model is effectively improved.

The various embodiments or optional examples of the disclosure described above may be implemented individually or in any combination without conflict.

Any of the training methods for semantic segmentation models provided by the embodiments of the present disclosure may be performed by any suitable device with data processing capabilities, including but not limited to: terminal equipment, a server and the like. Alternatively, the training method of any semantic segmentation model provided by the embodiments of the present disclosure may be executed by a processor, for example, the processor may execute the training method of any semantic segmentation model mentioned in the embodiments of the present disclosure by calling a corresponding instruction stored in a memory. And will not be described in detail below.

Another embodiment of the present disclosure provides a semantic segmentation method for an image, which is used for performing semantic segmentation on the image, and the semantic segmentation method for an image of the present disclosure may be applied to an electronic device, such as a server or a terminal, specifically, a vehicle-mounted computing platform in an automatic driving field, and may also be an electronic device in other scenes or fields; fig. 10 is a flowchart illustrating a semantic segmentation method for an image according to an exemplary embodiment of the present disclosure. The method comprises the following steps:

step 301, acquiring image data to be processed.

The image data to be processed may be any image data that needs to be subjected to semantic segmentation, and this embodiment is not limited.

Step 302, performing semantic segmentation on the image data to be processed based on a pre-obtained semantic segmentation model, and obtaining a semantic segmentation result corresponding to the image data to be processed.

The semantic segmentation model is obtained by the training method of the semantic segmentation model provided in any of the above embodiments or examples. For a specific obtaining process, reference is made to the foregoing embodiments or examples, which are not described herein again. Under the condition that the semantic segmentation model is obtained, the principle of performing semantic segmentation on the image data to be processed based on the semantic segmentation model is a conventional technology, and is not described herein again.

According to the image semantic segmentation method, because the semantic segmentation model is obtained based on a training mode of semi-supervised learning integrating pixel-level contrast learning and cross supervision, through the cross supervised semi-supervised learning, labeled image data and unlabeled image data can be used for training together, the robustness of the semantic segmentation model is effectively improved, the distance between pixel features of the same type can be effectively shortened by combining the pixel-level contrast learning, the distance between pixel features of different types can be simultaneously pushed away, the discriminability of the pixel features is improved, and therefore the performance of the semantic segmentation model is effectively improved.

Any of the image semantic segmentation methods provided by the embodiments of the present disclosure may be performed by any suitable device with data processing capabilities, including but not limited to: terminal equipment, a server and the like. Alternatively, the semantic segmentation method for any kind of image provided by the embodiments of the present disclosure may be executed by a processor, for example, the processor may execute the semantic segmentation method for any kind of image mentioned by the embodiments of the present disclosure by calling a corresponding instruction stored in a memory. And will not be described in detail below.

Exemplary devices

Fig. 11 is a schematic structural diagram of a training apparatus for a semantic segmentation model according to an exemplary embodiment of the present disclosure. The apparatus of this embodiment may be used to implement the embodiment of the training method for semantic segmentation models corresponding to the present disclosure, and the apparatus shown in fig. 11 includes: a first obtaining module 501 and a first processing module 502.

A first obtaining module 501 is configured to obtain training data.

The first processing module 502 is configured to train a pre-established semantic segmentation network by using the training data acquired by the first acquisition module 501 and using a training mode of semi-supervised learning that combines pixel-level contrast learning and cross supervision, so as to obtain a semantic segmentation model.

In an alternative example, fig. 12 is a schematic structural diagram of the first processing module 502 according to an exemplary embodiment of the disclosure. In this example, the training data includes first image data, first label data corresponding to the first image data, and second image data; the first processing module 502 includes: a first processing unit 50211, a second processing unit 50212, and a third processing unit 50213.

The first processing unit 50211 is used for performing supervised learning training on a first semantic segmentation network and a second semantic segmentation network which are established in advance by using first image data and first label data, performing cross supervised learning training on the first semantic segmentation network and the second semantic segmentation network by using second image data, and performing pixel-level contrast learning training on the first semantic segmentation network and the second semantic segmentation network based on a pixel-level contrast rule; the second processing unit 50212 is configured to determine a current loss based on a preset loss function, where the preset loss function includes a first loss function corresponding to supervised learning, a second loss function corresponding to cross-supervised learning, and a third loss function corresponding to pixel-level contrast learning; the third processing unit 50213 is configured to determine that the current loss meets a preset condition, end training, and use the trained first semantic segmentation network or second semantic segmentation network as a semantic segmentation model.

In an alternative example, fig. 13 is a schematic structural diagram of a first processing module 502 provided in another exemplary embodiment of the present disclosure. In this example, the training data includes first image data, first label data corresponding to the first image data, and second image data; the first processing module 502 includes: first prediction unit 50221, second prediction unit 50222, third prediction unit 50223, fourth prediction unit 50224, first determination unit 50225, second determination unit 50226, third determination unit 50227, fourth determination unit 50228, fifth determination unit 50229, and parameter adjustment unit 50230.

A first prediction unit 50221, configured to predict the first image data by using a first semantic segmentation network, and obtain first probability data corresponding to the first image data; a second prediction unit 50222, configured to predict the first image data by using a second semantic segmentation network, to obtain second probability data corresponding to the first image data; a third prediction unit 50223, configured to predict the second image data by using the first semantic segmentation network, and obtain a third feature map and third probability data corresponding to the second image data; a fourth predicting unit 50224, configured to predict the second image data by using the second semantic segmentation network, to obtain a fourth feature map and fourth probability data corresponding to the second image data; a first determining unit 50225, configured to determine, based on the third probability data obtained by the third predicting unit 50223, first pseudo tag data as tag data of the second image data under the second semantic segmentation network; a second determining unit 50226, configured to determine second pseudo label data as label data of the second image data under the first semantic segmentation network based on the fourth probability data obtained by the fourth predicting unit 50224; a third determining unit 50227, configured to determine a first cross entropy loss based on the first probability data obtained by the first predicting unit 50221, the second probability data obtained by the second predicting unit 50222, and the first label data; a fourth determining unit 50228, configured to determine a second cross entropy loss based on the third probability data obtained by the third predicting unit 50223, the second pseudo label data obtained by the second determining unit 50226, the fourth probability data obtained by the fourth predicting unit 50224, and the first pseudo label data obtained by the first determining unit 50225; a fifth determining unit 50229, configured to perform pixel-level comparison between the third feature map obtained by the third predicting unit 50223 and the fourth feature map obtained by the fourth predicting unit 50224 based on the third probability data obtained by the third predicting unit 50223 and the fourth probability data obtained by the fourth predicting unit 50224, so as to obtain pixel-level comparison loss; a parameter adjusting unit 50230, configured to adjust parameters of the semantic segmentation network based on the first cross entropy loss obtained by the third determining unit 50227, the second cross entropy loss obtained by the fourth determining unit 50228, and the pixel-level contrast loss obtained by the fifth determining unit 50229, until a preset training end condition is met, to obtain a semantic segmentation model.

In an alternative example, fig. 14 is a schematic structural diagram of the fifth determining unit 50229 provided in an exemplary embodiment of the present disclosure, where the pixel-level contrast loss includes a first pixel-level directional contrast loss and a second pixel-level directional contrast loss in this example; the fifth determining unit 50229 includes: a first processing subunit 502291, a first sampling subunit 502292, a first determining subunit 502293, and a second determining subunit 502294.

A first processing subunit 502291, configured to perform channel dimension reduction processing on the third feature map to obtain a fifth feature map, and perform channel dimension reduction processing on the fourth feature map to obtain a sixth feature map; a first sampling subunit 502292, configured to perform pixel sample pair sampling based on the fifth feature map and the sixth feature map obtained by the first processing subunit 502291 and a preset sampling rule, to obtain a first positive sample pair set, a second positive sample pair set, a first negative sample pair set, and a second negative sample pair set; the first positive sample pair set and the first negative sample pair set are respectively a positive sample pair set and a negative sample pair set obtained by sampling with the pixel in the fifth feature map as a reference sample, and the second positive sample pair set and the second negative sample pair set are respectively a positive sample pair set and a negative sample pair set obtained by sampling with the pixel in the sixth feature map as a reference sample; a first determining subunit 502293, configured to determine a first pixel-level directional contrast loss based on the first set of positive sample pairs and the first set of negative sample pairs obtained by the first sampling subunit 502292; a second determining subunit 502294, configured to determine a second pixel-level directional contrast loss based on the second set of positive sample pairs and the second set of negative sample pairs obtained by the first sampling subunit 502292.

In an alternative example, the first sampling sub-unit 502292 is specifically configured to: for any first pixel in the fifth feature map and any second pixel in the sixth feature map, determining that the first pixel and the second pixel are located at the same position, the maximum probability corresponding to the first pixel is smaller than the maximum probability corresponding to the second pixel, and the maximum probability corresponding to the second pixel is larger than a preset probability threshold, and then taking the first pixel and the second pixel as a positive sample pair with the pixel in the fifth feature map as a reference sample; thereby obtaining a first set of positive sample pairs; determining that the first pixel and the second pixel are located at the same position, the maximum probability corresponding to the first pixel is greater than the maximum probability corresponding to the second pixel, and the maximum probability corresponding to the first pixel is greater than a preset threshold, and then taking the first pixel and the second pixel as a positive sample pair taking a pixel in the sixth feature map as a reference sample; thereby obtaining a second set of positive sample pairs; determining that the first pixel and the second pixel are located at different positions, and the maximum probability of the first pixel is inconsistent with the maximum probability of the second pixel, taking the first pixel and the second pixel as a negative sample pair taking the pixel in the fifth feature map as a reference sample, and taking the first pixel and the second pixel as a negative sample pair taking the pixel in the sixth feature map as a reference sample; thereby obtaining a first set of negative sample pairs and a second set of negative sample pairs.

In an alternative example, the first determining subunit 502293 is specifically configured to: generating a first positive sample pair mask based on the first set of positive sample pairs; generating a first negative exemplar pair mask based on the first set of negative exemplar pairs; determining a first pixel-level directional contrast loss based on the first positive-sample pair mask and the first negative-sample pair mask;

in an alternative example, the second determining subunit 502294 is specifically configured to: generating a second positive sample pair mask based on the second set of positive sample pairs; generating a second negative exemplar pair mask based on the second set of negative exemplar pairs; a second pixel-level directional contrast loss is determined based on the second positive-sample pair mask and the second negative-sample pair mask.

In an optional example, the first determining unit 50225 is specifically configured to: and carrying out one-hot coding on the third probability data to obtain first pseudo tag data.

In an optional example, the second determining unit 50226 is specifically configured to: and carrying out one-hot coding on the fourth probability data to obtain second pseudo tag data.

In an alternative example, fig. 15 is a schematic structural diagram of a fourth determining unit 50228 provided in an exemplary embodiment of the present disclosure; in this example, the fourth determining unit 50228 includes: a second processing sub-unit 502281, a third processing sub-unit 502282, and a fourth processing sub-unit 502283.

A second processing subunit 502281, configured to determine, based on the third probability data and the second pseudo tag data, first cross entropies respectively corresponding to pixels of the second image data in the first semantic segmentation network; a third processing subunit 502282, configured to determine, based on the fourth probability data and the first pseudo tag data, second cross entropies respectively corresponding to pixels of the second image data in the second semantic segmentation network; a fourth processing subunit 502283, configured to determine a second cross entropy loss based on the first cross entropy corresponding to each pixel of the second image data obtained by the second processing subunit 502281 and the second cross entropy corresponding to each pixel of the second image data obtained by the third processing subunit 502282.

In an optional example, the parameter adjusting unit 50230 is specifically configured to: weighting and summing the first cross entropy loss, the second cross entropy loss and the pixel level contrast loss based on preset weights to obtain comprehensive loss; determining that the comprehensive loss does not meet a preset training end condition, and adjusting parameters of a semantic segmentation network based on the comprehensive loss; or determining that the comprehensive loss meets a preset training end condition, and ending the training.

Yet another embodiment of the present disclosure further provides a semantic segmentation apparatus for an image, where the apparatus of this embodiment may be used to implement the embodiment of the semantic segmentation method for an image according to the present disclosure, and fig. 16 is a schematic structural diagram of the semantic segmentation apparatus for an image according to an exemplary embodiment of the present disclosure. The semantic segmentation device of the image comprises: a second obtaining module 601 and a second processing module 602.

A second obtaining module 601, configured to obtain image data to be processed; the second processing module 602 is configured to perform semantic segmentation on the image data to be processed based on a pre-obtained semantic segmentation model, and obtain a semantic segmentation result corresponding to the image data to be processed; wherein, the semantic segmentation model is obtained by the training method of the semantic segmentation model provided in any of the above embodiments or examples.

Exemplary electronic device

An embodiment of the present disclosure further provides an electronic device, including: a memory for storing a computer program;

a processor for executing the computer program stored in the memory, and the computer program, when executed, implements the method of any of the above embodiments of the present disclosure.

Fig. 17 is a schematic structural diagram of an application embodiment of the electronic device of the present disclosure. In this embodiment, the electronic device 10 includes one or more processors 11 and a memory 12.

The processor 11 may be a Central Processing Unit (CPU) or other form of processing unit having data processing capabilities and/or instruction execution capabilities, and may control other components in the electronic device 10 to perform desired functions.

Memory 12 may include one or more computer program products that may include various forms of computer-readable storage media, such as volatile memory and/or non-volatile memory. The volatile memory may include, for example, Random Access Memory (RAM), cache memory (cache), and/or the like. The non-volatile memory may include, for example, Read Only Memory (ROM), hard disk, flash memory, etc. One or more computer program instructions may be stored on the computer-readable storage medium and executed by processor 11 to implement the methods of the various embodiments of the disclosure described above and/or other desired functionality. Various contents such as an input signal, a signal component, a noise component, etc. may also be stored in the computer-readable storage medium.

In one example, the electronic device 10 may further include: an input device 13 and an output device 14, which are interconnected by a bus system and/or other form of connection mechanism (not shown).

The input means 13 may be, for example, a microphone or a microphone array as described above for capturing an input signal of a sound source.

The input device 13 may also include, for example, a keyboard, a mouse, and the like.

The output device 14 may output various information including the determined distance information, direction information, and the like to the outside. The output devices 14 may include, for example, a display, speakers, a printer, and a communication network and its connected remote output devices, among others.

Of course, for simplicity, only some of the components of the electronic device 10 relevant to the present disclosure are shown in fig. 17, omitting components such as buses, input/output interfaces, and the like. In addition, the electronic device 10 may include any other suitable components depending on the particular application.

Exemplary computer program product and computer-readable storage Medium

In addition to the above-described methods and apparatus, embodiments of the present disclosure may also be a computer program product comprising computer program instructions that, when executed by a processor, cause the processor to perform steps in a method according to various embodiments of the present disclosure as described in the "exemplary methods" section of this specification above.

The computer program product may write program code for carrying out operations for embodiments of the present disclosure in any combination of one or more programming languages, including an object oriented programming language such as Java, C + + or the like and conventional procedural programming languages, such as the "C" programming language or similar programming languages. The program code may execute entirely on the user's computing device, partly on the user's device, as a stand-alone software package, partly on the user's computing device and partly on a remote computing device, or entirely on the remote computing device or server.

Furthermore, embodiments of the present disclosure may also be a computer-readable storage medium having stored thereon computer program instructions that, when executed by a processor, cause the processor to perform steps in methods according to various embodiments of the present disclosure as described in the "exemplary methods" section above of this specification.

The computer-readable storage medium may take any combination of one or more readable media. The readable medium may be a readable signal medium or a readable storage medium. A readable storage medium may include, for example, but not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or a combination of any of the foregoing. More specific examples (a non-exhaustive list) of the readable storage medium include: an electrical connection having one or more wires, a portable disk, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber, a portable compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing.

The foregoing describes the general principles of the present disclosure in conjunction with specific embodiments, however, it is noted that the advantages, effects, etc. mentioned in the present disclosure are merely examples and are not limiting, and they should not be considered essential to the various embodiments of the present disclosure. Furthermore, the foregoing disclosure of specific details is for the purpose of illustration and description and is not intended to be limiting, since the disclosure is not intended to be limited to the specific details so described.

In the present specification, the embodiments are described in a progressive manner, each embodiment focuses on differences from other embodiments, and the same or similar parts in the embodiments are referred to each other. For the system embodiment, since it basically corresponds to the method embodiment, the description is relatively simple, and for the relevant points, reference may be made to the partial description of the method embodiment.

The block diagrams of devices, apparatuses, devices, systems involved in the present disclosure are only given as illustrative examples and are not intended to require or imply that the connections, arrangements, configurations must be made in the manner shown in the block diagrams. These devices, apparatuses, devices, systems may be connected, arranged, configured in any manner, as will be appreciated by those skilled in the art. Words such as "including," "comprising," "having," and the like are open-ended words that mean "including, but not limited to," and are used interchangeably therewith. As used herein, the words "or" and "refer to, and are used interchangeably with, the word" and/or, "unless the context clearly dictates otherwise. The word "such as" is used herein to mean, and is used interchangeably with, the phrase "such as but not limited to".

The method and apparatus of the present disclosure may be implemented in a number of ways. For example, the methods and apparatus of the present disclosure may be implemented by software, hardware, firmware, or any combination of software, hardware, and firmware. The above-described order for the steps of the method is for illustration only, and the steps of the method of the present disclosure are not limited to the order specifically described above unless specifically stated otherwise. Further, in some embodiments, the present disclosure may also be embodied as programs recorded in a recording medium, the programs including machine-readable instructions for implementing the methods according to the present disclosure. Thus, the present disclosure also covers a recording medium storing a program for executing the method according to the present disclosure.

It is also noted that in the devices, apparatuses, and methods of the present disclosure, each component or step can be decomposed and/or recombined. These decompositions and/or recombinations are to be considered equivalents of the present disclosure.

The previous description of the disclosed aspects is provided to enable any person skilled in the art to make or use the present disclosure. Various modifications to these aspects will be readily apparent to those skilled in the art, and the generic principles defined herein may be applied to other aspects without departing from the scope of the disclosure. Thus, the present disclosure is not intended to be limited to the aspects shown herein but is to be accorded the widest scope consistent with the principles and novel features disclosed herein.

The foregoing description has been presented for purposes of illustration and description. Furthermore, this description is not intended to limit embodiments of the disclosure to the form disclosed herein. While a number of example aspects and embodiments have been discussed above, those of skill in the art will recognize certain variations, modifications, alterations, additions and sub-combinations thereof.

Claims

1. A training method of a semantic segmentation model comprises the following steps:

acquiring training data;

and training a pre-established semantic segmentation network by using the training data and adopting a semi-supervised learning training mode of fusing pixel-level contrast learning and cross supervision to obtain a semantic segmentation model.

2. The method of claim 1, wherein the training data comprises first image data, first label data corresponding to the first image data, and second image data;

the method for training the pre-established semantic segmentation network by using the training data and adopting a semi-supervised learning training mode integrating pixel-level contrast learning and cross supervision to obtain a semantic segmentation model comprises the following steps:

respectively carrying out supervised learning training on a first semantic segmentation network and a second semantic segmentation network which are established in advance by using the first image data and the first label data;

performing cross supervised learning training on the first semantic segmentation network and the second semantic segmentation network by using the second image data, and performing pixel level contrast learning training on the first semantic segmentation network and the second semantic segmentation network based on a pixel level contrast rule;

determining the current loss based on a preset loss function, wherein the preset loss function comprises a first loss function corresponding to the supervised learning, a second loss function corresponding to the cross supervised learning and a third loss function corresponding to the pixel level comparison learning;

and determining that the current loss meets a preset condition, finishing training, and taking the first semantic segmentation network or the second semantic segmentation network obtained by training as the semantic segmentation model.

3. The method of claim 1, wherein the training data comprises first image data, first label data corresponding to the first image data, and second image data;

predicting the first image data by utilizing a first semantic segmentation network to obtain first probability data corresponding to the first image data;

predicting the first image data by utilizing a second semantic segmentation network to obtain second probability data corresponding to the first image data;

predicting the second image data by utilizing the first semantic segmentation network to obtain a third feature map and third probability data corresponding to the second image data;

predicting the second image data by using the second semantic segmentation network to obtain a fourth feature map and fourth probability data corresponding to the second image data;

determining first pseudo tag data based on the third probability data as tag data of the second image data under the second semantic segmentation network;

determining second pseudo tag data based on the fourth probability data as tag data of the second image data under the first semantic segmentation network;

determining a first cross entropy loss based on the first probability data, the second probability data, and the first label data;

determining a second cross entropy loss based on the third probability data, the second pseudo label data, the fourth probability data, and the first pseudo label data;

performing pixel-level comparison on the third feature map and the fourth feature map based on the third probability data and the fourth probability data to obtain pixel-level comparison loss;

and adjusting parameters of the semantic segmentation network based on the first cross entropy loss, the second cross entropy loss and the pixel level comparison loss until a preset training end condition is met, and obtaining the semantic segmentation model.

4. The method of claim 3, wherein the pixel-level contrast loss comprises a first pixel-level directional contrast loss and a second pixel-level directional contrast loss;

the performing pixel-level comparison on the third feature map and the fourth feature map based on the third probability data and the fourth probability data to obtain pixel-level comparison loss includes:

performing channel dimension reduction processing on the third feature map to obtain a fifth feature map, and performing channel dimension reduction processing on the fourth feature map to obtain a sixth feature map;

based on the fifth feature map, the sixth feature map and a preset sampling rule, performing pixel sample pair sampling to obtain a first positive sample pair set, a second positive sample pair set, a first negative sample pair set and a second negative sample pair set; the first positive sample pair set and the first negative sample pair set are respectively a positive sample pair set and a negative sample pair set obtained by sampling with a pixel in the fifth feature map as a reference sample, and the second positive sample pair set and the second negative sample pair set are respectively a positive sample pair set and a negative sample pair set obtained by sampling with a pixel in the sixth feature map as a reference sample;

determining the first pixel-level directional contrast loss based on the first set of positive sample pairs and the first set of negative sample pairs;

determining the second pixel-level directional contrast loss based on the second set of positive sample pairs and the second set of negative sample pairs.

5. The method according to claim 4, wherein the sampling pixel sample pairs based on the fifth feature map, the sixth feature map and a preset sampling rule to obtain a first positive sample pair set, a second positive sample pair set, a first negative sample pair set and a second negative sample pair set comprises:

for any first pixel in the fifth feature map and any second pixel in the sixth feature map, determining that the first pixel and the second pixel are located at the same position, the maximum probability corresponding to the first pixel is smaller than the maximum probability corresponding to the second pixel, and the maximum probability corresponding to the second pixel is greater than a preset probability threshold, and then taking the first pixel and the second pixel as a positive sample pair with the pixel in the fifth feature map as a reference sample; thereby obtaining the first set of positive sample pairs;

determining that the first pixel and the second pixel are located at the same position, the maximum probability corresponding to the first pixel is greater than the maximum probability corresponding to the second pixel, and the maximum probability corresponding to the first pixel is greater than a preset threshold, and then taking the first pixel and the second pixel as a positive sample pair with a pixel in the sixth feature map as a reference sample; thereby obtaining the second set of positive sample pairs;

determining that the first pixel and the second pixel are located at different positions, and the maximum probability of the first pixel is inconsistent with the maximum probability of the second pixel, taking the first pixel and the second pixel as a negative sample pair taking the pixel in the fifth feature map as a reference sample, and taking the first pixel and the second pixel as a negative sample pair taking the pixel in the sixth feature map as a reference sample; thereby obtaining the first set of negative example pairs and the second set of negative example pairs.

6. The method of claim 4, wherein the determining the first pixel-level directional contrast loss based on the first set of positive sample pairs and the first set of negative sample pairs comprises:

generating a first positive sample pair mask based on the first set of positive sample pairs;

generating a first negative exemplar pair mask based on the first set of negative exemplar pairs;

determining the first pixel-level directional contrast loss based on the first positive-sample pair mask and the first negative-sample pair mask;

said determining the second pixel-level directional contrast loss based on the second set of positive sample pairs and the second set of negative sample pairs, comprising:

generating a second positive sample pair mask based on the second set of positive sample pairs;

generating a second negative-sample pair mask based on the second set of negative-sample pairs;

determining the second pixel-level directional contrast loss based on the second positive-sample pair mask and the second negative-sample pair mask.

7. The method of claim 3, wherein the determining first pseudo tag data based on the third probability data comprises:

performing one-hot coding on the third probability data to obtain the first pseudo tag data;

said determining second pseudo tag data based on said fourth probability data comprises:

and carrying out one-hot coding on the fourth probability data to obtain the second pseudo tag data.

8. The method of claim 3, wherein the determining a second cross entropy loss based on the third probability data, the second pseudo tag data, the fourth probability data, and the first pseudo tag data comprises:

determining first cross entropy respectively corresponding to each pixel of the second image data under the first semantic segmentation network based on the third probability data and the second pseudo label data;

determining second cross entropy corresponding to each pixel of the second image data under the second semantic segmentation network based on the fourth probability data and the first pseudo label data;

and determining the second cross entropy loss based on the first cross entropy and the second cross entropy corresponding to each pixel of the second image data.

9. The method of claim 3, wherein the adjusting parameters of the semantic segmentation network based on the first cross-entropy loss, the second cross-entropy loss, and the pixel-level contrast loss until a preset training end condition is met comprises:

weighting and summing the first cross entropy loss, the second cross entropy loss and the pixel level contrast loss based on preset weight to obtain comprehensive loss;

determining that the comprehensive loss does not meet a preset training end condition, and adjusting parameters of the semantic segmentation network based on the comprehensive loss; or determining that the comprehensive loss meets a preset training end condition, and ending the training.

10. A semantic segmentation method of an image, comprising:

acquiring image data to be processed;

performing semantic segmentation on the image data to be processed based on a pre-obtained semantic segmentation model to obtain a semantic segmentation result corresponding to the image data to be processed;

the semantic segmentation model is obtained by a training method of the semantic segmentation model according to any one of claims 1 to 9.

11. A training apparatus for a semantic segmentation model, comprising:

the first acquisition module is used for acquiring training data;

and the first processing module is used for training a pre-established semantic segmentation network by using the training data and adopting a semi-supervised learning training mode of fusing pixel-level contrast learning and cross supervision to obtain a semantic segmentation model.

12. A computer-readable storage medium, the storage medium storing a computer program for performing the method of any of the preceding claims 1-10.

13. An electronic device, the electronic device comprising:

a processor;

a memory for storing the processor-executable instructions;

the processor is configured to read the executable instructions from the memory and execute the instructions to implement the method of any one of claims 1-10.