WO2024012255A1 - 语义分割模型训练方法、装置、电子设备及存储介质 - Google Patents

语义分割模型训练方法、装置、电子设备及存储介质 Download PDF

Info

Publication number
WO2024012255A1
WO2024012255A1 PCT/CN2023/104539 CN2023104539W WO2024012255A1 WO 2024012255 A1 WO2024012255 A1 WO 2024012255A1 CN 2023104539 W CN2023104539 W CN 2023104539W WO 2024012255 A1 WO2024012255 A1 WO 2024012255A1
Authority
WO
WIPO (PCT)
Prior art keywords
segmentation
map
loss
semantic
sample image
Prior art date
Application number
PCT/CN2023/104539
Other languages
English (en)
French (fr)
Inventor
覃杰
吴捷
李明
肖学锋
Original Assignee
北京字跳网络技术有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 北京字跳网络技术有限公司 filed Critical 北京字跳网络技术有限公司
Publication of WO2024012255A1 publication Critical patent/WO2024012255A1/zh

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • G06N3/084Backpropagation, e.g. using gradient descent
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • G06N3/088Non-supervised learning, e.g. competitive learning
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • G06N3/09Supervised learning
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/20Image preprocessing
    • G06V10/26Segmentation of patterns in the image field; Cutting or merging of image elements to establish the pattern region, e.g. clustering-based techniques; Detection of occlusion
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/70Arrangements for image or video recognition or understanding using pattern recognition or machine learning
    • G06V10/77Processing image or video features in feature spaces; using data integration or data reduction, e.g. principal component analysis [PCA] or independent component analysis [ICA] or self-organising maps [SOM]; Blind source separation
    • G06V10/774Generating sets of training patterns; Bootstrap methods, e.g. bagging or boosting
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/70Arrangements for image or video recognition or understanding using pattern recognition or machine learning
    • G06V10/82Arrangements for image or video recognition or understanding using pattern recognition or machine learning using neural networks

Definitions

  • Embodiments of the present disclosure relate to the field of image processing technology, and in particular, to a semantic segmentation model training method, device, electronic device, and storage medium.
  • Image semantic segmentation refers to the technology of segmenting objects expressing different meanings in the image into different targets by identifying the content in the image. Semantic segmentation of images is usually achieved by deploying a trained semantic segmentation model. , widely used in various applications.
  • a lightweight semantic segmentation model needs to be trained and deployed on the terminal device.
  • Embodiments of the present disclosure provide a semantic segmentation model training method, device, electronic device, and storage medium.
  • embodiments of the present disclosure provide a semantic segmentation model training method, including:
  • the teacher semantic segmentation model includes a first teacher network and a second teacher network, wherein the first teacher network has the structural characteristics of low depth and high width, and the second teacher network has Structural features of high depth and low width; process the sample image based on the teacher semantic segmentation model to obtain a first segmentation map and a second segmentation map, where the first segmentation map is the first teacher network's analysis of the sample image
  • the result of semantic segmentation, the second segmentation map is the result of semantic segmentation of the sample image by the second teacher network; according to the sample image, the first segmentation map and the second segmentation map, Train a lightweight student semantic segmentation model to obtain the target semantic segmentation model.
  • embodiments of the present disclosure provide a semantic segmentation model training device, including:
  • An acquisition module is used to acquire a pre-trained teacher semantic segmentation model.
  • the teacher semantic segmentation model includes a first teacher network and a second teacher network, wherein the first teacher network has structural characteristics of low depth and high width, and the The second teacher network has the structural characteristics of high depth and low width;
  • a processing module configured to process the sample image based on the teacher semantic segmentation model to obtain a first segmentation map and a second segmentation map, wherein the first segmentation map is the semantic segmentation of the sample image by the first teacher network.
  • the second segmentation map is the result of semantic segmentation of the sample image by the second teacher network;
  • a training module configured to train a lightweight student semantic segmentation model based on the sample image, the first segmentation map, and the second segmentation map to obtain a target semantic segmentation model.
  • an electronic device including:
  • a processor and a memory communicatively connected to the processor
  • the memory stores computer execution instructions
  • the processor executes the computer execution instructions stored in the memory to implement the semantic segmentation model training method described in the first aspect and various possible designs of the first aspect.
  • embodiments of the present disclosure provide a computer-readable storage medium.
  • Computer-executable instructions are stored in the computer-readable storage medium.
  • the processor executes the computer-executable instructions, the above first aspect and the first aspect are implemented. various possible designs for the semantic segmentation model training method described.
  • embodiments of the present disclosure provide a computer program product, including a computer program that, when executed by a processor, implements the semantic segmentation model training method described in the first aspect and various possible designs of the first aspect. .
  • embodiments of the present disclosure provide a computer program that, when executed by a processor, implements the semantic segmentation model training method described in the first aspect and various possible designs of the first aspect.
  • the semantic segmentation model training method, device, electronic device and storage medium provided by this embodiment obtain a pre-trained teacher semantic segmentation model.
  • the teacher semantic segmentation model includes a first teacher network and a second teacher network, wherein, The first teacher network has the structural characteristics of low depth and high width, and the second teacher network has the structural characteristics of high depth and low width; the sample image is processed based on the teacher semantic segmentation model to obtain the first segmentation map and the second segmentation map, Wherein, the first segmentation map is the result of semantic segmentation of the sample image by the first teacher network, and the second segmentation map is the result of semantic segmentation of the sample image by the second teacher network; According to the sample image, the first segmentation map and the second segmentation map, a lightweight student semantic segmentation model is trained to obtain a target semantic segmentation model.
  • the student semantic segmentation model is trained through the teacher semantic segmentation model composed of the first teacher network and the second teacher network with differentiated structural characteristics, the specific characteristics of the first teacher network and the second teacher network can be fully utilized, and from the two The complementary dimensions (width and depth) provide learnable knowledge for student semantic segmentation models and provide knowledge supervision for the training of student semantic segmentation models.
  • Figure 1 is an application scenario diagram of the semantic segmentation model training method provided by an embodiment of the present disclosure
  • Figure 2 is a schematic flowchart 1 of the semantic segmentation model training method provided by an embodiment of the present disclosure
  • Figure 3 is a schematic structural diagram of a first teacher network provided by an embodiment of the present disclosure.
  • Figure 4 is a schematic structural diagram of a second teacher network provided by an embodiment of the present disclosure.
  • FIG. 5 is a flow chart of specific implementation steps of step S103 in the embodiment shown in Figure 2;
  • Figure 6 is a schematic diagram of a process for generating a target supervision loss provided by an embodiment of the present disclosure
  • Figure 7 is a schematic flowchart 2 of the semantic segmentation model training method provided by an embodiment of the present disclosure
  • FIG 8 is a flow chart of specific implementation steps of step S207 in the embodiment shown in Figure 7;
  • FIG. 9 is a flow chart of specific implementation steps of step S208 in the embodiment shown in Figure 7;
  • Figure 10 is a schematic diagram of a process for obtaining target unsupervised loss provided by an embodiment of the present disclosure
  • Figure 11 is a structural block diagram of a semantic segmentation model training device provided by an embodiment of the present disclosure.
  • Figure 12 is a schematic structural diagram of an electronic device provided by an embodiment of the present disclosure.
  • Figure 13 is a schematic diagram of the hardware structure of an electronic device provided by an embodiment of the present disclosure.
  • Figure 1 is an application scenario diagram of the semantic segmentation model training method provided by the embodiment of the present disclosure.
  • the semantic segmentation model training method provided by the embodiment of the present disclosure can be applied to the application scenario of model training before deploying a lightweight semantic segmentation model.
  • the method provided by the embodiments of the present disclosure can be applied to terminal devices, servers and other devices used for model training.
  • the server is taken as an example.
  • pre-training is pre-stored in the server.
  • the teacher semantic segmentation model, and the lightweight student semantic segmentation model to be trained (the lightweight model is shown in the figure).
  • the server receives the training instructions sent by the developer user through the development terminal device, and uses the semantic segmentation model training method provided by the embodiment of the present disclosure to perform model training on the lightweight model until the model convergence conditions are met, and the target semantic segmentation model is obtained. Afterwards, the server receives the deployment instruction (not shown in the figure) sent by the terminal device and performs lightweight model deployment, that is, deploys the lightweight target semantic segmentation model to the user terminal device. After the deployment is completed, the server running in the user terminal device The target semantic segmentation model can provide image semantic segmentation services in response to application requests.
  • FIG 2 is a schematic flowchart 1 of a semantic segmentation model training method provided by an embodiment of the present disclosure.
  • the method of this embodiment can be applied to electronic devices with computing capabilities, such as model training servers, terminal devices, etc.
  • This embodiment is introduced with the terminal device as the execution subject.
  • the semantic segmentation model training method includes:
  • Step S101 Obtain the pre-trained teacher semantic segmentation model.
  • the teacher semantic segmentation model includes a first teacher network and a second teacher network.
  • the first teacher network has the structural characteristics of low depth and high width
  • the second teacher network has the structural characteristics of high depth and low width. Structural characteristics of width.
  • the teacher semantic segmentation model is a pre-trained model with image semantic segmentation capabilities.
  • the teacher semantic segmentation model includes a pre-trained first teacher network and a pre-trained second teacher network.
  • the trained first teacher network Both the teacher network and the second teacher network have image semantic segmentation capabilities.
  • the first teacher network has low depth
  • the structural characteristics of high width are that the first teacher network has fewer network layers but more network output channels, that is, a "shallow and wide" network structure.
  • Figure 3 is a schematic structural diagram of a first teacher network provided by an embodiment of the present disclosure.
  • the first teacher network can be an encoder-decoder network structure, which includes four symmetrical settings.
  • the first teacher network has the characteristics of low depth, that is, it has fewer network layers, but it also has the characteristics of high width, that is, (one or more ) network layer has a relatively large number of channels.
  • the second teacher network has the structural characteristics of high depth and low width, that is, the second teacher network has more network layers but fewer network output channels, that is, a "deep and narrow” network structure.
  • Figure 4 is a schematic structural diagram of a second teacher network provided by an embodiment of the present disclosure.
  • the second teacher network can be an encoder-decoder network structure, which includes six symmetrical settings. network layers (shown as L1, L2, L3, L4, L5, L6 in the figure), the second teacher network has the characteristics of high depth, that is, it has a large number of network layers, but it also has the characteristics of low width, that is, The number of channels in the network layer(s) is relatively small.
  • the aspect ratio coefficient of the first teacher network is less than or equal to the first threshold
  • the aspect ratio coefficient of the second teacher network is greater than or equal to the second threshold
  • the first threshold is less than the second threshold
  • the depth and width are The ratio coefficient represents the ratio of the number of network layers to the number of network output channels.
  • the corresponding first threshold and second threshold can be selected based on different business requirements (i.e., accuracy requirements, real-time requirements, etc.), and further a lightweight student semantic segmentation model can be trained based on the corresponding first teacher network and second teacher network.
  • the first teacher network can be a Wide ResNet-34 network
  • the second teacher network can be a ResNet-101 network.
  • the specific implementation methods of the first teacher network and the second teacher network can be set according to specific needs and are not limited here.
  • Step S102 Process the sample image based on the teacher's semantic segmentation model to obtain a first segmentation map and a second segmentation map.
  • the first segmentation map is the result of semantic segmentation of the sample image by the first teacher network
  • the second segmentation map is the second segmentation map. Results of semantic segmentation of sample images by the teacher network.
  • the preset sample images are input to the first teacher network and the second teacher network for processing, and the first teacher network and the second teacher network can be obtained respectively.
  • the output prediction results are the first segmentation map and the second segmentation map. Due to the difference in network structure between the first teacher network and the second teacher network, the first segmentation map and the second segmentation map output are also different.
  • the first teacher network based on its structural characteristics of low depth and high width, the first teacher network has With a sufficient number of channels, the first teacher network is good at capturing diverse local content perception information and is conducive to modeling the contextual relationship between pixels; and based on its high-depth and low-width structural characteristics, the second teacher network is more The number of network layers is more conducive to extracting global information and has the ability to abstract advanced semantics and global classification.
  • the first segmentation map output by the first teacher network can better express local information
  • the second segmentation map output by the second teacher network can better express global information.
  • the first teacher network and the second teacher network The processing of the sample image is equivalent to extracting the information in the sample image from two complementary dimensions, and then training the lightweight student semantic segmentation model based on the obtained first segmentation map and the second segmentation map, thereby achieving Optimization of student semantic segmentation models.
  • by setting up two first teacher networks and second teacher networks with differentiated network structures it is possible to extract information from image samples from two complementary dimensions and improve the efficiency of subsequent training of student semantic segmentation models. Effect.
  • Step S103 Train a lightweight student semantic segmentation model based on the sample image, the first segmentation map and the second segmentation map to obtain the target semantic segmentation model.
  • the lightweight student semantic segmentation model is a preset small neural network model.
  • the student semantic segmentation model has a small amount of calculation and parameters, and can be easily deployed on devices with limited resources. More specifically, it can be a network model with both low depth and low width.
  • the number of network layers of the student semantic segmentation model can be the same as the number of network layers of the first teacher network.
  • the process of training the lightweight student semantic segmentation model based on the first segmentation map and the second segmentation map is equivalent to the process of knowledge supervision of the student semantic segmentation model.
  • the parameters of the first teacher network and the second teacher network are fixed. Therefore, this process is a process of improving the performance of the student model by performing offline distillation through the first teacher network and the second teacher network.
  • the sample image includes a labeled sample image and an unlabeled sample image.
  • the first segmentation map includes a first labeled segmentation map generated from the labeled sample image and a first segmentation map generated from the unlabeled sample image.
  • the second segmentation map includes a second label-free segmentation map generated from the labeled sample image and a second label-free segmentation map generated from the label-free sample image.
  • the specific implementation steps of step S103 include:
  • Step S1031 Obtain the target supervision loss based on the labeled sample image, the first labeled segmentation map and the second labeled segmentation map.
  • a labeled sample image is data including an image and corresponding annotation information.
  • the result of semantic segmentation of the labeled sample image by the student semantic segmentation model can be obtained, that is, the first prediction result.
  • a first supervised loss and/or a second supervised loss can be obtained, where the first supervised loss represents the annotation
  • the difference between the information and the first prediction result, the second supervised loss represents the pixel-level consistency difference between the first labeled segmentation map and the second labeled segmentation map relative to the first prediction result.
  • the target supervision loss can be the first supervision loss, the second supervision loss, or the weighted sum of the first supervision loss and the second supervision loss.
  • the method of calculating the first supervision loss includes: after obtaining the first prediction result, calculating based on the preset supervision loss function, using the first prediction result and the annotation information of the labeled sample image as input, and then we can obtain First supervision loss.
  • the specific implementation method of calculating the corresponding supervision loss based on the supervision loss function will not be described again here.
  • the method of calculating the second supervision loss includes: after obtaining the first prediction result, using the first labeled segmentation map and the second labeled segmentation map corresponding to the labeled sample image as the pseudo segmentation map corresponding to the first prediction result.
  • the label constrains it to obtain the corresponding pixel-level consistency difference.
  • the first prediction result, the first labeled segmentation map and the second labeled The segmentation map is used as input for calculation and the second supervised loss is obtained.
  • the specific implementation of the pixel-level consistency loss function for labeled data is shown in Equation (1):
  • yi represents the first prediction result
  • H ⁇ W represents the total number of pixels of the first prediction result.
  • the segmentation results predicted by the three should ideally have pixel-level consistency.
  • the second supervision loss it can be Ensure that the prediction results of multi-branch output are consistent, thereby achieving auxiliary supervision of student semantic segmentation models and improving the training effect of student semantic segmentation models.
  • the target supervision loss can be obtained.
  • the specific implementation method can be set as needed, and will not be repeated here. narrate.
  • Figure 6 is a schematic diagram of a process for generating target supervision loss provided by an embodiment of the present disclosure. As shown in Figure 6, after the labeled image data is input into the first teacher network, the second teacher network, and the student semantic segmentation model respectively, the first The teacher network outputs the first labeled segmentation map, the second teacher network outputs the second labeled segmentation map, and the student semantic segmentation model outputs the first prediction result.
  • the first prediction result is combined with the label information to generate the first supervision loss;
  • first The labeled segmentation map and the second labeled segmentation map are used as pseudo labels of the first prediction result, and combined with the first prediction result, a second supervision loss is generated; the first supervision loss and the second supervision loss are weighted and summed to obtain the target supervision loss. .
  • Step S1032 Obtain the target unsupervised loss based on the unlabeled sample image, the first unlabeled segmentation map and the second unlabeled segmentation map.
  • the unlabeled sample image only includes the image and does not include the data of the corresponding annotation information.
  • Unlabeled sample images are cheaper to acquire and more numerous. Therefore, by extracting information from unlabeled sample images for full training, the performance of the student semantic segmentation model can be improved and the performance of the lightweight student semantic segmentation model can be avoided. question.
  • the unlabeled sample image is processed by the student semantic segmentation model, and the result of the semantic segmentation of the unlabeled sample image by the student semantic segmentation model can be obtained, that is, the second prediction result.
  • This process is the same as the student semantic segmentation model.
  • the process of processing labeled sample images is the same and will not be described again.
  • the first label-free segmentation map and the second label-free segmentation map are used as pseudo labels corresponding to the second prediction result, and the loss function is calculated to obtain the corresponding target unsupervised loss.
  • the target unsupervised loss includes a first unsupervised loss, and the first unsupervised loss represents the pixel-level consistency difference between the first label-free segmentation map and the second label-free segmentation map relative to the second prediction result. .
  • the method of calculating the first unsupervised loss includes: after obtaining the second prediction result, using the first label-free segmentation map and the second label-free segmentation map corresponding to the label-free sample image as pseudo labels corresponding to the second prediction result. Constraint is performed to obtain the corresponding pixel-level consistency difference. Specifically, based on the preset pixel-level consistency loss function of unlabeled data, the second prediction result, the first unlabeled segmentation map and the second unlabeled segmentation map are used as The input is calculated and the first unsupervised loss is obtained. Among them, the specific implementation of the pixel-level consistency loss function for unlabeled data is shown in Equation (2):
  • y j represents the second prediction result, is the second unlabeled segmentation map corresponding to the unlabeled sample image, is the first label-free segmentation map corresponding to the label-free sample image.
  • H ⁇ W represents the total number of pixels of the second prediction result. For the second supervisory loss.
  • Step S1033 Perform weighted fusion based on the target supervised loss and the target unsupervised loss to obtain the output loss, and perform reverse gradient propagation based on the output loss to adjust the network parameters of the student semantic segmentation model to obtain the target semantic segmentation model.
  • the target supervised loss and the target unsupervised loss are weighted and fused to obtain the output loss, where, for example, the target supervised loss and the target unsupervised loss correspond to
  • the weighting coefficient can be set based on specific needs and can be adjusted dynamically. For example, in the early stage of training the student semantic segmentation model, set the target supervision loss corresponding to the labeled sample image to have a larger weight coefficient to improve the model convergence speed.
  • the target supervision loss corresponding to the unlabeled sample image can be set to have a larger (or slightly larger) weight coefficient, so as to make full use of the information in the unlabeled sample image and improve the performance of the student's semantic segmentation model.
  • reverse gradient propagation is performed based on the output loss to adjust the student semantic segmentation model.
  • the network parameters are used to obtain the optimized student semantic segmentation model, which is repeated multiple times.
  • the converged student semantic segmentation model is the target semantic segmentation model.
  • the output loss obtained makes full use of the information in the labeled sample images and unlabeled sample images, and at the same time combines the differences between the first teacher network and the second teacher network It can improve the learning ability of students' semantic segmentation model with its specialized information extraction capabilities.
  • the teacher semantic segmentation model includes a first teacher network and a second teacher network, where the first teacher network has the structural characteristics of low depth and high width, and the second teacher network It has the structural characteristics of high depth and low width; the sample image is processed based on the teacher's semantic segmentation model to obtain the first segmentation map and the second segmentation map.
  • the first segmentation map is the result of the semantic segmentation of the sample image by the first teacher network.
  • the second segmentation map is the result of semantic segmentation of the sample image by the second teacher network; based on the sample image, the first segmentation map and the second segmentation map, a lightweight student semantic segmentation model is trained to obtain the target semantic segmentation model.
  • the student semantic segmentation model is trained through the teacher semantic segmentation model composed of the first teacher network and the second teacher network with differentiated structural characteristics, the specific characteristics of the first teacher network and the second teacher network can be fully utilized, and from the two
  • the complementary dimensions width and depth
  • Model performance of the model is performed.
  • FIG. 7 is a schematic flow chart 2 of a semantic segmentation model training method provided by an embodiment of the present disclosure. Based on the embodiment shown in Figure 2, this embodiment further refines the specific implementation of step S102.
  • the semantic segmentation model training method includes:
  • Step S201 Obtain a pre-trained teacher semantic segmentation model.
  • the teacher semantic segmentation model includes a first teacher network and a second teacher network.
  • the first teacher network has the structural characteristics of low depth and high width
  • the second teacher network has the structural characteristics of high depth and low width. Structural characteristics of width.
  • Step S202 Process the sample image based on the teacher's semantic segmentation model to obtain a first segmentation map and a second segmentation map, where the sample image includes a labeled sample image and an unlabeled sample image, and the first segmentation map includes a first labeled segmentation map and a second segmentation map.
  • the first unlabeled segmentation map and the second segmentation map include a second labeled segmentation map and a second unlabeled segmentation map.
  • the labeled sample image and the unlabeled sample image are processed respectively based on the first teacher network and the second teacher network to obtain the corresponding first labeled segmentation map, the first unlabeled segmentation map, and the second labeled sample image.
  • the labeled segmentation map and the second unlabeled segmentation map, wherein the order of processing the labeled sample images and the unlabeled sample images can be set according to specific needs, and is not limited here.
  • the above-mentioned specific implementation method of obtaining the first labeled segmentation map, the first unlabeled segmentation map, the second labeled segmentation map, and the second unlabeled segmentation map has been introduced in the embodiment shown in Figure 2 and will not be repeated here. Repeat.
  • Step S203 Obtain the target supervision loss based on the labeled sample image, the first labeled segmentation map and the second labeled segmentation map.
  • Step S204 Based on the student semantic segmentation model, process the unlabeled sample image to obtain the second prediction result.
  • Step S205 Obtain a first unsupervised loss based on the first label-free segmentation map, the second label-free segmentation map and the second prediction result.
  • the first unsupervised loss represents the first segmentation map and the second segmentation map relative to the second prediction result. pixel-level consistency difference.
  • step S203 is the step of obtaining the target supervision loss based on the labeled sample image, which has been introduced in the embodiment shown in Figure 2.
  • step S1031 corresponding to the embodiment shown in Figure 2, here No longer.
  • Steps S204-S205 are based on the unlabeled sample image to obtain the second prediction result and the first unsupervised loss.
  • the steps have been introduced in the embodiment shown in Figure 2.
  • step S1032 corresponding to the embodiment shown in Figure 2, and will not be described again here.
  • Step S206 Obtain the first feature map of the unlabeled sample image output by the decoder of the first teacher network and the second feature map of the unlabeled sample image output by the decoder of the student semantic segmentation model.
  • Step S207 Obtain a second unsupervised loss based on the first feature map and the second feature map.
  • the second unsupervised loss represents the difference between the regional texture correlation of the second prediction result and the regional texture correlation of the first standard-free segmentation map. .
  • the first teacher network is an encoder-decoder network structure and has structural characteristics of low depth and high width, which makes it good at capturing diverse Local content-aware information is helpful for modeling contextual relationships between pixels.
  • the first feature map (Features) of the unlabeled sample image output by the decoder of the first teacher network and the student semantic segmentation are obtained.
  • the second feature map (Features) of the unlabeled sample image output by the model's decoder can represent the texture correlation of the processing area of the unlabeled sample image captured by the first teacher network, and the second feature map can represent the student
  • the processed regional texture correlation of the unlabeled sample image captured by the semantic segmentation model is calculated.
  • the difference between the regional texture correlation representing the second prediction result and the regional texture correlation of the first unlabeled segmentation map can be obtained.
  • the second unsupervised loss which can also be called region-level content-aware loss.
  • This region-level content-aware loss aims to take advantage of the channels of the wider teacher model (first teacher network) to provide rich local context information. It can provide auxiliary supervision to guide the student model (student semantic segmentation model) to model contextual relationships between pixels. It utilizes the correlation of image patch regions input to the teacher model to guide the texture correlation between regions of the student model.
  • step S207 include:
  • Step S2071 Map the first feature map to a first feature vector set, and map the second feature map to a second feature vector set.
  • the first feature vector set represents the first teacher network's evaluation of the regional content of the unlabeled sample image.
  • the second set of feature vectors characterizes the student semantic segmentation model's evaluation of the region-level content of the unlabeled sample image.
  • Step S2072 Obtain the corresponding first autocorrelation matrix and the second autocorrelation matrix according to the first eigenvector set and the second eigenvector set.
  • the first autocorrelation matrix represents the relationship between the regional content corresponding to the first eigenvector set.
  • the second autocorrelation matrix represents the correlation between the regional-level content corresponding to the second feature vector set.
  • Step S2073 Obtain the second unsupervised loss based on the difference between the first autocorrelation matrix and the second autocorrelation matrix.
  • the features (first feature map) of the teacher model (first teacher network) and the features (second feature map) of the student model (student semantic segmentation model) are extracted from the feature space after the decoder.
  • These features (the first feature map and the second feature map) are respectively mapped to feature vector sets of regional-level content. That is, the first feature map is mapped to the first feature vector set, and the second feature map is mapped to the second feature vector set; where H v ⁇ W v is the number of pixels at the regional level, and each feature vector v ⁇ R C in V ⁇ 1 ⁇ 1 represents the local area content of the original feature (the local feature size is C ⁇ H/H v ⁇ W/W v ).
  • the corresponding autocorrelation matrix is obtained through the feature vector set V
  • Equ (3) The calculation process is shown in equation (3):
  • m ij refers to the value located at coordinates (i, j) in the autocorrelation matrix, calculated by cosine similarity sim();
  • v i and v j are the flattened eigenvectors The i-th and j-th vectors in .
  • the calculated autocorrelation matrix represents the feature area-level correlation and reflects the relationship between different areas of the image. Therefore, the region-level content-aware loss function, that is, the second unsupervised loss, can be obtained by minimizing the difference between the autocorrelation matrices of different models, with Specifically, the calculation process of the second unsupervised loss is as shown in Equation (4):
  • M S is the second autocorrelation matrix
  • Step S208 Based on the second unlabeled segmentation map and the second prediction result, obtain a third unsupervised loss.
  • the third unsupervised loss represents the global semantic category corresponding to the second prediction result relative to the global semantics corresponding to the second unlabeled segmentation map. Category differences.
  • the second teacher network is an encoder-decoder network structure and has the structural characteristics of high depth and low width.
  • the second teacher network has more The number of network layers is more conducive to extracting global information and has the ability of advanced semantics and global classification abstraction.
  • the characteristics of the second teacher network after predicting the unlabeled sample image and obtaining the second unlabeled segmentation map and the second prediction result, extract the high-dimensional semantic abstract information from the deeper second teacher network. to a lightweight student semantic segmentation model, thereby improving the performance of the student semantic segmentation model.
  • step S208 include:
  • Step S2081 Obtain the first global semantic vector corresponding to the second label-free segmentation map and the second global semantic vector corresponding to the second prediction result.
  • the first global semantic vector represents the sum of the number of segmented objects in the second label-free segmentation map. Semantic categories, the second global semantic vector represents the number and semantic categories of segmented objects in the second prediction result.
  • Step S2082 Obtain the third unsupervised loss based on the difference between the first global semantic vector and the second global semantic vector.
  • the global semantic vector of each category is calculated through the global average pooling (GAP) operation.
  • GAP global average pooling
  • the second standard-free segmentation map is Y ⁇ R N ⁇ H ⁇ W
  • the first global semantic vector is The calculation process is shown in equation (5):
  • the first global semantic vector represents the global semantic category vector of N categories
  • G represents the global average pooling operation in each channel.
  • the second prediction result can be processed to obtain the second global semantic vector corresponding to the second prediction result. The details will not be described again.
  • N represents the number of categories
  • u represents the unlabeled sample image.
  • the student semantic segmentation model attempts to learn higher-dimensional semantic category representations, which helps provide global guidance for the discrimination of semantic categories in semantic segmentation tasks.
  • Step S209 Obtain the target unsupervised loss based on at least one of the first unsupervised loss, the second unsupervised loss and the third unsupervised loss.
  • the target unsupervised loss can be obtained through one or more of them, for example, for the first unsupervised loss
  • the loss, the second unsupervised loss and the third unsupervised loss are weighted to calculate the target unsupervised loss and the specific weighting coefficient. You can set it as needed and will not go into details here.
  • Figure 10 is a schematic diagram of a process for obtaining target unsupervised loss provided by an embodiment of the present disclosure.
  • the first teacher network, the second teacher network and the student semantics are respectively input.
  • segmentation model and then, on the one hand, obtain the first feature map output by the decoder of the first teacher network and the second feature map output by the decoder of the student semantic segmentation model, and obtain the first feature map based on the first feature map and the second feature map.
  • Unsupervised loss on the other hand, obtain the second unsupervised segmentation map output by the second teacher network and the second prediction result output by the student semantic segmentation model.
  • the second unsupervised segmentation map and the second prediction result obtain the third Unsupervised loss; on the other hand, based on the first label-free segmentation map output by the first teacher network, the second label-free segmentation map output by the second teacher network, and the second prediction result output by the student semantic segmentation model, the first unsupervised segmentation map is obtained. Supervise losses. Finally, the first unsupervised loss, the second unsupervised loss, and the third unsupervised loss are weighted and fused to obtain the target unsupervised loss.
  • Step S210 Perform weighted fusion according to the target supervised loss and the target unsupervised loss to obtain the output loss, and perform reverse gradient propagation based on the output loss to adjust the network parameters of the student semantic segmentation model to obtain the target semantic segmentation model.
  • step S210 is a step of generating an output loss and training the student semantic segmentation model based on the output loss. It has been introduced in the embodiment shown in Figure 2. For details, please refer to step S1033 corresponding to the embodiment shown in Figure 2. The relevant introduction will not be repeated here.
  • FIG. 11 is a structural block diagram of a semantic segmentation model training device provided by an embodiment of the present disclosure. For convenience of explanation, only parts related to the embodiments of the present disclosure are shown.
  • the semantic segmentation model training device 3 includes:
  • the acquisition module 31 is used to acquire the pre-trained teacher semantic segmentation model.
  • the teacher semantic segmentation model includes a first teacher network and a second teacher network.
  • the first teacher network has the structural characteristics of low depth and high width, and the second teacher network has Structural features of high depth and low width;
  • the processing module 32 is used to process the sample image based on the teacher's semantic segmentation model to obtain a first segmentation map and a second segmentation map, where the first segmentation map is the result of semantic segmentation of the sample image by the first teacher network, and the second segmentation map The result of semantic segmentation of sample images for the second teacher network;
  • the training module 33 is used to train a lightweight student semantic segmentation model based on the sample image, the first segmentation map and the second segmentation map to obtain the target semantic segmentation model.
  • the aspect ratio coefficient of the first teacher network is less than or equal to the first threshold
  • the aspect ratio coefficient of the second teacher network is greater than or equal to the second threshold
  • the first threshold is less than the second threshold
  • the aspect ratio coefficient represents the ratio of the number of network layers to the number of network output channels.
  • the sample image includes a labeled sample image and an unlabeled sample image
  • the first segmentation map includes a first labeled segmentation map generated by the labeled sample image and a first segmentation map generated by the unlabeled sample image
  • the first label-free segmentation map; the second segmentation map includes a second label-free segmentation map generated from the labeled sample image and a second label-free segmentation map generated from the label-free sample image;
  • the training module 33 is specifically used for: According to the labeled sample image, the first labeled segmentation map and the second labeled segmentation map, the target supervised loss is obtained; based on the unlabeled sample image, the first unlabeled segmentation map and the second unlabeled segmentation map, the target unsupervised loss is obtained ; Perform weighted fusion based on the target supervised loss and the target unsupervised loss to obtain the output loss, and perform reverse gradient propagation based on the output loss to adjust the network parameters of the student semantic segmentation model to obtain the target semantic segmentation model.
  • the training module 33 performs the training according to the labeled sample image, the first labeled segmentation map and the third Second, the labeled segmentation map is used to obtain the target supervision loss. It is specifically used to: process the labeled sample image based on the student semantic segmentation model to obtain the first prediction result; based on the annotation information and the first prediction result of the labeled sample image, obtain the first prediction result.
  • the first supervised loss represents the difference between the annotation information and the first prediction result; based on the first labeled segmentation map, the second labeled segmentation map and the first prediction result, the second supervised loss is obtained, and the second supervised loss represents The pixel-level consistency difference between the first segmentation map and the second segmentation map relative to the first prediction result; based on the first supervision loss and the second supervision loss, the target supervision loss is obtained.
  • the training module 33 when obtaining the target unsupervised loss based on the unlabeled sample image, the first unlabeled segmentation map, and the second unlabeled segmentation map, the training module 33 is specifically used to: based on the student semantic segmentation model, Process the unlabeled sample image to obtain the second prediction result; based on the first unlabeled segmentation map, the second unlabeled segmentation map and the second prediction result, the first unsupervised loss is obtained, and the first unsupervised loss represents the first unsupervised segmentation.
  • the pixel-level consistency difference between the image and the second unsupervised segmentation image relative to the second prediction result; based on the first unsupervised loss, the target unsupervised loss is obtained.
  • the processing module 32 is also configured to: obtain the first feature map of the unlabeled sample image output by the decoder of the first teacher network and the unlabeled sample image output by the decoder of the student semantic segmentation model.
  • the second feature map of The difference in regional texture correlation of the unlabeled segmentation map; when the training module 33 obtains the target unsupervised loss based on the first unsupervised loss, it is specifically used to: obtain the target unsupervised loss based on the first unsupervised loss and the second unsupervised loss. Supervise losses.
  • the training module 33 when obtaining the second unsupervised loss based on the first feature map and the second feature map, is specifically used to: map the first feature map to a first feature vector set, and map the first feature map to a first feature vector set.
  • the second feature map is mapped to a second feature vector set.
  • the first feature vector set represents the first teacher network's evaluation of the regional content of the unlabeled sample image; the second feature vector set represents the student semantic segmentation model's evaluation of the unlabeled sample image. Evaluation of regional-level content; according to the first eigenvector set and the second eigenvector set, the corresponding first autocorrelation matrix and second autocorrelation matrix are obtained.
  • the first autocorrelation matrix represents each regional level corresponding to the first eigenvector set.
  • the second autocorrelation matrix represents the correlation between the regional content corresponding to the second feature vector set; according to the difference between the first autocorrelation matrix and the second autocorrelation matrix, the second unsupervised loss.
  • the training module 33 is also used to: obtain a third unsupervised loss based on the second standard-free segmentation map and the second prediction result, and the third unsupervised loss represents the global corresponding to the second prediction result.
  • the training module 33 when the training module 33 obtains the third unsupervised loss based on the second label-free segmentation map and the second prediction result, it is specifically used to: obtain the first global value corresponding to the second label-free segmentation map.
  • the second global semantic vector corresponding to the semantic vector and the second prediction result.
  • the first global semantic vector represents the number and semantic categories of the segmented objects in the second unlabeled segmentation map.
  • the second global semantic vector represents the segmentation in the second prediction result.
  • the number and semantic categories of the objects are extracted; based on the difference between the first global semantic vector and the second global semantic vector, the third unsupervised loss is obtained.
  • the semantic segmentation model training device 3 provided in this embodiment can execute the technical solution of the above method embodiment. Its implementation principles and technical effects are similar, and will not be described again in this embodiment.
  • FIG 12 is a schematic structural diagram of an electronic device provided by an embodiment of the present disclosure. As shown in Figure 12, the electronic device Preparation 4 includes:
  • Processor 401 and memory 402 communicatively connected to processor 401;
  • Memory 402 stores computer execution instructions
  • the processor 401 executes the computer execution instructions stored in the memory 402 to implement the semantic segmentation model training method in the embodiment shown in Figures 2 to 10.
  • processor 401 and the memory 402 are connected through a bus 403.
  • the electronic device 900 may be a terminal device or a server.
  • terminal devices may include but are not limited to mobile phones, notebook computers, digital broadcast receivers, personal digital assistants (Personal Digital Assistant, PDA), tablet computers (Portable Android Device, PAD), portable multimedia players (Portable Media Player , PMP), mobile terminals such as vehicle-mounted terminals (such as vehicle-mounted navigation terminals), and fixed terminals such as digital televisions (Television, TV), desktop computers, etc.
  • PDA Personal Digital Assistant
  • PMP portable multimedia players
  • mobile terminals such as vehicle-mounted terminals (such as vehicle-mounted navigation terminals)
  • fixed terminals such as digital televisions (Television, TV), desktop computers, etc.
  • the electronic device shown in FIG. 13 is only an example and should not bring any limitations to the functions and scope of use of the embodiments of the present disclosure.
  • the electronic device 900 may include a processing device (such as a central processing unit, a graphics processor, etc.) 901, which may process data according to a program stored in a read-only memory (Read Only Memory, ROM) 902 or from a storage device 908
  • the program loaded into the random access memory (Random Access Memory, RAM) 903 performs various appropriate actions and processing.
  • RAM 903 various programs and data required for the operation of the electronic device 900 are also stored.
  • the processing device 901, ROM 902 and RAM 903 are connected to each other via a bus 904.
  • An input/output (I/O) interface 905 is also connected to bus 904.
  • the following devices can be connected to the I/O interface 905: input devices 906 including, for example, a touch screen, touch pad, keyboard, mouse, camera, microphone, accelerometer, gyroscope, etc.; including, for example, a Liquid Crystal Display (LCD) , an output device 907 such as a speaker, a vibrator, etc.; a storage device 908 including a magnetic tape, a hard disk, etc.; and a communication device 909.
  • the communication device 909 may allow the electronic device 900 to communicate wirelessly or wiredly with other devices to exchange data.
  • FIG. 13 illustrates electronic device 900 with various means, it should be understood that implementation or availability of all illustrated means is not required. More or fewer means may alternatively be implemented or provided.
  • embodiments of the present disclosure include a computer program product including a computer program carried on a computer-readable medium, the computer program containing program code for performing the method illustrated in the flowchart.
  • the computer program may be downloaded and installed from the network via communication device 909, or from storage device 908, or from ROM 902.
  • the processing device 901 the above-mentioned functions defined in the method of the embodiment of the present disclosure are performed.
  • the computer-readable medium mentioned above in the present disclosure may be a computer-readable signal medium or a computer-readable storage medium, or any combination of the above two.
  • the computer-readable storage medium may be, for example, but not limited to, an electrical, magnetic, optical, electromagnetic, infrared, or semiconductor system, device or device, or any combination thereof. More specific examples of computer readable storage media may include, but are not limited to: an electrical connection having one or more wires, a portable computer disk, a hard drive, random access memory (RAM), read only memory (ROM), Erasable Programmable Read Only Memory (EPROM) or flash memory, optical fiber, portable compact disk read only memory (Compact Disc Read Only Memory, CD-ROM), optical storage device, magnetic storage device, or the above any suitable combination.
  • a computer-readable storage medium may be any tangible medium that contains or stores a program for use by or in connection with an instruction execution system, apparatus, or device.
  • a computer-readable signal medium may include a data signal propagated in baseband or as part of a carrier wave, carrying computer-readable program code therein. Such propagated data signals may take a variety of forms, including but not limited to electromagnetic signals, optical signals, or any suitable combination of the above.
  • a computer-readable signal medium may also be any computer-readable medium other than a computer-readable storage medium that can send, propagate, or transmit a program for use by or in connection with an instruction execution system, apparatus, or device .
  • Program code contained on a computer-readable medium can be transmitted using any appropriate medium, including but not limited to: wires, optical cables, radio frequency (Radio Frequency, RF), etc., or any suitable combination of the above.
  • the above-mentioned computer-readable medium may be included in the above-mentioned electronic device; it may also exist independently without being assembled into the electronic device.
  • the computer-readable medium carries one or more programs.
  • the electronic device When the one or more programs are executed by the electronic device, the electronic device performs the method shown in the above embodiment.
  • Computer program code for performing the operations of the present disclosure may be written in one or more programming languages, including object-oriented programming languages such as Java, Smalltalk, C++, and conventional Procedural programming language—such as "C" or a similar programming language.
  • the program code may execute entirely on the user's computer, partly on the user's computer, as a stand-alone software package, partly on the user's computer and partly on a remote computer or entirely on the remote computer or server.
  • the remote computer can be connected to the user's computer through any kind of network, including a Local Area Network (LAN) or a Wide Area Network (WAN), or it can be connected to an external computer ( For example, using an Internet service provider to connect via the Internet).
  • LAN Local Area Network
  • WAN Wide Area Network
  • each block in the flowchart or block diagram may represent a module, segment, or portion of code that contains one or more logic functions that implement the specified executable instructions.
  • the functions noted in the block may occur out of the order noted in the figures. For example, two blocks shown one after another may actually execute substantially in parallel, or they may sometimes execute in the reverse order, depending on the functionality involved.
  • each block of the block diagram and/or flowchart illustration, and combinations of blocks in the block diagram and/or flowchart illustration can be implemented by special purpose hardware-based systems that perform the specified functions or operations. , or can be implemented using a combination of specialized hardware and computer instructions.
  • the units involved in the embodiments of the present disclosure can be implemented in software or hardware.
  • the name of the unit does not constitute a limitation on the unit itself under certain circumstances.
  • the first acquisition unit can also be described as "the unit that acquires at least two Internet Protocol addresses.”
  • exemplary types of hardware logic components include: Field-Programmable Gate Array (FPGA), Application Specific Integrated Circuit (Application Specific Integrated Circuit) Circuit (ASIC), Application Specific Standard Parts (ASSP), System On Chip (SOC), Complex Programmable Logic Device (CPLD), etc.
  • FPGA Field-Programmable Gate Array
  • ASIC Application Specific Integrated Circuit
  • ASSP Application Specific Standard Parts
  • SOC System On Chip
  • CPLD Complex Programmable Logic Device
  • a machine-readable medium may be a tangible medium that may contain or store a program for use by or in connection with an instruction execution system, apparatus, or device.
  • the machine-readable medium may be a machine-readable signal medium or a machine-readable storage medium.
  • Machine-readable media may include, but are not limited to, electronic, magnetic, optical, electromagnetic, infrared, or semiconductor systems, devices or devices, or any suitable combination of the foregoing.
  • machine-readable storage media may include an electrical connection based on one or more wires, a portable computer disk, a hard disk, random access memory (RAM), read only memory (ROM), erasable programmable read only memory (EPROM or flash memory), optical fiber, portable compact disk read-only memory (CD-ROM), optical storage device, magnetic storage device, or any suitable combination of the above.
  • RAM random access memory
  • ROM read only memory
  • EPROM or flash memory erasable programmable read only memory
  • CD-ROM portable compact disk read-only memory
  • magnetic storage device or any suitable combination of the above.
  • a semantic segmentation model training method including:
  • the teacher semantic segmentation model includes a first teacher network and a second teacher network, wherein the first teacher network has the structural characteristics of low depth and high width, and the second teacher network has Structural features of high depth and low width; process the sample image based on the teacher semantic segmentation model to obtain a first segmentation map and a second segmentation map, where the first segmentation map is the first teacher network's analysis of the sample image
  • the result of semantic segmentation, the second segmentation map is the result of semantic segmentation of the sample image by the second teacher network; according to the sample image, the first segmentation map and the second segmentation map, Train a lightweight student semantic segmentation model to obtain the target semantic segmentation model.
  • the aspect ratio coefficient of the first teacher network is less than or equal to a first threshold
  • the aspect ratio coefficient of the second teacher network is greater than or equal to a second threshold
  • the The first threshold is smaller than the second threshold
  • the aspect ratio coefficient represents the ratio of the number of network layers to the number of network output channels.
  • the sample image includes a labeled sample image and an unlabeled sample image
  • the first segmentation map includes a first labeled segmentation map generated by the labeled sample image and a first segmentation map generated by the labeled sample image.
  • the second segmentation map includes a second label-free segmentation map generated by the label-free sample image and a second label-free segmentation map generated by the label-free sample image.
  • a labeled segmentation map training a lightweight student semantic segmentation model according to the sample image, the first segmentation map and the second segmentation map to obtain a target semantic segmentation model, including: based on the labeled sample image , the first labeled segmentation map and the second labeled segmentation map to obtain the target supervision loss; according to the unlabeled sample image, the first unlabeled segmentation map and the second unlabeled segmentation map, Obtain the target unsupervised loss; perform weighted fusion according to the target supervised loss and the target unsupervised loss to obtain the output loss, and perform reverse gradient propagation based on the output loss to adjust the network parameters of the student semantic segmentation model, Obtain the target semantic segmentation model.
  • obtaining the target supervision loss based on the labeled sample image, the first labeled segmentation map and the second labeled segmentation map includes: based on the student A semantic segmentation model processes the labeled sample image to obtain a first prediction result; based on the annotation information of the labeled sample image and the first prediction result, a first supervised loss is obtained, and the first supervised loss represents the The difference between the annotation information and the first prediction result; based on the first labeled segmentation map, the second labeled segmentation map and the first prediction result, a second supervision loss is obtained, and the second supervision loss The loss represents the first segmentation map and the second segmentation map relative to the first segmentation map.
  • the pixel-level consistency difference of a prediction result; the target supervision loss is obtained according to the first supervision loss and the second supervision loss.
  • obtaining the target unsupervised loss based on the standard-free sample image, the first standard-free segmentation map and the second standard-free segmentation map includes: based on the The student semantic segmentation model processes the unsupervised sample image to obtain a second prediction result; based on the first unsupervised segmentation map, the second unsupervised segmentation map and the second prediction result, a first unsupervised segmentation map is obtained Loss, the first unsupervised loss represents the pixel-level consistency difference between the first label-free segmentation map and the second label-free segmentation map relative to the second prediction result; according to the first unsupervised loss, Obtain the target unsupervised loss.
  • the method further includes: obtaining a first feature map of the unlabeled sample image output by a decoder of the first teacher network and a decoder of the student semantic segmentation model The output second feature map of the unlabeled sample image; according to the first feature map and the second feature map, a second unsupervised loss is obtained, and the second unsupervised loss characterizes the second prediction result
  • the difference between the regional texture correlation of the first unsupervised segmentation map and the regional texture correlation of the first unsupervised segmentation map; according to the first unsupervised loss, the target unsupervised loss is obtained, including: according to the first unsupervised loss and The second unsupervised loss is used to obtain the target unsupervised loss.
  • obtaining a second unsupervised loss based on the first feature map and the second feature map includes: mapping the first feature map to a first feature vector Set, map the second feature map to a second feature vector set, the first feature vector set represents the first teacher network's evaluation of the regional content of the unlabeled sample image; the second feature The vector set represents the evaluation of the region-level content of the unlabeled sample image by the student semantic segmentation model; according to the first feature vector set and the second feature vector set, the corresponding first autocorrelation matrix and the second autocorrelation matrix are obtained.
  • the first autocorrelation matrix represents the correlation between the regional content corresponding to the first feature vector set
  • the second autocorrelation matrix represents the correlation between each regional content corresponding to the second feature vector set.
  • Correlation between regional-level content; the second unsupervised loss is obtained based on the difference between the first autocorrelation matrix and the second autocorrelation matrix.
  • the method further includes: obtaining a third unsupervised loss based on the second standard-free segmentation map and the second prediction result, the third unsupervised loss representing the The difference between the global semantic category corresponding to the second prediction result and the global semantic category corresponding to the second unlabeled segmentation map; and obtaining the target unsupervised loss according to the first unsupervised loss, including: The first unsupervised loss and the third unsupervised loss are used to obtain the target unsupervised loss.
  • obtaining a third unsupervised loss based on the second label-free segmentation map and the second prediction result includes: obtaining the second label-free segmentation map corresponding to A first global semantic vector and a second global semantic vector corresponding to the second prediction result.
  • the first global semantic vector represents the number and semantic categories of objects segmented in the second label-free segmentation map.
  • the third The two global semantic vectors represent the number and semantic categories of the segmented objects in the second prediction result; the third unsupervised loss is obtained based on the difference between the first global semantic vector and the second global semantic vector.
  • a semantic segmentation model training device including:
  • An acquisition module is used to acquire a pre-trained teacher semantic segmentation model.
  • the teacher semantic segmentation model includes a first teacher network and a second teacher network, wherein the first teacher network has structural characteristics of low depth and high width, and the The second teacher network has the structural characteristics of high depth and low width;
  • a processing module configured to process the sample image based on the teacher semantic segmentation model to obtain a first segmentation map and a second segmentation map, wherein the first segmentation map is the semantic segmentation of the sample image by the first teacher network.
  • the second segmentation map is the result of semantic segmentation of the sample image by the second teacher network;
  • a training module configured to train a lightweight student semantic segmentation model based on the sample image, the first segmentation map, and the second segmentation map to obtain a target semantic segmentation model.
  • the aspect ratio coefficient of the first teacher network is less than or equal to a first threshold
  • the aspect ratio coefficient of the second teacher network is greater than or equal to a second threshold
  • the The first threshold is smaller than the second threshold
  • the aspect ratio coefficient represents the ratio of the number of network layers to the number of network output channels.
  • the sample image includes a labeled sample image and an unlabeled sample image
  • the first segmentation map includes a first labeled segmentation map generated by the labeled sample image and a first segmentation map generated by the labeled sample image.
  • the second segmentation map includes a second label-free segmentation map generated by the label-free sample image and a second label-free segmentation map generated by the label-free sample image.
  • a labeled segmentation map; the training module is specifically used to: obtain a target supervision loss according to the labeled sample image, the first labeled segmentation map and the second labeled segmentation map; and based on the unlabeled sample image, the first label-free segmentation map and the second label-free segmentation map to obtain the target unsupervised loss; perform weighted fusion according to the target supervised loss and the target unsupervised loss to obtain the output loss, and based on the The output loss is used for reverse gradient propagation, and the network parameters of the student semantic segmentation model are adjusted to obtain the target semantic segmentation model.
  • the training module when the training module obtains the target supervision loss based on the labeled sample image, the first labeled segmentation map, and the second labeled segmentation map, In: based on the student semantic segmentation model, process the labeled sample image to obtain a first prediction result; based on the annotation information of the labeled sample image and the first prediction result, obtain a first supervision loss, the The first supervised loss represents the difference between the annotation information and the first prediction result; based on the first labeled segmentation map, the second labeled segmentation map and the first prediction result, a second supervised loss is obtained , the second supervised loss represents the pixel-level consistency difference between the first segmentation map and the second segmentation map relative to the first prediction result; according to the first supervised loss and the second supervised loss, Obtain the target supervision loss.
  • the training module when the training module obtains the target unsupervised loss based on the unlabeled sample image, the first unlabeled segmentation map and the second unlabeled segmentation map, specifically Used for: processing the unlabeled sample image based on the student semantic segmentation model to obtain a second prediction result; based on the first unlabeled segmentation map, the second unlabeled segmentation map and the second prediction result , obtain the first unsupervised loss, which represents the pixel-level consistency difference between the first label-free segmentation map and the second label-free segmentation map relative to the second prediction result; according to First unsupervised loss, the target unsupervised loss is obtained.
  • the processing module is further configured to: obtain the first feature map of the unlabeled sample image output by the decoder of the first teacher network and the student semantic segmentation model.
  • the second feature map of the unlabeled sample image output by the decoder; the training module is also used to: obtain a second unsupervised loss based on the first feature map and the second feature map, the The second unsupervised loss represents the difference between the regional texture correlation of the second prediction result and the regional texture correlation of the first label-free segmentation map; the training module obtains the result based on the first unsupervised loss.
  • the target unsupervised loss is specified, it is specifically used to: obtain the target unsupervised loss according to the first unsupervised loss and the second unsupervised loss.
  • the training module is configured based on the first feature map and the second Feature map, when obtaining the second unsupervised loss, is specifically used to: map the first feature map to a first feature vector set, map the second feature map to a second feature vector set, and the first feature
  • the vector set represents the first teacher network's evaluation of the region-level content of the unlabeled sample image
  • the second feature vector set represents the student semantic segmentation model's evaluation of the region-level content of the unlabeled sample image.
  • the corresponding first autocorrelation matrix and the second autocorrelation matrix are obtained, and the first autocorrelation matrix represents the first eigenvector set corresponding to The correlation between each regional level content, the second autocorrelation matrix represents the correlation between each regional level content corresponding to the second feature vector set; according to the first autocorrelation matrix and the second The difference of the autocorrelation matrix is obtained by the second unsupervised loss.
  • the training module is further configured to: obtain a third unsupervised loss based on the second standard-free segmentation map and the second prediction result, the third unsupervised loss
  • the loss represents the difference between the global semantic category corresponding to the second prediction result and the global semantic category corresponding to the second unlabeled segmentation map; the training module obtains the target unsupervised loss based on the first unsupervised loss.
  • it is specifically used to: obtain the target unsupervised loss based on the first unsupervised loss and the third unsupervised loss.
  • the training module when obtaining a third unsupervised loss based on the second standard-free segmentation map and the second prediction result, is specifically configured to: obtain the second The first global semantic vector corresponding to the label-free segmentation map and the second global semantic vector corresponding to the second prediction result.
  • the first global semantic vector represents the number and number of objects segmented in the second label-free segmentation map.
  • Semantic category, the second global semantic vector represents the number and semantic category of the segmented objects in the second prediction result; according to the difference between the first global semantic vector and the second global semantic vector, the said Third unsupervised loss.
  • an electronic device including: a processor, and a memory communicatively connected to the processor;
  • the memory stores computer execution instructions
  • the processor executes the computer execution instructions stored in the memory to implement the semantic segmentation model training method described in the first aspect and various possible designs of the first aspect.
  • a computer-readable storage medium is provided.
  • Computer-executable instructions are stored in the computer-readable storage medium.
  • a processor executes the computer-executed instructions, Implement the semantic segmentation model training method described in the first aspect and various possible designs of the first aspect.
  • a computer program product including a computer program that, when executed by a processor, implements the above first aspect and various possible designs of the first aspect.
  • the semantic segmentation model training method is provided, including a computer program that, when executed by a processor, implements the above first aspect and various possible designs of the first aspect.
  • a computer program is provided, the computer program being used to implement the semantic segmentation model training method described in the first aspect and various possible designs of the first aspect. .

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Evolutionary Computation (AREA)
  • General Physics & Mathematics (AREA)
  • Computing Systems (AREA)
  • Artificial Intelligence (AREA)
  • General Health & Medical Sciences (AREA)
  • Software Systems (AREA)
  • Health & Medical Sciences (AREA)
  • Multimedia (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Medical Informatics (AREA)
  • Databases & Information Systems (AREA)
  • Biomedical Technology (AREA)
  • Biophysics (AREA)
  • Computational Linguistics (AREA)
  • Data Mining & Analysis (AREA)
  • Molecular Biology (AREA)
  • General Engineering & Computer Science (AREA)
  • Mathematical Physics (AREA)
  • Image Analysis (AREA)

Abstract

一种语义分割模型训练方法、装置、电子设备及存储介质,获取预训练的教师语义分割模型,教师语义分割模型包括第一教师网络和第二教师网络,其中,第一教师网络具有低深度高宽度的结构特征,第二教师网络具有高深度低宽度的结构特征;基于教师语义分割模型处理样本图像,得到第一分割图和第二分割图,其中,第一分割图为第一教师网络对样本图像进行语义分割的结果,第二分割图为第二教师网络对样本图像进行语义分割的结果;根据样本图像、第一分割图和第二分割图,训练轻量化的学生语义分割模型,得到目标语义分割模型。

Description

语义分割模型训练方法、装置、电子设备及存储介质
相关申请交叉引用
本申请要求于2022年07月11日提交中国专利局、申请号为202210814989.4、发明名称为“语义分割模型训练方法、装置、电子设备及存储介质”的中国专利申请的优先权,其全部内容通过引用并入本文。
技术领域
本公开实施例涉及图像处理技术领域,尤其涉及一种语义分割模型训练方法、装置、电子设备及存储介质。
背景技术
图像语义分割,是指通过对图像中的内容进行识别,从而实现将图像中表达不同含义的物体分割为不同目标的技术,通常是通过部署训练好的语义分割模型,来实现针对图像的语义分割,广泛应用于各类应用中。
相关技术中,为了使低计算资源的终端设备能够实现图像语义分割的功能,需要训练并在该终端设备上部署轻量化的语义分割模型。
发明内容
本公开实施例提供一种语义分割模型训练方法、装置、电子设备及存储介质。
第一方面,本公开实施例提供一种语义分割模型训练方法,包括:
获取预训练的教师语义分割模型,所述教师语义分割模型包括第一教师网络和第二教师网络,其中,所述第一教师网络具有低深度高宽度的结构特征,所述第二教师网络具有高深度低宽度的结构特征;基于所述教师语义分割模型处理样本图像,得到第一分割图和第二分割图,其中,所述第一分割图为所述第一教师网络对所述样本图像进行语义分割的结果,所述第二分割图为所述第二教师网络对所述样本图像进行语义分割的结果;根据所述样本图像、所述第一分割图和所述第二分割图,训练轻量化的学生语义分割模型,得到目标语义分割模型。
第二方面,本公开实施例提供一种语义分割模型训练装置,包括:
获取模块,用于获取预训练的教师语义分割模型,所述教师语义分割模型包括第一教师网络和第二教师网络,其中,所述第一教师网络具有低深度高宽度的结构特征,所述第二教师网络具有高深度低宽度的结构特征;
处理模块,用于基于所述教师语义分割模型处理样本图像,得到第一分割图和第二分割图,其中,所述第一分割图为所述第一教师网络对所述样本图像进行语义分割的结果,所述第二分割图为所述第二教师网络对所述样本图像进行语义分割的结果;
训练模块,用于根据所述样本图像、所述第一分割图和所述第二分割图,训练轻量化的学生语义分割模型,得到目标语义分割模型。
第三方面,本公开实施例提供一种电子设备,包括:
处理器,以及与所述处理器通信连接的存储器;
所述存储器存储计算机执行指令;
所述处理器执行所述存储器存储的计算机执行指令,以实现如上第一方面以及第一方面各种可能的设计所述的语义分割模型训练方法。
第四方面,本公开实施例提供一种计算机可读存储介质,所述计算机可读存储介质中存储有计算机执行指令,当处理器执行所述计算机执行指令时,实现如上第一方面以及第一方面各种可能的设计所述的语义分割模型训练方法。
第五方面,本公开实施例提供一种计算机程序产品,包括计算机程序,所述计算机程序被处理器执行时实现如上第一方面以及第一方面各种可能的设计所述的语义分割模型训练方法。
第六方面,本公开实施例提供一种计算机程序,所述计算机程序被处理器执行时,实现如上第一方面以及第一方面各种可能的设计所述的语义分割模型训练方法。
本实施例提供的语义分割模型训练方法、装置、电子设备及存储介质,通过获取预训练的教师语义分割模型,所述教师语义分割模型包括第一教师网络和第二教师网络,其中,所述第一教师网络具有低深度高宽度的结构特征,所述第二教师网络具有高深度低宽度的结构特征;基于所述教师语义分割模型处理样本图像,得到第一分割图和第二分割图,其中,所述第一分割图为所述第一教师网络对所述样本图像进行语义分割的结果,所述第二分割图为所述第二教师网络对所述样本图像进行语义分割的结果;根据所述样本图像、所述第一分割图和所述第二分割图,训练轻量化的学生语义分割模型,得到目标语义分割模型。由于通过由具有差异化结构特征的第一教师网络和第二教师网络构成的教师语义分割模型对学生语义分割模型进行训练,可以充分利用第一教师网络和第二教师网络的特定,从两个互补的维度(宽度和深度)为学生语义分割模型提供可学习的知识,为学生语义分割模型的训练提供知识监督。
附图说明
为了更清楚地说明本公开实施例或现有技术中的技术方案,下面将对实施例或现有技术描述中所需要使用的附图作一简单地介绍,显而易见地,下面描述中的附图是本公开的一些实施例,对于本领域普通技术人员来讲,在不付出创造性劳动性的前提下,还可以根据这些附图获得其他的附图。
图1为本公开实施例提供的语义分割模型训练方法的一种应用场景图;
图2为本公开实施例提供的语义分割模型训练方法的流程示意图一;
图3为本公开实施例提供的一种第一教师网络的结构示意图;
图4为本公开实施例提供的一种第二教师网络的结构示意图;
图5为图2所示实施例中步骤S103的具体实现步骤流程图;
图6为本公开实施例提供的一种生成目标监督损失的过程示意图;
图7为本公开实施例提供的语义分割模型训练方法的流程示意图二;
图8为图7所示实施例中步骤S207的具体实现步骤流程图;
图9为图7所示实施例中步骤S208的具体实现步骤流程图;
图10为本公开实施例提供的一种获得目标无监督损失的过程示意图;
图11为本公开实施例提供的语义分割模型训练装置的结构框图;
图12为本公开实施例提供的一种电子设备的结构示意图;
图13为本公开实施例提供的电子设备的硬件结构示意图。
具体实施方式
为使本公开实施例的目的、技术方案和优点更加清楚,下面将结合本公开实施例中的附图,对本公开实施例中的技术方案进行清楚、完整地描述,显然,所描述的实施例是本公开一部分实施例,而不是全部的实施例。基于本公开中的实施例,本领域普通技术人员在没有作出创造性劳动前提下所获得的所有其他实施例,都属于本公开保护的范围。
下面对本公开实施例的应用场景进行解释:
图1为本公开实施例提供的语义分割模型训练方法的一种应用场景图,本公开实施例提供的语义分割模型训练方法,可以应用于轻量化语义分割模型部署前的模型训练的应用场景。具体地,本公开实施例提供的方法,可以应用于终端设备、服务器等用于模型训练的设备,图1中以服务器为例,如图1所示,示例性地,服务器内预存有预训练的教师语义分割模型,以及待训练的轻量化的学生语义分割模型(图中示为轻量化模型)。服务器接收到开发者用户通过开发终端设备发送的训练指令,利用本公开实施例提供的语义分割模型训练方法对该轻量化模型进行模型训练,直至满足模型收敛条件后,得到目标语义分割模型。之后,服务器接收终端设备发送的部署指令(图中未示出),进行轻量化模型部署,即将该轻量化的目标语义分割模型部署至用户终端设备,部署完成后,运行于用户终端设备中的目标语义分割模型可以响应于应用请求,提供图像语义分割服务。
现有技术中,针对轻量化模型的训练,通常是利用预训练的大型模型(即教师模型)进行知识蒸馏(Knowledge Distillation),从而使轻量化模型(即学生模型)学习到大型模型中的知识,实现对应的模型功能。然而,在图像语义分割的应用场景下,像素级的图像分割任务对模型性能要求很高,现有技术中通过传统的教师模型进行知识蒸馏的方案,往往会导致训练出的轻量化的学生模型出现性能大幅退化的问题,从而影响训练后的学生模型的图像分割能力,泛化能力差,稳定性差。现有技术中的训练方法,会导致轻量化的语义分割模型出现性能下降问题,影响语义分割模型的正常功能实现。本公开实施例提供一种语义分割模型训练方法以解决上述问题。
参考图2,图2为本公开实施例提供的语义分割模型训练方法的流程示意图一。本实施例的方法可以应用在具有计算能力的电子设备,例如模型训练服务器、终端设备等,本实施例以终端设备作为执行主体进行介绍,该语义分割模型训练方法包括:
步骤S101:获取预训练的教师语义分割模型,教师语义分割模型包括第一教师网络和第二教师网络,其中,第一教师网络具有低深度高宽度的结构特征,第二教师网络具有高深度低宽度的结构特征。
示例性地,教师语义分割模型是经过预训练,具有图像语义分割能力的模型,具体地,教师语义分割模型包括预训练的第一教师网络和预训练的第二教师网络,经过训练的第一教师网络和第二教师网络均具有图像语义分割能力。其中,第一教师网络具有低深度 高宽度的结构特征,即第一教师网络具有较少的网络层数,但具有较多的网络输出通道数,也即“浅且宽”的网络结构。图3为本公开实施例提供的一种第一教师网络的结构示意图,如图3所示,示例性地,第一教师网络可以为编码器-解码器的网络结构,其包括4个对称设置的网络层(图中示为L1、L2、L3、L4),该第一教师网络具有低深度的特征,即具有较少的网络层数,但同时具有高宽度的特征,即(一个或多个)网络层的通道数比较多,具体可参见图3中“宽度”和“深度”的示意。
对应的,第二教师网络具有高深度低宽度的结构特征,即第二教师网络具有较多的网络层数,但具有较少的网络输出通道数,也即“深且窄”的网络结构。图4为本公开实施例提供的一种第二教师网络的结构示意图,如图4所示,示例性地,第二教师网络可以为编码器-解码器的网络结构,其包括6个对称设置的网络层(图中示为L1、L2、L3、L4、L5、L6),该第二教师网络具有高深度的特征,即具有较多的网络层数,但同时具有低宽度的特征,即(一个或多个)网络层的通道数比较少。具体可参见图3中“宽度”和“深度”的示意。
进一步地,示例性地,第一教师网络的深宽比系数小于或等于第一阈值,第二教师网络的深宽比系数大于或等于第二阈值,且第一阈值小于第二阈值,深宽比系数表征网络层数与网络输出通道数的比值。可以通过不同的业务需求(即精度需求、实时性需求等)来选择对应的第一阈值和第二阈值,进一步根据对应的第一教师网络和第二教师网络来训练轻量化的学生语义分割模型。其中,在一种可能的实现方式中,第一教师网络可以为Wide ResNet-34网络;第二教师网络可以为ResNet-101网络。第一教师网络和第二教师网络的具体实现方式,可根据具体需要设置,此处不进行限制。
步骤S102:基于教师语义分割模型处理样本图像,得到第一分割图和第二分割图,其中,第一分割图为第一教师网络对样本图像进行语义分割的结果,第二分割图为第二教师网络对样本图像进行语义分割的结果。
示例性地,在获得上述第一教师网络和第二教师网络后,将预设的样本图像输入第一教师网络和第二教师网络进行处理,即可得到第一教师网络和第二教师网络分别输出的预测结果,即第一分割图和第二分割图。由于第一教师网络和第二教师网络在网络结构上的差异性,所输出的第一分割图和第二分割图也不同,其中,基于其低深度高宽度的结构特征,第一教师网络具有充足的通道数量,因此第一教师网络善于捕获多样化的局部内容感知信息,利于对像素之间的上下文关系进行建模;而基于其高深度低宽度的结构特征,第二教师网络更多的网络层数,更利于提取全局信息,具有高级语义和全局分类抽象的能力。
因此,第一教师网络所输出的第一分割图能够更好的表现局部信息,而第二教师网络所输出的第二分割图能够更好的表现全局信息,第一教师网络和第二教师网络对样本图像的处理过程,相当于从两个互补的维度提取了样本图像中的信息,之后基于得到的第一分割图和第二分割图来对轻量化的学生语义分割模型进行训练,从而实现对学生语义分割模型的优化。本实施例中,通过设置两个具有差异化的网络结构的第一教师网络和第二教师网络,从而实现从两个互补维度对图像样本进行信息提取,提高后续对学生语义分割模型进行训练的效果。
步骤S103:根据样本图像、第一分割图和第二分割图,训练轻量化的学生语义分割模型,得到目标语义分割模型。
示例性地,轻量化的学生语义分割模型,为预设的小型神经网络模型,学生语义分割模型具有很小的计算量和参数量,可以很方便的部署到资源受限的设备上。更具体地,可以是一个同时具有低深度和低宽度的网络模型,可选地,学生语义分割模型的网络层数可以与第一教师网络的网络层数相同。
在获得第一分割图和第二分割图后,基于第一分割图和第二分割图,对轻量化的学生语义分割模型进行训练的过程,相当于对学生语义分割模型进行知识监督的过程,在该过程中,第一教师网络和第二教师网络的参数固定,因此,该过程即通过第一教师网络和第二教师网络执行离线蒸馏来改进学生模型的性能的过程。
示例性地,样本图像包括有标样本图像和无标样本图像,相应的,所述第一分割图包括由有标样本图像生成的第一有标分割图和由无标样本图像生成的第一无标分割图;所述第二分割图包括由有标样本图像生成的第二有标分割图和由无标样本图像生成的第二无标分割图。示例性地,如图5所示,步骤S103的具体实现步骤包括:
步骤S1031:根据有标样本图像、第一有标分割图和第二有标分割图,得到目标监督损失。
示例性地,有标样本图像即包括图像和对应的标注信息的数据。通过学生语义分割模型对有标样本图像进行处理,即可得到学生语义分割模型对有标样本图像进行语义分割的结果,即第一预测结果。之后,示例性地,基于第一预测结果,和第一有标分割图、第二有标分割图,可以得到第一监督损失,和/或第二监督损失,其中,第一监督损失表征标注信息和第一预测结果的差异,第二监督损失表征第一有标分割图和第二有标分割图相对第一预测结果的像素级一致性差异。目标监督损失可以为第一监督损失,也可以为第二监督损失,还可以为第一监督损失和第二监督损失的加权和。
下面具体对第一监督损失和第二监督损失的确定方法进行介绍:
示例性地,计算第一监督损失的方法包括:在得到第一预测结果后,基于预设的监督损失函数,以第一预测结果和有标样本图像的标注信息作为输入进行计算,即可得到第一监督损失。其中,基于监督损失函数计算对应的监督损失的具体实现方式此处不再赘述。
示例性地,计算第二监督损失的方法包括:在得到第一预测结果后,将有标样本图像对应的第一有标分割图和第二有标分割图分别作为第一预测结果对应的伪标签对其进行约束,得到对应的像素级一致性差异,具体地,即基于预设的有标数据像素级一致性损失函数,将第一预测结果、第一有标分割图和第二有标分割图作为输入进行计算,得到第二监督损失。其中,有标数据像素级一致性损失函数的具体实现如式(1)所示:
其中,yi表示第一预测结果,为有标样本图像对应的第二分割图,为有标样本图像对应的第一分割图。H×W表示第一预测结果的像素总数。为第二监督损失。
由于第一教师网络、第二教师网络和学生语义分割模型处理同一组有标样本数据,因此三者所预测的分割结果,理想状态下,应当具有像素级一致性,通过第二监督损失,可以保证多分支输出的预测结果一致,从而实现对学生语义分割模型的辅助监督,提高学生语义分割模型的训练效果。之后,基于第一监督损失、第二监督损失二者之一,或者二者的加权和,即可得到目标监督损失,具体实现方式可根据需要设置,此处不再赘 述。
图6为本公开实施例提供的一种生成目标监督损失的过程示意图,如图6所示,将有标图像数据分别输入第一教师网络、第二教师网络、学生语义分割模型后,第一教师网络输出第一有标分割图,第二教师网络输出第二有标分割图,学生语义分割模型输出第一预测结果,之后,第一预测结果结合标注信息,生成第一监督损失;第一有标分割图和第二有标分割图作为第一预测结果的伪标签,结合第一预测结果,生成第二监督损失;第一监督损失和第二监督损失进行加权求和,得到目标监督损失。
步骤S1032:根据无标样本图像、第一无标分割图和第二无标分割图,得到目标无监督损失。
示例性地,无标样本图像即仅包括图像,而不包括对应的标注信息的数据。无标样本图像的获取成本更低,数量更多,因此,通过提取无标样本图像中的信息进行充分训练,可以提高学生语义分割模型的性能,避免轻量化的学生语义分割模型出现性能下降的问题。
示例性地,首先,通过学生语义分割模型对无标样本图像进行处理,即可得到学生语义分割模型对无标样本图像进行语义分割的结果,即第二预测结果,该过程与学生语义分割模型处理有标样本图像的过程相同,不再赘述。之后,示例性地,将第一无标分割图、第二无标分割图作为第二预测结果对应的伪标签,进行损失函数计算,即可得到对应的目标无监督损失。在一种可能的实现方式中,目标无监督损失包括第一无监督损失,第一无监督损失表征第一无标分割图和第二无标分割图相对第二预测结果的像素级一致性差异。
计算第一无监督损失的方法包括:在得到第二预测结果后,将无标样本图像对应的第一无标分割图和第二无标分割图分别作为第二预测结果对应的伪标签对其进行约束,得到对应的像素级一致性差异,具体地,即基于预设的无标数据像素级一致性损失函数,将第二预测结果、第一无标分割图和第二无标分割图作为输入进行计算,得到第一无监督损失。其中,无标数据像素级一致性损失函数的具体实现如式(2)所示:
其中,yj表示第二预测结果,为无标样本图像对应的第二无标分割图,为无标样本图像对应的第一无标分割图。H×W表示第二预测结果的像素总数。为第二监督损失。
步骤S1033:根据目标监督损失和目标无监督损失进行加权融合,得到输出损失,并基于输出损失进行反向梯度传播,调整学生语义分割模型的网络参数,得到目标语义分割模型。
示例性地,在获得目标监督损失和目标无监督损失后,将目标监督损失和目标无监督损失进行加权融合,即可得到输出损失,其中,示例性地,目标监督损失和目标无监督损失对应的加权系数,可以基于具体需要设置,并可以动态调整,例如,在学生语义分割模型训练前期,设置有标样本图像对应的目标监督损失具有较大权重系数,以提高模型收敛速度,在学生语义分割模型训练后期,可以设置无标样本图像对应的目标监督损失具有较大(或稍大)的权重系数,从而充分利用无标样本图像中的信息,提高学生语义分割模型的性能。之后,基于输出损失进行反向梯度传播,调整学生语义分割模型 的网络参数,得到优化后的学生语义分割模型,多次循环,当学生语义分割模型达到收敛条件后,该收敛学生语义分割模型即为目标语义分割模型。
本实施例步骤中,通过对有标数据和无标数据进行处理,所得到输出损失充分利用了有标样本图像和无标样本图像中的信息,同时结合第一教师网络和第二教师网络差异化的信息提取能力,提高学生语义分割模型的学习能力。
在本实施例中,通过获取预训练的教师语义分割模型,教师语义分割模型包括第一教师网络和第二教师网络,其中,第一教师网络具有低深度高宽度的结构特征,第二教师网络具有高深度低宽度的结构特征;基于教师语义分割模型处理样本图像,得到第一分割图和第二分割图,其中,第一分割图为第一教师网络对样本图像进行语义分割的结果,第二分割图为第二教师网络对样本图像进行语义分割的结果;根据样本图像、第一分割图和第二分割图,训练轻量化的学生语义分割模型,得到目标语义分割模型。由于通过由具有差异化结构特征的第一教师网络和第二教师网络构成的教师语义分割模型对学生语义分割模型进行训练,可以充分利用第一教师网络和第二教师网络的特定,从两个互补的维度(宽度和深度)为学生语义分割模型提供可学习的知识,为学生语义分割模型的训练提供知识监督,从而提高学生语义分割模型的训练效率和训练效果,提高最终生成的目标语义分割模型的模型性能。
参考图7,图7为本公开实施例提供的语义分割模型训练方法的流程示意图二。本实施例在图2所示实施例的基础上,对步骤S102的具体实现方式进一步细化,该语义分割模型训练方法包括:
步骤S201:获取预训练的教师语义分割模型,教师语义分割模型包括第一教师网络和第二教师网络,其中,第一教师网络具有低深度高宽度的结构特征,第二教师网络具有高深度低宽度的结构特征。
步骤S202:基于教师语义分割模型处理样本图像,得到第一分割图和第二分割图,其中,样本图像包括有标样本图像和无标样本图像,第一分割图包括第一有标分割图和第一无标分割图,第二分割图包括第二有标分割图和第二无标分割图。
通过步骤S201-S202,基于第一教师网络和第二教师网络分别对有标样本图像和无标样本图像进行处理,得到对应的第一有标分割图、第一无标分割图、第二有标分割图、第二无标分割图,其中,处理有标样本图像和无标样本图像的次序可根据具体需要设置,此处不限制。上述获得第一有标分割图、第一无标分割图、第二有标分割图、第二无标分割图的具体实现方式,在图2所示实施例中已进行介绍,此处不再赘述。
步骤S203:根据有标样本图像、第一有标分割图和第二有标分割图,得到目标监督损失。
步骤S204:基于学生语义分割模型,处理无标样本图像,得到第二预测结果。
步骤S205:基于第一无标分割图、第二无标分割图和第二预测结果,得到第一无监督损失,第一无监督损失表征第一分割图和第二分割图相对第二预测结果的像素级一致性差异。
其中,步骤S203是基于有标样本图像,得到目标监督损失的步骤,在图2所示实施例中已进行介绍,具体可参见图2所示实施例对应的步骤S1031中的相关介绍,此处不再赘述。步骤S204-S205是基于无标样本图像,得到第二预测结果和第一无监督损失的 步骤,在图2所示实施例中已进行介绍,具体可参见图2所示实施例对应的步骤S1032中的相关介绍,此处不再赘述。
步骤S206:获取第一教师网络的解码器输出的无标样本图像的第一特征图和学生语义分割模型的解码器输出的无标样本图像的第二特征图。
步骤S207:根据第一特征图和第二特征图,得到第二无监督损失,第二无监督损失表征第二预测结果的区域纹理相关性相对第一无标分割图的区域纹理相关性的差异。
示例性地,基于上述实施例中对第一教师网络的介绍,第一教师网络为编码器-解码器网络结构,且具有低深度高宽度的结构特征,该结构特征使其善于捕获多样化的局部内容感知信息,利于对像素之间的上下文关系进行建模,本实施例步骤中,通过获取第一教师网络的解码器输出的无标样本图像的第一特征图(Features)和学生语义分割模型的解码器输出的无标样本图像的第二特征图(Features),第一特征图能够表征第一教师网络捕捉的无标样本图像的处理区域纹理相关性,而第二特征图能够表征学生语义分割模型捕捉的无标样本图像的处理区域纹理相关性,对二者进行计算,即可得到表征第二预测结果的区域纹理相关性相对第一无标分割图的区域纹理相关性的差异,即第二无监督损失,也可称为区域级内容感知损失。该区域级内容感知损失旨在利用更宽的教师模型(第一教师网络)的通道优势来提供丰富的局部上下文信息。它可以提供辅助监督来指导学生模型(学生语义分割模型)对像素之间的上下文关系进行建模。它利用输入教师模型的图像补丁区域的相关性指导学生模型的区域间的纹理相关性。
示例性地,如图8所示,步骤S207的具体实现步骤包括:
步骤S2071:将第一特征图映射为第一特征向量集,将第二特征图映射为第二特征向量集,第一特征向量集表征第一教师网络对无标样本图像的区域级内容的评估;第二特征向量集表征学生语义分割模型对无标样本图像的区域级内容的评估。
步骤S2072:根据第一特征向量集和第二特征向量集,得到对应的第一自相关矩阵和第二自相关矩阵,第一自相关矩阵表征第一特征向量集对应的各区域级内容之间的相关性,第二自相关矩阵表征第二特征向量集对应的各区域级内容之间的相关性。
步骤S2073:根据第一自相关矩阵和第二自相关矩阵的差异,得到第二无监督损失。
示例性地,在解码器之后的特征空间中提取出教师模型(第一教师网络)的特征(第一特征图)和学生模型(学生语义分割模型)的特征(第二特征图)。将这些特征(第一特征图和第二特征图)分别映射到区域级内容的特征向量集即第一特征图映射为第一特征向量集,第二特征图映射为第二特征向量集;其中,Hv×Wv为区域级的像素数量,V中的每一个特征向量v∈RC×1×1代表了原始特征的局部区域内容(局部特征大小为C×H/Hv×W/Wv),之后,通过特征向量集V得到对应的自相关矩阵计算过程如式(3)所示:
其中,mij是指自相关矩阵中位于坐标(i,j)处的值,通过余弦相似度sim()计算得到;vi和vj是展平后的特征向量中的第i个和第j个向量。计算得到的自相关矩阵代表了特征区域级的相关性,反应了图像不同区域的关系。因此区域级的内容感知损失函数,即第二无监督损失,可以通过最小化不同模型的自相关矩阵之间的差别得到,具 体地,第二无监督损失的计算过程如式(4)所示:
其中,MS为第二自相关矩阵,为第一自相关矩阵,为第二自相关矩阵中的值;为第一自相关矩阵中的值。
步骤S208:基于第二无标分割图和第二预测结果,得到第三无监督损失,第三无监督损失表征第二预测结果对应的全局语义类别相对于第二无标分割图对应的全局语义类别的差异。
进一步地,示例性地,基于上述实施例中对第二教师网络的介绍,第二教师网络为编码器-解码器网络结构,且具有高深度低宽度的结构特征,第二教师网络具有更多的网络层数,更利于提取全局信息,具有高级语义和全局分类抽象的能力。本实施例步骤中,在对无标样本图像进行预测,得到第二无标分割图和第二预测结果后,第二教师网络的特点,将高维语义抽象信息从更深的第二教师网络提炼到轻量化的学生语义分割模型,从而提高学生语义分割模型的性能。
示例性地,如图9所示,步骤S208的具体实现步骤包括:
步骤S2081:获取第二无标分割图对应的第一全局语义向量和第二预测结果对应的第二全局语义向量,第一全局语义向量表征第二无标分割图中分割出的物体的数量和语义类别,第二全局语义向量表征第二预测结果中分割出的物体的数量和语义类别。
步骤S2082:根据第一全局语义向量和第二全局语义向量的差异,得到第三无监督损失。
示例性地,首先,通过全局平均池化(GAP)操作来计算每个类别的全局语义向量,具体地,第二无标分割图为Y∈RN×H×W,第一全局语义向量的计算过程如式(5)所示:
其中,第一全局语义向量表示N个类别的全局语义类别向量,G表示每个通道中的全局平均池化操作。同样的,基于上述式(5)的方法,处理第二预测结果,可得到第二预测结果对应的第二全局语义向量具体不再赘述。
之后,利用第一全局语义向量和第二全局语义向量的差异,得到第三无监督损失,具体计算过程如式(6)所示:
其中,为第三无监督损失,分别表示学生语义分割模型和第二教师网络输出的语义类别。N表示类别的数量,上标u表示无标样本图像。通过这种方式,学生语义分割模型尝试学习更高维的语义类别表示,这有助于在语义分割任务中为语义类别的判别提供全局指导。
步骤S209:根据第一无监督损失、第二无监督损失和第三无监督损失中的至少一种,得到目标无监督损失。
示例性地,在经过上述步骤得到第一无监督损失、第二无监督损失和第三无监督损失后,可以通过其中的一种或多种,得到目标无监督损失,例如对第一无监督损失、第二无监督损失和第三无监督损失进行加权计算,得到目标无监督损失,具体的加权系数 可以根据需要设置,此处不再赘述。
图10为本公开实施例提供的一种获得目标无监督损失的过程示意图,如图10所示,示例性地,基于无标样本图像,分别输入第一教师网络、第二教师网络和学生语义分割模型,之后,一方面,获得第一教师网络的解码器输出的第一特征图和学生语义分割模型的解码器输出的第二特征图,根据第一特征图和第二特征图,得到第二无监督损失;另一方面,获得第二教师网络输出的第二无标分割图和学生语义分割模型输出的第二预测结果,根据第二无标分割图和第二预测结果,得到第三无监督损失;再一方面,基于第一教师网络输出的第一无标分割图、第二教师网络输出的第二无标分割图和学生语义分割模型输出的第二预测结果,得到第一无监督损失。最后,对第一无监督损失、第二无监督损失、第三无监督损失进行加权融合,得到目标无监督损失。
步骤S210:根据目标监督损失和目标无监督损失进行加权融合,得到输出损失,并基于输出损失进行反向梯度传播,调整学生语义分割模型的网络参数,得到目标语义分割模型。
其中,步骤S210是生成输出损失,并基于输出损失对学生语义分割模型进行训练的步骤,在图2所示实施例中已进行介绍,具体可参见图2所示实施例对应的步骤S1033中的相关介绍,此处不再赘述。
对应于上文实施例的语义分割模型训练方法,图11为本公开实施例提供的语义分割模型训练装置的结构框图。为了便于说明,仅示出了与本公开实施例相关的部分。参照图11,语义分割模型训练装置3,包括:
获取模块31,用于获取预训练的教师语义分割模型,教师语义分割模型包括第一教师网络和第二教师网络,其中,第一教师网络具有低深度高宽度的结构特征,第二教师网络具有高深度低宽度的结构特征;
处理模块32,用于基于教师语义分割模型处理样本图像,得到第一分割图和第二分割图,其中,第一分割图为第一教师网络对样本图像进行语义分割的结果,第二分割图为第二教师网络对样本图像进行语义分割的结果;
训练模块33,用于根据样本图像、第一分割图和第二分割图,训练轻量化的学生语义分割模型,得到目标语义分割模型。
在本公开的一个实施例中,第一教师网络的深宽比系数小于或等于第一阈值,第二教师网络的深宽比系数大于或等于第二阈值,且第一阈值小于第二阈值,深宽比系数表征网络层数与网络输出通道数的比值。
在本公开的一个实施例中,样本图像包括有标样本图像和无标样本图像,所述第一分割图包括由有标样本图像生成的第一有标分割图和由无标样本图像生成的第一无标分割图;所述第二分割图包括由有标样本图像生成的第二有标分割图和由无标样本图像生成的第二无标分割图;训练模块33,具体用于:根据有标样本图像、第一有标分割图和第二有标分割图,得到目标监督损失;根据无标样本图像、第一无标分割图和第二无标分割图,得到目标无监督损失;根据目标监督损失和目标无监督损失进行加权融合,得到输出损失,并基于输出损失进行反向梯度传播,调整学生语义分割模型的网络参数,得到目标语义分割模型。
在本公开的一个实施例中,训练模块33在根据有标样本图像、第一有标分割图和第 二有标分割图,得到目标监督损失时,具体用于:基于学生语义分割模型,处理有标样本图像,得到第一预测结果;基于有标样本图像的标注信息和第一预测结果,得到第一监督损失,第一监督损失表征标注信息和第一预测结果的差异;基于第一有标分割图、第二有标分割图和第一预测结果,得到第二监督损失,第二监督损失表征第一分割图和第二分割图相对第一预测结果的像素级一致性差异;根据第一监督损失和第二监督损失,得到目标监督损失。
在本公开的一个实施例中,训练模块33在根据无标样本图像、第一无标分割图和第二无标分割图,得到目标无监督损失时,具体用于:基于学生语义分割模型,处理无标样本图像,得到第二预测结果;基于第一无标分割图、第二无标分割图和第二预测结果,得到第一无监督损失,第一无监督损失表征第一无标分割图和第二无标分割图相对第二预测结果的像素级一致性差异;根据第一无监督损失,得到目标无监督损失。
在本公开的一个实施例中,处理模块32,还用于:获取第一教师网络的解码器输出的无标样本图像的第一特征图和学生语义分割模型的解码器输出的无标样本图像的第二特征图;训练模块33,还用于:根据第一特征图和第二特征图,得到第二无监督损失,第二无监督损失表征第二预测结果的区域纹理相关性相对第一无标分割图的区域纹理相关性的差异;训练模块33在根据第一无监督损失,得到目标无监督损失时,具体用于:根据第一无监督损失和第二无监督损失,得到目标无监督损失。
在本公开的一个实施例中,训练模块33在根据第一特征图和第二特征图,得到第二无监督损失时,具体用于:将第一特征图映射为第一特征向量集,将第二特征图映射为第二特征向量集,第一特征向量集表征第一教师网络对无标样本图像的区域级内容的评估;第二特征向量集表征学生语义分割模型对无标样本图像的区域级内容的评估;根据第一特征向量集和第二特征向量集,得到对应的第一自相关矩阵和第二自相关矩阵,第一自相关矩阵表征第一特征向量集对应的各区域级内容之间的相关性,第二自相关矩阵表征第二特征向量集对应的各区域级内容之间的相关性;根据第一自相关矩阵和第二自相关矩阵的差异,得到第二无监督损失。
在本公开的一个实施例中,训练模块33,还用于:基于第二无标分割图和第二预测结果,得到第三无监督损失,第三无监督损失表征第二预测结果对应的全局语义类别相对于第二无标分割图对应的全局语义类别的差异;训练模块33在根据第一无监督损失,得到目标无监督损失时,具体用于:根据第一无监督损失和第三无监督损失,得到目标无监督损失。
在本公开的一个实施例中,训练模块33在基于第二无标分割图和第二预测结果,得到第三无监督损失时,具体用于:获取第二无标分割图对应的第一全局语义向量和第二预测结果对应的第二全局语义向量,第一全局语义向量表征第二无标分割图中分割出的物体的数量和语义类别,第二全局语义向量表征第二预测结果中分割出的物体的数量和语义类别;根据第一全局语义向量和第二全局语义向量的差异,得到第三无监督损失。
其中,获取模块31、处理模块32、训练模块33依次连接。本实施例提供的语义分割模型训练装置3可以执行上述方法实施例的技术方案,其实现原理和技术效果类似,本实施例此处不再赘述。
图12为本公开实施例提供的一种电子设备的结构示意图,如图12所示,该电子设 备4包括:
处理器401,以及与处理器401通信连接的存储器402;
存储器402存储计算机执行指令;
处理器401执行存储器402存储的计算机执行指令,以实现如图2-图10所示实施例中的语义分割模型训练方法。
其中,可选地,处理器401和存储器402通过总线403连接。
相关说明可以对应参见图2-图10所对应的实施例中的步骤所对应的相关描述和效果进行理解,此处不做过多赘述。
参考图13,其示出了适于用来实现本公开实施例的电子设备900的结构示意图,该电子设备900可以为终端设备或服务器。其中,终端设备可以包括但不限于诸如移动电话、笔记本电脑、数字广播接收器、个人数字助理(Personal Digital Assistant,PDA)、平板电脑(Portable Android Device,PAD)、便携式多媒体播放器(Portable Media Player,PMP)、车载终端(例如车载导航终端)等等的移动终端以及诸如数字电视(Television,TV)、台式计算机等等的固定终端。图13示出的电子设备仅仅是一个示例,不应对本公开实施例的功能和使用范围带来任何限制。
如图13所示,电子设备900可以包括处理装置(例如中央处理器、图形处理器等)901,其可以根据存储在只读存储器(Read Only Memory,ROM)902中的程序或者从存储装置908加载到随机访问存储器(Random Access Memory,RAM)903中的程序而执行各种适当的动作和处理。在RAM 903中,还存储有电子设备900操作所需的各种程序和数据。处理装置901、ROM 902以及RAM 903通过总线904彼此相连。输入/输出(Input/Output,I/O)接口905也连接至总线904。
通常,以下装置可以连接至I/O接口905:包括例如触摸屏、触摸板、键盘、鼠标、摄像头、麦克风、加速度计、陀螺仪等的输入装置906;包括例如液晶显示器(Liquid Crystal Display,LCD)、扬声器、振动器等的输出装置907;包括例如磁带、硬盘等的存储装置908;以及通信装置909。通信装置909可以允许电子设备900与其他设备进行无线或有线通信以交换数据。虽然图13示出了具有各种装置的电子设备900,但是应理解的是,并不要求实施或具备所有示出的装置。可以替代地实施或具备更多或更少的装置。
特别地,根据本公开的实施例,上文参考流程图描述的过程可以被实现为计算机软件程序。例如,本公开的实施例包括一种计算机程序产品,其包括承载在计算机可读介质上的计算机程序,该计算机程序包含用于执行流程图所示的方法的程序代码。在这样的实施例中,该计算机程序可以通过通信装置909从网络上被下载和安装,或者从存储装置908被安装,或者从ROM 902被安装。在该计算机程序被处理装置901执行时,执行本公开实施例的方法中限定的上述功能。
需要说明的是,本公开上述的计算机可读介质可以是计算机可读信号介质或者计算机可读存储介质或者是上述两者的任意组合。计算机可读存储介质例如可以是——但不限于——电、磁、光、电磁、红外线、或半导体的系统、装置或器件,或者任意以上的组合。计算机可读存储介质的更具体的例子可以包括但不限于:具有一个或多个导线的电连接、便携式计算机磁盘、硬盘、随机访问存储器(RAM)、只读存储器(ROM)、 可擦除可编程只读存储器(Erasable Programmable Read Only Memory,EPROM)或闪存、光纤、便携式紧凑磁盘只读存储器(Compact Disc Read Only Memory,CD-ROM)、光存储器件、磁存储器件、或者上述的任意合适的组合。在本公开中,计算机可读存储介质可以是任何包含或存储程序的有形介质,该程序可以被指令执行系统、装置或者器件使用或者与其结合使用。而在本公开中,计算机可读信号介质可以包括在基带中或者作为载波一部分传播的数据信号,其中承载了计算机可读的程序代码。这种传播的数据信号可以采用多种形式,包括但不限于电磁信号、光信号或上述的任意合适的组合。计算机可读信号介质还可以是计算机可读存储介质以外的任何计算机可读介质,该计算机可读信号介质可以发送、传播或者传输用于由指令执行系统、装置或者器件使用或者与其结合使用的程序。计算机可读介质上包含的程序代码可以用任何适当的介质传输,包括但不限于:电线、光缆、射频(Radio Frequency,RF)等等,或者上述的任意合适的组合。
上述计算机可读介质可以是上述电子设备中所包含的;也可以是单独存在,而未装配入该电子设备中。
上述计算机可读介质承载有一个或者多个程序,当上述一个或者多个程序被该电子设备执行时,使得该电子设备执行上述实施例所示的方法。
可以以一种或多种程序设计语言或其组合来编写用于执行本公开的操作的计算机程序代码,上述程序设计语言包括面向对象的程序设计语言—诸如Java、Smalltalk、C++,还包括常规的过程式程序设计语言—诸如“C”语言或类似的程序设计语言。程序代码可以完全地在用户计算机上执行、部分地在用户计算机上执行、作为一个独立的软件包执行、部分在用户计算机上部分在远程计算机上执行、或者完全在远程计算机或服务器上执行。在涉及远程计算机的情形中,远程计算机可以通过任意种类的网络——包括局域网(Local Area Network,LAN)或广域网(Wide Area Network,WAN)—连接到用户计算机,或者,可以连接到外部计算机(例如利用因特网服务提供商来通过因特网连接)。
附图中的流程图和框图,图示了按照本公开各种实施例的系统、方法和计算机程序产品的可能实现的体系架构、功能和操作。在这点上,流程图或框图中的每个方框可以代表一个模块、程序段、或代码的一部分,该模块、程序段、或代码的一部分包含一个或多个用于实现规定的逻辑功能的可执行指令。也应当注意,在有些作为替换的实现中,方框中所标注的功能也可以以不同于附图中所标注的顺序发生。例如,两个接连地表示的方框实际上可以基本并行地执行,它们有时也可以按相反的顺序执行,这依所涉及的功能而定。也要注意的是,框图和/或流程图中的每个方框、以及框图和/或流程图中的方框的组合,可以用执行规定的功能或操作的专用的基于硬件的系统来实现,或者可以用专用硬件与计算机指令的组合来实现。
描述于本公开实施例中所涉及到的单元可以通过软件的方式实现,也可以通过硬件的方式来实现。其中,单元的名称在某种情况下并不构成对该单元本身的限定,例如,第一获取单元还可以被描述为“获取至少两个网际协议地址的单元”。
本文中以上描述的功能可以至少部分地由一个或多个硬件逻辑部件来执行。例如,非限制性地,可以使用的示范类型的硬件逻辑部件包括:现场可编程门阵列(Field-Programmable Gate Array,FPGA)、专用集成电路(Application Specific Integrated  Circuit,ASIC)、专用标准产品(Application Specific Standard Parts,ASSP)、片上系统(System On Chip,SOC)、复杂可编程逻辑设备(Complex Programmable Logic Device,CPLD)等等。
在本公开的上下文中,机器可读介质可以是有形的介质,其可以包含或存储以供指令执行系统、装置或设备使用或与指令执行系统、装置或设备结合地使用的程序。机器可读介质可以是机器可读信号介质或机器可读储存介质。机器可读介质可以包括但不限于电子的、磁性的、光学的、电磁的、红外的、或半导体系统、装置或设备,或者上述内容的任何合适组合。机器可读存储介质的更具体示例可以包括基于一个或多个线的电气连接、便携式计算机盘、硬盘、随机存取存储器(RAM)、只读存储器(ROM)、可擦除可编程只读存储器(EPROM或快闪存储器)、光纤、便捷式紧凑盘只读存储器(CD-ROM)、光学储存设备、磁储存设备、或上述内容的任何合适组合。
第一方面,根据本公开的一个或多个实施例,提供了一种语义分割模型训练方法,包括:
获取预训练的教师语义分割模型,所述教师语义分割模型包括第一教师网络和第二教师网络,其中,所述第一教师网络具有低深度高宽度的结构特征,所述第二教师网络具有高深度低宽度的结构特征;基于所述教师语义分割模型处理样本图像,得到第一分割图和第二分割图,其中,所述第一分割图为所述第一教师网络对所述样本图像进行语义分割的结果,所述第二分割图为所述第二教师网络对所述样本图像进行语义分割的结果;根据所述样本图像、所述第一分割图和所述第二分割图,训练轻量化的学生语义分割模型,得到目标语义分割模型。
根据本公开的一个或多个实施例,所述第一教师网络的深宽比系数小于或等于第一阈值,所述第二教师网络的深宽比系数大于或等于第二阈值,且所述第一阈值小于所述第二阈值,所述深宽比系数表征网络层数与网络输出通道数的比值。
根据本公开的一个或多个实施例,所述样本图像包括有标样本图像和无标样本图像,所述第一分割图包括由所述有标样本图像生成的第一有标分割图和由所述无标样本图像生成的第一无标分割图;所述第二分割图包括由所述有标样本图像生成的第二有标分割图和由所述无标样本图像生成的第二无标分割图;所述根据所述样本图像、所述第一分割图和所述第二分割图,训练轻量化的学生语义分割模型,得到目标语义分割模型,包括:根据所述有标样本图像、所述第一有标分割图和所述第二有标分割图,得到目标监督损失;根据所述无标样本图像、所述第一无标分割图和所述第二无标分割图,得到目标无监督损失;根据所述目标监督损失和所述目标无监督损失进行加权融合,得到输出损失,并基于所述输出损失进行反向梯度传播,调整所述学生语义分割模型的网络参数,得到目标语义分割模型。
根据本公开的一个或多个实施例,所述根据所述有标样本图像、所述第一有标分割图和所述第二有标分割图,得到目标监督损失,包括:基于所述学生语义分割模型,处理所述有标样本图像,得到第一预测结果;基于所述有标样本图像的标注信息和所述第一预测结果,得到第一监督损失,所述第一监督损失表征所述标注信息和所述第一预测结果的差异;基于所述第一有标分割图、所述第二有标分割图和所述第一预测结果,得到第二监督损失,所述第二监督损失表征所述第一分割图和所述第二分割图相对所述第 一预测结果的像素级一致性差异;根据所述第一监督损失和所述第二监督损失,得到所述目标监督损失。
根据本公开的一个或多个实施例,所述根据所述无标样本图像、所述第一无标分割图和所述第二无标分割图,得到目标无监督损失,包括:基于所述学生语义分割模型,处理所述无标样本图像,得到第二预测结果;基于所述第一无标分割图、所述第二无标分割图和所述第二预测结果,得到第一无监督损失,所述第一无监督损失表征所述第一无标分割图和所述第二无标分割图相对所述第二预测结果的像素级一致性差异;根据所述第一无监督损失,得到所述目标无监督损失。
根据本公开的一个或多个实施例,所述方法还包括:获取所述第一教师网络的解码器输出的所述无标样本图像的第一特征图和所述学生语义分割模型的解码器输出的所述无标样本图像的第二特征图;根据所述第一特征图和所述第二特征图,得到第二无监督损失,所述第二无监督损失表征所述第二预测结果的区域纹理相关性相对所述第一无标分割图的区域纹理相关性的差异;根据所述第一无监督损失,得到所述目标无监督损失,包括:根据所述第一无监督损失和所述第二无监督损失,得到所述目标无监督损失。
根据本公开的一个或多个实施例,所述根据所述第一特征图和所述第二特征图,得到第二无监督损失,包括:将所述第一特征图映射为第一特征向量集,将所述第二特征图映射为第二特征向量集,所述第一特征向量集表征所述第一教师网络对所述无标样本图像的区域级内容的评估;所述第二特征向量集表征所述学生语义分割模型对所述无标样本图像的区域级内容的评估;根据所述第一特征向量集和所述第二特征向量集,得到对应的第一自相关矩阵和第二自相关矩阵,所述第一自相关矩阵表征所述第一特征向量集对应的各区域级内容之间的相关性,所述第二自相关矩阵表征所述第二特征向量集对应的各区域级内容之间的相关性;根据所述第一自相关矩阵和所述第二自相关矩阵的差异,得到所述第二无监督损失。
根据本公开的一个或多个实施例,所述方法还包括:基于所述第二无标分割图和所述第二预测结果,得到第三无监督损失,所述第三无监督损失表征所述第二预测结果对应的全局语义类别相对于所述第二无标分割图对应的全局语义类别的差异;所述根据所述第一无监督损失,得到所述目标无监督损失,包括:根据所述第一无监督损失和所述第三无监督损失,得到所述目标无监督损失。
根据本公开的一个或多个实施例,所述基于所述第二无标分割图和所述第二预测结果,得到第三无监督损失,包括:获取所述第二无标分割图对应的第一全局语义向量和所述第二预测结果对应的第二全局语义向量,所述第一全局语义向量表征所述第二无标分割图中分割出的物体的数量和语义类别,所述第二全局语义向量表征所述第二预测结果中分割出的物体的数量和语义类别;根据所述第一全局语义向量和所述第二全局语义向量的差异,得到所述第三无监督损失。
第二方面,根据本公开的一个或多个实施例,提供了一种语义分割模型训练装置,包括:
获取模块,用于获取预训练的教师语义分割模型,所述教师语义分割模型包括第一教师网络和第二教师网络,其中,所述第一教师网络具有低深度高宽度的结构特征,所述第二教师网络具有高深度低宽度的结构特征;
处理模块,用于基于所述教师语义分割模型处理样本图像,得到第一分割图和第二分割图,其中,所述第一分割图为所述第一教师网络对所述样本图像进行语义分割的结果,所述第二分割图为所述第二教师网络对所述样本图像进行语义分割的结果;
训练模块,用于根据所述样本图像、所述第一分割图和所述第二分割图,训练轻量化的学生语义分割模型,得到目标语义分割模型。
根据本公开的一个或多个实施例,所述第一教师网络的深宽比系数小于或等于第一阈值,所述第二教师网络的深宽比系数大于或等于第二阈值,且所述第一阈值小于所述第二阈值,所述深宽比系数表征网络层数与网络输出通道数的比值。
根据本公开的一个或多个实施例,所述样本图像包括有标样本图像和无标样本图像,所述第一分割图包括由所述有标样本图像生成的第一有标分割图和由所述无标样本图像生成的第一无标分割图;所述第二分割图包括由所述有标样本图像生成的第二有标分割图和由所述无标样本图像生成的第二无标分割图;所述训练模块,具体用于:根据所述有标样本图像、所述第一有标分割图和所述第二有标分割图,得到目标监督损失;根据所述无标样本图像、所述第一无标分割图和所述第二无标分割图,得到目标无监督损失;根据所述目标监督损失和所述目标无监督损失进行加权融合,得到输出损失,并基于所述输出损失进行反向梯度传播,调整所述学生语义分割模型的网络参数,得到目标语义分割模型。
根据本公开的一个或多个实施例,所述训练模块在根据所述有标样本图像、所述第一有标分割图和所述第二有标分割图,得到目标监督损失时,具体用于:基于所述学生语义分割模型,处理所述有标样本图像,得到第一预测结果;基于所述有标样本图像的标注信息和所述第一预测结果,得到第一监督损失,所述第一监督损失表征所述标注信息和所述第一预测结果的差异;基于所述第一有标分割图、所述第二有标分割图和所述第一预测结果,得到第二监督损失,所述第二监督损失表征所述第一分割图和所述第二分割图相对所述第一预测结果的像素级一致性差异;根据所述第一监督损失和所述第二监督损失,得到所述目标监督损失。
根据本公开的一个或多个实施例,所述训练模块在根据所述无标样本图像、所述第一无标分割图和所述第二无标分割图,得到目标无监督损失时,具体用于:基于所述学生语义分割模型,处理所述无标样本图像,得到第二预测结果;基于所述第一无标分割图、所述第二无标分割图和所述第二预测结果,得到第一无监督损失,所述第一无监督损失表征所述第一无标分割图和所述第二无标分割图相对所述第二预测结果的像素级一致性差异;根据所述第一无监督损失,得到所述目标无监督损失。
根据本公开的一个或多个实施例,所述处理模块,还用于:获取所述第一教师网络的解码器输出的所述无标样本图像的第一特征图和所述学生语义分割模型的解码器输出的所述无标样本图像的第二特征图;所述训练模块,还用于:根据所述第一特征图和所述第二特征图,得到第二无监督损失,所述第二无监督损失表征所述第二预测结果的区域纹理相关性相对所述第一无标分割图的区域纹理相关性的差异;所述训练模块在根据所述第一无监督损失,得到所述目标无监督损失时,具体用于:根据所述第一无监督损失和所述第二无监督损失,得到所述目标无监督损失。
根据本公开的一个或多个实施例,所述训练模块在根据所述第一特征图和所述第二 特征图,得到第二无监督损失时,具体用于:将所述第一特征图映射为第一特征向量集,将所述第二特征图映射为第二特征向量集,所述第一特征向量集表征所述第一教师网络对所述无标样本图像的区域级内容的评估;所述第二特征向量集表征所述学生语义分割模型对所述无标样本图像的区域级内容的评估;根据所述第一特征向量集和所述第二特征向量集,得到对应的第一自相关矩阵和第二自相关矩阵,所述第一自相关矩阵表征所述第一特征向量集对应的各区域级内容之间的相关性,所述第二自相关矩阵表征所述第二特征向量集对应的各区域级内容之间的相关性;根据所述第一自相关矩阵和所述第二自相关矩阵的差异,得到第二无监督损失。
根据本公开的一个或多个实施例,所述训练模块,还用于:基于所述第二无标分割图和所述第二预测结果,得到第三无监督损失,所述第三无监督损失表征所述第二预测结果对应的全局语义类别相对于所述第二无标分割图对应的全局语义类别的差异;所述训练模块在根据所述第一无监督损失,得到所述目标无监督损失时,具体用于:根据所述第一无监督损失和所述第三无监督损失,得到所述目标无监督损失。
根据本公开的一个或多个实施例,所述训练模块在基于所述第二无标分割图和所述第二预测结果,得到第三无监督损失时,具体用于:获取所述第二无标分割图对应的第一全局语义向量和所述第二预测结果对应的第二全局语义向量,所述第一全局语义向量表征所述第二无标分割图中分割出的物体的数量和语义类别,所述第二全局语义向量表征所述第二预测结果中分割出的物体的数量和语义类别;根据所述第一全局语义向量和所述第二全局语义向量的差异,得到所述第三无监督损失。
第三方面,根据本公开的一个或多个实施例,提供了一种电子设备,包括:处理器,以及与所述处理器通信连接的存储器;
所述存储器存储计算机执行指令;
所述处理器执行所述存储器存储的计算机执行指令,以实现如上第一方面以及第一方面各种可能的设计所述的语义分割模型训练方法。
第四方面,根据本公开的一个或多个实施例,提供了一种计算机可读存储介质,所述计算机可读存储介质中存储有计算机执行指令,当处理器执行所述计算机执行指令时,实现如上第一方面以及第一方面各种可能的设计所述的语义分割模型训练方法。
第五方面,根据本公开的一个或多个实施例,提供了一种计算机程序产品,包括计算机程序,所述计算机程序被处理器执行时实现如上第一方面以及第一方面各种可能的设计所述的语义分割模型训练方法。
第六方面,根据本公开的一个或多个实施例,提供了一种计算机程序,所述计算机程序用于实现如上第一方面以及第一方面各种可能的设计所述的语义分割模型训练方法。
以上描述仅为本公开的较佳实施例以及对所运用技术原理的说明。本领域技术人员应当理解,本公开中所涉及的公开范围,并不限于上述技术特征的特定组合而成的技术方案,同时也应涵盖在不脱离上述公开构思的情况下,由上述技术特征或其等同特征进行任意组合而形成的其它技术方案。例如上述特征与本公开中公开的(但不限于)具有类似功能的技术特征进行互相替换而形成的技术方案。
此外,虽然采用特定次序描绘了各操作,但是这不应当理解为要求这些操作以所示 出的特定次序或以顺序次序来执行。在一定环境下,多任务和并行处理可能是有利的。同样地,虽然在上面论述中包含了若干具体实现细节,但是这些不应当被解释为对本公开的范围的限制。在单独的实施例的上下文中描述的某些特征还可以组合地实现在单个实施例中。相反地,在单个实施例的上下文中描述的各种特征也可以单独地或以任何合适的子组合的方式实现在多个实施例中。
尽管已经采用特定于结构特征和/或方法逻辑动作的语言描述了本主题,但是应当理解所附权利要求书中所限定的主题未必局限于上面描述的特定特征或动作。相反,上面所描述的特定特征和动作仅仅是实现权利要求书的示例形式。

Claims (14)

  1. 一种语义分割模型训练方法,包括:
    获取预训练的教师语义分割模型,所述教师语义分割模型包括第一教师网络和第二教师网络,其中,所述第一教师网络具有低深度高宽度的结构特征,所述第二教师网络具有高深度低宽度的结构特征;
    基于所述教师语义分割模型处理样本图像,得到第一分割图和第二分割图,其中,所述第一分割图为所述第一教师网络对所述样本图像进行语义分割的结果,所述第二分割图为所述第二教师网络对所述样本图像进行语义分割的结果;
    根据所述样本图像、所述第一分割图和所述第二分割图,训练轻量化的学生语义分割模型,得到目标语义分割模型。
  2. 根据权利要求1所述的方法,其中,所述第一教师网络的深宽比系数小于或等于第一阈值,所述第二教师网络的深宽比系数大于或等于第二阈值,且所述第一阈值小于所述第二阈值,所述深宽比系数表征网络层数与网络输出通道数的比值。
  3. 根据权利要求1或2所述的方法,其中,所述样本图像包括有标样本图像和无标样本图像,所述第一分割图包括由所述有标样本图像生成的第一有标分割图和由所述无标样本图像生成的第一无标分割图;所述第二分割图包括由所述有标样本图像生成的第二有标分割图和由所述无标样本图像生成的第二无标分割图;
    所述根据所述样本图像、所述第一分割图和所述第二分割图,训练轻量化的学生语义分割模型,得到目标语义分割模型,包括:
    根据所述有标样本图像、所述第一有标分割图和所述第二有标分割图,得到目标监督损失;
    根据所述无标样本图像、所述第一无标分割图和所述第二无标分割图,得到目标无监督损失;
    根据所述目标监督损失和所述目标无监督损失进行加权融合,得到输出损失,并基于所述输出损失进行反向梯度传播,调整所述学生语义分割模型的网络参数,得到所述目标语义分割模型。
  4. 根据权利要求3所述的方法,其中,所述根据所述有标样本图像、所述第一有标分割图和所述第二有标分割图,得到目标监督损失,包括:
    基于所述学生语义分割模型,处理所述有标样本图像,得到第一预测结果;
    基于所述有标样本图像的标注信息和所述第一预测结果,得到第一监督损失,所述第一监督损失表征所述标注信息和所述第一预测结果的差异;
    基于所述第一有标分割图、所述第二有标分割图和所述第一预测结果,得到第二监督损失,所述第二监督损失表征所述第一分割图和所述第二分割图相对所述第一预测结果的像素级一致性差异;
    根据所述第一监督损失和所述第二监督损失,得到所述目标监督损失。
  5. 根据权利要求3或4所述的方法,其中,所述根据所述无标样本图像、所述第一无标分割图和所述第二无标分割图,得到目标无监督损失,包括:
    基于所述学生语义分割模型,处理所述无标样本图像,得到第二预测结果;
    基于所述第一无标分割图、所述第二无标分割图和所述第二预测结果,得到第一无 监督损失,所述第一无监督损失表征所述第一无标分割图和所述第二无标分割图相对所述第二预测结果的像素级一致性差异;
    根据所述第一无监督损失,得到所述目标无监督损失。
  6. 根据权利要求5所述的方法,其中,所述方法还包括:
    获取所述第一教师网络的解码器输出的所述无标样本图像的第一特征图和所述学生语义分割模型的解码器输出的所述无标样本图像的第二特征图;
    根据所述第一特征图和所述第二特征图,得到第二无监督损失,所述第二无监督损失表征所述第二预测结果的区域纹理相关性相对所述第一无标分割图的区域纹理相关性的差异;
    所述根据所述第一无监督损失,得到所述目标无监督损失,包括:
    根据所述第一无监督损失和所述第二无监督损失,得到所述目标无监督损失。
  7. 根据权利要求6所述的方法,其中,所述根据所述第一特征图和所述第二特征图,得到第二无监督损失,包括:
    将所述第一特征图映射为第一特征向量集,将所述第二特征图映射为第二特征向量集,所述第一特征向量集表征所述第一教师网络对所述无标样本图像的区域级内容的评估;所述第二特征向量集表征所述学生语义分割模型对所述无标样本图像的区域级内容的评估;
    根据所述第一特征向量集和所述第二特征向量集,得到对应的第一自相关矩阵和第二自相关矩阵,所述第一自相关矩阵表征所述第一特征向量集对应的各区域级内容之间的相关性,所述第二自相关矩阵表征所述第二特征向量集对应的各区域级内容之间的相关性;
    根据所述第一自相关矩阵和所述第二自相关矩阵的差异,得到所述第二无监督损失。
  8. 根据权利要求5所述的方法,其中,所述方法还包括:
    基于所述第二无标分割图和所述第二预测结果,得到第三无监督损失,所述第三无监督损失表征所述第二预测结果对应的全局语义类别相对于所述第二无标分割图对应的全局语义类别的差异;
    所述根据所述第一无监督损失,得到所述目标无监督损失,包括:
    根据所述第一无监督损失和所述第三无监督损失,得到所述目标无监督损失。
  9. 根据权利要求8所述的方法,其中,所述基于所述第二无标分割图和所述第二预测结果,得到第三无监督损失,包括:
    获取所述第二无标分割图对应的第一全局语义向量和所述第二预测结果对应的第二全局语义向量,所述第一全局语义向量表征所述第二无标分割图中分割出的物体的数量和语义类别,所述第二全局语义向量表征所述第二预测结果中分割出的物体的数量和语义类别;
    根据所述第一全局语义向量和所述第二全局语义向量的差异,得到所述第三无监督损失。
  10. 一种语义分割模型训练装置,包括:
    获取模块,用于获取预训练的教师语义分割模型,所述教师语义分割模型包括第一教师网络和第二教师网络,其中,所述第一教师网络具有低深度高宽度的结构特征,所 述第二教师网络具有高深度低宽度的结构特征;
    处理模块,用于基于所述教师语义分割模型处理样本图像,得到第一分割图和第二分割图,其中,所述第一分割图为所述第一教师网络对所述样本图像进行语义分割的结果,所述第二分割图为所述第二教师网络对所述样本图像进行语义分割的结果;
    训练模块,用于根据所述样本图像、所述第一分割图和所述第二分割图,训练轻量化的学生语义分割模型,得到目标语义分割模型。
  11. 一种电子设备,包括:处理器,以及与所述处理器通信连接的存储器;
    所述存储器存储计算机执行指令;
    所述处理器执行所述存储器存储的计算机执行指令,以实现如权利要求1至9中任一项所述的语义分割模型训练方法。
  12. 一种计算机可读存储介质,所述计算机可读存储介质中存储有计算机执行指令,当处理器执行所述计算机执行指令时,实现如权利要求1至9中任一项所述的语义分割模型训练方法。
  13. 一种计算机程序产品,包括计算机程序,所述计算机程序被处理器执行时实现权利要求1至9中任一项所述的语义分割模型训练方法。
  14. 一种计算机程序,所述计算机程序用于实现如权利要求1至9中任一项所述的语义分割模型训练方法。
PCT/CN2023/104539 2022-07-11 2023-06-30 语义分割模型训练方法、装置、电子设备及存储介质 WO2024012255A1 (zh)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN202210814989.4A CN117437411A (zh) 2022-07-11 2022-07-11 语义分割模型训练方法、装置、电子设备及存储介质
CN202210814989.4 2022-07-11

Publications (1)

Publication Number Publication Date
WO2024012255A1 true WO2024012255A1 (zh) 2024-01-18

Family

ID=89535416

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2023/104539 WO2024012255A1 (zh) 2022-07-11 2023-06-30 语义分割模型训练方法、装置、电子设备及存储介质

Country Status (2)

Country Link
CN (1) CN117437411A (zh)
WO (1) WO2024012255A1 (zh)

Cited By (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN118037651A (zh) * 2024-01-29 2024-05-14 浙江工业大学 基于多教师网络和伪标签对比生成的医学图像分割方法
CN118314352A (zh) * 2024-06-07 2024-07-09 安徽农业大学 一种基于补丁级分类标签的农作物遥感图像分割方法
CN118334277A (zh) * 2024-06-17 2024-07-12 浙江有鹿机器人科技有限公司 一种基于困难体素挖掘的自蒸馏占用网格生成方法及设备
CN118583888A (zh) * 2024-08-01 2024-09-03 四川省华兴宇电子科技有限公司 印制电路板嵌铜质量分析方法及系统
CN118644603A (zh) * 2024-08-16 2024-09-13 北京工业大学 一种基于体渲染知识蒸馏的在线矢量地图构建方法及装置
CN118644603B (zh) * 2024-08-16 2024-10-29 北京工业大学 一种基于体渲染知识蒸馏的在线矢量地图构建方法及装置

Families Citing this family (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN118015431B (zh) * 2024-04-03 2024-07-26 阿里巴巴(中国)有限公司 图像处理方法、设备、存储介质和程序产品

Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2019234564A1 (en) * 2018-06-08 2019-12-12 International Business Machines Corporation Constructing a mixed-domain model
CN111985523A (zh) * 2020-06-28 2020-11-24 合肥工业大学 基于知识蒸馏训练的2指数幂深度神经网络量化方法
CN112508169A (zh) * 2020-11-13 2021-03-16 华为技术有限公司 知识蒸馏方法和系统
CN113792871A (zh) * 2021-08-04 2021-12-14 北京旷视科技有限公司 神经网络训练方法、目标识别方法、装置和电子设备
CN114120319A (zh) * 2021-10-09 2022-03-01 苏州大学 一种基于多层次知识蒸馏的连续图像语义分割方法
US20220138633A1 (en) * 2020-11-05 2022-05-05 Samsung Electronics Co., Ltd. Method and apparatus for incremental learning
KR20220069225A (ko) * 2020-11-20 2022-05-27 서울대학교산학협력단 트랜스포머 뉴럴 네트워크 경량화를 위한 지식 증류 방법 및 이를 수행하기 위한 장치

Patent Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2019234564A1 (en) * 2018-06-08 2019-12-12 International Business Machines Corporation Constructing a mixed-domain model
CN111985523A (zh) * 2020-06-28 2020-11-24 合肥工业大学 基于知识蒸馏训练的2指数幂深度神经网络量化方法
US20220138633A1 (en) * 2020-11-05 2022-05-05 Samsung Electronics Co., Ltd. Method and apparatus for incremental learning
CN112508169A (zh) * 2020-11-13 2021-03-16 华为技术有限公司 知识蒸馏方法和系统
KR20220069225A (ko) * 2020-11-20 2022-05-27 서울대학교산학협력단 트랜스포머 뉴럴 네트워크 경량화를 위한 지식 증류 방법 및 이를 수행하기 위한 장치
CN113792871A (zh) * 2021-08-04 2021-12-14 北京旷视科技有限公司 神经网络训练方法、目标识别方法、装置和电子设备
CN114120319A (zh) * 2021-10-09 2022-03-01 苏州大学 一种基于多层次知识蒸馏的连续图像语义分割方法

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
RUN-ZHENG WANG, GAO JIAN; HUANG SHU-HUA; TONG XIN: "Malicious Code Family Detection Method Based on Knowledge Distillation", COMPUTER SCIENCE, vol. 48, no. 1, 15 January 2021 (2021-01-15), pages 280 - 286, XP093127701 *
YOU SHAN YOUSHAN@PKU.EDU.CN; XU CHANG C.XU@SYDNEY.EDU.AU; XU CHAO XUCHAO@CIS.PKU.EDU.CN; TAO DACHENG DACHENG.TAO@SYDNEY.EDU.AU: "Learning from Multiple Teacher Networks", PROCEEDINGS OF THE 2ND ACM INTERNATIONAL WORKSHOP ON DISTRIBUTED MACHINE LEARNING, ACMPUB27, NEW YORK, NY, USA, 4 August 2017 (2017-08-04) - 3 June 2022 (2022-06-03), New York, NY, USA, pages 1285 - 1294, XP058787071, ISBN: 978-1-4503-9140-5, DOI: 10.1145/3097983.3098135 *

Cited By (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN118037651A (zh) * 2024-01-29 2024-05-14 浙江工业大学 基于多教师网络和伪标签对比生成的医学图像分割方法
CN118314352A (zh) * 2024-06-07 2024-07-09 安徽农业大学 一种基于补丁级分类标签的农作物遥感图像分割方法
CN118334277A (zh) * 2024-06-17 2024-07-12 浙江有鹿机器人科技有限公司 一种基于困难体素挖掘的自蒸馏占用网格生成方法及设备
CN118583888A (zh) * 2024-08-01 2024-09-03 四川省华兴宇电子科技有限公司 印制电路板嵌铜质量分析方法及系统
CN118644603A (zh) * 2024-08-16 2024-09-13 北京工业大学 一种基于体渲染知识蒸馏的在线矢量地图构建方法及装置
CN118644603B (zh) * 2024-08-16 2024-10-29 北京工业大学 一种基于体渲染知识蒸馏的在线矢量地图构建方法及装置

Also Published As

Publication number Publication date
CN117437411A (zh) 2024-01-23

Similar Documents

Publication Publication Date Title
WO2024012255A1 (zh) 语义分割模型训练方法、装置、电子设备及存储介质
CN111476309B (zh) 图像处理方法、模型训练方法、装置、设备及可读介质
WO2020155907A1 (zh) 用于生成漫画风格转换模型的方法和装置
WO2024012251A1 (zh) 语义分割模型训练方法、装置、电子设备及存储介质
WO2020228405A1 (zh) 图像处理方法、装置及电子设备
WO2022012179A1 (zh) 生成特征提取网络的方法、装置、设备和计算机可读介质
WO2023143178A1 (zh) 对象分割方法、装置、设备及存储介质
CN113515942A (zh) 文本处理方法、装置、计算机设备及存储介质
CN112149699B (zh) 用于生成模型的方法、装置和用于识别图像的方法、装置
CN110211017B (zh) 图像处理方法、装置及电子设备
CN113033682B (zh) 视频分类方法、装置、可读介质、电子设备
WO2022161302A1 (zh) 动作识别方法、装置、设备、存储介质及计算机程序产品
CN112364829A (zh) 一种人脸识别方法、装置、设备及存储介质
CN118097157B (zh) 基于模糊聚类算法的图像分割方法及系统
CN113515994A (zh) 视频特征提取方法、装置、设备以及存储介质
CN112232422A (zh) 一种目标行人的重识别方法、装置、电子设备和存储介质
JP2022541832A (ja) 画像を検索するための方法及び装置
CN117633228A (zh) 模型训练方法和装置
CN113610034B (zh) 识别视频中人物实体的方法、装置、存储介质及电子设备
WO2024007958A1 (zh) 图像语义分割模型优化方法、装置、电子设备及存储介质
CN118397147A (zh) 一种基于深度学习的图像文本生成方法及装置
WO2023169334A1 (zh) 图像的语义分割方法、装置、电子设备及存储介质
WO2024061311A1 (zh) 模型训练方法、图像分类方法和装置
CN111915689B (zh) 用于生成目标函数的方法、装置、电子设备和计算机可读介质
CN117171573A (zh) 多模态模型的训练方法、装置、设备和存储介质

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 23838747

Country of ref document: EP

Kind code of ref document: A1