CN117079276B

CN117079276B - Semantic segmentation method, system, equipment and medium based on knowledge distillation

Info

Publication number: CN117079276B
Application number: CN202310748610.9A
Authority: CN
Inventors: 苟建平; 周夏斌; 马忠臣; 宋和平; 刘金华; 欧卫华
Original assignee: Jiangsu University
Current assignee: Jiangsu University
Priority date: 2023-06-21
Filing date: 2023-06-21
Publication date: 2024-02-09
Anticipated expiration: 2043-06-21
Also published as: CN117079276A

Abstract

The invention discloses a semantic segmentation method, a semantic segmentation system, semantic segmentation equipment and semantic segmentation media based on knowledge distillation, relates to semantic segmentation in the field of artificial intelligence, and aims to solve the technical problems that a high-difference area learned by a teacher model and a student model is ignored and segmentation performance is poor in the existing semantic segmentation method. The method comprises the steps of inputting a feature map and logit features through a teacher/student network model, inputting the logit features into a difference perception knowledge distillation model to perform logic difference calculation and probability distribution difference calculation, and obtaining a logic difference mask and a probability distribution difference mask. Through the features covered by the two mechanisms together, students can learn from differences by minimizing the reconstruction errors of the difference areas, the student model is prone to generating the features of teachers by the feature generation module, students can learn from areas with high differences, and the students pay more attention to boundary areas when recovering feature graphs generated by the teachers, so that better representation is realized, and the performance of semantic segmentation is greatly improved.

Description

Semantic segmentation method, system, equipment and medium based on knowledge distillation

Technical Field

The invention belongs to the technical field of artificial intelligence, relates to semantic segmentation, and in particular relates to a semantic segmentation method, system, equipment and medium based on knowledge distillation.

Background

In recent years, deep learning is rapidly advancing, and particularly in the field of computer vision, it is widely used, for example, in the fields of automatic driving, geological detection, medical image diagnosis, and the like. Introducing more parameters will generally improve the accuracy of the model. Semantic segmentation is an important and practical task in computer vision that involves classifying each pixel of an image by mapping it to a network and learning a rich feature representation of the combined context to determine the pixel class.

Applications such as object detection and semantic segmentation are currently being developed at remarkable speeds under the support of deep convolutional neural networks (DNNs). For example, when the urban street view in the automatic driving field is identified and segmented, a large number of urban street view images are required to be obtained as training samples, corresponding real ground label data are marked, and then the training samples and the label data are input into a network model for training. The patent application of application number 202310359928.8 discloses a semantic segmentation method of urban road scenes based on laser point clouds, which is used for detecting and analyzing targets based on road point cloud data and can be used for assisting automobiles in measuring and calculating data such as distance, speed and the like so as to realize the functions of identifying and avoiding obstacles. Firstly, carrying out voxel downsampling on an original urban street point cloud to obtain the point cloud; then randomly allocating probability values to all points of the point cloud, and constructing an input point set of the network; training a neural network based on graph convolution and attention fusion; finally, semantic segmentation prediction is carried out through a neural network based on graph convolution and attention fusion; the method uses computer graphics to preprocess the urban street scene point cloud, and fully utilizes the distribution characteristics of street scenes. Predicting the category by using a sampling-projection mode, and reducing the amount of network reasoning data; by using the neural network with the fusion of graph convolution and attention, the semantic segmentation accuracy and efficiency of the urban street point cloud are improved.

Currently, most advanced semantic segmentation methods generally require a large amount of computing resources to achieve accurate semantic segmentation, and although the performance of current DNNs is significantly improved, the efficiency is very important for semantic segmentation, and the huge memory cost and the huge computation of these deep networks make it difficult to directly deploy the trained networks in real-time applications, such as embedded systems and autopilot vehicles. Model compression techniques have emerged to address these issues, including lightweight network design, pruning, quantization, and knowledge distillation. Knowledge Distillation (KD) is used as a model compression technique that transfers knowledge from a large complex network (teacher network) with rich expressive power to a small network (student network) to achieve performance comparable to, or even superior to, the teacher network.

Semantic segmentation tasks prioritize spatial information and global information, several KD techniques have been developed for semantic segmentation, including: SKD techniques to perform pairwise and global distillation by extracting structured knowledge for students, IFVD techniques to transfer intra-class feature changes (IFVs) from the teacher network to the student network, and CWD techniques to normalize soft probability distribution feature maps to each channel.

The invention patent application with the application number of 202211180513.6 discloses a semantic segmentation self-adaptive knowledge distillation method based on channel characteristics, which comprises the steps of firstly obtaining a teacher segmentation model and a student segmentation model; respectively extracting knowledge of each layer of a main network of the teacher segmentation model and the student segmentation model, and performing self-adaptive feature distillation; then respectively calculating channel correlation matrixes of the teacher and student feature graphs; carrying out channel-by-channel normalization processing on the label predicted value feature graphs finally output by the teacher segmentation model and the student segmentation model; and finally, calculating the final total loss of model training, and training the student model. The invention automatically learns the correlation of each layer of the backbone network through the self-attention mechanism, fully utilizes the channel knowledge of the model, reduces the learning of the student model on the redundant knowledge of the teacher model through learning the correlation and the significance characteristics of the channel of the teacher model, and effectively improves the segmentation precision of the student model.

The patent application of application number 201911277549.4 discloses a knowledge distillation method based on feature differences in semantic segmentation class, which aims to solve the problem of balance of accuracy and efficiency of a semantic segmentation model from a brand new view angle, and aims to migrate the learned dark knowledge of a complex model (teacher model) to a simplified model (student model) so as to improve the accuracy of the semantic segmentation model and keep the speed of the semantic segmentation model. Firstly, convolution characteristics are obtained through a teacher model and a student model respectively; then, obtaining a feature map of each class center through mask-guided average pooling operation, and calculating feature similarity of each pixel point and the corresponding class center to obtain an intra-class feature difference map; finally, the intra-class characteristic difference graph of the student model is aligned with the teacher model, so that the purpose of improving the accuracy of the student model is achieved. Compared with the prior art, the distillation method provided by the invention has novel thought, the obtained semantic segmentation model has good effects on accuracy and speed, and meanwhile, the method can be conveniently combined with other related technologies, and has strong practical application value.

However, as in the above patent application, the semantic segmentation method based on knowledge distillation in the prior art generally trains the student model to directly simulate the middle feature or logarithm of the teacher model, or the relation or structural knowledge between distillation features, but this method ignores the high difference region learned by the teacher model and the student model, especially the difference of the edges of the examples, so that the segmentation performance needs to be improved.

Disclosure of Invention

The invention provides a semantic segmentation method, a semantic segmentation system, semantic segmentation equipment and a semantic segmentation medium based on knowledge distillation, which are used for solving the technical problems that a semantic segmentation method based on knowledge distillation in the prior art omits a high-difference region learned by a teacher model and a student model and has poor segmentation performance.

In order to solve the technical problems, the invention adopts the following technical scheme:

a semantic segmentation method based on knowledge distillation comprises the following steps:

step S1, obtaining sample data

Obtaining city street view sample image data, and forming tag data on the real ground in the city street view sample image data;

s2, constructing a semantic segmentation model

The semantic segmentation model comprises a teacher network model and a student network model which have the same structure and different layers, wherein the teacher network model is a pre-trained teacher network model;

Step S3, extracting sample characteristics

Respectively inputting the image data of the city streetscape sample into a teacher network model and a student network model, outputting a first middle characteristic diagram at the last layer of a backbone network of the teacher network model, and finally outputting a first logic characteristic of the teacher network model; the last layer of backbone network of the student network model outputs a second intermediate feature map, and the final layer of backbone network of the student network model outputs a second logic feature;

step S4, constructing a knowledge distillation loss function

Inputting the first logic characteristic and the second logic characteristic into a difference perception knowledge distillation model, calculating through logic differences to obtain a logic difference mask, and calculating through probability distribution differences to obtain a probability distribution difference mask; respectively performing point multiplication on the logic difference mask and the probability distribution difference mask and the second intermediate feature map, generating a double-mask student feature map in a convolution mode, and constructing a knowledge distillation loss function based on the double-mask student feature map and the first intermediate feature map;

step S5, constructing a segmentation loss function

Processing the second logic characteristic into a probability distribution type, and calculating cross entropy with the tag data to obtain a segmentation loss function;

step S6, constructing a total loss function

The total loss function comprises a segmentation loss function and a knowledge distillation loss function;

step S7, training a student network model

Training a student network model by using the total loss function, and back-propagating, and updating parameters of the student network model to obtain a mature student network model;

step S8, real-time classification of city street images

And acquiring a real-time city street view image, inputting the city street view image into a student network model for semantic segmentation, and outputting a real ground segmentation result by the student network model.

Further, the backbone network of the teacher network model is ResNet101, and the backbone network of the student network model is ResNet18.

Further, in step S3, the method for obtaining the logic feature is as follows:

step S3-1, respectively inputting city streetscape sample image data into a teacher network model and a student network model, and outputting a first intermediate feature map at the last layer of a backbone network of the teacher network modelThe last layer of backbone network of the student network model outputs a second intermediate feature map +.>The method comprises the steps of carrying out a first treatment on the surface of the First intermediate feature map->Second intermediate feature map->Expressed as:

wherein H, W and D represent the length, width and dimension, respectively, of the intermediate feature map;

step S3-2, for the first intermediate feature graphs respectively Second intermediate feature map->Performing feature fusion to obtain a first logic feature which is finally output by the teacher network model>Second logic feature for obtaining final output of student network model>First logic feature->Second logic feature->Expressed as:

where H, W and C represent the length, width and dimension of the logic feature, respectively.

Further, in step S4, the specific method for obtaining the logic difference mask is as follows:

step S4-1-1, converting the tag data into a tag matrix composed of 0 and 1：

Wherein,representing the number of categories->、/>Representing the height and width of the tag matrix, respectively, < >>、/>、/>；

Step S4-1-2, pooling the label matrix by averagingDownsampling, and multiplying the difference of the first logic characteristic and the second logic characteristic to obtain a logic differential matrix +.>：

Wherein H, W and C represent the length, width and dimension, respectively, of the logic feature,、/>respectively representing a first logic feature, a second logic feature,/and a third logic feature>Representing dot product->Representing average pooling;

step S4-1-3, logic differential matrix through spatial attentionCompressed into a two-dimensional matrix and output a spatial attention mask +.>：

Wherein H, W respectively represent the length and width of the logic feature,representing per-channel connection matrix, " >Andrespectively, mean and maximum value of each pixel in channel dimension, +.>Representing a convolution layer, the dimension is reduced from 2 to 1 by convolution;

step S4-1-4, according to the spatial attention maskGenerating a logical difference mask->：

Wherein H, W respectively represent the length and width of the logic feature,representation ofPixel position->Represent the range [0,1 ]]The percentile value of (c).

Further, in step S4, the specific method for obtaining the probability distribution difference mask is as follows:

step S4-2-1, adoptThe function activates the first logic feature and the second logic feature to generate teacher predictive probability +.>Student prediction probability->：

Wherein H, W and C represent the length, width and dimension, respectively, of the logic feature,position in the logic feature representing the localization, < >>A value representing the first logic feature at the (i, j, c) position,/->A value representing a second logic feature at the (i, j, c) position;

s4-2-2, calculating absolute difference between teacher prediction probability and student prediction probability, taking maximum value of each pixel in channel dimension, and generating maximum mask：

Wherein H, W respectively represent the length and width of the logic feature,representing channel dimension->，/>Representing the probability of the teacher's prediction,representing student prediction probabilities;

Step S4-2-3, setting a mask corresponding to the maximumIs>Threshold of number of bits by comparing maximum mask +.>And->Generating a probability distribution difference mask +.>：

Wherein H, W respectively represent the length and width of the logic feature,representing pixel position +.>Represent the range [0,1 ]]The percentile value of (c).

Further, in step S4, when constructing the knowledge distillation loss function, the specific method is as follows:

s4-3-1, fusing by using a logic difference mask and a probability distribution difference mask to generate a double-mask student characteristic diagram：

Wherein,representing a logical difference mask, ">Representing probability distribution disparity mask, ">Representing a second intermediate profile, ">Represents a convolution layer with a convolution kernel of 3 x 3 +.>Representing the connection of two matrices in the channel dimension,representing a convolution block->Representing dot product->Representation->Activating a function;

step S4-3-2, according to the double-mask student characteristic diagramConstructing a knowledge distillation loss function by using the first intermediate feature map>：

Wherein H, W and D respectively represent the length, width and dimension of the first intermediate feature map, (-) -) Representing the position of the first intermediate feature map within the (H, W, D) range, ++>Representing the first intermediate feature map at (-) >) Value on position->Representing a double-mask student feature map (++>) Values in position.

Further, in step S6, when constructing the total loss function, the total loss functionExpressed as:

wherein,representing a segmentation loss function, +.>Representing knowledge distillation loss function,/->Representing the weights.

A knowledge distillation based semantic segmentation system comprising:

the sample data acquisition module is used for acquiring the urban street view sample image data and forming tag data on the real ground in the urban street view sample image data;

the semantic segmentation model construction module is used for constructing a semantic segmentation model, wherein the semantic segmentation model comprises a teacher network model and a student network model which have the same structure and different layers, and the teacher network model is a pre-trained teacher network model;

the sample feature extraction module is used for respectively inputting the urban street view sample image data into a teacher network model and a student network model, outputting a first middle feature map at the last layer of a backbone network of the teacher network model, and finally outputting a first logic feature of the teacher network model; the last layer of backbone network of the student network model outputs a second intermediate feature map, and the final layer of backbone network of the student network model outputs a second logic feature;

The knowledge distillation loss function construction module is used for inputting the first logic characteristic and the second logic characteristic into a difference perception knowledge distillation model, obtaining a logic difference mask through logic difference calculation, and obtaining a probability distribution difference mask through probability distribution difference calculation; respectively performing point multiplication on the logic difference mask and the probability distribution difference mask and the second intermediate feature map, generating a double-mask student feature map in a convolution mode, and constructing a knowledge distillation loss function based on the double-mask student feature map and the first intermediate feature map;

the segmentation loss function construction module is used for processing the second logic characteristic into a probability distribution type, and calculating cross entropy with the tag data to obtain a segmentation loss function;

the total loss function construction module is used for the total loss function to comprise a segmentation loss function and a knowledge distillation loss function;

the student network model training module is used for training the student network model by using the total loss function, and back-propagating, and updating parameters of the student network model to obtain a mature student network model;

and the city street view image real-time classification module is used for acquiring a real-time city street view image, inputting the city street view image into the student network model for semantic segmentation, and outputting a real ground segmentation result by the student network model.

A computer device comprising a memory and a processor, the memory storing a computer program which, when executed by the processor, causes the processor to perform the steps of the method described above.

A computer readable storage medium storing a computer program which, when executed by a processor, causes the processor to perform the steps of the method described above.

Compared with the prior art, the invention has the beneficial effects that:

in the invention, aiming at the difference between the student network and the teacher network, the characteristic that the student has huge difference between the teacher and the logit of the student is covered, then the covered characteristic is used for reconstructing the characteristic diagram of the teacher, and the student can learn from the difference by minimizing the reconstruction error of the difference area; segmentation performance is improved by detecting differences between teacher models and student models in the logic space through two masking mechanisms (e.g., masking by a teacher-student logic diagram with ground truth labels and masking differences in probability distribution between the teacher and student) and guiding the student models to recover characteristics of the teacher and focus on areas of these height differences; through the features covered by the two mechanisms together, the student model tends to generate the features of the teacher by the feature generation module, students can learn from the regions with high difference and pay more attention to the boundary regions when restoring the feature map generated by the teacher, so that better representation is realized, and the performance of semantic segmentation is greatly improved.

Drawings

FIG. 1 is a schematic flow chart of the present invention;

FIG. 2 is a schematic flow chart of a differential awareness knowledge distillation model in the present invention.

Detailed Description

The invention is further described below with reference to the accompanying drawings. Embodiments of the present invention include, but are not limited to, the following examples.

Example 1

The embodiment provides a semantic segmentation method based on knowledge distillation, which is used for carrying out semantic segmentation on a city street image. As shown in fig. 1, it includes the steps of:

step S1, obtaining sample data

And acquiring the city street sample image data, and forming tag data on the real ground in the city street sample image data.

Sample data in the training set may be from Cityscapes Dataset-Semantic Understanding of Urban Street Scenes (cityscapes-dataset. Com). The pictures in the training set are taken from city street view pictures, and the corresponding real ground is marked, so that tag data are formed.

The training set employs a Cityscapes dataset, which is a collection of images captured from street views of 50 different cities, focusing on achieving semantic understanding of the city street views. The fine annotation image of the dataset is divided into a training set, a verification set and a test set, which are respectively composed of 2975, 500 and 1525 images. Each image in the dataset provided annotations for 30 different common classes, but only 19 classes were used for evaluation and testing.

In addition, the city street sample image data can be preprocessed, wherein the preprocessing comprises operations of clipping, scaling, rotating and the like so as to enlarge the sample data.

S2, constructing a semantic segmentation model

The semantic segmentation model comprises a teacher network model and a student network model, wherein the teacher network model and the student network model have the same structure and different layers, and the teacher network model is a pre-trained teacher network model.

The teacher network model can be constructed into a PSPNet-Res101 structure, and the backbone network of the teacher network model is ResNet101; the student network model can be constructed as a PSPNet-Res18 structure, and the backbone network of the student network model is ResNet18.ResNet101 has a greater number of layers than ResNet18 so that the teacher network model can learn more prominently and with better results.

In addition, the sample data is adopted to pretrain the teacher network model, so that the pretrained teacher network model is obtained.

Step S3, extracting sample characteristics

Respectively inputting the image data of the city streetscape sample into a teacher network model and a student network model, outputting a first middle characteristic diagram at the last layer of a backbone network of the teacher network model, and finally outputting a first logic characteristic of the teacher network model; the last layer of the backbone network of the student network model outputs a second intermediate feature map, and the final layer of the backbone network of the student network model outputs a second logic feature.

In the step, the specific method for obtaining the logic characteristic is as follows:

step S3-2, for the first intermediate feature graphs respectivelySecond intermediate feature map->Performing feature fusion to obtain a first logic feature which is finally output by the teacher network model>Second logic feature for obtaining final output of student network model>First logic feature->Second logic feature->Expressed as:

Step S4, constructing a knowledge distillation loss function

Inputting the first logic characteristic and the second logic characteristic into a difference perception knowledge distillation model, calculating through logic differences to obtain a logic difference mask, and calculating through probability distribution differences to obtain a probability distribution difference mask; and respectively carrying out point multiplication (matrix multiplication according to elements) on the logic difference mask and the probability distribution difference mask and a second intermediate feature map output by the student network model, generating a double-mask student feature map in a convolution mode, and constructing a knowledge distillation loss function based on the double-mask student feature map and the first intermediate feature map output by the teacher network model.

In this embodiment, a logic difference module, a probability distribution difference module and a feature generation module are designed for the difference perception knowledge distillation model, and the specific flow is shown in fig. 2.

The logic difference module is mainly used for obtaining a logic difference mask, and the specific method for obtaining the logic difference mask is as follows:

step S4-1-1 to locate those regions with the greatest variance, the present embodiment converts the tag data (i.e., the ground truth tag) into a one-hot (one-hot) tag matrix consisting of 0 and 1：

Wherein,representing the number of categories->、/>Representing the height and width of the tag matrix, respectively, < >>、/>、/>。

Step S4-1-2, pooling the label matrix by averagingDownsampling is performed to match the dimension of the logic feature map (i.e. +.>) Then multiplying the difference of the mapping of the first logic feature output by the teacher network model and the second logic feature output by the student network model, namely +.>And->Absolute difference between them to obtain logic differential matrixThe expression of this operation is:

wherein H, W and C represent the length, width and dimension, respectively, of the logic feature,、/>respectively representing a first logic feature, a second logic feature,/and a third logic feature>Representing dot multiplication (vector corresponding element multiplication), +.>Representing average pooling.

Step S4-1-3, which further uses spatial attention to determine a logic differential matrixCompressed into a two-dimensional matrix to enableThe model is more concerned with +.>And outputs a spatial attention mask +.>：

Wherein H, W respectively represent the length and width of the logic feature,representing per-channel connection matrix, ">Andrespectively, mean and maximum value of each pixel in channel dimension, +.>Representing a convolution layer, the dimension is reduced from 2 to 1 by convolution; will->Channel value of (2) from->The dimension is converted into 1 dimension, finally the channel dimension is converted into 1 by using convolution layer, and +.>Function (i.e.)>) Activating it.

Step S4-1-4 fromPixels for which a greater attention weight is obtained correspond to a teacherThe network model and the student network model are the areas of most concern in the example; to distinguish between these pixels, a mask is generated using a percentile-based threshold function. I.e. according to the spatial attention mask->Generating a logical difference mask->：

Will have a value greater than or equal to the firstThe quantile pixel is set to 1, representing a high difference region between the teacher network model and the student network model predictions, and the value is smaller than +. >The quantile pixel is set to 0.

For example, whenWhen set to 0.6, the pixels that notice the weight greater than or equal to the 60 th percentile will be masked.

The probability distribution difference module is mainly used for obtaining a probability distribution difference mask, and the specific method for obtaining the probability distribution difference mask is as follows:

step S4-2-1, adoptThe function activates the first logic feature and the second logic feature respectively to generate teacher prediction probability +.>Student prediction probability->：

Wherein H, W and C represent the length, width and dimension, respectively, of the logic feature,position in the logic feature representing the localization, < >>A value representing the first logic feature at the (i, j, c) position,/->Representing the value of the second logic feature at the (i, j, c) position.

S4-2-2, calculating the prediction probability of the teacherStudent prediction probability->Taking the absolute difference between the two, taking the maximum value of each pixel in the channel dimension, this operation will produce a maximum mask +.>：

Wherein H, W respectively represent the length and width of the logic feature,representing channel dimension->，/>Representing the probability of the teacher's prediction,representing the student's prediction probability.

Teacher prediction probabilityPrediction probability +.>A large difference between them results in +.>(wherein) And thus form the region of difference of interest to the present embodiment.

Step S4-2-3, similar to the logic difference module, applying a quantile-based threshold function toThe region of interest is limited to regions of relatively large variance. Specifically, a mask corresponding to the maximum mask is set>Is>Threshold of number of bits by comparing maximum mask +.>And->Generating a probability distribution difference mask +.>：

Wherein H, W respectively represent the length and width of the logic feature,representing pixel position +.>Represent the range [0,1 ]]The percentile value of (c). The value is greater than or equal to +.>The quantile pixel is set to 1, representing a high difference region between the teacher network model and the student network model predictions, and the value is smaller than +.>The quantile pixel is set to 0. Similar to𝑞1。

The characteristic generation module is mainly used for constructing and generating a knowledge distillation loss function, and the specific method is as follows:

step S4-3-1 while using the logical difference maskProbability distribution difference mask->To mask the student's features to force the generating personnel to focus on the region of discrepancy. Specifically, a logical difference mask is used>Probability distribution difference maskFusing to generate a double-mask student feature map +.>：

Wherein,representing a logical difference mask, " >Representing probability distribution disparity mask, ">Representing a second intermediate profile, ">Represents a convolution layer with a convolution kernel of 3 x 3 +.>Representing the connection of two matrices in the channel dimension, +.>Representing dot product; />Representing a convolution block comprising two 3 x 3 convolution layers, one of eachThe activation function, then a 1 x 1 convolutional layer, is written as:

wherein,representation->The function is activated.

Step S4-3-2, based on the marking mechanism, the embodiment designs the distillation loss function as the teacher characteristic of the teacher networkAnd double mask student feature map->Mean square error between. The method comprises the following steps: according to the double mask student profile->Constructing a knowledge distillation loss function by using the first intermediate feature map>：

Wherein H, W and D respectively represent the length, width and dimension of the first intermediate feature map, (-) -) Representing the position of the first intermediate feature map within the (H, W, D) range, ++>Representing the first intermediate feature map at (-)>) Value on position->Representing a double-mask student feature map (++>) Values in position.

Step S5, constructing a segmentation loss function

And processing the second logic characteristic into a probability distribution type, and calculating cross entropy with the tag data to obtain a segmentation loss function. The method comprises the following steps: based on student predictive probability Constructing a segmentation loss function by tag probability P>：

Wherein CE (·) is represented as a cross entropy loss function (Cross Entropy Loss), P (i) represents the actual probability value of the i-th class, representing the predicted probability value of the i-th class.

Step S6, constructing a total loss function

The total loss function includes a segmentation loss function and a knowledge distillation loss function.

Total loss functionExpressed as:

Step S7, training a student network model

And training the student network model by using the total loss function, and back-propagating, and updating parameters of the student network model to obtain a mature student network model.

The training of the student network model is achieved by adopting the existing and conventional training mode, which is not the innovation point of the application.

Step S8, real-time classification of city street images

According to the data query, the average semantic segmentation accuracy of the previous method SKD, IFVD, CWD, MGD on the Cityscapes data set is 72.5%, 72.73%, 75.77% and 76.22% respectively. According to experiments, the accuracy of the difference-aware knowledge distillation method in the embodiment on the Cityscapes data set is 76.72%, compared with the prior method, the method has the advantages that the effectiveness of the method is improved obviously.

Example 2

The embodiment provides a semantic segmentation system based on knowledge distillation, which is used for carrying out semantic segmentation on a city street image. It comprises the following steps:

the sample data acquisition module is used for acquiring the urban street view sample image data and forming tag data on the real ground in the urban street view sample image data.

The semantic segmentation model construction module is used for constructing a semantic segmentation model. The semantic segmentation model comprises a teacher network model and a student network model, wherein the teacher network model and the student network model have the same structure and different layers, and the teacher network model is a pre-trained teacher network model.

The sample feature extraction module is used for respectively inputting the urban street view sample image data into a teacher network model and a student network model, outputting a first middle feature map at the last layer of a backbone network of the teacher network model, and finally outputting a first logic feature of the teacher network model; the last layer of the backbone network of the student network model outputs a second intermediate feature map, and the final layer of the backbone network of the student network model outputs a second logic feature.

The knowledge distillation loss function construction module is used for inputting the first logic characteristic and the second logic characteristic into a difference perception knowledge distillation model, obtaining a logic difference mask through logic difference calculation, and obtaining a probability distribution difference mask through probability distribution difference calculation; and respectively carrying out point multiplication (matrix multiplication according to elements) on the logic difference mask and the probability distribution difference mask and a second intermediate feature map output by the student network model, generating a double-mask student feature map in a convolution mode, and constructing a knowledge distillation loss function based on the double-mask student feature map and the first intermediate feature map output by the teacher network model.

wherein H, W and C represent the length, width and dimension, respectively, of the logic feature,、/>respectively represent the first logic featuresSecond logic feature, +.>Representing dot multiplication (vector corresponding element multiplication), +.>Representing average pooling.

Step S4-1-3, which further uses spatial attention to determine a logic differential matrixCompressing into two-dimensional matrix to make model more focused +.>And outputs a spatial attention mask +.>：

Wherein H, W respectively represent the length and width of the logic feature,representing per-channel connection matrix, ">Andrespectively, mean and maximum value of each pixel in channel dimension, +.>The representation represents a convolution layer, the dimension being reduced from 2 to 1 by convolution; will->Channel value of (2) from->The dimension is converted into 1 dimension, finally the channel dimension is converted into 1 by using convolution layer, and +.>Function (i.e.)>) Activating it.

Step S4-1-4 fromThe pixels for which the greater attention weight is obtained correspond to the areas where the teacher network model and the student network model are most required to be focused in the example; to distinguish between these pixels, a mask is generated using a percentile-based threshold function. I.e. according to the spatial attention mask->Generating a logical difference mask->：

Will have a value greater than or equal to the firstThe quantile pixel is set to 1, representing a high difference region between the teacher network model and the student network model predictions, and the value is smaller than +. >Dividing positionThe number of pixels is set to 0./>

/>

wherein,representation->The function is activated.

And the segmentation loss function construction module is used for processing the second logic characteristic into a probability distribution type, and calculating cross entropy with the tag data to obtain a segmentation loss function. The method comprises the following steps: based on student predictive probability Constructing a segmentation loss function by tag probability P>：

And the total loss function construction module is used for constructing a total loss function, wherein the total loss function comprises a segmentation loss function and a knowledge distillation loss function.

Total loss functionExpressed as:

And the student network model training module is used for training the student network model by using the total loss function, and back-propagating, and updating parameters of the student network model to obtain a mature student network model.

Example 3

A computer device comprising a memory and a processor, the memory storing a computer program which, when executed by the processor, causes the processor to perform the steps of a knowledge distillation based semantic segmentation method.

The computer equipment can be computing equipment such as a desktop computer, a notebook computer, a palm computer, a cloud server and the like. The computer equipment can perform man-machine interaction with a user through a keyboard, a mouse, a remote controller, a touch pad or voice control equipment and the like.

The memory includes at least one type of readable storage medium including flash memory, hard disk, multimedia card, card memory (e.g., SD or D interface display memory, etc.), random Access Memory (RAM), static Random Access Memory (SRAM), read Only Memory (ROM), electrically Erasable Programmable Read Only Memory (EEPROM), programmable Read Only Memory (PROM), magnetic memory, magnetic disk, optical disk, etc. In some embodiments, the memory may be an internal storage unit of the computer device, such as a hard disk or a memory of the computer device. In other embodiments, the memory may also be an external storage device of the computer device, such as a plug-in hard disk, a Smart Media Card (SMC), a Secure Digital (SD) Card, a Flash memory Card (Flash Card) or the like. Of course, the memory may also include both internal storage units of the computer device and external storage devices. In this embodiment, the memory is often used to store an operating system and various application software installed on the computer device, for example, program codes of the semantic segmentation method based on knowledge distillation. In addition, the memory may be used to temporarily store various types of data that have been output or are to be output.

The processor may be a central processing unit (Central Processing Unit, CPU), controller, microcontroller, microprocessor, or other data processing chip in some embodiments. The processor is typically used to control the overall operation of the computer device. In this embodiment, the processor is configured to execute the program code stored in the memory or process data, for example, the program code of the semantic segmentation method based on knowledge distillation.

Example 4

A computer readable storage medium storing a computer program which, when executed by a processor, causes the processor to perform the steps of a knowledge distillation based semantic segmentation method.

Wherein the computer-readable storage medium stores an interface display program executable by at least one processor to cause the at least one processor to perform the steps of the knowledge distillation based semantic segmentation method as described above.

From the above description of the embodiments, it will be clear to those skilled in the art that the above-described embodiment method may be implemented by means of software plus a necessary general hardware platform, but of course may also be implemented by means of hardware, but in many cases the former is a preferred embodiment. Based on such understanding, the technical solution of the present application may be embodied essentially or in a part contributing to the prior art in the form of a software product stored in a storage medium (such as ROM/RAM, magnetic disk, optical disk), comprising several instructions for causing a terminal device (which may be a mobile phone, a computer, a server or a network device, etc.) to perform the semantic segmentation method based on knowledge distillation according to the embodiments of the present application.

The above is an embodiment of the present invention. The above embodiments and specific parameters in the embodiments are only for clearly describing the inventive verification process of the inventor, and are not intended to limit the scope of the invention, which is defined by the claims, and all equivalent structural changes made by applying the descriptions and the drawings of the invention are included in the scope of the invention.

Claims

1. The semantic segmentation method based on knowledge distillation is characterized by comprising the following steps of:

step S1, obtaining sample data

s2, constructing a semantic segmentation model

step S3, extracting sample characteristics

Step S4, constructing a knowledge distillation loss function

step S5, constructing a segmentation loss function

step S6, constructing a total loss function

step S7, training a student network model

step S8, real-time classification of city street images

Acquiring a real-time city street view image, inputting the city street view image into a student network model for semantic segmentation, and outputting a real ground segmentation result by the student network model;

In step S4, the specific method for obtaining the logic difference mask is as follows:

step S4-1-1, converting the tag data into a tag matrix composed of 0 and 1：

Wherein H, W and C each represent logicthe length, width and dimension of the t feature,、/>respectively representing a first logic feature, a second logic feature,/and a third logic feature>Representing dot product->Representing average pooling;

Wherein H, W respectively represent the length and width of the logic feature,representing per-channel connection matrix, ">Andrespectively, mean and maximum value of each pixel in channel dimension, +.>Representing a convolution layer, the dimension is reduced from 2 to 1 by convolution;

Wherein H, W respectively represent the length and width of the logic feature,representing pixel position +.>Represent the range [0,1 ] ]The percentile value of (a);

in step S4, the specific method for obtaining the probability distribution difference mask is as follows:

Wherein H, W respectively represent the length and width of the logic feature,representing channel dimension->，/>Representing teacher prediction probability, < >>Representing student prediction probabilities;

step S4-2-3, setting a mask corresponding to the maximumIs the first of (2)/>Threshold of number of bits by comparing maximum mask +.>And->Generating a probability distribution difference mask +.>：

Wherein H, W respectively represent the length and width of the logic feature,representing pixel position +.>Represent the range [0,1 ]]The percentile value of (a);

in step S4, when constructing the knowledge distillation loss function, the specific method is as follows:

Wherein,representing a logical difference mask, ">Representing probability distribution disparity mask, ">A second intermediate feature map is represented and,represents a convolution layer with a convolution kernel of 3 x 3 +.>Representing the connection of two matrices in the channel dimension,representing a convolution block->Representing dot product->Representation->Activating a function;

step S4-3-2, according to the double-mask student characteristic diagramKnowledge distillation loss function construction by first intermediate feature map：

2. The semantic segmentation method based on knowledge distillation according to claim 1, wherein: the backbone network of the teacher network model is ResNet101, and the backbone network of the student network model is ResNet18.

3. The semantic segmentation method based on knowledge distillation according to claim 1, wherein: in step S3, the method for obtaining the logic feature is as follows:

4. The semantic segmentation method based on knowledge distillation according to claim 1, wherein: step S6, when constructing the Total loss functionExpressed as:

5. A knowledge distillation-based semantic segmentation system, comprising:

the urban street view image real-time classification module is used for acquiring real-time urban street view images, inputting the urban street view images into the student network model for semantic segmentation, and outputting a real ground segmentation result by the student network model;

in the knowledge distillation loss function construction module, the specific method for obtaining the logic difference mask is as follows:

step S4-1-1, converting the tag data into a tag matrix composed of 0 and 1：

Step S4-1-2, pooling the label matrix by averagingDownsampling, and multiplying the difference of the first logic characteristic and the second logic characteristic to obtain a logic differential matrix +. >：

Wherein H, W and C represent the length, width and dimension, respectively, of the logic feature,logit bits representing positioningLocation in symptoms,/->A value representing the first logic feature at the (i, j, c) position,/- >A value representing a second logic feature at the (i, j, c) position;

Wherein,representing a logical difference mask, ">Representing probability distribution disparity mask, ">A second intermediate feature map is represented and,represents a convolution layer with a convolution kernel of 3 x 3 +.>Representing the connection of two matrices in the channel dimension,representing a convolution block->Representing dot product- >Representation->Activating a function;

6. A computer device, characterized by: comprising a memory and a processor, the memory storing a computer program which, when executed by the processor, causes the processor to perform the steps of the method according to any one of claims 1 to 4.

7. A computer-readable storage medium, characterized by: a computer program is stored which, when executed by a processor, causes the processor to perform the steps of the method according to any one of claims 1 to 4.