CN115761235A

CN115761235A - Zero sample semantic segmentation method, system, equipment and medium based on knowledge distillation

Info

Publication number: CN115761235A
Application number: CN202211472238.5A
Authority: CN
Inventors: 沈冯立; 李福生; 赵彦春; 唐荣江
Original assignee: Yangtze River Delta Research Institute of UESTC Huzhou
Current assignee: Yangtze River Delta Research Institute of UESTC Huzhou
Priority date: 2022-11-22
Filing date: 2022-11-22
Publication date: 2023-03-07

Abstract

The invention belongs to the technical field of computer vision, and discloses a zero sample semantic segmentation method, a system, equipment and a medium based on knowledge distillation, which comprise the following steps: extracting various regional picture features in a training picture by using a pre-trained picture encoder, taking the regional picture features as visual supervision of a picture feature map extracted by a semantic segmentation model, and applying visual knowledge in the picture encoder to training of the semantic segmentation model in a distillation manner; extracting text features of class names converted into texts by using a pre-trained text encoder, and taking the text features as classification bases of a semantic segmentation model feature map; and classifying each pixel point of the picture according to the score of each class by the semantic segmentation model. According to the invention, the pre-training picture encoder and the text encoder which are obtained by using the CLIP model under a large amount of text picture pairing data help the semantic segmentation model to have the capability of identifying unseen classes, and the identification range of the semantic segmentation model is expanded.

Description

Knowledge distillation-based zero sample semantic segmentation method, system, equipment and medium

Technical Field

The invention belongs to the technical field of computer vision, and particularly relates to a zero sample semantic segmentation method, a zero sample semantic segmentation system, zero sample semantic segmentation equipment and a zero sample semantic segmentation medium based on knowledge distillation.

Background

The existing semantic segmentation model can only perform semantic segmentation on pictures containing classes labeled in a semantic segmentation data set, if the semantic segmentation model is expected to have the capability of identifying more classes, the pictures of the classes need to be collected, and the model is trained again after the collected pictures are subjected to semantic labeling. In collecting data, for some categories, a problem may also be encountered in that a sufficient number of data sets cannot be collected. This also hinders the popularization and application of the semantic segmentation model in reality. On the other hand, however, the internet is flooded with huge amounts of paired text and picture data. People now train with this huge amount of data (greater than 4 hundred million pairs of text pictures) to obtain a CLIP (contextual Language-Image Pre-training) model (Radford, alec, et al. "Learning transferable visual models from natural Language overview." International Conference on Machine Learning. Pmlr, 2021), which includes a text encoder and a picture encoder. The most important capability of the method is to classify any class appearing in the picture by using a text encoder. Although this approach is very successful in extracting picture features, it is a challenging task to apply the model in semantic segmentation.

Through the above analysis, the problems and defects of the prior art are as follows:

(1) Most of current semantic segmentation models are difficult to expand the number of recognition classes, and the generalization capability of the models can be improved only by collecting more classes of labeled data.

(2) Currently, most semantic segmentation models can only perform downstream work on pre-trained models on semantic segmentation data sets, and cannot utilize knowledge on lower-level data sets, such as picture classification data sets.

Disclosure of Invention

Aiming at the problems in the prior art, the invention provides a zero sample semantic segmentation method, a system, equipment and a medium based on knowledge distillation.

The invention is realized in such a way that a knowledge distillation-based zero sample semantic segmentation method comprises the following steps:

extracting the picture characteristics of the region by using a pre-trained picture encoder to be used as visual supervision of a semantic segmentation model; and extracting the text features of the class names converted into the texts by using a pre-trained text encoder to serve as a classification basis of a semantic segmentation model feature map, and further training the semantic segmentation model to realize image classification.

Further, the zero sample semantic segmentation method based on knowledge distillation comprises the following specific processes:

pre-training a picture encoder and a text encoder using text and picture data pairs; collecting a training data set and a test data set with pixel level labels; constructing a semantic segmentation model to extract a feature map of a training data set; extracting regional picture characteristics of different objects in a training data set by using the picture encoder, and monitoring picture blocks in the characteristic picture, distilling information in the regional picture characteristics, and calculating to obtain loss 1; converting the labels in the training data set into a text form, obtaining the text characteristics of the labels corresponding to the training data set by using the text encoder, taking the text characteristics as the classification weight of the feature map picture block, and calculating to obtain a loss 2; and calculating the total loss according to the loss 1 and the loss 2, conducting reversely, updating and training the semantic segmentation model until the weight of the model is stable, and performing model test.

Further, the picture encoder is a visual converter of a ViT-B/16 model in a CLIP model, has 12 layers of neural networks in total, inputs pictures with the size of 224 multiplied by 224, and outputs picture features with 512 dimensions;

the text encoder is a character converter of a ViT-B/16 model in a CLIP model, and has 12 layers of neural networks in total, wherein the input is a text, and the output is a 512-dimensional text characteristic.

Further, the training data set and the testing data set are VOC2012 data sets, in which the training data set has 10582 pictures and the testing data set has 1449 pictures.

Further, the semantic segmentation model adopts a DeepLab V3 model as a backbone network.

Further, before extracting the region picture features, the picture encoder obtains a region picture of a certain class in the training picture with a black background by using the labeled class mask of the training picture, and then extracts the region picture features.

Further, the text form is represented as: a picture of a category name.

Further, the loss 1 is calculated by a squared error loss function, and the expression is as follows:

in the formula, f _i.j Is the feature of a picture block belonging to class c in the training picture, f _c Is the regional graph characteristic that the characteristic graph belongs to class C, C is the total number of classes;

the loss 2 is obtained by calculating various confidence degrees of the picture blocks, taking the highest confidence degree as a classification result of each block and calculating by utilizing a cross entropy loss function according to a real semantic segmentation result;

the confidence is obtained by calculating cosine similarities between the picture block features and the text features of the classes.

Further, the model test includes:

inputting the test picture into the trained semantic segmentation model to obtain a feature map of the test picture; converting the category names of the visible category and the unseen category in the testing stage into texts, and obtaining corresponding text characteristics by using a text encoder; and after calculating the picture characteristics and the text characteristics of the picture blocks of the characteristic graph of the test picture, integrating the classification results of all the blocks according to the maximum value in the various scores as the classification result of the picture blocks to obtain the predicted semantic segmentation result of the test picture.

Another object of the present invention is to provide a knowledge-based distillation zero-sample semantic segmentation system, which includes:

the picture coding module is used for extracting the regional picture characteristics of different objects in the picture;

the text coding module is used for obtaining the text characteristics of the label corresponding to the picture;

and the semantic segmentation module is used for acquiring a pixel level classification result of the picture.

Another object of the present invention is to provide a computer device, which comprises a memory and a processor, the memory storing a computer program, which when executed by the processor, causes the processor to perform the steps of the zero sample semantic segmentation method based on knowledge distillation.

It is a further object of the present invention to provide a computer readable storage medium, storing a computer program which, when executed by a processor, causes the processor to perform the steps of the knowledge distillation based zero sample semantic segmentation method.

Another object of the present invention is to provide an information data processing terminal, which is used for implementing the knowledge distillation-based zero sample semantic segmentation system.

By combining the technical scheme and the technical problem to be solved, the technical scheme to be protected by the invention has the advantages and positive effects that:

first, aiming at the technical problems existing in the prior art and the difficulty in solving the problems, the technical problems to be solved by the technical scheme of the present invention are closely combined with the technical scheme to be protected and the results and data in the research and development process, and some creative technical effects brought after the problems are solved are analyzed in detail and deeply. The specific description is as follows:

(1) The method utilizes a pre-trained picture encoder to extract various regional picture characteristics in a training picture, takes the regional picture characteristics as the visual supervision of a characteristic picture obtained by semantic segmentation extraction, and distills the knowledge learned in the pre-training from the picture encoder to help a semantic segmentation model to better extract the characteristics of the training picture.

(2) The invention utilizes a pre-trained text encoder to extract the features of the category names converted into the text, and the features are used as the pixel classification weights of the feature images extracted by semantic segmentation.

Secondly, considering the technical scheme as a whole or from the perspective of products, the technical effect and advantages of the technical scheme to be protected by the invention are specifically described as follows:

according to the invention, the pre-training picture encoder and the text encoder which are obtained by using the CLIP model under a large amount of text picture pairing data help the semantic segmentation model to have the capability of identifying unseen classes, and the identification range of the semantic segmentation model is expanded.

Third, as an inventive supplementary proof of the claims of the present invention, there are also presented several important aspects:

(1) The expected income and commercial value after the technical scheme of the invention is converted are as follows:

the invention borrows the knowledge learned in the CLIP model and uses the knowledge in the invention, so that the invention has the capability of carrying out semantic segmentation on the pictures containing unseen classes, and has great commercial value in the field of semantic segmentation.

(2) The technical scheme of the invention fills the technical blank in the industry at home and abroad:

if the semantic segmentation model is expected to have the capability of identifying more classes, pictures of the classes need to be collected, semantically labeled and then trained again. The invention utilizes the visual characteristic encoder and the text encoder obtained in the CLIP method to help the semantic segmentation model to realize the semantic segmentation capability on the picture containing unseen classes, and fills the technical blank.

(3) The technical scheme of the invention solves the technical problem that people are eager to solve but can not succeed all the time:

at present, most semantic segmentation models are difficult to expand the number of identification classes, and the generalization capability of the models can be improved only by collecting more classes of labeled data.

(4) The technical scheme of the invention overcomes the technical prejudice whether:

in the field of semantic segmentation, the generalization capability of the model can be improved only by collecting more types of labeled data, and for some types, the problem that enough data sets cannot be collected may be encountered. On one hand, the knowledge learned by a pre-trained picture encoder in pre-training is utilized to help a semantic segmentation model to better extract the visual features of the picture; on the other hand, the capability of converting the classes into texts by using a pre-trained text encoder is utilized, and the capability of performing unseen semantic segmentation on semantic segmentation is given to the semantic segmentation.

Drawings

FIG. 1 is a schematic flow chart of a training phase provided by an embodiment of the present invention;

FIG. 2 is a schematic flow chart of a testing phase provided by an embodiment of the present invention;

fig. 3 is an original drawing, a semantic segmentation labeling diagram and a semantic segmentation prediction diagram provided by the embodiment of the invention, (a) a computer display, (b) a sofa and a chair, and (c) a bus;

Detailed Description

In order to make the objects, technical solutions and advantages of the present invention more apparent, the present invention is further described in detail with reference to the following embodiments. It should be understood that the specific embodiments described herein are merely illustrative of the invention and do not limit the invention.

This section is an explanatory embodiment expanding on the claims so as to fully understand how the present invention is embodied by those skilled in the art.

As shown in fig. 1 and fig. 2, a knowledge distillation-based zero sample semantic segmentation method provided by an embodiment of the present invention includes:

the method comprises the following steps: the fixed CLIP model utilizes the parameters of a text encoder and a picture encoder trained on 4 billion pairs of text and picture data, and is employed herein as a text encoder and a picture encoder in its ViT-B/16 model.

Step two: the VOC2012 data set is used as the experimental data set of the method, wherein 10582 data sets are trained, and 1449 data sets are tested.

Step three: and marking the semantic meaning of the training picture to obtain a picture area picture with black background after the mask.

Step four: and (5) taking depeplabv 3 as a backbone network of the method, and extracting a characteristic diagram of the training picture.

Step five: and (4) extracting the visual characteristics of the region map obtained in the third step by using the picture encoder in the first step, using the part of visual characteristics to supervise the feature map extracted in the fourth step, and calculating the loss 1 by using the square error loss.

Step six: converting the class label of the training picture into text in a form similar to 'a picture with a class name', and then extracting text features by using a text encoder in the first step.

Step seven: and taking the text features in the sixth step as the classification basis of the pixels of the feature map in the fourth step, calculating cosine similarity, obtaining a prediction result of each pixel point according to the highest score, and then calculating the loss of each pixel point by using cross entropy to obtain loss 2.

Step eight: and calculating a total loss function according to the loss 1 and the loss 2 in the fifth step and the seventh step, then conducting reversely, updating the DeepLab V3 model in the fourth step, and iterating for multiple times to obtain a stable model weight.

Step nine: and in the testing stage, as shown in fig. 2, inputting the test picture into the deep lab V3 semantic segmentation model obtained after the training in the step eight is completed, so as to obtain a feature map of the test picture.

Step ten: and converting the category names of the visible category and the unseen category in the testing stage into texts, and obtaining corresponding text characteristics by using the text encoder in the step one.

Step eleven: and (5) calculating cosine similarity of the feature graph of the test picture obtained in the step nine and the text feature obtained in the step ten, and then taking the class with the maximum value in the class scores as the classification of each feature. And integrating the classification results of all the blocks and the real semantic labels of the test pictures for comparison. The final semantic segmentation result graph is shown in fig. 3. The first line is the original image, the second line is the semantically segmented labeling image, and the third line is the prediction result image of the method. The result shows that the result obtained by the method is very similar to the labeled result, and the shape of the object can be basically recognized. The computer display in the step (a) and the sofa in the step (b) are visible classes, the method can perform semantic segmentation well, and for unseen classes (such as a chair on the right side in the step (b) and a bus in the step (c)), the method can perform approximate semantic segmentation through semantic class relations in the CLIP.

As shown in FIG. 1, S ₁ And S ₂ Representing the visible class of text features obtained by a text coder,

N ₁ and N ₂ Representing features of unseen text obtained by a text encoder, I ₁ And I ₂ Which represents the visual characteristics of the region map obtained by the picture encoder.

The embodiment of the invention provides a text encoder and a picture encoder which adopt a CLIP model, wherein the picture encoder is a visual converter of a ViT-B/16 model in the CLIP model, the visual converter has 12 layers in total, the input is 224 multiplied by 224 pictures, and the output is 512-dimensional picture characteristics; the text encoder is a character converter of a ViT-B/16 model in a CLIP model, has 12 layers, inputs are texts, and outputs are 512-dimensional text features.

The calculation method of the loss 1 provided by the embodiment of the invention is as follows:

wherein f is _i,j Is the feature of a certain picture block belonging to class c in the training picture, f _c The feature map in step four belongs to the region map feature of class c.

According to the embodiment of the invention, the pre-training picture encoder and the text encoder which are obtained under a large amount of text picture pairing data by utilizing the CLIP model help the semantic segmentation model to have the capability of identifying unseen classes, and the identification range of the semantic segmentation model is expanded. Specifically, on one hand, knowledge learned in pre-training by using a pre-trained picture encoder helps a semantic segmentation model to better extract visual features of a picture. On the other hand, the capability of converting the category into the text by using the pre-trained text encoder is used for endowing the semantic segmentation with the capability of performing unseen semantic segmentation, and the segmentation example is shown in (c) of fig. 3.

The zero sample semantic segmentation system based on knowledge distillation provided by the embodiment of the invention specifically comprises the following steps:

a picture coding module: the method comprises the steps of extracting regional picture characteristics of different objects in a picture;

a text encoding module: the method comprises the steps of obtaining text characteristics of labels corresponding to pictures;

a semantic segmentation module: and the method is used for obtaining the pixel level classification result of the picture.

In order to prove the creativity and the technical value of the technical scheme of the invention, the part is the application example of the technical scheme of the claims on specific products or related technologies.

It should be noted that embodiments of the present invention can be realized in hardware, software, or a combination of software and hardware. The hardware portions may be implemented using dedicated logic; the software portions may be stored in a memory and executed by a suitable instruction execution system, such as a microprocessor or specially designed hardware. Those skilled in the art will appreciate that the apparatus and methods described above may be implemented using computer executable instructions and/or embodied in processor control code, such code being provided on a carrier medium such as a disk, CD-or DVD-ROM, programmable memory such as read only memory (firmware), or a data carrier such as an optical or electronic signal carrier, for example. The apparatus and its modules of the present invention may be implemented by hardware circuits such as very large scale integrated circuits or gate arrays, semiconductors such as logic chips, transistors, etc., or programmable hardware devices such as field programmable gate arrays, programmable logic devices, etc., or by software executed by various types of processors, or by a combination of hardware circuits and software, e.g., firmware.

The embodiment of the invention has some positive effects in the process of research and development or use, and indeed has great advantages compared with the prior art, and the following contents are described by combining data, graphs and the like in the experimental process.

Through tests, as shown in table 1, the evaluation indexes of the method on the VOC2012 test set include a visible average cross-over ratio of 49.0%, an invisible average cross-over ratio of 30.2% and a concordant cross-over ratio of 37.4%, which are slightly higher than three indexes obtained by a similar ZS3Net method (Bucher, maxime, et al, "Zero-shot segmentation. According to the method, the pre-training picture encoder and the text encoder which are obtained under a large amount of text picture matching data of the CLIP model are utilized to help the semantic segmentation model to have the capability of recognizing unseen classes, and the recognition range of the semantic segmentation model is expanded. Specifically, on one hand, knowledge learned in pre-training by using a pre-trained picture encoder helps a semantic segmentation model to better extract visual features of a picture. On the other hand, the capability of converting the classes into texts by using a pre-trained text encoder is utilized, and the capability of performing unseen semantic segmentation on semantic segmentation is given to the semantic segmentation.

TABLE 1 evaluation results

VOC2012	Mean cross-over ratio of visible classes	Average cross-over ratio of not seen	Coordination of intersection ratio
				ZS3Net	47.7％	25.2％	33.0％
Method for producing a composite material	49.0％	30.2％	37.4％

The above description is only for the specific embodiments of the present invention, but the scope of the present invention is not limited thereto, and any modification, equivalent replacement, and improvement made by those skilled in the art within the technical scope of the present invention disclosed in the present invention should be covered within the scope of the present invention.

Claims

1. A knowledge distillation-based zero sample semantic segmentation method is characterized by comprising the following steps:

extracting the picture characteristics of the region by using a pre-trained picture encoder to be used as visual supervision of a semantic segmentation model; and extracting the text features of the category names converted into the text by using a pre-trained text encoder to serve as classification bases of a semantic segmentation model feature map, and further training the semantic segmentation model to realize image classification.

2. The knowledge distillation-based zero sample semantic segmentation method as set forth in claim 1, wherein the knowledge distillation-based zero sample semantic segmentation method comprises the following specific processes:

pre-training a picture encoder and a text encoder using text and picture data; collecting a training data set and a test data set with pixel level labels; constructing a semantic segmentation model to extract a feature map of a training data set; extracting regional picture characteristics of different objects in a training data set by using the picture encoder, and monitoring picture blocks in the characteristic picture, distilling information in the regional picture characteristics, and calculating to obtain loss 1; converting the labels in the training data set into a text form, obtaining the text characteristics of the labels corresponding to the training data set by using the text encoder, taking the text characteristics as the classification weight of the feature map picture block, and calculating to obtain a loss 2; and summing the loss 1 and the loss 2 to obtain the total loss, conducting reversely, updating and training the semantic segmentation model until the weight of the model is stable, and performing model test.

3. The knowledge distillation-based zero-sample semantic segmentation method as claimed in claim 2, wherein the picture encoder is a visual converter of a ViT-B/16 model in a CLIP model, and has 12 layers of neural networks, the input is a picture with a size of 224 x 224, and the output is a picture feature with 512 dimensions;

the text encoder is a character converter of a ViT-B/16 model in a CLIP model, and has 12 layers of neural networks in total, wherein the input is a text and the output is a text characteristic with 512 dimensions;

the semantic segmentation model adopts a deep Lab V3 model as a backbone network.

4. The knowledge-distillation-based zero-sample semantic segmentation method as claimed in claim 2, wherein before extracting the region picture features, the picture encoder obtains a region picture of a certain class in the training picture with black background by using a labeled class mask of the training picture, and then extracts the region picture features.

5. The knowledge distillation-based zero-sample semantic segmentation method as set forth in claim 2, wherein the text form is text of a picture for converting a class name of a tag into a class name.

6. The knowledge-based distillation zero-sample semantic segmentation method as set forth in claim 2, wherein the loss 1 is calculated by a squared error loss function, and the expression is as follows:

in the formula (f) _i.j Is the feature of a picture block belonging to class c in the training picture, f _c Is the regional graph characteristic that the characteristic graph belongs to class C, C is the total number of classes;

7. The knowledge-distillation-based zero-sample semantic segmentation method as set forth in claim 2, wherein the model test comprises:

inputting the test picture into the trained semantic segmentation model to obtain a feature map of the test picture; converting the category names of the visible category and the unseen category in the test stage into texts, and obtaining corresponding text characteristics by using a text encoder; and after calculating the picture characteristics and the text characteristics of the picture blocks of the characteristic graph of the test picture, integrating the classification results of all the blocks according to the maximum value in the various scores as the classification result of the picture blocks to obtain the predicted semantic segmentation result of the test picture.

8. A knowledge distillation based zero sample semantic segmentation system for implementation according to any one of claims 1 to 7, wherein the knowledge distillation based zero sample semantic segmentation system comprises:

the picture coding module: the method comprises the steps of extracting regional picture characteristics of different objects in a picture;

9. A computer arrangement, characterized in that the computer arrangement comprises a memory and a processor, the memory storing a computer program that, when executed by the processor, causes the processor to carry out the steps of the knowledge distillation based zero sample semantic segmentation method according to any one of claims 1 to 7.

10. A computer-readable storage medium storing a computer program which, when executed by a processor, causes the processor to perform the steps of the knowledge distillation based zero sample semantic segmentation method according to any one of claims 1 to 7.