CN115661465A

CN115661465A - Image multi-label segmentation method and device, computer equipment and storage medium

Info

Publication number: CN115661465A
Application number: CN202211601112.3A
Authority: CN
Inventors: 王楠; 王远; 李志权; 周尧; 刘枢; 吕江波; 沈小勇
Original assignee: Shenzhen Smartmore Technology Co Ltd
Current assignee: Shenzhen Smartmore Technology Co Ltd
Priority date: 2022-12-14
Filing date: 2022-12-14
Publication date: 2023-01-31

Abstract

The application relates to a multi-label segmentation method and device for an image, computer equipment and a storage medium. The method comprises the following steps: determining initial image features extracted from a target image; inputting the initial image characteristics into an attention network for processing to obtain a plurality of query characterization vectors output by the attention network; each query characterization vector is used for characterizing the same type of image features extracted by attention network learning; performing mask prediction based on each query characterization vector and the initial image characteristics to obtain a plurality of prediction masks corresponding to the target image; the number of the plurality of prediction masks is matched with the number of the plurality of query characterization vectors; performing classification prediction according to the query characterization vectors to obtain multi-class probability distribution corresponding to the prediction masks; determining mask classification results for the plurality of predictive masks based on the multi-class probability distribution. By adopting the method, the efficiency of image multi-label segmentation can be improved.

Description

Image multi-label segmentation method and device, computer equipment and storage medium

Technical Field

The present application relates to the field of image processing technologies, and in particular, to a method and an apparatus for multi-label segmentation of an image, a computer device, and a storage medium.

Background

With the development of image processing techniques, semantic segmentation techniques have emerged. The semantic segmentation problem is considered as a classification task at the pixel level, and a broad focus is on the multi-classification task, i.e. the single label problem of pixels. The semantic segmentation model outputs a single class label for each pixel given the input image.

In the conventional technology, the semantic segmentation model is only effective for single-label tasks with mutually exclusive pixel classes, that is, the problem of label overlapping cannot be handled by using only one semantic segmentation model. Therefore, there are several overlapping labels, and a corresponding number of semantic segmentation models are required, which causes a significant increase in inference time and computational overhead, resulting in inefficiency.

Disclosure of Invention

In view of the foregoing, it is desirable to provide a method, an apparatus, a computer device, a computer readable storage medium, and a computer program product for multi-label segmentation of an image, which can improve the efficiency of multi-label segmentation of the image.

In a first aspect, the present application provides a method for multi-label segmentation of an image, including:

determining initial image features extracted from a target image;

inputting the initial image characteristics into an attention network for processing to obtain a plurality of query characterization vectors output by the attention network; each query characterization vector is used for characterizing the same type of image features extracted by the attention network learning;

performing mask prediction based on each query characterization vector and the initial image characteristics to obtain a plurality of prediction masks corresponding to the target image; the number of the plurality of prediction masks is matched with the number of the plurality of query characterization vectors;

performing classification prediction according to the plurality of query characterization vectors to obtain multi-class probability distribution corresponding to the plurality of prediction masks;

and determining mask classification results of the plurality of predicted masks according to the multi-class probability distribution.

In some embodiments, before performing classification prediction based on the query characterization vectors and obtaining multi-class probability distributions corresponding to the prediction masks, the method further includes:

performing characteristic dimension transformation on the plurality of query characterization vectors through a linear transformation layer to obtain a plurality of query characterization vectors subjected to dimension transformation;

performing classification prediction according to the plurality of query characterization vectors to obtain multi-class probability distribution corresponding to the plurality of prediction masks, including:

and performing multi-dimensional classification prediction on the query characterization vectors subjected to the multi-dimensional transformation through a linear classifier to obtain multi-class probability distribution corresponding to the multiple prediction masks.

In some embodiments, performing mask prediction based on each query characterization vector and initial image features to obtain multiple prediction masks corresponding to a target image, includes:

carrying out multi-layer up-sampling processing on the initial image characteristics to obtain target image characteristics; the target image features have richer detail information than the initial image features;

and performing mask prediction according to the query characterization vector and the target image feature after each dimension transformation to obtain a plurality of prediction masks corresponding to the target image.

In some embodiments, performing a classification prediction based on the plurality of query characterization vectors to obtain multi-class probability distributions corresponding to the plurality of prediction masks includes:

performing multi-dimensional classification prediction according to the plurality of query characterization vectors to obtain initial multi-class probability distribution corresponding to the plurality of prediction masks; the initial multi-class probability distribution comprises a probability distribution of empty classes;

and removing the probability distribution of the empty category from the initial multi-category probability distribution to obtain the multi-category probability distribution corresponding to the plurality of prediction masks.

In some embodiments, the mask classification result is an output result from taking the target image as an input to a multi-label segmentation model of the image; before determining the initial image features extracted from the target image, the method further comprises:

in each iteration training, bipartite graph matching is carried out on the basis of a plurality of sample prediction masks corresponding to the sample images and multi-class probability distribution corresponding to the sample prediction masks, and label masks and label classes corresponding to the sample prediction masks are determined;

generating a classification prediction loss value based on the multi-class probability distribution corresponding to the sample image and the difference between the label classes corresponding to the sample prediction masks;

generating a mask prediction loss value based on a difference between each sample prediction mask and the corresponding label mask;

and training the multi-label segmentation model of the image to be trained towards the direction that the mask prediction loss value and the classification prediction loss value become smaller until a training stop condition is reached, and obtaining the multi-label segmentation model of the trained image.

In some embodiments, the categories to which the target image corresponds include a plurality of defect categories and a plurality of surface categories; overlapping the pixel points under the defect category and the pixel points under the surface category; performing classification prediction according to the plurality of query characterization vectors to obtain multi-class probability distribution corresponding to the plurality of prediction masks, including:

performing surface class prediction and defect class prediction according to the plurality of query characterization vectors to obtain multi-class probability distribution corresponding to the plurality of prediction masks; the multi-class probability distribution for the plurality of prediction masks includes a probability for each class for each query characterization vector.

In some embodiments, the mask classification results include masks under mask classes of different classification dimensions; the method further comprises the following steps:

determining masks under the mask classes of all the classification dimensions from the plurality of predicted masks according to the mask classification results to obtain at least one mask to be combined corresponding to all the classification dimensions;

combining at least one mask to be combined corresponding to each classification dimension to obtain a mask combination result corresponding to each classification dimension; the mask combination result is used for indicating the distribution condition of the mask classes of the same classification dimension.

In a second aspect, the present application further provides an apparatus for multi-label segmentation of an image, including:

the first determination module is used for determining initial image features extracted from the target image;

the extraction module is used for inputting the initial image characteristics into the attention network for processing to obtain a plurality of query characterization vectors output by the attention network; each query characterization vector is used for characterizing the same type of image features extracted by the attention network learning;

the first prediction module is used for performing mask prediction based on the query characterization vectors and the initial image characteristics to obtain a plurality of prediction masks corresponding to the target image; the number of the plurality of prediction masks is matched with the number of the plurality of query characterization vectors;

the second prediction module is used for carrying out classification prediction according to the plurality of query characterization vectors to obtain multi-class probability distribution corresponding to the plurality of prediction masks;

and the second determining module is used for determining the mask classification results of the plurality of predicted masks according to the multi-class probability distribution.

In a third aspect, the present application further provides a computer device, where the computer device includes a memory and a processor, the memory stores a computer program, and the processor implements the steps in the embodiments of the method when executing the computer program.

In a fourth aspect, the present application further provides a computer-readable storage medium, on which a computer program is stored, which, when executed by a processor, implements the steps in the embodiments of the method of the present application.

In a fifth aspect, the present application also provides a computer program product comprising a computer program that, when executed by a processor, performs the steps of the embodiments of the method of the present application.

The image multi-label segmentation method, the device, the computer equipment, the storage medium and the computer program product determine the initial image features extracted from the target image, learn the similar image features in the initial image features by utilizing the capability of more pertinently identifying the patterns and the textures of attention network learning, and obtain a plurality of query characterization vectors respectively representing the similar image features extracted by the attention network learning. The method is different from the traditional pixel-by-pixel classification mode, the multi-label segmentation task of the image is divided into two subtasks of class prediction and mask prediction, mask prediction is carried out according to query characterization vectors and initial image features to obtain a plurality of prediction masks corresponding to the target image, and classification prediction is carried out according to the query characterization vectors to obtain multi-class probability distribution corresponding to the prediction masks. And subsequently, determining mask classification results of a plurality of prediction masks according to the multi-class probability distribution so as to realize multi-label segmentation of the image, and improving the efficiency of the multi-label segmentation of the image.

Drawings

Fig. 1 is an application environment diagram of a multi-label segmentation method for an image according to an embodiment of the present application;

fig. 2 is a schematic flowchart of a multi-label segmentation method for an image according to an embodiment of the present disclosure;

fig. 3 is a schematic structural diagram of a multi-label segmentation model of an image according to an embodiment of the present disclosure;

fig. 4 is a block diagram illustrating a structure of an apparatus for multi-label segmentation of an image according to an embodiment of the present disclosure;

fig. 5 is an internal structural diagram of a computer device according to an embodiment of the present application;

FIG. 6 is a diagram of an internal structure of another computer device according to an embodiment of the present application;

fig. 7 is an internal structural diagram of a computer-readable storage medium according to an embodiment of the present application.

Detailed Description

In order to make the objects, technical solutions and advantages of the present application more apparent, the present application is described in further detail below with reference to the accompanying drawings and embodiments. It should be understood that the specific embodiments described herein are merely illustrative of the present application and are not intended to limit the present application.

The multi-label segmentation method for the image provided by the embodiment of the application can be applied to the application environment shown in fig. 1. Wherein the computer device 102 communicates with the server 104 over a communication network. The data storage system may store data that the server 104 needs to process. The data storage system may be integrated on the server 104, or may be located on a cloud server or other network server. Server 104 may determine initial image features extracted from the target image; the server 104 may input the initial image features into the attention network for processing, to obtain a plurality of query characterization vectors output by the attention network; each query characterization vector is used for characterizing the same type of image features extracted by the attention network learning; the server 104 may perform mask prediction based on each query characterization vector and the initial image feature to obtain a plurality of prediction masks corresponding to the target image; the number of the plurality of prediction masks is matched with the number of the plurality of query characterization vectors; the server 104 may perform classification prediction according to the plurality of query characterization vectors to obtain multi-class probability distributions corresponding to the plurality of prediction masks; server 104 may determine mask classification results for a plurality of predictive masks based on the multi-class probability distribution. It is to be appreciated that the server 104 can send the mask classification results to the computer device 102, and the computer device 102 can present the mask classification results.

The computer device 102 may be, but not limited to, various personal computers, notebook computers, smart phones, tablet computers, internet of things devices, and portable wearable devices, and the internet of things devices may be smart speakers, smart televisions, smart air conditioners, smart car-mounted devices, and the like. The portable wearable device can be a smart watch, a smart bracelet, a head-mounted device, and the like. The server 104 may be implemented as a stand-alone server or as a server cluster comprised of multiple servers.

In some embodiments, as shown in fig. 2, a multi-label segmentation method for an image is provided, which is described by taking the method as an example applied to a computer device, it is understood that the computer device may be a terminal or a server, and the method may be implemented by the terminal or the server alone, or by an interaction between the terminal and the server. In this embodiment, the method includes the steps of:

s202, determining initial image features extracted from the target image.

Illustratively, the computer device may pre-process the target image. For example, the pre-processing may include normalization processing. The computer equipment can extract the characteristics of the preprocessed target image through the backbone network to obtain the initial image characteristics.

In some embodiments, the backbone Network may be any one of a High-Resolution Network (HRNet), a Residual Network (ResNet), and the like.

In some embodiments, the initial image features may be high resolution features and the backbone network may be a high resolution network. It can be understood that since the network structure of the HRNet maintains a path of high resolution features, it is friendly to the detection of small defects in an industrial scene, so that the HRNet can be used as a backbone network in the case of the detection of small defects in an industrial scene.

In some embodiments, the computer device may perform feature extraction and downsampling on the preprocessed target image through the backbone network to obtain initial image features. For example, the dimension of the initial image feature Ω may be C _Ω * (H/d) (W/d); wherein, C _Ω The number of characteristic channels is represented, d represents a down-sampling multiple, and H and W represent the height and width of the target image respectively.

And S204, inputting the initial image characteristics into the attention network for processing to obtain a plurality of query characterization vectors output by the attention network.

Wherein, each query characterization vector is used for characterizing the same type of image features extracted by the attention network learning.

Illustratively, a computer device may extract a plurality of query characterization vectors from initial image features over a transformer attention network. After the initial image features are subjected to self-attention and cross-attention calculation in a transformer, a plurality of query characterization vectors can be obtained. It can be understood that the query in the transform can adaptively detect different image features, and since different queries have different "preferences" for different modes, textures, and other features, it is ultimately responsible for detecting different image features. The query characterization vectors are actually characterization vectors of query in the transform, one query corresponds to one query characterization vector, and the query characterization vectors are equivalent to image features corresponding to the query.

In some embodiments, the attention network includes an attention decoder (transformer decoder). The computer device may extract a plurality of query characterization vectors from the initial image features according to an attention decoder. It will be appreciated that the attention decoder comprises a plurality of object queries which are able to adaptively detect different image features.

In some embodiments, the computer device may extract a plurality of query characterization vectors from the initial image features by the trained object query in the attention decoder. It can be understood that, in the training process, the computer device may obtain a plurality of initial object queries preset for the attention decoder, and obtain the trained object queries by performing self-attention calculation on the plurality of initial object queries and performing cross-attention calculation on initial image features of the sample images to optimize the initial object queries. It should be noted that the self-attention calculation is used to evaluate the mutual influence among the object queries, and the de-duplication of the object queries can be realized. The cross attention calculation is used for extracting rich global information from the initial image characteristics through object queries, so that the object queries output different results according to the image characteristics and modes.

In some embodiments, in a first iterative training of the attention network, the computer device may take as input the initial image features of the sample image and a preset plurality of initial object queries. It will be appreciated that in each subsequent iteration of training, the computer device may consult the object query based on the results of the self-attention calculations and the results of the cross-attention calculations.

In some embodiments, the query characterization vector includes global information for the target image. It can be understood that each query token vector actually corresponds to the pattern and texture of each type of image feature learned by the attention network, and the global information in the query token vector can fully characterize the category corresponding to the query token vector. For example, the dimension of the N query characterization vectors Q may be C _Q * N; where N characterizes the number of query characterization vectors, C _Q The number of feature channels characterizing the query characterization vector.

And S206, performing mask prediction based on the query characterization vectors and the initial image characteristics to obtain a plurality of prediction masks corresponding to the target image.

Wherein the plurality of prediction masks matches a plurality of query token vector quantities. For example, the number of the plurality of prediction masks corresponds to the number of query characterization vectors.

For example, the computer device may perform upsampling processing on the initial image features, and perform mask prediction based on each query token vector and the upsampled initial image features, to obtain multiple prediction masks corresponding to the target image. It will be appreciated that the dimensions of the up-sampled initial image features match the target image size. For example, the dimension of the initial image feature Ω may be C _Ω * (H/d) × (W/d), the dimension of the initial image feature after upsampling may be C _Φ *H*W。

In some embodiments, the computer device may perform mask prediction based on the query characterization vector and the initial image features after each dimension transformation.

And S208, performing classified prediction according to the plurality of query characterization vectors to obtain multi-class probability distribution corresponding to the plurality of prediction masks.

Wherein the multi-class probability distribution includes a probability that each query characterization vector corresponds to each class. It can be understood that each query token vector prefers features of the same pattern, each prediction mask actually includes a probability that each pixel has a feature of the same pattern preferred by the query token vector, and the probability of each category corresponding to the query token vector is actually consistent with the probability of each category corresponding to the corresponding prediction mask. It should be noted that, in the step of multi-dimensional classification prediction, calculation of multiple prediction masks is not actually involved, there is no sequential execution order between the step of classification prediction and the step of mask prediction, and the multi-class probability distributions corresponding to the multiple prediction masks obtained by classification prediction are only used for representing the correspondence between the multi-class probability distributions and the multiple prediction masks.

Illustratively, the computer device may perform feature dimension transformation on the plurality of query characterization vectors, and perform multi-dimensional classification prediction according to the plurality of query characterization vectors after dimension transformation, so as to obtain multi-class probability distribution corresponding to the target image.

In some embodiments, the computer device may perform feature mapping on the plurality of query characterization vectors to obtain multi-class probability distributions corresponding to the plurality of prediction masks.

S210, determining mask classification results of the multiple prediction masks according to the multi-class probability distribution.

The mask classification result comprises masks in different mask categories.

For example, the computer device may classify the plurality of prediction masks corresponding to the target image according to the multi-class probability distribution corresponding to the plurality of prediction masks, so as to obtain a mask classification result. It can be understood that the query characterization vector is equivalent to the feature corresponding to the query in the transform, the initial image feature includes the feature corresponding to the target image pixel point, and the multi-class probability distribution corresponding to the query characterization vector is kept high in consistency with the multi-class probability distribution obtained by combining the query characterization vector and the initial image feature, so that the multi-class probability distribution can accurately indicate the class distribution of the multiple predictive masks, and the accuracy of the mask classification result can be ensured.

In the image multi-label segmentation method, initial image features extracted from a target image are determined, the similar image features in the initial image features are learned by utilizing the capability of more pertinently identifying patterns and textures of attention network learning, and a plurality of query characterization vectors respectively representing the similar image features extracted by the attention network learning are obtained. The method is different from the traditional pixel-by-pixel classification mode, the multi-label segmentation task of the image is divided into two subtasks of class prediction and mask prediction, the class prediction is carried out according to a plurality of query characterization vectors to obtain multi-class probability distribution corresponding to a plurality of prediction masks, and the mask prediction is carried out according to the plurality of query characterization vectors and the initial image characteristics to obtain a plurality of prediction masks corresponding to the target image. And subsequently, determining mask classification results of a plurality of prediction masks according to the multi-class probability distribution so as to realize multi-label segmentation of the image, and improving the efficiency.

In some embodiments, before performing classification prediction based on the plurality of query characterization vectors and obtaining multi-class probability distributions corresponding to the plurality of prediction masks, the method further includes:

performing characteristic dimension transformation on the plurality of query characterization vectors through a linear transformation layer to obtain a plurality of query characterization vectors after dimension transformation;

For example, the computer device may perform feature dimension transformation on the plurality of query characterization vectors by using a multi-layer Perceptron (MLP) to obtain a plurality of dimension-transformed query characterization vectors. It can be understood that the number of feature channels of the query characterization vector after dimension transformation is consistent with the number of feature channels of the initial image feature after upsampling. The computer equipment can perform feature mapping on the query characterization vectors subjected to the dimensionality transformation through a linear classifier to realize multi-dimensional classification prediction, and obtain multi-class probability distribution corresponding to a plurality of prediction masks.

In one embodiment, a multi-label segmentation model of an image may include a backbone network, an attention network, a linear transformation layer, and a linear classifier. It can be understood that the multi-label segmentation method for the image provided by the application can be realized by a multi-label segmentation model of the image.

In this embodiment, the feature dimension transformation is performed on the plurality of query characterization vectors to obtain the query characterization vectors after the dimension transformation, and the multi-dimensional classification prediction is performed on the query characterization vectors after the dimension transformation to obtain multi-class probability distributions corresponding to the plurality of prediction masks, where the multi-class probability distributions corresponding to the query characterization vectors and the class distributions of the prediction masks obtained subsequently based on the query characterization vectors and the initial image features maintain higher consistency, so that the mask classification result determined based on the multi-class probability distributions is more accurate.

and performing mask prediction according to the query characterization vector and the target image characteristics after the dimensionality transformation to obtain a plurality of prediction masks corresponding to the target image.

Illustratively, the computer device may perform multi-layer upsampling processing on the initial image feature through the feature pyramid network to obtain a target image feature. It can be understood that the target image feature is actually an initial image feature after upsampling, the target image feature includes a feature of each pixel point of the target image, and the target image feature is actually a pixel-by-pixel feature. The computer device can perform matrix multiplication operation on the query characterization vectors after the multiple dimensionalities are transformed and the target image characteristics to realize mask prediction, and obtain multiple prediction masks corresponding to the target image.

In some embodiments, the computer device may perform matrix multiplication on the query characterization vectors and the target image features after the multiple dimension transformations, and normalize a result obtained by the matrix multiplication to obtain multiple prediction masks corresponding to the target image. It will be appreciated that the masks include values in the 0,1 range, corresponding to the probability of each mask including a distribution of pixel points.

In some embodiments, the computer device may normalize, by the activation layer, results from the matrix multiplication operation. The activation function of the activation layer may be a Sigmoid function (Sigmoid).

In some embodiments, the computer device may input the initial image features extracted by the backbone network to the feature pyramid network, and perform layer-by-layer upsampling on the initial image features to the size of the target image through the feature pyramid network to restore detail information of the target image, so as to obtain target image features including features of each pixel. It can be appreciated that in industrial quality inspection scenarios, defects have a large range of size variations, and therefore Feature Pyramid Networks (FPN) take full advantage of multi-scale features.

In this embodiment, multi-layer upsampling processing is performed on the initial image features to obtain target image features, mask prediction is performed according to the query characterization vector after each dimension is transformed and the target image features to obtain multiple prediction masks corresponding to the target image, so that multi-label segmentation prediction of the image is performed in a manner of classifying the target image masks, and efficiency is improved.

the probability distribution of empty classes is removed from the initial multi-class probability distribution to obtain multi-class probability distributions corresponding to the multiple prediction masks.

For example, the computer device may perform feature mapping on the plurality of query token vectors to implement multi-dimensional classification prediction, and obtain initial multi-class probability distributions corresponding to the plurality of prediction masks. It is understood that the number of query in the transform is greater than the number of categories corresponding to the target image, and that query in the transform that is not related to the category corresponding to the target image is redundant, and these redundant queries correspond to empty categories. The initial multi-class probability distributions corresponding to the multiple prediction masks include probability distributions of empty classes. The computer device may remove the probability distribution over the empty class from the initial multi-class probability distributions corresponding to the plurality of prediction masks to obtain multi-class probability distributions corresponding to the plurality of prediction masks.

In some embodiments, the multi-class probability distribution comprises a multi-class probability matrix. It will be appreciated that the elements of the multi-class probability matrix are used to characterize the probability of each query characterization vector corresponding to each class. The computer device can perform matrix multiplication operation on the multi-class probability distribution and a plurality of prediction masks corresponding to the target image to obtain a mask classification result.

In the embodiment, multi-dimensional classification prediction is carried out according to a plurality of query characterization vectors to obtain initial multi-class probability distribution corresponding to a plurality of prediction masks; the probability distribution on the empty category is removed from the initial multi-category probability distribution corresponding to the plurality of prediction masks to obtain the multi-category probability distribution corresponding to the plurality of prediction masks, so that mask prediction is performed on the category corresponding to the target image, and the mask prediction is more accurate.

and training the multi-label segmentation model of the image to be trained towards the direction that the mask prediction loss value and the classification prediction loss value become smaller until a training stopping condition is reached, and obtaining the multi-label segmentation model of the trained image.

For example, in each iteration training round, the computer device may determine class matching errors between the label classes and multi-class probability distributions corresponding to the plurality of sample prediction masks using a bipartite graph matching algorithm, determine mask matching errors between the sample prediction masks and the label masks using the bipartite graph matching algorithm, and weight the class matching errors and the mask matching errors to obtain the matching errors. The computer device may obtain the label mask and the label category corresponding to each sample prediction mask by determining the sample prediction mask with the smallest matching error corresponding to each pair of label mask and label category. It can be understood that the label masks and label categories are in one-to-one correspondence, the label categories are used for indicating the categories of the label masks, a pair of label masks and label categories correspond to a sample prediction mask, and redundant sample prediction masks and corresponding query characterization vectors correspond to empty categories.

The computer device may calculate a multi-class probability distribution for the plurality of sample prediction masks and a class prediction loss value between label classes corresponding to each sample prediction mask using a cross entropy loss function, and calculate a mask prediction loss value between each sample prediction mask and a corresponding label mask using an exponential logarithmic loss function.

The computer equipment can obtain the multi-label segmentation model of the trained image towards the direction that the mask prediction loss value and the classification prediction loss value become smaller until the training stopping condition is reached. It is to be appreciated that the mask prediction loss value and the class prediction loss value are constraints of the mask prediction subtask and the class prediction subtask, respectively.

In one embodiment, the computer device may adjust the number of queries in the attention network to be trained towards a direction in which the mask predicted loss value and the classification predicted loss value become smaller until a training stop condition is reached. It is understood that the attention network to be trained is included in the multi-label segmentation model of the image to be trained.

In some embodiments, the training stop condition may be at least one of reaching a preset number of training rounds or convergence of the mask predicted loss value and the class predicted loss value, or the like.

In some embodiments, a probability distribution over empty classes in a multi-class probability distribution participates in the computation of the classification predictive loss value. It will be appreciated that the probability distribution over the null classes participates in the calculation of the class prediction loss values, whereas the masks over the null classes determined through bipartite graph matching do not participate in the calculation of the mask prediction loss values. The probability distribution on the empty class is used as the constraint of the classification prediction subtask, so that the accuracy of obtaining the multi-class probability distribution can be improved, but the mask classification is not influenced, and the probability distribution on the empty class needs to be discarded.

In some embodiments, the number of queries in the attention network to be trained may be different for different sample images, and the number of queries need only be greater than the number of classes corresponding to the sample images. It is understood that in the first iteration training, the user can manually set the query number of the transform.

In the embodiment, in each iteration training, the label mask and the label category corresponding to each sample prediction mask are determined through bipartite graph matching, and a classification prediction loss value and a mask prediction loss value are determined; and training the multi-label segmentation model of the image to be trained towards the direction that the mask prediction loss value and the classification prediction loss value become smaller until a training stop condition is reached to obtain the multi-label segmentation model of the trained image, wherein the attention network can learn and extract the same type of image features of the target image by the attention network during use so as to realize mask classification of the target image and improve the efficiency of image multi-label segmentation.

In some embodiments, the corresponding classes of the target image include a plurality of defect classes and a plurality of surface classes; overlapping the pixel points under the defect category and the pixel points under the surface category; performing classification prediction according to the plurality of query characterization vectors to obtain multi-class probability distribution corresponding to the plurality of prediction masks, including:

For example, the computer device may perform feature mapping on the plurality of query characterization vectors to perform surface class prediction and defect class prediction, and obtain multi-class probability distributions corresponding to the plurality of prediction masks.

In some embodiments, the target image may be any one of a road image, a printed circuit board image, and the like. It can be understood that in the task of image-oriented multi-label segmentation, the training and testing stages need to be adjusted. Taking the printed circuit board image as an example, the sample image data set contains 42 types of defects, and the gold surface and the ink surface need to be divided. It should be noted that there is no overlap between these 42 types of defects, and there is no overlap between the gold side and the ink side, but the defects overlap with the gold side or the ink side. During training, the computer equipment can use a binary mask of 42 types of defects, gold surfaces and ink surfaces as a label mask to participate in the calculation of the predicted loss value of the mask, and the number of the types corresponding to the images is set to be 45. It should be noted here that, in the training supervision in the classification prediction subtask, 42 types of defects are respectively marked with 0 to 41, 42 indicates golden surface, 43 indicates ink, and 44 indicates empty type, that is, non-overlapped defects are marked with continuous numbers as much as possible, so as to facilitate the post-processing operation.

In the embodiment, the surface type prediction and the defect type prediction are carried out according to the plurality of query characterization vectors to obtain the multi-type probability distribution corresponding to the plurality of prediction masks, and then the mask classification is carried out according to the multi-type probability distribution, so that the prediction of the image defect and the surface two-dimensional type can be realized, and the defect detection of an industrial quality inspection scene can be well adapted.

determining masks under the mask categories of all the classification dimensions from the plurality of predicted masks according to the mask classification result to obtain at least one mask to be combined corresponding to all the classification dimensions;

For example, the computer device may determine masks in the mask classes of the respective classification dimensions from a plurality of predicted masks in the mask classes included in the mask classification result, and obtain a plurality of masks to be merged corresponding to the respective classification dimensions. The computer device can perform merging processing on the multiple masks to be merged corresponding to each classification dimension to obtain a mask merging result corresponding to each classification dimension.

In some embodiments, for the printed circuit board image, in the inference stage, the computer device may obtain a 44 × H × W result, the first 42 channels are taken out to perform the merging operation to obtain a single-channel merged mask with 42 types of defects, and then the 42 th and 43 th channels are taken out to perform the merging operation to obtain the merging of the gold surface and the ink surface. Therefore, the segmentation results of the objects at the overlapped positions can be distributed to different combined masks, so that the multi-label segmentation of the image is realized.

In the embodiment, at least one mask to be combined corresponding to each classification dimension is obtained according to the mask classification result; and combining at least one mask to be combined corresponding to each classification dimension to obtain a mask combination result corresponding to each classification dimension, and distributing the segmentation results of the pixel points at the overlapped positions to different combined masks to realize multi-label segmentation of the image.

In some embodiments, an architectural schematic of a multi-label segmentation model of an image is provided as shown in FIG. 3. The multi-label segmentation model of the image comprisesBackbone networks, attention networks, multi-tier perceptrons, and linear classifiers. It is understood that the multi-layer perceptron is actually a linear transformation layer. The computer equipment can extract the features of the target image through the backbone network to obtain the initial image features. For example, the size of the target image is H × W; where H is the height of the target image and W is the width of the target image. The dimension of the initial image feature omega is C _Ω * (H/d) (W/d); wherein, C _Ω D is the downsampling multiple of the backbone network for the number of feature channels of the initial image feature. Obtaining a plurality of query characterization vectors, such as N query characterization vectors Q with dimension C, from the initial image features through an attention network _Q * N; wherein, C _Q Is the number of feature channels of the query token vector, and N is the number of query token vectors. Obtaining N dimensionality-transformed query characterization vectors M by the N query characterization vectors through a multilayer perceptron, wherein the dimensionality of the N M is C _Φ * N; wherein, C _Φ The number of feature channels of the query characterization vector after dimension transformation. And obtaining multi-class probability distribution by the query characterization vectors M after the N dimensionalities are transformed through a linear classifier. The multi-class probability distribution includes a probability that each query characterization vector corresponds to each class, and the multi-class probability distribution may include a multi-class probability matrix, where a dimension of the multi-class probability matrix is N × (K + 1), where K is a number of classes corresponding to the target image, and 1 is an empty class.

And performing layer-by-layer up-sampling processing on the initial image features through a feature pyramid network to obtain target image features. The dimension of the target image feature phi is C _Φ * H W, wherein C _Φ Is the number of feature channels of the target image feature. And performing matrix multiplication operation on the target image features and the query characterization vectors after the N dimensionality transformation to obtain a plurality of prediction masks corresponding to the target image. It will be appreciated that in the training phase, the multi-class probability distribution is constrained by the class prediction loss values and the plurality of prediction masks are constrained by the mask prediction loss values.

It is to be understood that in the iterative training of the attention network, the input of the attention network includes the initial image feature of the sample image and a preset plurality of initial object queries. The computer device can adaptively optimize the object queries based on results of the self-attention computation and the cross-attention computation during the training process, and adjust the number of the plurality of object queries based on the mask prediction loss value and the class prediction loss value.

It should be understood that, although the steps in the flowcharts related to the embodiments as described above are sequentially shown as indicated by arrows, the steps are not necessarily performed sequentially as indicated by the arrows. The steps are not performed in the exact order shown and described, and may be performed in other orders, unless explicitly stated otherwise. Moreover, at least a part of the steps in the flowcharts according to the embodiments described above may include multiple steps or multiple stages, which are not necessarily performed at the same time, but may be performed at different times, and the order of performing the steps or stages is not necessarily sequential, but may be performed alternately or alternately with other steps or at least a part of the steps or stages in other steps.

Based on the same inventive concept, the embodiment of the application also provides a multi-label segmentation device of the image. The implementation scheme for solving the problem provided by the apparatus is similar to the implementation scheme described in the method, so specific limitations in the embodiment of the multi-tag segmentation apparatus for one or more images provided below can be referred to the limitations on the multi-tag segmentation method for images in the foregoing, and details are not described here again.

In some embodiments, as shown in fig. 4, there is provided an apparatus 400 for multi-label segmentation of an image, comprising:

a first determining module 402 for determining initial image features extracted from a target image;

an extracting module 404, configured to input the initial image features into an attention network for processing, so as to obtain a plurality of query characterization vectors output by the attention network; each query characterization vector is used for characterizing the same type of image features extracted by the attention network learning;

a first prediction module 406, configured to perform mask prediction based on each query characterization vector and initial image features to obtain multiple prediction masks corresponding to a target image; the plurality of predictive masks are matched with the plurality of query characterization vector quantities;

a second prediction module 408, configured to perform classification prediction according to the plurality of query characterization vectors to obtain multi-class probability distributions corresponding to the plurality of prediction masks;

a second determining module 410, configured to determine mask classification results of the plurality of predicted masks according to the multi-class probability distribution.

In some embodiments, the second prediction module 408 is further configured to:

in terms of performing classification prediction according to the query characterization vectors to obtain multi-class probability distributions corresponding to the prediction masks, the second prediction module 408 is specifically configured to:

and performing multi-dimensional classification prediction on the query characterization vectors subjected to multi-dimensional transformation through a linear classifier to obtain multi-class probability distribution corresponding to the multiple prediction masks.

In some embodiments, in terms of performing mask prediction based on each query characterization vector and initial image feature to obtain multiple prediction masks corresponding to the target image, the first prediction module 406 is specifically configured to:

In some embodiments, in performing classification prediction according to the query characterization vectors to obtain multi-class probability distributions corresponding to the prediction masks, the second prediction module 408 is specifically configured to:

In some embodiments, the mask classification result is an output result obtained by taking the target image as an input of a multi-label segmentation model of the image, and the first determination module 402 is further configured to:

In some embodiments, the corresponding classes of the target image include a plurality of defect classes and a plurality of surface classes; the condition that pixel points under the defect category and pixel points under the surface category are overlapped exists; in terms of performing classification prediction according to the query characterization vectors to obtain multi-class probability distributions corresponding to the prediction masks, the second prediction module 408 is specifically configured to:

performing surface class prediction and defect class prediction according to the plurality of query characterization vectors to obtain multi-class probability distribution corresponding to the plurality of prediction masks; the multi-class probability distributions for the multiple prediction masks include probabilities for each class for each query characterization vector.

In some embodiments, the mask classification results include masks under mask classes of different classification dimensions; the second determining module 410 is further configured to:

The modules in the image multi-label dividing device can be wholly or partially realized by software, hardware and a combination thereof. The modules may be embedded in hardware or independent of a processor in the computer device, or may be stored in a memory in the computer device in software, so that the processor calls and executes operations corresponding to the modules.

In some embodiments, a computer device is provided, which may be a server, the internal structure of which may be as shown in fig. 5. The computer device includes a processor, a memory, an Input/Output (I/O) interface, and a communication interface. The processor, the memory and the input/output interface are connected through a system bus, and the communication interface is connected to the system bus through the input/output interface. Wherein the processor of the computer device is configured to provide computing and control capabilities. The memory of the computer device includes a non-volatile storage medium and an internal memory. The non-volatile storage medium stores an operating system, a computer program, and a database. The internal memory provides an environment for the operation of an operating system and computer programs in the non-volatile storage medium. The database of the computer device is used to store the attention network. The input/output interface of the computer device is used for exchanging information between the processor and an external device. The communication interface of the computer device is used for connecting and communicating with an external terminal through a network. The computer program is executed by a processor to implement the steps in the above-described method of multi-label segmentation of images.

In some embodiments, a computer device is provided, which may be a terminal, and its internal structure diagram may be as shown in fig. 6. The computer apparatus includes a processor, a memory, an input/output interface, a communication interface, a display unit, and an input device. The processor, the memory and the input/output interface are connected by a system bus, and the communication interface, the display unit and the input device are connected by the input/output interface to the system bus. Wherein the processor of the computer device is configured to provide computing and control capabilities. The memory of the computer device includes a non-volatile storage medium and an internal memory. The non-volatile storage medium stores an operating system and a computer program. The internal memory provides an environment for the operating system and the computer program to run on the non-volatile storage medium. The input/output interface of the computer device is used for exchanging information between the processor and an external device. The communication interface of the computer device is used for carrying out wired or wireless communication with an external terminal, and the wireless communication can be realized through WIFI, a mobile cellular network, NFC (near field communication) or other technologies. The computer program is executed by a processor to implement the steps in the above-described method of multi-label segmentation of images. The display unit of the computer device is used for forming a visual picture and can be a display screen, a projection device or a virtual reality imaging device. The display screen can be a liquid crystal display screen or an electronic ink display screen, and the input device of the computer equipment can be a touch layer covered on the display screen, a key, a track ball or a touch pad arranged on the shell of the computer equipment, an external keyboard, a touch pad or a mouse and the like.

Those skilled in the art will appreciate that the configurations shown in fig. 5 and 6 are merely block diagrams of some configurations relevant to the present disclosure, and do not constitute a limitation on the computing devices to which the present disclosure may be applied, and that a particular computing device may include more or less components than those shown, or may combine certain components, or have a different arrangement of components.

In some embodiments, a computer device is provided, the computer device comprising a memory and a processor, the memory having stored therein a computer program, the processor implementing the steps of the above-described method embodiments when executing the computer program.

In some embodiments, an internal structure diagram of a computer-readable storage medium storing a computer program that when executed by a processor implements the steps in the above-described method embodiments is provided as shown in fig. 7.

In some embodiments, a computer program product is provided, comprising a computer program which, when executed by a processor, performs the steps in the above-described method embodiments.

It should be noted that the user information (including but not limited to user device information, user personal information, etc.) and data (including but not limited to data for analysis, stored data, displayed data, etc.) referred to in the present application are information and data authorized by the user or fully authorized by each party, and the collection, use and processing of the related data need to comply with the relevant laws and regulations and standards of the relevant countries and regions.

It will be understood by those skilled in the art that all or part of the processes of the methods of the embodiments described above may be implemented by hardware instructions of a computer program, and the computer program may be stored in a non-volatile computer-readable storage medium, and when executed, may include the processes of the embodiments of the methods described above. Any reference to memory, database, or other medium used in the embodiments provided herein may include at least one of non-volatile and volatile memory. The nonvolatile Memory may include Read-Only Memory (ROM), magnetic tape, floppy disk, flash Memory, optical Memory, high-density embedded nonvolatile Memory, resistive Random Access Memory (ReRAM), magnetic Random Access Memory (MRAM), ferroelectric Random Access Memory (FRAM), phase Change Memory (PCM), graphene Memory, and the like. Volatile Memory can include Random Access Memory (RAM), external cache Memory, and the like. By way of illustration and not limitation, RAM can take many forms, such as Static Random Access Memory (SRAM) or Dynamic Random Access Memory (DRAM), for example. The databases referred to in various embodiments provided herein may include at least one of relational and non-relational databases. The non-relational database may include, but is not limited to, a block chain based distributed database, and the like. The processors referred to in the embodiments provided herein may be general purpose processors, central processing units, graphics processors, digital signal processors, programmable logic devices, quantum computing based data processing logic devices, etc., without limitation.

The technical features of the above embodiments can be arbitrarily combined, and for the sake of brevity, all possible combinations of the technical features in the above embodiments are not described, but should be considered as the scope of the present specification as long as there is no contradiction between the combinations of the technical features.

The above examples only express several embodiments of the present application, and the description thereof is more specific and detailed, but not to be construed as limiting the scope of the present application. It should be noted that, for a person skilled in the art, several variations and modifications can be made without departing from the concept of the present application, and these are all within the scope of protection of the present application. Therefore, the protection scope of the present application shall be subject to the appended claims.

Claims

1. A method for multi-label segmentation of an image, comprising:

determining initial image features extracted from a target image;

inputting the initial image features into an attention network for processing to obtain a plurality of query characterization vectors output by the attention network; each query characterization vector is used for characterizing the same type of image features extracted by the attention network learning;

performing mask prediction based on each query characterization vector and the initial image features to obtain a plurality of prediction masks corresponding to the target image; the plurality of prediction masks matches the plurality of query characterization vector quantities;

and determining the mask classification results of the plurality of prediction masks according to the multi-class probability distribution.

2. The method of claim 1, wherein before performing the classification prediction based on the query characterization vectors to obtain multi-class probability distributions corresponding to the prediction masks, the method further comprises:

the performing classification prediction according to the plurality of query characterization vectors to obtain multi-class probability distributions corresponding to the plurality of prediction masks includes:

3. The method of claim 2, wherein the performing mask prediction based on each query characterization vector and the initial image feature to obtain a plurality of prediction masks corresponding to the target image comprises:

carrying out multi-layer up-sampling processing on the initial image features to obtain target image features; the target image features have richer detail information than the initial image features;

and performing mask prediction according to the query characterization vectors after the dimensionality transformation and the target image characteristics to obtain a plurality of prediction masks corresponding to the target image.

4. The method of claim 1, wherein performing a classification prediction based on the query characterization vectors to obtain multi-class probability distributions corresponding to the prediction masks comprises:

removing the probability distribution of the empty category from the initial multi-category probability distribution to obtain multi-category probability distributions corresponding to the plurality of prediction masks.

5. The method of claim 1, wherein the mask classification result is an output result from inputting the target image as a multi-label segmentation model of an image; before the determining the initial image feature extracted from the target image, the method further comprises:

in each iteration training, performing bipartite graph matching based on a plurality of sample prediction masks corresponding to sample images and multi-class probability distribution corresponding to the sample prediction masks, and determining label masks and label classes corresponding to the sample prediction masks;

generating a classified prediction loss value based on the multi-class probability distribution corresponding to the sample image and the difference between the label classes corresponding to the sample prediction masks;

generating mask prediction loss values based on differences between each of the sample prediction masks and the corresponding label mask;

6. The method of claim 1, wherein the categories to which the target image corresponds include a plurality of defect categories and a plurality of facet categories; the condition that the pixel points under the defect category and the pixel points under the area category are overlapped exists; the performing classification prediction according to the plurality of query characterization vectors to obtain multi-class probability distributions corresponding to the plurality of prediction masks includes:

performing surface class prediction and defect class prediction according to the plurality of query characterization vectors to obtain multi-class probability distribution corresponding to the plurality of prediction masks; the multi-class probability distributions corresponding to the multiple prediction masks include probabilities of each query characterization vector corresponding to each class.

7. The method according to any one of claims 1 to 6, wherein the mask classification result comprises masks under mask classes of different classification dimensions; the method further comprises the following steps:

combining at least one mask to be combined corresponding to each classification dimension to obtain a mask combination result corresponding to each classification dimension; and the mask combination result is used for indicating the distribution condition of the mask classes of the same classification dimension.

8. An apparatus for multi-label segmentation of an image, comprising:

the extraction module is used for inputting the initial image features into an attention network for processing to obtain a plurality of query characterization vectors output by the attention network; each query characterization vector is used for characterizing the same type of image features extracted by the attention network learning;

the first prediction module is used for performing mask prediction based on each query characterization vector and the initial image characteristics to obtain a plurality of prediction masks corresponding to the target image; the plurality of prediction masks matches the plurality of query characterization vector quantities;

and a second determining module, configured to determine mask classification results of the multiple predicted masks according to the multi-class probability distribution.

9. A computer device comprising a memory and a processor, the memory storing a computer program, characterized in that the processor implements the steps of the method of any one of claims 1 to 7 when executing the computer program.

10. A computer-readable storage medium, on which a computer program is stored, which, when being executed by a processor, carries out the steps of the method of any one of claims 1 to 7.