CN117974693B

CN117974693B - Image segmentation method, device, computer equipment and storage medium

Info

Publication number: CN117974693B
Application number: CN202410393276.4A
Authority: CN
Inventors: 初春燕
Original assignee: Tencent Technology Shenzhen Co Ltd
Current assignee: Tencent Technology Shenzhen Co Ltd
Priority date: 2024-04-02
Filing date: 2024-04-02
Publication date: 2024-06-25
Anticipated expiration: 2044-04-02
Also published as: CN117974693A

Abstract

The present application relates to an image segmentation method, an apparatus, a computer device, a storage medium and a computer program product. The method comprises the following steps: acquiring a target image to be segmented and a Gaussian noise image; performing feature coding processing on the Gaussian noise image by taking the target image as conditional information, and applying potential space representation constraint to the result of the feature coding processing to obtain a plurality of probabilistic potential feature vectors of the target image; performing feature decoding processing on the probabilistic latent feature vectors by taking the target image as conditional information to obtain a plurality of prediction noise; and denoising the Gaussian noise image through back diffusion based on the prediction noise to obtain an image segmentation result corresponding to each prediction noise. The method can effectively reconstruct a plurality of probabilistic segmentation masks, realize the reconstruction processing of the diversified segmentation masks of the target image, reduce the occurrence of the missing detection condition and effectively improve the accuracy of the image segmentation processing.

Description

Image segmentation method, device, computer equipment and storage medium

Technical Field

The present application relates to the field of computer technology, and in particular, to an image segmentation method, an image segmentation apparatus, a computer device, a storage medium, and a computer program product.

Background

With the development of Computer technology and artificial intelligence technology, computer Vision (CV) has emerged, and Computer Vision is a science of researching how to make a machine "look at", more specifically, to replace a human eye with a camera and a Computer to perform machine Vision such as recognition, tracking and measurement on a target, and further perform graphic processing, so that the Computer processing becomes an image more suitable for human eye observation or transmission to an instrument for detection. As a scientific discipline, computer vision research-related theory and technology has attempted to build artificial intelligence systems that can acquire information from images or multidimensional data. The large model technology brings important transformation for the development of computer vision technology, and pre-trained models in the vision fields of swin-transducer, viT, V-MOE, MAE and the like can be quickly and widely applied to downstream specific tasks through fine tuning. Computer vision techniques typically include image processing, image recognition, image semantic understanding, image retrieval, OCR, video processing, video semantic understanding, video content/behavior recognition, three-dimensional object reconstruction, 3D techniques, virtual reality, augmented reality, synchronous positioning, and map construction, among others, as well as common biometric recognition techniques such as face recognition, fingerprint recognition, and others. The image segmentation process may be implemented by computer vision techniques, for example, the image segmentation process may be performed on medical images to assist doctors in performing lesion recognition, medical diagnosis, and the like.

However, the current image segmentation based on the diffusion model cannot generate diversified image segmentation results, and phenomena such as missing detection occur, so that the accuracy of the image segmentation processing is affected.

Disclosure of Invention

In view of the foregoing, it is desirable to provide an image segmentation method, apparatus, computer device, computer-readable storage medium, and computer program product that can improve the accuracy of image segmentation processing.

In a first aspect, the present application provides an image segmentation method, including:

acquiring a target image to be segmented and a Gaussian noise image;

performing feature coding processing on the Gaussian noise image by using the target image as conditional information through a vector quantization variation self-encoder model, and applying potential space representation constraint to the result of the feature coding processing to obtain a plurality of probabilistic potential feature vectors of the Gaussian noise image;

Taking the target image as conditional information, carrying out feature decoding processing on the probabilistic latent feature vector through a vector quantization variation self-decoder model to obtain a plurality of prediction noise; the vector quantization variation self-encoder model and the vector quantization variation self-decoder model are obtained by performing conditional diffusion training based on historical images;

and denoising the Gaussian noise image by back diffusion based on the prediction noise to obtain an image segmentation result corresponding to each prediction noise.

In a second aspect, the present application also provides an image segmentation apparatus, including:

the data acquisition module is used for acquiring a target image to be segmented and a Gaussian noise image;

The coding processing module is used for carrying out feature coding processing on the Gaussian noise image by taking the target image as conditional information through a vector quantization variation self-coder model, and applying potential space representation constraint to a result of the feature coding processing to obtain a plurality of probabilistic potential feature vectors of the Gaussian noise image;

The decoding processing module is used for carrying out feature decoding processing on the probabilistic latent feature vector through a vector quantization variation self-decoder model by taking the target image as conditional information to obtain a plurality of prediction noise; the vector quantization variation self-encoder model and the vector quantization variation self-decoder model are obtained by performing conditional diffusion training based on historical images;

And the image segmentation module is used for carrying out denoising processing on the Gaussian noise image through back diffusion based on the prediction noise to obtain an image segmentation result corresponding to each prediction noise.

In a third aspect, the present application also provides a computer device comprising a memory and a processor, the memory storing a computer program, the processor implementing the following steps when executing the computer program:

acquiring a target image to be segmented and a Gaussian noise image;

In a fourth aspect, the present application also provides a computer readable storage medium having stored thereon a computer program which when executed by a processor performs the steps of:

acquiring a target image to be segmented and a Gaussian noise image;

In a fifth aspect, the application also provides a computer program product comprising a computer program which, when executed by a processor, performs the steps of:

acquiring a target image to be segmented and a Gaussian noise image;

According to the image segmentation method, the device, the computer equipment, the storage medium and the computer program product, basic data suitable for diffusion model analysis are obtained by acquiring a target image to be segmented and a Gaussian noise image, then feature encoding processing is carried out on the target image through a vector quantization variation self-encoder model by taking the target image as conditional information, potential space representation constraint is applied to a result of the feature encoding processing, a plurality of probabilistic potential feature vectors of the target image are obtained, feature decoding processing is carried out on the probabilistic potential feature vectors through the vector quantization variation self-decoder model by taking the target image as conditional information, a plurality of prediction noise are obtained, namely, the vector quantization variation self-encoder model and the Gaussian noise model obtained through diffusion training are used for realizing segmentation mask reconstruction processing of the target image, probability diversified segmentation sample distribution of the image is learned through the potential space representation capability of the vector quantization variation self-encoder model, a plurality of prediction noise is output, and finally image segmentation results corresponding to each prediction noise are obtained through denoising processing through inverse diffusion based on the prediction noise, so that diversified image processing of the target image is realized. In this embodiment, a plurality of probabilistic segmentation masks can be effectively reconstructed through vector quantization variation obtained through diffusion training from an encoder model and a decoder model, so as to realize reconstruction processing of diversified segmentation masks of a target image, reduce occurrence of missed detection situations, and effectively improve accuracy of image segmentation processing.

Drawings

In order to more clearly illustrate the embodiments of the present application or the technical solutions in the related art, the drawings that are required to be used in the embodiments or the related technical descriptions will be briefly described, and it is apparent that the drawings in the following description are only some embodiments of the present application, and other drawings may be obtained according to the drawings without inventive effort for those skilled in the art.

FIG. 1 is an application environment diagram of an image segmentation method in one embodiment;

FIG. 2 is a flow chart of an image segmentation method in one embodiment;

FIG. 3 is a flow diagram of the encoding and decoding processes of the vector quantization variation self-decoder model in one embodiment;

FIG. 4 is a flow diagram of an image generation process of a vector quantization variation self-decoder model in one embodiment;

FIG. 5 is a schematic diagram of a conditional vector quantization diffusion model in one embodiment;

FIG. 6 is a schematic diagram of a potential spatial representation constraint module in one embodiment;

FIG. 7 is a flow diagram of a conditional diffusion training flow in one embodiment;

FIG. 8 is a flow chart of an image segmentation method according to another embodiment;

FIG. 9 is a block diagram showing the structure of an image dividing apparatus in one embodiment;

Fig. 10 is an internal structural view of a computer device in one embodiment.

Detailed Description

The present application will be described in further detail with reference to the drawings and examples, in order to make the objects, technical solutions and advantages of the present application more apparent. It should be understood that the specific embodiments described herein are for purposes of illustration only and are not intended to limit the scope of the application.

The application relates to the field of artificial intelligence, in particular to a theory, a method, a technology and an application system which simulate, extend and expand human intelligence by using a digital computer or a machine controlled by the digital computer, sense environment, acquire knowledge and acquire an optimal result by using the knowledge. In other words, artificial intelligence is an integrated technology of computer science that attempts to understand the essence of intelligence and to produce a new intelligent machine that can react in a similar way to human intelligence. Artificial intelligence, i.e. research on design principles and implementation methods of various intelligent machines, enables the machines to have functions of sensing, reasoning and decision. The artificial intelligence technology is a comprehensive subject, and relates to the technology with wide fields, namely the technology with a hardware level and the technology with a software level. Artificial intelligence infrastructure technologies generally include technologies such as sensors, dedicated artificial intelligence chips, cloud computing, distributed storage, big data processing technologies, pre-training model technologies, operation/interaction systems, mechatronics, and the like. The pre-training model is also called a large model and a basic model, and can be widely applied to all large-direction downstream tasks of artificial intelligence after fine adjustment. The artificial intelligence software technology mainly comprises a computer vision technology, a voice processing technology, a natural language processing technology, machine learning/deep learning and other directions. The application relates specifically to computer vision techniques and machine learning (MACHINE LEARNING, ML) techniques in artificial intelligence.

The machine learning is a multi-field interdisciplinary, and relates to a plurality of disciplines such as probability theory, statistics, approximation theory, convex analysis, algorithm complexity theory and the like. It is specially studied how a computer simulates or implements learning behavior of a human to acquire new knowledge or skills, and reorganizes existing knowledge structures to continuously improve own performance. Machine learning is the core of artificial intelligence, a fundamental approach to letting computers have intelligence, which is applied throughout various areas of artificial intelligence. Machine learning and deep learning typically include techniques such as artificial neural networks, confidence networks, reinforcement learning, transfer learning, induction learning, teaching learning, and the like. The pre-training model is the latest development result of deep learning, and integrates the technology.

The technical terms related to the application include:

Variable self-encoder (Variational Autoencoder, VAE): a variation self-encoder is a generative model that combines the ideas of self-encoder and variation inference. It can be used to learn and generate representations of high-dimensional data in potential space.

Potential space (LATENT SPACE): potential space, which refers to a low-dimensional feature space obtained by encoding or representing data in machine learning and statistical modeling.

Convolutional neural network (Convolutional Neural Network, CNN): convolutional neural networks are a deep learning model that is used primarily to process and analyze data, such as images and video, that has a grid structure. It extracts features from the input data by applying convolution and pooling operations at different levels and uses these features to perform classification, identification, or regression tasks.

Vector quantization variation from encoder (Vector Quantized Variational Autoencoder, VQ-VAE): vector quantization variation self-encoder is a deep learning model for generating and learning low-dimensional representations of high-dimensional data. It combines the concepts of self-encoders and variational self-encoders and introduces vector quantization techniques to handle discrete potential space.

The image segmentation method provided by the embodiment of the application can be applied to an application environment shown in fig. 1. Wherein the terminal 102 communicates with the server 104 via a network. The data storage system may store data that the server 104 needs to process. The data storage system may be integrated on the server 104 or may be located on a cloud or other network server. When the user on the terminal 102 side wishes to perform segmentation processing on the target image to obtain a plurality of segmented images, the terminal 102 may submit the target image to be segmented and the random gaussian noise image to the server 104. The server 104 acquires the target image to be segmented and the Gaussian noise image; performing feature coding processing on the target image by using the target image as conditional information through a vector quantization variation self-encoder model, and applying potential space representation constraint to the result of the feature coding processing to obtain a plurality of probabilistic potential feature vectors of the target image; and the target image is used as conditional information, and the probabilistic latent feature vector is subjected to feature decoding processing through a vector quantization variation self-decoder model to obtain a plurality of prediction noise; the vector quantization variation self-encoder model and the vector quantization variation self-decoder model are obtained by carrying out conditional diffusion training based on historical images; and denoising the Gaussian noise image through back diffusion based on the prediction noise to obtain an image segmentation result corresponding to each prediction noise. The terminal 102 may be, but not limited to, various personal computers, notebook computers, smart phones, tablet computers, internet of things devices, and portable wearable devices, where the internet of things devices may be smart speakers, smart televisions, smart air conditioners, smart vehicle devices, and the like. The portable wearable device may be a smart watch, smart bracelet, headset, or the like. The server 104 may be implemented as a stand-alone server or as a server cluster of multiple servers.

In an exemplary embodiment, as shown in fig. 2, an image segmentation method is provided, and an example of application of the method to the server 104 in fig. 1 is described, including the following steps 201 to 209. Wherein:

In step 201, a target image to be segmented and a gaussian noise image are acquired.

The target image is processed by the image segmentation method, and the image segmentation method can be used for carrying out probabilistic segmentation processing on the target image to obtain a plurality of image segmentation results. In one embodiment, the target image may be a medical image, such as a CT (Computed Tomography, electronic computer tomography) image, an MRI (Magnetic Resonance Imaging ) image, and the processing of lesion recognition, medical diagnosis, and the like by performing a probabilistic image segmentation process on such a medical image and outputting a corresponding segmentation mask (segmentation result) thereof may be assisted by a doctor. A gaussian noise image is a noise image whose probability density function follows a gaussian distribution (normal distribution).

For example, when the user on the terminal 102 side wishes to implement the segmentation process on the target image, the target image to be processed may be submitted to the server 104 through the terminal 102, and after the target image is obtained, the server 104 may first apply gaussian noise to the target image to obtain a noise image. The scheme of the application realizes probabilistic segmentation processing of the target image by combining a diffusion model with a vector quantization variation self-encoder model, and Gaussian noise is associated with the diffusion model. The gaussian noise image is a random noise image that is typically added to the input data of the diffusion model. This is done to help the diffusion model learn to generate new data similar to the training data even if the input is not perfect. Noise is generated using a gaussian distribution, which is a probability distribution describing the likelihood of different values occurring within a given range. The diffusion model is a basic model that can generate new data from training data. The principle of operation is to add gaussian noise to the image, which is essentially a random pixel or distortion variation that affects the original image. This process is referred to as the forward diffusion process. The diffusion model then learns to eliminate these increased noise during back diffusion, gradually reducing the noise level until a clear and high quality image is produced. Therefore, the scheme of the application completes the training of the vector quantization variation self-encoder and the vector quantization variation self-decoder in a training mode of the conditional diffusion model, thereby constructing the conditional vector quantization diffusion model (Conditional Vector Quantizer Diffusion Model, CVQDM) by combining the conditional diffusion model with the vector quantization variation self-encoder to realize the probabilistic segmentation processing of the target image.

And 203, carrying out feature coding processing on the Gaussian noise image by using the target image as conditional information through a vector quantization variation self-encoder model, and applying potential space representation constraint to the result of the feature coding processing to obtain a plurality of probabilistic potential feature vectors of the target image.

The method and the device can effectively solve the problems that complex related structures of real distribution are difficult to capture, diversity of segmentation prediction is less and fuzzy. The feature encoding process refers to a process of mapping the target image into a potential space through an encoder model to obtain a corresponding potential feature vector, and in the process, a potential space representation constraint needs to be applied to a result of the feature encoding process, and the potential space representation constraint can be realized through a potential space representation constraint module (LATENT SPACE Representation Constraint Module, LSRCM) for limiting distribution or distance measurement of the potential space, so that the representation capability of the potential space is prevented from being lost due to too small weight value of a channel in the feature encoding process, and the robustness and stability of the model are improved.

Illustratively, after obtaining the target image and the gaussian noise image, the feature extraction process of the gaussian noise image can be completed by training the completed vector quantization variation self-encoder model. The method comprises the steps of inputting a Gaussian noise image into a vector quantization variation self-encoder model, taking the input target image as conditional information, encoding the Gaussian noise image by the model, mapping the representation of the Gaussian noise image into a potential space, and applying a potential space representation constraint to obtain a probabilistic potential feature vector in a discrete form. The vector quantization variation self-encoder model contains an embedded space (codebook) mechanism, so that the codebook vector quantization process using the model incorporates discrete potential space, which allows for more efficient representation of discrete data and capture of more structured potential representations, i.e., learning probabilistic diversified segmentation sample distributions of an image through the powerful potential spatial characterization capabilities of the encoder model, while also requiring the application of potential spatial representation constraints to the results of feature encoding processes, reducing the weights of discrete vector weakly correlated representation values in the codebook, thereby preventing processing anomalies due to the strong generalization of the vector quantization variation self-encoder model, resulting in fine and accurate image segmentation results.

Step 207, performing feature decoding processing on the probabilistic latent feature vectors by using the target image as condition information and using the vector quantization variation self-decoder model to obtain a plurality of prediction noise; the vector quantization variation is obtained by performing conditional diffusion training on the basis of a historical image from an encoder model and a vector quantization variation from a decoder model.

The vector quantization variation self-decoder model corresponds to a vector quantization variation self-encoder model, and is a decoding module consisting of a full-connection layer, a conversion layer (Reshape) and a transposed convolution layer, which can convert probabilistic latent feature vectors into prediction noise. For the prediction noise, a segmentation mask for the target image can be obtained by continuously removing the prediction noise on the Gaussian noise image, so that the segmentation processing of the target image is realized, and a plurality of probabilistic potential feature vectors correspond to different prediction noise.

Illustratively, when a plurality of probabilistic latent feature vectors of the gaussian noise image are obtained, the probabilistic latent feature vectors can be input into a vector quantization variable self-decoder model, and the target image is taken as condition information, and decoding processing of the probabilistic latent feature vectors is realized by a decoder, so that corresponding prediction noise is obtained. In the process of decoding the prediction noise, the decoder can reconstruct an image based on the input probabilistic latent feature vectors, then the probabilistic latent image features can obtain prediction noise hidden vectors after being processed through the conversion (transform) of codebook, and the prediction noise hidden vectors can be converted into the prediction noise after being mapped back to the image space. At present, for the specific coding and decoding processes of the vector quantization variation self-decoder model, reference may be made to fig. 3, and for the process of image generation, reference may be made to fig. 4, which is different from the schemes of fig. 3 and 4, the scheme of the present application adopts the network structure of the conditional diffusion model (Conditional Diffusion Models) as a backbone model, and then combines the vector quantization variation self-encoder model to obtain the conditional vector quantization diffusion model for probabilistic segmentation by using the multi-segmentation mask, and the structure thereof is as shown in fig. 5, and the training of the vector quantization variation self-encoder model and the vector quantization variation self-decoder model is completed in a diffusion model manner, wherein the processes of (Xm, 0) to (Xm, t) represent the process of conditional diffusion training on the model, and meanwhile, in the coding and decoding process, a potential space representation constraint (LSRCM) is also required to be applied to the output of an embedded space (codebook) in the vector quantization variation self-encoding process.

Step 209, denoising the Gaussian noise image by back diffusion based on the prediction noise to obtain the image segmentation result corresponding to each prediction noise.

Image segmentation is, illustratively, a technique and process of dividing an image into a number of specific, uniquely-nature regions and presenting objects of interest. It is a key step from image processing to image analysis. The scheme of the application specifically realizes the processing of the target image through the prediction noise, namely, the denoising processing is carried out on the Gaussian noise image through the back diffusion of the prediction noise on the basis of the original Gaussian noise image, so that the corresponding probabilistic segmentation mask assumption is obtained as an image segmentation result. For the image segmentation process, noise can be gradually removed from the noise image in the back diffusion process based on the obtained prediction noise to restore the segmentation mask, so as to obtain different probabilistic image segmentation results. In one embodiment, the image segmentation method of the present application is suitable for processing a medical image, and at this time, the image segmentation is suitable for dividing a region of interest in the medical image to assist a doctor in performing processes such as lesion recognition and medical diagnosis, at this time, the region of interest in the medical image can be predicted by the image segmentation method of the present application to obtain a segmentation mask hypothesis for the medical image, and different prediction noises can be processed to obtain a plurality of probabilistic image segmentation results, so that the situation that the collective insight from a group of experts is always better than the situation of the best diagnosis of an individual in clinical tasks can be better simulated. Only one most likely segmentation mask is output compared to existing segmentation methods, possibly resulting in misdiagnosis and suboptimal treatment. Providing only pixel-by-pixel probabilities ignores all covariances between pixels, which makes subsequent analysis more difficult, if not impossible. If multiple probabilistic segmentation mask hypotheses are provided, they can be used for further diagnosis or resolution of ambiguity.

According to the image segmentation method, basic data suitable for diffusion model analysis are obtained through obtaining a target image to be segmented and a Gaussian noise image, then the target image is used as conditional information, feature encoding processing is carried out on the target image through a vector quantization variation self-encoder model, potential space representation constraint is applied to the result of the feature encoding processing, a plurality of probability potential feature vectors of the target image are obtained, feature decoding processing is carried out on the probability potential feature vectors through a vector quantization variation self-decoder model by taking the target image as conditional information, a plurality of prediction noises are obtained, namely segmentation mask reconstruction processing of the target image is achieved through the vector quantization variation self-encoder model and the decoder model obtained through diffusion training, probability diversified segmentation sample distribution of the image is learned through the potential space representation capability of the vector quantization variation self-encoder model, the plurality of prediction noises are output, and finally denoising processing is carried out on the Gaussian noise image through back diffusion based on the prediction noises, so that image segmentation results corresponding to each prediction noise are obtained, and diversified image segmentation processing is achieved. In this embodiment, a plurality of probabilistic segmentation masks can be effectively reconstructed through vector quantization variation obtained through diffusion training from an encoder model and a decoder model, so as to realize reconstruction processing of diversified segmentation masks of a target image, reduce occurrence of missed detection situations, and effectively improve accuracy of image segmentation processing.

In an exemplary embodiment, step 205 includes: taking the target image as conditional information, carrying out vector coding processing on the Gaussian noise image through a vector quantization variation self-coder model to obtain an intermediate characterization vector; searching a similar vector of the intermediate characterization vector in an embedded space through a nearest neighbor algorithm; applying potential space representation constraint to the similarity vectors to obtain a plurality of probabilistic target similarity vectors; and constructing a probabilistic potential feature vector corresponding to each probabilistic target similarity vector.

The intermediate token vector refers to an intermediate token obtained directly through the encoder, and is used for directly finding the nearest vector in the embedding space. The nearest neighbor algorithm is an algorithm that determines the distance between features, the closer the distance the more similar the two vectors are. The embedding space, codebook, is a mechanism by which the vector quantization changes from the encoder model, which can encode images into discrete vectors.

For example, for the processing of the vector quantization variation self-encoder model, the input gaussian noise image may be first subjected to vector encoding processing by the vector quantization variation self-encoder model, and encoded into an intermediate characterization vector. The intermediate token vector is then input into the embedded space codebook and the similarity vector is queried in the codebook. In the present application, the initialization process may be performed specifically by the following method to obtain self_embedding=nn. The similarity vectors of the intermediate token vectors are then queried within the codebook, i.e. the squared euclidean distances between the intermediate token vectors and all embedded vectors (self embedding. Weight) within the embedded space codebook are calculated. A matrix d is thus obtained, which represents the distance between the embedding vector and self embedding. The torch. Argmin method is used to find the index of the minimum value for each row of the d matrix, which represents the index of the nearest embedded vector for each z_weighted sample, corresponding to the similarity vector. And then, applying spatial representation constraint on the similar vectors to obtain a plurality of probabilistic target similar vectors, constructing probabilistic potential feature vectors corresponding to each probabilistic target similar vector, and placing the queried probabilistic target similar vectors at positions corresponding to the probabilistic potential feature vectors to obtain quantized probabilistic potential feature vectors. This location refers to a particular index or coordinate in the potential feature vector.

In VQ-VAE, the encoder generates a continuous concealment vector for the input data. This hidden vector is then divided into a plurality of parts, each of which is quantized into a discrete embedded vector. Thus "placing the queried similarity vector in the position of the corresponding probabilistic latent feature vector" refers to replacing the continuous hidden vector of each part with the closest discrete embedded vector in the embedded space. This "position" is the index or coordinate of the original continuous hidden vector in the entire hidden vector. The continuous hidden variable output by the vector quantization variation self-encoder model can be converted into a discrete form by constructing the probability potential feature vector corresponding to each probability target similar vector, so that a plurality of probability potential feature vectors of the target image are obtained. In this embodiment, the similarity vector query of the intermediate token vector is performed, and then a potential spatial representation constraint is applied to the similarity vector query, so that a probability potential feature vector can be constructed to perform image generation and subsequent segmentation processing by utilizing a probability diversified segmentation sample distribution of the embedded space codebook strong potential spatial token capable of learning an image.

In one exemplary embodiment, applying potential spatial representation constraints to the similarity vectors to obtain a plurality of probabilistic target similarity vectors includes: compressing the similar vector into a channel dimension vector through adaptive averaging pooling; carrying out convolution processing on the channel dimension vector, and obtaining importance weights of different channels in the channel dimension vector based on the convolution processing result; applying potential space constraint to importance weights of different channels based on relative entropy to obtain constraint weights of different channels; and obtaining a plurality of probabilistic target similarity vectors based on the constraint weights of different channels and the channel dimension vectors.

Wherein, the adaptive average pooling is a pooling method, which can map the convolution kernel and the features in the space into a specific coordinate system, so that the high-dimensional features are projected into the low-dimensional space, and the average pool result in the space is obtained by averaging the feature vectors in the coordinate system. The relative entropy, also called KL (Kullback-Leibler divergence) divergence, is a statistical measure that represents the degree of difference of one probability distribution relative to another. The representation capability of the potential space of the codebook can be prevented from being lost due to the fact that the weight value of a channel is too small through the constraint of the relative entropy

The constraint process of the potential spatial representation constraint can be realized by using adaptive average pooling to compress discrete vectors obtained by converting the embedded spatial codebook into channel (channel) dimensions, then predicting the importance of each channel through convolution operation to obtain the original importance weights of different channels, performing reduced-dimension and transposed convolution to recover dimensions through convolution, and performing superposition weight value processing on the channel dimensions of the codebook based on the reduced-dimension convolution and the increased-dimension convolution of the channel dimensions. And then, using relative entropy to carry out potential space constraint on the obtained weight, wherein the potential space constraint can prevent the too small weight value of the channel from losing the representation capability of the code book potential space by limiting the distribution or distance measurement of the potential space, finally, combining constraint weights of different channels and channel dimension vectors to obtain a plurality of probabilistic target similar vectors, which is beneficial to improving the robustness and stability of the model, and finally, multiplying the learned weight value of each channel by the corresponding channel. By imposing a potential spatial representation constraint on the similarity vectors, the weights of discrete vector weakly correlated representation values in the codebook can be effectively reduced, thereby preventing reconstruction anomalies due to the generalization of VQ-VAEs. In one embodiment, the specific structure of the potential spatial representation constraint module that imposes a potential spatial representation constraint on the similarity vector may be as shown with reference to FIG. 6. In one embodiment, the relative entropy may also be replaced with a measure of the variability of other probability distributions, such as JS-dispersion (Jensen-Shannon Divergence), TV distance (total variation distance), and so on. In this embodiment, the weight of the weak correlation representation value of the discrete vector in the embedded space codebook can be effectively reduced by restricting the potential space representation of the similarity vector output by the embedded space, so as to prevent reconstruction abnormality due to the strengthening of VQ-VAE, and improve the accuracy of the image segmentation process.

In an exemplary embodiment, step 207 includes: converting the multiple probabilistic potential feature vectors to obtain multiple prediction noise hidden vectors; and taking the target image as conditional information, decoding each prediction noise hidden vector through a vector quantization variation self-decoder model, and determining the prediction noise corresponding to each prediction noise hidden vector.

For example, for probabilistic latent feature vectors, they may be converted into prediction noise concealment vectors first, and then the prediction noise is derived from the prediction noise concealment vectors. For the conversion process, a transform (transform) process may be performed on each probabilistic latent feature vector by embedding a spatial codebook to obtain a prediction noise concealment vector. And then decoding each prediction noise hidden vector through a vector quantization variation self-decoder model, determining the prediction noise corresponding to each prediction noise hidden vector, reconstructing a picture by using the vector quantization variation self-decoder when predicting the prediction noise, and mapping the prediction noise hidden vector back to the original image space, thereby obtaining the prediction noise corresponding to each prediction noise hidden vector in the original image space.

In an exemplary embodiment, the method further comprises: acquiring a plurality of segmentation masks of a history image; and based on the segmentation mask, performing conditional diffusion training on the initial vector quantization variation self-encoder model and the initial vector quantization variation self-decoder model to obtain a vector quantization variation self-encoder model and a vector quantization variation self-decoder model.

The history image refers to image data in the history data, and is used for model training, and the history image can be selected according to the application field of the image segmentation method of the present application, for example, the history image can be used as a history image when the history image is applied to the medical image segmentation field. If applied to the segmentation of scenery images, scenery images may be selected as history images. The segmentation mask (Segmentation Mask) refers to an image used to identify or mask a specific region in the image segmentation process, through which the region of interest can be highlighted or selected. Different objects in the image can be separated through the segmentation mask, a corresponding binary image or a multi-value image is generated, and the position and the shape of each object are represented.

Illustratively, the inventive solution also comprises a model training process for the vector quantization variation self-encoder model and the vector quantization variation self-decoder model, by means of unsupervised self-training of the encoder and decoder, available models for the image segmentation process can be obtained. For the training process, a plurality of segmentation masks of the historical image are acquired first, and the segmentation masks are obtained by processing the original historical image. And training the initial vector quantization variation self-encoder model and the initial vector quantization variation self-decoder model through the segmentation masks, wherein the training mode refers to the training mode of the conditional diffusion model, and finally the available vector quantization variation self-encoder model and vector quantization variation self-decoder model are obtained, so that effective image segmentation processing is realized. In this embodiment, by selecting the historical image, constructing a plurality of segmentation masks, and training the vector quantization variation self-encoder model and the vector quantization variation self-decoder model in a training manner of the diffusion model, the effectiveness of the processing in the encoding and decoding processes can be effectively ensured, thereby ensuring the accuracy of the image segmentation processing.

In one exemplary embodiment, based on the segmentation mask, performing conditional diffusion training on the initial vector quantization variation self-encoder model and the initial vector quantization variation self-decoder model to obtain a vector quantization variation self-encoder model and a vector quantization variation self-decoder model includes: sequentially applying Gaussian noise of different times to each divided mask to obtain a noise mask of each divided mask under the influence of the Gaussian noise of different times; based on noise masks of each segmentation mask under the influence of Gaussian noise of different times, sequentially carrying out conditional diffusion treatment on an initial vector quantization variation self-encoder model and an initial vector quantization variation self-decoder model to obtain noise loss parameters, image reconstruction loss parameters and weight distribution loss parameters which are respectively corresponding to the Gaussian noise of different times; and according to the noise loss parameter, the image reconstruction loss parameter and the weight distribution loss parameter, carrying out optimization treatment on the initial vector quantization variation self-encoder model and the initial vector quantization variation self-decoder model to obtain a vector quantization variation self-encoder model and a vector quantization variation self-decoder model.

The Gaussian noise of different times refers to adding noise to each original segmentation mask step by step for multiple times, recording the added noise of each step, and obtaining the noise mask of each segmentation mask under the influence of the Gaussian noise of different times. The noise loss parameter, the image reconstruction loss parameter and the weight distribution loss parameter refer to the loss parameter of the process of predicting noise, the image reconstruction loss parameter refers to the loss of a reconstructed image and an original image in the image reconstruction process, and the weight distribution loss parameter refers to the loss of original weight distribution and the weight distribution learned after constraint when potential space representation constraint is applied to the result of feature coding processing.

For example, for a model training process of conditional diffusion processing, it may be split into two phases, the first phase being to add noise to the original segmentation mask multiple times, and record the noise added at each step. The second stage is to input a segmentation mask with noise and an original image under the current step number, predict the noise occurring in the previous step in the inversion process by using an initial model with a potential space representation constraint module, subtract the noise, reconstruct the segmentation mask by subtracting the noise for a plurality of times, calculate the loss of the reconstruction segmentation mask and a real segmentation mask, finally optimize the model, and finally obtain a model which can be used for image segmentation processing by a plurality of rounds of iteration. The model loss specifically comprises three parts of loss, namely noise loss parameters, image reconstruction loss parameters and weight distribution loss parameters, and the three parts are combined to obtain total loss, so that the initial vector quantization variation self-encoder model and the initial vector quantization variation self-decoder model are optimized, and the required vector quantization variation self-encoder model and the required vector quantization variation self-decoder model can be obtained. In this embodiment, the diffusion training processing of the initial model is completed by gradually adding noise, so that the model is optimized to obtain a usable model, and the accuracy of the subsequent image segmentation processing can be effectively improved.

In an exemplary embodiment, based on the noise mask of each segmentation mask under the influence of different times of gaussian noise, sequentially performing conditional diffusion processing on the initial vector quantization variation self-encoder model and the initial vector quantization variation self-decoder model, and obtaining noise loss parameters corresponding to different times of gaussian noise respectively includes: splicing the maximum noise mask of each division mask with the historical image to obtain a mask spliced image of each division mask, wherein the maximum noise mask is the noise mask with the maximum Gaussian noise; performing feature coding processing on the mask spliced images through an initial vector quantization variation self-encoder model, and applying potential space representation constraint to the result of the feature coding processing to obtain probability potential feature vectors corresponding to the mask spliced images; decoding the probabilistic latent feature vector through an initial vector quantization variation self-decoder model to determine predicted highest-layer Gaussian noise; constructing a predicted next-highest noise mask of each division mask based on the maximum noise mask and the highest Gaussian noise of each division mask; a noise loss parameter of the highest layer Gaussian noise is determined based on the predicted next highest noise mask for each split mask and the next highest noise mask for each split mask.

After all Gaussian noise is applied to the segmentation mask, the obtained noise mask is the maximum noise mask, and inverse processing can be performed on the basis of the maximum noise mask to calculate model loss in each noise adding process, so that the training process of the model is completed.

In the model training process, an image after noise addition and an original image are subjected to splicing processing, then a mask spliced image obtained by splicing is input into an initial vector quantization variation self-encoder model, and feature coding and potential space representation constraint processing are performed to obtain probability potential feature vectors corresponding to each mask spliced image. And then the initial vector quantization variation is used for carrying out corresponding decoding processing from the decoder model, so that the noise at the highest layer is predicted. In one embodiment, for the prediction process of the highest Gaussian noise, the image reconstruction processing can be performed on the historical image through the initial vector quantization variation self-decoder model to obtain a reconstructed image; mapping the probabilistic latent feature vector to an image reconstruction space of the reconstructed image to obtain predicted highest-layer Gaussian noise.

After the highest-level noise is obtained, the input maximum noise mask and the highest-level Gaussian noise are subtracted, a predicted next-highest noise mask can be predicted, and the predicted next-highest noise mask obtained through prediction and the recorded next-highest noise mask are compared, so that the noise loss parameter of each segmentation mask at the highest-level Gaussian noise can be determined. The final noise loss parameters can be obtained by adding the noise loss parameters of each division mask. And after the noise loss parameter and the other two loss parameters of the highest Gaussian noise are obtained, the model can be optimized once based on the model. After the calculation of the noise loss parameters of the highest Gaussian noise is completed, various loss parameters corresponding to the secondary Gaussian noise can be calculated on the basis to complete the subsequent iterative optimization of the model, the initial segmentation mask can be restored through the reverse training process corresponding to the times of the superimposed Gaussian noise, and a plurality of probabilistic segmentation masks can be obtained through the training process and only the noise conforming to normal distribution and the image needing to be segmented need to be input in the subsequent reasoning process.

And for the calculation process of two loss parameters of the image reconstruction loss parameter and the weight distribution loss parameter, the image reconstruction loss parameter corresponding to the highest Gaussian noise layer can be obtained by comparing the historical image with the reconstruction image. And simultaneously, acquiring importance weights and constraint weights of the probabilistic latent feature vectors in different channels to obtain weight distribution loss parameters corresponding to the Gaussian noise of the highest layer. The definition of the model three-part loss function satisfies the following formula:

the total model loss consists of three parts, wherein Representing image reconstruction loss parameters for measuring differences between reconstructed and real images, wherein/>For the original image,/>To reconstruct an image,/>A parameter indicative of the loss of noise is provided,In/>To the noise that needs to be predicted,/>Is a hidden vector of potential space,/>Is a stop gradient,/>Refers to a vector quantization variation self-encoder model,/>Refers to the vector quantization variation from the decoder model. KL divergence is used to constrain the distribution of codebook weights,/>For the original weight value distribution,/>Is a learned weight value distribution.

For the conditional diffusion training process of the model, the main reason that the diffusion model needs to superimpose noise multiple times in the training process is related to the target distribution modeled by the diffusion model. A diffusion model is a generative model used to model the joint distribution of data. The goal is to gradually evolve from a simple distribution (typically a gaussian distribution) to a target distribution, thereby generating samples that conform to the target data distribution. The multiple superposition of noise is the core of this generation process. At each step, a small noise is introduced so that the current data point gradually approaches the target distribution in distribution. This is achieved by repeatedly adding noise. After each noise addition, the data points are slightly adjusted in the distribution so that the final generated sample better conforms to the target distribution.

In one embodiment, the present application is applicable to a training process for medical images, where the training process may be illustrated with reference to fig. 7, and includes: a plurality of segmentation masks are established based on an original medical image, then n segmentation masks are overlapped with T Gaussian noise, n x T masks overlapped with noise of different degrees are obtained, the noise added in each step is recorded, in the training process, the n segmentation masks overlapped with T noise can be spliced with the corresponding medical image, and then the n segmentation masks are input into an encoder. The encoder maps the joint representation after the input image and the mask are spliced into a potential space to obtain a potential vector representation of the mask with T times of noise and the medical image, a nearest neighbor algorithm finds a similar vector of a vector closest to the input intermediate representation in the codebook according to a distance metric, the similar vector is processed under the condition of potential space representation constraint, the queried similar vector is placed at a position corresponding to a target domain hidden vector to be generated, the quantized target domain hidden vector is obtained, then the medical image is reconstructed by the encoder model on the basis of the target domain hidden vector to be generated, the predicted noise hidden vector is mapped back to the original image space to generate the predicted T times of superimposed noise, the n segmentation masks superimposed with T times of noise are subtracted, and the segmentation mask superimposed with T-1 times of noise is predicted. And then calculating loss parameters for the segmentation mask with the predicted T-1 times of superimposed noise and the segmentation mask with the truly T-1 times of superimposed noise to reversely optimize the encoder model and the decoder model, and recovering the original segmentation mask by repeating T times to complete the training treatment of the model. In this embodiment, model training processing on the initial vector quantization variation self-encoder model and the initial vector quantization variation self-decoder model is completed by a mode of overlapping noise for a plurality of times, so that accuracy of subsequent image segmentation processing can be effectively improved.

In an exemplary embodiment, optimizing the initial vector quantization variation self-encoder model and the initial vector quantization variation self-decoder model according to the noise loss parameter, the image reconstruction loss parameter, and the weight distribution loss parameter, to obtain the vector quantization variation self-encoder model and the vector quantization variation self-decoder model includes: according to the noise loss parameter, the image reconstruction loss parameter and the weight distribution loss parameter, carrying out optimization treatment on the initial vector quantization variation self-encoder model and the initial vector quantization variation self-decoder model to obtain a to-be-detected vector quantization variation self-encoder model and a to-be-detected vector quantization variation self-decoder model; acquiring a test image and a real segmentation result of the test image; inputting the test image into a to-be-tested vector quantization variation self-encoder model and a to-be-tested vector quantization variation self-decoder model to obtain a test segmentation result; performing contrast analysis on the test segmentation result and the real segmentation result to obtain similarity matching scores, and performing ground truth value distribution on the test segmentation result and the real segmentation result on the minimum variance and the maximum variance; determining a diversity consensus score based on the minimum variance and the maximum variance; and obtaining a collective recognition score based on the similarity matching score and the diversity consistency score, and when the collective recognition score is higher than a score threshold value, taking the vector quantization variation to be tested as a vector quantization variation self-encoder model, and taking the vector quantization variation to be tested as a vector quantization variation self-decoder model.

The similarity matching score is a score for measuring the performance of a segmentation algorithm, and is used for comparing the similarity between a predicted segmentation result and a real label. The similarity match score ranges from 0 to 1, where 0 indicates a complete mismatch and 1 indicates a complete match. The closer the Dice score is to 1, the more similar the segmentation result is to a real segmentation, and the higher the segmentation quality is. Diversity consensus score refers toScore, which is used to evaluate the diversity of the predicted segmentation results. The collective recognition score (Collective Insight Score, CI) is based on the similarity match score and the diversity identity score, the index taking into account the integrated sensitivity, the maximum Dice match score, and the diversity identity score. The CI score balances the weights of the components by harmonic averaging of each component.

Illustratively, the resulting conditional vector quantized diffusion model (Conditional Vector Quantizer Diffusion Model, CVQDM) model of the present application is a probabilistic split model that yields a predicted distribution rather than a deterministic distribution, and thus requires evaluation from a distribution of ground truth. Although the generalized energy distance has been previously used to evaluate the fuzzy segmentation model, this metric is found to be inadequate because it disproportionately rewards sample diversity, regardless of whether it matches the real sample. This can be potentially dangerous, especially in pathological cases. In order to improve accuracy of model evaluation, the application introduces a collective recognition score to evaluate the probabilistic segmentation model and judge whether the model is available. The index comprehensively considers comprehensive sensitivity, maximum Dice matching score and diversity consistency score. The CI score balances the weights of the components by harmonic averaging of each component. CI scores have been demonstrated to more accurately evaluate the performance of fuzzy models. The formula is defined as follows:

Wherein the method comprises the steps of For combined sensitivity,/>Match score for maximum Dice,/>Diversity consistent scores. The formulas are defined as follows: (/ >To predict the set of segmentation masks,/>For true segmentation of the set of masks,/>Dividing the mask for a single reality,/>Partitioning a mask for a corresponding single prediction

Therefore, when the training model is evaluated, the real segmentation result and the test segmentation result of the test image can be respectively obtained, and the similarity matching score is obtained through the comparison analysis of the test segmentation result and the real segmentation result. And simultaneously calculating variances among all pairs in the ground truth distribution of the single input image, and taking the minimum variance and the maximum variance. The minimum variance is defined herein asThe maximum variance is defined as/>. Similarly, the variance between all pairs in the predicted distribution that can be input is taken as the minimum and maximum variance, which is defined as/>And/>. For a particular input, the difference between the minimum variance of the true value and the predicted distribution can be expressed as/>. Similarly, the difference between the true value of a particular input and the maximum variance of the predicted distribution is denoted/>. Finally, the diversity protocol Da is defined as:

Substituting the calculated similarity matching score and the calculated diversity consistency score into a formula of the collective recognition score to obtain the collective recognition score, when the collective recognition score is higher than a score threshold value, taking the vector quantization variation to be tested as a vector quantization variation self-encoder model, taking the vector quantization variation to be tested as a decoder model, and taking the vector quantization variation self-decoder model, otherwise, continuing training the model until the obtained model can pass through evaluation of the collective recognition score. In this embodiment, the test processing of the vector quantization variation self-encoder model and the vector quantization variation self-decoder model obtained by training is completed by the collective recognition score, and the model with excellent segmentation performance can be effectively selected to perform subsequent image segmentation processing, so that the accuracy of the image segmentation processing is ensured.

In an exemplary embodiment, the target image comprises a medical image. The method further comprises the steps of: performing medical image recognition processing based on the image segmentation result to generate a medical image processing result; and feeding back a medical image processing result.

The image segmentation method of the present application can be applied to the field of image processing in particular to accomplish probabilistic segmentation processing of an original medical image, for example. At this time, corresponding medical image recognition can be performed for each probabilistic image segmentation result, the specific content of the region of interest obtained by image segmentation is determined, a plurality of medical image processing results are recognized, and then the medical image processing results are fed back to specific medical image processing personnel, and the medical image processing personnel perform processes such as focus recognition and medical diagnosis. In one embodiment, the application is applied to the processing of medical images of lung lesions, where the model training may be performed using a lung lesion segmentation (LIDC-IDRI) dataset. The data set available for this disclosure contains 1018 lung CT scans from 1010 subjects, with manual annotations by four field specialists, making it a good representation of typical CT image ambiguities. A total of 12 radiologists provided an annotation mask for the dataset. Using the dataset after the second reading, the training set contained 13511 lesion images and the test set contained 1585 lesion images, with 4 expert hierarchies. In another embodiment, the application is applicable to the processing of medical images of bone surfaces, where a bone surface segmentation (B-US) dataset may be selected for model training, using a 2D C5-2/60 curve probe and an L14-5 linear probe to collect scan results from a subject. The depth setting and image resolution vary between 3-8 cm and 0.12-0.19 mm, respectively. All the collected scan images were manually segmented by an ultrasound expert and three novices trained in bone segmentation. The training set contains 1769 bone ultrasound scans and the test set contains 211 bone ultrasound scans. In this embodiment, the image recognition processing and the feedback processing of the image segmentation result can more efficiently feed back the recognition result of the probabilistic segmentation to the medical image processing personnel, and the medical image processing personnel can perform the processing such as the subsequent focus recognition and the medical diagnosis, thereby improving the processing efficiency.

The application also provides an application scene, which is described by taking the application scene as an example of applying the image segmentation method, and the image segmentation method specifically comprises the following steps:

When a user needs to divide the interested lung lesion positions from the acquired medical images by a medical image dividing method, and the method is used for assisting in identifying and diagnosing the medical related lesions, the image dividing processing of the medical images can be completed by the image dividing method provided by the application, so that a plurality of probabilistic image dividing results are obtained.

Firstly, training treatment of a model is required to be completed, at this time, a user selects relevant lung images to construct a training set and a testing set according to the diagnosis requirement of lung lesions, and then the training and testing treatment of the model is completed through the training set and the testing set. The server carrying the image segmentation method comprises the steps of firstly acquiring training set data submitted or selected by a target object, then constructing a plurality of segmentation masks corresponding to each image in the training set, sequentially applying Gaussian noise for different times to each segmentation mask to obtain noise masks of each segmentation mask under the influence of the Gaussian noise for different times, and performing splicing processing on the maximum noise mask of each segmentation mask and a test set image to obtain mask spliced images of each segmentation mask; performing feature coding processing on the mask spliced images through an initial vector quantization variation self-encoder model, and applying potential space representation constraint to the result of the feature coding processing to obtain probability potential feature vectors corresponding to the mask spliced images; decoding the probabilistic latent feature vector through an initial vector quantization variation self-decoder model to determine predicted highest-layer Gaussian noise; constructing a predicted next-highest noise mask of each division mask based on the maximum noise mask and the highest Gaussian noise of each division mask; a noise loss parameter of the highest layer Gaussian noise is determined based on the predicted next highest noise mask for each split mask and the next highest noise mask for each split mask. Meanwhile, the input test set image and the reconstruction image of the decoder can be compared to obtain an image reconstruction loss parameter corresponding to the highest Gaussian noise; and simultaneously, aiming at the process of potential space representation constraint, acquiring importance weights and constraint weights of the probabilistic potential feature vectors in different channels to obtain weight distribution loss parameters corresponding to the highest Gaussian noise. According to the noise loss parameter, the image reconstruction loss parameter and the weight distribution loss parameter, carrying out optimization treatment on the initial vector quantization variation self-encoder model and the initial vector quantization variation self-decoder model to obtain a to-be-detected vector quantization variation self-encoder model and a to-be-detected vector quantization variation self-decoder model; acquiring a test image in a test set and a real segmentation result of the test image; inputting the test image into a to-be-tested vector quantization variation self-encoder model and a to-be-tested vector quantization variation self-decoder model to obtain a test segmentation result; performing contrast analysis on the test segmentation result and the real segmentation result to obtain similarity matching scores, and performing ground truth value distribution on the test segmentation result and the real segmentation result on the minimum variance and the maximum variance; determining a diversity consensus score based on the minimum variance and the maximum variance; and obtaining a collective recognition score based on the similarity matching score and the diversity consistency score, and when the collective recognition score is higher than a score threshold value, taking the vector quantization variation to be tested as a vector quantization variation self-encoder model, and taking the vector quantization variation to be tested as a vector quantization variation self-decoder model.

After the model is tested, the actual image segmentation processing can be carried out through the model obtained through training, at the moment, the lung lesion image to be processed and the Gaussian noise image can be input into a server carrying an image segmentation method together, then the target image is used as condition information, and vector encoding processing is carried out on the target image through a vector quantization variation self-encoder model, so that an intermediate characterization vector is obtained; searching a similar vector of the intermediate characterization vector in an embedded space through a nearest neighbor algorithm; compressing the similar vector into a channel dimension vector through adaptive averaging pooling; carrying out convolution processing on the channel dimension vector, and obtaining importance weights of different channels in the channel dimension vector based on the convolution processing result; applying potential space constraint to importance weights of different channels based on relative entropy to obtain constraint weights of different channels; based on constraint weights of different channels and channel dimension vectors, obtaining a plurality of probabilistic target similarity vectors; and constructing a probabilistic potential feature vector corresponding to each probabilistic target similarity vector. Converting the multiple probabilistic potential feature vectors to obtain multiple prediction noise hidden vectors; and taking the target image as conditional information, decoding each prediction noise hidden vector through a vector quantization variation self-decoder model, and determining the prediction noise corresponding to each prediction noise hidden vector. And finally, denoising the Gaussian noise image through back diffusion based on the prediction noise to obtain an image segmentation result corresponding to each prediction noise.

In one embodiment, the image segmentation method of the present application specifically includes the steps of:

Step 801, a target image to be segmented and a gaussian noise image are acquired. And 803, vector encoding the Gaussian noise image by using the target image as conditional information through a vector quantization variation self-encoder model to obtain an intermediate characterization vector. In step 805, the similarity vector of the intermediate token vector is found in the embedding space by a nearest neighbor algorithm. Step 807 compresses the similar vectors into channel dimension vectors by adaptive averaging pooling. Step 809, performing convolution processing on the channel dimension vector, and obtaining importance weights of different channels in the channel dimension vector based on the convolution processing result. Step 811, applying potential space constraint to importance weights of different channels based on relative entropy, and obtaining constraint weights of different channels. Step 813, obtaining a plurality of probabilistic target similarity vectors based on the constraint weights of different channels and the channel dimension vectors. In step 815, the plurality of probabilistic latent feature vectors are transformed to obtain a plurality of prediction noise concealment vectors. In step 817, each prediction noise hidden vector is decoded by using the target image as the condition information through the vector quantization variation self-decoder model, and the prediction noise corresponding to each prediction noise hidden vector is determined. And step 819, based on the prediction noise, denoising the Gaussian noise image through back diffusion to obtain an image segmentation result corresponding to each prediction noise.

It should be understood that, although the steps in the flowcharts related to the above embodiments are sequentially shown as indicated by arrows, these steps are not necessarily sequentially performed in the order indicated by the arrows. The steps are not strictly limited to the order of execution unless explicitly recited herein, and the steps may be executed in other orders. Moreover, at least some of the steps in the flowcharts described in the above embodiments may include a plurality of steps or a plurality of stages, which are not necessarily performed at the same time, but may be performed at different times, and the order of the steps or stages is not necessarily performed sequentially, but may be performed alternately or alternately with at least some of the other steps or stages.

Based on the same inventive concept, the embodiment of the application also provides an image segmentation device for realizing the above related image segmentation method. The implementation of the solution provided by the device is similar to the implementation described in the above method, so the specific limitation in one or more embodiments of the image segmentation device provided below may refer to the limitation of the image segmentation method hereinabove, and will not be repeated herein.

In an exemplary embodiment, as shown in fig. 9, there is provided an image segmentation apparatus including:

A data acquisition module 902, configured to acquire a target image to be segmented and a gaussian noise image.

The encoding processing module 904 is configured to perform feature encoding processing on the gaussian noise image by using the target image as condition information and using the vector quantization variation self-encoder model, and apply a potential spatial representation constraint on a result of the feature encoding processing, so as to obtain a plurality of probabilistic potential feature vectors of the target image.

The decoding processing module 906 is configured to perform feature decoding processing on the probabilistic latent feature vector by using the target image as condition information and using the vector quantization variation self-decoder model to obtain a plurality of prediction noise; the vector quantization variation is obtained by performing conditional diffusion training on the basis of a historical image from an encoder model and a vector quantization variation from a decoder model.

The image segmentation module 908 is configured to perform denoising processing on the gaussian noise image by back diffusion based on the prediction noise, so as to obtain an image segmentation result corresponding to each prediction noise.

In one embodiment, the encoding processing module 904 is specifically configured to: taking the target image as conditional information, carrying out vector coding processing on the Gaussian noise image through a vector quantization variation self-coder model to obtain an intermediate characterization vector; searching a similar vector of the intermediate characterization vector in an embedded space through a nearest neighbor algorithm; applying potential space representation constraint to the similarity vectors to obtain a plurality of probabilistic target similarity vectors; and constructing a probabilistic potential feature vector corresponding to each probabilistic target similarity vector.

In one embodiment, the encoding processing module 904 is specifically configured to: compressing the similar vector into a channel dimension vector through adaptive averaging pooling; carrying out convolution processing on the channel dimension vector, and obtaining importance weights of different channels in the channel dimension vector based on the convolution processing result; applying potential space constraint to importance weights of different channels based on relative entropy to obtain constraint weights of different channels; and obtaining a plurality of probabilistic target similarity vectors based on the constraint weights of different channels and the channel dimension vectors.

In one embodiment, the decoding processing module 906 is specifically configured to: converting the multiple probabilistic potential feature vectors to obtain multiple prediction noise hidden vectors; and taking the target image as conditional information, decoding each prediction noise hidden vector through a vector quantization variation self-decoder model, and determining the prediction noise corresponding to each prediction noise hidden vector.

In one embodiment, the method further comprises a diffusion model training module for: acquiring a plurality of segmentation masks of a history image; and based on the segmentation mask, performing conditional diffusion training on the initial vector quantization variation self-encoder model and the initial vector quantization variation self-decoder model to obtain a vector quantization variation self-encoder model and a vector quantization variation self-decoder model.

In one embodiment, the diffusion model training module is specifically configured to: sequentially applying Gaussian noise of different times to each divided mask to obtain a noise mask of each divided mask under the influence of the Gaussian noise of different times; based on noise masks of each segmentation mask under the influence of Gaussian noise of different times, sequentially carrying out conditional diffusion treatment on an initial vector quantization variation self-encoder model and an initial vector quantization variation self-decoder model to obtain noise loss parameters, image reconstruction loss parameters and weight distribution loss parameters which are respectively corresponding to the Gaussian noise of different times; and according to the noise loss parameter, the image reconstruction loss parameter and the weight distribution loss parameter, carrying out optimization treatment on the initial vector quantization variation self-encoder model and the initial vector quantization variation self-decoder model to obtain a vector quantization variation self-encoder model and a vector quantization variation self-decoder model.

In one embodiment, the diffusion model training module is specifically configured to: splicing the maximum noise mask of each division mask with the historical image to obtain a mask spliced image of each division mask, wherein the maximum noise mask is the noise mask with the maximum Gaussian noise; performing feature coding processing on the mask spliced images through an initial vector quantization variation self-encoder model, and applying potential space representation constraint to the result of the feature coding processing to obtain probability potential feature vectors corresponding to the mask spliced images; decoding the probabilistic latent feature vector through an initial vector quantization variation self-decoder model to determine predicted highest-layer Gaussian noise; constructing a predicted next-highest noise mask of each division mask based on the maximum noise mask and the highest Gaussian noise of each division mask; a noise loss parameter of the highest layer Gaussian noise is determined based on the predicted next highest noise mask for each split mask and the next highest noise mask for each split mask.

In one embodiment, the diffusion model training module is specifically configured to: performing image reconstruction processing on the historical image through an initial vector quantization variation self-decoder model to obtain a reconstructed image; mapping the probabilistic latent feature vector to an image reconstruction space of the reconstructed image to obtain predicted highest-layer Gaussian noise.

In one embodiment, the diffusion model training module is specifically configured to: comparing the historical image with the reconstructed image to obtain an image reconstruction loss parameter corresponding to the highest Gaussian noise; and acquiring importance weights and constraint weights of the probabilistic potential feature vectors in different channels to obtain weight distribution loss parameters corresponding to the Gaussian noise of the highest layer.

In one embodiment, the diffusion model training module is specifically configured to: according to the noise loss parameter, the image reconstruction loss parameter and the weight distribution loss parameter, carrying out optimization treatment on the initial vector quantization variation self-encoder model and the initial vector quantization variation self-decoder model to obtain a to-be-detected vector quantization variation self-encoder model and a to-be-detected vector quantization variation self-decoder model; acquiring a test image and a real segmentation result of the test image; inputting the test image into a to-be-tested vector quantization variation self-encoder model and a to-be-tested vector quantization variation self-decoder model to obtain a test segmentation result; performing contrast analysis on the test segmentation result and the real segmentation result to obtain similarity matching scores, and performing ground truth value distribution on the test segmentation result and the real segmentation result on the minimum variance and the maximum variance; determining a diversity consensus score based on the minimum variance and the maximum variance; and obtaining a collective recognition score based on the similarity matching score and the diversity consistency score, and taking the vector quantization variation to be tested and the encoder model and the decoder model as the vector quantization variation self-encoder model and the vector quantization variation self-decoder model when the collective recognition score is higher than a score threshold value.

In one embodiment, the target image comprises a medical image. The apparatus further comprises a medical image processing module for: performing medical image recognition processing based on the image segmentation result to generate a medical image processing result; and feeding back a medical image processing result.

The respective modules in the above-described image dividing apparatus may be implemented in whole or in part by software, hardware, and combinations thereof. The above modules may be embedded in hardware or may be independent of a processor in the computer device, or may be stored in software in a memory in the computer device, so that the processor may call and execute operations corresponding to the above modules.

In one exemplary embodiment, a computer device is provided, which may be a server, and the internal structure thereof may be as shown in fig. 10. The computer device includes a processor, a memory, an Input/Output interface (I/O) and a communication interface. The processor, the memory and the input/output interface are connected through a system bus, and the communication interface is connected to the system bus through the input/output interface. Wherein the processor of the computer device is configured to provide computing and control capabilities. The memory of the computer device includes a non-volatile storage medium and an internal memory. The non-volatile storage medium stores an operating system, computer programs, and a database. The internal memory provides an environment for the operation of the operating system and computer programs in the non-volatile storage media. The database of the computer device is for storing image segmentation related data. The input/output interface of the computer device is used to exchange information between the processor and the external device. The communication interface of the computer device is used for communicating with an external terminal through a network connection. The computer program is executed by a processor to implement an image segmentation method.

It will be appreciated by those skilled in the art that the structure shown in FIG. 10 is merely a block diagram of some of the structures associated with the present inventive arrangements and is not limiting of the computer device to which the present inventive arrangements may be applied, and that a particular computer device may include more or fewer components than shown, or may combine some of the components, or have a different arrangement of components.

In an embodiment, there is also provided a computer device comprising a memory and a processor, the memory having stored therein a computer program, the processor implementing the steps of the method embodiments described above when the computer program is executed.

In one embodiment, a computer-readable storage medium is provided, storing a computer program which, when executed by a processor, implements the steps of the method embodiments described above.

In one embodiment, a computer program product or computer program is provided that includes computer instructions stored in a computer readable storage medium. The processor of the computer device reads the computer instructions from the computer-readable storage medium, and the processor executes the computer instructions, so that the computer device performs the steps in the above-described method embodiments.

It should be noted that, the user information (including but not limited to user equipment information, user personal information, etc.) and the data (including but not limited to data for analysis, stored data, presented data, etc.) related to the present application are both information and data authorized by the user or sufficiently authorized by each party, and the collection, use and processing of the related data are required to meet the related regulations.

Those skilled in the art will appreciate that implementing all or part of the above described methods may be accomplished by way of a computer program stored on a non-transitory computer readable storage medium, which when executed, may comprise the steps of the embodiments of the methods described above. Any reference to memory, database, or other medium used in embodiments provided herein may include at least one of non-volatile and volatile memory. The nonvolatile Memory may include Read-Only Memory (ROM), magnetic tape, floppy disk, flash Memory, optical Memory, high density embedded nonvolatile Memory, resistive random access Memory (ReRAM), magneto-resistive random access Memory (Magnetoresistive Random Access Memory, MRAM), ferroelectric Memory (Ferroelectric Random Access Memory, FRAM), phase change Memory (PHASE CHANGE Memory, PCM), graphene Memory, and the like. Volatile memory can include random access memory (Random Access Memory, RAM) or external cache memory, and the like. By way of illustration, and not limitation, RAM can be in various forms such as static random access memory (Static Random Access Memory, SRAM) or dynamic random access memory (Dynamic Random Access Memory, DRAM), etc. The databases referred to in the embodiments provided herein may include at least one of a relational database and a non-relational database. The non-relational database may include, but is not limited to, a blockchain-based distributed database, and the like. The processor referred to in the embodiments provided in the present application may be a general-purpose processor, a central processing unit, a graphics processor, a digital signal processor, a programmable logic unit, a data processing logic unit based on quantum computing, or the like, but is not limited thereto.

The technical features of the above embodiments may be arbitrarily combined, and all possible combinations of the technical features in the above embodiments are not described for brevity of description, however, as long as there is no contradiction between the combinations of the technical features, they should be considered as the scope of the description.

The foregoing examples illustrate only a few embodiments of the application and are described in detail herein without thereby limiting the scope of the application. It should be noted that it will be apparent to those skilled in the art that several variations and modifications can be made without departing from the spirit of the application, which are all within the scope of the application. Accordingly, the scope of the application should be assessed as that of the appended claims.

Claims

1. An image segmentation method, the method comprising:

acquiring a target image to be segmented and a Gaussian noise image;

taking the target image as conditional information, and carrying out vector coding processing on the Gaussian noise image through a vector quantization variation self-coder model to obtain an intermediate characterization vector; searching the similarity vector of the intermediate characterization vector in an embedded space through a nearest neighbor algorithm; applying potential space representation constraint to the similarity vectors to obtain a plurality of probabilistic target similarity vectors; constructing a probabilistic potential feature vector corresponding to each probabilistic target similarity vector;

2. The method of claim 1, wherein the applying a potential spatial representation constraint to the similarity vectors to obtain a plurality of probabilistic target similarity vectors comprises:

compressing the similarity vector into a channel dimension vector through adaptive averaging pooling;

carrying out convolution processing on the channel dimension vector, and obtaining importance weights of different channels in the channel dimension vector based on a convolution processing result;

Applying potential space constraint to the importance weights of the different channels based on the relative entropy to obtain constraint weights of the different channels;

and obtaining a plurality of probabilistic target similar vectors based on the constraint weights of the different channels and the channel dimension vectors.

3. The method of claim 1, wherein performing feature decoding processing on the probabilistic latent feature vectors using the target image as conditional information and a vector quantization variation self-decoder model to obtain a plurality of prediction noise comprises:

converting the plurality of probabilistic latent feature vectors to obtain a plurality of prediction noise hidden vectors;

and taking the target image as conditional information, decoding each prediction noise hidden vector through a vector quantization variation self-decoder model, and determining the prediction noise corresponding to each prediction noise hidden vector.

4. The method according to claim 1, wherein the method further comprises:

Acquiring a plurality of segmentation masks of a history image;

and based on the segmentation mask, performing conditional diffusion training on the initial vector quantization variation self-encoder model and the initial vector quantization variation self-decoder model to obtain a vector quantization variation self-encoder model and a vector quantization variation self-decoder model.

5. The method of claim 4, wherein the conditional diffusion training of the initial vector quantization variation self-encoder model and the initial vector quantization variation self-decoder model based on the segmentation mask to obtain the vector quantization variation self-encoder model and the vector quantization variation self-decoder model comprises:

Sequentially applying Gaussian noise for different times to each divided mask to obtain a noise mask of each divided mask under the influence of the Gaussian noise for different times;

Based on the noise masks of each segmentation mask under the influence of Gaussian noise of different times, sequentially carrying out conditional diffusion treatment on an initial vector quantization variation self-encoder model and an initial vector quantization variation self-decoder model to obtain noise loss parameters, image reconstruction loss parameters and weight distribution loss parameters corresponding to the Gaussian noise of different times;

And according to the noise loss parameter, the image reconstruction loss parameter and the weight distribution loss parameter, carrying out optimization processing on the initial vector quantization variation self-encoder model and the initial vector quantization variation self-decoder model to obtain a vector quantization variation self-encoder model and a vector quantization variation self-decoder model.

6. The method of claim 5, wherein sequentially performing conditional diffusion processing on the initial vector quantization variation self-encoder model and the initial vector quantization variation self-decoder model based on the noise masks of each segmentation mask under the influence of different times of gaussian noise, to obtain the noise loss parameters corresponding to the different times of gaussian noise respectively comprises:

Performing stitching processing on the maximum noise mask of each segmentation mask and the historical image to obtain mask stitching images of each segmentation mask, wherein the maximum noise mask is the noise mask with the maximum Gaussian noise;

Performing feature coding processing on the mask spliced images through an initial vector quantization variation self-encoder model, and applying potential space representation constraint to the result of the feature coding processing to obtain probability potential feature vectors corresponding to each mask spliced image;

decoding the probabilistic latent feature vector through an initial vector quantization variation self-decoder model to obtain predicted highest-layer Gaussian noise;

constructing a predicted next highest noise mask for each split mask based on the maximum noise mask and the highest gaussian noise for each split mask;

a noise loss parameter for the highest layer gaussian noise is determined based on the predicted next highest noise mask for each split mask and the next highest noise mask for each split mask.

7. The method of claim 6, wherein decoding the probabilistic latent feature vector via an initial vector quantization variation from a decoder model to determine a predicted highest-level gaussian noise comprises:

performing image reconstruction processing on the historical image through an initial vector quantization variation self-decoder model to obtain a reconstructed image;

and mapping the probabilistic latent feature vector to an image reconstruction space of the reconstructed image to obtain predicted highest-layer Gaussian noise.

8. The method of claim 7, wherein the method further comprises:

Comparing the historical image with the reconstructed image to obtain an image reconstruction loss parameter corresponding to the highest Gaussian noise;

and acquiring importance weights and constraint weights of the probabilistic latent feature vectors in different channels to obtain weight distribution loss parameters corresponding to the Gaussian noise of the highest layer.

9. The method of claim 5, wherein optimizing the initial vector quantization variation self-encoder model and the initial vector quantization variation self-decoder model based on the noise loss parameter, the image reconstruction loss parameter, and the weight distribution loss parameter comprises:

According to the noise loss parameter, the image reconstruction loss parameter and the weight distribution loss parameter, carrying out optimization treatment on the initial vector quantization variation self-encoder model and the initial vector quantization variation self-decoder model to obtain a to-be-detected vector quantization variation self-encoder model and a to-be-detected vector quantization variation self-decoder model;

acquiring a test image and a real segmentation result of the test image;

Inputting the test image into the to-be-tested vector quantization variation self-encoder model and the to-be-tested vector quantization variation self-decoder model to obtain a test segmentation result;

performing comparative analysis on the test segmentation result and the real segmentation result to obtain a similarity matching score, and performing ground truth value distribution on the test segmentation result and the real segmentation result on a minimum variance and a maximum variance;

Determining a diversity consensus score based on the minimum variance and the maximum variance;

And obtaining a collective recognition score based on the similarity matching score and the diversity consistency score, and when the collective recognition score is higher than a score threshold value, taking the vector quantization variation to be tested as a vector quantization variation self-encoder model, and taking the vector quantization variation to be tested as a vector quantization variation self-decoder model.

10. The method according to any one of claims 1 to 9, wherein the target image comprises a medical image;

The method further comprises the steps of:

Performing medical image recognition processing based on the image segmentation result to generate a medical image processing result;

and feeding back the medical image processing result.

11. An image segmentation apparatus, the apparatus comprising:

The coding processing module is used for carrying out vector coding processing on the Gaussian noise image by taking the target image as conditional information and using a vector quantization variation self-coder model to obtain an intermediate representation vector; searching the similarity vector of the intermediate characterization vector in an embedded space through a nearest neighbor algorithm; applying potential space representation constraint to the similarity vectors to obtain a plurality of probabilistic target similarity vectors; constructing a probabilistic potential feature vector corresponding to each probabilistic target similarity vector;

12. The apparatus of claim 11, wherein the encoding processing module is specifically configured to: compressing the similarity vector into a channel dimension vector through adaptive averaging pooling; carrying out convolution processing on the channel dimension vector, and obtaining importance weights of different channels in the channel dimension vector based on a convolution processing result; applying potential space constraint to the importance weights of the different channels based on the relative entropy to obtain constraint weights of the different channels; and obtaining a plurality of probabilistic target similar vectors based on the constraint weights of the different channels and the channel dimension vectors.

13. The apparatus of claim 11, wherein the decoding processing module is specifically configured to: converting the plurality of probabilistic latent feature vectors to obtain a plurality of prediction noise hidden vectors; and taking the target image as conditional information, decoding each prediction noise hidden vector through a vector quantization variation self-decoder model, and determining the prediction noise corresponding to each prediction noise hidden vector.

14. The apparatus of claim 11, further comprising a diffusion model training module to: acquiring a plurality of segmentation masks of a history image; and based on the segmentation mask, performing conditional diffusion training on the initial vector quantization variation self-encoder model and the initial vector quantization variation self-decoder model to obtain a vector quantization variation self-encoder model and a vector quantization variation self-decoder model.

15. The apparatus of claim 14, wherein the diffusion model training module is specifically configured to: sequentially applying Gaussian noise for different times to each divided mask to obtain a noise mask of each divided mask under the influence of the Gaussian noise for different times; based on the noise masks of each segmentation mask under the influence of Gaussian noise of different times, sequentially carrying out conditional diffusion treatment on an initial vector quantization variation self-encoder model and an initial vector quantization variation self-decoder model to obtain noise loss parameters, image reconstruction loss parameters and weight distribution loss parameters corresponding to the Gaussian noise of different times; and according to the noise loss parameter, the image reconstruction loss parameter and the weight distribution loss parameter, carrying out optimization processing on the initial vector quantization variation self-encoder model and the initial vector quantization variation self-decoder model to obtain a vector quantization variation self-encoder model and a vector quantization variation self-decoder model.

16. The apparatus of claim 15, wherein the diffusion model training module is specifically configured to: performing stitching processing on the maximum noise mask of each segmentation mask and the historical image to obtain mask stitching images of each segmentation mask, wherein the maximum noise mask is the noise mask with the maximum Gaussian noise; performing feature coding processing on the mask spliced images through an initial vector quantization variation self-encoder model, and applying potential space representation constraint to the result of the feature coding processing to obtain probability potential feature vectors corresponding to each mask spliced image; decoding the probabilistic latent feature vector through an initial vector quantization variation self-decoder model to obtain predicted highest-layer Gaussian noise; constructing a predicted next highest noise mask for each split mask based on the maximum noise mask and the highest gaussian noise for each split mask; a noise loss parameter for the highest layer gaussian noise is determined based on the predicted next highest noise mask for each split mask and the next highest noise mask for each split mask.

17. The apparatus of claim 16, wherein the diffusion model training module is specifically configured to: performing image reconstruction processing on the historical image through an initial vector quantization variation self-decoder model to obtain a reconstructed image; and mapping the probabilistic latent feature vector to an image reconstruction space of the reconstructed image to obtain predicted highest-layer Gaussian noise.

18. The apparatus of claim 17, wherein the diffusion model training module is specifically configured to: comparing the historical image with the reconstructed image to obtain an image reconstruction loss parameter corresponding to the highest Gaussian noise; and acquiring importance weights and constraint weights of the probabilistic latent feature vectors in different channels to obtain weight distribution loss parameters corresponding to the Gaussian noise of the highest layer.

19. The apparatus of claim 15, wherein the diffusion model training module is specifically configured to: according to the noise loss parameter, the image reconstruction loss parameter and the weight distribution loss parameter, carrying out optimization treatment on the initial vector quantization variation self-encoder model and the initial vector quantization variation self-decoder model to obtain a to-be-detected vector quantization variation self-encoder model and a to-be-detected vector quantization variation self-decoder model; acquiring a test image and a real segmentation result of the test image; inputting the test image into the to-be-tested vector quantization variation self-encoder model and the to-be-tested vector quantization variation self-decoder model to obtain a test segmentation result; performing comparative analysis on the test segmentation result and the real segmentation result to obtain a similarity matching score, and performing ground truth value distribution on the test segmentation result and the real segmentation result on a minimum variance and a maximum variance; determining a diversity consensus score based on the minimum variance and the maximum variance; and obtaining a collective recognition score based on the similarity matching score and the diversity consistency score, and when the collective recognition score is higher than a score threshold value, taking the vector quantization variation to be tested as a vector quantization variation self-encoder model, and taking the vector quantization variation to be tested as a vector quantization variation self-decoder model.

20. The apparatus of any one of claims 11 to 19, wherein the target image comprises a medical image; the apparatus further comprises a medical image processing module for: performing medical image recognition processing based on the image segmentation result to generate a medical image processing result; and feeding back the medical image processing result.

21. A computer device comprising a memory and a processor, the memory storing a computer program, characterized in that the processor implements the steps of the method of any one of claims 1 to 10 when the computer program is executed.

22. A computer readable storage medium, on which a computer program is stored, characterized in that the computer program, when being executed by a processor, implements the steps of the method of any of claims 1 to 10.

23. A computer program product comprising a computer program, characterized in that the computer program, when being executed by a processor, implements the steps of the method of any one of claims 1 to 10.