CN117710677A

CN117710677A - Heart fat segmentation method based on language prompt

Info

Publication number: CN117710677A
Application number: CN202311802073.8A
Authority: CN
Inventors: 叶胡平; 洪义
Original assignee: Shanghai Jiaotong University
Current assignee: Shanghai Jiaotong University
Priority date: 2023-12-26
Filing date: 2023-12-26
Publication date: 2024-03-15

Abstract

The invention relates to the technical field of medical data analysis, and discloses a heart fat segmentation method based on language prompt, which comprises the following steps of S1: acquiring a multi-mode image, and extracting heart fat image features based on a visual model; step S2: carrying out text description on the labels, extracting the characteristics of the segmentation intention of the text description by using the existing large language model, and converting the characteristics into task prompt vectors which can be understood by a subsequent module; step S3: the method comprises the steps of constructing a decoding network, comprehensively decoding an input heart fat image encoding sequence according to the requirements of task prompt vectors in the decoding network to obtain a segmentation result, mainly solving the problems of insufficient heart fat data quantity, uncertainty in shape, small volume and very similar brightness of peripheral tissues, and having the advantages of reducing cost, improving performance, being high in flexibility, processing incomplete data, multiplexing pre-training and the like.

Description

Heart fat segmentation method based on language prompt

Technical Field

The invention relates to the technical field of medical data analysis, in particular to a heart fat segmentation method based on language prompt.

Background

The heart fat comprises epicardial adipose tissue (EpicardialAdiposeTissue, EAT) and pericardial adipose tissue (PericardialAdiposeTissue, PAT). The pericardium is a protective layer wrapping the heart around the heart and consists of serous pericardium positioned at the inner layer and fibrous pericardium positioned at the outer layer, wherein the serous pericardium of the inner layer can be divided into a dirty layer and a wall layer; EATs are adipose tissue located around the visceral layer of the pericardium; PAT is adipose tissue located in the area around the wall layer. The size and shape of heart fat is closely related to a variety of cardiovascular diseases. The death number of the cardiovascular diseases accounts for 1/3 of the death number of each year, so that the clinical research of the cardiovascular diseases has important social value. For EAT, it is closer to the heart than PAT, there is evidence that it plays a direct role in causing coronary atherosclerosis and cardiomyopathy. Its thickness has been shown to be associated with metabolic syndrome and coronary artery disease, and with the progression of coronary calcification. Its volume and density are associated with major adverse cardiac events in asymptomatic subjects. EAT amounts are also considered to be a good marker for cardiovascular disease. In addition, EAT plays a role in visceral adiposity, cardiac morphology, liver enzymes, insulin resistance, fasting glucose, and is an accurate therapeutic target. For PAT, its volume is related to atrial fibrillation. There is evidence that computer-aided methods for quantification of EAT and PAT, especially on large data sets, can be used to predict major adverse cardiovascular events, which is of great clinical importance for the identification of heart fat for the diagnosis and treatment of related diseases, however heart fat segmentation has a number of difficulties. For example, the heart anatomy is small, the fat shape is very different and is unevenly distributed, the epicardium and the pericardial adipose tissue have similar brightness intensity and are difficult to distinguish, the pericardium is usually too thin to be completely seen in a CT or MRI image, and boundary information of each part of the heart is difficult to delineate, so that the problems of automatic accurate segmentation of the heart fat are still solved, and manual or semi-automatic segmentation is often used in clinical practice.

Traditional cardiac image segmentation generally requires much expertise in the field and manual intervention. And because of the difference of the segmented objects, different methods are adopted, and even a plurality of methods are combined. These methods are generally classified into three major categories, pixel-based clustering, graph theory-based methods, and depth-based semantics. Among the general methods of pixel clustering are thresholding (based on a single value such as image brightness), clustering (categorized into predefined classes such as k-means, spectral clustering, meanshift, SLIC, etc. based on extracted feature similarity), histogram (based on statistical methods), edge detection (using domain prior knowledge), etc. Graph theory based segmentation is represented by NormalizedCut, graphCut and GrabCut, etc. Deep semantic segmentation is a variety of algorithms developed over the years along with deep learning. From the very beginning FCNs, deeplabs, and pspnets to the currently popular uiet, segNet, and ResUnet. With the advent of transformers and large models over the years, transformers have also been increasingly used for complex segmentation; heart fat segmentation is a subdivision of image segmentation. The heart anatomy is complex, and heart fat tasks are similar with myocardial segmentation, full heart substructure segmentation. Most cardiac fat segmentation is based on CT images, segmentation from MRI being more difficult than segmentation from CT, as the pericardium is often blurred due to partial volume effects. Most existing EAT and PAT segmentation methods focus on quantifying EAT from CT.

Defects and deficiencies of the prior art:

the traditional method is mainly suitable for heart fat segmentation and mainly comprises thresholding segmentation, a region filling algorithm, self-adaptive contour segmentation and a map-based method (the method adopts an artificial pre-marked atlas, and the image to be segmented is registered by combining a registration technology, so that a segmentation result can be obtained). Which in combination with the deep-learning registration can achieve a faster segmentation. However, due to the complexity of fat, the data distribution of the atlas and the data of the real patient may vary greatly, and this approach is gradually replaced by other segmentation approaches. Yet another is to use a pericardium to assist in heart fat segmentation. The pericardium is a smooth, very thin, almost oval outline that once identified, conveniently limits the range of heart fat. The challenge with this approach is that the pericardium is very thin, typically less than 2 mm, and is not easily identifiable in the image. In cardiac fat segmentation, the most used is a framework like the Unet, and data enhancement is used to artificially increase the amount of data. Notably, since the heart fat and the background are severely unbalanced, i.e. the number of pixels of the heart fat and the background class of pixels is much smaller, this imbalance may cause problems in CNN training, and some recent efforts have used vacuslass to ameliorate this problem, but these problems still perform poorly under such problems as heart fat.

In addition, since the CMRI scan of the heart is a multi-layer scan, the interval between layers is large, the image structure between layers is relatively variable (anisotropic), and for the task of dividing anisotropic data, 2.5DU-Net is currently mostly adopted to reduce the additional calculation cost. However, the effect of 2.5DU-Net is not ideal for this task. The reason for the method is that the interlayer difference is too large, or the label is too small, so that the model cannot learn the interlayer association information truly and effectively. Because of the uncertainty in shape of the heart fat, the shape of the pericardial effusion adjacent to the fat is also uncertain, and the brightness of the pericardium is very similar in images of nuclear magnetic resonance. Without a priori constraints, such as anatomical based priors, geometric based priors, or topological based constraints, etc., it is difficult to effectively distinguish between these structures.

Again, the medical data in this field are now very rare, and no open MRI pericardial fat segmentation dataset is yet available on the market. There are fewer data sets of other outcome data of the heart. This makes it difficult for the prior art to obtain a satisfactory result from such a small data set. If other different datasets are used, there may be none, if any, of the published datasets of heart fat that are currently not found, except for the dataset modality format, etc.

Finally, the existing model also lacks support for training various mixed data, such as label missing and invariable label category. Conventional classification/segmentation requires a fixed number of categories in advance, but for our expanding data, the categories are increasing. In addition, some information between classes is difficult to express using conventional means, such as the myocardium ring, which is obviously a subset of the former, and the myocardium ring, which obviously contains the latter two labels, such as "ventricle/left ventricle/right ventricle". In addition, the traditional model needs to train data to contain complete labels (namely, label deletion cannot occur), all labels are not possible among different data sets, and often each data set only has partial labels, and labels among the data sets are different. In order to cope with the above problems, we propose to use a language big model to code corresponding to each task, but rather, by converting each label into a text description, then using the existing language big model to extract the feature of the segmentation intention (text prompt) using the text description, and converting into a task prompt vector (prompt) which can be understood by the subsequent module, combining with the feature extraction of the previous pre-training model, respectively obtaining the feature of the image (image email) and the feature of the text (prompt), and entering the segmentation network to obtain the segmentation result, thereby solving the above problems. Because the large language model can capture the relevant characteristics among the input characters, the content prompt to be segmented has stronger expression capability. In addition, in the segmentation network, each sample corresponds to only one result, so that the problem of label insufficiency of some data is indirectly solved. Experiments have shown that this approach is more effective. In specific training, the task of each segmentation is described in the form of words, each data set has more or less different tasks, the tasks are possibly repeated, and the word description can well express the information and the segmentation intention. Such as "please partition left ventricle"; please divide the left ventricle and left ventricular scar; "please divide the scar inside the left ventricle"; please divide the left ventricle and the right ventricle, etc. In addition, a plurality of model training tasks are designed, the problem that one-dimensional models often only use labels in the model training tasks is solved, and data information of metadata sets is utilized to the maximum extent.

In view of this, the present invention provides a method of cardiac fat segmentation based on linguistic cues.

Disclosure of Invention

In order to overcome the problems in the prior art, the invention provides a heart fat segmentation method based on language prompt, which combines a large model and a large number of data sets in each field of heart to complete epicardial fat segmentation.

In order to realize the problems, the invention provides the following technical scheme: in a first aspect, the present invention provides a method for cardiac fat segmentation based on linguistic cues, comprising the steps of:

step S1: acquiring a multi-mode image, and extracting heart fat image features based on a visual model;

step S2: carrying out text description on the labels, extracting the characteristics of the segmentation intention of the text description by using the existing large language model, and converting the characteristics into task prompt vectors which can be understood by a subsequent module;

step S3: and constructing a decoding network, and comprehensively decoding the input heart fat image coding sequence according to the requirement of the task prompt vector in the decoding network to obtain a segmentation result.

As a preferred scheme of the invention, the application logic of the visual model is as follows:

step S11: processing the multi-mode image and the mode type information corresponding to the multi-mode image to generate an Embedding sequence;

step S12: and performing feature extraction on the Embedding sequence by using a transducer architecture to obtain heart fat image features, wherein the heart fat image features are the Embedding sequence.

As a preferable scheme of the invention, the acquisition logic of the Embedding sequence is as follows:

converting the image into an encoding sequence: the multi-mode image is divided into mode blocks with fixed sizes, and each mode block is mapped into an image coding sequence;

converting the mode type into a corresponding mode Embedding: the mode type is encoded by using an integer, and the Embedding corresponding to the mode is used as the mode Embedding;

assembling the Embedding sequences of different information: and splicing the image Embedding sequence and the modal Embedding sequence to generate the Embedding sequence.

As a preferred embodiment of the present invention, the procedure of mapping the embedded into the corresponding feature by the self-attention mechanism is as follows. First, each eimbedding is mapped into Q, K, V three sets of vectors by using a linear mapping module, wherein Q is equivalent to the representation of the current eimbedding used for searching by eyes, K is equivalent to the current corresponding feature, and V is the current corresponding feature.

As a preferred solution of the present invention, the application logic of the language model is:

converting each label into a corresponding text description, and extracting text features of the segmentation intention of the text description by using a language model;

after the text feature is obtained, here we need to convert the text feature into a task hint vector that outputs the text feature as a fixed length.

As a preferred embodiment of the present invention, the decoding processing logic includes the steps of:

step S31: based on the heart fat image features extracted in the step S1 and the task prompt vectors extracted in the step S2, marking an Embedding sequence corresponding to the heart fat image features as an A sequence, marking an Embedding sequence corresponding to the task prompt vectors as a B sequence,

step S32: simultaneously sending the A sequence and the B sequence into a bidirectional attention network to respectively obtain an A sequence sensing result and a B sequence sensing result;

step S33: and (3) generating a plurality of alternative images by the A sequence sensing result through multi-layer deconvolution, sequentially passing the B sequence sensing result through a cross attention transducer block of the B pair A, obtaining weights of the alternative images by a multi-layer sensing machine, and finally weighting according to the weighted images to obtain a segmentation result.

In a second aspect, the present invention provides a model training process of a cardiac fat segmentation method based on language cues, which is characterized by comprising the following steps:

step X1: collecting and processing data sets;

step X2: a plurality of data sets and generating a training data set;

step X3: extracting training tasks from the combined data set during training;

step X4: and coding the image and the mode, and coding the visual model of the input image and mode information to obtain an Embedding sequence of the image.

Step X5: and calculating the contrast similarity. Because model training is performed simultaneously on a batch of images, it is assumed that each batch of images has bs pieces of training data;

step X6: calculating a reconstruction similarity, wherein:

step X61: subtask one: computing the ability of the model to reconstruct an image;

step X62: subtask two: calculating model modal migration capacity;

step X63: subtask three: the ability of the model to segment the image is calculated.

In a third aspect, the present invention provides an electronic apparatus, comprising:

at least one processor; and

a memory communicatively coupled to the at least one processor;

wherein the memory stores instructions executable by the at least one processor to cause the electronic device to perform the language hint based heart fat segmentation method.

In a fourth aspect, the present invention provides a storage medium having stored thereon a computer program, wherein the computer program, when executed by a processor, implements the language-hint based heart fat segmentation method.

The heart fat segmentation method based on language prompt has the technical effects and advantages that: according to the invention, by utilizing a large amount of data in other medical fields, combining a small amount of heart fat data and utilizing the ability of a large model and a large amount of data to learn, the training is performed, the problems that the heart fat data is insufficient in quantity and uncertainty in shape, small in volume and close in brightness of surrounding tissues are mainly solved, and the method has the advantages of reducing cost, improving performance, being high in flexibility, processing incomplete data, multiplexing and pre-training and the like.

Drawings

FIG. 1 is a flow chart of a method for segmenting cardiac fat based on language prompt and model processing logic according to the present invention;

FIG. 2 is a schematic diagram of the visual model processing logic related attention mechanism of the present invention;

FIG. 3 is a logic diagram of the relationship chain of each part of the training data set of the heart fat segmentation method based on language prompt;

fig. 4 is a flow chart of training logic and training tasks of the language prompt based heart fat segmentation method of the present invention.

Detailed Description

The following description of the embodiments of the present invention will be made clearly and completely with reference to the accompanying drawings, in which it is apparent that the embodiments described are only some embodiments of the present invention, but not all embodiments. All other embodiments, which can be made by those skilled in the art based on the embodiments of the invention without making any inventive effort, are intended to be within the scope of the invention.

Example 1

Referring to fig. 1-2, the method for heart fat segmentation based on language prompt according to the present embodiment includes the following steps:

it should be noted that: the multi-mode image comprises an MRI image, a CT image and an ultrasonic image, the MRI image is further subdivided into modes such as T1, T2 and the like, the image modes of different types of data sets are different, for example, the images which are mainly obtained through the MRI image in the heart myocardial ring and the ventricular segmentation at present, namely, the images which are mainly used for heart fat segmentation are mainly derived from the CT image data sets. In order to make full use of these precious datasets, we need the model to be able to process images of different modalities simultaneously. Images of different modes have different expression modes such as spatial resolution, pixel intensity, anatomical structure information and the like, and the images are different, so that the input of the multi-mode images is required to be adapted through the multi-mode visual model, the content of the input images is understood, and unified coding is provided for subsequent segmentation.

The application logic of the visual model is as follows:

it should be noted that: although the images contain modal information, the images of different modal types may have some similarity, but the information contained in the images is completely different, so that the modal types are designated, the specific modalities are more targeted to be encoded in the subsequent steps, the difficulty of the system in image recognition can be reduced, and the capability of the visual model in rapidly recognizing the characteristic information carried by the multi-modal images is improved; since the subsequent processing requires the use of an embedded sequence, this step aims at converting the image and corresponding modality type into an embedded sequence for use in the subsequent step.

The acquisition logic of the Embedding sequence is as follows:

converting the image into an encoding sequence: the multi-mode image is divided into mode blocks with fixed sizes, and each mode block is mapped into an image coding sequence; this mapping process is performed using a multi-layer perceptron.

Converting the mode type into a corresponding mode Embedding: the mode type is encoded by using integers, such as CT corresponds to 1, MRIT1 corresponds to 2, and the like; each integer corresponds to one Embedding vector, namely, different modal types correspond to different Embedding vectors, and how many modalities have different Embedding vectors; when the mode type is required to be converted into the mode Embedding, the Embedding corresponding to the mode is found and used as the mode Embedding.

And assembling the Embedding sequences of different information. And splicing the image Embedding sequence and the modal Embedding sequence to generate the Embedding sequence. And then adding position codes of corresponding positions to each Embedding of the sequence, and finally obtaining the Embedding sequence which can be processed by a transducer.

Step S12: extracting features of the Embedding sequence by using a Transformer architecture, and converting the features into the Embedding sequence, so that subsequent calculation is facilitated;

it should be noted that: a "self-attention" mechanism is used herein to effect conversion of an embedded into a feature. The transducer block obtains a corresponding group of characteristic sequences by using a plurality of attention mechanisms, and then the group of characteristic sequences are restored into an Embedding sequence through a linear mapping module; the multiple transducer blocks are connected in series to form a transducer encoder, and the coding sequence output by the last block is the coding result.

Wherein: the process of mapping an Embedding to a corresponding feature by a self-attention mechanism is as follows. First, each eimbedding is mapped into Q, K, V three sets of vectors by using a linear mapping module, wherein Q is equivalent to the representation of the current eimbedding used for searching by eyes, K is equivalent to the current corresponding feature, and V is the current corresponding feature. And then, calculating the distance between the Q and K of all the Embedding, judging the relation (distance) between the Q and K, and if the Q and K are relatively close, considering that the relation between the current Embedding and the feature (V) of the corresponding Embedding is close, and the corresponding feature weight is higher. Thus, the weight can be calculated according to the distance and the characteristics (V) are weighted, so that the characteristic code after the current coding conversion is obtained. The same operation is performed for each Embedding in the series to obtain a converted feature sequence. The multi-head attention is to use the self-attention mechanism multiple times, and each time uses different modes (weights), which is equivalent to multiple times of opportunity for feature detection, and each time detects different feature sequences. Thus, as shown in FIG. 2, the result of the multi-headed attention is to convert the Embedding sequence into a set of feature sequences.

In addition, since the input contains the Embedding corresponding to the modality type, each step in the feature extraction process considers the modality type.

Step S2: and carrying out text description on the labels, extracting the characteristics of the segmentation intention of the text description by using the existing large language model, and converting the characteristics into task prompt vectors which can be understood by a subsequent module.

The application logic of the language model is as follows:

converting each label into a corresponding text description, and extracting text features of the segmentation intention of the text description by using a language model; the text descriptions are chosen to replace the traditional segmented representations for better fusion of the various data sets for training.

After the text feature is obtained, we need to convert the text feature into task hint vector (prompt email). For text feature to task hint vector conversion, we use a smaller network such as RNN or MLP to learn to extract task hint vectors from text features, input as text features, and output as a fixed length sample.

It should be noted that, in this step, it is not necessary to train a word feature extraction network alone. Because of the high cost of training large language models, many trained large language models, such as LLaMA, T5, etc., are available in academia. The character features can be extracted by directly using or fine-tuning the large models. Furthermore, since we aim not at what language big model is used, but at extracting text features, feature extraction can also be performed using pre-trained text Encoder models, such as BERT.

Step S3: and constructing a decoding network, in the decoding network, comprehensively decoding the input heart fat image coding sequence according to the requirement of the task prompt vector, and finally outputting the segmentation target data.

It should be noted that: the decoding network is composed of a bidirectional attention network and a fusion output network in sequence. The foregoing steps encode the image and the task hint vector separately and generate the two enhancement sequences separately, without considering the relationship between each other, and the function of the bidirectional attention network is to further decode the two enhancement sequences by considering the relationship between the two enhancement sequences. "Cross attention" is used herein, as opposed to "self attention," in which cross attention is input as two sequences of ebedding. Assuming these two sequences are A and B, then the A-to-B cross-attention process is: mapping A to Q using linear mapping _A Mapping B to K _B 、V _B Then Q is taken _A 、K _B 、V _B The output features are computed as Q, K, V of the self-attention mechanism. The multi-head cross attention is to apply the cross attention multiple times by using different weights to obtain a corresponding group of characteristics. It can be seen that if A, B is the same sequence, it is equivalent to self-attention. The mapping of features back to blocks of embedded by the multi-headed cross-attention mechanism + linear mapping is called cross-attention transform blocks, such as the "a-to-B attention (cross-attention) logic" in fig. 2.

The decoding processing logic comprises the steps of:

specifically, for example, an a-sequence and a B-sequence input sequentially passes through a plurality of bi-directional attention blocks to obtain a new a-sequence and B-sequence.

The logic of the bi-directional attention block is as follows: inputs are a and B, outputs are a and B, and a and B are taken as inputs of a and B of the next layer network to realize series connection of a plurality of blocks. For B input, a new B is obtained by passing through a self-attention transducer block, then passing through a cross-attention transducer block of B to a, and finally passing through a multi-layer perceptron. For a input, a new a is obtained via a cross-attention transducer block of a versus B.

Step S33: and generating a plurality of alternative images by the A sequence sensing result through multi-layer deconvolution. And the B sequence perception result sequentially passes through a cross attention transducer block of the B pair A, the multi-layer perceptron obtains weights of the alternative images, and finally the images are weighted according to the weights to obtain a segmentation result.

Example 2

Referring to fig. 3-4, the model training process of the heart fat segmentation method based on language prompt according to the embodiment includes the following steps:

the training aims at maximizing the utilization and mining of various information of the existing data set, and meets the increasing demands of the data set and the increasing demands of the labels, so that the model has better universality.

Step X1: collection and processing of data sets. Our datasets have multiple formats, such as MRI k-space, matlab, nii, jpg, etc., and we first read out both the images and the corresponding metadata of these datasets. The metadata comprises information such as the mode, the size, the resolution and the like of the image, and all the information is converted into json format; the image is required to uniformly adjust the layer resolution of the image to 1mm/pix according to metadata set information, the resolution adjustment uses quadratic linear interpolation, the resolution between layers is unchanged, the image is normalized to 0-1 according to different modes, the highest 1% of noise points in the histogram are required to be thrown away during normalization, the histogram is matched to the global average histogram of the mode, and the background value is 0. The dataset is then globally registered to the average template by the Ants registration tool, and finally the image size of the layer is cropped to 192x192. Note that there may be a large difference in the scan direction of the different datasets, and the registration of the datasets is currently done in 5 directions, the long axis of the heart and the short axis of the heart, according to the scan direction in 3 planes of the body.

We define a unified numbering of labels, with different labels corresponding to different numbers. For labeling information, the labeling information needs to be uniformly mapped into a number defined by us. The labeling then requires the corresponding image resolution to be unified to 1mm/pix, the resolution adjustment uses nearest neighbor elements, and after performing the same transformation as the corresponding image registration, the size is cropped to 192x192.

After preprocessing the image, the image needs to be compressed to meet the requirement of training the combined data set. This procedure first unpacks both the image and the corresponding annotation (if any, the same applies below) into a 2D sequence and converts it into a float16 type, the corresponding annotation being split into one-hot codes (i.e. split according to the annotation class, one mask per class, with its pixel value being 1 for that class and 0 for not that class), such that an image will correspond to multiple masks. Because the mask and the Image are the same in size, the mask and the Image are also saved in an Image file, and the respective location indexes of the Image and the annotation are recorded, such as "Image" in fig. 3.

The training task of building a single dataset follows. For each training task, the image, the mode type and the training task type are needed in training, and the model outputs 4 pieces of information. Since the model input and output are in the image, the model input and output can be expressed by integers; modality type and training task type we can also define different integers to represent. Thus, for each training task, it can be expressed as (image modality id, image index, task type id, model output image index), such as "pair" in fig. 3. When the data set is constructed, all tasks which can be constructed according to the current image are described and stored in a pair file.

Step X2: multiple data sets and generating a training data set.

First, text descriptive information of a training task is prepared. To enhance the generalization ability of the model, we have prepared multiple similar text descriptions for each training task. And extracting features of the descriptive information in advance by using a language model. Finally, the text features are saved together in a text-casting file, such as the "task description" in fig. 3. Wherein the text features of the different descriptions of each task are stored consecutively and the index start address and number of each task are stored in the form of an index, such as "task information" in fig. 3.

And then merging the multiple data sets. The images of the multiple data sets are combined into one, such as "Image" in fig. 3. Since the index position of the picture is renamed during image merging, for task pair, the corresponding index needs to be modified for merging, such as "pair" in fig. 3.

Step X3: training tasks are extracted from the combined dataset during training.

Firstly, before each training is divided into a plurality of epochs, training tasks are randomly disturbed, and task information is extracted according to the sequence of the batch size. All tasks are extracted once to complete one epoch training.

And secondly, acquiring input and output images from the image according to the index. And random data enhancement is carried out on the image and the label at the same time. The data enhancement modes are as follows: randomly rotating and overturning, randomly adjusting brightness and contrast, and randomly cutting to 144x144.

And randomly extracting text features of a corresponding task (because each task has various description features) from text-email according to the task id.

Finally we extract the data required for training: the image (144 x 144), modality type, task description feature, model expected output result (144 x 144).

Step X4: and encoding the image and the mode, and encoding the input image and mode information into a visual model. First, S11 of example 1 is applied to obtain an image coding sequence.

3/4 of the content is needed for randomly deleting the Embedding sequence. S12 of example 1 is subsequently reapplied.

The above procedure is performed twice, with two different sequences A1, A2, such as the "multi-modal image encoder (during training)" in fig. 4.

Step X5: and calculating the contrast similarity. Because model training is performed simultaneously for a batch of images, it is assumed that there are bs pieces of training data for each batch of images. We take the first symbol for each of A1, A2 of each training data, so there are 2 symbol sequences of length bs. We then perform a similarity calculation for each element between the two Embedding sequences. Since the same position of the references are all from the same input image, the resulting features should be similar. The images in different positions are derived from different images and therefore their features should not be similar. According to this principle we can define a corresponding loss function, i.e. a contrast similarity loss function.

Step X6: the reconstructed similarity is calculated as the "training task" in fig. 4.

Step X61: subtask one: the ability of the model to reconstruct an image is calculated. This subtask training process is similar to that of a guess map. The input removes most of the image, and if the model can predict the original complete content of the image, the model can be considered to really understand the information of the image, and the feature coding is performed. Since the output A1, A2 sequence of X4 is only 1/4 of the original, we use a special Embedding to fill those positions of A1, A2 that were originally deleted for subsequent computation, resulting in a complete sequence. This special Embedding effect simply tells the model that the feature at this location was deleted, requiring the model to predict. Finally, S2 and S3 of example 1 are applied to obtain output results R1 and R2 of two model predictions. Since the model input is the same image, the model output should be consistent with the target image regardless of which part of the image is deleted in the middle. According to this principle we can define a corresponding loss function, i.e. reconstruction loss.

Step X62: subtask two: model modality migration capabilities are calculated. This process is the same as X61, except that an image of another modality is output.

Step X63: subtask three: the ability of the model to segment the image is calculated. This process is the same as X61, except that the output is a label. We learn for each annotation using soft-decision (i.e. the output values are no longer only 0, 1, but between 0 and 1), so that the segmentation can be performed as a reconstruction task, unifying the training process.

The invention provides a method for indirectly learning the knowledge of various medical fields of the cardiac MRI image by the model on the basis of a large model and on the basis of training other large quantity of public data sets, thereby obtaining a good effect improvement on our task and solving the defects of the prior art. Large models and large training data have proven to be significant in many tasks, but not in large amounts, so as a basic idea, the large models and large training data cannot be used with other data sets in the medical field, even if the research directions are different, for example, the data of other tasks (such as the left atrium segmentation data set) are used for assisting the current task, and although the data of each task is not much, the small data are added together to form a plurality of data. The whole solution therefore requires a reasonable training means to neutralize so many different data sets to achieve the training goal, such as not only to deal with the task of current pericardial fat segmentation, but also to deal with other tasks.

Therefore, the technical scheme is divided into two parts, one part corresponds to the model, and the other part corresponds to the data and training. Wherein the model part comprises three parts: the visual large model and the language large model are used as data input, and the decoding network is used as output. The data set training part is mainly used for realizing fusion training of various data sets by matching with a model for data processing.

Example 3

In an exemplary embodiment, there is also provided an electronic device including:

at least one processor; and

a memory communicatively coupled to the at least one processor;

wherein the memory stores instructions executable by the at least one processor to cause the electronic device to perform a language hint based heart fat segmentation method.

For example, the computer readable storage medium can be Read-Only Memory (ROM), random access Memory (Random Access Memory, RAM), compact disk Read-Only Memory (Compact Disc Read-Only Memory, CD-ROM), magnetic tape, floppy disk, optical data storage device, etc.

In an exemplary embodiment, a storage medium is also provided, on which a computer program is stored which, when being executed by a processor, implements a language-hint based heart fat segmentation method.

It should be understood that, in various embodiments of the present application, the sequence numbers of the foregoing processes do not mean the order of execution, and the order of execution of the processes should be determined by the functions and internal logic thereof, and should not constitute any limitation on the implementation process of the embodiments of the present application.

It should be appreciated that determining B from a does not mean determining B from a alone, but rather B from a and/or other information.

Those of ordinary skill in the art will appreciate that all or a portion of the steps implementing the above-described embodiments can be implemented by hardware, or can be implemented by a program instructing the relevant hardware, and the program can be stored in a computer readable storage medium, and the above-mentioned storage medium can be a read-only memory, a magnetic disk or an optical disk, etc.

The foregoing description is only of alternative embodiments of the present application and is not intended to limit the present application, but any modifications, equivalents, improvements, etc. that fall within the spirit and principles of the present application are intended to be included within the scope of the present application.

In the several embodiments provided in the present invention, it should be understood that the disclosed system, apparatus and method may be implemented in other manners. For example, the apparatus embodiments described above are merely illustrative, e.g., the division of the units is merely one, and there may be additional divisions in actual implementation, e.g., multiple units or components may be combined or integrated into another system, or some features may be omitted, or not performed. In addition, the coupling or direct coupling or communication connection shown or discussed may be through some interfaces, devices or units, may be electrical, mechanical or other form

The foregoing is merely illustrative of the present invention, and the present invention is not limited thereto, and any person skilled in the art will readily recognize that variations or substitutions are within the scope of the present invention. Therefore, the protection scope of the present invention shall be subject to the protection scope of the claims.

Finally: the foregoing description of the preferred embodiments of the invention is not intended to limit the invention to the precise form disclosed, and any such modifications, equivalents, and alternatives falling within the spirit and principles of the invention are intended to be included within the scope of the invention.

Claims

1. The heart fat segmentation method based on language prompt is characterized by comprising the following steps of:

2. The language hint based heart fat segmentation method of claim 1, wherein the application logic of the visual model is:

3. The language hint based heart fat segmentation method of claim 1, wherein the acquiring logic of the embedded sequence is:

4. A language hint based heart fat segmentation method according to claim 1, wherein mapping of the ededings to corresponding features by a self-attention mechanism is as follows. First, each eimbedding is mapped into Q, K, V three sets of vectors by using a linear mapping module, wherein Q is equivalent to the representation of the current eimbedding used for searching by eyes, K is equivalent to the current corresponding feature, and V is the current corresponding feature.

5. The method for cardiac fat segmentation based on linguistic cues according to claim 4, wherein the distance is calculated using the vector Q and the vector K of all the references, weights are calculated based on the distance and the features (V) are weighted to obtain the feature code after conversion of the current references.

6. The language hint based heart fat segmentation method of claim 1, wherein the application logic of the language model is:

7. The language hint based heart fat segmentation method of claim 1, wherein the decoding processing logic includes the steps of:

8. The language-hint based heart fat segmentation method of claim 1, wherein the model training process of the language-hint based heart fat segmentation method comprises the steps of:

step X1: collecting and processing data sets;

step X2: a plurality of data sets and generating a training data set;

step X3: extracting training tasks from the combined data set during training;

Step X5: calculating contrast similarity;

step X6: calculating the reconstruction similarity;

wherein:

step X62: subtask two: calculating model modal migration capacity;

9. An electronic device, comprising:

at least one processor; and

a memory communicatively coupled to the at least one processor;

wherein the memory stores instructions executable by the at least one processor to cause the electronic device to perform the language hint based heart fat segmentation method of any of claims 1-7.

10. A storage medium having stored thereon a computer program which, when executed by a processor, implements a language hint based heart fat segmentation method according to any of claims 1-7.