CN117556077B

CN117556077B - Training method of text image model, related method and related product

Info

Publication number: CN117556077B
Application number: CN202410036431.7A
Authority: CN
Inventors: 谢卫国; 黄炳顶; 肖楚达
Original assignee: Shenzhen Weide Precision Medical Technology Co ltd
Current assignee: Shenzhen Weide Precision Medical Technology Co ltd
Priority date: 2024-01-10
Filing date: 2024-01-10
Publication date: 2024-05-03
Anticipated expiration: 2044-01-10
Also published as: CN117556077A

Abstract

The application discloses a training method of a text image model, a related method and related products. The training method of the text image model comprises the following steps: acquiring a training image, a training text and a prediction label of the training text, wherein the dimension of the training image is larger than 1, and the training image comprises a target organ; converting the training image into a first image vector; extracting features of the first image vector to obtain a first feature vector; downsampling the first image vector to obtain a second image vector; the second image vector is subjected to feature extraction processing to obtain a second feature vector; predicting the blocked content in the training text based on the first feature vector and the second feature vector by the model to be trained to obtain a prediction result; and updating parameters of the model to be trained based on the difference between the prediction result and the prediction label to obtain a target text image model.

Description

Training method of text image model, related method and related product

Technical Field

The application relates to the technical field of medical images, in particular to a training method of a text image model, a related method and related products.

Background

In recent years, with the rapid development of artificial intelligence technology, the application of the artificial intelligence technology in the field of medical imaging is also increasing, wherein the application includes processing medical imaging by using a model. Therefore, how to train a model that can be used to process medical images is of great importance.

Disclosure of Invention

The application provides a training method, a related method and a related product of a text image model, which are used for training to obtain a model for processing medical images.

In a first aspect, a training method for a text image model is provided, the method comprising:

Acquiring a training image, a training text and a prediction label of the training text, wherein the dimension of the training image is larger than 1, the training image comprises a target organ, the training text is a text describing the target organ in the training image, the content related to the target organ in the training text is blocked, and the prediction label comprises the blocked content in the training text;

Converting the training image into a first image vector;

extracting features of the first image vector to obtain a first feature vector;

Downsampling the first image vector to obtain a second image vector;

obtaining a second feature vector by carrying out feature extraction processing on the second image vector;

predicting the blocked content in the training text by the model to be trained based on the first feature vector and the second feature vector to obtain a prediction result;

updating parameters of the model to be trained based on the difference between the prediction result and the prediction label to obtain a target text image model, wherein the target text image model is used for generating a target vector based on a target text, the target text is a text describing the target organ, and information of the target organ carried by the target vector is matched with the description of the target organ by the target text.

In combination with any one of the embodiments of the present application, the extracting features of the first image vector to obtain a first feature vector includes:

extracting features of the first image vector to obtain a third feature vector;

And expanding the size of the third feature vector to be the same as the size of the first image vector to obtain the first feature vector.

In combination with any one of the embodiments of the present application, the obtaining the second feature vector by performing feature extraction processing on the second image vector includes:

Extracting features of the second image vector to obtain a fourth feature vector;

And expanding the size of the fourth feature vector to be the same as the size of the second image vector to obtain the second feature vector.

In combination with any one of the embodiments of the present application, the training image is one of an ultrasound image and a three-dimensional computed tomography (Computed Tomography, CT) image.

In a second aspect, a method for training a target organ segmentation model is provided, the method comprising:

Acquiring a training image, a target text and a segmentation label, wherein the dimension of the training image is larger than 1, the training image comprises a target organ, the target text is a text describing the target organ in the training image, and the segmentation label comprises the position of the target organ in the training image;

acquiring a target text image model, wherein the target text image model is obtained through training according to the first aspect and any implementation mode of the target text image model;

Processing the target text by using the target text image model to obtain a target vector, wherein the information of the target organ carried by the target vector is matched with the description of the target text on the target organ in the training image;

the target vector is utilized by the segmentation model to be trained to segment the target organ in the training image, and a segmentation result is obtained;

And updating parameters of the segmentation model to be trained based on the segmentation label and the segmentation result to obtain a target segmentation model.

In combination with any one of the embodiments of the present application, updating parameters of the to-be-trained segmentation model based on the segmentation label and the segmentation result to obtain a target segmentation model includes:

Determining the loss of the segmentation model to be trained based on the difference between the segmentation label and the segmentation result, wherein the loss of the segmentation model to be trained is positively correlated with the difference;

And updating parameters of the segmentation model to be trained based on the loss of the segmentation model to be trained to obtain the target segmentation model.

In a third aspect, there is provided a method of segmentation of a target organ, the method comprising:

acquiring an image to be segmented, a reference text, a target segmentation model and a target text image model, wherein the image to be segmented is one of an ultrasonic image and a three-dimensional CT image, the image to be segmented comprises a target organ, the reference text is a text describing the target organ in the image to be segmented, the target segmentation model is obtained by training according to the second aspect and any implementation mode thereof, and the target text image model is obtained by training according to the first aspect and any implementation mode thereof;

converting the reference text into a reference image vector by using the target text image model;

And dividing the target organ in the image to be divided according to the reference image vector by using the target division model to obtain a target division result.

In a fourth aspect, there is provided a training apparatus for a text image model, the training apparatus for a text image model including:

An obtaining unit, configured to obtain a training image, a training text, and a prediction label of the training text, where a dimension of the training image is greater than 1, the training image includes a target organ, the training text is a text describing the target organ in the training image, and content related to the target organ in the training text is blocked, and the prediction label includes blocked content in the training text;

a conversion unit configured to convert the training image into a first image vector;

the extraction unit is used for extracting the characteristics of the first image vector to obtain a first characteristic vector;

the downsampling unit is used for downsampling the first image vector to obtain a second image vector;

The extracting unit is used for extracting the characteristics of the second image vector to obtain a second characteristic vector;

the prediction unit is used for predicting the blocked content in the training text based on the first feature vector and the second feature vector through a model to be trained to obtain a prediction result;

The updating unit is used for updating parameters of the model to be trained based on the difference between the prediction result and the prediction label to obtain a target text image model, the target text image model is used for generating a target vector based on a target text, the target text is a text describing the target organ, and the information of the target organ carried by the target vector is matched with the description of the target text on the target organ.

In combination with any one of the embodiments of the present application, the extraction unit is specifically configured to:

extracting features of the first image vector to obtain a third feature vector;

In combination with any one of the embodiments of the present application, the training image is one of an ultrasound image and a three-dimensional CT image.

In a fifth aspect, there is provided a training apparatus of a target organ segmentation model, the training apparatus of a target organ segmentation model including:

The device comprises an acquisition unit, a processing unit and a processing unit, wherein the acquisition unit is used for acquiring a training image, a target text and a segmentation label, the dimension of the training image is larger than 1, the training image comprises a target organ, the target text is a text describing the target organ in the training image, and the segmentation label comprises the position of the target organ in the training image;

The acquiring unit is used for acquiring a target text image model, and the target text image model is obtained through training according to the first aspect and any implementation mode thereof;

The processing unit is used for processing the target text by utilizing the target text image model to obtain a target vector, and the information of the target organ carried by the target vector is matched with the description of the target text on the target organ in the training image;

The segmentation unit is used for segmenting the target organ in the training image by utilizing the target vector through a segmentation model to be trained to obtain a segmentation result;

And the updating unit is used for updating the parameters of the to-be-trained segmentation model based on the segmentation label and the segmentation result to obtain a target segmentation model.

In combination with any one of the embodiments of the present application, the updating unit is specifically configured to:

In a sixth aspect, there is provided a segmentation apparatus of a target organ, the segmentation apparatus of the target organ including:

The device comprises an acquisition unit, a processing unit and a processing unit, wherein the acquisition unit is used for acquiring an image to be segmented, a reference text, a target segmentation model and a target text image model, the image to be segmented is one of an ultrasonic image and a three-dimensional CT image, the image to be segmented comprises a target organ, the reference text is a text describing the target organ in the image to be segmented, the target segmentation model is obtained through training according to the second aspect and any implementation mode thereof, and the target text image model is obtained through training according to the first aspect and any implementation mode thereof;

A conversion unit for converting the reference text into a reference image vector using the target text image model;

The segmentation unit is used for segmenting the target organ in the image to be segmented according to the reference image vector by utilizing the target segmentation model to obtain a target segmentation result.

In a seventh aspect, there is provided an electronic device comprising: a processor and a memory for storing computer program code comprising computer instructions which, when executed by the processor, cause the electronic device to perform a method as described in the first aspect and any one of its possible implementations;

The electronic device or a method as described in the second aspect and any possible implementation thereof, when the processor executes the computer instructions;

The electronic device, or a method as described in the third aspect and any possible implementation thereof, when the processor executes the computer instructions.

In an eighth aspect, there is provided another electronic device comprising: a processor, transmission means, input means, output means and memory for storing computer program code comprising computer instructions which, when executed by the processor, cause the electronic device to perform a method as described in the first aspect and any one of its possible implementations;

In a ninth aspect, there is provided a computer readable storage medium having stored therein a computer program comprising program instructions which, when executed by a processor, cause the processor to perform a method as in the first aspect and any one of the possible implementations thereof;

where the program instructions are executed by a processor, causing the processor to perform or carry out a method as described in the second aspect and any one of its possible implementations;

the program instructions, when executed by a processor, cause the processor to perform or execute a method as described in the third aspect and any one of its possible implementations.

In a tenth aspect, there is provided a computer program product comprising a computer program or instructions which, when run on a computer, cause the computer to perform the method of the first aspect and any one of its possible implementations;

In the case of a computer program or instructions running on a computer, or causing the computer to perform the method of the second aspect described above and any one of its possible implementations;

in the case of a computer program or instructions being run on a computer or causing the computer to carry out the method of the third aspect and any one of its possible implementations.

It is to be understood that both the foregoing general description and the following detailed description are exemplary and explanatory only and are not restrictive of the application as claimed.

In the application, the dimension of the training image is larger than 1, the training image comprises a target organ, the training text is a text describing the target organ in the training image, the content related to the target organ in the training text is blocked, and the prediction label comprises the blocked content in the training text. After the training device acquires the training image, the training text and the predictive label of the training text, the training image is converted into a first image vector, and the first image vector is subjected to feature extraction to obtain a first feature vector. And then downsampling the first image vector to obtain a second image vector. And then, carrying out feature extraction processing on the second image vector to obtain a second feature vector, so that the first feature vector and the second feature vector can carry image information of the training image under different scales. The model to be trained predicts the blocked content in the training text based on the first feature vector and the second feature vector to obtain a prediction result, the model to be trained can use the image information under different scales, the blocked content in the training text is predicted, so that the model to be trained can establish mapping between the image information of the training image and the text information by utilizing the image information of the training image under different scales.

The training device updates parameters of the model to be trained based on the difference between the prediction result and the prediction label to obtain a target text image model, so that the accuracy of mapping between the image information and the text information of the model to be trained can be improved, and the target text image model has the capability of mapping between the image information of the medical image and the text information describing the medical image.

Drawings

In order to more clearly describe the embodiments of the present application or the technical solutions in the background art, the following description will describe the drawings that are required to be used in the embodiments of the present application or the background art.

The accompanying drawings, which are incorporated in and constitute a part of this specification, illustrate embodiments consistent with the application and together with the description, serve to explain the principles of the application.

Fig. 1 is a schematic flow chart of a training method of a text image model according to an embodiment of the present application;

FIG. 2 is a flowchart of another training method for text image model according to an embodiment of the present application;

FIG. 3 is a schematic flow chart of a training method of a target organ segmentation model according to an embodiment of the present application;

FIG. 4 is a flowchart of another training method of a target organ segmentation model according to an embodiment of the present application;

fig. 5 is a flow chart of a method for segmenting a target organ according to an embodiment of the present application;

FIG. 6 is a flowchart of another method for segmenting a target organ according to an embodiment of the present application;

FIG. 7 is a flowchart of another method for segmenting a target organ according to an embodiment of the present application;

FIG. 8 is a flowchart of another method for segmenting a target organ according to an embodiment of the present application;

fig. 9 is a schematic structural diagram of a training device for a text image model according to an embodiment of the present application;

FIG. 10 is a schematic structural diagram of a training device for a target organ segmentation model according to an embodiment of the present application;

FIG. 11 is a schematic structural view of a target organ segmentation apparatus according to an embodiment of the present application;

Fig. 12 is a schematic hardware structure of an electronic device according to an embodiment of the present application.

Detailed Description

In order that those skilled in the art will better understand the present application, a technical solution in the embodiments of the present application will be clearly and completely described below with reference to the accompanying drawings in which it is apparent that the described embodiments are only some embodiments of the present application, not all embodiments. All other embodiments, which can be made by those skilled in the art based on the embodiments of the application without making any inventive effort, are intended to be within the scope of the application.

The terms first, second and the like in the description and in the claims and in the above-described figures are used for distinguishing between different objects and not necessarily for describing a sequential or chronological order. Furthermore, the terms "comprise" and "have," as well as any variations thereof, are intended to cover a non-exclusive inclusion. For example, a process, method, system, article, or apparatus that comprises a list of steps or elements is not limited to only those listed steps or elements but may include other steps or elements not listed or inherent to such process, method, article, or apparatus.

Reference herein to "an embodiment" means that a particular feature, structure, or characteristic described in connection with the embodiment may be included in at least one embodiment of the application. The appearances of such phrases in various places in the specification are not necessarily all referring to the same embodiment, nor are separate or alternative embodiments mutually exclusive of other embodiments. Those of skill in the art will explicitly and implicitly appreciate that the embodiments described herein may be combined with other embodiments.

The execution main body of the training method of the text image model provided by the embodiment of the application is a training device (hereinafter simply referred to as training device) of the text image model, wherein the training device can be any electronic equipment capable of executing the technical scheme disclosed by the embodiment of the method of the application. Alternatively, the training device may be one of the following: cell-phone, computer, panel computer, wearable smart machine.

The text image model has the capability of generating image information matched with content in the text based on the text, wherein the image information comprises: an image vector based on which an image may be generated. For example, if the text is that there is a kidney in the image, then the information carried by the image vector generated based on the text is that there is a kidney in the image.

It should be understood that the training method of the text image model provided by the application can also be implemented by executing the computer program code by a processor. Embodiments of the present application will be described below with reference to the accompanying drawings in the embodiments of the present application. Referring to fig. 1, fig. 1 is a flowchart of a training method of a text image model according to an embodiment of the present application.

101. And acquiring a training image, a training text and a prediction label of the training text.

In an embodiment of the present application, the training image includes a target organ, where the target organ may be any organ, for example, the target organ is a kidney, for example, the target organ is a lung, and for example, the target organ is a heart. The dimension of the training image is greater than 1, in one possible implementation the training image is a two-dimensional image, in another possible implementation the training image is a three-dimensional image, e.g. the training image is an ultrasound image, and for example the training image is a CT image. As another example, the training image is magnetic resonance imaging (magnetic resonance imaging, MRI).

In the embodiment of the application, the training text is a text describing a target organ in the training image, and the content related to the target organ in the training text is blocked, for example, the target organ is a kidney, and the text describing the target organ in the training image is: the kidney exists in the image, the training text is obtained by shielding the content related to the target organ in the text, and the training text is as follows: XX exists in the image, wherein XX is occluded content. For another example, the target organ is a kidney, and the text describing the kidney in the training image is: the image comprises a left kidney, stones are arranged in the left kidney, and the training text is as follows: the images include XX, XXXXXX.

Optionally, the training text includes a mask, wherein the mask is used to occlude content related to the target organ.

In the embodiment of the application, the predictive label of the training text comprises the occluded content in the training text. For example, the target organ is a kidney, and the text describing the target organ in the training image is: kidneys exist in the images, and training texts are as follows: XX is present in the image. At this point, the predictive label includes the kidney. For another example, the target organ is a kidney, and the text describing the kidney in the training image is: the image comprises a left kidney, stones are arranged in the left kidney, and the training text is as follows: the images include XX, XXXXXX. At this time, the predictive label includes the left kidney with stones in the left kidney.

In one implementation of acquiring training images, a training device receives training images input by a user through an input component. The input assembly includes at least one of: keyboard, mouse, touch screen, touch pad, audio input device.

In another implementation of acquiring the training image, the training device receives the training image sent by the terminal. The terminal may be any of the following: cell phone computer, tablet computer, server.

In one implementation of obtaining training text, a training device receives training text entered by a user through an input component.

In another implementation of obtaining training text, the training device receives training text sent by the terminal. The terminal may be any of the following: cell phone computer, tablet computer, server.

In one implementation of obtaining predictive labels for training text, a training device receives predictive labels input by a user through an input component.

In another implementation of obtaining the predictive label of the training text, the training device receives the predictive label sent by the terminal.

It should be understood that, in the embodiment of the present application, the step of acquiring the training image, the step of acquiring the training text, and the step of acquiring the predictive label of the training text may be performed simultaneously or separately by the training device, which is not limited in the present application.

102. The training image is converted into a first image vector.

In the embodiment of the application, the first image vector is one-dimensional data. The training device converts the high-dimensional image data into one-dimensional data by converting the training image into a first image vector. In one possible implementation, the training device converts the training image into the first image vector by means of a deformer (Transformer).

103. And extracting the characteristics of the first image vector to obtain a first characteristic vector.

The training device extracts the features in the first image vector by extracting the features of the first image vector, so as to obtain a first feature vector. In one possible implementation, the training device obtains the first feature vector by performing convolution processing on the first image vector.

104. And downsampling the first image vector to obtain a second image vector.

The training device can reduce the size of the first image vector by downsampling the first image vector to obtain a second image vector.

105. And obtaining a second feature vector by carrying out feature extraction processing on the second image vector.

Since the second image vector has a smaller size than the first image vector, the training device performs feature extraction processing on the first image vector, and the feature extraction processing on the second image vector is different in scale from the extracted feature. Specifically, the scale of the first feature vector is different from the scale of the second feature vector, that is, the first feature vector and the second feature vector carry the feature information of the first image vector under different scales, that is, the first feature vector and the second feature vector carry the image information of the training image under different scales.

106. And predicting the blocked content in the training text by the model to be trained based on the first feature vector and the second feature vector to obtain a prediction result.

Because the content of the training text is matched with the image content of the training image, and the first feature vector and the second feature vector both carry the image information of the training image, the model to be trained can predict the blocked content in the training text by utilizing the image information carried by the first feature vector and the image information carried by the second feature vector. For example, the target organ is a kidney, the training image is an image including the target organ, and the training text is: XX is present in the image. Because the kidney exists in the training image, the first feature vector and the second feature vector both carry information of the kidney in the training image, and therefore, the model to be trained can determine that the kidney exists in the image based on the first feature vector and the second feature vector, and further can determine that the blocked content in the training text is the kidney. For another example, the target organ is a kidney, the training image includes a left kidney, and stones are present in the left kidney, the training text is: the images include XX, XXXXXX. Because the first feature vector and the second feature vector both carry information of the left kidney and stones in the left kidney in the training image, the model to be trained can determine the left kidney and stones in the left kidney in the image based on the first feature vector and the second feature vector, and then can determine the blocked content in the training text as the left kidney and stones in the left kidney.

107. And updating parameters of the model to be trained based on the difference between the prediction result and the prediction label to obtain a target text image model.

The method comprises the steps that the to-be-trained model predicts the blocked content in the training text based on the first feature vector and the second feature vector to obtain a prediction result, and the image information carried by the first feature vector and the image information carried by the second feature vector are converted into text information, so that the prediction result is the result of converting the image information into the text information by the to-be-trained model, and specifically, the prediction result is the result of mapping between the image information and the text information by the to-be-trained model.

As described above, the predictive label includes what is occluded in the training text, and thus, the difference between the predictive label and the predicted result may characterize the accuracy with which the model to be trained establishes a mapping between the image information and the text information. Therefore, the training device updates parameters of the model to be trained based on the difference between the prediction result and the prediction label, and the accuracy of mapping between the image information and the text information of the model to be trained can be improved.

Based on the above, the training device updates parameters of the model to be trained based on the difference between the prediction result and the prediction label, and under the condition that the target text image model is obtained, the target text image model has the capability of establishing mapping between image information and text information. Specifically, the target text image model may convert image information into text information, and may also convert text information into image information. Thus, the target text image model may be used to generate a target vector based on the target text, wherein the target text is text describing the target organ, and the information of the target organ carried by the target vector matches the description of the target organ by the target text. For example, the target text is: the left lung is included in the figure, and then the target vector generated by the target image model based on the target text includes information that the left lung is included in the image. For another example, the target text is: in the figure, the right kidney is arranged, and a calculus is arranged near the side wall of the right kidney, so that the target vector generated by the target image model based on the target text comprises the information that the right kidney is arranged in the figure, and the calculus is arranged near the side wall of the right kidney.

Optionally, the training device establishes a mapping relation between the Image information and the text information through a contrast language-Image Pre-training method (Contrastive Language-Image Pre-training method, CLIP), so as to train the model to be trained and obtain the target text Image model.

In the embodiment of the application, the dimension of the training image is larger than 1, the training image comprises a target organ, the training text is a text describing the target organ in the training image, the content related to the target organ in the training text is blocked, and the prediction label comprises the blocked content in the training text. After the training device acquires the training image, the training text and the predictive label of the training text, the training image is converted into a first image vector, and the first image vector is subjected to feature extraction to obtain a first feature vector. And then downsampling the first image vector to obtain a second image vector. And then, carrying out feature extraction processing on the second image vector to obtain a second feature vector, so that the first feature vector and the second feature vector can carry image information of the training image under different scales. The model to be trained predicts the blocked content in the training text based on the first feature vector and the second feature vector to obtain a prediction result, the model to be trained can use the image information under different scales, the blocked content in the training text is predicted, so that the model to be trained can establish mapping between the image information of the training image and the text information by utilizing the image information of the training image under different scales.

As an alternative embodiment, the training device performs the following steps in performing step 103:

2001. and extracting the characteristics of the first image vector to obtain a third characteristic vector.

The implementation process of this step may refer to step 103, which will not be described here again. Specifically, the third feature vector in this step corresponds to the first feature vector in step 103.

2002. And expanding the size of the third feature vector to be the same as the size of the first image vector to obtain the first feature vector.

Since feature extraction of the first image vector results in the extracted feature vector having a smaller size than the first image vector, the first feature vector has a smaller size than the first image vector. The training device obtains the first feature vector by expanding the size of the third feature vector to be the same as the size of the first image vector, so that the size of the first feature vector is the same as the size of the first image vector, and further, the text content describing the first image is determined based on the information carried by the first feature vector.

As an alternative embodiment, the training device performs the following steps in performing step 104:

3001. And extracting the characteristics of the second image vector to obtain a fourth characteristic vector.

The implementation of this step can be referred to as step 104, and will not be described here. Specifically, the fourth feature vector in this step corresponds to the second feature vector in step 104.

3002. And expanding the size of the fourth feature vector to be the same as the size of the second image vector to obtain the second feature vector.

Since the second image vector is obtained by downsampling the first image vector, the second image vector has a smaller size than the first image vector. The size of the first feature vector is smaller than the size of the first image vector. The training device obtains the first feature vector by expanding the size of the third feature vector to be the same as the size of the first image vector, so that the size of the first feature vector is the same as the size of the first image vector, and further, the text content describing the first image is determined based on the information carried by the first feature vector.

Referring to fig. 2, fig. 2 is a flowchart of another training method for a text image model according to an embodiment of the application. As shown in FIG. 2, all data sets required for training include a B-ultrasound data set, an MRI data set, and a CT data set. The images in the B ultrasonic data set are all obtained through B ultrasonic examination, the images in the MRI data set are all obtained through MRI acquisition, and the images in the CT data set are all obtained through CT acquisition. Moreover, the images in the B-ultrasound dataset, the MRI dataset, and the CT dataset all include organs, it being understood that different images may include different organs.

The images in the dataset are then preprocessed, and in particular, the tag processing module generates image tags for the images in the dataset based on the modality of the images in the dataset and the organs included in the images. For example, for a CT image including kidneys, the image tag is 1. For CT images including lungs, the image label is 2. For ultrasound images including kidneys, the image label is 3. After the label processing module generates the image label, a text label data module is generated based on the image label, and the text label of the image is generated based on the image label of the image in the training data set. For example, for a CT image including the kidney, the image label is 1, the generated text label is that this is a CT image including the kidney, the generated text label is either the kidney. And then the training text can be obtained by shielding the content related to the organ in the text label.

After generating text labels and training texts for images in a data set, taking the images, the text labels and the training texts in the data set as training data for training a 2D/3D language-image pre-training model, wherein the 2D/3D language-image pre-training model is the to-be-trained text image model. As shown in fig. 2, the images in the training data are input into the 2D/3D language-image pre-training model through the input images, so that the 2D/3D language-image pre-training model predicts the occluded content in the training text to obtain a prediction result. And finally, calculating cross entropy loss based on the text labels and the prediction results, and carrying out back propagation based on the cross entropy loss, so as to update parameters of the 2D/3D language-image pre-training model and obtain a target text image model.

The embodiment of the application also provides a training method of the target organ segmentation model, and the execution main body of the method is the training device. Referring to fig. 3, fig. 3 is a flowchart illustrating a training method of a target organ segmentation model according to an embodiment of the application.

301. And acquiring a training image, a target text and a segmentation label.

In an embodiment of the present application, the training image includes a target organ, where the target organ may be any organ, for example, the target organ is a kidney, for example, the target organ is a lung, and for example, the target organ is a heart. The dimension of the training image is greater than 1, in one possible implementation the training image is a two-dimensional image, in another possible implementation the training image is a three-dimensional image, e.g. the training image is an ultrasound image, and e.g. the ultrasound image is a CT image.

In the embodiment of the application, the target text is a text describing a target organ in the training image. For example, the target text is: kidneys are present in the training images, where the target organ is the kidney. For another example, the target text is: the lung is present in the training image, and the target organ is the lung.

In the embodiment of the application, the segmentation tag comprises the position of the target organ in the training image, and the target organ can be determined from the training image according to the segmentation tag. Optionally, the position of the target organ indicated by the segmentation label is a pixel-level position, specifically, according to the segmentation label, it can be determined which pixels in the training image belong to the target organ and which pixels do not belong to the target organ, for example, the segmentation label is an image corresponding to the training image, the pixel value of the pixel in the image is 0 or 1, wherein, the pixel value of the pixel is 0 to indicate that the pixel corresponding to the pixel in the training image does not belong to the target organ, the pixel value of the pixel is 1 to indicate that the pixel corresponding to the pixel in the training image belongs to the target organ,

302. And acquiring a target text image model.

In the embodiment of the application, the target text image model is obtained by training according to the training method of the text image model. The target text image model may be used to generate a target vector based on the target text, wherein the information of the target organ carried by the target vector matches the description of the target organ by the target text.

303. And processing the target text by using the target text image model to obtain a target vector.

In the embodiment of the application, the information of the target organ carried by the target vector is matched with the description of the target organ in the training image by the target text.

304. And the target organ in the training image is segmented by the segmentation model to be trained by utilizing the target vector, so as to obtain a segmentation result.

Because the target vector carries information of the target organ in the training image, the to-be-trained segmentation model can segment the target organ in the training image by using the target vector to obtain a segmentation result, wherein the target organ in the training image can be determined according to the segmentation result.

In one possible implementation, the segmentation model to be trained includes an encoder (Encoder) to encode the image and a Decoder (Decoder) to decode the result of the encoder output.

305. And updating parameters of the segmentation model to be trained based on the segmentation label and the segmentation result to obtain a target segmentation model.

Because the segmentation labels represent the positions of the target organs in the training images, and the segmentation results are the segmentation results of the segmentation model to be trained, the accuracy of the segmentation results can be determined by utilizing the segmentation labels, and then the parameters of the segmentation model to be trained can be updated to obtain the target segmentation model.

In one possible implementation manner, the training device updates parameters of the to-be-trained segmentation model based on the difference between the segmentation label and the segmentation result to obtain the target segmentation model.

In another possible implementation manner, the training device determines a loss of the segmentation model to be trained based on the difference between the segmentation label and the segmentation result, wherein the loss of the segmentation model to be trained is positively correlated with the difference. And updating parameters of the segmentation model to be trained based on the loss of the segmentation model to be trained to obtain a target segmentation model.

In yet another possible implementation, the training device calculates the cross entropy loss based on the segmentation labels and the segmentation results. And updating parameters of the segmentation model to be trained based on the cross entropy loss to obtain a target segmentation model.

In the embodiment of the application, the dimension of the training image is greater than 1, and the training image comprises a target organ, wherein the target text is a text describing the target organ in the training image. The segmentation labels include the location of the target organ in the training image. The target text image model is trained according to the training method of the text image model, so that the target text image model has the capability of generating a target vector based on the target text, wherein the information of the target organ carried by the target vector is matched with the description of the target organ by the target text.

Then, after the training device acquires the training image, the target text, the segmentation tag and the target text image model, the target text image model is utilized to process the target text to obtain a target vector, so that the image vector carrying the information of the target organ in the training image is generated based on the text describing the training image. And the target vector is reused for the segmentation model to be trained, the target organ in the training image is segmented, and the information of the target organ in the training image can be utilized to segment the target organ in the training image, so that a segmentation result is obtained. Finally, based on the segmentation labels and the segmentation results, updating parameters of the segmentation model to be trained to obtain a target segmentation model, and improving the accuracy of segmentation of the target segmentation model on the target organ.

It should be appreciated that for the target segmentation model, the target organ is the organ described in the target text, that is, the target organ may be any organ, for example, the organ described in the target text is a kidney, and the target segmentation model may be used to segment the kidney in the image. For another example, where the organ described in the target text is a lung, the target segmentation model may be used to segment the lung in the image.

Referring to fig. 4, fig. 4 is a flowchart illustrating another training method of a target organ segmentation model according to an embodiment of the application. As shown in fig. 4, the training input is training data, where the training data includes an image, a text label of the image, and a segmented label of the image, where the image may be an image in the training set in fig. 2, and the text label of the image may also be a text label in fig. 2, and the segmented label includes a position of an organ in the image. Firstly, text data (namely text labels) and image data (namely images in a training set in fig. 2) in training data are input into a 2D/3D language-image pre-training model module through outputting text labels and images, the module comprises a model to be trained, and a target text image model is obtained through training of the model to be trained. The target text image model converts text information into image information by converting the text labels into text vectors. And then inputting the image vector and the image data (and the image in the data set) in the training data to a segmentation model to be trained, outputting the segmentation model to be trained through the segmentation result, and outputting the segmentation result of the organ. And calculating cross entropy loss based on the segmentation labels in the training data and the segmentation results of organs, and updating parameters of the segmentation model to be trained based on the back propagation of the loss to obtain a target segmentation model.

The embodiment of the application also provides a method for segmenting the target organ, and an execution main body of the method for segmenting the target organ is a segmenting device (hereinafter referred to as a segmenting device for short) of the target organ, wherein the segmenting device can be any electronic equipment capable of executing the technical scheme disclosed by the embodiment of the method of the application. Alternatively, the dividing means may be one of the following: cell-phone, computer, panel computer, wearable smart machine.

Referring to fig. 5, fig. 5 is a flowchart illustrating a method for segmenting a target organ according to an embodiment of the application.

501. And acquiring an image to be segmented, a reference text, a target segmentation model and a target text image model.

In the embodiment of the application, the image to be segmented is one of an ultrasonic image and a three-dimensional CT image, wherein the image to be segmented comprises a target organ, and the reference text is a text describing the target organ in the image to be segmented. For example, the reference text is: kidneys exist in the images to be segmented, and the target organ is the kidney. For another example, the reference text is: the lung is present in the image to be segmented, and the target organ is the lung.

In the embodiment of the application, the target segmentation model is obtained by training according to the training method of the target organ segmentation model. The target organ segmentation model may be used to segment the target organ from the image. The target text image model is trained according to the training method of the text image model.

502. And converting the reference text into a reference image vector by using the target text image model.

As described above, the target text image model has the capability of establishing a mapping between the image information and the text information, and therefore, the segmentation device can convert the text information in the reference text into the image information by using the target text image model, thereby obtaining the reference image vector.

503. And dividing the target organ in the image to be divided according to the reference image vector by using the target division model to obtain a target division result.

In the embodiment of the application, the target segmentation result comprises a target organ in the image to be segmented, and the target organ can be determined from the image to be segmented according to the target segmentation result.

Because the reference image vector carries information of the target organ in the image to be segmented, the target segmentation model can segment the target organ in the image to be segmented according to the reference image vector to obtain a target segmentation result.

In the embodiment of the application, the image to be segmented is one of an ultrasonic image and a three-dimensional CT image, the image to be segmented comprises a target organ, the reference text is a text describing the target organ in the image to be segmented, and the target segmentation model is obtained by training according to the training method of the target organ segmentation model. Therefore, after the image to be segmented, the reference text and the target segmentation model are acquired, the segmentation device can segment the target organ in the image to be segmented by using the target segmentation model to obtain a target segmentation result.

It should be understood that, since the segmentation device uses the target text image model to convert the reference text into the reference image vector, and then uses the target segmentation model to segment the target organ in the image to be segmented according to the reference image vector to obtain the target segmentation result, the segmentation device can segment the target organ in the reference image regardless of the mode of the image in which the reference image is the reference text describing the target organ in the reference image. That is, the segmentation means may segment the target organ in the image of any modality. Specifically, the reference image may be an ultrasound image or a CT image.

Referring to fig. 6, fig. 6 is a flowchart illustrating another method for segmenting a target organ according to an embodiment of the present application. As shown in fig. 6, an image to be segmented and a reference text are input, wherein the reference text is: this is a CT image of the kidney. And converting the image to be segmented into a one-dimensional image vector through data preprocessing. The one-dimensional image vector and the reference text are then input to a medical image generic model, wherein the medical image generic model comprises a target text image model and a target segmentation model. The target text image model converts the reference text into a reference image vector. And the target segmentation model processes the one-dimensional image vector according to the reference image vector to realize the segmentation of the target organ in the image to be segmented, and a target segmentation result is obtained. As shown in fig. 6, the output target segmentation result is two kidneys.

Referring to fig. 7, fig. 7 is a flowchart illustrating another method for segmenting a target organ according to an embodiment of the present application. As shown in fig. 7, an image to be segmented and a reference text are input, wherein the reference text is: and (3) kidneys. And converting the image to be segmented into a one-dimensional image vector through data preprocessing. The one-dimensional image vector and the reference text are then input to a medical image generic model, wherein the medical image generic model comprises a target text image model and a target segmentation model. The target text image model converts the reference text into a reference image vector. And the target segmentation model processes the one-dimensional image vector according to the reference image vector to realize the segmentation of the target organ in the image to be segmented, and a target segmentation result is obtained. As shown in fig. 7, the output target segmentation result is kidney.

Referring to fig. 8, fig. 8 is a flowchart illustrating another method for segmenting a target organ according to an embodiment of the present application. As shown in fig. 8, the model inputs include ultrasound images, CT images, and MRI images, each of which includes a prompt, i.e., a reference text that matches the organ in the image. Specifically, the ultrasound image includes two images, wherein the prompting of one image is: kidney, the other image was presented: the method is a ventricular ultrasound image, and the CT image comprises four images, wherein prompts of the four images are respectively as follows: this is a CT image of the liver, dividing the kidneys, spleen and stomach, kidneys and kidney stones. The MRI image includes two images, wherein the cues for one image are: the other image of the ventricle is indicated as: brain tumor. As shown in fig. 8, the model input is processed through the target text image model and the target segmentation model, so as to obtain a model output, wherein the model output is the target segmentation result.

It will be appreciated by those skilled in the art that in the above-described method of the specific embodiments, the written order of steps is not meant to imply a strict order of execution but rather should be construed according to the function and possibly inherent logic of the steps.

If the technical scheme of the application relates to personal information, the product applying the technical scheme of the application clearly informs the personal information processing rule before processing the personal information and obtains the autonomous agreement of the individual. If the technical scheme of the application relates to sensitive personal information, the product applying the technical scheme of the application obtains individual consent before processing the sensitive personal information, and simultaneously meets the requirement of 'explicit consent'. For example, a clear and remarkable mark is set at a personal information acquisition device such as a camera to inform that the personal information acquisition range is entered, personal information is acquired, and if the personal voluntarily enters the acquisition range, the personal information is considered as consent to be acquired; or on the device for processing the personal information, under the condition that obvious identification/information is utilized to inform the personal information processing rule, personal authorization is obtained by popup information or a person is requested to upload personal information and the like; the personal information processing may include information such as a personal information processor, a personal information processing purpose, a processing mode, and a kind of personal information to be processed.

The foregoing details of the method according to the embodiments of the present application and the apparatus according to the embodiments of the present application are provided below.

Referring to fig. 9, fig. 9 is a schematic structural diagram of a training device for text image model according to an embodiment of the present application. The training device 1 for a text image model includes: acquisition unit 11, conversion unit 12, extraction unit 13, downsampling unit 14, prediction unit 15, update unit 16, specifically:

An obtaining unit 11, configured to obtain a training image, a training text, and a prediction label of the training text, where a dimension of the training image is greater than 1, the training image includes a target organ, the training text is a text describing the target organ in the training image, and content related to the target organ in the training text is blocked, and the prediction label includes blocked content in the training text;

A conversion unit 12 for converting the training image into a first image vector;

an extracting unit 13, configured to obtain a first feature vector by performing feature extraction on the first image vector;

A downsampling unit 14, configured to downsample the first image vector to obtain a second image vector;

the extracting unit 13 is configured to obtain a second feature vector by performing feature extraction processing on the second image vector;

the predicting unit 15 is configured to predict, through a model to be trained, the blocked content in the training text based on the first feature vector and the second feature vector, so as to obtain a prediction result;

And the updating unit 16 is configured to update parameters of the model to be trained based on a difference between the prediction result and the prediction label, so as to obtain a target text image model, where the target text image model is used to generate a target vector based on a target text, the target text is a text describing the target organ, and information of the target organ carried by the target vector is matched with the description of the target organ by the target text.

In combination with any embodiment of the present application, the extracting unit 13 is specifically configured to:

extracting features of the first image vector to obtain a third feature vector;

Referring to fig. 10, fig. 10 is a schematic structural diagram of a training device for a target organ segmentation model according to an embodiment of the application. The training device 2 for a target organ segmentation model includes: acquisition unit 21, processing unit 22, segmentation unit 23, updating unit 24, in particular:

An obtaining unit 21, configured to obtain a training image, a target text, and a segmentation label, where the dimension of the training image is greater than 1, the training image includes a target organ, the target text is a text describing the target organ in the training image, and the segmentation label includes a position of the target organ in the training image;

the acquiring unit 21 is configured to acquire a target text image model, where the target text image model is trained according to the training method of the text image model described above;

A processing unit 22, configured to process the target text by using the target text image model to obtain a target vector, where information of the target organ carried by the target vector is matched with a description of the target organ in the training image by the target text;

a segmentation unit 23, configured to segment the target organ in the training image by using the target vector through a to-be-trained segmentation model, so as to obtain a segmentation result;

and the updating unit 24 is configured to update parameters of the to-be-trained segmentation model based on the segmentation label and the segmentation result, so as to obtain a target segmentation model.

In combination with any embodiment of the present application, the updating unit 24 is specifically configured to:

Referring to fig. 11, fig. 11 is a schematic structural diagram of a target organ segmentation apparatus according to an embodiment of the application. The target organ segmentation apparatus 3 includes: acquisition unit 31, conversion unit 32, division unit 33, specifically:

An obtaining unit 31, configured to obtain an image to be segmented, a reference text, a target segmentation model and a target text image model, where the image to be segmented is one of an ultrasound image and a three-dimensional CT image, the image to be segmented includes a target organ, the reference text is a text describing the target organ in the image to be segmented, the target segmentation model is obtained by training according to the training method of the target organ segmentation model, and the target text image model is obtained by training according to the training method of the text image model;

A conversion unit 32 for converting the reference text into a reference image vector using the target text image model;

the segmentation unit 33 is configured to segment the target organ in the image to be segmented according to the reference image vector by using the target segmentation model, so as to obtain a target segmentation result.

In some embodiments, the functions or modules included in the apparatus provided by the embodiments of the present application may be used to perform the methods described in the foregoing method embodiments, and specific implementations thereof may refer to descriptions of the foregoing method embodiments, which are not repeated herein for brevity.

Fig. 12 is a schematic hardware structure of an electronic device according to an embodiment of the present application. The electronic device 4 comprises a processor 41, a memory 42. Optionally, the electronic device 4 further comprises input means 43 and output means 44. The processor 41, memory 42, input device 43, and output device 44 are coupled by connectors including various interfaces, transmission lines or buses, etc., which are not limiting examples of the application. It should be appreciated that in various embodiments of the application, coupled is intended to mean interconnected by a particular means, including directly or indirectly through other devices, e.g., through various interfaces, transmission lines, buses, etc.

Processor 41 may be one or more graphics processors (graphics processing unit, GPUs), which may be single-core GPUs or multi-core GPUs in the case where processor 41 is a GPU. Alternatively, the processor 41 may be a processor group formed by a plurality of GPUs, and the plurality of processors are coupled to each other through one or more buses. In the alternative, the processor may be another type of processor, and the embodiment of the application is not limited.

Memory 42 may be used to store computer program instructions as well as various types of computer program code for performing aspects of the present application. Optionally, the memory includes, but is not limited to, random access memory (random access memory, RAM), read-only memory (ROM), erasable programmable read-only memory (erasable programmable read only memory, EPROM), or portable read-only memory (compact disc read-only memory, CD-ROM) for associated instructions and data.

The input means 43 are for inputting data and/or signals and the output means 44 are for outputting data and/or signals. The input device 43 and the output device 44 may be separate devices or may be an integral device.

It will be appreciated that in the embodiment of the present application, the memory 42 may be used to store not only related instructions, but also related data, for example, the memory 42 may be used to store a training image, a training text, and a predictive label of the training text obtained through the input device 43, or the memory 42 may be used to store a target text image model obtained through the processor 41, etc., where the embodiment of the present application is not limited to the data specifically stored in the memory.

It will be appreciated that fig. 12 shows only a simplified design of an electronic device. In practical applications, the electronic device may further include other necessary elements, including but not limited to any number of input/output devices, processors, memories, etc., and all electronic devices that can implement the embodiments of the present application are within the scope of the present application.

Those of ordinary skill in the art will appreciate that the various illustrative elements and algorithm steps described in connection with the embodiments disclosed herein may be implemented as electronic hardware, or combinations of computer software and electronic hardware. Whether such functionality is implemented as hardware or software depends upon the particular application and design constraints imposed on the solution. Skilled artisans may implement the described functionality in varying ways for each particular application, but such implementation decisions should not be interpreted as causing a departure from the scope of the present application.

It will be clear to those skilled in the art that, for convenience and brevity of description, specific working procedures of the above-described systems, apparatuses and units may refer to corresponding procedures in the foregoing method embodiments, and are not repeated herein. It will be further apparent to those skilled in the art that the descriptions of the various embodiments of the present application are provided with emphasis, and that the same or similar parts may not be described in detail in different embodiments for convenience and brevity of description, and thus, parts not described in one embodiment or in detail may be referred to in description of other embodiments.

In the several embodiments provided by the present application, it should be understood that the disclosed systems, devices, and methods may be implemented in other manners. For example, the apparatus embodiments described above are merely illustrative, e.g., the division of the units is merely a logical function division, and there may be additional divisions when actually implemented, e.g., multiple units or components may be combined or integrated into another system, or some features may be omitted or not performed. Alternatively, the coupling or direct coupling or communication connection shown or discussed with each other may be an indirect coupling or communication connection via some interfaces, devices or units, which may be in electrical, mechanical or other form.

The units described as separate units may or may not be physically separate, and units shown as units may or may not be physical units, may be located in one place, or may be distributed on a plurality of network units. Some or all of the units may be selected according to actual needs to achieve the purpose of the solution of this embodiment.

In addition, each functional unit in the embodiments of the present application may be integrated in one processing unit, or each unit may exist alone physically, or two or more units may be integrated in one unit.

In the above embodiments, it may be implemented in whole or in part by software, hardware, firmware, or any combination thereof. When implemented in software, may be implemented in whole or in part in the form of a computer program product. The computer program product includes one or more computer instructions. When loaded and executed on a computer, produces a flow or function in accordance with embodiments of the present application, in whole or in part. The computer may be a general purpose computer, a special purpose computer, a computer network, or other programmable apparatus. The computer instructions may be stored in or transmitted across a computer-readable storage medium. The computer instructions may be transmitted from one website, computer, server, or data center to another website, computer, server, or data center by a wired (e.g., coaxial cable, fiber optic, digital subscriber line (digital subscriber line, DSL)), or wireless (e.g., infrared, wireless, microwave, etc.). The computer readable storage medium may be any available medium that can be accessed by a computer or a data storage device such as a server, data center, etc. that contains an integration of one or more available media. The usable medium may be a magnetic medium (e.g., a floppy disk, a hard disk, a magnetic tape), an optical medium (e.g., a digital versatile disk (DIGITAL VERSATILE DISC, DVD)), or a semiconductor medium (e.g., a Solid State Disk (SSD)), or the like.

Those of ordinary skill in the art will appreciate that implementing all or part of the above-described method embodiments may be accomplished by a computer program to instruct related hardware, the program may be stored in a computer readable storage medium, and the program may include the above-described method embodiments when executed. And the aforementioned storage medium includes: a read-only memory (ROM) or a random access memory (random access memory, RAM), a magnetic disk or an optical disk, or the like.

Claims

1. A method of training a text image model, the method comprising:

Converting the training image into a first image vector;

Extracting features of the first image vector to obtain a first feature vector; the feature extraction is performed on the first image vector to obtain a first feature vector, which comprises the following steps: extracting features of the first image vector to obtain a third feature vector; expanding the size of the third feature vector to be the same as the size of the first image vector to obtain the first feature vector;

Downsampling the first image vector to obtain a second image vector;

obtaining a second feature vector by carrying out feature extraction processing on the second image vector; the obtaining a second feature vector by performing feature extraction processing on the second image vector includes: extracting features of the second image vector to obtain a fourth feature vector; expanding the size of the fourth feature vector to be the same as the size of the second image vector to obtain the second feature vector;

2. A method of training a target organ segmentation model, the method comprising:

Acquiring a target text image model, wherein the target text image model is trained according to the method of claim 1;

3. A method of segmenting a target organ, the method comprising:

Acquiring an image to be segmented, a reference text, a target segmentation model and a target text image model, wherein the image to be segmented is one of an ultrasonic image and a three-dimensional CT image, the image to be segmented comprises a target organ, the reference text is a text describing the target organ in the image to be segmented, the target segmentation model is obtained by training according to the method of claim 2, and the target text image model is obtained by training according to the method of claim 1;

4. A training device for a text image model, characterized in that the training device for a text image model comprises:

The extraction unit is used for extracting the characteristics of the first image vector to obtain a first characteristic vector; the extraction unit is specifically configured to: extracting features of the first image vector to obtain a third feature vector; expanding the size of the third feature vector to be the same as the size of the first image vector to obtain the first feature vector;

The extracting unit is used for extracting the characteristics of the second image vector to obtain a second characteristic vector; the extraction unit is specifically configured to: extracting features of the second image vector to obtain a fourth feature vector; expanding the size of the fourth feature vector to be the same as the size of the second image vector to obtain the second feature vector;

5. A training device for a target organ segmentation model, characterized in that the training device for a target organ segmentation model comprises:

the acquiring unit is used for acquiring a target text image model, and the target text image model is trained according to the method of claim 1;

6. A segmentation apparatus for a target organ, characterized in that the segmentation apparatus for a target organ comprises:

The device comprises an acquisition unit, a processing unit and a processing unit, wherein the acquisition unit is used for acquiring an image to be segmented, a reference text, a target segmentation model and a target text image model, the image to be segmented is one of an ultrasonic image and a three-dimensional CT image, the image to be segmented comprises a target organ, the reference text is a text describing the target organ in the image to be segmented, the target segmentation model is obtained by training according to the method of claim 2, and the target text image model is obtained by training according to the method of claim 1;

7. An electronic device, comprising: a processor and a memory for storing computer program code comprising computer instructions which, when executed by the processor, cause the electronic device to perform the method of claim 1, or to perform the method of claim 2, or to perform the method of claim 3.

8. A computer readable storage medium, in which a computer program is stored, the computer program comprising program instructions which, when executed by a processor, cause the processor to perform the method of claim 1, or to perform the method of claim 2, or to perform the method of claim 3.