WO2024001104A1

WO2024001104A1 - Image-text data mutual-retrieval method and apparatus, and device and readable storage medium

Info

Publication number: WO2024001104A1
Application number: PCT/CN2022/141374
Authority: WO
Inventors: 赵雅倩; 王立; 范宝余
Original assignee: 苏州元脑智能科技有限公司
Priority date: 2022-06-30
Filing date: 2022-12-23
Publication date: 2024-01-04
Also published as: CN115408551A

Abstract

A medical image-text data mutual-retrieval method, comprising: performing multi-level classification on text information in image-text data according to a predetermined manner, and respectively generating classified text information into a text feature in a cascaded manner by means of a first neural network model according to a classification relationship (S1); generating image information in the image-text data into an image feature in an image sequence manner by means of a second neural network model (S2); according to the text feature and the image feature, performing iterative training on the basis of a predetermined loss function, so as to generate an image-text data mutual-retrieval model (S3); and for text information and/or image information in input image-text data, retrieving corresponding text information and/or image information by means of the image-text data mutual-retrieval model (S4).

Description

A method, device, equipment and readable storage medium for mutual checking of graphic and text data

Cross-references to related applications

This application is required to be submitted to the China Patent Office on June 30, 2022. The application number is 202210760827.7, and the application title is "A medical graphic and text data mutual inspection method, device, equipment and readable storage medium". The priority of the Chinese patent application , the entire contents of which are incorporated herein by reference.

Technical field

The application belongs to the field of computers, and specifically relates to a method, device, equipment and readable storage medium for mutual checking of graphic and text data.

Background technique

With the continuous improvement of the informatization level of the medical industry, the amount of medical image data is expanding day by day. The common situation in the industry is that there has been a lack of effective management and retrieval methods for these medical image data with multiple modalities. Data retrieval of multiple modalities has become an urgent problem that needs to be solved.

Existing medical retrieval tasks are mainly oriented to single-modal retrieval. Single-modal retrieval can only query information of the same modality, such as text retrieval text and image retrieval image. Cross-modal retrieval refers to using samples of one modality to retrieve samples of another modality that are semantically similar to it, such as image retrieval of text and text retrieval of images.

The cross-domain heterogeneity of this application is mainly reflected in the fact that the image data is in different spaces and is heterogeneous data. If the retrieval is correct, the retrieval method needs to have the function of cross-domain retrieval to achieve alignment and sorting between modalities.

The inventor realized that compared with single-modal data, cross-modal retrieval not only needs to model the relationship between modal data, but also needs to model the correlation between different modalities, so as to achieve different Cross-domain retrieval between modalities. Cross-modal retrieval has strong flexibility, wide application scenarios and strong user needs. It is also an important research content of cross-modal machine learning and has very important academic value and significance.

For example, with the vigorous development of medical informatization, various hospital information systems have become more and more complete, and a rich variety of medical data has been collected. Medical data has slowly become another special cross-modal data type after natural data sets. General radiologists make diagnoses directly with the naked eye through experience and with reference to characteristics of cases they have seen before. Due to large amounts of data, limited experience and other reasons, misdiagnosis and missed diagnosis will inevitably occur, leaving great hidden dangers to the accuracy of patient treatment. Therefore, if doctors can quickly query similar information in medical databases for auxiliary diagnosis, they will reduce misdiagnosis and improve work efficiency.

Contents of the invention

This application proposes a mutual inspection method for medical graphic and text data, including:

Classify the text information in the graphic data in a multi-level manner according to a predetermined method, and pass the classified text information through the first neural network model to generate text features in a cascade manner according to the classification relationship;

Generate image features from the image information in the graphic data through the second neural network model in the form of image sequences;

Iterative training based on text features and image features based on a predetermined loss function generates a graphic and text data mutual detection model; and

The corresponding text information and/or image information in the input graphic data is retrieved through the graphic and text data mutual inspection model.

Another aspect of this application also proposes a medical image and text data mutual detection device, including:

The preprocessing module is configured to perform multi-level classification of the text information in the graphic data according to a predetermined method, and pass the classified text information through the first neural network model to generate text in a cascade manner according to the classification relationship. feature;

The first model calculation module is configured to use the image information in the graphic data in the form of an image sequence to generate image features through the second neural network model;

The second model calculation module is configured to iteratively train based on a predetermined loss function based on text features and image features to generate a graphic and text data mutual detection model; and

The image-text mutual inspection module is configured to retrieve the corresponding text information and/or image information in the input image-text data through the image-text data mutual inspection model.

Another aspect of the present application also proposes a computer device, including: a memory and one or more processors. Computer readable instructions are stored in the memory. When the computer readable instructions are executed by the one or more processors, such that The above-mentioned one or more processors implement the steps of the above-mentioned mutual checking method of medical graphic and text data.

Another aspect of the present application also proposes one or more non-volatile computer-readable storage media storing computer-readable instructions. When the above-mentioned computer-readable instructions are executed by the above-mentioned one or more processors, the above-mentioned one or more processors Each processor executes the steps of the above-mentioned mutual checking method of medical graphic and text data.

The details of one or more embodiments of the application are set forth in the accompanying drawings and the description below. Other features and advantages of the application will be apparent from the description, drawings, and claims.

Description of drawings

In order to explain the embodiments of the present application or the technical solutions in the prior art more clearly, the drawings needed to be used in the description of the embodiments or the prior art will be briefly introduced below. Obviously, the drawings in the following description are only These are some embodiments of the present application. For those of ordinary skill in the art, other drawings can be obtained based on these drawings without exerting creative efforts.

Figure 1 is a flow chart of an embodiment of a medical image and text retrieval method provided by this application according to one or more embodiments;

Figure 2 is a schematic diagram of medical text data provided by this application according to one or more embodiments;

Figure 3 is a schematic diagram of a partial model structure of a medical image and text retrieval method provided by this application according to one or more embodiments;

Figure 4 is a schematic diagram of a partial model structure of a medical image and text retrieval method provided by this application according to one or more embodiments;

Figure 5 is a schematic diagram of a partial model structure of a medical image and text retrieval method provided by this application according to one or more embodiments;

Figure 6 is a schematic structural diagram of a model of a medical image and text retrieval method provided by this application according to one or more embodiments;

Figure 7 is a schematic diagram of a partial model structure of a medical image and text retrieval method provided by this application according to one or more embodiments;

Figure 8 is a schematic diagram of a partial model structure of a medical image and text retrieval method provided by this application according to one or more embodiments.

Figure 9 is a schematic structural diagram of a medical image and text data mutual detection device provided by the present application according to one or more embodiments;

Figure 10 is a schematic structural diagram of a computer device provided according to one or more embodiments of the present application;

Figure 11 is a schematic structural diagram of a non-volatile computer-readable storage medium provided by this application according to one or more embodiments.

Detailed ways

In order to make the purpose, technical solutions and advantages of the present application more clear, the embodiments of the present application will be further described in detail below with reference to specific embodiments and the accompanying drawings.

It should be noted that all expressions using “first” and “second” in the embodiments of this application are intended to distinguish two entities or parameters with the same name but not the same. It can be seen that “first” and “second” It is only for the convenience of description and should not be understood as a limitation on the embodiments of the present application, and subsequent embodiments will not describe this one by one.

This application is dedicated to realizing medicine--mutual retrieval tasks between images and texts. The task data consists of two parts: medical images and medical text. Medical images include many types of images, such as MRI images, CT, ultrasound images, etc., which are all sequence images. For medical texts include: medical record reports, etc. This is just an example, but it does not mean that the method of this application can only be applied in this field.

In traditional solutions, most use single-modal retrieval methods to analyze patient conditions. For example, some medical image detection models based on image processing technology can only provide disease detection results for corresponding medical images. Medical images give disease results, that is, disease or no disease, and it is difficult to conduct a comprehensive analysis of the patient. To conduct a comprehensive analysis of the patient, it is necessary to obtain a variety of patient information, such as medical history, family history, age, and life style. Habits and other information, and most of this information is recorded in text, so traditional technology implementation lacks intelligent assistive technology to determine the disease based on the patient's various information. It is impossible to provide medical staff with a comprehensive analysis of the condition and etiology.

As shown in Figure 1, in order to solve the above problems, this application proposes a mutual checking method of medical graphic and text data. The mutual checking method of medical graphic and text data is applied to computer equipment as an example, including:

Step S1: Classify the text information in the graphic data in a multi-level manner according to a predetermined method, and pass the classified text information through the first neural network model to generate text features in a cascade manner according to the classification relationship;

Step S2: Generate image features from the image information in the graphic data through the second neural network model in the form of an image sequence;

Step S3: Iteratively train based on text features and image features based on a predetermined loss function to generate an image and text data mutual detection model;

Step S4: Retrieve the corresponding text information and/or image information from the text information and/or image information in the input image and text data through the image and text data mutual inspection model.

In the embodiment of this application, in step S1, the graphic data refers to the text data and image data corresponding to the medical image, that is, the graphic data in this application refers to the medical sequence image and the corresponding disease description and the patient's corresponding For information related to the patient's disease, such as physical status information, please refer to the content shown in Figure 2 for details.

Further, the text data in the graphic data, that is, the description information of the disease, is divided into multiple categories, as shown in Figure 2, and the classified text is input into the first neural network model with the classified text as a unit. In the embodiment, the first neural network model is a Transformer model, that is, in step S1, corresponding feature vectors are calculated for the classified text data through multiple Transformer models, and then the feature vectors output by the multiple Transformer models are used as input, Input it into a superior Transformer model, and use the output results of the superior Transformer model as text features.

In step S2, the medical images in the graphic data are calculated through the residual network model ResNet to obtain corresponding image features. An image feature is a vector of specified size.

It should be noted that in some embodiments of the present application, there is at least one medical image in the graphic data, which usually refers to multiple medical images. That is, in reality, medical images such as MRI or CT generally have multiple medical images. Scan the lesion across a wide area or multiple angles. Therefore, when there are multiple medical images, corresponding image features need to be generated based on the multiple medical images.

In step S3, the above text features and image features are similarity matched, and the corresponding similarity loss value is calculated according to the preset loss function, and the corresponding loss value is back-propagated to the Transformer model and the residual network model. Iterative training is repeated until the size of the loss value meets the accuracy requirements, then the Transformer model, the residual network model and the corresponding model parameters on the loss function are saved as mutual inspection models.

In step S4, when using the mutual detection model for analysis or prediction, the text description of the corresponding case or disease and/or the corresponding medical image is input into the mutual detection model, and the mutual detection model gives a result based on the input text or image. Match the test report, or filter out the diagnosis content of the corresponding disease through the corresponding medical image. Realize the mutual inspection of pictures and texts of medical images to help medical workers reduce their workload.

In some embodiments of the present application, the text information in the graphic data is classified into multiple levels according to a predetermined method, and the classified text information is passed through the first neural network model to generate text features in a cascade manner according to the classification relationship. include:

Classify the text information according to the text structure type, and calculate the feature vector of the structural text information through the first neural network model for each classified structural text information.

In this embodiment, as shown in Figure 2, Figure 2 shows the description information of the patient's disease in a certain hospital, and includes the patient's personal information, such as age, marital status, occupation, etc., as well as allergy history and current illness history. , personal history, past history, family history, current illnesses, and much more. In this embodiment, the text information in Figure 2 is divided into multiple structured texts according to the above classification. For example, the text content of the personal history classification is: born and raised in the place of origin, living in a good living environment, etc., as one type of text. information. And the content is input into the Transformer model as input data of a Transformer model, and the Transformer model gives the feature vector of the corresponding text information under the category. That is, the content of personal history is represented by a Transformer model to give corresponding feature vectors for subsequent model judgment.

Furthermore, the above-mentioned input of classified text content into the Transformer model is not the original text input, but the corresponding text is converted into word vector mode using the corresponding tool and then input into the Transformer model. The corresponding tool can be a model such as Bert. Text vectorization.

In some embodiments of the present application, classifying text information according to text structure types includes:

Categorize text information according to text structure and/or time type.

In some embodiments of the present application, text information can also be divided according to a combination of time and structure. For example, the causes of some diseases cannot be affected by past medical history, but are related to the patient's living habits or other pre-existing symptoms in the recent period. Therefore, when the entire disease content is related to the person's medical history, past history or When family histories are mixed together, there will be a large number of irrelevant factors that will affect the judgment of the mutual inspection model. Therefore, when classifying text information, time factors can be used to classify the text information. The effect of text content that highlights certain diseases in model judgment.

In some embodiments of the present application, the text information in the graphic data is classified into multiple levels according to a predetermined method, and the classified text information is passed through the first neural network model to generate text features in a cascade manner according to the classification relationship. Also includes:

Sort the text content in the classified structured text information based on the number of occurrences of the statements, and input each sorted statement as a parameter into the first neural network model to calculate the text features of the structured text information.

In this embodiment, for the classified text content, this application also uses the Transformer model cascade method to generate the text features of the classification. Specifically, the classified text is divided into corresponding sections of content according to punctuation marks or semantics, which are called clauses in this application, that is, each category is represented by multiple clauses. In natural language, multiple clauses The content is the classified text content.

Furthermore, each clause is used as the input of a Transformer model and the feature vector corresponding to the clause is calculated, that is, one clause corresponds to a Transformer model, and then multiple clause feature vectors are input into a Transformer model, and the Transformer model converts multiple clauses into one Transformer model. The feature vector of the clause is calculated, and the calculation result is the text feature of the classified text content. For example, all the content of the personal history in Figure 2 corresponds to one sentence (the comma interval can also be regarded as one sentence). The Transformer model performs calculations, and then inputs the multiple outputs of the Transformer model corresponding to multiple sentences into a total Transformer model, and then the total Transformer model outputs text features of personal history. You can refer to the cascade structure shown in Figure 3. Figure 3 shows the way in which the text features of multiple classified texts are cascaded into a total Transformer model, and the text features are output by the total Transformer model. In Figure 3 Just replace the first text information with the corresponding clause.

In some embodiments of the present application, sentences are sorted by the number of times they appear, and each sorted sentence is input as a parameter to the first neural network model to calculate the text features of the structural text information:

The words in each sentence are added together with their corresponding sequence number values and the sentence numbers in the text structure classification and then input into the first neural network model to calculate text features of the structural text information.

In this embodiment, as mentioned above, when calculating text features of classified text information in clause units, multiple clauses are input to multiple Transformer models. When any clause is input to the Transformer model, , add the corresponding words in the clause (represented by word vectors, that is, numbers) to the position numbers of the corresponding clauses in the classified text information, that is, if the value of the word vector of the first word is 0.3, Its clause is the first clause in the classified text, then the calculation process is 0.3+1, further adding the position number of the word vector, which is also 1, that is, 0.3+1+1=2.3, 2.3 As the first input data of the Transformer model, and so on, for the second time, if it is 0.4, the value input to the Transformer model according to the above process is 0.4+1+2, where 1 is the first statement, 2 means that the word ranks second in this sentence. Calculate the corresponding feature vector for each clause in the above manner, and finally input the feature vectors of multiple clauses into the overall Transformer model to obtain the text feature value of the category.

It should be noted that when adding the feature vectors of multiple clauses to the Transformer model, the calculation process is the same as the clause calculation process, that is, if the feature vector of the first clause is a=[0.1,0.2,0.3], that is It is no longer a single word vector. When input to the Transformer model, if a is the feature vector of the first clause, then a+1 will be added. If a is the first text category, it will also be 1. If it is the first text category, it will also be 1. , that is, a+1+1=[2.1,2.2,2.3], and so on, to complete the cascade Transformer model calculation for each text classification. Specifically referring to Figure 4, the Emb in the lower part of Figure 4 represents an input data of the Transformer model. Before being input to the Transformer model, any piece of data needs to be combined with the number of the text category it belongs to, which is the text type in the second to last line in the lower part of the figure. The values are added, and then added to the sorting number of the input clause (that is, the position information in Figure 4), and the final value is input into the Transformer model.

In some embodiments of the present application, the method further includes:

Select any one of the calculation results output by the first neural network model and corresponding to multiple sentences as the text feature of the structured text information; or

The text features of the structured text information are obtained by weighting and averaging the calculation results output by the first neural network model and corresponding to the plurality of sentences.

In some embodiments of the present application, structured text information refers to classified text information. There are two ways to select the output of the Transformer model at any level. The first way is to rely on the calculation principle of the Transformer model. For any input data, the Transformer model will calculate it with other input data and output the data. The result is the feature vector of the input data (different from the value of the original input), so the output result of any input data can be used as the output of the Transformer model at this level, that is, if it is the output result of a certain category of text information The value of one of the clauses calculated by the total Transformer model can be used as the text feature of the classified text.

That is, the output value of the Transformer model of one of the clauses is used as the text feature of the entire classified text; or the output value of the Transformer model of multiple clauses is weighted and averaged to obtain the text feature of the corresponding structural text information.

Text features of the multiple structural text information are input into the first neural network model to obtain text features of the text information.

As shown in Figure 3, Figure 3 shows a schematic diagram of the Transformer model cascade of this application, that is, multiple classified texts are calculated through multiple Transformer models in the lower layer, and corresponding multiple classified text features are obtained, and then the multiple classified text features are calculated. The classified text features are input into the last-level Transformer model to obtain the text features of the overall text.

In some embodiments of the present application, inputting feature vectors of multiple structural text information into the first neural network model to obtain text features of the text information includes:

The text features of each structured text information and the corresponding sequence value and classification number of the structured text are added and then input into the first neural network model to calculate the text features of the text information.

In this embodiment, structured text refers to classified text classified according to structure. Furthermore, similar to the feature cascade calculation of clauses in classified text, when calculating the text features of the overall text information, it is also necessary to first add the text features after the corresponding classification to their corresponding classification numbers, and then The difference from its corresponding sequential addition is that the values of the two additions are the same in some scenarios.

In some embodiments of the present application, the method further includes:

Select any one of the calculation results output by the first neural network model and corresponding to multiple structural text information as the text feature of the text information; or

The text features of the text information are obtained by weighting and averaging the calculation results output by the first neural network model and multiple structural text information; or

The text features of multiple structural text information are spliced into long vectors, and the spliced long vectors are passed through the fully connected layer to obtain the text features of the text information.

In this embodiment, similar to determining the text features of the classified text described above, when determining the text features of the entire text, a feature of the classified text output by the Transformer model can be selected as the feature of the overall text. That is, the output result of the total Transformer model corresponding to one of the categories can be selected as the text feature of the graphic data.

It may also be a weighted average of the text features in multiple structural text information (classified text information) to obtain the text features of the overall text information.

In some embodiments of the present application, the text features of multiple structural text information can be spliced head to tail, and the spliced text features are used through a fully connected layer to obtain a new dimensional feature vector as the text feature of the overall text information.

In some embodiments of the present application, using the image information in the graphic data in the form of an image sequence to generate image features through a second neural network model includes:

Input the image sequence into the second neural network model and calculate the image sequence feature vector corresponding to the image sequence;

Calculate the weight of the image sequence feature vector, and multiply the weight with the image sequence feature vector to obtain the image sequence feature weight vector; and

The image feature weight vector is then added to the image sequence feature vector to obtain the image feature.

In this embodiment, as shown in FIG. 5 , the image sequence shown in FIG. 5 only shows three images. Specifically, the image sequence is calculated through the residual network, and the corresponding feature vector of each image is obtained.

Further calculate the weight of the feature vector corresponding to each image, multiply the weight with the corresponding feature vector respectively, add it to the feature vector of the corresponding image, and then transform the multiple image feature weight vectors into a new one through linear transformation. Dimensions as image features of image sequences.

In some embodiments of the present application, calculating the weight of the image sequence feature vector includes:

Pass the image sequence feature vector through the first fully connected layer to obtain the first fully connected layer vector;

Pass the first fully connected layer vector through the pooling layer to obtain the pooling layer vector;

Pass the pooling layer vector through the second fully connected layer to obtain the second fully connected layer vector;

Normalize the second fully connected vector to obtain the weight of the corresponding image sequence. In this embodiment, as shown in Figure 7, Figure 7 is a sub-figure of the overall network structure diagram of this application, illustrating our weight calculation structure, including two fully connected layers FC and one ReLU layer. In this application, the image features are passed through the backbone network to obtain the embedded features, that is, the feature vector of the image. The embedded features are passed through a fully connected layer to obtain the final embedded feature e of each image. The final embedded feature e will calculate the weight of each feature through the attention structure. The weight is a number and is normalized through the sigmoid layer.

The weights of the features of all image sequences will be uniformly entered into the softmax layer to determine which medical image sequence is important. Finally, the feature weight of the image sequence after the softmax layer is multiplied by the final embedded feature e corresponding to each image. At the same time, we introduced the idea of residual network. For each medical image sequence, the output of its attention structure is as follows:

The final image features will pass through Liner's fully connected layer fc to obtain the final medical image features:

In the above formula

Represents the image sequence feature vector corresponding to multiple images, multiplied by the weight (Attention) and then added to the original image sequence feature vector,

Represents the final calculated image feature. The feature vector of the image sequence after weight calculation and its own addition is calculated through the fully connected layer to obtain a new dimensional feature vector as the image feature of the image. Attention value Figure 7 shows the calculation process of the Attention model. fc stands for fully connected layer.

In some embodiments of the present application, iterative training to generate an image and text data mutual detection model based on text features and image features based on a predetermined loss function includes:

For any text feature, calculate the Euclidean distance between the text feature and the corresponding image feature and the minimum Euclidean distance between the text feature and other text features and/or image features, and use the difference between the Euclidean distance and the minimum Euclidean distance as the text loss value;

For any image feature, calculate the Euclidean distance between the image feature and the corresponding text feature and the minimum Euclidean distance between the image feature and other text features/or image features, and use the difference between the Euclidean distance and the minimum Euclidean distance as the image loss value;

The text loss value and the image loss value are summed to obtain the first loss value, and the mutual detection model is trained through the first loss value.

In this embodiment, this application proposes a new generalized pairwise hinge-loss function to evaluate the above model loss. The formula is as follows:

In the loss function design, as shown in Figure 8, for this pair of data, this application will traverse the feature coding of each image group (i.e., the image features as before) and the text feature coding (the text features corresponding to the overall text information). ) to find the average of the loss function. As shown in the above formula.

Each iterative calculation needs to be traversed N times, where N represents a total of N paired samples in this batch. First, the image group features

Traverse (N in total), and the one selected by the traversal is called

a represents anchor (anchor sample). The text feature encoding paired with the anchor sample is denoted as

p stands for positive. In the same way, in this batch,

All remaining unpaired samples are recorded as s _np . Δ is a hyperparameter, fixed during training, and is set to 0.4 in this application.

In the same way, this application also performs the same traversal operation for text features.

Represents the sample selected in the traversal, and its corresponding positive image group feature sample is recorded as

Those that do not correspond are recorded as s _np . This application uses the above loss function to perform gradient backpropagation during training to update the cascade Transformer model and ResNet network parameters.

In this embodiment, the text loss value refers to the above formula:

in,

Represents the Euclidean distance between text features and corresponding image features,

s _np represents the minimum Euclidean distance between text features and other text features and/or image features.

The image loss value refers to the above formula:

Represents the first loss value (the loss value after one round of iterative calculation of the loss function),

Represents the Euclidean distance between image features and corresponding text features,

Represents the minimum Euclidean distance between an image feature and other text features/or image features.

In some embodiments of the present application, iterative training to generate a mutual detection model of image and text data based on text features and image features based on a predetermined loss function also includes:

Transform the text features into the image feature space through the first transformation method, and obtain the image text features, and then transform the image text features into the text feature space through the second transformation method to obtain the text transformation features;

Transform the image features into the text feature space through the second transformation method and obtain the text image features, and then transform the text image features into the image feature space through the first transformation method and obtain the image transformation features;

The minimum value of the sum of the distance between the text transformation feature and the text feature and the distance between the image transformation feature and the image feature is used as the second loss value, and the mutual detection model is trained through the second loss value. In this embodiment, in order to achieve the alignment of multi-structured text and image features, that is, the two features describe information in the same semantic space. This application designs a semantic alignment loss adversarial loss function:

Among them, F represents the first conversion method, G represents the second conversion method,

As shown in the figure above, X represents e ^csi , which is our image group feature, and Y represents e ^rec , which is our medical text feature. We hope that these two features X and Y are mapped to a common space.

In order to constrain the above purposes, the implementation steps of this application are as follows:

1. Map the X feature to

Represents the features mapped from image features to text feature space.

2 will

mapped to via method F

3 Requirements for this application

It should be as close as possible to the original feature X.

Same reason:

4. Map Y features to

Represents the features mapped from text features to image feature space.

5 will

mapped to via method G

6 Requirements for this application

It should be as close as possible to the original feature Y.

Therefore, the constraint formula is as follows:

L _c =min{E(||F(G(X))-X|| ₂ )+E(||G(F(Y))-Y|| ₂ )}

Among them, E(||F(G(X))-X|| ₂ ) means that the image features are transformed into the text feature space through the second transformation method, and the text image features are obtained, and then the text image features are transformed into the text image features through the first transformation method. After transforming into the image feature space and obtaining the image transformation features, the mean value of the difference between the image transformation features and the original image features;

E(||G(F(Y))-Y|| ₂ ) means that the first transformation method transforms text features into image feature space and obtains image text features, and then uses the second transformation method to transform image text features into text After the text transformation features are obtained from the feature space, the mean value of the difference between the text transformation features and the original text features is

L _c represents the minimum value of the sum of the distance between the text transformation feature and the text feature and the distance between the image transformation feature and the image feature, which is the minimum value of the second loss function.

In this embodiment, the mutual detection model is iteratively trained using L _c as the loss function.

The loss values corresponding to the corresponding text features and image features are calculated respectively through the third conversion method, and the difference between the loss values corresponding to the text features and the loss values corresponding to the image features is determined, and the difference is used as the third loss value, and The mutual inspection model is iteratively trained through the third loss value.

In this embodiment, the purpose of this application is to make the features of X (image features) and Y (text features) as close as possible, so this application designs a discriminant loss function:

That is, X is mapped to a Dx feature (scalar) through the D method, that is, a certain image feature is calculated into a scalar Dx through the third conversion method D). Y is mapped to Dy features (scalars) via the D method. The purpose is to make the Dx and Dy features as close as possible, so that it is even impossible to tell whether it is Dx or Dy.

The formula is as follows:

L _d =E[logD(Y)]+E[log(1-D(X))]

Among them, logD(Y) refers to the logarithm of the number after converting the text feature into a scalar through the third conversion method D. E[logD(Y)] represents the mean of all logarithms in a Batch sample; log (1-D(X)) refers to converting the image feature into a scalar through the third conversion method D, and then calculating the logarithm of it. E[log(1-D(X)) represents all the image data in a Batch sample The mean of the logarithm after transformation by the third transformation method D.

L _d represents the loss value of the above-mentioned discriminant loss function in a Batch sample, that is, the loss value of the loss function under one Batch iteration training. If the third conversion method D obtains appropriate parameter values through iterative training, then D(Y) and D(X) should be extremely close, and L _d is infinitely close to 0, (under super-ideal conditions), then this application will be explained at this time The mutual detection model transforms text features and image features into the same space. Text features and corresponding image features have very similar meanings, that is, they are almost the same. This means that image features can be used to represent text features, and its practical significance is When a certain set of medical images finds the corresponding text information description, or when a certain text information finds the corresponding image for display, image-text mutual inspection is achieved. In some embodiments of the present application, iterative training to generate a mutual detection model of image and text data based on text features and image features based on a predetermined loss function also includes:

The mutual detection model is iteratively trained by using the sum of the first loss value, the second loss value and the third loss value as the loss value.

In this embodiment, the above three loss functions can be superimposed to train the mutual detection model proposed in this application, that is, the formula is described as:

Example:

The training process of this application can refer to Figure 6 to build a medical image text retrieval network based on cascade transformers, including a text information feature encoder and a medical image sequence feature encoder (as shown in the figure above).

Establish a generalized pairwise hinge-loss loss function:

The network is trained according to the above loss function to make it converge.

The network training process is as follows: The training process of the neural network is divided into two stages. The first stage is the stage where data is propagated from low level to high level, that is, the forward propagation stage. Another stage is the stage where the error is propagated from the high level to the bottom level, that is, the back propagation stage, when the results obtained by the forward propagation are not in line with expectations. The training process is:

1. All network layer weights are initialized, generally using random initialization;

2. The input image and text data are forward propagated through the neural network, convolutional layer, downsampling layer, fully connected layer and other layers to obtain the output value;

3. Find the output value of the network, and calculate the universal triplet loss function of the output value of the network according to the above loss function calculation formula.

The sum of semantic alignment adversarial loss functions (L _c +L _d ).

4. Reversely propagate the error back to the network, and obtain the back propagation error of each layer of the network: transformer layer, fully connected layer, convolution layer, etc.

5. Each layer of the network adjusts all weight coefficients in the network based on the backpropagation error of each layer, that is, updates the weights.

6. Re-randomly select the image text data of the new batch, and then enter the second step to obtain the output value through forward propagation of the network.

7. Infinite iteration. When the error between the output value of the network and the target value (label) is less than a certain threshold, or the number of iterations exceeds a certain threshold, the training ends.

8. Save the trained network parameters of all layers.

The following is a brief description of the network reasoning process, that is, the retrieval and matching process:

During the inference process, the weight coefficients trained by the network are preloaded. Feature extraction from medical text or medical image sequences. Store in the data set to be retrieved. The user gives any medical text data or medical image sequence data, which we call query data. Extract features of medical text data or medical image sequence data of query data, using our cascade transformer medical image text retrieval network. Distance-match the features of the query data with the features of all samples in the data set to be retrieved, that is, find the vector distance. This application finds the Euclidean distance. For example: If the query data is medical text data, get all the medical image sequence features in the data set to be retrieved and calculate the distance. Similarly, query data is medical image sequence data. Calculate the Euclidean distance from all the medical image sequence features in the data set to be retrieved. The sample with the smallest distance is the recommended sample and is output.

Through the image-text mutual inspection method provided by this application, the text data is calculated in a multi-level Transformer model cascade manner, and the corresponding text features are calculated through the residual network. The image features of the image sequence are calculated through the multiple losses proposed by this application. The function calculates and trains the mutual detection model of graphic and text data based on text features and image features. Further, the input text data or image data is predicted or the corresponding text data or image data is retrieved through the image and text data mutual detection model.

As shown in Figure 9, another aspect of this application also proposes a medical image and text data mutual detection device, including:

Preprocessing module 1, the preprocessing module 1 is configured to perform multi-level classification of the text information in the graphic data according to a predetermined method, and pass the classified text information through the first neural network model in a cascade manner according to the classification relationship. Generate text features;

The first model calculation module 2 is configured to use the image information in the graphic data in the form of an image sequence to generate image features through the second neural network model;

The second model calculation module 3 is configured to iteratively train and generate a graphic and text data mutual detection model based on text features and image features based on a predetermined loss function;

The image-text mutual inspection module 4 is configured to retrieve the corresponding text information and/or image information in the input image-text data through the image-text data mutual inspection model.

As shown in Figure 10, another aspect of this application also proposes a computer device. The computer device can be a terminal or a server, including:

at least one processor 21; and

The memory 22 stores computer readable instructions 23 that can be run on the processor 21 . When the readable instructions 23 are executed by the processor 21 , the steps of any one of the methods in the above embodiments are implemented.

As shown in Figure 11, another aspect of the present application also proposes a non-volatile computer-readable storage medium 401. The non-volatile computer-readable storage medium 401 stores computer-readable instructions 402. The computer-readable instructions 402 When executed by the processor, the steps of any one of the methods in the above embodiments are implemented.

The above are exemplary embodiments disclosed in the present application, but it should be noted that various changes and modifications can be made without departing from the scope of the embodiments disclosed in the present application as defined in the claims. The functions, steps and/or actions of the method claims in accordance with the disclosed embodiments described herein need not be performed in any particular order. In addition, although the elements disclosed in the embodiments of the present application may be described or claimed in individual form, they may also be understood as plural unless explicitly limited to the singular.

It will be understood that, as used herein, the singular form "a" and "an" are intended to include the plural form as well, unless the context clearly supports an exception. It will also be understood that as used herein, "and/or" is meant to include any and all possible combinations of one or more of the associated listed items.

Those of ordinary skill in the art can understand that all or part of the processes in the methods of the above embodiments can be completed by instructing relevant hardware through computer readable instructions. The above computer readable instructions can be stored in a non-volatile computer. When the computer readable instructions are read from the storage medium and executed, they may include the processes of the above method embodiments. Any reference to memory, storage, database or other media used in the embodiments provided in this application may include non-volatile and/or volatile memory. Non-volatile memory may include read-only memory (ROM), programmable ROM (PROM), electrically programmable ROM (EPROM), electrically erasable programmable ROM (EEPROM), or flash memory. Volatile memory may include random access memory (RAM) or external cache memory. By way of illustration and not limitation, RAM is available in many forms, such as static RAM (SRAM), dynamic RAM (DRAM), synchronous DRAM (SDRAM), double data rate SDRAM (DDRSDRAM), enhanced SDRAM (ESDRAM), synchronous chain Synchlink DRAM (SLDRAM), memory bus (Rambus) direct RAM (RDRAM), direct memory bus dynamic RAM (DRDRAM), and memory bus dynamic RAM (RDRAM), etc.

The technical features of the above embodiments can be combined in any way. To simplify the description, not all possible combinations of the technical features in the above embodiments are described. However, as long as there is no contradiction in the combination of these technical features, all possible combinations should be used. It is considered to be within the scope of this manual.

The above embodiments only express several implementation modes of the present application, and their descriptions are relatively specific and detailed, but they should not be construed as limiting the scope of the patent application. It should be noted that, for those of ordinary skill in the art, several modifications and improvements can be made without departing from the concept of the present application, and these all fall within the protection scope of the present application. Therefore, the protection scope of this patent application should be determined by the appended claims.

Claims

A method for mutual checking of medical graphic and text data, which is characterized by including:

Classify the text information in the graphic data in a multi-level manner according to a predetermined method, and pass the classified text information through the first neural network model to generate text features in a cascade manner according to the classification relationship;

Generate image features from the image information in the graphic data through the second neural network model in the form of image sequences;

Iterative training based on the text features and image features based on a predetermined loss function generates a graphic and text data mutual detection model; and

The text information and/or image information in the input graphic data are retrieved for corresponding text information and/or image information through the mutual checking model of graphic and text data.
The method according to claim 1, characterized in that the text information in the graphic data is classified into multiple levels according to a predetermined method, and the classified text information is passed through the first neural network model in a hierarchical manner according to the classification relationship. Text features are generated in a connected manner, including:

The text information is classified according to text structure types, and each classified structural text information is used to calculate the feature vector of the structural text information through the first neural network model.
The method of claim 2, wherein classifying the text information according to text structure type includes:

The text information is classified according to text structure and/or time type.
The method according to claim 2, characterized in that the text information in the graphic data is classified into multiple levels according to a predetermined method, and the classified text information is passed through the first neural network model in a hierarchical manner according to the classification relationship. Text features are generated in a connected manner, including:

The text content in the classified structural text information is sorted based on the number of occurrences of statements, and each sorted statement is input into the first neural network model as a parameter to calculate the text of the structured text information. feature.
The method of claim 4, wherein the sorting is performed based on the number of occurrences of statements, and each sorted statement is input into the first neural network model as a parameter to calculate the structural text information. Text features, including:

The words in each sentence are added together with their corresponding sequence number values and the sentence numbers in the text structure classification, and then input into the first neural network model to calculate text features of the structural text information.
The method of claim 4, further comprising:

Select any one of the calculation results corresponding to multiple sentences output by the first neural network model as the text feature of the structured text information.
The method of claim 4, further comprising:

The text features of the structured text information are obtained by weighting and averaging the calculation results output by the first neural network model and corresponding to multiple sentences.
The method according to claim 4, characterized in that the text information in the graphic data is classified into multiple levels according to a predetermined method, and the classified text information is passed through the first neural network model in a hierarchical manner according to the classification relationship. Text features are generated in a connected manner, including:

Text features of a plurality of structural text information are input into the first neural network model to obtain text features of the text information.
The method according to claim 8, wherein said inputting a plurality of feature vectors of the structural text information into the first neural network model to obtain text features of the text information includes:

The text features of each structured text information and the sequence value and classification number of the corresponding structured text are added and then input into the first neural network model to calculate the text features of the text information.
The method of claim 8, further comprising:

Select any one of the calculation results output by the first neural network model and corresponding to multiple structural text information as the text feature of the text information.
The method of claim 8, further comprising:

The text features of the text information are obtained by weighting and averaging the calculation results output by the first neural network model and multiple structural text information.
The method of claim 8, further comprising:

The text features of the multiple structural text information are spliced into long vectors, and the spliced long vectors are passed through a fully connected layer to obtain the text features of the text information.
The method according to claim 1, characterized in that generating image features through a second neural network model from the image information in the graphic data in the form of an image sequence includes:

Input the image sequence into the second neural network model and calculate the image sequence feature vector corresponding to the image sequence;

Calculate the weight of the image sequence feature vector, and multiply the weight with the image sequence feature vector to obtain an image sequence feature weight vector; and

The image feature weight vector is then added to the image sequence feature vector to obtain the image feature.
The method of claim 13, wherein calculating the weight of the image sequence feature vector includes:

Pass the image sequence feature vector through the first fully connected layer to obtain the first fully connected layer vector;

Pass the first fully connected layer vector through the pooling layer to obtain the pooling layer vector;

Pass the pooling layer vector through the second fully connected layer to obtain the second fully connected layer vector; and

The second fully connected vector is normalized to obtain the weight of the corresponding image sequence.
The method according to claim 1, characterized in that the iterative training to generate a graphic and text data mutual detection model based on a predetermined loss function according to the text features and image features includes:

For any text feature, calculate the Euclidean distance between the text feature and the corresponding image feature and the minimum Euclidean distance between the text feature and other text features and/or image features, and calculate the difference between the Euclidean distance and the minimum Euclidean distance as text loss value;

For any image feature, calculate the Euclidean distance between the image feature and the corresponding text feature and the minimum Euclidean distance between the image feature and other text features/or image features, and calculate the Euclidean distance and the minimum Euclidean distance as The difference is used as the image loss value; and

The text loss value and the image loss value are summed to obtain a first loss value, and the mutual detection model is trained through the first loss value.
The method according to claim 1, characterized in that the iterative training to generate a graphic and text data mutual detection model based on a predetermined loss function according to the text features and image features includes:

Transform the text features into the image feature space through the first conversion method, and obtain the image text features, and then transform the image text features into the text feature space through the second conversion method to obtain the text transformation features;

The image features are transformed into the text feature space through the second transformation method, and text image features are obtained, and the text image features are transformed into the image feature space through the first transformation method, and the image transformation features are obtained; and

The minimum value of the sum of the distance between the text transformation feature and the text feature and the distance between the image transformation feature and the image feature is used as the second loss value, and the mutual detection model is trained through the second loss value.
The method according to claim 1, characterized in that the iterative training to generate a graphic and text data mutual detection model based on a predetermined loss function according to the text features and image features includes:

The loss values corresponding to the corresponding text features and image features are respectively calculated through the third conversion method, and the difference between the loss values corresponding to the text features and the loss values corresponding to the image features is determined, and the difference is used as the third loss value, and iteratively trains the mutual detection model through the third loss value.
The method according to any one of claims 15 to 17, characterized in that the iterative training to generate a graphic and text data mutual detection model based on a predetermined loss function according to the text features and image features includes:

The mutual detection model is iteratively trained using the sum of the first loss value, the second loss value and the third loss value as the loss value.
A device for mutual checking of medical graphic and text data, which is characterized by including:

A preprocessing module configured to perform multi-level classification of the text information in the graphic data according to a predetermined method, and pass the classified text information through the first neural network model in a cascade manner according to the classification relationship. Generate text features;

A first model calculation module, the first model calculation module is configured to use the image information in the graphic data in the form of an image sequence to generate image features through the second neural network model;

A second model calculation module, the second model calculation module is configured to iteratively train and generate a graphic and text data mutual detection model based on the predetermined loss function according to the text features and image features; and

A picture-text mutual check module configured to retrieve the corresponding text information and/or image information from the text information and/or image information in the input picture-text data through the picture-text data mutual check model. information.
A computer device, characterized in that it includes a memory and one or more processors. Computer-readable instructions are stored in the memory. When the computer-readable instructions are executed by the one or more processors, the computer-readable instructions cause the The one or more processors perform the steps of the method described in any one of claims 1-18.
One or more non-volatile computer-readable storage media storing computer-readable instructions, characterized in that, when executed by one or more processors, the computer-readable instructions cause the one or more processors to Carry out the steps of the method according to any one of claims 1-18.