CN113270199A

CN113270199A - Medical cross-modal multi-scale fusion class guidance hash method and system thereof

Info

Publication number: CN113270199A
Application number: CN202110483387.0A
Authority: CN
Inventors: 欧卫华; 张勇
Original assignee: Guizhou Education University
Current assignee: Guizhou Education University
Priority date: 2021-04-30
Filing date: 2021-04-30
Publication date: 2021-08-17
Anticipated expiration: 2041-04-30
Also published as: CN113270199B

Abstract

The invention discloses a medical cross-modal multi-scale fusion class guidance hash method and a medical cross-modal multi-scale fusion class guidance hash system. A number of experiments on the medical data set MIMIC-CXR have shown that this approach outperforms the existing baseline in the cross-modality retrieval task.

Description

Medical cross-modal multi-scale fusion class guidance hash method and system thereof

Technical Field

The invention belongs to the field of cross-modal retrieval, and particularly relates to a medical cross-modal multi-scale fusion class guidance hash method and system.

Background

With the rapid development of medical technology, a large amount of medical data such as radiology reports, CT images, PET images, X-ray images, and the like are generated. Although they differ in form, they have similar semantics. Recently, many single modality methods have been proposed to separately understand these data, such as medical image segmentation, medical image classification and content-based medical image retrieval. Although much work has been done on clinical imaging, other morphological data of medical data, such as radiology reports, have been overlooked. In order to enable physicians to obtain comprehensive information about queries, retrieve semantically similar clinical profiles in different modalities, and provide diagnostic results according to their previous medical recommendations, a medical cross-modality retrieval is proposed, i.e. using an instance of one modality (e.g. an x-ray image) to retrieve an instance of another modality (e.g. a radiology report) with similar semantics.

Hashing is applied to cross-modal retrieval due to its high retrieval rate and low storage cost. Existing cross-modal hashing methods are generally divided into three categories: unsupervised, semi-supervised and supervised methods. Generally, while some tags may be damaged and inaccurate, tag information is useful for learning more discriminative features. Therefore, supervised cross-modal hashing methods can generally achieve better retrieval performance.

With the remarkable progress of deep learning, the deep neural network shows potential capability in cross-modal retrieval. For example, jiang et al propose depth cross-modal hashing (DCMH), which is an end-to-end framework that can learn depth features and hash functions simultaneously. Deep Visual Semantic Hashing (DVSH) uses a Convolutional Neural Network (CNN) and Long Short Term Memory (LSTM) to learn the hash code for each modality. Lie et al propose a self-supervised antagonistic hash network (SSAH) to design a self-supervised semantic network incorporating antagonistic learning to explore the semantic relationships between different modalities. Compared with a manual characteristic cross-modal retrieval method, the deep cross-modal retrieval performance is greatly improved.

However, the cross-modal retrieval methods described above all rely on semantic similarity matrices to supervise the generation of hash codes. Specifically, two data are defined to be similar if their respective tags share at least one common category, and are otherwise dissimilar. However, it is obvious that the definition omits rich semantic information and cannot well retain semantic structure information. Meanwhile, different modal data sharing the same semantics are embedded into a uniform hash code by the cross-modal retrieval method, and error codes are inevitably generated due to inherent modal differences and noise.

Based on the current situation, a medical cross-modal multi-scale fusion class-guided hashing (MCMFCH) method and a system thereof are provided.

Disclosure of Invention

Technical problem to be solved

The invention aims to provide a medical cross-modal multi-scale fusion class guidance hash method and a medical cross-modal multi-scale fusion class guidance hash system. Meanwhile, the combined network is utilized to guide the learning of the Hash codes of the images and the texts, so that modal semantics are mutually associated, and the semantic correlation among the modalities is favorably improved.

(II) technical scheme

In order to achieve the above purpose, the invention adopts the following technical scheme:

a medical cross-modal multi-scale fusion class guidance hash method comprises the following specific steps:

s1, inputting category semantics, and establishing a category hash network for learning hash codes of various categories;

s2, inputting data of different modes, establishing an image network and a text network so as to obtain the characteristics and hash codes of the modes, and combining the image text to generate a combined hash code;

s3, representing labels by class hash codes as supervision information to train hash codes of images, texts and joint networks;

s4, federated network to guide the learning of hash codes for images and text.

Further, the model of the class hash network in S1 is:

s.t.p_i＝sgn(H^(c))＝sgn(f^c(c_i；θ^c))

wherein α is a hyperparameter; 1 is a vector with all elements 1; sgn (.) is a sign function; wherein p is_iRepresents a category c_iThe learned hash code. Finally, the Hash code is obtained

Further, in S2, establishing an image hash network and a text hash network to obtain features and hash codes of each modality, and generating a joint hash code by the joint hash network specifically includes the following steps:

s2.1, an image hash network, wherein in order to obtain high-resolution and high-semantic medical image features, a deep convolutional network (VGG) is combined with a target pyramid network (FPN) to obtain image multi-scale features, and the image multi-scale features are called as a VFPN multi-scale network; the network fuses high resolution and weak semantic features and low resolution and strong semantic features to obtain high resolution and strong semantic features f^x(x；θ^x) (ii) a In addition, three full-connection layers are added as hash functions to convert the characteristics f^x(x；θ^x) Conversion into binary code H^(x)＝f^x(x；θ^x)∈{-1,1}^k(ii) a Wherein the first two fully connected layers are the same as the last two layers of the VGG, and the third fully connected layer has k hidden units, using the tanh (-) function as the activation function. Finally pass through B^x＝sgn(H^x)∈{-1,1}^kObtaining a hash code of an image mode; where k is the length of the hash code;

s2.2, text Hash network, using the baseA text network multi-scale fusion model for self-supervised secure Hash Cross-modal retrieval (SSAH); firstly, extracting a plurality of scale features from text data by using 5 average pooling layers of 1 × 1, 1 × 2, 1 × 3, 1 × 6 and 1 × 10, and then fusing the features by using one 1 × 1 convolutional layer; then, obtaining the multi-scale text semantic feature f by utilizing the processes of size adjustment and connection^y(y；θ^y) (ii) a The fusion features are sent into a three-layer feedforward neural network to be used as a Hash function to convert the features f into the features f^y(y；θ^y) Conversion into binary code H^(y)＝f^y(y；θ^y)∈{-1,1}^k(ii) a Finally pass through B^y＝sgn(H^y)∈{-1,1}^kObtaining a hash code of a text mode;

s2.3, combining the Hash network, wherein the network uses the image multi-scale feature f generated by the VFPN multi-scale network in the image network^x(x；θ^x) And multi-scale fusion features f in text^y(y；θ^y) Of (a) intersection f^u(u；θ^u) Is input; intersection feature f^u(u；θ^u) Is fed into a three-layer feedforward neural network as a hash function to convert the features into binary codes H^(u)＝f^u(u；θ^u)∈{-1,1}^k(ii) a Finally pass through B^u＝sgn(H^u)∈{-1,1}^kObtaining a hash code of the union network;

further, in the step S3, the step of monitoring the learning of the modal hash codes according to the class hash codes includes:

s3.1, cross-modal similarity and rich semantic structure information are kept through the Hamming distance,

and belong to class c_iShould be smaller than not belonging to class c_iThe hamming distance between the hash codes is modeled as:

wherein x represents x, y, u image, text and union modality; mu epsilon [0,1]Is a predefined margin, k is the hash code length; e_iAs data points_iIndex set of the class to which it belongs, i.e. label vector l_iIndex of middle element 1; q_i＝{1,…,c}-E_iAs data points_iIndex set of categories not belonging to, i.e. label vector l_iAn index of the middle element "0";

is that

And p_eThe hamming distance of;

is that

Should be equal to the average of the similar class hash codes of

Similarly; furthermore, if

Class hash code corresponding to the same

Ratio { p }_q|q∈Q_iThe class hash codes in the data are more similar, then

The semantic similarity and the semantic structure information are well kept at the same time;

S3.2、

the loss of each mode can be supervised and generated by a class hash code P, wherein the loss of each mode is as follows:

wherein λ is a hyperparameter; x, y, u images, text, and union modality;

is that

Average value of similar category hash codes of (1); p is a radical of_qIs that

Is different from the similar class hash code.

Further, in S4, a joint network is used to guide the learning of hash codes of images and texts, and the specific model is as follows:

wherein

Respectively, hash codes for the federated network, the image, and the text.

A retrieval model based on a medical cross-modal multi-scale fusion class guidance hash method is generated by adopting the medical cross-modal multi-scale fusion class guidance hash method, and the retrieval model is as follows:

wherein gamma and eta are hyper-parameters; x, y, u images, text, and union modality;

is that

Average value of similar category hash codes of (1); p is a radical of_qIs that

The dissimilar class hash codes of (1);

respectively, hash codes for the federated network, the image, and the text.

A retrieval system based on medical cross-modal multi-scale fusion class guidance hash method comprises the following steps:

the input module I is used for inputting category semantics;

the first characteristic processing module is used for establishing a category hash network to learn hash codes of various categories;

the input module II is used for inputting data of different modes;

the second characteristic processing module is used for establishing an image network and a text network to obtain characteristics and hash codes of each mode, and generating a combined hash code by combining the image text characteristics;

the learning training module is used for training images, texts and hash codes of a joint network by using class hash codes to represent labels as monitoring information, and simultaneously the joint network is used for guiding the learning of the hash codes of the images and the texts and searching;

and the output module is used for outputting the retrieval result.

(III) advantageous effects

Compared with the prior art, the method obtains the mode specific representation of each mode by using multi-scale fusion, and guides the learning of the hash code of each mode by using class hash. Experiments on two data sets simultaneously show that the method has better retrieval performance.

Drawings

FIG. 1 is a flow chart of the method of the present invention;

FIG. 2 is a diagram of an algorithm architecture proposed by the method of the present invention;

FIG. 3 shows the first 10 search results on the MIMIC-CXR dataset for CCA, DCMH and the method of the present invention;

fig. 4 is a schematic structural diagram of a cross-modal retrieval system according to an embodiment of the present invention.

Detailed Description

As shown in FIG. 1, the invention provides a medical cross-modal multi-scale fusion class guidance hashing method, and a corresponding system is designed according to the method.

The medical cross-modal multi-scale fusion category guidance hashing method comprises the following specific steps:

s4, federated network to guide the learning of hash codes for images and text.

The class hash network is used to generate the hash codes of the classes, so that the learned class hash codes can represent the labels, and the model of the class hash network in S1, that is, the objective function, is as follows:

s.t.p_i＝sgn(H^(c))＝sgn(f^c(c_i；θ^c))

In S2, the image hash network, the text hash network, and the joint hash network learn features and hash codes of different modalities, and the specific implementation process is as follows:

s2.1, image hash network, firstly, a deep convolutional network (VGG) is combined with a target pyramid network (FPN) to obtain image multi-scale features, and the image multi-scale features are called as VFPN multi-scale networks. The network fuses high resolution and weak semantic features and low resolution and strong semantic features to obtain high resolution and strong semantic features, namely the highest resolution and strong semantic feature f^x(x；θ^x). Furthermore, the addition of the first two fully connected layers is the same as the last two layers of the VGG. The third fully-connected layer has k hidden units, using the tanh (-) function as the activation function. These three layers use the feature f as a hash function^x(x；θ^x) Conversion into binary code H^(x)＝f^x(x；θ^x)∈{-1,1}^k. Then, we pass B^x＝sgn(H^x)∈{-1,1}^kA hash code of the image modality is obtained, where k is the length of the hash code.

S2.2, the text Hash network adopts a text network multi-scale fusion model based on cross-modal retrieval of self-supervision countermeasure Hash (SSAH). The multi-scale fusion model comprises 5 average pooling layers of 1 × 1, 1 × 2, 1 × 3, 1 × 6 and 1 × 10 for extracting text dataMultiple scale features and a 1 x 1 convolutional layer fuse the multiple features. Then, obtaining the multi-scale text semantic feature f by utilizing the processes of size adjustment and connection^y(y；θ^y). The fusion features are sent into a three-layer feedforward neural network to be used as a Hash function to convert the features f into the features f^y(y；θ^y) Conversion into binary code H (^y)＝f^y(y；θ^y)∈{-1,1}^k. Then, we pass B^y＝sgn(H^y)∈{-1,1}^kA hash code of the textual modality is obtained.

S2.3, combining the Hash network, wherein the network uses the image multi-scale feature f generated by the VFPN multi-scale network in the image network^x(x；θ^x) And multi-scale fusion features f in text^y(y；θ^y) Of (a) intersection f^u(u；θ^u)＝concat(f^x(x；θ^x),f^y(y；θ^y) ) is input. Intersection feature f^u(u；θ^u) Is fed into a three-layer feedforward neural network as a hash function to convert the features into binary codes H^(u)＝f^u(u；θ^u)∈{-1,1}^k. Then, we pass B^u＝sgn(H^u)∈{-1,1}^kA hash code for the federated network is obtained.

In S3, the step of monitoring the learning of the modal hash codes according to the class hash codes means that the following steps are adopted:

is that

And p_eThe hamming distance of;

is that

Should be equal to the average of the similar class hash codes of

Similarly; furthermore, if

Class hash code corresponding to the same

Ratio { p }_q|q∈Q_iThe class hash codes in the data are more similar, then

S3.2、

wherein λ is a hyperparameter; x, y, u images, text, and union modality;

is that

Average value of similar category hash codes of (1); p is a radical of_qIs that

Is different from the similar class hash code.

Guiding each mode to generate a hash code by utilizing a category network, wherein the hash objective function is as follows:

wherein λ is a hyperparameter; x, y, u images, text, and union modality;

is that

Average value of similar category hash codes of (1); p is a radical of_qIs that

Is different from the similar class hash code. The cross-modal similarity and rich semantic structure information can be well maintained through the hash code learned by the model provided by the embodiment.

In S4, the hash code generation and learning of the image and the text are guided by using a joint hash network, so as to improve the correlation of modalities, that is:

wherein

Respectively, hash codes for the federated network, the image, and the text.

Combining the above functions, a retrieval model based on medical cross-modal multi-scale fusion class guidance hash method is disclosed, the retrieval model is:

is that

Average value of similar category hash codes of (1); p is a radical of_qIs that

The dissimilar class hash codes of (1);

respectively, hash codes for the federated network, the image, and the text.

In order to verify the superiority of the method in cross-modal retrieval, MIMIC-CXR on a public medical data set is selected for experiment, mAP is adopted for cross-modal retrieval evaluation, and a retrieval result of Top-10 is also displayed; in the experiment, the training of the method of the embodiment is performed for 5 times, the average value is taken as the final result, and the parameters are set as follows: α ═ 0.05, β ═ 0.01, λ ═ 0.3, γ ═ 0.3, η ═ 0.3, and μ ═ 0.3

Table 1: mAP values on MIMIC-CXR datasets

(1) Analysis of results of mAP values on two public data sets

The method of this embodiment is compared with the existing 7 cross-modal retrieval methods, namely CCA, CMSSH, SCM, STMH, CMFH, SePH, DCMH. Compared experiments are carried out on two data sets in all the methods, as shown in the table above, the mAP value of the method is higher than that of other compared experiment methods, the feasibility of the method of substituting the category hash for the semantic similarity matrix is shown, and the combined semantics is beneficial to improving the semantic relevance.

(2) Comparative analysis of Top-10 search results

As shown in fig. 3, there are multiple failure cases of CCA and DCMH methods, and this embodiment compares that although our method is unsuccessful in both task image retrieval text and text retrieval image, respectively, the ranking is earlier, and the retrieval result is intuitively semantically related to the query.

As shown in fig. 4, a retrieval system based on medical cross-modal multi-scale fusion class guidance hash method includes:

the input module I1 is used for inputting category semantics;

the characteristic processing module I2 is used for establishing a category hash network to learn hash codes of various categories;

the input module II 3 is used for inputting data of different modes;

the second feature processing module 4 is used for establishing an image network and a text network to obtain features and hash codes of each modality, and generating a combined hash code by combining the image and text features;

the learning training module 5 is used for representing the label as monitoring information by utilizing the class Hash code to train the Hash codes of the image, the text and the combined network, and simultaneously, the combined network is used for guiding the learning of the Hash codes of the image and the text and searching;

and the output module 6 is used for outputting the retrieval result.

The above description is only a preferred embodiment of the present invention, but the scope of the present invention is not limited thereto, and any modification and replacement based on the technical solution and inventive concept provided by the present invention should be covered within the scope of the present invention.

Claims

1. The medical cross-modal multi-scale fusion class guidance hash method is characterized by comprising the following steps: the method comprises the following specific steps:

s4, federated network to guide the learning of hash codes for images and text.

2. The medical cross-modal multi-scale fusion class-guided hashing method according to claim 1, wherein: the model of the class hash network in S1 is:

s.t.p_i＝sgn(H^(c))＝sgn(f^c(c_i；θ^c))

wherein α is a hyperparameter; 1 is a vector with all elements 1; sgn (.) is a sign function; wherein p is_iRepresents a category c_iA learned hash code; finally, the Hash code is obtained

3. The medical cross-modal multi-scale fusion class-guided hashing method according to claim 1, wherein: in S2, an image hash network and a text hash network are established to obtain features and hash codes of each modality, and a joint hash network generates a joint hash code, which is specifically implemented by the following steps:

s2.1, an image hash network, wherein in order to obtain high-resolution and high-semantic medical image features, a deep convolutional network (VGG) is combined with a target pyramid network (FPN) to obtain image multi-scale features, and the image multi-scale features are called as a VFPN multi-scale network; the network fuses high resolution and weak semantic features and low resolution and strong semantic features to obtain high resolution and strong semantic features f^x(x；θ^x) (ii) a In addition, three full-connection layers are added as hash functions to convert the characteristics f^x(x；θ^x) Conversion into binary code H^(x)＝f^x(x；θ^x)∈{-1，1}^k(ii) a Wherein the first two are fully connectedThe layers are identical to the last two layers of the VGG, and the third fully connected layer has k hidden units, using the tanh (-) function as the activation function. Finally pass through B^x＝sgn(H^x)∈{-1，1}^kObtaining a hash code of an image mode; where k is the length of the hash code;

s2.2, adopting a text hash network multi-scale fusion model based on self-supervision anti-hash cross-modal retrieval (SSAH); firstly, extracting a plurality of scale features from text data by using 5 average pooling layers of 1 × 1, 1 × 2, 1 × 3, 1 × 6 and 1 × 10, and then fusing the features by using one 1 × 1 convolutional layer; then, obtaining the multi-scale text semantic feature f by utilizing the processes of size adjustment and connection^y(y；θ^y) (ii) a The fusion features are sent into a three-layer feedforward neural network to be used as a Hash function to convert the features f into the features f^y(y；θ^y) Conversion into binary code H^(y)＝f^y(y；θ^y)∈{-1，1}^k(ii) a Finally pass through B^y＝sgn(H^y)∈{-1，1}^kObtaining a hash code of a text mode;

s2.3, combining the Hash network, wherein the network uses the image multi-scale feature f generated by the VFPN multi-scale network in the image network^x(x；θ^x) And multi-scale fusion features f in text^y(y；θ^y) Of (a) intersection f^u(u；θ^u) Is input; intersection feature f^u(u；θ^u) Is fed into a three-layer feedforward neural network as a hash function to convert the features into binary codes H^(u)＝f^u(u；θ^u)∈{-1，1}^k(ii) a Finally pass through B^u＝sgn(H^u)∈{-1，1}^kA hash code for the federated network is obtained.

4. The medical cross-modal multi-scale fusion class-guided hashing method according to claim 1, wherein: in S3, the step of monitoring the learning of the modal hash codes according to the class hash codes includes:

wherein x represents x, y, u image, text and union modality; mu epsilon [0,1]Is a predefined margin, k is the hash code length; e_iAs data points_iIndex set of the class to which it belongs, i.e. label vector l_iIndex of middle element 1; q_i＝{1，...，c}-E_iAs data points_iIndex set of categories not belonging to, i.e. label vector l_iAn index of the middle element "0";

is that

And p_eThe hamming distance of;

is that

Should be equal to the average of the similar class hash codes of

Similarly; furthermore, if

Class hash code corresponding to the same

Ratio { p }_q|q∈Q_iThe class hash codes in the data are more similar, then

S3.2、

wherein λ is a hyperparameter; x, y, u images, text, and union modality;

is that

Average value of similar category hash codes of (1); p is a radical of_qIs that

Is different from the similar class hash code.

5. The medical cross-modal multi-scale fusion class-guided hashing method according to claim 1, wherein: in S4, a joint network is used to guide the learning of hash codes of images and texts, and the specific model is as follows:

wherein

Respectively, hash codes for the federated network, the image, and the text.

6. The retrieval model based on the medical cross-modal multi-scale fusion class guidance hash method is characterized in that: the retrieval model is generated by adopting the medical cross-modal multi-scale fusion class guidance hash method of claim 1, and the retrieval model is as follows:

is that

Average value of similar category hash codes of (1); p is a radical of_qIs that

The dissimilar class hash codes of (1);

respectively, hash codes for the federated network, the image, and the text.

7. The retrieval system based on the medical cross-modal multi-scale fusion class guidance hash method is characterized by comprising the following steps:

the input module I (1) is used for inputting category semantics;

the characteristic processing module I (2) is used for establishing a class hash network to learn hash codes of various classes;

the input module II (3) is used for inputting data of different modes;

the second feature processing module (4) is used for establishing an image network and a text network to obtain features and hash codes of all modalities, and combining the image and text features to generate a combined hash code;

the learning training module (5) is used for representing the label as monitoring information by utilizing the class Hash code to train the Hash codes of the image, the text and the combined network, and simultaneously the combined network is used for guiding the learning of the Hash codes of the image and the text and searching;

and the output module (6) is used for outputting the retrieval result.