CN113889228A

CN113889228A - Semantic enhanced Hash medical image retrieval method based on mixed attention

Info

Publication number: CN113889228A
Application number: CN202111106128.2A
Authority: CN
Inventors: 陈亚雄; 李小玉; 汤一博; 王凡; 熊盛武
Original assignee: Wuhan University of Technology WUT
Current assignee: Wuhan University of Technology WUT
Priority date: 2021-09-22
Filing date: 2021-09-22
Publication date: 2022-01-04

Abstract

The invention relates to a semantic enhanced hash medical image retrieval method based on mixed attention. Firstly, a data set is divided into a training set and a testing retrieval set, images are randomly selected from the training set to form a medical triple, then an integral network model is constructed, the medical triple sample is used as the input of the network model, finally, the integral network model is trained, and a retrieval result is obtained by using the trained network. The invention utilizes the channel attention module and the space attention module to form a mixed attention mechanism, and can efficiently extract region of interest (ROI) information; the class-level semantic information is utilized to constrain the learning process of the hash codes, which is beneficial to distinguishing different classes of similar hash codes; when the depth embedding is mapped to the discrete hash code, the quantization error between the depth embedding and the hash code is reduced by using the quantization loss item, and the precision of medical image retrieval can be further improved.

Description

Semantic enhanced Hash medical image retrieval method based on mixed attention

Technical Field

The invention belongs to the field of medical image retrieval, and particularly relates to a semantic enhanced hash medical image retrieval method based on mixed attention.

Background

With the rapid development of radiographic imaging technology, medical data is gradually electronized, and the number of medical images is sharply increased. Mining useful information in large-scale medical images is critical to better assist in medical diagnosis and assessment. Therefore, medical image retrieval attracts a wide attention.

Medical image retrieval can be divided into two categories: text-based medical image retrieval and content-based medical image retrieval. Text-based medical image retrieval occurs early in medical image retrieval, which avoids analysis of medical image visualization elements, indexes medical images in terms of name, size, type, etc., and often queries medical images based on keywords. Text-based medical image retrieval relies on highly subjective manual labeling, and text does not fully express the rich semantic content in medical images. Content-based medical image retrieval aims at extracting low-dimensional visual features and high-dimensional semantic features directly from medical images, thereby forming feature vectors as objective bases for indexing and matching images required for retrieval. However, most of the existing content-based medical image retrieval methods only learn the relative relationship of medical images to extract deep features, and ignore the category level semantics of the medical images and labels, which causes the problem of insufficient utilization of high-level semantic information and finally influences retrieval performance.

Disclosure of Invention

Aiming at the defects of the prior art, the invention provides a semantic enhanced hash medical image retrieval method based on mixed attention. Firstly, a data set is divided into a training set and a testing retrieval set, images are randomly selected from the training set to form a medical triple, then an integral network model is constructed, the medical triple sample is used as the input of the network model, finally, the integral network model is trained, and a retrieval result is obtained by using the trained network.

In order to achieve the above object, the technical solution provided by the present invention is a semantic enhanced hash medical image retrieval method based on mixed attention, comprising the following steps:

step 1, dividing a data set into a training set and a test retrieval set;

step 2, randomly selecting images to form a medical triple;

step 3, constructing an integral network model, and taking the medical triple sample as the input of the network model;

step 4, training an integral network model;

and 5, obtaining a retrieval result by using the trained network.

Furthermore, in

step

1, 3 data sets are used, namely a combined selected data set curatedX-Ray of the chest X-Ray image data set COVID-19radio image, a combined selected data set of the chest X-Ray image data set COVID-19radio image and a dermoscope image data set HAM10000, for each data set, 70% of data is selected as a training set, the remaining 30% is selected as a testing and searching set, medical images in the same data set are the same type of medical images, and medical images in different data sets are different types of medical images.

Furthermore, in step 2, given m training images, a training set I ═ I is formed₁,I₂,...,I_mRandomly selecting two medical images of the same kind from a training set as an anchor point image Q_iAnd a normal example image P_iThen randomly selecting one frame and Q_i、P_iMedical images of different classes as negative example images N_iForming a medical triplet T ═ { Q ═ Q_i,P_i,N_i}_{i∈{1,...,m}}. Anchor point image Q in triad_iAnd a normal example image P_iSimilar, negative example image N_iAre not similar. When a medical triple sample unit is constructed, a medical image with a small number of samples is selected as a rare image and is used as a negative image of a common sample, so that the rare image is multiplexed in a training stage, and the problem of sample imbalance in the field of medical image retrieval is solved.

Moreover, in the step 3, for each triplet, the three medical images are simultaneously input into a weight-shared twin neural network, and the twin neural network consists of a rolling block, a dense block, a rolling block and a full-connection layer for outputting the hash code; a channel attention module is added between the convolution block and the dense block, and a space attention module is added between the dense block and the convolution block to form a mixed attention mechanism; the channel attention module and the space attention module are used for acquiring the information of the region of interest, so that the dependence between channels and the remarkable characteristics of a space domain can be acquired simultaneously, and the difference of the medical image can be paid attention to more effectively.

Firstly, a medical image obtains a characteristic diagram X belonging to R through a first convolution block^C×H×WWhere H and W represent the height and width of the feature map, respectively, and C represents the number of channels. The input feature map is then compressed by the channel attention module using the average pooling and maximum pooling operations. The channel attention module comprises two continuous convolution layers, the first 1 x 1 convolution is used for projecting the features after the pooling operation to a hidden layer with fewer parameters, and a ReLU function is used as an activation function; the second 1 x 1 convolution is intended to recover the number of channels and uses the sigmoid function as the activation function. And then adding the average pooling vector and the maximum pooling vector element by element, performing weighting operation by using a sigmoid function, and finally multiplying by a feature map X.

The channel attention module may be expressed as:

in the formula, M_C(X) is a one-dimensional channel attention, size C × 1 × 1; conv_1×1Represents a convolution operation with a filter size of 1 × 1; sigma represents a sigmoid function; AvgPool (·) is the average pooling function; MaxPool (. cndot.) is the maximum pooling function.

In order to fully utilize the feature map and enhance the transfer of the feature map, the feature map is input into a dense block composed of four dense layers. Input of each dense layerThe egress will pass to each subsequent layer to enable the creation of a short path from the early layer to the later layer. The space attention module is a supplement of the channel attention module, and focuses on the part with the largest sample information amount. Let Y be an element of R^C×H×WRepresenting the feature map extracted from the last dense layer, where H and W represent the height and width of the feature map, respectively, and C represents the number of channels, the spatial attention module can be expressed as:

M_S(Y)＝σ(Conv_7×7([AvgPool(X)；MaxPool(X)])) (3)

in the formula, M_S(Y) is a two-dimensional spatial attention map, the size being 1 × H × W; conv_7×7Represents a convolution operation with a filter size of 7 x 7; sigma represents a sigmoid function; AvgPool (·) is the average pooling function; MaxPool (. cndot.) is the maximum pooling function.

And finally, mapping the depth embedding to a hash code generation layer, wherein the depth embedding is restricted by semantic enhancement loss, regularization loss and triple cross entropy loss.

And in the step 4, based on the mixed attention mechanism and the twin neural network, the model is trained by optimizing an overall loss function, wherein the overall loss function comprises a hash triple, a semantic enhancement item and a quantization item.

The hash function can map the medical instance into a compact hash code, and meanwhile, the semantic information matched with the medical image and the label in the original space is reserved, and because the Hamming distance of the discrete hash code is not convenient to optimize in the deep learning network, the invention uses the deeply-embedded Euclidean distance output by the linear layer to replace the Hamming distance of the hash code; to capture relative relevance in hash space, the basic triplet terms of a medical image can be represented as:

in the formula, | · the luminance | |₂Representing a two-normal vector for measuring distance;

and

indicating a k-bit depth embedding that has not been discretized; δ represents an edge threshold.

Class-level semantics help to distinguish different classes of similar hash codes, and to capture the class-level semantics of medical images, the learning process of hash codes is constrained using matching images and true labels.

The semantic enhancement term can be expressed as:

in the formula (I), the compound is shown in the specification,

represents a cross-entropy loss function of the entropy of the sample,

and

respectively represent Q_i、P_i、N_iThe tag information of (1).

Since the computation of triplet loss is based on depth embedding without discretization, quantization errors will result, which are inspired by iterative quantization, using quantization terms to reduce the quantization error between depth embedding and hash codes.

The quantization term can be expressed as:

and

indicating a k-bit depth embedding that has not yet been discretized,

and

respectively represent Q_i、P_i、N_iK-bit hash code of (1).

Considering the above three parts, the overall loss function can be expressed as:

L_total＝L_tri+α×L_se+β×L_qu (8)

in the formula, α and β represent superparameters that control the weight of the loss term.

When the overall network model is trained, the size of a medical triple image is adjusted to 256 × 256, random sampling is taken as input of the network in each round of training, the edge threshold value delta of triple loss is set to 0.5, parameters alpha and beta of an overall loss function are set to 1 and 0.8 respectively, the network optimizes the loss by using an Adam function, the learning rate is 0.001, the performance of Hash code numbers from 8, 16, 32, 48 to 64 and the performance of the most similar images from 5, 10, 15, 20, 25 to 30 are evaluated, and the trained model is obtained by training 100 rounds or until the loss is not reduced any more.

In step 5, the trained network is used to calculate the average hit rate (mHR), average precision (mAP) and average reciprocal rank (mRR) of the sample images in the test data set, and the retrieval performance is evaluated according to the three indexes. Wherein the Hit Rate (HR) is used to measure how many images in the returned list are similar to the query image; in the return list, Average Precision (AP) carries out average operation on the ranking positions of the images similar to the query image, so that the ranking quality is measured; the Reciprocal Rank (RR) refers to the reciprocal position of the first similar image ordering in the return list.

Compared with the prior art, the invention has the following advantages: 1) the invention utilizes the channel attention module and the space attention module to form a mixed attention mechanism, and can efficiently extract region of interest (ROI) information; 2) the class-level semantic information is utilized to constrain the learning process of the hash codes, which is beneficial to distinguishing different classes of similar hash codes; 3) when the depth embedding is mapped to the discrete hash code, the quantization error between the depth embedding and the hash code is reduced by using the quantization loss item, and the precision of medical image retrieval can be further improved.

Drawings

FIG. 1 is a flow chart of an embodiment of the present invention.

Fig. 2 is a network structure diagram according to an embodiment of the present invention.

Fig. 3 is a graph comparing the search performance results of the method of the present invention with other methods on different datasets, where fig. 3(a) is the first 10 maps medical search performance using different hash bits on COVID-19 radiographic dataset, fig. 3(b) is the first 10 maps medical search performance using different hash bits on Curated X-Ray dataset, and fig. 3(c) is the first 10 maps medical search performance using different hash bits on HAM10000 dataset.

Fig. 4 is a comparison graph of the retrieval performance results of different retrieval points on different data sets according to the method of the present invention and other methods, where fig. 4(a) is the medical retrieval performance of 48-bit hash codes of different retrieval points on COVID-19radio data set, fig. 4(b) is the medical retrieval performance of 48-bit hash codes of different retrieval points on Curated X-Ray data set, and fig. 4(c) is the medical retrieval performance of 48-bit hash codes of different retrieval points on HAM10000 data set.

FIG. 5 shows the first 10 similar images returned by the method of the present invention by retrieving images on the Curated X-Ray dataset and the HAM10000 dataset, wherein FIG. 5(a) shows the first 10 similar images returned by retrieving images on the Curated X-Ray dataset, FIG. 5(b) shows the first 10 similar images returned by retrieving images on the HAM10000 dataset, and the wrong images are labeled by different names below.

Detailed Description

The invention provides a semantic enhanced hash medical image retrieval method based on mixed attention, which comprises the steps of dividing a data set into a training set and a testing retrieval set, randomly selecting images from the training set to form medical triples, then constructing an integral network model, taking medical triples as the input of the network model, finally training the integral network model, and obtaining a retrieval result by using the trained network.

The technical solution of the present invention is further explained with reference to the drawings and the embodiments.

As shown in fig. 1, the process of the embodiment of the present invention includes the following steps:

step 1, dividing a data set into a training set and a test retrieval set.

Three data sets were used, respectively, the combined culled data set curatedX-Ray and the dermoscopic image data set HAM10000 for the chest X-Ray image data set COVID-19Radiography, COVID-19 chest X-Ray images. For each data set, 70% of the data is selected as a training set, the remaining 30% is selected as a testing and retrieving set, the medical images in the same data set are the same type of medical images, and the medical images in different data sets are different types of medical images.

And 2, randomly selecting images to form a medical triple.

Given m training images, a training set I ═ I is formed₁,I₂,...,I_mRandomly selecting two medical images of the same kind from a training set as an anchor point image Q_iAnd a normal example image P_iThen randomly selecting one frame and Q_i、P_iMedical images of different classes as negative example images N_iForming a medical triplet T ═ { Q ═ Q_i,P_i,N_i}_{i∈{1,...,m}}. Anchor point image Q in triad_iAnd a normal example image P_iSimilar, negative example image N_iAre not similar. When a medical triple sample unit is constructed, a medical image with a small number of samples is selected as a rare image and is used as a negative image of a common sample, so that the rare image is multiplexed in a training stage, and the problem of sample imbalance in the field of medical image retrieval is solved.

And 3, constructing an integral network model, and taking the medical triple sample as the input of the network model.

For each triplet, three medical images are simultaneously input into the weight-shared twin neural network. As shown in fig. 2, the twin neural network is composed of a convolution block, a dense block, a convolution block, and a fully connected layer for hash code output. And a channel attention module is added between the convolution block and the dense block, and a space attention module is added between the dense block and the convolution block to form a mixed attention mechanism. The channel attention module and the space attention module acquire region of interest (ROI) information, so that the dependence between channels and the remarkable characteristics of a space domain can be acquired simultaneously, and the difference of a medical image can be focused more effectively.

The channel attention module may be expressed as:

In order to fully utilize the feature map and enhance the transfer of the feature map, the feature map is input into a dense block composed of four dense layers. The output of each dense tier will be passed to each subsequent tier to enable the creation of a short path from the early tier to the later tier. The space attention module is a supplement of the channel attention module, and focuses on the part with the largest sample information amount. Let Y be an element of R^C×H×WRepresenting the feature map extracted from the last dense layer, where H and W represent the height and width of the feature map, respectively, and C represents the number of channels, the spatial attention module can be expressed as:

M_S(Y)＝σ(Conv_7×7([AvgPool(X)；MaxPool(X)])) (3)

And 4, training the whole network model.

Based on a hybrid attention mechanism and a twin neural network, a model is trained by optimizing an overall loss function, which includes hash triples, semantic enhancement terms, and quantization terms.

The hash function may map the medical instance to a compact hash code while preserving semantic information in the original space that matches the medical image and the label. Since the hamming distance of discrete hash codes is not easily optimized in deep learning networks, the present invention replaces the hamming distance of hash codes with deeply embedded euclidean distances output by linear layers. To capture relative relevance in hash space, the basic triplet terms of a medical image can be represented as:

and

The semantic enhancement term can be expressed as:

in the formula (I), the compound is shown in the specification,

represents a cross-entropy loss function of the entropy of the sample,

and

respectively represent Q_i、P_i、N_iThe tag information of (1).

The quantization term can be expressed as:

and

indicating a k-bit depth embedding that has not yet been discretized,

and

respectively represent Q_i、P_i、N_iK-bit hash code of (1).

L_total＝L_tri+α×L_se+β×L_qu (8)

When the whole network model is trained, the size of the medical triple image is adjusted to 256 multiplied by 256, and random sampling is taken as the input of the network in each round of training. The edge threshold δ of the triplet penalty is set to 0.5 and the parameters α and β of the overall penalty function are set to 1 and 0.8, respectively. The network optimizes the loss using the Adam function with a learning rate of 0.001. The performance of the hash code numbers from 8, 16, 32, 48 to 64 and the performance of the most similar images from 5, 10, 15, 20, 25 to 30 were evaluated. Training for 100 rounds or until the loss is no longer reduced, resulting in a trained model.

And 5, obtaining a retrieval result by using the trained network.

And (3) calculating the average hit rate (mHR), the average precision (mAP) and the average reciprocal ranking (mRR) of the sample images in the test data set by using the trained network, and evaluating the retrieval performance according to the three indexes. Wherein the Hit Rate (HR) is used to measure how many images in the returned list are similar to the query image; in the return list, Average Precision (AP) carries out average operation on the ranking positions of the images similar to the query image, so that the ranking quality is measured; the Reciprocal Rank (RR) refers to the reciprocal position of the first similar image ordering in the return list.

To evaluate the effectiveness of the method of the invention, an ablation experiment was first performed: firstly, the method of the invention is used for extracting the feature (HASE-C) without a channel attention module; secondly, learning a hash function (HASE-S) by utilizing the method without semantic enhancement loss; thirdly, the method of the invention is used for executing a learning hash function (HASE-Q) without considering the quantization item; finally, the method of the invention (HASE) is carried out. Then, the method of the present invention is compared with advanced methods such as ASH, ATH, DHN, DPSH, DSH, DTSH and IDHN in search performance.

TABLE 1

Table 1 shows the results of comparative experiments of the present invention with HASE-C, HASE-S, HASE-Q on COVID-19 Radiograpy data sets for different hash bits. The comparison result shows that the average precision index of the method provided by the invention on the COVID-19radiograph data set aiming at the first 10 retrieval results of different hash bits is highest.

TABLE 2

Table 2 shows the results of experiments comparing the COVID-19 Radiogry dataset, the Current X-Ray dataset and the HAM10000 dataset according to the invention with other methods by means of the indices mHR @10, mAP @10 and mRR @ 10. The comparison result shows that the average accuracy index of the first 10 retrieval results of the method provided by the invention on the three data sets is the highest.

In specific implementation, the above process can adopt computer software technology to realize automatic operation process.

The specific embodiments described herein are merely illustrative of the spirit of the invention. Various modifications or additions may be made to the described embodiments or alternatives may be employed by those skilled in the art without departing from the spirit or ambit of the invention as defined in the appended claims.

Claims

1. A semantic enhanced hash medical image retrieval method based on mixed attention is characterized by comprising the following steps:

step 1, dividing a data set into a training set and a test retrieval set;

step 2, randomly selecting images to form a medical triple;

step 4, training an integral network model;

and 5, obtaining a retrieval result by using the trained network.

2. The semantic enhanced hash medical image retrieval method based on mixed attention as claimed in claim 1, wherein: in step 1, n data sets are used, for each data set, 70% of data is selected as a training set, the remaining 30% of data is selected as a testing and searching set, medical images in the same data set are similar medical images, and medical images in different data sets are different medical images.

3. The semantic enhanced hash medical image retrieval method based on mixed attention as claimed in claim 2, characterized in that: in step 2, given m training images, a training set I ═ I is formed₁,I₂,...,I_mRandomly selecting two medical images of the same kind from a training set as an anchor point image Q_iAnd a normal example image P_iThen randomly selecting one frameAnd Q_i、P_iMedical images of different classes as negative example images N_iForming a medical triplet T ═ { Q ═ Q_i,P_i,N_i}_{i∈{1,...,m}}(ii) a When a medical triple sample unit is constructed, a medical image with a small number of samples is selected as a rare image and is used as a negative image of a common sample, so that the rare image is multiplexed in a training stage, and the problem of sample imbalance in the field of medical image retrieval is solved.

4. The semantic enhanced hash medical image retrieval method based on mixed attention as claimed in claim 3, wherein: in step 3, for each triple, three medical images are simultaneously input into a weight-shared twin neural network, and the twin neural network consists of a rolling block, a dense block, a rolling block and a full connection layer for outputting hash codes; a channel attention module is added between the convolution block and the dense block, and a space attention module is added between the dense block and the convolution block to form a mixed attention mechanism; the channel attention module and the space attention module are used for acquiring the information of the region of interest, so that the dependence between channels and the remarkable characteristics of a space domain can be acquired simultaneously, and the difference of the medical image can be paid attention to more effectively.

5. The semantic enhanced hash medical image retrieval method based on mixed attention as claimed in claim 4, wherein: in step 3, a medical image firstly obtains a feature map X belonging to R through a first convolution block^C×H×WWherein H and W represent the height and width of the feature map, respectively, and C represents the number of channels; compressing the input feature map by the channel attention module by using average pooling and maximum pooling operations; the channel attention module comprises two continuous convolution layers, the first 1 x 1 convolution is used for projecting the features after the pooling operation to a hidden layer with fewer parameters, and a ReLU function is used as an activation function; the second 1 × 1 convolution aims at recovering the number of channels and uses the sigmoid function as an activation function; then adding the average pooling vector and the maximum pooling vector element by element, performing weighting operation by using a sigmoid function, and finallyMultiplying by a feature map X;

the channel attention module may be expressed as:

6. The semantic enhanced hash medical image retrieval method based on mixed attention as claimed in claim 5, wherein: in step 3, in order to fully utilize the feature diagram and strengthen the transmission of the feature diagram, the feature diagram is input into a dense block consisting of four dense layers, and the output of each dense layer is transmitted to each subsequent layer so as to realize the creation of a short path from an early layer to a later layer; the space attention module is a supplement of the channel attention module, focuses on the part with the maximum sample information quantity, and makes Y be equal to R^C×H×WRepresenting the feature map extracted from the last dense layer, where H and W represent the height and width of the feature map, respectively, and C represents the number of channels, the spatial attention module can be expressed as:

M_S(Y)＝σ(Conv_7×7([AvgPool(X)；MaxPool(X)])) (3)

in the formula, M_S(Y) is a two-dimensional spatial attention map, the size being 1 × H × W; conv_7×7Represents a convolution operation with a filter size of 7 x 7; sigma represents a sigmoid function; AvgPool (·) is the average pooling function; maxPool (. cndot.) is the maximum pooling function;

7. The semantic enhanced hash medical image retrieval method based on mixed attention as claimed in claim 1, wherein: training a model by optimizing an overall loss function based on a mixed attention mechanism and a twin neural network in the step 4, wherein the overall loss function comprises a Hash triple, a semantic enhancement item and a quantization item;

the hash function can map the medical instance into a compact hash code while preserving semantic information matching the medical image and the label in the original space, since hamming distance of discrete hash codes is not convenient to be optimized in the deep learning network, the hamming distance of hash codes is replaced by the deeply embedded euclidean distance output by the linear layer; to capture relative relevance in hash space, the basic triplet terms of a medical image can be represented as:

and

indicating a k-bit depth embedding that has not been discretized; δ represents an edge threshold;

class-level semantics helps to distinguish different classes of similar hash codes, in order to capture the class-level semantics of medical images, the learning process of the hash codes is constrained using the matched images and the real labels;

the semantic enhancement term can be expressed as:

in the formula (I), the compound is shown in the specification,

represents a cross-entropy loss function of the entropy of the sample,

and

respectively represent Q_i、P_i、N_iThe tag information of (a);

since the computation of the triple loss is based on depth embedding without discretization, quantization errors will be caused, and the quantization error between the depth embedding and the hash code is reduced by using a quantization item under the inspiration of iterative quantization;

the quantization term can be expressed as:

and

indicating a k-bit depth embedding that has not yet been discretized,

and

respectively represent Q_i、P_i、N_iK-bit hash code of (1);

L_total＝L_tri+α×L_se+β×L_qu (8)

8. The semantic enhanced hash medical image retrieval method based on mixed attention as claimed in claim 7, wherein: when the whole network model is trained in the step 4, the size of the medical triple image is adjusted to 256 × 256, random sampling is performed in each round of training to serve as input of the network, the edge threshold value delta of triple loss is set to 0.5, parameters alpha and beta of the overall loss function are set to 1 and 0.8 respectively, the network optimizes loss by using an Adam function, the learning rate is 0.001, the performance of the Hash code number from 8, 16, 32, 48 to 64 and the performance of the most similar image from 5, 10, 15, 20, 25 to 30 are evaluated, and the trained model is obtained after 100 rounds of training or until loss is not reduced any more.

9. The semantic enhanced hash medical image retrieval method based on mixed attention as claimed in claim 1, wherein: in the step 5, the trained network is used for calculating the average hit rate, the average precision and the average reciprocal ranking of the sample images in the test data set, and the retrieval performance is evaluated according to the three indexes; wherein, the hit rate is used for measuring how many images in the return list are similar to the query image; in the return list, averaging the ranking positions of the images similar to the query image with average precision, so as to measure the ranking quality; the reciprocal rank refers to the reciprocal position of the ordering of the first similar image in the returned list.