CN117315347A

CN117315347A - Cross-modal feature fusion-based image classification system

Info

Publication number: CN117315347A
Application number: CN202311253333.0A
Authority: CN
Inventors: 王烤; 吴钦木
Original assignee: Guizhou University
Current assignee: Guizhou University
Priority date: 2023-09-26
Filing date: 2023-09-26
Publication date: 2023-12-29

Abstract

The invention discloses an image classification system based on cross-modal feature fusion, which comprises an extraction module, a fusion module and a model training module, wherein the extraction module performs feature extraction on medical images and diagnostic reports to obtain images and texts; the fusion module adopts a cross-mode attention module to fuse the characteristics of the image and the text; and the model training module is used for training the fused data by adopting a two-way long-short-term memory network module based on a sparse gate. The method can adaptively allocate weights for different mode data based on a cross-mode attention mechanism, so that the attention degree of a model to key information is improved, the correlation between different modes can be better captured, and the fusion effect of the multi-mode data is improved; the sparse gate is applied to the gating unit of the Bi-LSTM, so that the computation complexity of the model is reduced, the robustness of the model is improved, the number of parameters can be effectively reduced, the computation cost of the model is reduced, and the operation efficiency of the model is improved.

Description

Cross-modal feature fusion-based image classification system

Technical Field

The invention belongs to the technical field of image classification, and relates to an image classification system based on cross-modal feature fusion.

Background

According to World Health Organization (WHO) data, lung cancer is one of the most common cancers worldwide, resulting in about 180 tens of thousands of deaths each year (WHO, 2020). The data of the global disease burden study (Global Burden of Disease Study) indicate that Chronic Obstructive Pulmonary Disease (COPD) is the third global disability disease, resulting in about 3 million deaths annually. Therefore, it is very important to diagnose and treat lung diseases.

Radiographs are the most cost-effective diagnostic tool for pulmonary disease detection, and diagnosis of such diseases from chest radiographs requires highly skilled radiologists, whereas manual detection of pulmonary disease is a time-consuming process, often resulting in subjective differences, which can delay diagnosis and treatment. The Computer Aided Diagnosis (CAD) has great potential in clinic, can accurately diagnose the lung diseases in a short time, helps doctors to improve the workload of diagnosing the lung diseases, reduces the misdiagnosis rate and greatly improves the diagnosis efficiency. Deep learning has been extensively studied for its general applicability to problems involving automatic feature extraction and classification. Convolutional Neural Network (CNN) based evaluation is widely used for image classification and object detection. CNN relates to a spatial filter that automatically collects information of structures embedded in an image. Initially, the selected radiographic images were classified into normal and abnormal classes by a SoftMax classifier using pre-trained DL systems, such as DL networks (VGG 16 and VGG 19) and ResNet50 of AlexNet, visual Geometry Group. Document "Anthimopoulos M, christodoulidis S, ebner L, christe A and Mougiakakou S.2016.Lung pattern classification for interstitial lung diseases using a deep convolutional neural network.IEEE Transactions on Medical Imaging,35 (5): 1207-1216 (lung pattern classification of interstitial lung disease using deep convolutional neural networks, journal of medical imaging of the institute of electrical and electronics engineers) "uses CNN to analyze medical images to determine the severity of disease for different organs. Document "Kawahara J BenTaieb A and Hamarneh g 2016.Deep features to classify skin versions. In:2016 IEEE 13th International Symposium on Biomedical Imaging (ISBI), 1397-1400 (classification of skin lesions with depth features, 13th institute of Electrical and electronics Engineers International biomedical imaging Innovation (ISBI)) studied combinations of local and global context information and designed image analysis architectures of different scales. Setio et al (2016) literature "Setio A A A, ciompi F, litjens G, gerke P, jacobs C, van Riel S J, wille M W, naqibullah M, sanchez C I and van Ginneken B.2016.Pulmonary nodule detection in CT images: false positive reduction using multi-view convolutional networks ieee Transactions on Medical Imaging,35 (5): 1160-1169 (detection of lung nodules in CT images: reducing false alarm rate using multi-view convolutional networks)' 3D CNN is used to enhance classification performance. Although the above method has a good effect in processing medical images, the case of multi-modality cannot be considered.

Various methods have been proposed by scholars to process multimodal medical data. For example, some studies model medical images and time series data using Convolutional Neural Networks (CNNs) and Recurrent Neural Networks (RNNs). The literature "Tan W, tiware P, pandey H M, moreira C and Jaiswal A k.2020.multimodal medical image fusion algorithmin the era of big data.neurol Computing and Applications,1-21 (multi-modality medical image fusion algorithm, neuro-computing and application in big data age)" proposes a multi-modality medical image fusion algorithm applied in big data age. By combining a plurality of medical images with different modes, the information is fused by using a deep learning algorithm, so that the accuracy of diagnosis is improved. Document "Apostolopoulos I D and Mpesiana T a 2020.Covid-19: automatic detection from X-ray images utilizing transfer learning with convolutional neural networks.physical and Engineering Sciences in Medicine,43, 635-640 (Covid-19: automatic detection from X-ray images by means of transfer learning of convolutional neural networks), in order to alleviate the problem of small sample numbers, a transfer learning method is adopted, the recognition capability of several kinds of most advanced pretrained convolutional neural networks (VGG, acceptance, etc.) in X-ray images of the chest is evaluated, the knowledge extracted from the pretrained model is transferred to the model to be trained, a better effect is obtained through experiments, the calculation speed of the network is ignored by the traditional method of combining CNN and RNN, and the speed of a diagnosis system is also important.

Literature "hocchrite S, schmidhuber J.Long short-term memory [ J ]. Neural computation,1997,9 (8): 1735-1780 (long term short term memory, neuro-computing) "proposes a long and short term memory network (LSTM) that has the advantages of long term memory capability, prevention of gradient problems, adaptation to variable length sequences, processing of different time scale information, and generalization capability. Literature "niyan, yang Yuanyuan, xie Zhe, zheng Dechong, wang Weidong, 2022. Method for pulmonary nodule multi-feature extraction based on LSTM and attention structure. University of Shanghai traffic journal, 56 (08): 1078-1088' fuses the sharing characteristics among each task through the attention mechanism, and improves the characteristic extraction effect of the current task. The LSTM structure classifier can effectively screen sharing characteristics among tasks, and improves information transmission efficiency of the model. Literature "kalia a.in-depth understanding of LSTM and its recent advances in lung disease diagnosis [ J ]. World Journal of Advanced Research and Reviews,2022, 14 (3): 517-522 (understanding of LSTM and its recent progress in diagnosis of pulmonary disease, journal of advanced research and review worldwide) "by several practical examples, the recent progress of LSTM in diagnosis of COVID-19 and other pulmonary diseases was explored. The literature "Dastider A G, sadik F and Fattah S A.2021.An integrated autoencoder-based hybrid CNN-LSTM model for COVID-19 severity prediction from lung ultrasound.Computers in Biology and Medicine,132, 104296 (hybrid CNN-LSTM integrated model based on self-encoder for pulmonary ultrasound COVID-19 severity prediction, biological and medical computer)" predicts pulmonary ultrasound severity by introducing a Long Short Term Memory (LSTM) layer after the proposed CNN architecture, and the results show a significant improvement in classification performance, on average 7-12%, over the traditional DenseNet architecture by about 17%. Document "choindex G J.2021.Class dependency based leaming using Bi-LSTM coupled with the transfer leaming of VGG. For the diagnosis of Tuberculosis from chest x-ray, arXiv preprint arXiv:2108.04329 (class-dependent learning using Bi-LSTM and VGG16 transfer learning to diagnose tuberculosis from chest X-rays, arXiv preprint) "transfer learning of VGG16 combined with Bi-directional LSTM extract advanced discriminant features from segmented lung regions, then classification using fully connected layers, tuberculosis diagnosis accuracy on Schezien and Montgomery datasets improved by 0.7% and 11.68%, respectively. Document "Lv Qing, zhao Kui, cao Jilong, wei Jingfeng.2022. Text and image based pulmonary disease study and prediction. Automated chemistry report, 48 (02): 531-538' provides a lung cancer classification method based on the combination of images and texts, electronic medical record information is introduced, and Multi-head section and Bi-LSTM are used for modeling the texts, so that the performance of an image classification model is further improved, but the calculation complexity is high, the parameter amount is more, and the calculation load is increased. Hong Xin et al (2023) literature "Hong Xin, huang Kai, yang Chenhui.2023. Alzheimer's disease prediction CTIS model based on Bi-ConvLSTM timing feature extraction. Chinese graphic school journal, 28 (04): 1146-1156 'proposes an Alzheimer's disease prediction CTIS model based on Bi-ConvLSTM time sequence feature extraction, which performs time sequence feature extraction on a stratified section of brain images through a time sequence convolution bidirectional long-short-time memory model Bi-ConvLSTM and an attention mechanism.

Although these methods have achieved some success, they often do not fully exploit the correlation information between modalities, and there is a problem of computational inefficiency in processing large-scale data.

Disclosure of Invention

The invention aims to solve the technical problems that: the image classification system based on cross-modal feature fusion utilizes the association information among the modalities, and meanwhile, the problem of low calculation efficiency exists when large-scale data are processed.

The technical scheme adopted by the invention is as follows: an image classification system based on cross-modal feature fusion comprises,

the extraction module is used for extracting the characteristics of the medical image and the diagnosis report and extracting the image and the text;

the fusion module adopts a cross-mode attention module to fuse the characteristics of the image and the text;

and the model training module is used for training the fused data by adopting a two-way long-short-term memory network module based on a sparse gate.

Further, the extraction module adopts pre-training models ResNet50 and BERT to respectively extract the characteristics of the medical image and the diagnosis report into an image and a text.

Further, the above-mentioned method for fusing the cross-modal attention module adopts a feature fusion method based on a cross-modal attention mechanism, and the method comprises the following steps: the cross-modal attention mechanism is represented as follows:

wherein CMA (cross-model attention) represents a cross-modal attention mechanism, I and M represent different modalities respectively, wherein M functions to enhance the representation of I, represents the dot product of the matrix,representing the scaling factor, the fusion of the cross-modal attentiveness mechanisms involves interactions between the image and the text, the fusion formula of the cross-modal attentiveness mechanisms being as follows:

wherein A is _i，j Represents the attention weight of the ith modality to the jth modality, S _i，j Is a similarity score between two modalities, N representing the total number of modalities; for each modality i, the attention weight of i to j is calculated by calculating similarity scores between the modality i and other modalities j, and then converting these scores into probability distributions.

Further, the two-way long-short-term memory network module based on the sparse gate is a network combining Bi-LSTM and the sparse gate, in a model combining the sparse gate and the Bi-LSTM, the Bi-LSTM is used for encoding a sequence, then the encoded sequence is thinned through the sparse gate, the two-way long-term memory network module based on the sparse gate is respectively added with a sparse gate after an input gate, a forgetting gate and an output gate in the Bi-LSTM, the information flow of each time step in the model is controlled, and the two-way long-term memory network module based on the sparse gate is specifically realized by the following steps: at each time step t, the input of the sparse gate is the output h of Bi-LSTM at that time step _t The output of the sparse gate is a binary vector g_t E [0,1 ]]Indicating whether to pair h _t Sparse is carried out, and the output g of a sparse gate _t Dot product operation is carried out on the output of the input gate, the forgetting gate and the output gate in Bi-LSTM to obtain thinned gate output, and the input gate, the forgetting gate and the output gate are respectively arranged to have the output of i at the time step t _t ，f _t ，o _t The output of the thinned gate is:

wherein, the ". Is a dot product operator,and respectively representing the output of the thinned input gate, the thinned forget gate and the thinned output gate.

Further, the Bi-LSTM includes two LSTMs, wherein one LSTM is traversed in a time forward directionTraversing the input sequence in a reverse direction of time by another LSTM, called forward LSTM and backward LSTM, respectively, hidden states of forward LSTMConsidering only the input information at the current and previous times, and then going to the hidden state of LSTM +.>Considering only the input information at the present and later times, the outputs of the two LSTMs are spliced together to form the final Bi-LSTM output +.>

Wherein "," denotes a vector concatenation operation.

Further, the sparse gate is implemented by introducing a sparse matrix, and it is assumed that the input data is x= [ X ] ₁ ，x ₂ ，...，x _n ]∈R ^n×d Where n represents the sequence length and d represents the feature dimension for each time step, then the sparse gate is expressed as:

g _i，j ＝σ(a _j (x _i )+b _j )·s _i，j (4)

wherein a is _j Is a function for extracting an input x _i Features of b) _j Is a bias term, σ is a sigmoid function, s _i，j Is a binarization matrix, which represents sparsity, when s _i，j When=1, the j-th feature representing the i-th time step participates in the operation; when s is _i，j When=0, the jth feature representing the ith time step is ignored; will s _i，j Considered as probability variables, the parameters are then learned by maximizing the marginal likelihood of the model or minimizing the reconstruction error. Specifically, a sparse coding algorithm of LISTA is used to optimize the model:

wherein X represents a given input data matrixThe purpose of sparse coding is to learn a dictionary +.>The purpose is to generate sparse coding of the input data +.>Lambda represents l ₁ -regularization coefficients of norms.

To solve equation (5), the traditional approach is to alternately optimize W and S, which corresponds to two optimization processes: dictionary learning and sparse approximation, specifically, by fixing S, equation (5) is generalized to the following l ₂ Constraint optimization problem:

by fixing W, equation (5) is reduced to a sparse approximation problem that aims to represent the input x by a linear combination of W as follows:

an iterative hard threshold (ISTA) algorithm is one of the most popular solvers, and the iterative hard threshold (ISTA) algorithm is used to decompose the object of (7) into two parts: wherein the micro-portionsUpdated by gradient descent, the other part being l ₁ And a regularization part, updated by the hard threshold operator, wherein the updating formula is as follows:

wherein s is ^(t) Sparse coding representing the t-th iteration, sh _(λτ) =sign(s) (|s| - λt) is defined for the contraction function,representing the micro-segments, τ being a coefficient, the solution representing equation (8) is implemented by the following update rules:

wherein W is _u ＝I-τW ^T W，W _v ＝τW ^T 。

The invention has the beneficial effects that: compared with the prior art, aiming at the limitations of the prior methods, the invention introduces a cross-modal attention mechanism and a sparse gate to improve multi-modal medical data analysis, and firstly, a method capable of dynamically learning the association relationship between different modalities is designed based on the cross-modal attention mechanism. The mechanism can adaptively allocate weights for different modal data, so that the attention of the model to key information is improved. By introducing an attention mechanism, the correlation between different modes can be better captured, so that the fusion effect of the multi-mode data is improved. Secondly, sparse gates are applied in the gating cells of the Bi-LSTM to further optimize the performance of the model. The sparse gate is used for filtering out unimportant information, so that the computational complexity of the model is reduced, and the robustness of the model is improved. By introducing a sparse gate mechanism, the number of parameters can be effectively reduced, the calculation cost of the model is reduced, and the operation efficiency of the model is improved.

The innovation of the invention is that the cross-modal attention mechanism and the sparse gate are applied to multi-modal medical data analysis. By the method, the associated information among different modes can be fully utilized, and the calculation efficiency of the model in processing large-scale data is improved. Compared with the existing method, the system provided by the invention can better mine the potential characteristics of the multi-mode data, thereby improving the performance of the model.

Drawings

FIG. 1 is a block diagram of a proposed model structure;

FIG. 2 is a diagram of a Bi-LSTM network architecture;

FIG. 3 is a diagram of an SG-LSTM network framework;

FIG. 4 is a confusion matrix thermodynamic diagram;

fig. 5 is a graph of ROC curve for the model of the present invention versus other baseline models.

Detailed Description

The invention will be further described with reference to specific examples.

Example 1: as shown in fig. 1-5, an image classification system based on cross-modality feature fusion, includes,

The invention mainly aims at extracting and fusing characteristics of medical images and medical reports, is based on a Bi-LSTM model, and provides a sparse gate-based Bi-LSTM model to improve accuracy and efficiency of characteristic processing.

First, a ResNet50 model is used to perform feature extraction on medical images. ResNet (residual network) is a deep convolutional neural network structure, and can effectively solve the problems of gradient disappearance and gradient explosion in deep network training. By using ResNet, useful feature information can be extracted from medical images. Secondly, feature extraction is performed on the text by using the BERT model. BERT (bi-directional encoder representation transform) is a pre-trained language model with powerful text feature extraction capabilities. By using BERT, contextual information and semantic associations in text data can be captured. Next, features of the image and text are fused using a cross-modal based attention mechanism. The cross-modal attention mechanism can automatically learn the association between the image and the text, so that the fused features can better represent the comprehensive information of the two data. Finally, SG-Bi-LSTM model is proposed to train and test the fused features. The sparse gate is a sparse processing method, and can efficiently denoise and process redundant information in the network model, so that the efficiency and accuracy of the model are improved.

In summary, the basic principle of the proposed model CMASG-Bi-LSTM is that the feature extraction is carried out on a medical image through ResNet50, the text feature extraction is carried out through BERT model, then the feature fusion of the image and the text is carried out through a cross-modal attention-based mechanism, and finally the fused feature is trained and tested through SG-Bi-LSTM model. The comprehensive application of the method can improve the characteristic expression capability of medical images and text data, and further optimize the efficiency and accuracy of a model through the introduction of a sparse gate, and a model structure diagram is shown in figure 1.

The extraction module adopts a pre-training model ResNet50 and BERT to respectively extract the characteristics of the medical image and the diagnosis report from the image and the text.

The method for fusing the cross-modal attention module adopts a feature fusion method based on a cross-modal attention mechanism, and comprises the following steps: the cross-modal attention mechanism is represented as follows:

wherein CMA (cross-model attention) represents a cross-modal attention mechanism, I and M represent different modalities respectively, wherein M functions to enhance the representation of I, represents the dot product of the matrix,representing a scaling factor, a fusion of cross-modal attentiveness mechanisms involving interactions between images and textThe fusion formula is as follows:

wherein A is _i，j Represents the attention weight of the ith modality to the jth modality, S _i，j The similarity score between the two modes is calculated in a dot product mode, and N represents the total mode quantity; for each modality i, the attention weight of i to j is calculated by calculating similarity scores between the modality i and other modalities j, and then converting these scores into probability distributions. The method for fusing the cross-modal attention mechanisms can help the model to better utilize information among different modalities and improve the performance of the model.

By extracting the features of the images and the texts, the cross-modal feature fusion is carried out on the image features and the text features through a cross-modal attention mechanism, and finally, better features are obtained to carry out training and testing of the classification model, and finally, the model classification effect is improved.

The two-way long-short-term memory network module based on the sparse gate is a network combining Bi-LSTM and the sparse gate, in a model combining the sparse gate and the Bi-LSTM, the Bi-LSTM is used for encoding the sequence, and then the encoded sequence is thinned through the sparse gate. Because Bi-LSTM is able to fully exploit information in an input sequence, the input sequence is modeled, resulting in a representation H of the sequence. The sparse gate can further reduce redundant information in the sequence on the premise of keeping important information of the sequence, so that the efficiency of the model is improved

Bi-LSTM (Bidirectional Long Short-Term Memory) is a commonly used recurrent neural network (Recurrent Neural Network, RNN) model, commonly used for processing sequence data such as language models, natural language processing, and speech recognition. Compared to the conventional unidirectional LSTM (Long Short-Term Memory) model, bi-LSTM can use both forward and backward context information, thereby improving the expressive power of the model, as shown in fig. 2.

Bi-LSTM includes two LSTM, wherein one LSTM traverses the input sequence in forward time and the other LSTM traverses the input sequence in reverse time, called forward LSTM and backward LSTM, respectively, hidden states of forward LSTMConsidering only the input information at the current and previous times, and then going to the hidden state of LSTM +.>Considering only the input information at the present and later times, the outputs of the two LSTMs are spliced together to form the final Bi-LSTM output +.>

Wherein "," denotes a vector concatenation operation.

Among them, sparse Gate (Sparse Gate) is a novel gating mechanism, which mainly acts to control the flow of information and improve the representation capability and generalization capability of the neural network. Compared with the traditional gating mechanism (such as LSTM and GRU), the sparse gate can filter noise information more effectively, improve the robustness and the interpretability of the network, is realized by introducing a sparse matrix, and assumes that input data is X= [ X ] ₁ ，x ₂ ，...，x _n ]∈R ^n×d Where n represents the sequence length and d represents the feature dimension for each time step, then the sparse gate is expressed as:

g _i，j ＝σ(a _j (x _i )+b _j )·s _i，j (4)

wherein a is _j Is a function for extracting an input x _i Features of b) _j Is a bias term, σ is a sigmoid function, s _i，j Is a binarization matrix, which represents sparsity, when s _i，j When=1, the ith time step is representedj features participate in the operation; when s is _i，j When=0, the jth feature representing the ith time step is ignored; will s _i，j Considered as probability variables, the parameters are then learned by maximizing the marginal likelihood of the model or minimizing the reconstruction error. Specifically, a sparse coding algorithm of LISTA is used to optimize the model:

an iterative hard threshold (ISTA) algorithm is one of the most popular solvers, and the iterative hard threshold (ISTA) algorithm is adopted to target (7)Is decomposed into two parts: wherein the micro-portionsUpdated by gradient descent, the other part being l ₁ And a regularization part, updated by the hard threshold operator, wherein the updating formula is as follows:

wherein W is _u ＝I-τW ^T W，W _v ＝τW ^T 。

The two-way long-short-term memory network module based on the sparse gate is respectively added with a sparse gate after an input gate, a forgetting gate and an output gate in the Bi-LSTM, and controls the information flow of each time step in the model, and the specific implementation method of the two-way long-term memory network module based on the sparse gate is as follows: at each time step t, the input of the sparse gate is the output h of Bi-LSTM at that time step _t The output of the sparse gate is a binary vector g_t E [0,1 ]]Indicating whether to pair h _t Sparse is carried out, and the output g of a sparse gate _t Dot product operation is carried out on the output of the input gate, the forgetting gate and the output gate in Bi-LSTM to obtain thinned gate output, and the input gate, the forgetting gate and the output gate are respectively arranged to have the output of i at the time step t _t ，f _t ，o _t The output of the thinned gate is:

To illustrate the effect of the present invention, the following simulation experiments were performed:

1) Data set and preprocessing

The NLMCXR dataset is derived from the national medical library (National Library of Medicine) of the national institutes of health (National Institutes of Health) and from the NLM Open-i Indiana chest x-ray dataset, and is mainly used for medical image automatic diagnosis and disease classification and other studies. The dataset contained a series of chest radiographs and corresponding diagnostic reports, with corresponding 7470 images (including frontal and lateral) and 3955 diagnostic reports.

2) Experimental setup

The experiment is carried out under the GPU acceleration environment by using Python language, a Pytorch deep learning framework is adopted, and a computer is configured into a Windows11 system, a 128G memory and a RTX Geforce4090 24G video memory. The experiment adopts ResNet50 and BERT pre-training models to pre-process images and reports of a data set and obtain data characteristics. In order to fully utilize the feature information of the image report, the invention provides a feature fusion method based on a cross-modal attention mechanism for carrying out feature fusion on the image and text information, and training and deciding the fused features. SGBi-LSTM is a gating mechanism for improving Bi-LSTM. Standard gating mechanisms for Bi-LSTM include input gates (input gates), forget gates (forget gates), and output gates (output gates), which control the flow of information in LSTM. However, standard gating mechanisms may suffer from pile-up and interference of information, resulting in reduced model performance. The sparse gate is added to reduce the information flow of gate control, and reduce the accumulation and interference of information, so that the performance of the model is improved. Table 1 lists the parameter settings for this experiment.

Table 1 experimental parameter settings

3) Experimental evaluation index

The detection/diagnostic performance of CAD systems is typically measured by indices such as Recall (Recall), accuracy (Accuracy), subject operating characteristics, and confusion matrix. For the ROI area of the medical image, positive (Positive) or Negative (Negative) can be used to describe that the ROI area is a lesion or a non-lesion, and whether the ROI area is judged to be correct or not can be expressed by True (True) or False (False), then the detection diagnosis result output by the CAD system may be: (1) True Positive (TP) -diagnostic positive, subject true value positive; (2) True Negative (TN) -diagnosis negative, subject true value is also negative; (3) False Positive (FP) -diagnosis positive, subject true value negative; (4) False Negative (FN) -negative in diagnosis, positive in subject truth. The calculation formulas of the Recall rate (Recall) and the Accuracy rate (Accuracy) are expressed as follows:

Recall＝TPR＝TP/(TP+FN)×100％ (11)

Accuracy＝(TP+TN)/(TP+TN+FP+FN)×100％ (12)

FPR＝FP/(TP+FN)×100％ (13)

recall (Recall), also known as True Positive Rate (TPR), is the rate at which abnormal regions are correctly identified as positive, and is a measure of the true positive identification performance of a system. Accuracy (Accuracy) is the ratio at which true positive and true negative individuals in a subject are correctly identified. ROC curves show the relationship between the true positive rate (True Positive Rate, TPR) and false positive rate (False Positive Rate, FPR) of the two classifiers at various thresholds. Wherein, the true positive rate refers to the proportion of the sample actually being positive, and the false positive rate refers to the proportion of the sample actually being negative.

The abscissa of the ROC curve is FPR and the ordinate is TPR, the closer the ROC curve is to the upper left, the better the classifier performance. In addition, by calculating the area under the ROC curve, an evaluation index can be obtained, that is, the larger the area under the area, the better the classifier performance, which is called AUC (Area Under the Curve).

Confusion Matrix (fusion Matrix) is a common method of evaluating classifier performance. The confusion matrix is a two-dimensional matrix in which rows represent actual categories and columns represent predicted categories. The four elements in the confusion matrix are: true Positive (TP), false Positive (FP), true Negative (TN), and False Negative (FN). Using the confusion matrix, we can calculate a number of metrics, such as Accuracy (Accuracy), recall (Recall), precision (Precision), and F1 value (F1-Score), which are also important metrics for evaluating model performance.

4) Experimental results and analysis

After training the model provided by the invention, testing 20% of samples, and experiments show that after the NLMCXR test of the public data set, the accuracy of the network model provided by the invention reaches 95.98%, and in order to observe data more intuitively, a confusion matrix thermodynamic diagram is drawn as shown in figure 5.

In order to illustrate the effectiveness and superiority of the proposed algorithm, the present invention will perform baseline model comparison and ablation experimental comparison. Firstly, the ROC curve graph of the model compared with other baseline models is shown in FIG. 5, and the correlation index comparison result of the model and other traditional baseline models is shown in Table 3. As can be seen intuitively from fig. 5, the ROC of the proposed model is significantly better than other baseline models and existing models. As can be clearly seen from table 3, the VGG16 model shows the worst effect in both Accuracy (Accuracy) and AUC in the experiment, and the effect of the ResNet50 is relatively better compared with that of the model, the model provided by the invention shows the best effect in the aspect of the two indexes, and the model of the invention is inferior to the model of the acceptance v3+bi-LSTM-Attention, so that the effectiveness of the model of the invention is verified.

Table 3 Algorithm of the invention on NLMCXR dataset compared to Algorithm of other baseline models

Next, ablation experiments were performed on the same dataset using Bi-LSTM, cma+bi-LSTM, sg+bi-LSTM, cma+sg+bi-LSTM in combination with CNN feature extractor (ResNet 50), and the experimental results were compared using the relevant evaluation index, as shown in table 4. From the comparison results, the model of the present invention is shown to be optimal in both Accuracy (Accuracy) and AUC results, with the effect of the ablation experiment Bi-LSTM being the worst. The model SG+Bi-LSTM introduces a sparse gate on the basis of Bi-LSTM, the test average reasoning time (T_refer) is reduced by 60%, the effect of the sparse gate is fully exerted, the operation amount is reduced, the operation speed is improved, and the CMA+SG+Bi-LSTM (our) is compared with the CMA+Bi-LSTM, wherein the effect of the sparse gate reduces the reasoning time by about 65%, and the superiority of the model is proved.

Table 4 results of the model of the present invention compared to other ablation models

The CMASG-Bi-LSTM based medical image classification model provided by the invention can effectively fuse cross-modal characteristics and improve the performance and efficiency of the model. First, the cross-modal attention mechanism allows the model to dynamically adjust the weight of each modality, thereby better utilizing the information of different modalities. Secondly, the sparse gate mechanism can reduce the influence of redundant information and improve the calculation efficiency and the robustness of the model. Finally, by comparing with the traditional models VGG16 and ResNet50 and the existing model Inceptionv3+Bi-LSTM-Attention, the model provided by the invention combines a CNN feature extractor (ResNet 50) to carry out an ablation experiment, and the comparison results have stronger competitiveness, so that the effectiveness and superiority of the model are verified.

In summary, the model provided by the invention has remarkable advantages in the field of medical image processing, in particular medical image classification tasks, and provides valuable tools and methods for the field of medical image analysis.

The foregoing is merely illustrative of the present invention, and the scope of the present invention is not limited thereto, and any person skilled in the art can easily think about variations or substitutions within the scope of the present invention, and therefore, the scope of the present invention shall be defined by the scope of the appended claims.

Claims

1.An image classification system based on cross-modal feature fusion is characterized in that: comprising the steps of (a) a step of,

2. An image classification system based on cross-modality feature fusion as claimed in claim 1, wherein: the extraction module adopts pre-training models ResNet50 and BERT to respectively extract the characteristics of the medical image and the diagnosis report.

3. An image classification system based on cross-modality feature fusion as claimed in claim 1, wherein: the method for fusing the cross-modal attention module adopts a feature fusion method based on a cross-modal attention mechanism, and the method comprises the following steps: the cross-modal attention mechanism is represented as follows:

4. An image classification system based on cross-modality feature fusion as claimed in claim 1, wherein: the two-way long-short-term memory network module based on the sparse gate is a network combining Bi-LSTM and the sparse gate, in a model combining the sparse gate and the Bi-LSTM, the Bi-LSTM is used for encoding a sequence, then the encoded sequence is thinned through the sparse gate, the two-way long-term memory network module based on the sparse gate is respectively added with one sparse gate after an input gate, a forgetting gate and an output gate in the Bi-LSTM, the information flow of each time step in the model is controlled, and the two-way long-term memory network module based on the sparse gate is specifically realized by the following steps: at each time step t, the input of the sparse gate is the output h of Bi-LSTM at that time step _t The output of the sparse gate is a binary vector g_t E [0,1 ]]Indicating whether to pair h _t Sparse is carried out, and the output g of a sparse gate _t Dot product operation is carried out on the output of the input gate, the forgetting gate and the output gate in the Bi-LSTM to obtainThe output of the thinned gate is respectively i in the time step t, and the input gate, the forgetting gate and the output gate are arranged _t 、f _t 、o _t The output of the thinned gate is:

5. An image classification system based on cross-modality feature fusion as claimed in claim 4, wherein: bi-LSTM includes two LSTMs, wherein one LSTM traverses the input sequence in a forward direction in time and the other LSTM traverses the input sequence in a reverse direction in time, called forward LSTM and backward LSTM, respectively, hidden states of forward LSTMConsidering only the input information at the current and previous times, and then going to the hidden state of LSTM +.>Considering only the input information at the present and later times, the outputs of the two LSTMs are spliced together to form the final Bi-LSTM output +.>

Wherein "," denotes a vector concatenation operation.

6. An image classification system based on cross-modality feature fusion as claimed in claim 4, wherein: the sparse gate is implemented by introducing a sparse matrix, and the input data is assumed to be X= [ X ] ₁ ，x ₂ ，…，x _n ]∈R ^n×d Where n represents the sequence length and d represents the feature dimension of each time step, then the sparse gate is expressed as:

g _i，j ＝σ(a _j (x _i )+b _j )·s _i，j (4)

wherein a is _j Is a function for extracting an input x _i Features of b) _j Is a bias term, σ is a sigmoid function, S _i，j Is a binarization matrix, which represents sparsity, when S _i，j When=1, the j-th feature representing the i-th time step participates in the operation; when S is _i，j When=0, the jth feature representing the ith time step is ignored; will S _i，j Considered as probability variables, the parameters are then learned by maximizing the marginal likelihood of the model or minimizing the reconstruction error, specifically, using the sparse coding algorithm of LISTA to optimize the model:

wherein X represents a given input data matrixThe purpose of sparse coding is to learn a dictionary +.>The purpose being to generate sparse coding of input dataLambda represents->-regularization coefficients of norms.

To solve equation (5), W and S are alternately optimized, which corresponds to two optimization processes: dictionary learning and sparse approximation, specifically, by fixing S, equation (5) is generalized to the followingConstraint optimization problem:

decomposing the target of (7) into two parts by adopting an iterative hard threshold algorithm: wherein the micro-portionsUpdated by gradient descent method, the other part is +.>And a regularization part, updated by the hard threshold operator, wherein the updating formula is as follows:

wherein s is ^(t) Sparse coding representing the t-th iteration, sh _(λτ) =sign(s) (|s| - λτ) is defined for the contraction function,representing a microtopole, τFor coefficients, the solution representing equation (8) is implemented by the following update rule:

wherein W is _u ＝I-τW ^T W，W _v ＝τW ^T 。