CN113469962B

CN113469962B - Feature extraction and image-text fusion method and system for cancer lesion detection

Info

Publication number: CN113469962B
Application number: CN202110705588.0A
Authority: CN
Inventors: 吴晶晶; 宋余庆; 刘哲
Original assignee: Jiangsu University
Current assignee: Jiangsu University
Priority date: 2021-06-24
Filing date: 2021-06-24
Publication date: 2024-05-14
Anticipated expiration: 2041-06-24
Also published as: CN113469962A

Abstract

The invention discloses a feature extraction and image-text fusion method and a system for cancer lesion detection, which are used for acquiring CT image data and text data of a patient; preprocessing CT image data, and carrying out convolution operation and normalization processing on the CT image data in sequence to obtain an initial feature map F; adopting Darknet-53 network as basic network, and extracting features of CT image data; extracting text data characteristics of text data of a patient by adopting a BERT model; the feature map and the text feature are spliced, so that the fusion of the text feature and the image feature is realized, and the detection of the cancer lesion by the network is guided; and carrying out up-sampling operation and lateral connection on the fused image-text information to form different characteristic levels like a pyramid, applying corresponding detection heads on the different levels, and calculating the position and confidence value of a lesion area in the characteristic map to obtain the specific position of the lesion area in an original image so as to realize detection of the cancerous area.

Description

Feature extraction and image-text fusion method and system for cancer lesion detection

Technical Field

The invention belongs to the technical field of computer vision, and particularly relates to a feature extraction and image-text fusion method and system for cancer lesion detection.

Background

In the process of cancer lesion detection, the medical image can be used as an auxiliary means to effectively help detection personnel identify a lesion area so as to provide timely treatment for patients. The conventional medical image detection method needs medical knowledge and abundant experience of a clinician to analyze the medical image, but the problems of missed detection, false detection and the like are easy to exist. Furthermore, manually labeling a lesion in a medical image often requires a significant amount of time and effort. Therefore, the target detection method introduced into the computer vision technology detects the cancer lesion area, so that not only can the interference of human factors be effectively avoided, but also the detection precision of the cancer lesion area can be improved.

The traditional machine learning method mainly extracts relevant characteristics through probability density functions or gradient histograms and other information, so that the identification accuracy of the detector on a lesion area in a medical image is improved. However, the conventional machine learning method not only requires prior knowledge to perform manual intervention on the whole detection model, but also is difficult to effectively mine deep semantic features in the image.

The deep learning-based cancer lesion detection method mainly utilizes a Convolutional Neural Network (CNN) to extract characteristic information about cancer lesions, and detects cancer according to the characteristic information. On the one hand, the conventional convolutional neural network extracts feature information only according to the pixel difference, and the feature information of the lesion area cannot be fully mined due to the large difference among the sizes, shapes and densities of various lesion areas. On the other hand, a great number of cancer lesion detection methods based on deep learning only focus on detection of medical image layers, but neglect various lesion stages and have larger characteristic differences.

Disclosure of Invention

In order to solve the defects in the prior art, the invention provides a feature extraction and image-text fusion method and system for cancer lesion detection, which are used for solving the problems of insufficient image features extracted by a detection network and low detection precision in the prior art.

The technical scheme adopted by the invention is as follows:

a feature extraction and image-text fusion method for cancer lesion detection comprises the following steps:

S1, acquiring patient personal information recorded by a clinician, CT image data of a patient and a text report corresponding to the CT image data;

s2, preprocessing CT image data so that the format of the CT image data can be directly fed into a neural network as input for training or prediction; performing convolution operation and normalization processing on CT image data in sequence to obtain an initial feature map F;

s3, adopting Darknet-53 network as a basic network, and extracting features of CT image data;

S4, extracting text data characteristics of a text report form corresponding to the patient personal information and the CT image data by adopting a BERT model;

S5, splicing the feature map obtained in the S3 with the text features to realize fusion of the text features and the image features, and guiding a network to detect cancer lesions;

s6, carrying out up-sampling operation and lateral connection on the fused image-text information to form different feature levels like a pyramid, applying corresponding detection heads on the different feature levels, and calculating the position and confidence value of a lesion area in the feature map to obtain the specific position of the lesion area in an original image so as to realize detection of the cancerous area.

Further, the Darknet-53 network is formed by alternately forming a plurality of downsampling layers and Residual blocks, wherein one downsampling layer and one Residual Block are called a feature extraction Block, and a CT image data feature extraction network is formed by 5 feature extraction blocks; the input of each feature extraction block is the output of the previous feature extraction block, and the current output is the input of the next feature extraction block.

Further, the times of performing the feature extraction block operation in the 5 feature extraction blocks are sequentially 1,2, 8 and 4;

further, the feature extraction block operation of each feature extraction block is as follows:

S3.1, performing downsampling operation on the feature map in a downsampling layer, and gradually reducing the size of the feature map;

s3.2, screening and weighting the feature images in the channel dimension, and highlighting information in the effective area by using pooling operation along the channel direction; expressed as: M _c (F) represents performing feature extraction operation on F in the channel dimension;

S3.3, supplementing position information in a space dimension, learning a space weight graph by utilizing the relation among different space positions, and modeling by utilizing the relation of features in space, wherein the position relation information which cannot be better acquired by a channel attention mechanism is supplemented; expressed as: M _s (F ') denotes performing feature extraction operation on F' in spatial dimension,/> Representing element-by-element multiplication;

S3.4, carrying out convolution operation on the feature map F' obtained in the step S3.3, wherein the convolution kernel size is 3*3, the step length is 1, the normalization is carried out by adopting Batch Normalization, and the activation function is LeakyReLu.

Further, the BERT model in S4 is a multi-layer bi-directional transform encoder, which includes an encoder mechanism and a decoder mechanism, the encoder receiving text as input, the decoder being primarily responsible for the prediction; the transducer builds up an encoder and decoder of 6 layers each, for a total of 12 layers of encoder-decoders.

Further, the transducer does not need to loop, but processes all words or symbols in the sequence in parallel, and combines the context with the far words by using a self-attention mechanism, so that the context information is fully considered; the calculation rule of the attention mechanism is as follows: for each vocabulary, there are 3 different vectors, namely a Query vector Q, a Key vector K and a Value vector V; calculating a score value for each vector, wherein score = Q x K; next, the score is subjected to a softmax activation function, and the vector V is multiplied by the softmax point to obtain a weighted score V of each vector; several attention mechanisms are overlapped to obtain a Multi-Head Ateention module, an output result Z= ΣVis obtained after addition, and then normalization processing is carried out. Wherein the input of the decoder comprises the output of the decoder and the output of the decoder of the upper layer, and the final result is respectively output after being activated by Linear and Softmax.

Further, 3 detection frames are set in each unit of the feature map, so the prediction vector length of each unit is: 3 (4+4+1) =27, where the first 4 corresponds to 4 classes of the dataset, 3 represents the number of detection frames per cell, the second 4 is the 4 position offsets per detection frame, 1 is the confidence value that each detection frame contains the target, and the final image feature is represented as a vector of 1x 27 dimensions. And the acquired text features are expressed as a 1*N-dimensional vector; finally, a vector with 1 (27+N) dimension is obtained and is the image-text information after fusion and is used for the subsequent detection network.

A feature extraction and image-text fusion system for cancer lesion detection comprises a data acquisition module, a feature extraction module for CT images and text data, an image-text fusion module and an image-text information detection module;

the data acquisition module is used for acquiring personal information of a patient, a CT image of the patient and a text report corresponding to the CT image; patient personal information includes information such as disease information, health status, etc. of the patient;

The feature extraction module of the CT image and the text data comprises a CT image format conversion unit, a CT image feature extraction unit and a text data feature extraction unit;

The characteristic diagram and the text characteristic are spliced in the image-text fusion module;

The image-text information detection module calculates coordinates of a lesion area and a confidence value according to the guidance of the text characteristics, wherein the coordinates are used for framing the position of the lesion area in an original image, the confidence value is used for judging the type of the lesion, and a final result is output;

Further, the CT image format conversion unit is image format conversion software and is used for converting the CT image format acquired by the scanner so as to facilitate subsequent processing; the CT image feature extraction unit is internally provided with a CT image data feature extraction network, the CT image data feature extraction network adopts Darknet-53 networks as a basic network, the Darknet-53 networks are composed of 5 feature extraction blocks, and each feature extraction Block comprises a downsampling layer and a Residual Block; taking the CT image after format conversion as the input of a first feature extraction block, and taking the output of the previous feature extraction block in the five feature extraction blocks as the input of the next feature extraction block; the output of the last feature extraction block is a feature map used for fusing with text features; the text data feature extraction unit adopts a BERT model, inputs text data into the BERT model, establishes a context relation between words, and acquires text features based on sentence level.

The invention has the beneficial effects that:

The feature extraction and image-text fusion method and the system for cancer lesion detection introduce feature description information about each lesion stage in the cancer lesion detection method, so that the identification accuracy of a cancer lesion region can be improved, and the severity of the cancer lesion can be effectively divided.

In addition, the feature extraction and image-text fusion method and system for cancer lesion detection adopt an extraction method combining channel dimension and space dimension to improve the identification capability of lesion areas with large differences, and use BERT network to perform feature fusion of images and texts by combining text description information of each lesion stage, so that effective detection of cancer lesion areas and effective judgment of cancer stages are performed.

Drawings

FIG. 1 is an overall flow chart of the method of the present invention;

FIG. 2 is a schematic workflow diagram of a CT image feature extraction module according to the method of the present invention;

FIG. 3 is a schematic diagram of a network structure of a text feature extraction module of the method of the present invention;

FIG. 4 is a schematic diagram of an encoder-decoder structure of the method of the present invention;

FIG. 5 is an expanded schematic view of the encoder-decoder structure of FIG. 4;

FIG. 6 is a schematic diagram of the structure of a detection module according to the method of the present invention.

Detailed Description

The present invention will be described in further detail with reference to the drawings and examples, in order to make the objects, technical solutions and advantages of the present invention more apparent. It should be understood that the specific embodiments described herein are for purposes of illustration only and are not intended to limit the scope of the invention.

The following describes in detail a feature extraction and image-text fusion method for cancer lesion detection according to the present invention with reference to fig. 1 to 5, and the CT image data provided by the present invention are all examples of liver CT images.

S1, a data acquisition stage, which specifically comprises acquisition of CT image data and text data, wherein a high-precision German Siemens 52-loop scanner (Positron Emission Tomography, PET-CT) is adopted to generate a large number of abdomen flat scan slices; in the process of generating CT image data, a text report is generated, and the text report and the personal information (including disease information, health condition, etc.) of the patient recorded by the clinician form a required text data set. CT image data and text data of each patient are in one-to-one correspondence and stored in a database so as to be used by a subsequent image feature extraction module and a text feature extraction module.

S2, preprocessing CT image data, namely preprocessing the CT image data into a format which accords with the acceptance of the image feature extraction network. The scan slice format directly generated by the machine is DICOM (. Dcm), which cannot be directly fed as input into the neural network for training or prediction, which is converted into the. Jpg or. Png image format by means of a Python correlation library, or into a two-dimensional array; during the transformation process, attention should be paid to the limitation of window level and window width (i.e. limitation of Hunter unit HU value) corresponding to the liver (or other organs and tissues) so as to obtain the contrast ratio with the surrounding organs to the greatest extent (make the liver appear as much as possible); for the liver, according to the experience of the professional radiologist, the HU value is limited in the range of [ -100, 240], -100 and 240 are window levels, and the distance between two end points is window width; the defined formula is shown below:

Where HU _i,j represents the HU value at pixel (i, j) and PV _i,j represents the pixel value of the original scan slice at pixel (i, j) where slope and intercept are the slope and intercept, respectively, of the corresponding linear operation.

S3, in a CT image data feature extraction stage, a Darknet-53 network is adopted as a basic network, and the Darknet-53 network is formed by alternately a plurality of downsampling layers and Residual blocks; specifically, a downsampling layer and a Residual Block are called a feature extraction Block, and a CT image data feature extraction network is formed by 5 feature extraction blocks. The downsampling layer is responsible for gradually reducing the size of a feature map, the Residual Block is responsible for carrying out feature extraction on an input image, extracting high-level semantic information with more expressive capacity and robustness in the image and used for detecting a subsequent module, the input of each feature extraction Block is the output of a previous feature extraction Block, and the current output is used as the input of a next feature extraction Block. The specific structure is shown in fig. 2.

In this embodiment, a convolution operation is performed on the CT image data with the size of 512×512×3 after preprocessing, the convolution kernel size is 3*3, the step size is 1, the normalization processing is performed by Batch Normalization, the activation function is LeakyReLu, the number of output channels is 32, and the feature map F size is 512×512×32. The above feature map is input into a first feature extraction block, and the specific operation steps in the feature extraction block are as follows:

s3.1, performing downsampling operation on the feature map in a downsampling layer, wherein the process is as follows: the convolution kernel size is 3*3, the step size is 2, the filling is 1, the output channel number is 64 channels, the feature map size is reduced to 1/2, the size is 256×256×64, and then the convolution operation of 1*1 is performed.

And S3.2, screening and weighting the feature map in the channel dimension. Information in the active area can be highlighted using a pooling operation along the channel direction. First, a pooling operation is required on the acquired feature map to pay attention to the main features and filter redundant information. In the pooling operation, the global average pooling has feedback on each characteristic value on the characteristic map, so that the image background information can be effectively learned; maximum pooling can collect salient feature value information. Thus, simultaneous global average pooling and global maximum pooling may be a good aid for feature screening. The specific formula is as follows:

Wherein M _c (F) represents a feature extraction operation on F in the channel dimension, sigma is a Sigmoid activation function, avgpool (F) is an average pooling operation, maxpool (F) is a maximum pooling operation, F ^c _avg, And the one-dimensional vectors obtained by global average pooling and maximum pooling of each channel of the feature map F are respectively obtained.

The adopted multi-layer perceptron MLP (Multilayer Perceptron) is a three-layer perceptron network shared by vectors F ^c _avg、F^c _max, the number of neurons in the middle layer is C/r, C is the channel number of the feature map, r is the compression rate, r=16 is set, W ₀∈R^C/r×C,W₁∈R^C×C/r is the weight of the three-layer perceptron, sigma is a Sigmoid activation function, and a ReLU (RECTIFIED LINEAR Unit) activation function is used for improving the nonlinearity degree of the network.

And S3.3, supplementing the position information in the space dimension. The spatial weight map is learned by utilizing the relation among different spatial positions, so that the method can be used for supplementing the position relation information which cannot be better acquired by a channel attention mechanism, and modeling can be performed by utilizing the relation of the characteristics in space. The location information of deep features can be further enhanced by introducing a spatial attention mechanism in the form of a residual network. Taking the feature map enhanced by the channel attention mechanism as input, carrying out 1-dimensional average pooling and 1-dimensional maximum pooling on the channel to obtain two feature descriptions of 1 XH XW, splicing the two feature descriptions together, activating the feature descriptions through a sigmoid function, obtaining space weights through a convolution kernel of 7X 7, and multiplying the space weights with the input feature map F 'by using a jump connection mode of a residual network to obtain a feature map F'. The formula is as follows:

Wherein M _s (F ') represents a feature extraction operation performed on F' in a spatial dimension, avgpool (F ') is an average pooling operation, maxpool (F') is a maximum pooling operation, and F '^c _avg、F'^x _max∈R^1×H×W is a two-dimensional feature map of two single channels obtained by global average pooling and maximum pooling of the feature map F', respectively. f ^7×7 denotes a convolution operation with a convolution kernel size of 7 x 7. F' is the characteristic diagram after S3.2 treatment.

According to steps 3.2 and 3.3, the two steps can be summarized as follows:

Wherein F epsilon R ^C×H×W is an input feature map, R is a real number set, and H, W, C respectively represents the length, width and channel number of the feature map; f' is a feature map after channel dimension improvement; f' is a feature map after the spatial dimension is lifted; Representing element-wise multiplication. M _c (F) represents a feature extraction operation on F in the channel dimension. M _s (F ') represents a feature extraction operation on F' in the spatial dimension.

And S3.4, carrying out convolution operation on the feature map F' obtained in the step. The convolution kernel size is 3*3, the step length is 1, the normalization is carried out by adopting Batch Normalization, the activation function is LeakyReLu, the output channel number is 64 channels, and the feature map size is 256×256×64.

The above operations are repeated by the 5 feature extraction blocks, and the number of times the 5 feature extraction blocks are operated is sequentially 1, 2, 8, 4. And obtaining a feature map through the last feature extraction, namely a feature map used for fusing with text features.

S4, extracting text data characteristics. In order to splice and fuse with the feature map, the text information needs to be converted into corresponding text features. The text data includes disease information, health status, and examination report of CT image of the corresponding patient. The text feature extraction module consists of an input layer E1-EN, a plurality of encoder Transformer layers Trm and an output layer T1-TN. The specific structure is shown in fig. 3.

In this embodiment, the text feature extraction uses BERT (Bidirectional Encoder Representations from Transformers) model, which is a multi-layer bi-directional transducer encoder. The transducer encoder comprises an encoder decoder mechanism and a decoder mechanism, wherein the encoder receives text as input and the decoder is mainly responsible for the prediction result. The transducer builds up an encoder and decoder of 6 layers each, for a total of 12 layers of encoder-decoders. The specific structure of the encoder-decoder is shown in FIG. 4.

In the above embodiment, the transducer does not need to loop, but instead processes all words or symbols in the sequence in parallel, while combining context with far words using a self-attention mechanism, taking full account of the context information. A pair of encoder-decoder implementation steps are shown in fig. 5. The structure mainly comprises a plurality of attention modules, wherein the calculation rule of the attention mechanism is as follows: for each vocabulary, there are 3 different vectors, namely a Query vector Q, a Key vector K and a Value vector V; calculating a score value for each vector, wherein score = Q x K; next, the score is subjected to a softmax activation function, and the vector V is multiplied by the softmax point to obtain a weighted score V of each vector; several attention mechanisms (e.g., 8) are superimposed to obtain a Multi-Head Ateention module, and after addition, an output result z= Σvis obtained, and then normalization processing is performed. Wherein the input of the decoder comprises the output of the decoder and the output of the decoder of the upper layer, and the final result is respectively output after being activated by Linear and Softmax. The whole procedure involves several jump connections as indicated by the dashed arrow in fig. 5. Text data is input into the BERT model, a context relation between words is established, text features are obtained based on sentence levels, and generalization capability of the text feature extraction model is further improved.

S5, a picture-text fusion stage. And (3) performing simple splicing operation on the feature map and the text features obtained in the step (S3) to realize fusion of the text features and the image features and guide the network to detect liver lesions. The specific process is as follows: 3 detection frames are arranged in each unit of the feature map, so that the length of the prediction vector of each unit is as follows: 3 (4+4+1) =27, where the first 4 corresponds to 4 classes of the dataset, 3 represents the number of detection frames per cell, the second 4 is the 4 position offsets per detection frame, 1 is the confidence value that each detection frame contains the target, and the final image feature is represented as a vector of 1 x 27 dimensions. And the acquired text feature is represented as a 1*N-dimensional vector. Finally, a vector with 1 (27+N) dimension is obtained and is the image-text information after fusion and is used for the subsequent detection network.

And S6, detecting based on the fused graphic information. And after the image-text fusion stage is finished, carrying out up-sampling operation and lateral connection on the fused image-text information to form different feature levels similar to a pyramid, wherein the up-sampling operation corresponds to a down-sampling stage during feature extraction.

Applying corresponding detection heads on different levels, and calculating the position and confidence value of a lesion region in the feature map to obtain the specific position of the lesion region in the original image so as to realize detection of the cancerous region; aiming at adapting to cancerous regions with different sizes, traversing to wider and more suspected cancerous regions; fig. 6 is an illustration of the internal operation of each pyramid level. The detection head mentioned in the application can adopt the scheme disclosed in the sections 21-2.2 of ' YOLOv3: AN INCREMENTAL Improvement ', joseph Redmon, ALI FARHADI, university of Washington '. In order to realize the method, the application also provides a feature extraction and image-text fusion system for cancer lesion detection, which comprises a data acquisition module, a feature extraction module for CT images and text data, an image-text fusion module and an image-text information detection module;

the data acquisition module is used for acquiring personal information of a patient, a CT image of the patient and a text report corresponding to the CT image; patient personal information includes information such as disease information, health status, etc. of the patient; CT images and text reports of patients may be acquired using a germany siemens 52-ring scanner.

The feature extraction module of the CT image and the text data comprises a CT image format conversion unit, a CT image feature extraction unit and a text data feature extraction unit; specifically, the CT image format conversion unit is image format conversion software and is used for converting the CT image format acquired by the scanner so as to facilitate subsequent processing. The CT image feature extraction unit is internally provided with a CT image data feature extraction network, the CT image data feature extraction network adopts Darknet-53 networks as basic networks, the Darknet-53 networks are composed of 5 feature extraction blocks, and each feature extraction Block comprises a downsampling layer and a Residual Block; taking the CT image after format conversion as the input of a first feature extraction block, and taking the output of the previous feature extraction block in the five feature extraction blocks as the input of the next feature extraction block; the output of the last feature extraction block is the feature map used for fusing with the text features. The text data feature extraction unit adopts a BERT model, inputs text data into the BERT model, establishes a context relation between words, and acquires text features based on sentence level.

And the image-text information detection module calculates coordinates of the lesion area and a confidence value according to the guidance of the text characteristics, wherein the coordinates are used for framing the position of the lesion area in the original image, the confidence value is used for judging the type of the lesion area, and a final result is output.

The detection of the cancer lesions can be completed through the processing of the modules, the feature extraction of CT images and text data and the fusion of the two features are completed, the detection of the cancer lesions is finally completed, the detection accuracy is improved, and the diagnosis of doctors is assisted.

The above embodiments are merely for illustrating the design concept and features of the present invention, and are intended to enable those skilled in the art to understand the content of the present invention and implement the same, the scope of the present invention is not limited to the above embodiments. Therefore, all equivalent changes or modifications according to the principles and design ideas of the present invention are within the scope of the present invention.

Claims

1. The feature extraction and image-text fusion method for cancer lesion detection is characterized by comprising the following steps of:

S4, extracting text data characteristics of a text report form corresponding to the patient personal information and the CT image data by adopting a BERT model; the BERT model in S4 is a multi-layer bidirectional transducer encoder, the transducer encoder comprises an encoder mechanism and a decoder mechanism, the encoder is used as an input to receive text, and the decoder is mainly responsible for predicting results; the transducer builds up an encoder and a decoder of 6 layers each, and a total of 12 layers of encoder-encoders; the Transformer does not need to circulate, but processes all words or symbols in the sequence in parallel, combines the context with the far words by using a self-attention mechanism, and fully considers the context information; the calculation rule of the attention mechanism is as follows: for each vocabulary, there are 3 different vectors, namely a Query vector Q, a Key vector K and a Value vector V; calculating a score value for each vector, wherein score = Q x K; next, the score is subjected to a softmax activation function, and the vector V is multiplied by the softmax point to obtain a weighted score V of each vector; several attention mechanisms are overlapped to obtain a Multi-Head Ateention module, an output result Z=ΣV is obtained after addition, and then normalization processing is carried out. Wherein the input of the decoder comprises the output of the decoder and the output of the decoder of the last layer, and the final result is respectively activated by Linear and Softmax and then output;

2. The feature extraction and image-text fusion method for cancer lesion detection according to claim 1, wherein the Darknet-53 network is composed of a plurality of downsampling layers and Residual blocks, wherein one downsampling layer and one Residual Block are called one feature extraction Block, and the 5 feature extraction blocks form a CT image data feature extraction network; the input of each feature extraction block is the output of the previous feature extraction block, and the current output is the input of the next feature extraction block.

3. The feature extraction and image-text fusion method for cancer lesion detection according to claim 2, wherein the number of feature extraction block operations performed in 5 feature extraction blocks is sequentially 1,2, 8, 4.

4. A method of feature extraction and fusion for cancer lesion detection according to claim 2 or 3, wherein each feature extraction block operation of each feature extraction block comprises the steps of:

5. The method for feature extraction and image-text fusion for cancer lesion detection according to claim 1, wherein 3 detection frames are set in each unit of the feature map, so that the length of the prediction vector of each unit is as follows: 3 (4+4+1) =27, where the first 4 corresponds to 4 classes of the dataset, 3 represents the number of detection frames per cell, the second 4 is the 4 position offsets per detection frame, 1 is the confidence value that each detection frame contains the target, and the final image feature is represented as a vector of 1 x 27 dimensions. And the acquired text features are expressed as a 1*N-dimensional vector; finally, a vector with 1 (27+N) dimension is obtained and is the image-text information after fusion and is used for the subsequent detection network.

6. A feature extraction and image-text fusion system for cancer lesion detection based on the feature extraction and image-text fusion method for cancer lesion detection of claim 1, which is characterized by comprising a data acquisition module, a feature extraction module for CT images and text data, an image-text fusion module and an image-text information detection module;

7. The feature extraction and image-text fusion system for cancer lesion detection according to claim 6, wherein the CT image format conversion unit is image format conversion software for converting a CT image format acquired by a scanner for subsequent processing; the CT image feature extraction unit is internally provided with a CT image data feature extraction network, the CT image data feature extraction network adopts Darknet-53 networks as a basic network, the Darknet-53 networks are composed of 5 feature extraction blocks, and each feature extraction Block comprises a downsampling layer and a Residual Block; taking the CT image after format conversion as the input of a first feature extraction block, and taking the output of the previous feature extraction block in the five feature extraction blocks as the input of the next feature extraction block; the output of the last feature extraction block is a feature map used for fusing with text features; the text data feature extraction unit adopts a BERT model, inputs text data into the BERT model, establishes a context relation between words, and acquires text features based on sentence level.