CN113469962A

CN113469962A - Feature extraction and image-text fusion method and system for cancer lesion detection

Info

Publication number: CN113469962A
Application number: CN202110705588.0A
Authority: CN
Inventors: 吴晶晶; 宋余庆; 刘哲
Original assignee: Jiangsu University
Current assignee: Jiangsu University
Priority date: 2021-06-24
Filing date: 2021-06-24
Publication date: 2021-10-01
Anticipated expiration: 2041-06-24
Also published as: CN113469962B

Abstract

The invention discloses a method and a system for feature extraction and image-text fusion for cancer lesion detection, which are used for acquiring CT image data and text data of a patient; preprocessing CT image data, and sequentially performing convolution operation and normalization processing on the CT image data to obtain an initial characteristic diagram F; adopting a Darknet-53 network as a basic network, and extracting the characteristics of the CT image data; performing text data feature extraction on the text data of the patient by adopting a BERT model; splicing the feature map and the text feature to realize the fusion of the character feature and the image feature and guide the network to detect the cancer lesion; and performing up-sampling operation and lateral connection on the fused image-text information to form different feature levels like a pyramid, applying corresponding detection heads on the different feature levels, and calculating the position and the confidence value of the lesion area in the feature map to obtain the specific position of the lesion area in the original image so as to realize detection of the cancerous area.

Description

Feature extraction and image-text fusion method and system for cancer lesion detection

Technical Field

The invention belongs to the technical field of computer vision, and particularly relates to a method and a system for feature extraction and image-text fusion for cancer lesion detection.

Background

In the process of detecting cancer lesions, the medical image is used as an auxiliary means to effectively help the detection personnel to identify the lesion region so as to provide timely treatment for patients. The conventional medical image detection method needs medical knowledge and experience of a clinician to analyze the medical image, but easily has the problems of missing detection, false detection and the like. Furthermore, manual labeling of lesion regions in medical images often requires a great deal of time and effort. Therefore, the target detection method introduced into the computer vision technology is used for detecting the cancer lesion area, so that the interference of human factors can be effectively avoided, and the detection precision of the cancer lesion area can be improved.

The traditional machine learning method mainly extracts relevant features through information such as probability density functions or gradient histograms and the like, so that the identification precision of a detector for a lesion region in a medical image is improved. However, the conventional machine learning method not only needs a priori knowledge to perform manual intervention on the whole detection model, but also is difficult to effectively mine the features of deep semantics in the image.

The cancer lesion detection method based on deep learning mainly utilizes a Convolutional Neural Network (CNN) to extract characteristic information related to cancer lesions, and carries out cancer detection according to the characteristic information. On one hand, the conventional convolutional neural network only extracts feature information according to the difference of pixels, and the feature information of a lesion area cannot be sufficiently mined by the adopted feature extraction method easily due to the large difference among the sizes, the shapes and the densities of various lesion areas. On the other hand, a large number of cancer lesion detection methods based on deep learning only focus on detection at the medical image level, but ignore that large characteristic differences exist in various lesion stages.

Disclosure of Invention

In order to solve the defects in the prior art, the invention provides a method and a system for feature extraction and image-text fusion for cancer lesion detection, and aims to solve the problems that image features extracted by a detection network are insufficient and the detection precision is low in the prior art.

The technical scheme adopted by the invention is as follows:

a feature extraction and image-text fusion method for cancer lesion detection comprises the following steps:

s1, acquiring personal information of a patient, CT image data of the patient and a text report corresponding to the CT image data, which are recorded by a clinician;

s2, preprocessing the CT image data to enable the format of the CT image data to be directly fed into a neural network as input for training or prediction; carrying out convolution operation and normalization processing on the CT image data in sequence to obtain an initial characteristic diagram F;

s3, adopting a Darknet-53 network as a basic network, and extracting the characteristics of the CT image data;

s4, extracting text data features of the text report corresponding to the personal information of the patient and the CT image data by adopting a BERT model;

s5, splicing the feature graph obtained in the S3 and the text feature to realize the fusion of the character feature and the image feature and guide the network to detect the cancer lesion;

s6, performing up-sampling operation and lateral connection on the fused image-text information to form different feature levels like a pyramid, applying corresponding detection heads on the different feature levels, and calculating the position and the confidence value of the lesion area in the feature map to obtain the specific position of the lesion area in the original image so as to realize detection of the cancerous area.

Further, the Darknet-53 network is composed of a plurality of down-sampling layers and Residual blocks, wherein the down-sampling layers and the Residual blocks are alternately formed, one down-sampling layer and one Residual Block are called as a feature extraction Block, and a CT image data feature extraction network is formed by 5 feature extraction blocks; the input of each feature extraction block is the output of the previous feature extraction block, and the current output is used as the input of the next feature extraction block.

Furthermore, the frequency of the operation of the feature extraction blocks in the 5 feature extraction blocks is 1, 2, 8 and 4 in sequence;

further, the steps of the feature extraction block operation of each feature extraction block each time are as follows:

s3.1, performing down-sampling operation on the feature map in a down-sampling layer, and gradually reducing the size of the feature map;

s3.2 in channel dimensionScreening and weighting the feature maps, and highlighting information in the effective area by using pooling operation along the channel direction; expressed as:

M_c(F) representing the characteristic extraction operation of the F on the channel dimension;

s3.3, supplementing position information in spatial dimension, learning a spatial weight graph by using the relationship among different spatial positions, not only supplementing the position relationship information which cannot be well acquired by a channel attention mechanism, but also modeling by using the relationship of features in space; expressed as:

M_s(F ') representing a feature extraction operation on F' in a spatial dimension,

representing element-by-element multiplication;

and S3.4, performing convolution operation on the characteristic diagram F' obtained in the S3.3, wherein the convolution kernel size is 3 x 3, the step size is 1, Normalization is performed by adopting Batch Normalization, and the activation function is LeakyReLu.

Further, the BERT model in S4 is a multi-layer bidirectional transform encoder, where the transform encoder includes an encoder mechanism and a decoder mechanism, the encoder mechanism receives a text as an input, and the decoder is mainly responsible for predicting a result; the Transformer is an encoder-decoder with 6 layers each, and 12 layers in total, by building up the encoder and decoder.

Further, the Transformer does not need to loop, but processes all words or symbols in the sequence in parallel, and simultaneously combines the context with the more distant words by using an attention-free mechanism, and fully considers the context information; the calculation rule of the attention mechanism is as follows: for each vocabulary, there are 3 different vectors, which are Query vector Q, Key vector K, and Value vector V, respectively; calculating a score value for each vector, wherein score is Q K; applying a softmax activation function to the score, and multiplying the softmax point by the vector V to obtain a weighted score V of each vector; and superposing several attention mechanisms to obtain a Multi-Head attention module, adding the Multi-Head attention modules to obtain an output result Z ∑ V, and then carrying out normalization processing. The decoder input comprises the encoder output and the decoder output of the previous layer, and the final result is output after being activated by Linear and Softmax respectively.

Further, 3 detection boxes are arranged in each unit of the feature map, so that the length of each unit prediction vector is as follows: 3 (4+4+1) ═ 27, where the first 4 corresponds to 4 classes of the dataset, 3 represents the number of detection boxes per cell, the second 4 is the amount of 4 positional shifts per detection box, 1 is the confidence value that each detection box contains the target, and the final image features are represented as a vector of 1 × 27 dimensions. The acquired text features are expressed as a vector with 1 x N dimensions; finally, a vector with 1 x (27+ N) dimension is obtained, namely fused image-text information is used for a later detection network.

A feature extraction and image-text fusion system for cancer lesion detection comprises a data acquisition module, a feature extraction module for CT images and text data, an image-text fusion module and an image-text information detection module;

the data acquisition module is used for acquiring personal information of a patient, a CT image of the patient and a text report corresponding to the CT image; the personal information of the patient comprises disease information, health condition and other information of the patient;

the feature extraction module of the CT image and the text data comprises a CT image format conversion unit, a CT image feature extraction unit and a text data feature extraction unit;

splicing the feature graph and the text features in the image-text fusion module;

the image-text information detection module is used for calculating the coordinates and the confidence values of the lesion areas according to the guidance of the text characteristics, the coordinates are used for framing the positions of the lesion areas in the original image, the confidence values are used for judging the types of the lesions, and the final results are output;

further, the CT image format conversion unit is image format conversion software and converts the CT image format collected by the scanner so as to facilitate subsequent processing; the CT image feature extraction unit is internally provided with a CT image data feature extraction network, the CT image data feature extraction network adopts a Darknet-53 network as a basic network, the Darknet-53 network is composed of 5 feature extraction blocks, and each feature extraction Block comprises a down-sampling layer and a Residual Block; taking the CT image after format conversion as the input of a first feature extraction block, and taking the output of a previous feature extraction block in five feature extraction blocks as the input of a next feature extraction block; the output of the last feature extraction block is a feature graph used for being fused with the text features; the text data feature extraction unit adopts a BERT model, inputs text data into the BERT model, establishes a context relationship between words, and obtains text features based on sentence levels.

The invention has the beneficial effects that:

the invention is directed to a method and a system for feature extraction and image-text fusion of cancer lesion detection, and introduces feature description information related to each lesion stage in the cancer lesion detection method, so that the identification precision of a cancer lesion area can be improved, and the severity of cancer lesions can be effectively divided.

In addition, the feature extraction and image-text fusion method and system for cancer lesion detection of the invention adopt an extraction method combining channel dimension and space dimension to improve the identification capability of lesion regions with large difference, and combine text description information of each lesion stage to perform feature fusion of images and texts by using a BERT network, thereby performing effective detection on cancer lesion regions and effective judgment on cancer stages.

Drawings

FIG. 1 is an overall flow diagram of the method of the present invention;

FIG. 2 is a schematic view of the working flow of the CT image feature extraction module of the method of the present invention;

FIG. 3 is a schematic diagram of a network structure of a text feature extraction module of the method of the present invention;

FIG. 4 is a schematic diagram of the encoder-decoder structure of the method of the present invention;

FIG. 5 is an expanded view of the encoder-decoder structure of FIG. 4;

FIG. 6 is a schematic diagram of the structure of the detection module of the method of the present invention.

Detailed Description

In order to make the objects, technical solutions and advantages of the present invention more apparent, the present invention is described in further detail below with reference to the accompanying drawings and embodiments. It should be understood that the specific embodiments described herein are merely illustrative of the invention and are not intended to limit the invention.

The feature extraction and image-text fusion method for cancer lesion detection provided by the present invention is described in detail below with reference to fig. 1 to 5, and the CT image data provided by the present invention is exemplified by a liver CT image.

S1, a data acquisition stage, which specifically comprises the acquisition of CT image data and text data, wherein a great amount of abdomen flat-scan slices are generated by adopting a high-precision German Siemens 52 ring scanner (PET-CT); during the generation of CT image data, a text report is generated, which together with the individual patient information (including disease information, health status, etc.) recorded by the clinician, forms the desired text data set. And the CT image data of each patient corresponds to the text data one by one and is stored in a database so as to be used by a subsequent image feature extraction module and a text feature extraction module.

And S2, a CT image data preprocessing stage, wherein the CT image data is preprocessed to be in a format acceptable by an image feature extraction network. The scanning slice format directly generated by a machine is DICOM (dcm), the scanning slice format cannot be directly fed into a neural network as input for training or prediction, and the scanning slice format needs to be converted into a jpg or png image format by a Python correlation library and can also be converted into a two-dimensional array; during the conversion process, attention is also paid to the corresponding window level and width definition (i.e., the limit of the HU value in henry units) of the liver (or other organs and tissues) to obtain the maximum contrast with the surrounding organs (to make the liver appear as much as possible); for the liver, according to the experience of professional radiologists, the HU values are defined in the interval [ -100, 240], -100 and 240 are window levels, and the distance between the two end points is window width; the formula defined is shown below:

wherein, HU_i，jRepresenting HU value, PV, at pixel point (i, j)_i，jRepresenting the pixel value of the original scanning slice at the position of the pixel point (i, j), and slope and intercept are respectively the slope and intercept of the corresponding linear operation.

S3, in a CT image data feature extraction stage, adopting a Darknet-53 network as a basic network, wherein the Darknet-53 network is alternately composed of a plurality of down-sampling layers and a Residual Block; specifically, one down-sampling layer and one Residual Block are called a feature extraction Block, and a CT image data feature extraction network is formed by 5 feature extraction blocks. The down-sampling layer is responsible for gradually reducing the size of the feature map, the Residual Block is responsible for extracting features of an input image, extracting high-level semantic information with higher expression capability and robustness in the image for detection of a subsequent module, the input of each feature extraction Block is the output of a previous feature extraction Block, and the current output is used as the input of a next feature extraction Block. The specific structure is shown in fig. 2.

In this embodiment, the CT image data with the size of 512 × 3 after preprocessing is input into the network for a convolution operation, the convolution kernel size is 3 × 3, the step size is 1, Normalization processing is performed by using Batch Normalization, the activation function is LeakyReLu, the number of output channels is 32 channels, and the size of the feature map F is 512 × 32. Inputting the feature map into a first feature extraction block, wherein the specific operation steps in the feature extraction block are as follows:

s3.1, performing down-sampling operation on the characteristic diagram in a down-sampling layer, wherein the process is as follows: the convolution kernel size is 3 × 3, the step size is 2, the padding is 1, the number of output channels is 64 channels, the feature size is reduced to 1/2, the size is 256 × 64, and then the convolution operation is performed by 1 × 1.

And S3.2, screening and weighting the characteristic diagram in the channel dimension. Using pooling along the channel direction may highlight the information in the active area. First, pooling operation is required for the acquired feature map to focus on the main features and filter redundant information. In the pooling operation, the global average pooling has feedback on each characteristic value on the characteristic diagram, so that the image background information can be effectively learned; the maximum pooling can collect salient feature value information. Therefore, performing global average pooling and global maximum pooling simultaneously may be of great help for feature screening. The specific formula is as follows:

wherein M is_c(F) Representing the feature extraction operation on the channel dimension for F, sigma is Sigmoid activation function, Avgpool (F) is average pooling operation, Maxpool (F) is maximum pooling operation, F^c _avg、

Respectively, the one-dimensional vectors obtained by global average pooling and maximum pooling of each channel of the feature map F.

The adopted multi-layer perceptron MLP (Multi layer Perceptron) is a vector F^c _avg、F^c _maxThe shared three-layer perceptron network has the central layer neuron number of C/r, C the channel number of characteristic diagram, r the compression rate, and W16₀∈R^C/r×C，W₁∈R^C×C/rAll the functions are weights of three layers of perception machines, sigma is a Sigmoid activation function, and a ReLU (rectified Linear Unit) activation function is used for improving the nonlinearity degree of the network.

And S3.3, supplementing the position information in the spatial dimension. The spatial weight graph is learned by using the relationship between different spatial positions, so that the spatial weight graph not only can be used for supplementing the position relationship information which cannot be well acquired by a channel attention mechanism, but also can be modeled by using the relationship of the features on the space. By introducing a spatial attention mechanism in the form of a residual network, the position information of deep features can be further enhanced. Taking the feature map strengthened by a channel attention mechanism as an input, performing 1-dimensional average pooling and 1-dimensional maximum pooling on a channel to obtain two 1 × H × W feature descriptions, splicing the two feature descriptions together, activating by a sigmoid function, obtaining a spatial weight by a 7 × 7 convolution kernel, and multiplying the spatial weight by the input feature map F 'by using a jump connection mode of a residual error network to obtain the feature map F'. The formula is as follows:

wherein M is_s(F ') denotes the feature extraction operation on F ' in the spatial dimension, Avgpool (F ') is the average pooling operation, Maxpool (F ') is the maximum pooling operation, F '^c _avg、F’^x _max∈R^1×H×WTwo single-channel two-dimensional feature maps obtained by global average pooling and maximum pooling of the feature map F' are respectively obtained. f. of^7×7Representing a convolution operation with a convolution kernel size of 7 x 7. F' is the characteristic diagram after S3.2 treatment.

From steps 3.2 and 3.3, the two steps can be summarized as follows:

wherein F ∈ R^C×H×WFor inputting the feature map, R is a real number set, H, W, C represents the length, width and number of channels of the feature map respectively; f' is a feature diagram after the channel dimension is lifted; f' is a characteristic diagram after spatial dimension lifting;

representing element-by-element multiplication. M_c(F) Representing the feature extraction operation on the channel dimension for F. M_s(F ') represents the feature extraction operation on F' in the spatial dimension.

And S3.4, performing convolution operation on the characteristic diagram F' obtained in the step. The convolution kernel size is 3 × 3, the step size is 1, Normalization is performed by Batch Normalization, the activation function is LeakyReLu, the number of output channels is 64 channels, and the characteristic diagram size is 256 × 64.

The 5 feature extraction blocks repeat the above operations, and the number of times of the 5 feature extraction blocks is 1, 2, 8, and 4 in sequence. And (4) obtaining a feature map through the last feature extraction, namely the feature map used for fusing with the text features.

And S4, text data feature extraction. In order to perform splicing and fusion with the feature map, text information needs to be converted into corresponding text features. The text data includes disease information, health status, and examination reports of CT images for the patient. The text feature extraction module is composed of an input layer E1-EN, a plurality of encoder transform layers Trm and an output layer T1-TN. The specific structure is shown in fig. 3.

In this embodiment, a BERT (bidirectional Encoder) model is used for extracting text features, and the BERT model is a multi-layer bidirectional Transformer Encoder. The Transformer encoder comprises an encoder mechanism and a decoder mechanism, wherein the encoder receives text as input, and the decoder is mainly responsible for predicting results. The Transformer is an encoder-decoder with 6 layers each, and 12 layers in total, by building up the encoder and decoder. The specific structure of the encoder-decoder is shown in FIG. 4.

In the above embodiment, the Transformer need not loop, but processes all words or symbols in the sequence in parallel, while taking full account of the context information by combining the context with more distant words using the self-attention mechanism. A specific implementation procedure of a pair of encoder-decoder is shown in fig. 5. The structure mainly comprises a plurality of attention modules, wherein the attention mechanism is calculated according to the following rule: for each vocabulary, there are 3 different vectors, which are Query vector Q, Key vector K, and Value vector V, respectively; calculating a score value for each vector, wherein score is Q K; applying a softmax activation function to the score, and multiplying the softmax point by the vector V to obtain a weighted score V of each vector; several attention mechanisms (such as 8 attention mechanisms) are superposed to obtain a Multi-Head attention module, the Multi-Head attention module is added to obtain an output result Z-sigma V, and then normalization processing is carried out. The decoder input comprises the encoder output and the decoder output of the previous layer, and the final result is output after being activated by Linear and Softmax respectively. The whole step involves several jump connections as indicated by the dashed arrows in fig. 5. The text data is input into the BERT model, the context relationship among the words is established, the text features are obtained based on the sentence level, and the generalization capability of the text feature extraction model is further improved.

And S5, image-text fusion. And simply splicing the feature map and the text features obtained in the step S3, realizing the fusion of the character features and the image features, and guiding the network to detect the liver lesion. The specific process is as follows: 3 detection boxes are arranged in each unit of the feature map, so that the length of each unit of the prediction vector is as follows: 3 (4+4+1) ═ 27, where the first 4 corresponds to 4 classes of the dataset, 3 represents the number of detection boxes per cell, the second 4 is the amount of 4 positional shifts per detection box, 1 is the confidence value that each detection box contains the target, and the final image features are represented as a vector of 1 × 27 dimensions. And the acquired text features are represented as a vector of 1 x N dimensions. Finally, a vector with 1 x (27+ N) dimension is obtained, namely fused image-text information is used for a later detection network.

And S6, detecting based on the fused image-text information. After the image-text fusion stage is finished, performing up-sampling operation and lateral connection on fused image-text information to form different feature levels similar to a pyramid, wherein the up-sampling operation corresponds to the down-sampling stage during feature extraction.

Applying corresponding detection heads on different levels, and calculating the position and the confidence value of a lesion area in the characteristic diagram to obtain the specific position of the lesion area in the original image so as to realize the detection of the canceration area; aiming at adapting to the cancerous regions with different sizes and traversing to wider and more suspected cancerous regions; FIG. 6 is a schematic diagram of the internal operation of each pyramid level. The detection head mentioned in this application can adopt the scheme disclosed in paragraph 21-2.2 of Yolov3: An incorporated Improvement, Joseph Redmon, Ali Farhadi, University of Washington. In order to realize the method, the application also provides a characteristic extraction and image-text fusion system for cancer lesion detection, which comprises a data acquisition module, a characteristic extraction module for CT images and text data, an image-text fusion module and an image-text information detection module;

the data acquisition module is used for acquiring personal information of a patient, a CT image of the patient and a text report corresponding to the CT image; the personal information of the patient comprises disease information, health condition and other information of the patient; CT images of patients and text reports can be acquired using a german siemens 52 ring scanner.

The feature extraction module of the CT image and the text data comprises a CT image format conversion unit, a CT image feature extraction unit and a text data feature extraction unit; specifically, the CT image format conversion unit is image format conversion software, and converts a CT image format acquired by the scanner, so as to facilitate subsequent processing. The CT image feature extraction unit is internally provided with a CT image data feature extraction network, the CT image data feature extraction network adopts a Darknet-53 network as a basic network, the Darknet-53 network is composed of 5 feature extraction blocks, and each feature extraction Block comprises a down-sampling layer and a Residual Block; taking the CT image after format conversion as the input of a first feature extraction block, and taking the output of a previous feature extraction block in five feature extraction blocks as the input of a next feature extraction block; the output of the last feature extraction block is the feature graph used for being fused with the text features. The text data feature extraction unit adopts a BERT model, inputs text data into the BERT model, establishes a context relationship between words, and obtains text features based on sentence levels.

and the image-text information detection module is used for calculating the coordinates and the confidence values of the lesion areas according to the guidance of the text characteristics, wherein the coordinates are used for framing the positions of the lesion areas in the original image, and the confidence values are used for judging the types of the lesions and outputting the final results.

Through the processing of the modules, the cancer lesion detection can be completed, including the feature extraction of the CT image and the text data and the fusion of the two features, and finally the cancer lesion detection work is completed, the detection accuracy is improved, and the diagnosis of doctors is assisted.

The above embodiments are only used for illustrating the design idea and features of the present invention, and the purpose of the present invention is to enable those skilled in the art to understand the content of the present invention and implement the present invention accordingly, and the protection scope of the present invention is not limited to the above embodiments. Therefore, all equivalent changes and modifications made in accordance with the principles and concepts disclosed herein are intended to be included within the scope of the present invention.

Claims

1. A feature extraction and image-text fusion method for cancer lesion detection is characterized by comprising the following steps:

2. The method for feature extraction and image-text fusion oriented to cancer lesion detection as claimed in claim 1, wherein the Darknet-53 network is composed of a plurality of down-sampling layers and Residual blocks, wherein one down-sampling layer and one Residual Block are alternately called a feature extraction Block, and 5 feature extraction blocks form a CT image data feature extraction network; the input of each feature extraction block is the output of the previous feature extraction block, and the current output is used as the input of the next feature extraction block.

3. The method for feature extraction and image-text fusion for cancer lesion detection as claimed in claim 2, wherein the number of times of performing the feature extraction block operation in 5 feature extraction blocks is 1, 2, 8, 4.

4. The method for feature extraction and image-text fusion oriented to cancer lesion detection as claimed in claim 2 or 3, wherein the feature extraction blocks of each feature extraction block operate as follows:

s3.2, screening and weighting the feature map on the channel dimension, and highlighting information in the effective area by using pooling operation along the channel direction; expressed as:

representing element-by-element multiplication;

5. The method for feature extraction and teletext fusion for cancer lesion detection according to claim 1, wherein the BERT model in S4 is a multi-layer bidirectional Transformer encoder, the Transformer encoder includes an encoder mechanism and a decoder mechanism, the encoder receives text as input, and the decoder is mainly responsible for predicting results; the Transformer is an encoder-decoder with 6 layers each, and 12 layers in total, by building up the encoder and decoder.

6. The method for feature extraction and image-text fusion oriented to cancer lesion detection as claimed in claim 5, wherein the Transformer does not need to loop, but processes all words or symbols in the sequence in parallel, and combines the context with the more distant words by using the self-attention mechanism, fully considering the context information; the calculation rule of the attention mechanism is as follows: for each vocabulary, there are 3 different vectors, which are Query vector Q, Key vector K, and Value vector V, respectively; calculating a score value for each vector, wherein score is Q K; applying a softmax activation function to the score, and multiplying the softmax point by the vector V to obtain a weighted score V of each vector; and superposing several attention mechanisms to obtain a Multi-Head attention module, adding the Multi-Head attention modules to obtain an output result Z ∑ V, and then carrying out normalization processing. The decoder input comprises the encoder output and the decoder output of the previous layer, and the final result is output after being activated by Linear and Softmax respectively.

7. The feature extraction and image-text fusion method for cancer lesion detection as claimed in claim 1, wherein 3 detection frames are set for each unit of the feature map, so that the prediction vector length of each unit is: 3 (4+4+1) ═ 27, where the first 4 corresponds to 4 classes of the dataset, 3 represents the number of detection boxes per cell, the second 4 is the amount of 4 positional shifts per detection box, 1 is the confidence value that each detection box contains the target, and the final image features are represented as a vector of 1 × 27 dimensions. The acquired text features are expressed as a vector with 1 x N dimensions; finally, a vector with 1 x (27+ N) dimension is obtained, namely fused image-text information is used for a later detection network.

8. The system for extracting the characteristics and fusing the images and texts facing the cancer lesion detection is characterized by comprising a data acquisition module, a CT image and text data characteristic extraction module, an image and text fusion module and an image and text information detection module;

9. The method for feature extraction and image-text fusion oriented to cancer lesion detection of claim 1, wherein the CT image format conversion unit is an image format conversion software for converting the CT image format collected by the scanner for subsequent processing; the CT image feature extraction unit is internally provided with a CT image data feature extraction network, the CT image data feature extraction network adopts a Darknet-53 network as a basic network, the Darknet-53 network is composed of 5 feature extraction blocks, and each feature extraction Block comprises a down-sampling layer and a Residual Block; taking the CT image after format conversion as the input of a first feature extraction block, and taking the output of a previous feature extraction block in five feature extraction blocks as the input of a next feature extraction block; the output of the last feature extraction block is a feature graph used for being fused with the text features; the text data feature extraction unit adopts a BERT model, inputs text data into the BERT model, establishes a context relationship between words, and obtains text features based on sentence levels.