CN114758360A

CN114758360A - Multi-modal image classification model training method and device and electronic equipment

Info

Publication number: CN114758360A
Application number: CN202210435881.4A
Authority: CN
Inventors: 于昕晔; 马璐; 丁佳; 吕晨翀
Original assignee: Beijing Yizhun Medical AI Co Ltd
Current assignee: Zhejiang Yizhun Intelligent Technology Co ltd
Priority date: 2022-04-24
Filing date: 2022-04-24
Publication date: 2022-07-15
Anticipated expiration: 2042-04-24
Also published as: CN114758360B

Abstract

The disclosure provides a multi-modal image classification model training method, a device and electronic equipment, wherein the method comprises the following steps: confirming a training image set; inputting a first ultrasonic image and a first ultrasonic radiography image in a training image set into an image serialization module and a feature extraction module which are included in a multi-modal image classification model, and obtaining a first feature coding set corresponding to the first ultrasonic image and a second feature coding set corresponding to the first ultrasonic radiography image; inputting the first feature coding set and the second feature coding set into a multi-modal aggregation module included in a multi-modal image classification model to obtain a classification prediction result corresponding to the first ultrasonic image and the first ultrasonic image; adjusting parameters of a multi-modal image classification model based on the difference between classification labeling results and classification prediction results corresponding to a first ultrasonic image and the first ultrasonic image; wherein, the multi-modal aggregation module comprises a multi-head self-attention layer and a multi-layer perceptron.

Description

Multi-modal image classification model training method and device and electronic equipment

Technical Field

The present disclosure relates to the field of image processing technologies, and in particular, to a method and an apparatus for training a multi-modal image classification model, an electronic device, and a storage medium.

Background

In the related technology, the classification of the image can be based on extracting texture features and classifying the image through a classifier, the texture features can be realized through gray level co-occurrence matrix, wavelet transformation and Gabor transformation, the classifier can be realized through a support vector machine, a random forest algorithm or a Bayesian classifier, and how to improve the accuracy of image classification needs to be solved urgently for the image with multiple dimensions.

Disclosure of Invention

The disclosure provides a multi-modal image classification model training method, a multi-modal image classification model training device and electronic equipment, which are used for at least solving the technical problems in the prior art.

According to a first aspect of the present disclosure, there is provided a multi-modal image classification model training method, including:

confirming a training image set, wherein the training image set comprises an ultrasonic image subset and an ultrasonic contrast image subset, and images in the ultrasonic image subset correspond to images in the ultrasonic contrast image subset one by one;

inputting a first ultrasonic image and a first ultrasonic contrast image in the training image set into an image serialization module and a feature extraction module which are included in the multi-modal image classification model, and obtaining a first feature coding set corresponding to the first ultrasonic image and a second feature coding set corresponding to the first ultrasonic contrast image;

inputting the first feature coding set and the second feature coding set into a multi-modal aggregation module included in the multi-modal image classification model, and obtaining classification prediction results corresponding to the first ultrasonic image and the first ultrasonic radiography image;

adjusting parameters of the multi-modal image classification model based on the difference between the classification labeling result and the classification prediction result corresponding to the first ultrasonic image and the first ultrasonic image;

the multi-modal aggregation module comprises a multi-head self-attention layer, a multi-layer perceptron and a multi-layer perceptron head.

In the above scheme, the confirming training image set includes:

acquiring a second ultrasonic image and a second ultrasonic contrast image corresponding to the second ultrasonic image;

respectively acquiring a first ultrasonic image and a first ultrasonic contrast image with the same size from the second ultrasonic image and the second ultrasonic contrast image, wherein the first ultrasonic image and the first ultrasonic contrast image are images in the training image set;

wherein the first ultrasound image obtained from the ultrasound images is an image in the ultrasound image subset; the first ultrasound contrast image acquired from the ultrasound contrast image is an image of the subset of ultrasound contrast images.

In the above scheme, the inputting a first ultrasound image and a first ultrasound contrast image in the training image set into an image serialization module and a feature extraction module included in the multi-modal image classification model to obtain a first feature code set corresponding to the first ultrasound image and a second feature code set corresponding to the first ultrasound contrast image includes:

the image serialization module is used for carrying out blocking processing on the first ultrasonic image and the first ultrasonic contrast image to obtain at least two image blocks;

the feature extraction module performs feature extraction on each image block to acquire a feature code of each image block;

and confirming that the feature codes corresponding to the image blocks of the first ultrasonic image are the first feature code set, and confirming that the feature codes corresponding to the image blocks of the first ultrasonic contrast image are the second feature code set.

In the above scheme, the structure of the feature extraction module is a ResNet50 structure;

the last layer of the feature extraction module is a linear projection layer.

In the above solution, the inputting the first feature encoding set and the second feature encoding set into a multi-modal aggregation module included in the multi-modal image classification model to obtain the classification prediction results corresponding to the first ultrasound image and the first ultrasound contrast image includes:

confirming a first classification mark of the first ultrasonic image and the first ultrasonic contrast image;

adding position information codes into feature codes in the first feature code set, feature codes in the second feature code set and the first classification marks;

the position information code is used for representing the position information of the image block corresponding to the feature code in the first ultrasonic image or the first ultrasonic contrast image; the position information code is also used to distinguish the first classification flag.

In the above solution, the inputting the first feature coding set, the second feature coding set, and the first classification flag into a multi-modal aggregation module included in the multi-modal image classification model to obtain the classification prediction result corresponding to the first ultrasound image and the first ultrasound contrast image includes:

inputting feature codes with the same position information codes in the first feature code set and the second feature code set into the multi-head self-attention layer to obtain image features corresponding to the first ultrasonic image and the first ultrasonic image;

inputting the image features to the multi-layer perceptron;

inputting the first classification mark into the multi-head attention layer and the multi-layer perceptron to obtain classification features;

and inputting the classification features into the multilayer perception machine head to obtain a classification prediction result.

In the foregoing solution, the multi-modal aggregation module further includes:

a layer standardized structure and a jump connection structure;

wherein the layer normalization structure is located before the multi-head attention layer and the multi-layer perceptron.

In the above scheme, the multilayer perceptron includes a full connection layer, an activation function layer and a Dropout layer.

In the above solution, after the adjusting the parameters of the multi-modal image classification model based on the difference between the classification labeling result and the classification prediction result corresponding to the first ultrasound image and the first ultrasound contrast image, the method further includes:

inputting a third ultrasonic image and a third ultrasonic radiography image in the training image set into an image serialization module included in the multi-modal image classification model after parameter adjustment, and obtaining a third feature coding set corresponding to the third ultrasonic image and a fourth feature coding set corresponding to the third ultrasonic radiography image;

inputting the third feature coding set and the fourth feature coding set into a multi-modal aggregation module included in the multi-modal image classification model after the parameters are adjusted, and obtaining classification prediction results corresponding to the third ultrasonic image and the third ultrasonic radiography image;

and adjusting parameters of the multi-modal image classification model based on the difference between the classification labeling result corresponding to the third ultrasonic image and the classification prediction result corresponding to the third ultrasonic image and the third ultrasonic image.

In the above scheme, the method further comprises:

and determining that the multi-modal image classification model is trained in response to the difference between the classification labeling result and the classification prediction result being less than a first threshold value.

According to a second aspect of the present disclosure, there is provided an image classification method applying the multi-modal image classification model described above, the method comprising:

inputting an image to be recognized into the multi-modal image classification model;

and confirming that the output of the multi-modal image classification model is the classification result of the image to be recognized.

According to a third aspect of the present disclosure, there is provided a multi-modal image classification model training apparatus, including:

the data preparation unit is used for confirming a training image set, wherein the training image set comprises an ultrasonic image subset and an ultrasonic contrast image subset, and images in the ultrasonic image subset correspond to images in the ultrasonic contrast image subset one by one;

a first data processing unit, configured to input a first ultrasound image and a first ultrasound contrast image in the training image set into an image serialization module included in the multi-modal image classification model, and obtain a first feature coding set corresponding to the first ultrasound image and a second feature coding set corresponding to the first ultrasound contrast image;

the second data processing unit is used for inputting the first feature coding set and the second feature coding set into a multi-modal aggregation module included in the multi-modal image classification model, and obtaining classification prediction results corresponding to the first ultrasonic image and the first ultrasonic radiography image;

the adjusting unit is used for adjusting parameters of the multi-modal image classification model based on the difference between the classification labeling result and the classification prediction result corresponding to the first ultrasonic image and the first ultrasonic image;

wherein, the multi-modal aggregation module comprises a multi-head self-attention layer and a multi-layer perceptron.

According to a fourth aspect of the present disclosure, there is provided an image classification apparatus including:

the input unit is used for inputting an image to be recognized to the multi-modal image classification model;

and the recognition unit is used for confirming the output of the multi-modal image classification model as the classification result of the image to be recognized.

According to a fifth aspect of the present disclosure, there is provided an electronic device comprising:

at least one processor; and

a memory communicatively coupled to the at least one processor; wherein,

the memory stores instructions executable by the at least one processor to enable the at least one processor to perform the multi-modal image classification model training method and the image classification method of the present disclosure.

According to a sixth aspect of the present disclosure, there is provided a non-transitory computer readable storage medium storing computer instructions for causing the computer to perform the multimodal image classification model training method and the image classification method of the present disclosure.

According to the multi-modal image classification model training method, a training image set is confirmed, wherein the training image set comprises an ultrasonic image subset and an ultrasonic contrast image subset, and images in the ultrasonic image subset correspond to images in the ultrasonic contrast image subset one to one; inputting a first ultrasonic image and a first ultrasonic contrast image in the training image set into an image serialization module included in the multi-modal image classification model, and obtaining a first feature coding set corresponding to the first ultrasonic image and a second feature coding set corresponding to the first ultrasonic contrast image; inputting the first feature coding set and the second feature coding set into a multi-modal aggregation module included in the multi-modal image classification model, and obtaining classification prediction results corresponding to the first ultrasonic image and the first ultrasonic radiography image; adjusting parameters of the multi-modal image classification model based on the difference between the classification labeling result and the classification prediction result corresponding to the first ultrasonic image and the first ultrasonic image; the multi-modal aggregation module comprises a multi-head self-attention layer and a multi-layer perceptron, so that the accuracy of image classification can be improved.

It should be understood that the statements in this section are not intended to identify key or critical features of the embodiments of the present disclosure, nor are they intended to limit the scope of the present disclosure. Other features of the present disclosure will become apparent from the following description.

Drawings

The above and other objects, features and advantages of exemplary embodiments of the present disclosure will become readily apparent from the following detailed description read in conjunction with the accompanying drawings. Several embodiments of the present disclosure are illustrated by way of example, and not by way of limitation, in the figures of the accompanying drawings and in which:

in the drawings, the same or corresponding reference numerals indicate the same or corresponding parts.

FIG. 1 is a schematic flow chart diagram illustrating an alternative method for training a multi-modal image classification model according to an embodiment of the present disclosure;

FIG. 2 is a schematic diagram illustrating an alternative structure of a multi-headed self-attention layer provided by an embodiment of the present disclosure;

FIG. 3 is a schematic flow chart illustrating an alternative image classification method provided by the embodiment of the present disclosure;

FIG. 4 is a schematic flow chart diagram illustrating an alternative multi-modal image classification model training method provided by the embodiment of the present disclosure;

FIG. 5 shows a data schematic diagram of a multi-modal image classification model training method provided by the embodiment of the disclosure;

fig. 6 is a schematic diagram illustrating an alternative structure of a transformer network included in the multi-modal aggregation module according to an embodiment of the present disclosure;

FIG. 7 is a schematic diagram illustrating an alternative structure of a multi-modal image classification model training apparatus provided in an embodiment of the present disclosure;

fig. 8 is a schematic diagram illustrating an alternative structure of an image classification apparatus provided in an embodiment of the present disclosure;

fig. 9 is a schematic diagram illustrating a composition structure of an electronic device according to an embodiment of the disclosure.

Detailed Description

In order to make the objects, features and advantages of the present disclosure more apparent and understandable, the technical solutions in the embodiments of the present disclosure will be clearly and completely described below with reference to the accompanying drawings in the embodiments of the present disclosure, and it is apparent that the described embodiments are only a part of the embodiments of the present disclosure, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments disclosed herein without making any creative effort, shall fall within the protection scope of the present disclosure.

Before further detailed description of the embodiments of the present application, terms and expressions referred to in the embodiments of the present application will be described, and the terms and expressions referred to in the embodiments of the present application will be used for the following explanation.

1) Transformer network (Transformer) technology, originally applied to Natural Language Processing (NLP), is a deep self-attention transformation network for extracting intrinsic features of text data. Recently, the Transform technology has made a breakthrough in computer Vision, and many Transform-based methods have been used for computer Vision tasks, such as Detection Transform (DETR) for object Detection, sequence-to-sequence Detection Transform (SETR) for semantic Segmentation, Vision Transformer (ViT) for image classification, and Data-efficient image Transformer network (DeiT).

2) Self-Attention mechanism (Self-Attention, SA). May also be referred to as a self-attention layer. After the vector X is input, it is first converted into three different vectors: query matrix Q, key matrix K, and value matrix V:

Q＝XW_q,K＝XW_k,V＝XW_v

wherein, W_q、W_kAnd W_vIs a shared learnable parameter matrix. The weight assigned to each value V is then determined by the dot product of the query Q and the corresponding key K. The attention function between different input vectors is calculated as follows:

wherein d is_kIs the dimension of the key vector K.

Proper normalization is provided to make the gradient more stable.

3) The Multi-head Self-Attention Mechanism (MSA) is formed by combining a plurality of Self-Attention mechanisms, divides a single head into a plurality of times, respectively calculates Self-Attention, and finally splices all Attention outputs to obtain a final result, and specifically comprises:

MultiHead＝Concat(head₁,...,head_h)W^O

wherein,

and W^OAre parameters that can be used for training learning, and h is the number of times.

The advantage of the multi-headed self-attention mechanism is that different random initialization mapping matrices W_q、W_kAnd W_vThe input vectors can be mapped to different subspaces, which allows the model to understand the sequence of inputs from different angles. The combined effect of several self-attentional mechanisms at the same time may be better than a single self-attentional mechanism.

The existing image multi-modal fusion modes generally include the following modes: 1. the images are fused before being input into an algorithm network, and the defect is that the internal relation among the images in different modes is difficult to establish, so that the performance of the model is reduced; 2. the image of each mode is trained in the network independently, and the extracted features are fused, so that the defect that each mode corresponds to a neural network can be caused, and huge calculation cost can be brought, especially under the condition that the number of the modes is increased; 3. the images of each mode obtain respective decision results, and then the decision results are fused to obtain a final result. Therefore, there is an urgent need to effectively combine these three fusion strategies. A good multi-modal fusion strategy should achieve as many interactions between different modalities as possible with low computational complexity.

Transformer has succeeded in natural images, but it has received little attention in medical image analysis, especially in multi-modal medical image fusion. Moreover, the Transformer needs a very large data set to exceed CNN, and the performance of ViT can be reflected only by pre-training in a Google private image data set JFT-300M. Medical imaging field datasets are smaller and lack sufficient information to establish relationships between low-level semantic features.

Therefore, aiming at the defects existing in the image classification method in the related art, the present disclosure provides a multi-modal image classification model training method to solve at least some or all of the above defects.

Fig. 1 shows an alternative flow chart of the multi-modal image classification model training method provided by the embodiment of the disclosure, which will be described according to various steps.

Step S101, confirming a training image set.

In some embodiments, a multi-modal image classification model training device (hereinafter referred to as a first device) acquires a second ultrasound image and a second ultrasound contrast image corresponding to the second ultrasound image; respectively acquiring a first ultrasonic image and a first ultrasonic contrast image which have the same size from the second ultrasonic image and the second ultrasonic contrast image, wherein the first ultrasonic image and the first ultrasonic contrast image are images in the training image set; wherein the first ultrasound image obtained from the ultrasound images is an image in the ultrasound image subset; the first ultrasound contrast image acquired from the ultrasound contrast image is an image of the subset of ultrasound contrast images.

Step S102, inputting the first ultrasound image and the first ultrasound contrast image in the training image set into an image serialization module and a feature extraction module included in the multi-modal image classification model, and obtaining a first feature coding set corresponding to the first ultrasound image and a second feature coding set corresponding to the first ultrasound contrast image.

In some embodiments, the first device performs a blocking process on the first ultrasound image and the first ultrasound contrast image based on the image serialization module to obtain at least two image blocks; the feature extraction module performs feature extraction on each image block to acquire a feature code of each image block; and confirming that the feature codes corresponding to the image blocks of the first ultrasonic image are the first feature code set, and confirming that the feature codes corresponding to the image blocks of the first ultrasonic contrast image are the second feature code set.

In a specific implementation, the first apparatus may respectively block the first ultrasound image and the first ultrasound contrast image into N K × K image blocks based on the image serialization module; further, assuming that the ultrasound image subset includes M ultrasound images, after passing through the image serialization module, the ultrasound image subset includes M × N image blocks; similarly, the ultrasound contrast image subset also includes M × N image blocks.

In some alternative embodiments, the first ultrasound image is denoted as x and the first ultrasound contrast image is denoted as y; the image block corresponding to the first ultrasound image can be represented as x_i(ii) a Further, the image block corresponding to the first ultrasound contrast image may be represented as y_iWherein i ∈ {1, 2.., N }.

In some embodiments, the structure of the feature extraction module is a ResNet50 structure; the last layer of the feature extraction module is a linear projection layer.

In specific implementation, the feature extraction module maps the image block into a one-dimensional space to obtain a one-dimensional feature code. For example, the feature corresponding to the first ultrasound image is encoded as fx_iThe feature code corresponding to the first ultrasound contrast image is fy_i。

Step S103, inputting the first feature coding set and the second feature coding set into a multi-modal aggregation module included in the multi-modal image classification model, and obtaining a classification prediction result corresponding to the first ultrasound image and the first ultrasound contrast image.

In some embodiments, the first device identifies a first classification marker for the first ultrasound image and the first ultrasound contrast image; adding position information codes into feature codes in the first feature code set, feature codes in the second feature code set and the first classification marks; the position information code is used for representing the position information of the image block corresponding to the feature code in the first ultrasonic image or the first ultrasonic contrast image; the position information code is also used to distinguish the first classification flag.

In a specific implementation, the first classification flag may be a Class token (cls token), and the first device may generate the first classification flag based on a method in the related art.

In other embodiments, the first device inputs feature codes with the same position information code in the first feature code set and the second feature code set into the multi-head self-attention layer, and obtains image features corresponding to the first ultrasound image and the first ultrasound contrast image; inputting the image features to the multi-layer perceptron; inputting the first classification mark into the multi-head attention layer and the multi-layer perceptron to obtain classification features; and inputting the classification features into the multilayer perception machine head to obtain a classification prediction result. The multi-modal aggregation module comprises a multi-head self-attention layer, a multi-layer perceptron and a multi-layer perceptron head.

In some optional embodiments, the multimodal aggregation module further comprises: a layer standardized structure and a jump connection structure; wherein the layer normalization structure is located before the multi-headed attention layer and the multi-layered perceptron; the multilayer perceptron includes a fully connected layer, an activation function layer, and a Dropout layer.

In specific implementation, the layer standardization structure is used for improving the training speed and precision of the model, so that the model is more stable; the jump connection structure is used for solving the problem that the gradient disappears under the condition that the network layer number is deep, is beneficial to the reverse propagation of the gradient and accelerates the training process.

Fig. 2 shows an alternative structural diagram of a multi-head self-attention layer provided by the embodiment of the disclosure.

The inputs of the multi-headed self-attention layer are generally the same vector, i.e., as shown in fig. 2, the inputs of Q, K, V in the related art are the same vector, but in the present disclosure, the inputs of the multi-headed self-attention layer are different vectors; in particular, as shown in fig. 2, the feature of the image block of the first ultrasound image is encoded as fx_iFeature encoding fy of image blocks input to Q, first ultrasound contrast image_iInputs to K and V; optionally, fx may be encoded by the characteristics of the image block of the first ultrasound image_iInput to Q and K, feature encoding fy of image blocks of the first ultrasound contrast image_iInput to V. It should be noted that Q, K and V have exactly the same processing for encoding the characteristics of their inputs, and the present disclosure does not limit the characteristic encoding fx of the image patch of the first ultrasound image_iAnd feature encoding fy of image blocks of the first ultrasound contrast image_iThe order of inputs to the multi-headed self-attention layer, Q, K and V may be an fx input_iAnd two fy_iThe input mode may be fy_i、fx_i、fy_iMay be fy_i、fy_i、fx_iMay also be fx_i、fy_i、fy_iThe final output of which is identical, the inputs of Q, K and V are two fx_iAnd a fy_iInput mode of (2) and an fx_iAnd two fy_iSimilarly, the present disclosure is not particularly limited. It should be noted that fx_iAnd fy_iThe number of the input to the multi-head self-attention layer is different, and the output result can be specifically set according to the requirement, or the output result can be set according to the experimental result, and the disclosure is not particularly limited.

Feature encoding fx of image block of first ultrasound image_iAnd feature encoding fy of image blocks of the first ultrasound contrast image_iQ, K and V input into the multi-head self-attention layer are input into at least one self-attention layer through Linear transformation (Linear) (as a packet in FIG. 2)H self-attention layers are included, h is an integer larger than 1), and then splicing (Concat) and linear transformation are carried out on the output of at least one self-attention layer, and then the output is input into a multi-layer perceptron included in the multi-modal aggregation module.

Furthermore, since the training set used in the present disclosure includes only two modalities, namely ultrasound images and ultrasound contrast images, the embodiments of the present disclosure may also be applicable in scenarios of three and more modalities, in which the inputs of Q, K and V are completely different.

Step S104, based on the difference between the classification labeling result and the classification prediction result corresponding to the first ultrasound image and the first ultrasound contrast image, adjusting the parameters of the multi-modal image classification model.

In some embodiments, the first apparatus inputs a third ultrasound image and a third ultrasound contrast image in the training image set into an image serialization module included in the multi-modal image classification model after parameter adjustment, and obtains a third feature coding set corresponding to the third ultrasound image and a fourth feature coding set corresponding to the third ultrasound contrast image; inputting the third feature coding set and the fourth feature coding set into a multi-modal aggregation module included in the multi-modal image classification model after the parameters are adjusted, and obtaining classification prediction results corresponding to the third ultrasonic image and the third ultrasonic radiography image; and adjusting parameters of the multi-modal image classification model based on the difference between the classification labeling result corresponding to the third ultrasonic image and the classification prediction result corresponding to the third ultrasonic image and the third ultrasonic image.

Until all the images in the ultrasound image subset and the ultrasound contrast image subset included in the training image set are trained by the multi-modal image classification model, confirming that the multi-modal image classification model is trained; or, in response to the difference between the classification labeling result and the classification prediction result being smaller than a first threshold, determining that the multi-modal image classification model training is completed. Wherein the first threshold value may be determined based on actual needs and/or experimental results, the disclosure is not particularly limited.

Therefore, by the multi-modal image classification model training method provided by the embodiment of the disclosure, a training image set is determined, wherein the training image set comprises an ultrasound image subset and an ultrasound contrast image subset, and images in the ultrasound image subset correspond to images in the ultrasound contrast image subset one to one; inputting a first ultrasonic image and a first ultrasonic contrast image in the training image set into an image serialization module included in the multi-modal image classification model, and obtaining a first feature coding set corresponding to the first ultrasonic image and a second feature coding set corresponding to the first ultrasonic contrast image; inputting the first feature coding set and the second feature coding set into a multi-modal aggregation module included in the multi-modal image classification model, and obtaining classification prediction results corresponding to the first ultrasonic image and the first ultrasonic radiography image; adjusting parameters of the multi-modal image classification model based on the difference between the classification labeling result and the classification prediction result corresponding to the first ultrasonic image and the first ultrasonic contrast image; wherein, the multi-modal aggregation module comprises a multi-head self-attention layer and a multi-layer perceptron. The training image set comprises the ultrasonic image and the ultrasonic contrast image, so that rich image information can be obtained; in addition, when the feature code corresponding to the image block is determined, through the Resnet50 structure, the long-term dependency relationship between sequences can be effectively extracted from the low-layer feature sequence, so that the multi-modal image features are fused, and the multi-modal image classification model obtains good performance; finally, the multi-head self-attention layer inputs the feature codes of the ultrasonic images and the ultrasonic contrast images, and multi-modal features can be fused together, so that the model learns the features of different angles of the same object under different modes, and powerful support is provided for subsequent multi-modal image classification.

Fig. 3 shows an alternative flowchart of the image classification method provided by the embodiment of the present disclosure, which will be described according to various steps.

Step S301, inputting an image to be identified into a multi-modal image classification model.

In some embodiments, an image classification device (hereinafter referred to as a second device) inputs an image to be recognized into a multi-modal image classification model. The multi-modal image classification model may be a multi-modal image classification model trained in steps S101 to S104.

In specific implementation, the second device obtains a feature coding set corresponding to the image to be recognized based on an image serialization module included in the multi-modal image classification model; and the multi-mode aggregation module generates a classification mark based on the feature coding model, inputs the classification mark into the multilayer perception handpiece and obtains a classification result of the image to be recognized.

Step S302, confirming that the output of the multi-modal image classification model is the classification result of the image to be recognized.

In some embodiments, the second device confirms the output of the multi-modal image classification model as the classification result of the image to be recognized; wherein the classification result may include: the image to be recognized comprises a first object, and/or the image to be recognized does not comprise the first object. Wherein the object may be a cell, a bone, or the like.

Therefore, by the image classification method provided by the embodiment of the disclosure and the multi-modal image classification model provided by the disclosure, the accuracy of image classification can be improved.

Next, taking an ultrasound image and an ultrasound contrast image of a liver as an example, the multi-modal image classification model training method provided by the embodiment of the disclosure is further explained.

Hepatocellular carcinoma is the most common liver malignancy, with approximately 70% of liver cancer cases being hepatocellular carcinoma. Hepatocellular carcinoma ranks fifth in the most common tumors worldwide, and fourth in the number of deaths worldwide associated with cancer. Hepatocellular carcinoma usually develops from cirrhosis. Today, the gold standard for liver cancer diagnosis is needle biopsy, but this is an invasive, dangerous technique, as it may lead to spread of the tumor within the body and also to infection. Ultrasound examination is a cheap, non-invasive, non-radiative medical examination method, and therefore has repeatability, and is suitable for patient disease monitoring. The ultrasonic contrast is an improved technology based on ultrasound, and utilizes a contrast agent to enhance the back scattering echo and obviously improve the resolution, sensitivity and specificity of ultrasonic diagnosis.

In ultrasound images, adipocytes, necrosis, fibrosis and actively growing tissues are interwoven, hepatocellular carcinoma occurs in the late stages of evolution, and ultrasound images are characterized by high echogenicity. In ultrasound contrast images, hepatocellular carcinoma is more prominent due to the dense and complex vascular structure characteristic of malignant tumors. However, in ultrasound and ultrasound contrast images, hepatocellular carcinoma is in many cases very difficult to distinguish substantially from its evolving cirrhosis, and therefore advanced computer methods are required to span the limitations of the human eye in a non-invasive manner.

The prior art generally uses methods for identifying hepatocellular carcinoma, such as: based on extracting the texture features and classifying by a classifier, the method for extracting the texture features mainly comprises the following steps: gray level co-occurrence matrix, wavelet and Gabor transformation, and the classifier mainly comprises: support Vector Machines (SVM), random forests, Fisher linear discriminant methods, or bayesian classifiers. Recently, deep learning techniques, such as Deep Belief Networks (DBNs), Recurrent Neural Networks (RNNs), and CNN-based classifiers, have been successfully used for automated diagnosis of medical images. The development of deep learning techniques has shown its advantages in many cases, such as fatty liver identification in ultrasound images, breast tumor identification in ultrasound images, liver lesion identification, liver tumor identification and segmentation, lung nodule detection, etc.

In the medical field, the prior art generally performs lesion detection and analysis on images of the same modality, and in practice, doctors need to combine images of multiple modalities to perform final diagnosis of lesions.

Based on the above, the present disclosure provides a multi-modal image classification model training method, which implements multi-modal fusion by using two modalities, namely, an ultrasound image and an ultrasound contrast image, corresponding to the same lesion. Combining the points of the CNN and the Transformer to capture the bottom-layer characteristics and the trans-modal high-layer characteristics, processing the multi-modal images into a sequence and sending the sequence to the CNN, and then learning the relation between the sequences by using the Transformer and predicting.

Fig. 4 shows another alternative flow diagram of the multi-modal image classification model training method provided by the embodiment of the disclosure, and fig. 5 shows a data diagram of the multi-modal image classification model training method provided by the embodiment of the disclosure, which will be described according to various steps.

In the disclosure, the multi-modal image classification model comprises an image serialization module, a feature extraction module, a multi-modal aggregation module and a loss module; the image serialization module is a data processing module and is used for preprocessing an image and serializing the image into an image block; a feature extraction module: the method comprises the steps of extracting low-level features of two modal images, and primarily coding the features to obtain a one-dimensional vector (feature code); the multi-modal aggregation module is used for performing feature fusion on the coded image features extracted by the feature module and obtaining a classification result of input data; and the loss module is used for optimizing the model parameters by minimizing the error of the model result and the labeling result.

Step S401, confirming the training image set.

In some embodiments, the training image set includes an ultrasound image subset and an ultrasound contrast image subset, and images in the ultrasound image subset correspond to images in the ultrasound contrast image subset one to one. Optionally, the images in the training image set are composed of a series of paired liver ultrasound images (first ultrasound images) and liver ultrasound contrast images (first ultrasound contrast images), and the images in the training image set include labeling information of a lesion region and classification information of the lesion region (or lesion benign and malignant labeling information).

In some embodiments, the first means performs data preprocessing, cuts out the lesion-bearing region of the paired liver ultrasound and hepatoangiographic images (i.e., cuts out the lesion region in fig. 5), and resizes the lesion region to be uniform, with M images in data sets a and B, respectively.

In step S402, image serialization processing is performed.

In some embodiments, the first device performs a blocking process on the liver ultrasound image and the liver ultrasound contrast image based on the image serialization module to obtain at least two image blocks

In a specific implementation, the first device may block the liver ultrasound image and the liver ultrasound contrast image into N K × K image blocks based on the image serialization module (that is, the images are divided in fig. 5, and the size of each image block is K × K); further, assuming that the ultrasound image subset includes M ultrasound images, after passing through the image serialization module, the ultrasound image subset includes M × N image blocks; similarly, the ultrasound contrast image subset also includes M × N image blocks.

In some alternative embodiments, the liver ultrasound image is denoted x and the liver ultrasound contrast image is denoted y; the image block corresponding to the liver ultrasound image can be represented as x_i(ii) a Further, the image block corresponding to the liver ultrasound image may be represented as y_iWherein i ∈ {1, 2.., N }.

In step S403, feature extraction processing is performed.

In some embodiments, the first device performs feature extraction on each image block through the feature extraction module to obtain a feature code of each image block; and confirming that the feature codes corresponding to the image blocks of the liver ultrasonic image are the first feature code set, and confirming that the feature codes corresponding to the image blocks of the first ultrasonic contrast image are the second feature code set. The structure of the feature extraction module is a ResNet50 structure; the last layer of the feature extraction module is a linear projection layer. Optionally, the last layer of the feature extraction module may also be other network structures such as: other ResNet structures such as ResNet-18 and the like, common network structures such as AlexNet and the like, and light-weight network structures such as MobileNet and ShuffleNet and the like can realize the invention.

In specific implementation, the first device respectively acquires N image blocks of a liver ultrasound image corresponding to the same image (the same liver image) and N image blocks corresponding to a liver radiography image from the data set a and the data set B; respectively inputting 2N (namely N + N) image blocks corresponding to a liver ultrasonic image and a liver radiography image to the feature extraction module; the feature extraction module is mainly formed based on a CNN algorithm, and maps the image blocks into a one-dimensional space to obtain one-dimensional feature codes. For example, the corresponding feature of the liver ultrasound image is encoded as fx_iThe corresponding feature of the liver ultrasonic image is encoded as fy_i. Wherein the number of image blocks is the same as the number of generated feature codes.

Step S404, multimodal aggregation processing.

In some embodiments, the apparatus confirms the first classification flag and/or the positional information encoding of the liver ultrasound image and the liver ultrasound contrast image based on a multi-modal aggregation module; adding position information codes into feature codes in the first feature code set, feature codes in the second feature code set and the first classification marks; the position information code is used for representing the position information of the image block corresponding to the feature code in the liver ultrasonic image or the liver ultrasonic contrast image; the position information code is also used to distinguish the first classification flag.

In specific implementation, the first device inputs feature codes with the same position information codes in the first feature code set and the second feature code set into the multi-head self-attention layer to obtain image features corresponding to the liver ultrasound image and the liver ultrasound contrast image; inputting the image features to the multi-layer perceptron; and inputting the first classification mark to the multi-head attention layer and the multi-layer perceptron to obtain a classification prediction result.

In specific implementation, the multi-modal aggregation module corresponds to the liver ultrasound images based on the N liver ultrasound imagesThe feature codes and feature codes corresponding to the N liver ultrasonic imaging images generate a first classification mark cls token which is used as the input of a subsequent multi-modal aggregation module and can be used for image classification. Adding position information codes into the feature codes corresponding to the N liver ultrasonic images, the feature codes corresponding to the N liver ultrasonic image and the first classification marks, wherein the position information codes can be p_cls,px_i,py_iI ∈ {1, 2.,. N } represents the relative position of the feature code on the image, preventing the position information from being lost in the subsequent feature extraction.

Inputting feature codes corresponding to the N liver ultrasonic images, feature codes corresponding to the N liver ultrasonic image and the N liver ultrasonic contrast images, and a first classification mark into a multi-modal aggregation module by the first device, wherein the multi-modal aggregation module may include a Transformer encoder and a multi-layer sensing handpiece, the Transformer encoder includes a multi-head self-attention layer and a multi-layer sensing machine, and a layer standardization structure LN and a skip connection structure (skip connection) similar to ResNet are added to the head and the tail of the Transformer encoder; the multi-layer perceptron consists of a full connection layer + a GELU activation function + a Dropout layer.

Fig. 6 shows an alternative structural diagram of a converter network included in the multi-modal aggregation module provided in the embodiment of the present disclosure.

As shown in fig. 6, the transform encoder (converter network) included in the multi-modal aggregation module includes a multi-head self-attention layer (multi-head self-attention MSA) and a multi-layer perceptron MLP, wherein a first-layer standardized structure (layer standardized LN) and a second-layer standardized structure are respectively disposed before the multi-head self-attention MSA and the multi-layer perceptron MLP, and an input of the skip connection structure, that is, the input of the multi-modal aggregation module in fig. 6 is not only input into the first-layer standardized structure and the multi-head self-attention layer, but also added with an output of the multi-head self-attention layer and input into the second-layer standardized structure and the multi-layer perceptron; in addition, the input of the second layer standardized structure and the multi-layer perceptron is added with the output of the multi-layer perceptron to be used as the input of the multi-layer perceptron.

In particular toIn practice, the input of the multi-mode aggregation module is fx_i、fy_i、fy_iFor example, the first layer of standardized structures sequentially pair fx_i、fy_i、fy_iThe processing is performed, as shown in FIG. 2, on fx after the first layer of normalization structure processing_i、fy_i、fy_iRespectively carrying out linear transformation, then respectively inputting the data into the self-attention layer, splicing and linearly transforming the data of a plurality of self-attention layers, and then carrying out fx_i、fy_i、fy_iAnd adding the input data to a second layer of standardized structure and a multi-layer perceptron, and adding the input data to the second layer of standardized structure to obtain the output of the multi-modal aggregation module. Finally, inputting a feature vector corresponding to the first classification mark into the multi-Head self-attention layer and the multi-layer perceptron to obtain classification features, and inputting the classification features into a multi-layer perceptron MLP Head included in the multi-mode aggregation module, wherein the MLP Head consists of a full connection layer and a softmax activation function, so as to obtain a final classification prediction result of the liver ultrasonic image. The classification prediction result may include that the liver radiography image includes cells of a first type (the lesion is malignant), or the liver radiography image does not include cells of the first type (the lesion is benign). Wherein, the first type of cell can be a liver cancer cell.

In step S405, the loss module adjusts parameters of the multi-modal image classification model.

In some embodiments, the first apparatus adjusts parameters of the multi-modal image classification model by a loss module based on a difference between the classification labeling result and the classification prediction result corresponding to the first ultrasound image and the first ultrasound contrast image.

In other optional embodiments, the first apparatus repeatedly performs steps S402 to S405, and inputs a third ultrasound image and a third ultrasound contrast image in the training image set to an image serialization module included in the multi-modal image classification model after parameter adjustment, so as to obtain a third feature coding set corresponding to the third ultrasound image and a fourth feature coding set corresponding to the third ultrasound contrast image; inputting the third feature coding set and the fourth feature coding set into a multi-modal aggregation module included in the multi-modal image classification model after the parameters are adjusted, and obtaining classification prediction results corresponding to the third ultrasonic image and the third ultrasonic radiography image; and adjusting parameters of the multi-modal image classification model based on the difference between the classification labeling result corresponding to the third ultrasonic image and the classification prediction result corresponding to the third ultrasonic image and the third ultrasonic image.

Until all the images in the ultrasound image subset and the ultrasound contrast image subset included in the training image set are trained by the multi-mode image classification model, and the completion of the training of the multi-mode image classification model is confirmed; or determining that the multi-modal image classification model is completely trained in response to the fact that the difference between the classification labeling result and the classification prediction result is smaller than a first threshold value. Wherein the first threshold value may be determined based on actual needs and/or experimental results, the disclosure is not particularly limited.

Therefore, the multi-modal image classification model training method provided by the embodiment of the disclosure uses two image modalities (an ultrasound image and an ultrasound contrast image) of the liver, and can sufficiently learn rich information of the focus under the two image modalities. The method provided by the disclosure is a mixed model containing CNN and Transformer. The CNN is used as a low-level feature extraction tool to generate a local feature sequence of the multi-modal image, and a Resnet50 structure is used in the disclosure; and the Transformer can effectively extract the long-term dependence relationship among sequences from the low-level feature sequence, so that the multi-modal image features are fused, and good performance is obtained. In the method proposed by the present disclosure, the Transformer is used in the multi-modal aggregation module, and is improved on the multi-head attention layer, and usually the multi-head attention layer Q, K, V is the same input, while in the present disclosure, Q, K, V is feature coding of images of different modalities, so that multi-modal features can be fused together, and the model learns features of different angles in different modalities of the same lesion.

Further, in this disclosure, the structure for the feature extraction module is ResNet-50, but other network structures are such as: other ResNet structures such as ResNet-18, common network structures such as AlexNet, and lightweight network structures such as MobileNet and ShuffleNet can all implement the disclosure, and the disclosure is not particularly limited.

Fig. 7 shows an alternative structural diagram of the multi-modal image classification model training apparatus provided in the embodiment of the present disclosure, which will be described according to various parts.

In some embodiments, the multi-modal image classification model training apparatus 600 includes: a data preparation unit 601, a first data processing unit 602, a second data processing unit 603, and an adjustment unit 604.

The data preparation unit 601 is configured to determine a training image set, where the training image set includes an ultrasound image subset and an ultrasound contrast image subset, and images in the ultrasound image subset correspond to images in the ultrasound contrast image subset one to one;

the first data processing unit 602 is configured to input a first ultrasound image and a first ultrasound contrast image in the training image set into an image serialization module and a feature extraction module included in the multi-modal image classification model, and obtain a first feature coding set corresponding to the first ultrasound image and a second feature coding set corresponding to the first ultrasound contrast image;

the second data processing unit 603 is configured to input the first feature coding set and the second feature coding set into a multi-modal aggregation module included in the multi-modal image classification model, and obtain a classification prediction result corresponding to the first ultrasound image and the first ultrasound contrast image;

the adjusting unit 604 is configured to adjust parameters of the multi-modal image classification model based on a difference between a classification labeling result corresponding to the first ultrasound image and the first ultrasound contrast image and the classification prediction result, based on a loss module included in the multi-modal image classification model;

The data preparation unit 601 is specifically configured to acquire a second ultrasound image and a second ultrasound contrast image corresponding to the second ultrasound image; respectively acquiring a first ultrasonic image and a first ultrasonic contrast image which have the same size from the second ultrasonic image and the second ultrasonic contrast image, wherein the first ultrasonic image and the first ultrasonic contrast image are images in the training image set; wherein the first ultrasound image obtained from the ultrasound images is an image in the ultrasound image subset; the first ultrasound contrast image acquired from the ultrasound contrast image is an image of the subset of ultrasound contrast images.

The first data processing unit 602 is specifically configured to perform block processing on the first ultrasound image and the first ultrasound contrast image based on the image serialization module to obtain at least two image blocks; extracting the features of each image block based on the feature extraction module to obtain the feature codes of each image block; and confirming that the feature codes corresponding to the image blocks of the first ultrasonic image are the first feature code set, and confirming that the feature codes corresponding to the image blocks of the first ultrasonic contrast image are the second feature code set.

The second data processing unit 603 is specifically configured to confirm the first classification flag of the first ultrasound image and the first ultrasound contrast image; adding position information codes into feature codes in the first feature code set, feature codes in the second feature code set and the first classification marks; the position information code is used for representing the position information of the image block corresponding to the feature code in the first ultrasonic image or the first ultrasonic contrast image; the position information encoding is also used to distinguish the first classification flag.

The second data processing unit 603 is specifically configured to input feature codes in the first feature code set and the second feature code set, where the position information codes are the same, into the multi-head self-attention layer, and obtain image features corresponding to the first ultrasound image and the first ultrasound contrast image; inputting the image features to the multi-layer perceptron; inputting the first classification mark into the multi-head attention layer and the multi-layer perceptron to obtain classification features;

In some embodiments, the multimodal aggregation module further comprises: a layer standardized structure and a jump connection structure; wherein the layer normalization structure is located before the multi-head attention layer and the multi-layer perceptron.

In some embodiments, the multi-layered perceptron includes a fully connected layer, an activation function layer, and a Dropout layer.

The first data processing unit 602, after adjusting parameters of the multi-modal image classification model based on a difference between the classification labeling result and the classification prediction result corresponding to the first ultrasound image and the first ultrasound contrast image, is further configured to input a third ultrasound image and a third ultrasound contrast image in the training image set into an image serialization module included in the multi-modal image classification model after parameter adjustment, so as to obtain a third feature coding set corresponding to the third ultrasound image and a fourth feature coding set corresponding to the third ultrasound contrast image;

the second data processing unit 603 is further configured to input the third feature coding set and the fourth feature coding set into a multi-modal aggregation module included in the multi-modal image classification model after the parameters are adjusted, so as to obtain a classification prediction result corresponding to the third ultrasound image and the third ultrasound contrast image;

the adjusting unit 604 is further configured to adjust parameters of the multi-modal image classification model based on a difference between a classification labeling result corresponding to the third ultrasound image and the third ultrasound contrast image and a classification prediction result corresponding to the third ultrasound image and the third ultrasound contrast image.

The adjusting unit 604 is specifically configured to determine that the multi-modal image classification model is trained completely in response to that a difference between the classification labeling result and the classification prediction result is smaller than a first threshold.

Fig. 8 is a schematic diagram illustrating an alternative structure of an image classification apparatus provided in an embodiment of the present disclosure, which will be described according to various parts.

In some embodiments, the image classification apparatus 700 includes an input unit 701 and an identification unit 702.

The input unit 701 is configured to input an image to be recognized to the multi-modal image classification model;

the identifying unit 702 is configured to confirm that the output of the multi-modal image classification model is a classification result of the image to be identified.

The present disclosure also provides an electronic device and a readable storage medium according to an embodiment of the present disclosure.

Fig. 9 illustrates a schematic block diagram of an example electronic device 800 that can be used to implement embodiments of the present disclosure. Electronic devices are intended to represent various forms of digital computers, such as laptops, desktops, workstations, personal digital assistants, servers, blade servers, mainframes, and other appropriate computers. The electronic device may also represent various forms of mobile devices, such as personal digital processing, cellular phones, smart phones, wearable devices, and other similar computing devices. The components shown herein, their connections and relationships, and their functions, are meant to be examples only, and are not meant to limit implementations of the disclosure described and/or claimed herein.

As shown in fig. 9, the electronic apparatus 800 includes a computing unit 801 that can perform various appropriate actions and processes in accordance with a computer program stored in a Read Only Memory (ROM)802 or a computer program loaded from a storage unit 808 into a Random Access Memory (RAM) 803. In the RAM 803, various programs and data required for the operation of the electronic apparatus 800 can also be stored. The calculation unit 801, the ROM 802, and the RAM 803 are connected to each other by a bus 804. An input/output (I/O) interface 805 is also connected to bus 804.

A number of components in the electronic device 800 are connected to the I/O interface 805, including: an input unit 806 such as a keyboard, a mouse, or the like; an output unit 807 such as various types of displays, speakers, and the like; a storage unit 808, such as a magnetic disk, optical disk, or the like; and a communication unit 809 such as a network card, modem, wireless communication transceiver, etc. The communication unit 809 allows the electronic device 800 to exchange information/data with other devices through a computer network such as the internet and/or various telecommunication networks.

Computing unit 801 may be a variety of general and/or special purpose processing components with processing and computing capabilities. Some examples of the computing unit 801 include, but are not limited to, a Central Processing Unit (CPU), a Graphics Processing Unit (GPU), various dedicated Artificial Intelligence (AI) computing chips, various computing units running machine learning model algorithms, a Digital Signal Processor (DSP), and any suitable processor, controller, microcontroller, and the like. The calculation unit 801 performs the respective methods and processes described above, such as a multimodal image classification model training method and/or an image classification method. For example, in some embodiments, the multi-modal image classification model training method and/or the image classification method may be implemented as a computer software program tangibly embodied in a machine-readable medium, such as the storage unit 808. In some embodiments, part or all of the computer program can be loaded and/or installed onto the electronic device 800 via the ROM 802 and/or the communication unit 809. When loaded into RAM 803 and executed by the computing unit 801, a computer program may perform one or more steps of the multimodal image classification model training method and/or the image classification method described above. Alternatively, in other embodiments, the computing unit 801 may be configured to perform the multi-modal image classification model training method and/or the image classification method by any other suitable means (e.g., by means of firmware).

Various implementations of the systems and techniques described here above may be implemented in digital electronic circuitry, integrated circuitry, Field Programmable Gate Arrays (FPGAs), Application Specific Integrated Circuits (ASICs), Application Specific Standard Products (ASSPs), system on a chip (SOCs), load programmable logic devices (CPLDs), computer hardware, firmware, software, and/or combinations thereof. These various embodiments may include: implemented in one or more computer programs that are executable and/or interpretable on a programmable system including at least one programmable processor, which may be special or general purpose, receiving data and instructions from, and transmitting data and instructions to, a storage system, at least one input device, and at least one output device.

Program code for implementing the methods of the present disclosure may be written in any combination of one or more programming languages. These program codes may be provided to a processor or controller of a general purpose computer, special purpose computer, or other programmable data processing apparatus, such that the program codes, when executed by the processor or controller, cause the functions/operations specified in the flowchart and/or block diagram to be performed. The program code may execute entirely on the machine, partly on the machine, as a stand-alone software package partly on the machine and partly on a remote machine or entirely on the remote machine or server.

In the context of this disclosure, a machine-readable medium may be a tangible medium that can contain, or store a program for use by or in connection with an instruction execution system, apparatus, or device. The machine-readable medium may be a machine-readable signal medium or a machine-readable storage medium. A machine-readable medium may include, but is not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or any suitable combination of the foregoing. More specific examples of a machine-readable storage medium would include an electrical connection based on one or more wires, a portable computer diskette, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber, a portable compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing.

To provide for interaction with a user, the systems and techniques described here can be implemented on a computer having: a display device (e.g., a CRT (cathode ray tube) or LCD (liquid crystal display) monitor) for displaying information to a user; and a keyboard and a pointing device (e.g., a mouse or a trackball) by which a user can provide input to the computer. Other kinds of devices may also be used to provide for interaction with a user; for example, feedback provided to the user can be any form of sensory feedback (e.g., visual feedback, auditory feedback, or tactile feedback); and input from the user may be received in any form, including acoustic, speech, or tactile input.

The systems and techniques described here can be implemented in a computing system that includes a back-end component (e.g., as a data server), or that includes a middleware component (e.g., an application server), or that includes a front-end component (e.g., a user computer having a graphical user interface or a web browser through which a user can interact with an implementation of the systems and techniques described here), or any combination of such back-end, middleware, or front-end components. The components of the system can be interconnected by any form or medium of digital data communication (e.g., a communication network). Examples of communication networks include: local Area Networks (LANs), Wide Area Networks (WANs), and the Internet.

The computer system may include clients and servers. A client and server are generally remote from each other and typically interact through a communication network. The relationship of client and server arises by virtue of computer programs running on the respective computers and having a client-server relationship to each other. The server may be a cloud server, a server of a distributed system, or a server with a combined blockchain.

It should be understood that various forms of the flows shown above may be used, with steps reordered, added, or deleted. For example, the steps described in the present disclosure may be executed in parallel, sequentially, or in different orders, as long as the desired results of the technical solutions disclosed in the present disclosure can be achieved, and the present disclosure is not limited herein.

Furthermore, the terms "first", "second" and "first" are used for descriptive purposes only and are not to be construed as indicating or implying relative importance or implicitly indicating the number of technical features indicated. Thus, a feature defined as "first" or "second" may explicitly or implicitly include at least one such feature. In the description of the present disclosure, "a plurality" means two or more unless specifically limited otherwise.

The above description is only for the specific embodiments of the present disclosure, but the scope of the present disclosure is not limited thereto, and any person skilled in the art can easily conceive of the changes or substitutions within the technical scope of the present disclosure, and all the changes or substitutions should be covered within the scope of the present disclosure. Therefore, the protection scope of the present disclosure shall be subject to the protection scope of the claims.

Claims

1. A multi-modal image classification model training method, the method comprising:

2. The method of claim 1, wherein the validating a training image set comprises:

3. The method of claim 1, wherein the inputting the first ultrasound image and the first ultrasound contrast image in the training image set into an image serialization module and a feature extraction module included in the multi-modal image classification model to obtain a first feature code set corresponding to the first ultrasound image and a second feature code set corresponding to the first ultrasound contrast image comprises:

4. The method according to claim 1 or 3,

the structure of the feature extraction module is a ResNet50 structure;

the last layer of the feature extraction module is a linear projection layer.

5. The method according to claim 1, wherein the inputting the first feature encoding set and the second feature encoding set into a multi-modal aggregation module included in the multi-modal image classification model to obtain the classification prediction results corresponding to the first ultrasound image and the first ultrasound contrast image comprises:

the position information code is used for representing the position information of the image block corresponding to the feature code in the first ultrasonic image or the first ultrasonic contrast image; the position information encoding is also used to distinguish the first classification flag.

6. The method according to claim 5, wherein the inputting the first feature encoding set and the second feature encoding set into a multi-modal aggregation module included in the multi-modal image classification model to obtain the classification prediction results corresponding to the first ultrasound image and the first ultrasound contrast image comprises:

inputting feature codes with the same position information codes in the first feature code set and the second feature code set into the multi-head self-attention layer to obtain image features corresponding to the first ultrasonic image and the first ultrasonic radiography image;

inputting the image features to the multi-layer perceptron;

7. The method of claim 1 or 6, wherein the multimodal aggregation module further comprises:

a layer standardized structure and a jump connection structure;

8. The method according to claim 1 or 6,

the multilayer perceptron comprises a full connection layer, an activation function layer and a Dropout layer.

9. The method of claim 1, wherein after adjusting the parameters of the multi-modal image classification model based on the difference between the classification labeling result and the classification prediction result corresponding to the first ultrasound image and the first ultrasound contrast image, the method further comprises:

10. The method according to claim 1 or 9, further comprising:

11. A method of image classification, characterized by applying the multi-modal image classification model of claims 1 to 10, the method comprising:

12. A multi-modal image classification model training apparatus, the apparatus comprising:

a first data processing unit, configured to input a first ultrasound image and a first ultrasound contrast image in the training image set into an image serialization module and a feature extraction module included in the multi-modal image classification model, and obtain a first feature coding set corresponding to the first ultrasound image and a second feature coding set corresponding to the first ultrasound contrast image;

13. An image classification apparatus, characterized by applying the multi-modal image classification model of claims 1 to 10, the apparatus comprising:

14. An electronic device, comprising:

at least one processor; and

a memory communicatively coupled to the at least one processor; wherein,

the memory stores instructions executable by the at least one processor to enable the at least one processor to perform the method of any one of claims 1-10;

or, performing the method of any of claim 11.

15. A non-transitory computer readable storage medium having stored thereon computer instructions for causing the computer to perform the method of any one of claims 1-10;

or, performing the method of any of claim 11.