CN116129141A

CN116129141A - Medical data processing method, apparatus, device, medium and computer program product

Info

Publication number: CN116129141A
Application number: CN202310083897.8A
Authority: CN
Inventors: 刘博�; 卢东焕; 魏东; 郑冶枫
Original assignee: Tencent Technology Shenzhen Co Ltd
Current assignee: Tencent Technology Shenzhen Co Ltd
Priority date: 2023-01-13
Filing date: 2023-01-13
Publication date: 2023-05-16
Anticipated expiration: 2043-01-13
Also published as: CN116129141B

Abstract

The application relates to a medical data processing method, a device, equipment, a medium and a computer program product, which are applied to the field of artificial intelligence, wherein the method comprises the following steps: extracting features of the medical image and the medical report through a machine learning model to obtain conventional image features, momentum image features, conventional text features and momentum text features; determining a first cross-modal loss value by taking the conventional image characteristic as a first cross-modal anchor point and combining the momentum text characteristic; determining a second cross-modal loss value by taking the conventional text characteristic as a second cross-modal anchor point and combining the momentum image characteristic; determining a multimodal loss value based on the conventional image feature, the momentum image feature, the conventional text feature, and the momentum text feature; obtaining a target machine learning model according to parameters of the multi-modal loss value, the first cross-modal loss value and the second cross-modal loss value optimization model; and carrying out data processing on the target medical data through the target machine learning model. By adopting the method, the accuracy of medical data processing can be improved.

Description

Medical data processing method, apparatus, device, medium and computer program product

Technical Field

The present application relates to the field of artificial intelligence technology, and in particular, to a medical data processing method, apparatus, device, medium and computer program product.

Background

With rapid application of computer and big data technology in the medical field and gradual perfection of medical information storage standards, medical data is growing explosively. Medical data, which by its nature, takes on a multi-modal form, diagnostic reports, medical images produced by a variety of medical imaging devices such as X-rays, computed tomography, magnetic resonance imaging, ultrasound imaging, and positron emission tomography are referred to as multi-modal data. And these multimodal data tend to appear simultaneously, complementing each other. In the medical field, these multimodal data are mixed and co-exist, forming a semantically similar and interrelated complex feature.

The extraction of the multi-modal features influences the processing effect of the multi-modal medical task, the multi-modal data cannot be fully utilized for learning in the training stage in the machine learning model applied to medical data processing at present, the trained machine learning model is poor in feature expression capability, and finally the medical data processing accuracy is low.

Disclosure of Invention

In view of the foregoing, it is desirable to provide a medical data processing method, apparatus, device, medium and computer program product capable of improving the accuracy of medical data processing.

In a first aspect, the present application provides a medical data processing method. The method comprises the following steps:

acquiring an image report pair consisting of a medical image of a visual modality and a medical report of a text modality;

extracting the characteristics of the medical image and the medical report in the image report by a machine learning model to obtain the conventional image characteristics and the momentum image characteristics corresponding to the medical image and the conventional text characteristics and the momentum text characteristics corresponding to the medical report;

taking the conventional image feature as a first cross-modal anchor point, and determining a first cross-modal loss value based on the first cross-modal anchor point and the momentum text feature; and determining a second cross-modal loss value based on the second cross-modal anchor and the momentum image feature with the regular text feature as a second cross-modal anchor;

determining a multimodal loss value based on the conventional image feature, the momentum image feature, the conventional text feature, and the momentum text feature;

Optimizing parameters of the machine learning model according to the multi-modal loss value, the first cross-modal loss value and the second cross-modal loss value to obtain a target machine learning model;

performing data processing on the target medical data through the target machine learning model; the target medical data includes at least one of a target medical image or a target medical report.

In a second aspect, the present application also provides a medical data processing apparatus. The device comprises:

an image report pair acquisition module for acquiring an image report pair composed of a medical image of a visual modality and a medical report of a text modality;

the feature extraction module is used for extracting features of the medical image and the medical report in the image report through a machine learning model to obtain conventional image features and momentum image features corresponding to the medical image and conventional text features and momentum text features corresponding to the medical report;

the loss value determining module is used for taking the conventional image feature as a first cross-modal anchor point and determining a first cross-modal loss value based on the first cross-modal anchor point and the momentum text feature; and determining a second cross-modal loss value based on the second cross-modal anchor and the momentum image feature with the regular text feature as a second cross-modal anchor; determining a multimodal loss value based on the conventional image feature, the momentum image feature, the conventional text feature, and the momentum text feature;

The parameter optimization module is used for optimizing parameters of the machine learning model according to the multi-modal loss value, the first cross-modal loss value and the second cross-modal loss value to obtain a target machine learning model;

the data processing module is used for carrying out data processing on the target medical data through the target machine learning model; the target medical data includes at least one of a target medical image or a target medical report.

In one embodiment, the feature extraction module is further configured to:

respectively extracting the characteristics of the medical image in the image report pair through a conventional image encoder and a momentum image encoder of a machine learning model to obtain conventional image characteristics and momentum image characteristics corresponding to the medical image;

and respectively extracting the characteristics of the medical report in the image report pair through a conventional text encoder and a momentum text encoder of the machine learning model to obtain conventional text characteristics and momentum text characteristics corresponding to the medical report.

In one embodiment, the feature extraction module is further configured to:

performing data enhancement processing on the medical image to obtain a first medical image and a second medical image corresponding to the medical image;

Extracting the characteristics of the first medical image through a conventional image encoder of the machine learning model to obtain conventional image characteristics corresponding to the medical image;

and extracting the characteristics of the second medical image through a momentum image encoder of the machine learning model to obtain momentum image characteristics corresponding to the medical image.

In one embodiment, the feature extraction module is further configured to:

dividing the medical report into sentence sets according to semantics;

extracting features of the sentence set through a conventional text encoder of the machine learning model to obtain conventional text features corresponding to the medical report;

and extracting features of the sentence set through a momentum text encoder of the machine learning model to obtain momentum text features corresponding to the medical report.

In one embodiment, a conventional text encoder of the machine learning model includes a feature extraction network, a first pooling layer, and a second pooling layer; the feature extraction module is further configured to:

extracting features of the sentence set through the feature extraction network to obtain word features;

carrying out pooling operation on the word characteristics through the first pooling layer to obtain sentence characteristics;

And carrying out pooling operation on the sentence characteristics through the second pooling layer to obtain the conventional text characteristics corresponding to the medical report.

In one embodiment, the number of image report pairs is at least two; the loss value determining module is further configured to:

for medical images of each image report pair, taking the conventional image features corresponding to the medical image in question as a first cross-modal anchor point, taking the momentum text features corresponding to medical reports belonging to the same image report pair as a first cross-modal positive sample, taking the momentum text features corresponding to medical reports belonging to different image report pairs as a first cross-modal negative sample, and determining a first cross-modal loss value based on the first cross-modal anchor point, the first cross-modal positive sample and the first cross-modal negative sample;

the loss value determining module is further configured to:

for medical reports of each of the image report pairs, taking the conventional text features corresponding to the medical report in question as a second cross-modal anchor point, taking the momentum image features corresponding to medical images belonging to the same image report pair as a second cross-modal positive sample, taking the momentum text features corresponding to medical images belonging to different image report pairs as a second cross-modal negative sample, and determining a second cross-modal loss value based on the second cross-modal anchor point, the second cross-modal positive sample and the second cross-modal negative sample.

In one embodiment, the first cross-modal negative-sample and the second cross-modal negative-sample are maintained by first-in-first-out memory queues, respectively.

In one embodiment, the multi-modal loss values include a first multi-modal loss value and a second multi-modal loss value; the loss value determining module is further configured to:

determining a multimodal prototype feature based on the momentum image feature and the momentum text feature;

taking the conventional image feature as a first multi-modal anchor point, and determining a first multi-modal loss value based on the first multi-modal anchor point and the multi-modal prototype feature;

and taking the conventional text feature as a second multi-modal anchor point, and determining a second multi-modal loss value based on the second multi-modal anchor point and the multi-modal prototype feature.

In one embodiment, the loss value determination module is further configured to:

fusing the momentum image features and the corresponding momentum text features to obtain fusion features;

clustering the fusion features to obtain a clustering center;

and determining the cluster center as a multi-modal prototype feature.

For medical images of each of the image report pairs, taking the conventional image feature corresponding to the medical image in question as a first multi-modal anchor point, taking the multi-modal prototype feature corresponding to the image report pair containing the medical image in question as a first multi-modal positive sample, taking the multi-modal prototype feature corresponding to the image report pair not containing the medical image in question as a first multi-modal negative sample, and determining a first multi-modal loss value based on the first multi-modal anchor point, the first multi-modal positive sample and the first multi-modal negative sample;

the loss value determining module is further configured to:

for medical reports of each of the image report pairs, taking the regular text feature corresponding to the medical report pair as a second multi-modal anchor point, taking the multi-modal prototype feature corresponding to the image report pair containing the medical image pair as a second multi-modal positive sample, taking the multi-modal prototype feature corresponding to the image report pair not containing the medical image pair as a second multi-modal negative sample, and determining a second multi-modal loss value based on the second multi-modal anchor point, the second multi-modal positive sample and the second multi-modal negative sample.

determining a single mode loss value based on the conventional image feature, the momentum image feature, the conventional text feature, and the momentum text feature;

the parameter optimization module is further configured to:

and optimizing parameters of the machine learning model according to the single-mode loss value, the multi-mode loss value, the first cross-mode loss value and the second cross-mode loss value to obtain a target machine learning model.

processing medical data samples of a target task through the target machine learning model to obtain a task result;

the parameter optimization module is further used for: adjusting parameters of a sub-task model in the target machine learning model based on the task result to obtain a trained target machine learning model;

the data processing module is further configured to:

and carrying out data processing on the target medical data through the trained target machine learning model.

In one embodiment, the data processing module is further configured to:

acquiring target medical data and a corresponding task type;

Selecting a target machine learning model matched with the task type from the trained target machine learning models; the matched target machine learning model comprises a feature extraction network and a subtask model, wherein the subtask model comprises one of a classification sub-model, a segmentation sub-model, a cross-modal retrieval sub-model or a visual question-answer sub-model;

and carrying out data processing on the target medical data through the matched target machine learning model to obtain a data processing result.

In a third aspect, the present application also provides a computer device. The computer device comprises a memory storing a computer program and a processor which when executing the computer program performs the steps of:

In a fourth aspect, the present application also provides a computer-readable storage medium. The computer readable storage medium having stored thereon a computer program which when executed by a processor performs the steps of:

In a fifth aspect, the present application also provides a computer program product. The computer program product comprises a computer program which, when executed by a processor, implements the steps of:

The above-described medical data processing method, apparatus, device, medium and computer program product by acquiring an image report pair consisting of a medical image of a visual modality and a medical report of a text modality; extracting the characteristics of the medical image and the medical report in the image report through a machine learning model to obtain the conventional image characteristics and the momentum image characteristics corresponding to the medical image, and the conventional text characteristics and the momentum text characteristics corresponding to the medical report; taking the conventional image characteristic as a first cross-modal anchor point, and determining a first cross-modal loss value based on the first cross-modal anchor point and the momentum text characteristic; and determining a second cross-modal loss value based on the second cross-modal anchor and the momentum image feature by taking the conventional text feature as the second cross-modal anchor; determining a multi-modal loss value based on the conventional image features, the momentum image features, the conventional text features, and the momentum text features; according to the multi-modal loss value, the first cross-modal loss value and the second cross-modal loss value, parameters of the machine learning model are optimized, so that the target machine learning model has good characteristic expression capability, and further accuracy of medical data processing can be improved when the target machine learning model is used for data processing of target medical data.

Drawings

FIG. 1 is a diagram of an application environment for a medical data processing method according to one embodiment;

FIG. 2 is a flow chart of a method of processing medical data according to one embodiment;

FIG. 3 is a schematic diagram of a conventional text feature extraction process in one embodiment;

FIG. 4 is a diagram of a contrast learning process in one embodiment;

FIG. 5 is a schematic diagram of an application scenario in one embodiment;

FIG. 6 is a schematic diagram of an application scenario in another embodiment;

FIG. 7 is a flow chart of a method of processing medical data according to another embodiment;

FIG. 8 is a schematic diagram of a medical data processing procedure according to one embodiment;

FIG. 9 is a block diagram showing the structure of a medical data processing apparatus according to one embodiment;

FIG. 10 is an internal block diagram of a computer device in one embodiment;

FIG. 11 is an internal block diagram of a computer device in one embodiment.

Detailed Description

In order to make the objects, technical solutions and advantages of the present application more apparent, the present application will be further described in detail with reference to the accompanying drawings and examples. It should be understood that the specific embodiments described herein are for purposes of illustration only and are not intended to limit the present application.

The medical data processing method provided by the embodiment of the application relates to the technologies of artificial intelligence such as machine learning, natural language processing, computer vision and the like, wherein:

Artificial intelligence (Artificial Intelligence, AI) is the theory, method, technique and application system that uses a digital computer or a machine controlled by a digital computer to simulate, extend and extend human intelligence, sense the environment, acquire knowledge and use the knowledge to obtain optimal results. In other words, artificial intelligence is an integrated technology of computer science that attempts to understand the essence of intelligence and to produce a new intelligent machine that can react in a similar way to human intelligence. Artificial intelligence, i.e. research on design principles and implementation methods of various intelligent machines, enables the machines to have functions of sensing, reasoning and decision.

The artificial intelligence technology is a comprehensive subject, and relates to the technology with wide fields, namely the technology with a hardware level and the technology with a software level. Artificial intelligence infrastructure technologies generally include technologies such as sensors, dedicated artificial intelligence chips, cloud computing, distributed storage, big data processing technologies, operation/interaction systems, mechatronics, and the like. The artificial intelligence software technology mainly comprises a computer vision technology, a voice processing technology, a natural language processing technology, machine learning/deep learning and other directions.

Machine Learning (ML) is a multi-domain interdisciplinary, involving multiple disciplines such as probability theory, statistics, approximation theory, convex analysis, algorithm complexity theory, etc. It is specially studied how a computer simulates or implements learning behavior of a human to acquire new knowledge or skills, and reorganizes existing knowledge structures to continuously improve own performance. Machine learning is the core of artificial intelligence, a fundamental approach to letting computers have intelligence, which is applied throughout various areas of artificial intelligence. Machine learning and deep learning typically include techniques such as artificial neural networks, confidence networks, reinforcement learning, transfer learning, induction learning, teaching learning, and the like.

Natural language processing (Nature Language processing, NLP) is an important direction in the fields of computer science and artificial intelligence. It is studying various theories and methods that enable effective communication between a person and a computer in natural language. Natural language processing is a science that integrates linguistics, computer science, and mathematics. Thus, the research in this field will involve natural language, i.e. language that people use daily, so it has a close relationship with the research in linguistics. Natural language processing techniques typically include text processing, semantic understanding, machine translation, robotic questions and answers, knowledge graph techniques, and the like.

Computer Vision (CV) is a science of studying how to "look" a machine, and more specifically, to replace a human eye with a camera and a Computer to perform machine Vision such as recognition and measurement on a target, and further perform graphic processing to make the Computer process an image more suitable for human eye observation or transmission to an instrument for detection. As a scientific discipline, computer vision research-related theory and technology has attempted to build artificial intelligence systems that can acquire information from images or multidimensional data. Computer vision techniques typically include image processing, image recognition, image semantic understanding, image retrieval, OCR, video processing, video semantic understanding, video content/behavior recognition, three-dimensional object reconstruction, 3D techniques, virtual reality, augmented reality, synchronous positioning, and map construction, among others, as well as common biometric recognition techniques such as face recognition, fingerprint recognition, and others.

The medical data processing method provided by the embodiment of the application can be applied to an application environment shown in fig. 1. Can be applied in the application environment as shown in fig. 1. Wherein the terminal 102 communicates with the server 104 via a network. The data storage system may store data that the server 104 needs to process. The data storage system may be integrated on the server 104 or may be located on the cloud or other servers. The medical data processing method may be performed by the terminal 102 or the server 104, or by the terminal 102 and the server 104 in cooperation. In some embodiments, the medical data processing method is performed by the terminal 102, the terminal 102 acquiring an image report pair consisting of a medical image of a visual modality and a medical report of a text modality; extracting the characteristics of the medical image and the medical report in the image report through a machine learning model to obtain the conventional image characteristics and the momentum image characteristics corresponding to the medical image, and the conventional text characteristics and the momentum text characteristics corresponding to the medical report; taking the conventional image characteristic as a first cross-modal anchor point, and determining a first cross-modal loss value based on the first cross-modal anchor point and the momentum text characteristic; and determining a second cross-modal loss value based on the second cross-modal anchor and the momentum image feature by taking the conventional text feature as the second cross-modal anchor; determining a multi-modal loss value based on the conventional image features, the momentum image features, the conventional text features, and the momentum text features; optimizing parameters of the feature extraction model according to the multi-modal loss value, the first cross-modal loss value and the second cross-modal loss value to obtain a target feature extraction model; performing data processing on the target medical data through the target feature extraction model and the target task model; the target medical data includes at least one of a target medical image or a target medical report.

The terminal 102 may be, but not limited to, various desktop computers, notebook computers, smart phones, tablet computers, internet of things devices, and portable wearable devices, where the internet of things devices may be smart speakers, smart televisions, smart air conditioners, smart vehicle devices, and the like. The portable wearable device may be a smart watch, smart bracelet, headset, or the like. The server 104 may be an independent physical server, a server cluster or a distributed system formed by a plurality of physical servers, or a cloud server providing cloud services, cloud databases, cloud computing, cloud functions, cloud storage, network services, cloud communication, middleware services, domain name services, security services, CDNs, basic cloud computing services such as big data and artificial intelligence platforms, and the like. The terminal 102 and the server 104 may be directly or indirectly connected through wired or wireless communication, which is not limited herein.

It should be noted that the medical data processing method of the present application may be applied to application scenarios of medical image segmentation, medical image classification, visual questions and answers in the medical field, and multimodal retrieval in the medical field.

In one embodiment, as shown in fig. 2, a medical data processing method is provided, and the method is applied to the terminal 102 in fig. 1 for illustration, and includes the following steps:

S202, acquiring an image report pair consisting of a medical image of a visual mode and a medical report of a text mode.

It should be noted that different existing forms or information sources may be referred to as a mode. The visual modality refers to the existence of data in the form of an image, the text modality refers to the existence of data in the form of text, the medical image of the visual modality is the medical data in the form of an image, and the medical report of the text modality is the medical data in the form of text. The medical image may specifically be an image generated by a variety of medical imaging devices, including X-ray films, CT scans, MRI (magnetic resonance imaging), ultrasound images, etc., which may provide structural information of the interior of the tissue. The doctor can judge whether the patient has a disease or not by looking at the images, and determine the position and extent of the disease. The medical report is a written record of the diagnosis result of the patient based on the medical image of the patient by the doctor. It includes the patient's medical history, physical examination, examination results, diagnostic conclusions, and treatment advice.

It will be appreciated that the medical image of the visual modality and the medical report of the text modality are presented in pairs, each image report pair comprising one medical image and one medical report, for example 100 acquired medical images, 100 medical reports, one medical report for each medical image.

Specifically, the terminal acquires a medical dataset, and acquires an image report pair consisting of a medical image of a visual modality and a medical report of a text modality from the medical dataset. Wherein the medical dataset refers to an authorized dataset.

S204, extracting the characteristics of the medical image and the medical report in the image report pair through a machine learning model to obtain the conventional image characteristics and the momentum image characteristics corresponding to the medical image, and the conventional text characteristics and the momentum text characteristics corresponding to the medical report pair.

The machine learning model is used to extract multi-modal features from data of a plurality of different modalities (e.g., text, images, etc.) to further accomplish corresponding target tasks based on the extracted multi-modal features. The target task may be a classification task or a prediction task. The purpose of multi-modal feature extraction is to extract information related to a target task from data of different modalities. For example, in a text classification task, lexical representations may be extracted from text data; in the image classification task, a pixel representation may be extracted from the image data. Machine learning models are typically implemented using Neural networks (Neural networks), by which complex nonlinear features are automatically learned, which is an artificial intelligence (Artificial Intelligence, AI) technique that mimics the way a human brain works to solve the problem. The neural network consists of a large number of neurons (neurons) which are interconnected by weights (Weight) and offsets (Bias) and calculate the output by an activation function (Activation Function).

In the embodiment of the present application, the machine learning model includes an encoder, where the encoder is used to extract features, it can be understood that in machine learning, the encoder is a model used to convert original input data into feature vectors, and these feature vectors can be used to perform classification or regression prediction.

Image features refer to certain unique properties in an image that can be used to distinguish between different images or describe certain characteristics of an image, and are typically extracted from information such as edges, color, texture, shape, etc. in an image. In the embodiment of the application, the conventional image features are obtained by extracting features of the medical image through a conventional image encoder, and the momentum image features are obtained by extracting features of the medical image through a momentum image encoder.

Text features refer to certain unique properties in text that can be used to distinguish between different text or describe certain characteristics of the text, and are typically extracted from information such as word frequency, part of speech, sentence length, lexical diversity, etc. in text. In the embodiment of the application, the conventional text feature is obtained by extracting the feature of the medical report through a conventional text encoder, and the momentum text feature is obtained by extracting the feature of the medical report through a momentum text encoder.

Specifically, after acquiring an image report pair consisting of a medical image of a visual modality and a medical report of a text modality, the terminal inputs the acquired image report pair into a machine learning model to be trained, extracts features from the medical image and the medical report through encoders of the machine learning model, respectively, thereby obtaining conventional image features and momentum image features of the medical image, and obtains conventional text features and momentum text features corresponding to the medical report.

S206, taking the conventional image feature as a first cross-modal anchor point, and determining a first cross-modal loss value based on the first cross-modal anchor point and the momentum text feature; and determining a second cross-modal loss value based on the second cross-modal anchor and the momentum image feature with the regular text feature as the second cross-modal anchor.

It should be noted that, in the embodiment of the present application, a training manner of contrast learning may be used to train a multi-modal machine learning model, where training a model refers to using a large amount of training data to adjust parameters of the model, so that the model can extract useful features from data of multiple different modalities.

Contrast learning refers to a method of training a model using contrast samples. The comparative sample refers to a combination of a sample containing useful information and a sample not containing useful information. Contrast learning is often used to train deep learning models in order for the model to learn to distinguish between useful information and useless information. During contrast learning, the model is given a "visual target," i.e., the model learns to distinguish between useful information and useless information. The model gradually learns to distinguish useful information from useless information by constantly learning and adjusting parameters.

The cross-modal learning is also called cross-modal training, and specifically refers to learning and prediction across different data types or modes. The cross-modality in the embodiment of the application refers to two modalities of crossing images and texts. The cross-modal loss value refers to a loss function used in cross-modal learning. The loss function is a measure for measuring the accuracy of model prediction, and smaller loss function values indicate that the model prediction is more accurate. The first cross-modal penalty value is a image-to-text cross-modal trained penalty value and the second cross-modal penalty value is a text-to-image cross-modal trained penalty value.

The first cross-modal anchor point may also be referred to as a first cross-modal anchor sample, which refers to a reference sample used by the model for comparison with other samples.

In one embodiment, the machine learning model further includes a classification prediction network, and after the conventional image feature, the momentum image feature, the conventional text feature and the momentum text feature are extracted by the encoder of the machine learning model, the conventional image feature can be used as a first cross-modal anchor point for any one conventional image feature, the first cross-modal anchor point and the momentum text feature are input into the classification prediction network of the machine learning model, a first cross-modal prediction result corresponding to the first cross-modal anchor point is output by the classification prediction network, and a first cross-modal loss value is determined based on the first cross-modal prediction result; and aiming at any one conventional text feature, the conventional text feature can be used as a second cross-modal anchor point, the second cross-modal anchor point and the momentum image feature are input into a classification prediction network of a machine learning model, a second cross-modal prediction result corresponding to the second cross-modal anchor point is output through the classification prediction network, and a second cross-modal loss value is determined based on the second cross-modal prediction result.

S208, determining a multi-modal loss value based on the conventional image feature, the momentum image feature, the conventional text feature and the momentum text feature.

The multi-modal learning refers to multi-modal learning, and is also referred to as multi-modal training. The multimodal loss value refers to a loss function used in multimodal learning. The multi-modal loss values include a first multi-modal loss value, which is a loss value for multi-modal training from an image to a multi-modal prototype, and a second multi-modal loss value, which is a loss value for multi-modal training from a text to a multi-modal prototype, which is a feature prototype that incorporates image features and text features.

In one embodiment, S208 specifically includes the steps of: determining a multimodal prototype feature based on the momentum image features and the momentum text features; taking the conventional image characteristics as a first multi-modal anchor point, and determining a first multi-modal loss value based on the first multi-modal anchor point and the multi-modal prototype characteristics; and taking the conventional text characteristic as a second multi-modal anchor point, and determining a second multi-modal loss value based on the second multi-modal anchor point and the multi-modal prototype characteristic.

The multi-mode prototype feature is a feature class to which the momentum image feature and the momentum text feature of the image report pair belong. The first multi-modal anchor point may also be referred to as a first multi-modal anchor sample, which refers to a reference sample used by the model for comparison with other samples.

Specifically, the terminal determines each multi-modal prototype based on the momentum image features and momentum text features of each image report pair, and can take any one conventional image feature as a first multi-modal anchor point, input the first multi-modal anchor point and the multi-modal prototype features into a classification prediction network of a machine learning model, output a first multi-modal prediction result corresponding to the first multi-modal anchor point through the classification prediction network, and determine a first multi-modal loss value based on the first multi-modal prediction result; and aiming at any one conventional text feature, the conventional text feature can be used as a second multi-modal anchor point, the second multi-modal anchor point and the momentum image feature are input into a classification prediction network of a machine learning model, a second multi-modal prediction result corresponding to the second multi-modal anchor point is output through the classification prediction network, and a second multi-modal loss value is determined based on the second multi-modal prediction result.

S210, optimizing parameters of the machine learning model according to the multi-modal loss value, the first cross-modal loss value and the second cross-modal loss value to obtain a target machine learning model.

The parameters of the machine learning model may be internal parameters of the machine learning model, such as weights and deviation values, wherein the weights refer to weights of edges connecting two neurons, and the weights determine the contribution degree of each neuron to network output; the bias value refers to the threshold of the neuron, which determines whether the neuron will be activated.

Specifically, after obtaining the multi-modal loss value, the first cross-modal loss value and the second cross-modal loss value, the terminal determines a model loss value according to the multi-modal loss value, the first cross-modal loss value and the second cross-modal loss value, if the model loss value is greater than a loss value threshold, parameters of the machine learning model are adjusted, and then steps S202 to S208 are re-executed until the model loss value is not greater than the loss value threshold, and the machine learning model is obtained.

S212, performing data processing on the target medical data through the target machine learning model.

Wherein the target medical data comprises at least one of a target medical image or a target medical report.

Specifically, after obtaining the target machine model, the terminal may further obtain target medical data to be processed, input the target medical data into the target machine learning model, extract features of the target medical data through the target machine learning model, and perform classification prediction or other output based on the extracted features to obtain a data processing result.

In the above embodiment, the terminal acquires the image report pair composed of the medical image of the visual modality and the medical report of the text modality; extracting the characteristics of the medical image and the medical report in the image report through a machine learning model to obtain the conventional image characteristics and the momentum image characteristics corresponding to the medical image, and the conventional text characteristics and the momentum text characteristics corresponding to the medical report; taking the conventional image characteristic as a first cross-modal anchor point, and determining a first cross-modal loss value based on the first cross-modal anchor point and the momentum text characteristic; and determining a second cross-modal loss value based on the second cross-modal anchor and the momentum image feature by taking the conventional text feature as the second cross-modal anchor; determining a multi-modal loss value based on the conventional image features, the momentum image features, the conventional text features, and the momentum text features; according to the multi-modal loss value, the first cross-modal loss value and the second cross-modal loss value, parameters of the machine learning model are optimized, so that the target machine learning model has good characteristic expression capability, and further accuracy of medical data processing can be improved when the target machine learning model is used for data processing of target medical data.

In one embodiment, the process of extracting the characteristics of the medical image and the medical report in the image report by the terminal through the machine learning model to obtain the conventional image characteristics and the momentum image characteristics corresponding to the medical image and the conventional text characteristics and the momentum text characteristics corresponding to the medical report specifically comprises the following steps: respectively extracting the characteristics of the medical image in the image report pair through a conventional image encoder and a momentum image encoder of the machine learning model to obtain conventional image characteristics and momentum image characteristics corresponding to the medical image; and respectively extracting the characteristics of the medical report in the image report pair through a conventional text encoder and a momentum text encoder of the machine learning model to obtain conventional text characteristics and momentum text characteristics corresponding to the medical report.

The encoder in the embodiments of the present application may specifically include a conventional encoder and a momentum encoder, where the conventional encoder learns an abstract representation of original input data using a method and then converts the abstract representation into a feature vector, the momentum encoder corresponds to the conventional encoder, and the momentum encoder may make a determination based on its own weight and the corresponding conventional encoder, and may specifically be determined by using the following formula:

Wherein,,

for a conventional encoder at time t +.>

For the model parameters of the regular encoder at time t, < >>

Is a momentum encoder>

Momentum encoder for time t-1, < >>

For the model parameters of the momentum encoder at time t, < >>

For the model parameters of the momentum encoder at the time t-1, α is the weight of the momentum encoder, and the momentum encoder at the current time is determined by weighted summation of the momentum encoder at the previous time and the conventional encoder at the current time, that is, the parameters of the momentum encoder at the current time are also determined based on the parameters of the momentum encoder at the previous time and the parameters of the conventional encoder at the current time. When the weight alpha of the momentum encoder is relatively large, the momentum encoder at the current moment is +.>

Momentum encoder for last moment

A similar structure is achieved.

In an embodiment of the present application, the conventional encoder includes a conventional image encoder for encoding image content and a conventional text encoder for encoding text content, for example, the conventional image encoder may be a res net or a visual transducer, and the conventional text encoder may be a BioClinical BERT. The momentum encoder comprises a momentum image encoder and a momentum text encoder, wherein the momentum image encoder is used for encoding image content, the momentum text encoder is used for encoding text content, and the momentum image encoder is determined based on a conventional image encoder and the momentum text encoder is determined based on the conventional text encoder.

Specifically, after the terminal acquires the image report pair, determining a conventional image encoder and a conventional text encoder at the current moment of the machine learning model, and a momentum image encoder and a momentum text encoder at the last moment, determining the momentum image encoder at the current moment according to the conventional image encoder at the current moment and the momentum image encoder at the last moment, determining the momentum text encoder at the current moment according to the conventional text encoder at the current moment and the momentum text encoder at the last moment, then extracting features of a medical image in the image report pair through the conventional image encoder at the current moment to obtain conventional image features, and extracting features of the medical image in the image report pair through the momentum image encoder at the current moment to obtain momentum image features; and extracting the characteristics of the medical report in the image report pair by a conventional text encoder at the current moment to obtain conventional text characteristics, and extracting the characteristics of the medical report in the image report pair by a momentum text encoder at the current moment to obtain momentum text characteristics.

In the above embodiment, the terminal performs feature extraction on the medical image and the medical text through the conventional encoder and the momentum encoder of the machine learning model, so that the features of the medical image and the medical report can be better obtained, and further the target machine learning model obtained through subsequent training has better feature expression capability.

In one embodiment, before extracting the features of the medical image, the terminal may further perform data enhancement processing on the medical image to obtain a first medical image and a second medical image corresponding to the medical image; the terminal performs feature extraction on the medical image in the image report pair through a conventional image encoder and a momentum image encoder of a machine learning model, and the process of obtaining conventional image features and momentum image features corresponding to the medical image comprises the following steps: extracting the characteristics of the first medical image through a conventional image encoder of the machine learning model to obtain conventional image characteristics corresponding to the medical image; and extracting features of the second medical image through a momentum image encoder of the machine learning model to obtain momentum image features corresponding to the medical image.

The data enhancement processing refers to preprocessing of image data to increase generalization capability and robustness of a model, and the data enhancement processing specifically can be random clipping, horizontal overturning and other processing. Random cropping is to randomly crop a small image from the image, and horizontal overturning is to overturn the image along a vertical axis, namely, the image is bilaterally symmetrical.

Specifically, for any one medical image, the terminal can perform data enhancement processing on the medical image to obtain a first medical image and a second medical image, for the first medical image, the characteristic extraction is performed on the first medical image through a conventional image encoder at the current moment to obtain conventional image characteristics, and for the second medical image, the characteristic extraction is performed on the second medical image through a momentum image encoder at the current moment to obtain momentum image characteristics.

It will be appreciated that the first medical image and the second medical image are derived from the same medical image by data enhancement processing, and therefore the first medical image and the second medical image are semantically related, and the respective extracted conventional image features and momentum image features are similar.

For example, there are n image report pairs, then for any one of the medical images x _v For the medical image x _v Data enhancement is carried out to obtain two semantically related first medical images

And a second medical image->

First medical image->

Feeding into a conventional image encoder->

Second medical image->

Feeding into a momentum image encoder->

Conventional image encoder->

Output regular image feature v, momentum image encoder- >

Output momentum image feature +.>

In the above embodiment, the terminal performs the data enhancement processing on the medical image, and performs the feature extraction on the medical image after the data enhancement processing by using the conventional image encoder and the momentum image encoder, so that the features of the medical image can be better obtained, and further, the target machine learning model obtained by subsequent training has better feature expression capability.

In one embodiment, the terminal may also divide the medical report into a collection of sentences semantically before feature extraction of the medical report; the terminal performs feature extraction on the medical report in the image report pair through a conventional text encoder and a momentum text encoder of the machine learning model, and the process of obtaining the conventional text feature and the momentum text feature corresponding to the medical report comprises the following steps: extracting features of the sentence sets through a conventional text encoder of the machine learning model to obtain conventional text features corresponding to the medical report; and extracting features of the sentence set through a momentum text encoder of the machine learning model to obtain momentum text features corresponding to the medical report.

It should be noted that medical reports typically contain several semantically independent sentences, and the order between the several sentences is not correlated with the semantics expressed by the medical report.

Specifically, before extracting features of a medical report, the terminal uses a text clause tool to divide the medical report into separate sentences according to semantics to obtain a sentence set, extracts features of the sentence set through a conventional text encoder at the current moment to obtain conventional text features corresponding to the medical report, and extracts features of the sentence set through a momentum text encoder at the current moment to obtain features of momentum text corresponding to the medical report.

In the embodiment, the terminal performs feature extraction on the medical text subjected to the data enhancement processing by using the conventional text encoder and the momentum text encoder, so that the features of the medical report can be better obtained, and further, the target machine learning model obtained through subsequent training has better feature expression capability.

In one embodiment, the terminal may also perform data enhancement processing on the medical report before performing feature extraction on the medical report, and specifically may employ semantic-level Dropout enhancement.

In one embodiment, a conventional text encoder of a machine learning model includes a feature extraction network, a first pooling layer, and a second pooling layer; the process of extracting the characteristics of the sentence set by the terminal through the conventional text encoder of the machine learning model to obtain the conventional text characteristics corresponding to the medical report specifically comprises the following steps: extracting the characteristics of the sentence set through a characteristic extraction network to obtain word characteristics; pooling the word features through a first pooling layer to obtain sentence features; and carrying out pooling operation on the sentence characteristics through a second pooling layer to obtain the conventional text characteristics corresponding to the medical report.

It should be noted that the momentum text encoder of the machine learning model is determined based on the conventional text encoder, and thus the structure of the momentum text encoder is identical to that of the conventional text encoder, that is, the momentum text encoder includes a feature extraction network, a first pooling layer and a second pooling layer, wherein the first pooling layer may be max-pooling or average pooling, and the second pooling layer may be max-pooling or average pooling.

The feature extraction network is used for extracting features of an input statement set, and the pooling layer is used for downsampling the input features to obtain high-dimensional features, reduce the calculated amount and improve the generalization capability of the model.

Specifically, the terminal can encode the sentence set through a word embedding tool to obtain an encoded sentence set, extract the characteristics of each word in the encoded sentence set through a characteristic extraction network of a machine learning model to obtain word characteristics, then downsample the word characteristics by using a first pooling layer to obtain sentence characteristics, and downsample the sentence characteristics by using a second pooling layer to obtain conventional text characteristics corresponding to the medical report. The encoded statement set may specifically be a statement set represented by a numerical value.

For example, there are n image report pairs, then x is reported for any one of them _T The report is first split into statement sets and BioClinicalBERT is used as a conventional text encoder

Feature extraction of sentence sets by feature extraction network, output of BioClinicalBERT is word feature, then by conventional text encoder ∈ ->

Downsampling word features by a first Max (maximum) pooling layer of (1) to obtain a sentence feature set s= { S ₁ ，……，S _M M is the number of sentences, then +.>

Downsampling sentence features by a second Max (Max) pooling layer to obtain a medical report x _T Corresponding conventional text features T; similarly, by the momentum text encoder +.>

Medical report x _T Extracting features to obtain medical reportx _T Corresponding momentum text feature->

Referring to fig. 3, the medical report corresponding to the chest X-ray image is "no focal solid. The heart is normal in size. The report has independent sentences, the semantics of the whole report are irrelevant to the sequence of the sentences, so that the report can be decomposed to obtain three sentences of 'no focus actual change', 'normal heart size' and 'no bone column abnormality', and then the three sentences are processed by a conventional text encoder f _T Extracting the features to obtain word features, downsampling the word features through a first maximum pooling layer to obtain sentence features, downsampling the sentence features through a second maximum pooling layer to obtain conventional text features of the medical report.

In the above embodiment, the terminal performs feature extraction on the sentence set through the feature extraction network to obtain word features; pooling the word features through a first pooling layer to obtain sentence features; and carrying out pooling operation on sentence features through a second pooling layer to obtain conventional text features corresponding to the medical report, so that the features of the medical report can be better obtained, and further, a target machine learning model obtained through subsequent training has better feature expression capability.

In one embodiment, the terminal takes the conventional image feature as a first cross-modal anchor point, and the process of determining the first cross-modal loss value based on the first cross-modal anchor point and the momentum text feature specifically comprises the following steps: for medical images of each image report pair, taking conventional image features corresponding to the medical images of each image report pair as a first cross-modal anchor point, taking momentum text features corresponding to medical reports belonging to the same image report pair as a first cross-modal positive sample, taking momentum text features corresponding to medical reports belonging to different image report pairs as a first cross-modal negative sample, and determining a first cross-modal loss value based on the first cross-modal anchor point, the first cross-modal positive sample and the first cross-modal negative sample.

Specifically, after determining a first cross-modal anchor point, a first cross-modal positive sample and a first cross-modal negative sample, the terminal inputs the first cross-modal anchor point, the first cross-modal positive sample and the first cross-modal negative sample into a classification prediction network of a machine learning model, outputs a first cross-modal prediction result corresponding to the first cross-modal anchor point through the classification prediction network, and determines a first cross-modal loss value based on the first cross-modal prediction result.

In one embodiment, the terminal takes the conventional text feature as a second cross-modal anchor point, and the process of determining the second cross-modal loss value based on the second cross-modal anchor point and the momentum image feature comprises the following steps: for medical reports of each image report pair, taking the conventional text features corresponding to the medical report as a second cross-modal anchor point, taking the momentum image features corresponding to the medical images belonging to the same image report pair as a second cross-modal positive sample, taking the momentum text features corresponding to the medical images belonging to different image report pairs as a second cross-modal negative sample, and determining a second cross-modal loss value based on the second cross-modal anchor point, the second cross-modal positive sample and the second cross-modal negative sample.

Specifically, after determining a second cross-modal anchor point, a second cross-modal positive sample and a second cross-modal negative sample, the terminal inputs the second cross-modal anchor point, the second cross-modal positive sample and the second cross-modal negative sample into a classification prediction network of the machine learning model, outputs a second cross-modal prediction result corresponding to the second cross-modal anchor point through the classification prediction network, and determines a second cross-modal loss value based on the second cross-modal prediction result.

In the above embodiment, the terminal performs cross-modal training by adopting a contrast learning mode to determine the cross-modal loss value, so that the target machine learning model obtained by training has better feature expression capability, and the accuracy of medical data processing can be improved when the target medical data is processed through the target machine learning model.

In one embodiment, the first and second cross-modal negative samples are maintained by first-in-first-out memory queues, respectively.

It should be noted that, the memory queue includes negative samples determined at the historical moment, and the number of the negative samples maintained by the memory queue may be set according to the requirement, for example, Q negative samples maintained in the memory queue all the time, that is, after a new negative sample is added at the current moment, the negative sample added in the memory queue earliest is cleared, so that Q negative samples maintained in the memory queue all the time are obtained. The memory queue corresponding to the first cross-modal negative-sample may be referred to as a first cross-modal negative-sample queue, and the memory queue corresponding to the second cross-modal negative-sample may be referred to as a second cross-modal negative-sample queue.

Specifically, in the process of determining the first cross-modal loss value, after determining the first cross-modal negative sample, the terminal adds the determined first cross-modal negative sample into a current first cross-modal negative sample memory queue, clears the first cross-modal negative sample added at the earliest moment in the first cross-modal memory queue from the queue, then inputs the first cross-modal anchor point, the first cross-modal positive sample and each first cross-modal negative sample in the first cross-modal memory queue into a classification prediction network of a machine learning model, outputs a first cross-modal prediction result corresponding to the first cross-modal anchor point through the classification prediction network, and determines the first cross-modal loss value based on the first cross-modal prediction result; in the process of determining the second cross-modal loss value, after determining the second cross-modal negative sample, the terminal adds the determined second cross-modal negative sample into a current second cross-modal negative sample memory queue, clears the second cross-modal negative sample added at the earliest moment in the second cross-modal memory queue from the queue, inputs the second cross-modal anchor point, the second cross-modal positive sample and each second cross-modal negative sample in the second cross-modal memory queue into a classification prediction network of a machine learning model, outputs a second cross-modal prediction result corresponding to the second cross-modal anchor point through the classification prediction network, and determines the second cross-modal loss value based on the second cross-modal prediction result.

In this embodiment, the first cross-modal loss value may be expressed as follows:

in the above, L _i2r For the first cross-modal loss value, v is the normal image characteristic as anchor point,

momentum image characteristic for positive or negative sample, +.>

Momentum image characteristic of positive sample, +.>

For the momentum image characteristic of the negative sample, the negative sample is maintained by a first-in first-out memory queue,/-in>

Is a multi-layer perceptron (MPL) for mapping conventional image features to projection space, +.>

Is a multi-layer perceptron (MPL) for mapping momentum text features to projection space,

for cosine similarity of conventional image features and momentum text features in projection space, τ is a temperature coefficient, which is generally used to control the shape of probability distribution, the larger the value, the smoother the distribution, and the smaller the value, the sharper the distribution. In the process of performing cross-modal training, mapping conventional image features and momentum text features of a machine learning model to a projection space respectively, determining cross-modal cosine similarity between the conventional image features and each momentum text feature in the projection space, and combining the conventional image features with a pre-modelThe cosine similarity between momentum text features detected as positive samples is determined to be positive cross-modal cosine similarity, the cosine similarity between conventional image features and momentum text features predicted as negative samples is determined to be negative cross-modal cosine similarity, the ratio of each positive cross-modal cosine similarity to a temperature coefficient is calculated to obtain a positive cross-modal ratio, the ratio of each negative cross-modal cosine similarity to the temperature coefficient is calculated to obtain a negative cross-modal ratio, the negative cross-modal ratios are summed to obtain a negative cross-modal total ratio, and a first cross-modal loss value is determined based on the positive cross-modal ratio and the negative cross-modal total ratio. It should be noted that, the key point of the first cross-modal training is to find a momentum text feature similar to a conventional image feature in the projection space, where the conventional image feature is similar to a momentum text feature as a positive sample, when the momentum text feature predicted as a positive sample is the momentum text feature as a positive sample, the determined first cross-modal loss value is smaller, and the conventional image feature is dissimilar to a momentum text feature as a negative sample, when the momentum text feature predicted as a positive sample is the momentum text feature as a negative sample, the determined first cross-modal loss value is larger, so when the first cross-modal loss value L is _i2r Less, the first cross-modality training of the model may be considered complete. Similarly, the second cross-modal loss value may be represented as L _r2i Thereby obtaining the cross-modal loss value L _cmc ＝L _i2r +L _r2i 。

In the above embodiment, the negative samples are maintained through the memory queue, so that the number of the negative samples participating in model training can be increased, and the feature expression capability of the trained target machine learning model is improved.

For example, cross-modal contrast learning is described, referring to fig. 4 (a), for any one medical image, conventional image features corresponding to the medical image are used as a first cross-modal anchor point, momentum text features corresponding to medical reports belonging to the same image report pair are used as first cross-modal positive samples, momentum text features corresponding to medical reports belonging to different image report pairs are used as first cross-modal negative samples, and based on the first cross-modal anchor point and the firstDetermining a first cross-modal loss value L from a cross-modal positive sample and a first cross-modal negative sample _i2r For any one medical report, taking a conventional text feature corresponding to the report as a second cross-modal anchor point, taking a momentum image feature corresponding to a medical image in the same image report pair as a second cross-modal positive sample, taking a momentum image feature corresponding to a medical image in a different image report pair as a second cross-modal negative sample, and determining a second cross-modal loss value L based on the second cross-modal anchor point, the second cross-modal positive sample and the second cross-modal negative sample _r2i . By cross-modal contrast learning that "image features" and "reporting features" from the same image reporting pair are pulled closer in projection space, "image features" and "reporting features" from different image reporting pairs are pushed apart.

In one embodiment, the process of determining the multi-modal prototype feature by the terminal based on the momentum image feature and the momentum text feature specifically comprises the following steps: fusing the momentum image features and the corresponding momentum text features to obtain fusion features; clustering the fusion features to obtain a clustering center; the cluster center is determined to be a multi-modal prototype feature.

Specifically, after obtaining the momentum image features and momentum text features of each image report pair, the terminal calculates feature weighted averages of the momentum image features and the momentum text features according to the momentum image features and the momentum text features of the same image report pair, wherein the weighted averages are obtained fusion features, the momentum image features and the momentum text features of each image report pair are respectively calculated to obtain the fusion features of each image report pair, then clustering is carried out on each fusion feature by adopting a clustering algorithm to obtain clustering centers of each cluster, and the obtained features of each clustering center are determined to be multi-mode prototype features.

Wherein clustering refers to dividing a data set into several classes such that data within each class is most similar, and differences in data similarity between the classes are as large as possible. Clustering refers to grouping the fusion features in the clustering algorithm. It can be appreciated that the similarity between the fusion features belonging to the clusters is large, and the similarity between the fusion features belonging to different clusters is small.

The clustering algorithm adopted in the embodiment of the application can be specifically a k-means clustering algorithm, a k-means clustering algorithm or a clarans clustering algorithm. The K-means clustering algorithm is a simple iterative clustering algorithm, and adopts distances as similarity indexes, so that K classes in a given data set are found, the center of each class is obtained according to the average value of all values in the class, and the center of each class is described by using a clustering center.

In one embodiment, the feature fusion is performed on the momentum image feature and the momentum text feature, and specifically the following formula can be adopted:

wherein,,

for fusion feature->

Is a multi-layer perceptron for characterizing a momentum image>

Mapping to projection space, +.>

Is a multi-layer perceptron for characterizing momentum text >

Mapping to projection space, mapping momentum image features and momentum text features to projection space by machine learning model in multi-mode training process, calculating average value between momentum image features and corresponding momentum text features in projection space, and determining obtained average valueAnd defining the fusion characteristic of the corresponding 'momentum image characteristic-momentum text characteristic' pair.

In the above embodiment, the terminal obtains the fusion feature by fusing the momentum image feature and the corresponding momentum text feature; clustering the fusion features to obtain a clustering center; the clustering center is determined to be the multi-mode prototype feature, so that multi-mode training can be performed based on the multi-mode prototype feature, training data are fully utilized to train the model, a target machine learning model obtained through training has good feature expression capability, and accuracy of medical data processing can be improved when the target machine learning model is used for data processing of target medical data in the follow-up process.

In one embodiment, the terminal uses the conventional image feature as a first multi-modal anchor point, and the process of determining the first multi-modal loss value based on the first multi-modal anchor point and the multi-modal prototype feature includes the following steps: for the medical images in each image report pair, taking the conventional image features corresponding to the medical images in the pairs as a first multi-modal anchor point, taking the multi-modal prototype features corresponding to the image report pairs containing the medical images in the pairs as a first multi-modal positive sample, taking the multi-modal prototype features corresponding to the image report pairs not containing the medical images in the pairs as a first multi-modal negative sample, and determining a first multi-modal loss value based on the first multi-modal anchor point, the first multi-modal positive sample and the first multi-modal negative sample.

Specifically, after determining a first multi-modal anchor point, a first multi-modal positive sample and a first multi-modal negative sample, the terminal inputs the first multi-modal anchor point, the first multi-modal positive sample and the first multi-modal negative sample into a classification prediction network of a machine learning model, outputs a first multi-modal prediction result corresponding to the first multi-modal anchor point through the classification prediction network, and determines a first multi-modal loss value based on the first multi-modal prediction result.

In one embodiment, the terminal uses the conventional text feature as a second multi-modal anchor point, and the process of determining the second multi-modal loss value based on the second multi-modal anchor point and the multi-modal prototype feature includes the following steps: for the medical report of each image report pair, taking the conventional text feature corresponding to the medical report pair as a second multi-modal anchor point, taking the multi-modal prototype feature corresponding to the image report pair containing the medical image pair as a second multi-modal positive sample, taking the multi-modal prototype feature corresponding to the image report pair not containing the medical image pair as a second multi-modal negative sample, and determining a second multi-modal loss value based on the second multi-modal anchor point, the second multi-modal positive sample and the second multi-modal negative sample.

Specifically, after determining a second multi-modal anchor point, a second multi-modal positive sample and a second multi-modal negative sample, the terminal inputs the second multi-modal anchor point, the second multi-modal positive sample and the second multi-modal negative sample into a classification prediction network of the machine learning model, outputs a second multi-modal prediction result corresponding to the second multi-modal anchor point through the classification prediction network, and determines a second multi-modal loss value based on the second multi-modal prediction result.

In the above embodiment, the terminal performs multi-modal training by adopting a contrast learning mode to determine the multi-modal loss value, so that the target machine learning model obtained by training has better feature expression capability, and the accuracy of medical data processing can be improved when the target medical data is processed through the target machine learning model.

In one embodiment, the first multi-modal negative-sample and the second multi-modal negative-sample are maintained by first-in-first-out memory queues, respectively.

Specifically, in the process of determining the first multi-modal loss value, after determining the first multi-modal negative sample, the terminal adds the determined first multi-modal negative sample into a current first multi-modal negative sample memory queue, clears the first multi-modal negative sample added at the earliest moment in the first multi-modal memory queue from the queue, inputs the first multi-modal anchor point, the first multi-modal positive sample and each first multi-modal negative sample in the first multi-modal memory queue into a classification prediction network of a machine learning model, outputs a first multi-modal prediction result corresponding to the first multi-modal anchor point through the classification prediction network, and determines the first multi-modal loss value based on the first multi-modal prediction result; in the process of determining the second multi-mode loss value, after determining the second multi-mode negative sample, the terminal adds the determined second multi-mode negative sample into a current second multi-mode negative sample memory queue, clears the second multi-mode negative sample added at the earliest moment in the second multi-mode memory queue from the queue, inputs the second multi-mode anchor point, the second multi-mode positive sample and each second multi-mode negative sample in the second multi-mode memory queue into a classification prediction network of a machine learning model, outputs a second multi-mode prediction result corresponding to the second multi-mode anchor point through the classification prediction network, and determines the second multi-mode loss value based on the second multi-mode prediction result.

In this embodiment, the first multi-modal loss value may be expressed as follows:

in the above, L _i2c For the first multi-modal loss value, v is the normal image characteristic as anchor point,

multimodal prototype feature, either positive or negative, < ->

Multimodal prototype feature being positive sample, +.>

The multi-modal prototype feature of the negative sample maintained by a FIFO memory queue,/-for the negative sample>

For cosine similarity of conventional image features and multi-modal prototype features in projection space, mu _j For the relevant scaling factor of a cluster, it is generally used to control the shape of the probability distribution, the larger the value the smoother the distribution and the smaller the value the sharper the distribution. In the process of multi-modal training, conventional image features of a machine learning model are mapped to a projection space, multi-modal cosine similarity between the conventional image features and multi-modal prototype features of each multi-modal prototype feature is determined in the projection space, cosine similarity between the conventional image features and the multi-modal prototype features predicted as positive samples is determined as positive multi-modal cosine similarity, cosine similarity between the conventional image features and the multi-modal prototype features predicted as negative samples is determined as negative multi-modal cosine similarity, the ratio of each positive multi-modal cosine similarity to a scaling factor is calculated to obtain a positive multi-modal ratio, the ratio of each negative multi-modal cosine similarity to the scaling factor is calculated to obtain a negative multi-modal ratio, the negative multi-modal ratio is summed to obtain a negative multi-modal total ratio, and a first multi-modal loss value is determined based on the positive multi-modal ratio and the negative multi-modal total ratio. It should be noted that, the key of the first multimodal training is to find the multimodal prototype feature similar to the conventional image feature in the projection space, where the conventional image feature is similar to the momentum text feature as the positive sample, when the multimodal prototype feature predicted as the positive sample is the multimodal prototype feature as the positive sample, the determined first multimodal loss value is smaller, and the conventional image feature is dissimilar to the momentum text feature as the negative sample, when the multimodal prototype feature predicted as the positive sample is the multimodal prototype feature as the negative sample, the determined first multimodal loss value is larger, so when the first multimodal loss value L _i2r Less, the first multimodal training of the model can be considered complete. Similarly, the second multi-modal loss value may be represented as L _r2c Thereby obtaining the multi-modal loss value L _mpc ＝L _i2c +L _r2c 。

Describing multi-modal contrast learning by way of example, referring to fig. 4 (c), fusing the momentum image features of n pairs of image report pairs and the corresponding momentum text features to obtain fused features, clustering the fused features to obtain cluster centers, determining the cluster centers as multi-modal prototype features, for each image report pair of medical images, taking the normal image features corresponding to the medical images as first multi-modal anchor points, taking the image report pair including the medical images as first multi-modal positive samples, taking the image report pair not including the medical images as first multi-modal negative samples, and determining a first multi-modal loss value L based on the first multi-modal anchor points, the first multi-modal positive samples and the first multi-modal negative samples _i2c The method comprises the steps of carrying out a first treatment on the surface of the For medical reports in each image report pair, taking a regular text feature corresponding to the medical report pair as a second multi-modal anchor point, taking a multi-modal prototype feature corresponding to the image report pair containing the medical image pair as a second multi-modal positive sample, taking a multi-modal prototype feature corresponding to the image report pair not containing the medical image pair as a second multi-modal negative sample, and determining a second multi-modal loss value L based on the second multi-modal anchor point, the second multi-modal positive sample and the second multi-modal negative sample _r2c . By multi-modal contrast learning, the image features and the report features are closer to the cluster center to which the image features belong and further away from other cluster centers in the projection space, so that each cluster is more compact and independent.

In one embodiment, the medical data processing method further comprises the steps of: determining a single-mode loss value based on the conventional image feature, the momentum image feature, the conventional text feature and the momentum text feature; the terminal optimizes parameters of the machine learning model according to the multi-modal loss value, the first cross-modal loss value and the second cross-modal loss value, and the process of obtaining the target machine learning model comprises the following steps: and optimizing parameters of the machine learning model according to the single-mode loss value, the multi-mode loss value, the first cross-mode loss value and the second cross-mode loss value to obtain a target machine learning model.

Wherein the single-mode loss values include a first single-mode loss value that is a loss value for image-to-image single-mode training and a second single-mode loss value that is a loss value for text-to-text single-mode training.

Specifically, for each image report pair of medical images, the terminal takes the conventional image features corresponding to the medical images as a first single-mode anchor point, takes the momentum image features corresponding to the same medical image as a first single-mode positive sample, takes the momentum image features corresponding to different medical images as a first single-mode negative sample, and determines a first single-mode loss value based on the first single-mode anchor point, the first single-mode positive sample and the first single-mode negative sample; for medical reports of each image report pair, taking a conventional text feature corresponding to the medical report in question as a second single-mode anchor point, taking a momentum image feature corresponding to the same medical report as a second Shan Motai positive sample, taking momentum text features corresponding to different medical reports as a first single-mode negative sample, and determining a second single-mode loss value based on the second single-mode anchor point, the second Shan Motai positive sample and the second single-mode negative sample.

In the embodiment, the terminal determines the single-mode loss value by performing single-mode training, and fully utilizes training data, so that the target machine learning model obtained by training has better characteristic expression capability, and the accuracy of medical data processing can be improved when the target medical data is processed through the target machine learning model.

In one embodiment, the first single-mode negative sample and the second single-mode negative sample are maintained by first-in first-out memory queues, respectively.

Specifically, in the process of determining the first single-mode loss value, after determining the first single-mode negative sample, the terminal adds the determined first single-mode negative sample into a current first single-mode negative sample memory queue, clears the first single-mode negative sample added at the earliest moment in the first single-mode memory queue from the queue, inputs the first single-mode anchor point, the first single-mode positive sample and each first single-mode negative sample in the first single-mode memory queue into a classification prediction network of a machine learning model, outputs a first single-mode prediction result corresponding to the first single-mode anchor point through the classification prediction network, and determines the first single-mode loss value based on the first single-mode prediction result; in the process of determining the second single-mode loss value, after determining the second single-mode negative sample, the terminal adds the determined second single-mode negative sample into a current second single-mode negative sample memory queue, clears the second single-mode negative sample added at the earliest moment in the second single-mode memory queue from the queue, inputs the second single-mode anchor point, the second Shan Motai positive sample and each second single-mode negative sample in the second single-mode memory queue into a classification prediction network of a machine learning model, outputs a second single-mode prediction result corresponding to the second single-mode anchor point through the classification prediction network, and determines the second single-mode loss value based on the second single-mode prediction result.

In this embodiment, the first single mode loss value may be expressed as follows:

in the above, L _i2i For the first multi-modal loss value, v is the normal image characteristic as anchor point,

for the momentum image feature as positive or negative sample, +.>

For the momentum image feature as positive sample, +.>

For the feature of the momentum image as a negative sample, the negative sample is maintained by a first-in first-out memory queue,/->

Is a multi-layer perceptron (MPL) for mapping momentum image features to projection space, +.>

For cosine similarity of conventional image features and momentum image features in projection space, τ is a temperature coefficient, and is generally used to control the shape of probability distribution, the larger the value, the smoother the distribution, and the smaller the value, the sharper the distribution. In the process of performing single-mode training, the conventional image features and the momentum image features of the machine learning model are respectively mapped to a projection space, single-mode cosine similarity between the conventional image features and the momentum image features predicted as positive samples is determined in the projection space, cosine similarity between the conventional image features and the momentum image features predicted as negative samples is determined as positive single-mode cosine similarity, cosine similarity between the conventional image features and the momentum image features predicted as negative samples is determined as negative single-mode cosine similarity, the ratio of each positive single-mode cosine similarity to a temperature coefficient is calculated to obtain a positive single-mode ratio, the ratio of each negative single-mode cosine similarity to the temperature coefficient is calculated to obtain a negative single-mode ratio, the negative single-mode ratio is summed to obtain a negative single-mode total ratio, and a first single-mode loss value is determined based on the positive single-mode ratio and the negative single-mode total ratio. It should be noted that the key of the first unimodal training is to find the momentum image features similar to the conventional image features in the projection space The normal image feature is similar to the momentum image feature as the positive sample, the determined first single mode loss value is smaller when the momentum image feature predicted as the positive sample is the momentum text feature as the positive sample, the normal image feature is dissimilar to the momentum image feature as the negative sample, the determined first single mode loss value is larger when the momentum image feature predicted as the positive sample is the momentum image feature as the negative sample, and therefore, the first single mode loss value L _i2r Less, the first unimodal training of the model can be considered complete. Similarly, the second multi-modal loss value may be represented as L _r2r Thereby obtaining the single-mode loss value L _umc ＝L _i2i +L _r2r 。

Describing the unimodal contrast learning by way of example, referring to fig. 4 (b), for each image report pair of medical images, the conventional image features corresponding to the medical image being used as the first unimodal anchor, the momentum image features corresponding to the same medical image being used as the first unimodal positive sample, the momentum image features corresponding to different medical images being used as the first unimodal negative sample, the first unimodal loss value L being determined based on the first unimodal anchor, the first unimodal positive sample and the first unimodal negative sample _i2i The method comprises the steps of carrying out a first treatment on the surface of the For medical reports of each image report pair, taking a conventional text feature corresponding to the medical report in question as a second single-mode anchor point, taking a momentum image feature corresponding to the same medical report as a second Shan Motai positive sample, taking momentum text features corresponding to different medical reports as a first single-mode negative sample, and determining a second single-mode loss value L based on the second single-mode anchor point, the second Shan Motai positive sample and the second single-mode negative sample _r2r . Through single-mode contrast learning, additional supervisory signals can be introduced to enhance the expression of the model.

In one embodiment, the terminal determines a model loss value according to the single-mode loss value, the multi-mode loss value and the cross-mode loss value, and optimizes parameters of the machine learning model according to the model loss value to obtain a target machine learning model. The model loss value is determined according to the single-mode loss value, the multi-mode loss value and the cross-mode loss value, specifically, the single-mode loss value, the multi-mode loss value and the cross-mode loss value can be weighted and summed to obtain a summation result, and the summation result is determined as the model loss value, as shown in the following formula:

L _omnt ＝λL _cmc +βL _umc +γL _mpc

in the above, L _omnt For model loss value, L _cmc To cross-modal loss value, L _umc Is a single mode loss value, L _mpc The multi-modal loss value is represented by λ, β, and γ, respectively.

In one embodiment, the medical data processing method further comprises the steps of: processing medical data samples of a target task through a target machine learning model to obtain a task result; adjusting parameters of a sub-task model in the target machine learning model based on the task result to obtain a trained target machine learning model; the process of the terminal for carrying out data processing on the target medical data through the target machine learning model comprises the following steps: and performing data processing on the target medical data through the trained target machine learning model.

The target task refers to a target medical task, and specifically can be at least one of a segmentation task, a classification task, a cross-modal retrieval task and a visual question-answering task. The subtask model includes at least one of a classification sub-model, a segmentation sub-model, a cross-modal retrieval sub-model, and a visual question-answer sub-model. Segmentation tasks refer to the process of separating a target region (e.g., tumor, organ, etc.) in a medical image from the background; the classification task is to classify the medical images and divide the images into different categories; the cross-modal retrieval task refers to a task of retrieving among a plurality of different medical data modalities; a visual question-answering task refers to a task that uses a computer system to automatically answer questions about medical images.

It can be understood that after obtaining the target machine learning model, the terminal may further obtain a medical data sample of the target task, so as to perform fine-tuning training on the target machine learning model based on the medical data sample of the target task, to obtain a trained target machine learning model that may be used for processing the target task.

Specifically, after obtaining a medical data sample of a target task, the terminal processes the medical data sample of the target task through a target machine learning model to obtain a task result, determines a model fine tuning loss value based on the task result, optimizes parameters of a sub-task model in the target machine learning model based on the model fine tuning loss value to obtain a trained target machine learning model, and accordingly can use the trained target machine learning model to process data of target medical data corresponding to the target task.

In one embodiment, the process of the terminal for data processing of the target medical data through the trained target machine learning model comprises the following steps: obtaining target medical data and corresponding task types

The task type may be at least one of a segmentation task, a classification task, a cross-modal retrieval task, and a visual question-and-answer task.

It may be appreciated that after obtaining the trained target machine learning model, the terminal may divide the trained target machine learning model according to the subtask model included, for example, mark the trained target machine learning model including the classification sub-model as the classification model, mark the trained target machine learning model including the segmentation sub-model as the segmentation model, mark the trained target machine learning model including the cross-modal search sub-model as the cross-modal search model, and mark the trained target machine learning model including the visual question-answer model as the visual question-answer model.

As shown in fig. 5, the front end receives target medical data input by a user and a corresponding task type, the rear end processes the target medical data through a target machine learning model matched with the task type to obtain a data processing result, and returns the result to the front end, for example, if the user inputs a medical image to be classified and a classification task, the target machine learning model matched with the classification task is determined to be a classification model, and then the rear end performs classification processing on the medical image through the classification model to output a classification result; the method comprises the steps that a user inputs a medical image to be segmented and a segmentation task, a target machine learning model matched with the segmentation task is determined to be a segmentation model, the medical image is segmented through the segmentation model at the rear end, and a segmentation result is output; the method comprises the steps that when a user inputs a medical image or a medical report to be searched and a cross-modal searching task, a target machine learning model matched with the cross-modal searching task is determined to be a cross-modal searching model, and then the back end performs cross-modal searching processing on the medical image or the medical report through the cross-modal searching model to obtain a searching result; and if the user inputs the medical image, the related questions and the visual question-answering task, determining that the target machine learning model matched with the visual question-answering task is a visual question-answering model, and processing the medical image and the related questions through the visual question-answering model at the rear end to obtain answers of the related questions. As shown in fig. 6, the target medical data and task type input by the user are shown, and the result is processed based on the corresponding data.

In the above embodiment, the terminal retrains the target machine learning model by using the medical data sample of the target task, so as to obtain a trained target machine learning model which can be used for processing the target task, and further, when the trained target machine learning model is used for processing the target medical data, the accuracy of data processing can be further improved.

In one embodiment, as shown in fig. 7, a medical data processing method is provided, and the method is applied to the terminal 102 in fig. 1 for illustration, and includes the following steps:

s702, acquiring an image report pair consisting of a medical image of a visual mode and a medical report of a text mode.

S704, respectively extracting the characteristics of the medical image in the image report pair by a conventional image encoder and a momentum image encoder of the machine learning model to obtain conventional image characteristics and momentum image characteristics corresponding to the medical image.

Wherein a conventional image encoder is used to encode the image content, for example, the conventional image encoder may be a ResNet or a visual transducer, and a momentum image encoder is used to encode the image content, in this application, the momentum image encoder is determined based on the conventional image encoder.

Specifically, after the terminal acquires the image report pair, determining a conventional image encoder at the current moment of the machine learning model and a momentum image encoder at the last moment, determining the momentum image encoder at the current moment according to the conventional image encoder at the current moment and the momentum image encoder at the last moment, extracting features of a medical image in the image report pair through the conventional image encoder at the current moment to obtain conventional image features, and extracting features of the medical image in the image report pair through the momentum image encoder at the current moment to obtain momentum image features.

S706, respectively extracting the characteristics of the medical report in the image report pair by a conventional text encoder and a momentum text encoder of the machine learning model to obtain conventional text characteristics and momentum text characteristics corresponding to the medical report.

Wherein a conventional text encoder is used to encode text content, the conventional text encoder may be a bioclean BERT. The momentum text encoder is used to encode text content, and in this application, the momentum text encoder is determined based on a conventional text encoder.

Specifically, after the terminal acquires the image report pair, determining a conventional text encoder at the current moment of the machine learning model and a momentum text encoder at the last moment, determining the momentum text encoder at the current moment according to the conventional text encoder at the current moment and the momentum text encoder at the last moment, extracting features of a medical report in the image report pair through the conventional text encoder at the current moment to obtain conventional text features, and extracting features of a medical report in the image report pair through the momentum text encoder at the current moment to obtain momentum text features.

S708, taking the conventional image feature as a first cross-modal anchor point, and determining a first cross-modal loss value based on the first cross-modal anchor point and the momentum text feature; and determining a second cross-modal loss value based on the second cross-modal anchor and the momentum image feature with the regular text feature as the second cross-modal anchor.

In one embodiment, for any one conventional image feature, the conventional image feature can be used as a first cross-modal anchor point, the first cross-modal anchor point and the momentum text feature are input into a classification prediction network of a machine learning model, a first cross-modal prediction result corresponding to the first cross-modal anchor point is output through the classification prediction network, and a first cross-modal loss value is determined based on the first cross-modal prediction result; and aiming at any one conventional text feature, the conventional text feature can be used as a second cross-modal anchor point, the second cross-modal anchor point and the momentum image feature are input into a classification prediction network of a machine learning model, a second cross-modal prediction result corresponding to the second cross-modal anchor point is output through the classification prediction network, and a second cross-modal loss value is determined based on the second cross-modal prediction result.

S710, fusing the momentum image features and the corresponding momentum text features to obtain fusion features; clustering the fusion features to obtain a clustering center; the cluster center is determined to be a multi-modal prototype feature.

Specifically, after obtaining the momentum image features and momentum text features of each image report pair, the terminal calculates feature weighted averages of the momentum image features and the momentum text features according to the momentum image features and the momentum text features of the same image report pair, wherein the weighted averages are obtained fusion features, the momentum image features and the momentum text features of each image report pair are respectively calculated to obtain the fusion features of each image report pair, then clustering is carried out on each fusion feature by adopting a clustering algorithm to obtain clustering centers of each cluster, and the obtained features of each clustering center are determined to be multi-mode prototype features. The clustering algorithm adopted in the embodiment of the application can be specifically a k-means clustering algorithm, a k-means clustering algorithm or a clarans clustering algorithm.

Wherein,,

for fusion feature->

Is a multi-layer perceptron for characterizing a momentum image>

Mapping to projection space, +.>

Is a multi-layer perceptron for characterizing momentum text>

Mapping to a projection space, wherein in the process of multi-modal training, the machine learning model maps the momentum image features and the momentum text features to the projection space respectively, calculates an average value between the momentum image features and the corresponding momentum text features in the projection space, and determines the obtained average value as a fusion feature of the corresponding 'momentum image feature-momentum text feature' pair.

S712, taking the conventional image feature as a first multi-modal anchor point, and determining a first multi-modal loss value based on the first multi-modal anchor point and the multi-modal prototype feature; and taking the conventional text characteristic as a second multi-modal anchor point, and determining a second multi-modal loss value based on the second multi-modal anchor point and the multi-modal prototype characteristic.

S714, taking the conventional image characteristic as a first single-mode anchor point, and determining a first multi-mode loss value based on the first multi-mode anchor point and the momentum image characteristic; and taking the conventional text characteristic as a second single-mode anchor point, and determining a second single-mode loss value based on the second single-mode anchor point and the momentum text characteristic.

S716, optimizing parameters of the machine learning model according to the first single-mode loss value, the second single-mode loss value, the first multi-mode loss value, the second multi-mode loss value, the first cross-mode loss value and the second cross-mode loss value to obtain a target machine learning model.

L _omnt ＝λL _cmc +βL _umc +γL _mpc

S718, performing data processing on the target medical data through a target machine learning model; the target medical data includes at least one of a target medical image or a target medical report.

The application further provides an application scenario, the application scenario applies the medical data processing method, and referring to fig. 8, the medical data processing method specifically includes the following steps: acquiring an image report pair consisting of a medical image of a visual modality and a medical report of a text modality for any one of the medical images x _v For the medical image x _v Data enhancement is carried out to obtain two semantically related first medical images

And a second medical image->

First medical image->

Feeding into a conventional image encoder->

Second medical image->

Feeding into a momentum image encoder->

Conventional image encoder->

Output regular image feature v, momentum image encoder->

Output momentum image featureSyndrome of->

For any one medical report x _T First splitting the report into a set of sentences by means of a conventional text encoder +. >

Feature extraction is carried out on the statement set to obtain a medical report x _T Corresponding conventional text features T; similarly, by the momentum text encoder +.>

Medical report x _T Extracting features to obtain a medical report x _T Corresponding momentum text feature->

Taking the conventional image feature v as a first cross-modal anchor point, and based on the first cross-modal anchor point and the momentum text feature +.>

Determining a first cross-modal loss value; and taking the conventional text feature T as a second cross-modal anchor point, and based on the second cross-modal anchor point and the momentum image feature +.>

Determining a second cross-modal loss value; feature of momentum image->

And corresponding momentum text feature->

Fusing to obtain fusion characteristics; clustering the fusion features to obtain a clustering center; determining a cluster center as a multi-modal prototype feature; taking the conventional image feature v as a first multi-modal anchor point, and determining a first multi-modal loss value based on the first multi-modal anchor point and the multi-modal prototype feature; in conventional textThe feature T is a second multi-modal anchor point, and a second multi-modal loss value is determined based on the second multi-modal anchor point and the multi-modal prototype feature; taking the conventional image feature v as a first single-mode anchor point, and based on the first multi-mode anchor point and the momentum image feature +. >

Determining a first multimodal loss value; taking the conventional text feature T as a second single-mode anchor point, and based on the second single-mode anchor point and the momentum text feature +.>

Determining a second single mode loss value; optimizing parameters of a machine learning model according to the first single-mode loss value, the second single-mode loss value, the first multi-mode loss value, the second multi-mode loss value, the first cross-mode loss value and the second cross-mode loss value to obtain a target machine learning model, and performing data processing on target medical data through the target machine learning model; the target medical data comprises at least one of a target medical image or a target medical report, the target medical data is data corresponding to a target task, and the target task can be at least one of a classification task, a segmentation task, a cross-modal retrieval task and a visual question-answering task, wherein for the classification task, the model can output disease types of user input pictures; the segmentation task outlines pneumothorax focus; the cross-modal retrieval task includes two subtasks: when the input is a medical picture, the model will retrieve the most matching report from the report set; when the input is a report, the model retrieves the best matching medical picture from the picture set; the visual question-answering system will answer questions posed by the user for the input picture.

It should be understood that, although the steps in the flowcharts related to the embodiments described above are sequentially shown as indicated by arrows, these steps are not necessarily sequentially performed in the order indicated by the arrows. The steps are not strictly limited to the order of execution unless explicitly recited herein, and the steps may be executed in other orders. Moreover, at least some of the steps in the flowcharts described in the above embodiments may include a plurality of steps or a plurality of stages, which are not necessarily performed at the same time, but may be performed at different times, and the order of the steps or stages is not necessarily performed sequentially, but may be performed alternately or alternately with at least some of the other steps or stages.

Based on the same inventive concept, the embodiment of the application also provides a medical data processing device for realizing the above related medical data processing method. The implementation of the solution provided by the device is similar to the implementation described in the above method, so the specific limitations in one or more embodiments of the medical data processing device provided below may be referred to the limitations of the medical data processing method described above, and will not be repeated here.

In one embodiment, as shown in fig. 9, there is provided a medical data processing apparatus comprising: an image report pair acquisition module 902, a feature extraction module 904, a loss value determination module 906, a parameter optimization module 908, and a data processing module 910, wherein:

an image report pair acquisition module 902 for acquiring an image report pair consisting of a medical image of a visual modality and a medical report of a text modality.

The feature extraction module 904 is configured to perform feature extraction on the medical image and the medical report in the image report through the machine learning model, so as to obtain a conventional image feature and a momentum image feature corresponding to the medical image, and a conventional text feature and a momentum text feature corresponding to the medical report.

A loss value determination module 906 configured to determine a first cross-modal loss value based on the first cross-modal anchor and the momentum text feature with the regular image feature as the first cross-modal anchor; and determining a second cross-modal loss value based on the second cross-modal anchor and the momentum image feature by taking the conventional text feature as the second cross-modal anchor; the multi-modal loss value is determined based on the conventional image feature, the momentum image feature, the conventional text feature, and the momentum text feature.

The parameter optimization module 908 is configured to optimize parameters of the machine learning model according to the multi-modal loss value, the first cross-modal loss value and the second cross-modal loss value, and obtain a target machine learning model.

A data processing module 910, configured to perform data processing on the target medical data through the target machine learning model; the target medical data includes at least one of a target medical image or a target medical report.

In the above-described embodiment, by acquiring the image report pair composed of the medical image of the visual modality and the medical report of the text modality; extracting the characteristics of the medical image and the medical report in the image report through a machine learning model to obtain the conventional image characteristics and the momentum image characteristics corresponding to the medical image, and the conventional text characteristics and the momentum text characteristics corresponding to the medical report; taking the conventional image characteristic as a first cross-modal anchor point, and determining a first cross-modal loss value based on the first cross-modal anchor point and the momentum text characteristic; and determining a second cross-modal loss value based on the second cross-modal anchor and the momentum image feature by taking the conventional text feature as the second cross-modal anchor; determining a multi-modal loss value based on the conventional image features, the momentum image features, the conventional text features, and the momentum text features; according to the multi-modal loss value, the first cross-modal loss value and the second cross-modal loss value, parameters of the machine learning model are optimized, so that the target machine learning model has good characteristic expression capability, and further accuracy of medical data processing can be improved when the target machine learning model is used for data processing of target medical data.

In one embodiment, the feature extraction module 904 is further configured to: respectively extracting the characteristics of the medical image in the image report pair through a conventional image encoder and a momentum image encoder of the machine learning model to obtain conventional image characteristics and momentum image characteristics corresponding to the medical image; and respectively extracting the characteristics of the medical report in the image report pair through a conventional text encoder and a momentum text encoder of the machine learning model to obtain conventional text characteristics and momentum text characteristics corresponding to the medical report.

In one embodiment, the feature extraction module 904 is further configured to: performing data enhancement processing on the medical image to obtain a first medical image and a second medical image corresponding to the medical image; extracting the characteristics of the first medical image through a conventional image encoder of the machine learning model to obtain conventional image characteristics corresponding to the medical image; and extracting features of the second medical image through a momentum image encoder of the machine learning model to obtain momentum image features corresponding to the medical image.

In one embodiment, the feature extraction module 904 is further configured to: dividing the medical report into sentence sets according to semantics; extracting features of the sentence sets through a conventional text encoder of the machine learning model to obtain conventional text features corresponding to the medical report; and extracting features of the sentence set through a momentum text encoder of the machine learning model to obtain momentum text features corresponding to the medical report.

In one embodiment, a conventional text encoder of a machine learning model includes a feature extraction network, a first pooling layer, and a second pooling layer; the feature extraction module 904 is further configured to: extracting the characteristics of the sentence set through a characteristic extraction network to obtain word characteristics; pooling the word features through a first pooling layer to obtain sentence features; and carrying out pooling operation on the sentence characteristics through a second pooling layer to obtain the conventional text characteristics corresponding to the medical report.

In one embodiment, the number of image report pairs is at least two; the loss value determining module 906 is further configured to: for medical images of each image report pair, taking conventional image features corresponding to the medical images in the image report pair as a first cross-modal anchor point, taking momentum text features corresponding to medical reports in the same image report pair as a first cross-modal positive sample, taking momentum text features corresponding to medical reports in different image report pairs as a first cross-modal negative sample, and determining a first cross-modal loss value based on the first cross-modal anchor point, the first cross-modal positive sample and the first cross-modal negative sample;

the loss value determining module 906 is further configured to: for medical reports of each image report pair, taking the conventional text features corresponding to the medical report as a second cross-modal anchor point, taking the momentum image features corresponding to the medical images belonging to the same image report pair as a second cross-modal positive sample, taking the momentum text features corresponding to the medical images belonging to different image report pairs as a second cross-modal negative sample, and determining a second cross-modal loss value based on the second cross-modal anchor point, the second cross-modal positive sample and the second cross-modal negative sample.

In one embodiment, the multi-modal loss values include a first multi-modal loss value and a second multi-modal loss value; the loss value determining module 906 is further configured to: determining a multimodal prototype feature based on the momentum image features and the momentum text features; taking the conventional image characteristics as a first multi-modal anchor point, and determining a first multi-modal loss value based on the first multi-modal anchor point and the multi-modal prototype characteristics; and taking the conventional text characteristic as a second multi-modal anchor point, and determining a second multi-modal loss value based on the second multi-modal anchor point and the multi-modal prototype characteristic.

In one embodiment, the loss value determination module 906 is further configured to: fusing the momentum image features and the corresponding momentum text features to obtain fusion features; clustering the fusion features to obtain a clustering center; the cluster center is determined to be a multi-modal prototype feature.

In one embodiment, the loss value determination module 906 is further configured to: for medical images in each image report pair, taking conventional image features corresponding to the medical image in question as a first multi-modal anchor point, taking multi-modal prototype features corresponding to the image report pair containing the medical image in question as a first multi-modal positive sample, taking multi-modal prototype features corresponding to the image report pair not containing the medical image in question as a first multi-modal negative sample, and determining a first multi-modal loss value based on the first multi-modal anchor point, the first multi-modal positive sample and the first multi-modal negative sample;

The loss value determining module 906 is further configured to: for the medical report of each image report pair, taking the conventional text feature corresponding to the medical report pair as a second multi-modal anchor point, taking the multi-modal prototype feature corresponding to the image report pair containing the medical image pair as a second multi-modal positive sample, taking the multi-modal prototype feature corresponding to the image report pair not containing the medical image pair as a second multi-modal negative sample, and determining a second multi-modal loss value based on the second multi-modal anchor point, the second multi-modal positive sample and the second multi-modal negative sample.

In one embodiment, the loss value determination module 906 is further configured to: determining a single-mode loss value based on the conventional image feature, the momentum image feature, the conventional text feature and the momentum text feature;

the parameter optimization module 908 is further configured to: and optimizing parameters of the machine learning model according to the single-mode loss value, the multi-mode loss value, the first cross-mode loss value and the second cross-mode loss value to obtain a target machine learning model.

In one embodiment, the loss value determination module 906 is further configured to: processing medical data samples of a target task through a target machine learning model to obtain a task result; the parameter optimization module 908 is further configured to: adjusting parameters of a sub-task model in the target machine learning model based on the task result to obtain a trained target machine learning model;

The data processing module 910 is further configured to: and performing data processing on the target medical data through the trained target machine learning model.

In one embodiment, the data processing module 910 is further configured to: acquiring target medical data and a corresponding task type; selecting a target machine learning model matched with the task type from the trained target machine learning models; the matched target machine learning model comprises a feature extraction network and a subtask model, wherein the subtask model comprises one of a classification sub-model, a segmentation sub-model, a cross-modal retrieval sub-model or a visual question-answer sub-model; and carrying out data processing on the target medical data through the matched target machine learning model to obtain a data processing result.

The various modules in the medical data processing device described above may be implemented in whole or in part by software, hardware, and combinations thereof. The above modules may be embedded in hardware or may be independent of a processor in the computer device, or may be stored in software in a memory in the computer device, so that the processor may call and execute operations corresponding to the above modules.

In one embodiment, a computer device is provided, which may be a server, and the internal structure of which may be as shown in fig. 10. The computer device includes a processor, a memory, an Input/Output interface (I/O) and a communication interface. The processor, the memory and the input/output interface are connected through a system bus, and the communication interface is connected to the system bus through the input/output interface. Wherein the processor of the computer device is configured to provide computing and control capabilities. The memory of the computer device includes a non-volatile storage medium and an internal memory. The non-volatile storage medium stores an operating system, computer programs, and a database. The internal memory provides an environment for the operation of the operating system and computer programs in the non-volatile storage media. The database of the computer device is for storing medical data. The input/output interface of the computer device is used to exchange information between the processor and the external device. The communication interface of the computer device is used for communicating with an external terminal through a network connection. The computer program is executed by a processor to implement a medical data processing method.

In one embodiment, a computer device is provided, which may be a terminal, and the internal structure thereof may be as shown in fig. 11. The computer device includes a processor, a memory, an input/output interface, a communication interface, a display unit, and an input means. The processor, the memory and the input/output interface are connected through a system bus, and the communication interface, the display unit and the input device are connected to the system bus through the input/output interface. Wherein the processor of the computer device is configured to provide computing and control capabilities. The memory of the computer device includes a non-volatile storage medium and an internal memory. The non-volatile storage medium stores an operating system and a computer program. The internal memory provides an environment for the operation of the operating system and computer programs in the non-volatile storage media. The input/output interface of the computer device is used to exchange information between the processor and the external device. The communication interface of the computer device is used for carrying out wired or wireless communication with an external terminal, and the wireless mode can be realized through WIFI, a mobile cellular network, NFC (near field communication) or other technologies. The computer program is executed by a processor to implement a medical data processing method. The display unit of the computer equipment is used for forming a visual picture, and can be a display screen, a projection device or a virtual reality imaging device, wherein the display screen can be a liquid crystal display screen or an electronic ink display screen, the input device of the computer equipment can be a touch layer covered on the display screen, can also be a key, a track ball or a touch pad arranged on a shell of the computer equipment, and can also be an external keyboard, a touch pad or a mouse and the like.

It will be appreciated by those skilled in the art that the structures shown in fig. 10 or 11 are merely block diagrams of partial structures related to the aspects of the present application and do not constitute a limitation of the computer device to which the aspects of the present application apply, and in particular, the computer device may include more or less components than those shown in the drawings, or may combine some of the components, or have different arrangements of the components.

In an embodiment, there is also provided a computer device comprising a memory and a processor, the memory having stored therein a computer program, the processor implementing the steps of the method embodiments described above when the computer program is executed.

In one embodiment, a computer-readable storage medium is provided, on which a computer program is stored which, when executed by a processor, carries out the steps of the method embodiments described above.

In an embodiment, a computer program product is provided, comprising a computer program which, when executed by a processor, implements the steps of the method embodiments described above.

It should be noted that, the user information (including, but not limited to, user equipment information, user personal information, etc.) and the data (including, but not limited to, data for analysis, stored data, presented data, etc.) referred to in the present application are information and data authorized by the user or sufficiently authorized by each party, and the collection, use and processing of the related data are required to comply with the related laws and regulations and standards of the related countries and regions.

Those skilled in the art will appreciate that implementing all or part of the above described methods may be accomplished by way of a computer program stored on a non-transitory computer readable storage medium, which when executed, may comprise the steps of the embodiments of the methods described above. Any reference to memory, database, or other medium used in the various embodiments provided herein may include at least one of non-volatile and volatile memory. The nonvolatile Memory may include Read-Only Memory (ROM), magnetic tape, floppy disk, flash Memory, optical Memory, high density embedded nonvolatile Memory, resistive random access Memory (ReRAM), magnetic random access Memory (Magnetoresistive Random Access Memory, MRAM), ferroelectric Memory (Ferroelectric Random Access Memory, FRAM), phase change Memory (Phase Change Memory, PCM), graphene Memory, and the like. Volatile memory can include random access memory (Random Access Memory, RAM) or external cache memory, and the like. By way of illustration, and not limitation, RAM can be in the form of a variety of forms, such as static random access memory (Static Random Access Memory, SRAM) or dynamic random access memory (Dynamic Random Access Memory, DRAM), and the like. The databases referred to in the various embodiments provided herein may include at least one of relational databases and non-relational databases. The non-relational database may include, but is not limited to, a blockchain-based distributed database, and the like. The processors referred to in the embodiments provided herein may be general purpose processors, central processing units, graphics processors, digital signal processors, programmable logic units, quantum computing-based data processing logic units, etc., without being limited thereto.

The technical features of the above embodiments may be arbitrarily combined, and all possible combinations of the technical features in the above embodiments are not described for brevity of description, however, as long as there is no contradiction between the combinations of the technical features, they should be considered as the scope of the description.

The above examples only represent a few embodiments of the present application, which are described in more detail and are not to be construed as limiting the scope of the present application. It should be noted that it would be apparent to those skilled in the art that various modifications and improvements could be made without departing from the spirit of the present application, which would be within the scope of the present application. Accordingly, the scope of protection of the present application shall be subject to the appended claims.

Claims

1. A medical data processing method, the method comprising:

2. The method of claim 1, wherein the feature extraction of the medical image and the medical report in the image report pair by the machine learning model to obtain the regular image feature and the momentum image feature corresponding to the medical image, and the regular text feature and the momentum text feature corresponding to the medical report comprises:

3. The method according to claim 2, wherein the method further comprises:

the conventional image encoder and the momentum image encoder for the machine learning model perform feature extraction on the medical image in the image report pair to obtain conventional image features and momentum image features corresponding to the medical image, and the method comprises the following steps:

4. The method according to claim 2, wherein the method further comprises:

dividing the medical report into sentence sets according to semantics;

the conventional text encoder and the momentum text encoder for the machine learning model perform feature extraction on the medical report in the image report pair to obtain conventional text features and momentum text features corresponding to the medical report, and the method comprises the following steps:

5. The method of claim 4, wherein a conventional text encoder of the machine learning model comprises a feature extraction network, a first pooling layer, and a second pooling layer; the feature extraction is performed on the sentence set by the conventional text encoder of the machine learning model, so as to obtain conventional text features corresponding to the medical report, including:

6. The method of claim 1, wherein the number of image report pairs is at least two; the determining a first cross-modal loss value based on the first cross-modal anchor and the momentum text feature with the regular image feature as a first cross-modal anchor comprises:

The determining a second cross-modal loss value based on the second cross-modal anchor and the momentum image feature with the regular text feature as a second cross-modal anchor comprises:

7. The method of claim 6, wherein the first cross-modal negative-example and the second cross-modal negative-example are maintained by a first-in, first-out memory queue, respectively.

8. The method of claim 1, wherein the multi-modal loss values comprise a first multi-modal loss value and a second multi-modal loss value; the determining a multimodal loss value based on the conventional image feature, the momentum image feature, the conventional text feature, and the momentum text feature, comprising:

9. The method of claim 8, wherein the determining a multimodal prototype feature based on the momentum image feature and the momentum text feature comprises:

clustering the fusion features to obtain a clustering center;

and determining the cluster center as a multi-modal prototype feature.

10. The method of claim 8, wherein the determining a first multi-modal loss value based on the first multi-modal anchor and the multi-modal prototype feature with the regular image feature as a first multi-modal anchor comprises:

The determining a second multi-modal loss value based on the second multi-modal anchor point and the multi-modal prototype feature with the regular text feature as a second multi-modal anchor point includes:

11. The method according to claim 1, wherein the method further comprises:

the optimizing parameters of the machine learning model according to the multi-modal loss value, the first cross-modal loss value and the second cross-modal loss value to obtain a target machine learning model includes:

12. The method according to any one of claims 1 to 11, further comprising:

adjusting parameters of a sub-task model in the target machine learning model based on the task result to obtain a trained target machine learning model;

the data processing of the target medical data through the target machine learning model comprises the following steps:

13. The method of claim 12, wherein the data processing of the target medical data by the trained target machine learning model comprises:

acquiring target medical data and a corresponding task type;

14. A medical data processing apparatus, the apparatus comprising:

15. A computer device comprising a memory and a processor, the memory storing a computer program, characterized in that the processor implements the steps of the method of any one of claims 1 to 13 when the computer program is executed.

16. A computer readable storage medium, on which a computer program is stored, characterized in that the computer program, when being executed by a processor, implements the steps of the method of any of claims 1 to 13.

17. A computer program product comprising a computer program, characterized in that the computer program, when being executed by a processor, implements the steps of the method of any one of claims 1 to 13.