CN117976198A

CN117976198A - Medical cross-domain auxiliary diagnosis method and device based on data screening and countermeasure network

Info

Publication number: CN117976198A
Application number: CN202410366312.8A
Authority: CN
Inventors: 马鹏程; 白焜太; 刘莉; 杨雅婷; 宋佳祥; 刘硕; 许娟; 史文钊
Original assignee: Digital Health China Technologies Co Ltd
Current assignee: Digital Health China Technologies Co Ltd
Priority date: 2024-03-28
Filing date: 2024-03-28
Publication date: 2024-05-03
Anticipated expiration: 2044-03-28

Abstract

The invention relates to the technical field of medical cross-domain data processing, in particular to a medical cross-domain auxiliary diagnosis method and device based on data screening and countermeasure network; according to the method, on the basis of the known number of the electronic medical record text data marked in the source domain, the number required by the electronic medical record text data unmarked in the target domain can be calculated through a relation formula between the number of the electronic medical record text data marked in the source domain and the number of the electronic medical record text data unmarked in the target domain, a weight random sampling method is adopted to screen the number required by the electronic medical record text data unmarked in the target domain, then the electronic medical record text data marked in the source domain and the electronic medical record text data unmarked in the screened target domain are combined and trained, and the optimized countermeasure network is utilized, so that the reasoning speed of a model is accelerated on the basis of ensuring the model effect, and the accuracy of disease diagnosis prediction results is improved.

Description

Medical cross-domain auxiliary diagnosis method and device based on data screening and countermeasure network

Technical Field

The invention relates to the technical field of medical cross-domain data processing, in particular to a medical cross-domain auxiliary diagnosis method and device based on data screening and countermeasure network.

Background

The data in the medical scenario has the following features:

(1) Privacy, which is used for model training after desensitization pretreatment operation is required for electronic case data of patients;

(2) The data has high professionality, so that the problem of difficulty in marking exists, and the requirements on the level and the professional degree of marking personnel are high;

(3) The special character is that the writing specifications of the electronic cases of different hospitals are different, and the writing styles of different doctors are also different.

In view of this, a lot of manpower and material resources are required to obtain a lot of high-quality medical labeling data, and on the premise that a diagnosis model is trained for a certain department or a plurality of departments in an actual scene, when the medical labeling data is required to be applied to a situation of crossing departments or crossing hospitals, new electronic case data still needs to be labeled again, and the problems of time waste and financial cost exist.

In recent years, some pseudo sample generation and migration strategies aiming at the medical field appear, and mainly comprise the following two main stream strategies:

(1) Adopting a GAN network applied to the field of medical images, and generating a new sample picture by setting a generator and a discriminator to expand medical image training data;

(2) And performing fine tuning training and migration on the target domain data based on a model migration mode, namely by using a depth-based pre-training model.

However, the above two strategies for generating and migrating pseudo samples in the medical field have some disadvantages, for example, for the first strategy, the GAN network is only applicable to the real field for continuity, but not applicable to discrete text data, and in addition, at present, no clear technical scheme is available for generating pseudo samples in the medical text field; for the second strategy, the conventional migration method can cause catastrophic forgetting, that is, after the model is trained by the target domain data, although the training effect under the target domain is greatly improved, knowledge mastered by the source domain is disturbed before the training effect is greatly reduced.

The invention provides a medical cross-domain auxiliary diagnosis method and a device based on data screening and countermeasure network, which screen out unlabeled electronic medical record text data of a required number of target domains in a high-efficiency data screening mode, combine the unlabeled electronic medical record text data of source domains with unlabeled electronic medical record text data of screened target domains, input the combined unlabeled electronic medical record text data into the optimized countermeasure network, and extract and preserve deep common characteristics of different departments and/or hospital data so as to reduce manual labeling processes and solve the problem of cross-domain of the current diagnosis model.

Disclosure of Invention

Based on the above, it is necessary to provide a method and a device for medical cross-domain auxiliary diagnosis based on data screening and countermeasure network.

According to a first aspect of the present invention there is provided a method of medical cross-domain assisted diagnosis based on data screening and countermeasure networks, the method comprising:

Acquiring a plurality of source domain marked electronic medical record text data and original target domain unmarked electronic medical record text data;

Calculating the selected number of the electronic medical record text data which is not marked in the target domain based on a relation formula between the number of the electronic medical record text data which is marked in the source domain and the number of the electronic medical record text data which is not marked in the target domain;

Screening the unlabeled electronic medical record text data of the target domain corresponding to the selected number from the unlabeled electronic medical record text data of the original target domain based on a weight random sampling method;

the electronic medical record text data marked in the source domain and the electronic medical record text data not marked in the screened target domain are used as texts to be trained, and a data source label is added at the tail end of each text to be trained to construct first-class training data;

constructing an original countermeasure network based on the feature extraction model, the domain discrimination model and the diagnosis classification model;

Inputting the first type training data into a preset feature extraction model for vectorization processing to obtain second type training data; inputting the second training data into a preset domain discrimination model for training, and generating a prediction probability value of the classification result belonging to the input data source; calculating a cross entropy loss function value of the data source category based on the predicted probability value of the classification result belonging to the input data source and the actual probability value of the classification result belonging to the input data source;

Selecting second-class training data with a data source as a source domain to construct third-class training data, inputting the third-class training data into a preset diagnosis classification model for training, and generating a prediction probability value of a disease category belonging to a corresponding disease category; calculating a cross entropy loss function value of the disease diagnosis class based on the predicted probability value of the disease class belonging to the corresponding disease class and the actual probability value of the disease class belonging to the corresponding disease class;

Determining a total loss value of training based on the cross entropy loss function value of the data source class and the cross entropy loss function value of the disease diagnosis class, and performing reverse gradient propagation training on the original countermeasure network according to the total loss value of training until the target training round or the minimum convergence of the total loss value in the preset training round is finally reached, and stopping training to obtain the target countermeasure network;

Inputting the text data of the target domain unlabeled electronic medical record to be migrated into a target countermeasure network, and outputting a disease diagnosis prediction result of the text data of the target domain unlabeled electronic medical record so as to finish auxiliary diagnosis of medical cross-domain.

In some optional implementations of some embodiments, the calculating the selected number of unlabeled electronic medical record text data in the target domain based on a relational formula between the number of unlabeled electronic medical record text data in the source domain and the number of unlabeled electronic medical record text data in the target domain specifically includes:

If the number of the electronic medical record text data marked in the source domain is Y, the relation formula between the number of the electronic medical record text data marked in the source domain and the number of the electronic medical record text data not marked in the target domain is:

wherein: x represents the selected number of the text data of the unlabeled electronic medical record in the target domain, Representing the average confusion of the target domain,/>Representing source domain average confusion.

In some optional implementations of some embodiments, the screening, based on the weight random sampling method, the target-domain unlabeled electronic medical record text data corresponding to the selected number from the original target-domain unlabeled electronic medical record text data specifically includes:

Inputting all obtained original target domain unlabeled electronic medical record text data into a trained source domain model for training, outputting entropy values corresponding to each character in each original target domain unlabeled electronic medical record text data in the training process, and selecting the maximum value of the entropy values of all characters in the original target domain unlabeled electronic medical record text data as the entropy value of the original target domain unlabeled electronic medical record text data for outputting to obtain the entropy value of the original target domain unlabeled electronic medical record text data;

Arranging all the trained entropy values according to ascending order to construct an entropy value score interval;

equally dividing the entropy value score interval into N equal parts to obtain N entropy value score subintervals;

And assigning weights corresponding to the N entropy value score subintervals in a mode of gradually increasing the 1-N arithmetic difference sequence, randomly sampling in the corresponding entropy value score subintervals according to the assigned weights, and screening the text data of the unlabeled electronic medical records of the target fields with the corresponding selected number.

In some optional implementations of some embodiments, the inputting the first type of training data into a preset feature extraction model to perform vectorization processing to obtain the second type of training data specifically includes:

inputting first training data comprising a text to be trained and a corresponding data source label into a feature extraction model, wherein the data source is a source domain or a target domain, the feature extraction model adopts a Multi-Query Attention head structure, text vectors corresponding to the data source are generated through a BERT structure of the feature extraction model, the text vectors corresponding to the data source comprise a source domain text vector and a target domain text vector, and second training data is constructed according to the source domain text vector and the target domain text vector.

In some optional implementations of some embodiments, the inputting the second class of training data into a preset domain discriminant model for training, generating a predicted probability value that the classification result belongs to the input data source specifically includes:

inputting the second class training data into a preset domain discrimination model, wherein the domain discrimination model is a full convolution network FCN model, carrying out format processing on the input second class training data through an input layer of the domain discrimination model, carrying out weighted summation on feature vectors output by the input layer through a hidden layer of the domain discrimination model based on weights and offsets, then carrying out nonlinear transformation through an activation function to obtain data source types of the second class training data, inputting the data source types of the second class training data into an output layer, and finally outputting a prediction probability value of a classification result belonging to input data sources through a softmax function.

In some optional implementations of some embodiments, the calculating the cross entropy loss function value of the data source category based on the predicted probability value that the classification result belongs to the input data source and the actual probability value that the classification result belongs to the input data source specifically includes:

the cross entropy loss function value for the data source class is calculated by the following formula:

wherein: Cross entropy loss function value representing class of data source,/> Predictive probability value representing classification result belonging to input data source i,/>, and method for classifying input data source iRepresenting the actual probability value that the classification result belongs to the input data source i.

In some optional implementations of some embodiments, the selecting includes constructing third class training data from the second class training data with the data source as the source domain, inputting the third class training data into a preset diagnostic classification model for training, and generating the predicted probability value that the disease class belongs to the corresponding disease class, which specifically includes:

Selecting second-class training data with a data source as a source domain to construct third-class training data, inputting the third-class training data into a preset diagnosis classification model, wherein the preset diagnosis classification model is a full convolution network FCN model, carrying out format processing on the input third-class training data through an input layer of the diagnosis classification model, carrying out weighted summation on feature vectors output by the input layer through a hidden layer of the diagnosis classification model based on weights and offsets, carrying out nonlinear transformation through an activation function to obtain disease types of the third-class training data, inputting the disease types of the third-class training data into an output layer, and finally outputting a predicted probability value of the disease types belonging to the corresponding disease types through a softmax function.

In some optional implementations of some embodiments, calculating the cross entropy loss function value of the disease diagnosis class based on the predicted probability value that the disease class belongs to the corresponding disease class and the actual probability value that the disease class belongs to the corresponding disease class specifically includes:

the cross entropy loss function value for the disease diagnosis class is calculated by the following formula:

wherein: cross entropy loss function value representing disease diagnosis class,/> Predictive probability value representing that a disease category belongs to a corresponding disease category i,/>An actual probability value representing that a disease category belongs to a corresponding disease category and n represents the number of actual disease categories.

In some optional implementations of some embodiments, the determining the trained total loss value based on the cross-entropy loss function value of the data source class and the cross-entropy loss function value of the disease diagnosis class specifically includes:

summing the cross entropy loss function value of the disease diagnosis category and the cross entropy loss function value of the data source category to obtain a trained total loss value.

According to a second aspect of the present invention there is provided a data screening and countermeasure network based medical cross-domain auxiliary diagnostic apparatus, the apparatus comprising:

the data acquisition module is used for acquiring a plurality of source domain marked electronic medical record text data and original target domain unmarked electronic medical record text data;

The quantity calculation module is used for calculating the selected quantity of the unlabeled electronic medical record text data of the target domain based on a relation formula between the quantity of the labeled electronic medical record text data of the source domain and the quantity of the unlabeled electronic medical record text data of the target domain;

The screening module is used for screening the unlabeled electronic medical record text data of the target domain corresponding to the selected number from the unlabeled electronic medical record text data of the original target domain based on a weight random sampling method;

The first training data construction module is used for taking the electronic medical record text data marked in the source domain and the electronic medical record text data unmarked in the screened target domain as texts to be trained, and adding a data source label at the tail of each text to be trained to construct first training data;

the framework construction module is used for constructing an original countermeasure network based on the feature extraction model, the domain discrimination model and the diagnosis classification model;

The feature extraction module is used for inputting the first type of training data into a preset feature extraction model to carry out vectorization processing to obtain the second type of training data;

The domain judging module is used for inputting the second type of training data into a preset domain judging model for training and generating a prediction probability value of the classification result belonging to the input data source; calculating a cross entropy loss function value of the disease diagnosis class based on the predicted probability value of the disease class belonging to the corresponding disease class and the actual probability value of the disease class belonging to the corresponding disease class;

the diagnosis classification module is used for selecting second-class training data with a data source as a source domain to construct third-class training data, inputting the third-class training data into a preset diagnosis classification model for training, and generating a prediction probability value that the disease class belongs to the corresponding disease class; calculating a cross entropy loss function value of the disease diagnosis class based on the predicted probability value of the disease class belonging to the corresponding disease class and the actual probability value of the disease class belonging to the corresponding disease class;

The frame training module is used for determining a total loss value of training based on the cross entropy loss function value of the data source type and the cross entropy loss function value of the disease diagnosis type, performing reverse gradient propagation training on the original countermeasure network according to the total loss value of training, and stopping training until the target training round is finally reached or the convergence of the total loss value in the preset training round is minimum, so as to obtain the target countermeasure network;

the prediction module is used for inputting the target domain unlabeled electronic medical record text data to be migrated into the target countermeasure network, and outputting a disease diagnosis prediction result corresponding to the target domain unlabeled electronic medical record text data so as to complete medical cross-domain auxiliary diagnosis.

The invention has the advantages that: according to the medical cross-domain auxiliary diagnosis method and device based on the data screening and countermeasure network, the method can calculate the quantity required by the unlabeled electronic medical record text data of the target domain through a relation formula between the quantity of the unlabeled electronic medical record text data of the source domain and the quantity of the unlabeled electronic medical record text data of the target domain on the basis of knowing the quantity of the labeled electronic medical record text data of the source domain, and adopts a weight random sampling method to screen the quantity required by the unlabeled electronic medical record text data of the target domain, then the labeled electronic medical record text data of the source domain and the unlabeled electronic medical record text data of the screened target domain are combined and trained, and the optimized countermeasure network is utilized, so that the reasoning speed of a model is accelerated on the basis of guaranteeing the model effect, and the accuracy of disease diagnosis prediction results is improved; meanwhile, the optimized countermeasure network inherits the original performance of the original model on disease diagnosis and classification, and deep common characteristics between a source domain and a target domain can be mined, so that the model can be effectively migrated on the premise of no labeling data, and the problem of disastrous forgetting is relieved.

Drawings

FIG. 1 is a flow chart of a method of medical cross-domain auxiliary diagnosis based on data screening and antagonism networks;

fig. 2 is a schematic structural diagram of a medical cross-domain auxiliary diagnostic apparatus based on a data screening and countermeasure network.

Detailed Description

In order to make the objects, technical solutions and advantages of the present invention more apparent, the present invention will be further described in detail by the following detailed description with reference to the accompanying drawings. It should be understood that the specific embodiments described herein are for purposes of illustration only and are not intended to limit the scope of the invention.

Example 1

Referring to fig. 1, a method for medical cross-domain assisted diagnosis based on data screening and countermeasure networks, the method comprising:

s1, acquiring a plurality of source domain marked electronic medical record text data and original target domain unmarked electronic medical record text data.

In this embodiment, in the field of application of medical cross-domain auxiliary diagnosis, the source domain refers to a trained department, such as a respiratory department, and the target domain refers to a target department, such as a psychiatric department, that is ready to be applied.

In this embodiment, labeling is performed for a labeling format of electronic medical record text data (electronic medical record text ] [ diagnosis result ]) of a source domain labeled electronic medical record text data, for example:

The electronic medical record text is 2 years old and 8 months old, the language expression is poor, the language expression is excessive, the attention is concentrated, the expression is naive, the group can be formed, the activity is active, the activity coordination is normal, and the eating is normal.

[ Diagnosis result ] language development disorder.

S2, calculating the selected number of the electronic medical record text data which is not marked in the target domain based on a relation formula between the number of the electronic medical record text data which is marked in the source domain and the number of the electronic medical record text data which is not marked in the target domain.

In this embodiment, based on a relational formula between the number of electronic medical record text data marked in the source domain and the number of electronic medical record text data not marked in the target domain, the calculating the selected number of electronic medical record text data not marked in the target domain specifically includes:

In the present embodiment, the above-described process is further explained, for example: randomly acquiring 1000 source domain marked electronic medical record text data, calculating the confusion degree of a trained source domain model on the 1000 data, repeating 10 times to average, and obtaining the average confusion degree of the source domain; Wherein the trained source domain model is obtained by training the UIE model which is disclosed to be available, the input source domain is marked with the text data of the electronic medical record, and the average confusion/>, of the target domain is obtained in the same way on the target domain at night。

It should be understood that the method adopts the confusion index to measure the performance of the model, is suitable for the field of natural language processing, and can be used for verifying the understanding degree of the trained model on unknown data, and in general, the lower the confusion, the better the prediction capability of the model on texts.

And S3, screening the unlabeled electronic medical record text data of the target fields corresponding to the selected number from the unlabeled electronic medical record text data of the original target fields based on a weight random sampling method.

In this embodiment, based on a weight random sampling method, selecting target domain unlabeled electronic medical record text data corresponding to a selected number from original target domain unlabeled electronic medical record text data specifically includes:

In this embodiment, a process of selecting an X-entry field text is further explained, and the specific process includes:

(1) Sending all obtained original target domain unlabeled electronic medical record text data into a trained source domain model for prediction, wherein in the training process, the source domain model correspondingly outputs entropy values for each character in each original target domain unlabeled electronic medical record text data, and selecting the maximum value of the entropy values of all characters in the original target domain unlabeled electronic medical record text data as the entropy value of the original target domain unlabeled electronic medical record text data for outputting to obtain the entropy value of the original target domain unlabeled electronic medical record text data;

(2) Arranging all the text data of the unlabeled electronic medical record in the original target domain in an ascending order according to all the entropy values obtained in the previous step, and constructing an entropy value score interval;

(3) Equally dividing the entropy value score interval into 10 equal parts to obtain 10 entropy value score subintervals;

(4) And (3) assigning weights corresponding to the 10 entropy value score subintervals in a mode of gradually increasing 1-10 arithmetic progression (namely 1,2,3, the..10), randomly sampling in the corresponding entropy value score subintervals according to the assigned weights, and screening the unlabeled electronic medical record text data of the target domains with the corresponding selected number.

It should be appreciated that higher entropy values represent higher uncertainty in the data, and therefore, higher weights are assigned to the entropy value score subintervals for which the entropy values are higher during random sampling to ensure that the data is more easily extracted.

S4, taking the electronic medical record text data marked in the source domain and the electronic medical record text data not marked in the screened target domain as texts to be trained, and adding a data source label at the tail end of each text to be trained to construct first-class training data.

In this embodiment, the method includes merging the electronic medical record text data marked in the source field and the electronic medical record text data unmarked in the target field obtained in the step S3, constructing a text to be trained, adding a data source tag at the end of each text to be trained by taking each text to be trained as a unit, specifically adding a source field or a target field behind each text to be trained, repeating the above operations until the data source tag is added at the end of all texts to be trained, and constructing first training data for training a subsequent field discrimination model.

S5, constructing an original countermeasure network based on the feature extraction model, the domain discrimination model and the diagnosis classification model.

In this embodiment, the construction of the original countermeasure network is completed according to the feature extraction model, the domain discrimination model, and the diagnostic classification model.

S6, inputting the first type training data into a preset feature extraction model for vectorization processing to obtain second type training data; inputting the second training data into a preset domain discrimination model for training, and generating a prediction probability value of the classification result belonging to the input data source; and calculating a cross entropy loss function value of the data source category based on the predicted probability value of the classification result belonging to the input data source and the actual probability value of the classification result belonging to the input data source.

In this embodiment, the first training data is input into a preset feature extraction model for vectorization processing, so as to obtain second training data, which specifically includes:

In the embodiment, the Multi-Query Attention Head structure is adopted to replace the traditional Multi-Head Attention Head structure in the feature extraction part of the countermeasure network, so that the stability of the model can be ensured, and the reasoning speed of the model can be improved.

Further, in Multi-Head Attention, several Attention heads (heads) are formed, and each Head is formed by: the three matrices of Query (Q), key (K) and value (V) are commonly implemented, and it is noted that the Multi-Query Attention Head structure and the Multi-Head Attention Head structure are identical except that different Attention heads share one key and value weight, and the Multi-Query Attention Head structure is adopted, so that not only can the related operation of multiple heads be reduced, but also the precision can be not reduced, and the decoding speed can be greatly improved.

In the embodiment, in the feature extraction stage, text data is vectorized, and the input text is subjected to digital vector conversion through a BERT structure so as to carry out mathematical matrix operation of a subsequent model; and inputting texts to be trained from the source domain and the target domain in the feature extraction model, outputting a source domain text vector and a target domain text vector through the BERT structure of the feature extraction model, and constructing second-class training data.

In this embodiment, the second class training data is input into a preset domain discrimination model for training, and the generation of the prediction probability value that the classification result belongs to the input data source specifically includes:

Furthermore, the domain discrimination model and the diagnosis classification model both adopt full convolution network FCN models, and are the processes of carrying out class discrimination on input data no matter the domain discrimination process or the diagnosis classification process, and the difference between the domain discrimination model and the diagnosis classification model is that: the domain discrimination model can be used to determine which data source the input data belongs to, and the diagnostic classification model can be used to determine which disease category the input data belongs to.

Furthermore, the full convolutional network FCN model related to the present invention is a common neural network structure, also called as a multi-layer perceptron, and the specific structure includes an input layer, a hidden layer and an output layer, wherein the input layer is used for receiving input data, such as images, texts, etc., and converting the input data into a format that can be processed by the full convolutional network FCN model; the full convolution network FCN model typically comprises a plurality of hidden layers, each hidden layer being made up of a plurality of neurons, each neuron receiving the output of the previous layer and performing a weighted summation by weight and bias, and then performing a nonlinear transformation by an activation function to obtain the output of the neuron; the output of the last hidden layer is fed into an output layer, which typically contains the same number of neurons as the classified categories, one for each neuron of the output layer, the output of which indicates the probability that the sample belongs to that category; in addition, the invention adopts the loss function to measure the difference between the model output and the real label, and simultaneously, the weight and the bias in the network are updated by a gradient descent method through a reverse gradient propagation algorithm so as to minimize the loss function, thereby enabling the model to better perform classification tasks; thus, the full convolutional network FCN model maps input data to output classes through nonlinear transformations and weight adjustments of multiple hidden layers, thereby implementing classification tasks.

In this embodiment, the cross entropy loss function value of the data source class is calculated based on the predicted probability value that the classification result belongs to the input data source and the actual probability value that the classification result belongs to the input data source, which specifically includes:

wherein: Cross entropy loss function value representing class of data source,/> Predictive probability value representing classification result belonging to input data source i,/>, and method for classifying input data source iThe actual probability value representing that the classification result belongs to the input data source i is 2, namely the source domain or the target domain.

It should be noted that, in the full convolution network FCN model, the probability distribution of the obtained prediction result and the actual result is usually obtained through the output layer of the model; wherein the probability distribution of the predicted outcome is expressed as: in classification tasks, the output layer of the model typically converts the original output of the model into a form representing a probability distribution using a softmax function, which can convert the original output into one probability distribution such that the sum of the output probabilities for all classes is 1; thus, the prediction result of the model can be expressed as a probability of each category; the probability distribution of the actual result is expressed as: the actual result is usually expressed in the form of one-hot coding, that is, for the real class of the sample, the corresponding position is 1, and the other positions are 0, so that the probability distribution of the actual result is a vector with only one element being 1 and the other elements being 0, the probability that the sample belongs to the real class is 1, and the probability of the other classes is 0; thus, in this way, the probability distributions of the predicted and actual results obtained by the model can be compared, followed by calculation of the loss function and model training. The smaller the value of the loss function, the more accurate the prediction result of the model is represented when the predicted probability distribution of the model is closer to the actual probability distribution.

S7, selecting second-class training data with a data source as a source domain to construct third-class training data, inputting the third-class training data into a preset diagnosis classification model for training, and generating a prediction probability value that the disease category belongs to the corresponding disease category; the cross entropy loss function value of the disease diagnosis class is calculated based on the predicted probability value that the disease class belongs to the corresponding disease class and the actual probability value that the disease class belongs to the corresponding disease class.

In this embodiment, selecting second class training data including a data source as a source domain to construct third class training data, inputting the third class training data into a preset diagnosis classification model for training, and generating a predicted probability value that a disease class belongs to a corresponding disease class, including:

In this embodiment, in the diagnostic classification model according to the present invention, the input data is a third type of training data from the source domain S, and the third type of training data is labeled in the format as follows: ([ electronic medical record text ] [ diagnosis result ]), wherein [ electronic medical record text ] can be used as training text, and [ diagnosis result ] can be used as training target (label), and the digital vector characterization (namely the source field text vector output by the feature extraction stage) obtained after vectorization of the electronic medical record text and the training target (label) is used for fitting calculation in the diagnosis classification model, so that each input electronic medical record text can have a corresponding diagnosis result.

It should be further noted that, since the diagnostic classification model is a numerical operation, the diagnostic result output by the diagnostic classification model is also a number (i.e. the output number is mapped to the corresponding disease), that is, before the diagnostic classification model is trained, the disease in the labeling data (the "diagnostic result") is replaced by a number id in advance, that is, the numbers are numbered in ascending order from 0, so that each disease has a unique number corresponding to it, and then training is performed, and the output result is the id corresponding to a certain disease.

In this embodiment, the cross entropy loss function value of the disease diagnosis class is calculated based on the predicted probability value that the disease class belongs to the corresponding disease class and the actual probability value that the disease class belongs to the corresponding disease class, and specifically includes:

S8, determining a total loss value of training based on the cross entropy loss function value of the data source class and the cross entropy loss function value of the disease diagnosis class, and performing reverse gradient propagation training on the original countermeasure network according to the total loss value of training until the target training round or the minimum convergence of the total loss value in the preset training round is finally reached, and stopping training to obtain the target countermeasure network.

In this embodiment, determining the total loss value for training based on the cross entropy loss function value of the data source class and the cross entropy loss function value of the disease diagnosis class specifically includes:

Further, a calculation formula for obtaining a total loss value based on a calculation formula for a cross entropy loss function value of a data source class and a calculation formula for a cross entropy loss function value of a disease diagnosis class is as follows:

in the method, in the process of the invention, Cross entropy loss function value representing class of data source,/>Cross entropy loss function value representing disease diagnosis class,/>Indicating the total loss value.

Further, the target training round representation related to the invention is a training round (for example, 100 rounds) which finally reaches the target, the preset training round representation is a designated training round (for example, 3 rounds), and the training is stopped until the final target training round or the total loss value in the preset training round is converged to the minimum, so that the target countermeasure network is obtained.

S9, inputting the target domain unlabeled electronic medical record text data to be migrated into a target countermeasure network, and outputting a disease diagnosis prediction result corresponding to the target domain unlabeled electronic medical record text data so as to complete medical cross-domain auxiliary diagnosis.

In this embodiment, the target domain unlabeled electronic medical record text data to be migrated indicates that the target domain unlabeled electronic medical record text data to be migrated is input into the target countermeasure network, so as to output a corresponding disease diagnosis prediction result.

Example two

On the basis of the first embodiment, the present embodiment provides a medical cross-domain auxiliary diagnostic apparatus 200 based on a data screening and countermeasure network, please refer to fig. 2, for implementing the steps of the medical cross-domain auxiliary diagnostic method based on a data screening and countermeasure network described in the first embodiment, the apparatus 200 mainly includes: a data acquisition module 210, a quantity calculation module 220, a screening module 230, a first class training data construction module 240, a frame construction module 250, a feature extraction module 260, a domain discrimination module 270, a diagnostic classification module 280, a frame training module 290, and a prediction module 300, wherein,

A data obtaining module 210, configured to obtain a number of source domain labeled electronic medical record text data and original target domain unlabeled electronic medical record text data;

The number calculating module 220 is configured to calculate a selected number of electronic medical record text data not labeled in the target domain based on a relation formula between the number of electronic medical record text data labeled in the source domain and the number of electronic medical record text data not labeled in the target domain;

The screening module 230 is configured to screen, based on a weight random sampling method, text data of electronic medical records not labeled in a target domain corresponding to a selected number from text data of electronic medical records not labeled in an original target domain;

The first training data construction module 240 is configured to take the electronic medical record text data marked in the source domain and the electronic medical record text data unmarked in the screened target domain as texts to be trained, and add a data source tag at the end of each text to be trained to construct first training data;

The framework construction module 250 is used for constructing an original countermeasure network based on the feature extraction model, the domain discrimination model and the diagnosis classification model;

The feature extraction module 260 is configured to input the first training data into a preset feature extraction model for vectorization processing, so as to obtain second training data;

The domain discriminating module 270 is configured to input the second class training data into a preset domain discriminating model for training, and generate a predicted probability value that the classification result belongs to the input data source; calculating a cross entropy loss function value of the disease diagnosis class based on the predicted probability value of the disease class belonging to the corresponding disease class and the actual probability value of the disease class belonging to the corresponding disease class;

The diagnostic classification module 280 is configured to select second class training data including a data source as a source domain to construct third class training data, input the third class training data into a preset diagnostic classification model for training, and generate a predicted probability value that a disease class belongs to a corresponding disease class; calculating a cross entropy loss function value of the disease diagnosis class based on the predicted probability value of the disease class belonging to the corresponding disease class and the actual probability value of the disease class belonging to the corresponding disease class;

The frame training module 290 is configured to determine a total loss value of training based on the cross entropy loss function value of the data source class and the cross entropy loss function value of the disease diagnosis class, perform reverse gradient propagation training on the original countermeasure network according to the total loss value of training, and stop training until the target training round is finally reached or the convergence of the total loss value in the preset training round is minimum, so as to obtain the target countermeasure network;

the prediction module 300 is configured to input the target domain unlabeled electronic medical record text data to be migrated into the target countermeasure network, and output a disease diagnosis prediction result corresponding to the target domain unlabeled electronic medical record text data, so as to complete medical cross-domain auxiliary diagnosis.

It will be apparent to those skilled in the art that the various step embodiments of the invention described above may be performed in ways other than those described herein, including but not limited to simulation methods and experimental apparatus described above. The steps of the invention described above may in some cases be performed in a different order than that shown or described above, and may be performed separately. Therefore, the present invention is not limited to any specific combination of hardware and software.

The foregoing is a further detailed description of the invention in connection with specific embodiments, and is not intended to limit the practice of the invention to such descriptions. It will be apparent to those skilled in the art that several simple deductions or substitutions may be made without departing from the spirit of the invention, and these should be considered to be within the scope of the invention.

Claims

1. The medical cross-domain auxiliary diagnosis method based on the data screening and the countermeasure network is characterized by comprising the following steps:

2. The data screening and countermeasure network-based medical cross-domain auxiliary diagnosis method according to claim 1, wherein the calculating the selected number of unlabeled electronic medical record text data in the target domain based on a relational formula between the number of unlabeled electronic medical record text data in the source domain and the number of unlabeled electronic medical record text data in the target domain specifically includes:

wherein: x represents the selected number of text data of the unlabeled electronic medical record in the target domain,/> Representing the average confusion of the target domain,/>Representing source domain average confusion.

3. The data screening and countermeasure network-based medical cross-domain auxiliary diagnosis method according to claim 2, wherein the screening of the target-domain unlabeled electronic medical record text data corresponding to the selected number from the original target-domain unlabeled electronic medical record text data based on the weight random sampling method specifically comprises:

4. The data screening and countermeasure network-based medical cross-domain auxiliary diagnosis method according to claim 1, wherein the step of inputting the first type of training data into a preset feature extraction model for vectorization processing to obtain the second type of training data specifically comprises:

5. The method for medical cross-domain auxiliary diagnosis based on data screening and countermeasure network according to claim 4, wherein the step of inputting the second class training data into a preset domain discrimination model for training, and generating a predicted probability value that the classification result belongs to the input data source, specifically comprises:

6. The data screening and countermeasure network-based medical cross-domain auxiliary diagnostic method according to claim 5, wherein the calculating the cross entropy loss function value of the data source class based on the predicted probability value that the classification result belongs to the input data source and the actual probability value that the classification result belongs to the input data source specifically comprises:

Wherein: /(I) Cross entropy loss function values representing the class of data sources,Predictive probability value representing classification result belonging to input data source i,/>, and method for classifying input data source iRepresenting the actual probability value that the classification result belongs to the input data source i.

7. The method for cross-domain assisted diagnosis based on data screening and countermeasure network according to claim 5, wherein the selecting includes constructing third class training data from the second class training data with the data source as the source domain, inputting the third class training data into a preset diagnosis classification model for training, and generating a predicted probability value that the disease class belongs to the corresponding disease class, specifically including:

8. The data screening and countermeasure network-based medical cross-domain auxiliary diagnosis method according to claim 7, wherein the calculating of the cross entropy loss function value of the disease diagnosis class based on the predicted probability value of the disease class belonging to the corresponding disease class and the actual probability value of the disease class belonging to the corresponding disease class specifically comprises:

Wherein: /(I) Cross entropy loss function value representing disease diagnosis class,/>Predictive probability value representing that a disease category belongs to a corresponding disease category i,/>An actual probability value representing that a disease category belongs to a corresponding disease category and n represents the number of actual disease categories.

9. The data screening and countermeasure network-based medical cross-domain auxiliary diagnostic method of claim 1, wherein the determining of the trained total loss value based on the cross-entropy loss function value of the data source class and the cross-entropy loss function value of the disease diagnosis class specifically comprises:

10. A data screening and countermeasure network-based medical cross-domain auxiliary diagnostic apparatus, comprising: