CN113436698A

CN113436698A - Automatic medical term standardization system and method integrating self-supervision and active learning

Info

Publication number: CN113436698A
Application number: CN202110994475.7A
Authority: CN
Inventors: 李劲松; 杨宗峰; 辛然; 李玉格; 史黎鑫; 田雨; 周天舒
Original assignee: Zhejiang Lab
Current assignee: Zhejiang Lab
Priority date: 2021-08-27
Filing date: 2021-08-27
Publication date: 2021-09-24
Anticipated expiration: 2041-08-27
Also published as: CN113436698B

Abstract

The invention discloses a medical term automatic standardization system and method integrating self-supervision and active learning, wherein the system comprises a candidate set generation module, a self-supervision learning module for training a term standardization model, an active learning module, a precision sequencing module for comprehensively evaluating the prediction result of the term standardization model from text and semantic dimensions, and other basic modules, and also comprises a semi-supervision learning module, a straight superior term retrieval module and other preferred modules; the invention can realize the automatic medical term standardized model under the condition of less labeled data, and the model keeps the capability of fast updating and upgrading, thereby greatly reducing the workload of manual intervention while ensuring the accuracy of the output result; the new clinical concept can be matched with the direct superior term, and the accurate position is found in the standard term table, so that the integrity and the uniformity of the standardized result are ensured.

Description

Automatic medical term standardization system and method integrating self-supervision and active learning

Technical Field

The invention belongs to the technical field of Chinese medical term standardization and multi-center medical information platforms, and particularly relates to a medical term automatic standardization system and method integrating self-supervision and active learning.

Background

With the popularization of electronic medical record systems, a large amount of important medical information is stored in various medical information systems in an electronic form, and the data create great values for clinical auxiliary diagnosis, medicine research and development, public health monitoring and evaluation, infectious disease epidemic situation early warning, personalized accurate medical treatment and the like. The medical data standardization is a key step for promoting the integration of domestic medical systems and realizing the cooperative research and large-scale analysis of medical data. Achieving standardization of medical terms is the first difficult problem to be solved in the process of standardization of medical data. Internationally, different types of medical terms respectively have corresponding standard term systems, including disease term sets ICD-10, surgical operation codes ICD-9-CM-3, medical inspection term sets LONIC and the like. Hospitals or other medical facilities, however, do not make good use of the international universal standard terminology set during actual operations, primarily because: (1) different hospitals often adopt different medical information systems, and the data standards adopted by the information systems are different, so that the generated medical terms have larger difference in data dimension and data format; (2) the understanding of standard terminology and granularity by different operators is not uniform. The medical information system usually requires the operator to select the corresponding information of disease name, operation name, etc. according to the condition of the patient, and for the condition that the meanings of the upper and lower terms are overlapped (for example, the two codes "D00.2" and "D00.200" for the "gastric carcinoma in situ" in ICD-10), the understanding of different operators, even the understanding of the same operator at different times, may be different; (3) the operator personalizes the terms entered. Most information systems provide manual input for the convenience of entering new concepts, and therefore operators may develop irregular terminology based on past experience and personal habits. These factors result in the original clinical concept not being directly related to the general standard terms, and data unification and information exchange between different organizations are not easy.

The ultimate goal of medical term standardization efforts is to establish a mapping relationship between the original clinical concept and the standard term. The term standardization schemes in the past are generally based on the following two concepts: (1) by using an artificial method, professional clinicians are invited to carry out mapping and proofreading on the operation terms one by one, but the order of magnitude of the operation terms contained in each medical information system is in the ten thousand level, the working time for the clinicians to carry out proofreading is very long, the rapid popularization in China is difficult, and the rapid implementation of the domestic medical data standardization is further hindered; in addition, because the doctors have different work experiences, the standard mapping of the standard surgical terms lacks a uniform standard, so that the uniformity of the standards among different doctors is difficult to ensure, and meanwhile, the mapping result has manual errors, so that the uniformity of the mapping standards of the same doctor at different times is difficult to ensure. (2) The medical concept semantic matching model is trained based on a machine learning algorithm, but the difficulty of manually marking data is high, the consumed time is long, so that insufficient training data is not available, the finally generated model is low in generalization capability, and in order to ensure the accuracy of the actually used term standard result, more manpower is required to be consumed to verify the output result. On the other hand, there are many standard terminology sets with a high-low relationship, for example, the lower-level terminology of the term "corneal surgery (11)" in the surgical operation code ICD-9-CM-3 includes "corneal laceration suture (11.51)", "corneal transplantation NOS (11.6)", and the like. When the concept generated by the actual clinical operation cannot find the same-meaning peer-level term in the standard term set, the direct superior standard term needs to be accurately positioned, and the existing method cannot solve the problem well, so that the newly-added clinical concept cannot be fused into the universal standard term system. The invention aims to solve the problems that a medical term standardization system with good accuracy and generalization capability is established under the condition that a large amount of labeled data is not available, quick automatic iterative updating of the system is realized under the condition that manual intervention is reduced as much as possible, and meanwhile, the accurate peer standard terms or superior standard terms can be positioned for the original clinical concept.

Disclosure of Invention

The medical data standardization is a key step for promoting the integration of domestic medical systems and realizing the cooperative research and large-scale analysis of medical data. However, the existing clinical term standardization method and system generally need more manual examination and labeling work, and the accuracy and generalization capability are difficult to guarantee, so that the clinical data standardization is difficult to popularize at home quickly.

The invention aims to provide a medical term automatic standardization system and method based on a deep learning model and integrating self-supervision and active learning aiming at the difficulty of the standardization work of the current medical terms.

The purpose of the invention is realized by the following technical scheme: according to the method, a medical term standardization model is constructed on the basis of a deep learning language model, a self-supervision learning method is adopted to train the model, negative samples are sampled based on a text correlation model and a hierarchical structure of a standard glossary, the negative samples with higher information content and more difficult model discrimination are obtained, and the effect of data enhancement is achieved, so that the semantic relation contained in the model can be fully utilized under the condition that only a small number of labeled samples exist in the model; the method comprises the steps of realizing an active learning function based on the principles of maximum entropy, low confidence, high frequency and the like, and screening a group of samples which can improve the performance of a model to the maximum extent according to the prediction results of the model on a large number of unknown samples, so that the model can be upgraded quickly and obviously with the least manual intervention; designing a precise sequencing model, and finally outputting correct standard terms by integrating information in various aspects such as texts, semantics and the like; the accurately sequenced samples automatically update training data in a semi-supervised self-training learning mode, so that the accuracy and generalization capability of the model are further improved, and the workload of manual intervention is continuously reduced; an upward retrieval method is constructed, some newly added original clinical concepts are positioned to corresponding direct superior terms, the integrity and consistency of the medical term standardization results are guaranteed, the newly added clinical concepts can find the correct positions in the standard glossary, and the comprehensive standardization of clinical data is facilitated.

The invention discloses an automatic medical term standardization system integrating self-supervision and active learning, which comprises the following components:

(1) a candidate set generation module: sampling negative samples based on a hierarchical structure of a text correlation model and a standard glossary to generate a training candidate set, and sampling possible positive samples based on the text correlation model to generate a prediction candidate set;

(2) the self-supervision learning module: for training a term normalized model, comprising:

training a Chinese medical language model by a self-adaptive method to obtain semantic vectors of original clinical concepts and standard terms;

respectively calculating the semantic similarity of the labeled original clinical concept and the label thereof and the negative sample of the training candidate set through a semantic matching model;

adopting a self-supervision learning mode, and calculating a loss function of a normalized model according to the semantic similarity;

(3) an active learning module: calculating a semantic similarity score by using the unlabeled original clinical concepts and the standard terms of the prediction candidate set; screening out a group of samples with most uncertain current term standardized models according to the active learning standard, and fusing the samples into a training candidate set after determining labels of the samples;

(4) the accurate sequencing module: acquiring semantic similarity scores of the original clinical concept and standard terms output by the self-supervision learning module as semantic features, calculating text features, training a regression decision tree-based accurate sequencing model based on the semantic and text features, and calculating confidence scores of medical term standardization results; and calculating a confidence score for the standard term positive sample in the prediction candidate set by using the trained accurate sequencing model to obtain the standard term with the maximum confidence score.

Further, the automatic medical term standardization system further comprises a semi-supervised learning module, and the semi-supervised learning module fuses the samples of which the confidence scores of the medical term standardization results output by the precise ordering module meet the conditions to the training candidate set.

Further, the automatic medical term standardization system further comprises a direct superior term retrieval module, wherein the direct superior term retrieval module comprises: acquiring a group of standard terms with the highest confidence scores predicted by an accurate sequencing model for the original clinical concept, and generating a path traced back to the upper level in the hierarchical structure of the standard term table; and determining the direct superior terms corresponding to the original clinical concept based on the principle of majority voting.

The invention discloses a medical term automatic standardization method fusing self-supervision and active learning on the other hand, which comprises the following steps:

(1) generating a negative sample and a positive sample, and respectively constructing a training candidate set and a prediction candidate set: sampling negative samples based on a hierarchical structure of a text correlation model and a standard glossary to generate a training candidate set, and sampling possible positive samples based on the text correlation model to generate a prediction candidate set;

(2) training the term normalization model by self-supervised learning: training a Chinese medical language model by a self-adaptive method to obtain semantic vectors of original clinical concepts and standard terms; respectively calculating the semantic similarity of the labeled original clinical concept and the label thereof and the negative sample of the training candidate set through a semantic matching model; adopting a self-supervision learning mode, and calculating a loss function of a normalized model according to the semantic similarity;

(3) the term standardization model is rapidly upgraded through active learning: calculating a semantic similarity score by using the unlabeled original clinical concepts and the standard terms of the prediction candidate set; screening out a group of samples with most uncertain current term standardized models according to the active learning standard, and fusing the samples into a training candidate set after determining labels of the samples;

(4) training an accurate ordering model, and comprehensively evaluating the prediction result of the term standardization model from text and semantic dimensions: acquiring the semantic similarity score of the original clinical concept and the standard term output by the self-supervision learning in the step two as a semantic feature, and calculating a text feature; training a regression decision tree-based accurate sequencing model based on semantic and text features for calculating a confidence score of a medical term standardization result;

(5) predicting the final term normalization result: and calculating confidence scores for the standard term positive samples in the prediction candidate set by using the trained accurate ranking model, and taking the standard term with the maximum confidence score as a term standardization result.

Further, the step (1) includes:

(1.1) training candidate sets: training candidate set from original clinical conceptsxAnd their corresponding standard termsyComposition of, ifyPresence of last-level direct terminology

Then get

All next level terms of (1) are taken as a setM(ii) a If it isyAbsence of primary direct terms but presence of secondary direct terms

Then get

All next level and next level terms of (1) are denoted as setM(ii) a Otherwise, the standard glossary is collectively referred to as a setM(ii) a ComputingxAndMany standard term ofmThe text relevance scores are sorted according to the text relevance scores to select a negative sample set

To obtain a training candidate set

；

(1.2) predicting candidate set: in the prediction of the term standardized model, the original clinical concept without label is processedxThe standard glossary is collectively referred to as a setMUsing text relevance scores fromMTo select a positive sample set

To obtain a prediction candidate set

。

Further, in the step (2), the chinese medical language model is a bi-directional autoregressive language model, which specifically includes: the original clinical conceptxWith any standard terminology

Splicing together according to characters, adding a segmentation character (SEP) at the splicing position]And adds a start character to the leftmost side [ S ]]Inputting the concatenation result as a whole sentence into the bi-directional autoregressive language model, the last layer of which is at the starting character [ S ]]The output of the position isxAnd

the semantic vector of (2).

Further, in the step (3), the unlabeled original clinical concept is setxObtaining a prediction candidate set through the step (1):

computing a semantic similarity score using a semantic matching model

It is normalized to a probability distribution:

term standardized model pairxDegree of uncertainty of

The calculation is as follows:

wherein

For the term standardized model pairxThe information entropy of (2):

is the edge probability:

wherein

And

are respectively all

The median maximum and second maximum probabilities;

as confidence:

for raw clinical text dataxThe frequency of occurrence;

is the weight of each feature.

Further, in the step (4), a gradient lifting model XGBoost is adopted as a precise sequencing model, which specifically includes: training a plurality of regression decision trees, wherein the learning target of each tree is the error of the previous tree, and the accumulation of the calculation results of all the trees is the final confidence score; is provided with a pairuConstructing a gradient lifting model for each sample, thentLoss function of decision tree

Comprises the following steps:

wherein

In the form of a function of the square loss,

is a sample

The label of (a) is used,

is as followstA decision tree pair

The predicted value of (a) is determined,

is frontt-1 decision tree pair

The predicted value of (a) is determined,

is a regular term representing the complexity of the decision tree, wherein

Is as followstThe number of leaf nodes of the decision tree,

is as followskThe predicted value of each of the leaf nodes,

and

is a weight parameter;

in the process of training the precise ranking model, input training data is a data set consisting of standard term positive samples in a prediction candidate set of an original clinical concept and a term standardization model:

let the training be completed by a precise ranking model comprisingTA decision tree, then the sample is matched

Computing a confidence score for a medical term normalized result

Comprises the following steps:

and further, fusing the sample with the confidence score meeting the condition of the medical term standardization result output by the accurate sequencing module into a training candidate set, and updating the term standardization model and the accurate sequencing model parameters.

Further, the method also comprises a direct superior term retrieval function: acquiring a group of standard terms with the highest confidence scores predicted by an accurate sequencing model for the original clinical concept, and generating a path traced back to the upper level in the hierarchical structure of the standard term table; and determining the direct superior terms corresponding to the original clinical concept based on the principle of majority voting.

The invention has the beneficial effects that: the invention can realize the automatic medical term standardization model under the condition of less labeled data, and the model keeps the capability of quick updating and upgrading, thereby greatly reducing the workload of manual intervention while ensuring the accuracy of the output result. The new clinical concept can be matched with the direct superior term, and the accurate position is found in the standard term table, so that the integrity and the uniformity of the standardized result are ensured.

Drawings

FIG. 1 is a block diagram of an automatic standardization system for medical terms fusing self-supervision and active learning according to an embodiment of the present invention;

FIG. 2 is a schematic diagram of an implementation of a candidate set generation module according to an embodiment of the present invention;

FIG. 3 is a schematic diagram of an implementation of an auto-supervised learning module and an active learning module according to an embodiment of the present invention;

FIG. 4 is a schematic diagram of an implementation of a direct superior term retrieval module according to an embodiment of the present invention;

FIG. 5 is a schematic diagram of a training process of a decision tree-based precision ranking model according to an embodiment of the present invention;

fig. 6 is a schematic diagram of direct superior term retrieval according to an embodiment of the present invention.

Detailed Description

In order to make the aforementioned objects, features and advantages of the present invention comprehensible, embodiments accompanied with figures are described in detail below.

In the following description, numerous specific details are set forth in order to provide a thorough understanding of the present invention, but the present invention may be practiced in other ways than those specifically described and will be readily apparent to those of ordinary skill in the art without departing from the spirit of the present invention, and therefore the present invention is not limited to the specific embodiments disclosed below.

In the invention, the self-supervision learning means: and (3) mining own supervision information from large-scale unmarked data by using an auxiliary task, and training the network by using the constructed supervision information so as to learn valuable characteristics of downstream tasks. There are three main ways for self-supervised learning: context-based learning, time-series-based learning, and contrast learning, where contrast learning is the construction of a characterization by learning to encode the similarity or dissimilarity of two things.

Active learning means: the main goal is to reduce the cost of people to annotate data. The sample data which is difficult to classify or fuzzy in model classification is obtained through a machine learning method, and the data is generally considered to be possibly in the critical positions of different classes, so that the method can provide greater help for the model to accurately learn the features of the different classes. By manually re-confirming and auditing the samples, the effect of the model can be improved more remarkably under the condition of the same labeled data quantity.

Semi-supervised learning refers to: the learner is independent of external interaction, and learning performance is improved by automatically utilizing unmarked samples. The self-training is a special implementation mode of semi-supervised learning, assuming that similar samples have similar output, firstly training an initial model by using labeled samples, then carrying out prediction classification on unlabeled samples by using the model, screening out samples with higher prediction result confidence coefficient based on certain standard, and then using predicted soft labels or hard labels as new labeled data to expand a training set.

Medical term standardization refers to: the standardized principle and method are used for unifying medical terms in a certain range by establishing medical term standards so as to obtain the process of optimal order and social benefit. The establishment of the unified medical term standard and the term set is helpful for solving the problems of term repetition, connotation, semantic expression and inconsistent understanding and the like, and has important significance for effectively promoting the propagation, sharing and use of medical information in a wider range and a deeper level.

The embodiment of the invention provides an automatic medical term standardization system integrating self-supervision and active learning, which comprises the following modules as shown in figure 1:

a candidate set generation module: sampling negative samples based on a hierarchical structure of a text correlation model and a standard glossary to generate a training candidate set, and sampling possible positive samples based on the text correlation model to generate a prediction candidate set;

a self-supervised learning module for training the term standardized model;

the active learning module is realized on the basis of the principles of maximum entropy, minimum confidence coefficient and the like;

a precision ranking module for comprehensive evaluation of term normalized model predictions from text and semantic dimensions.

Preferably, the system further comprises: and fusing the sample of which the confidence score of the medical term standardization result output by the precise ordering module meets the condition to a semi-supervised learning module of the training candidate set.

Preferably, the system further comprises: directly superior term retrieval module.

Specifically, the candidate set generation module is composed of two parts: in the term standardization model training process, sampling is carried out based on the text correlation BM25 model and the hierarchical structure of a standard term table, and standard terms which are close to but not identical with the original clinical concept as much as possible are obtained as negative sample standard terms; in the term standardization model prediction process, possible positive sample standard terms are generated based on the text relevance BM25 model, and the detailed flow is shown in FIG. 2.

Specifically, the self-supervision learning module mainly comprises the following three steps:

1. training a Chinese medical language model, preferably a bidirectional autoregressive language model (BERT), by a self-adaptive method, and further acquiring semantic vectors of original clinical concepts and standard terms;

2. respectively calculating the semantic similarity of the labeled original clinical concept and the label thereof and the negative sample of the training candidate set through a semantic matching model;

3. the loss function of the normalized model is calculated according to the semantic similarity meter using an auto-supervised learning approach (preferably an auto-supervised contrast learning approach), as shown in the left part of fig. 3.

Specifically, the active learning module mainly includes the following two steps:

1. calculating a semantic similarity score by using the unlabeled original clinical concepts and the standard terms of the prediction candidate set;

2. and screening out a group of samples with the most uncertain current term standardization model according to the active learning standard, determining labels of the samples, and then merging the samples into a training candidate set, wherein the labels are shown in the right part of the figure 3.

Specifically, the precise sorting module mainly comprises the following two steps:

1. firstly, acquiring semantic similarity scores of an original clinical concept and a standard term output by an automatic supervision learning module as semantic features, and calculating text features, wherein the text features comprise the literal similarity of the original clinical concept and the standard term, word co-occurrence frequency, the difference of the number of contained words and the like;

2. a regression decision tree-based precision ranking model is then trained based on these features for computing a confidence score for the medical term normalization result.

Specifically, the semi-supervised learning module has the main function of screening a group of most determined samples of the current term standardized model based on the confidence scores output by the precise ranking module, and expanding a training candidate set.

Specifically, the direct superior term retrieval module mainly includes the following two steps:

1. firstly, acquiring a group of standard terms with the highest confidence scores predicted by a precise ordering model for an original clinical concept, and generating a path traced back to the upper level in the hierarchical structure of a standard term table;

2. and then determining the direct superior terms corresponding to the original clinical concept based on the principle of majority voting, as shown in fig. 4.

The embodiment of the invention provides a medical term automatic standardization method integrating self-supervision and active learning, which comprises the following specific implementation steps:

generating a negative sample and a positive sample, and respectively constructing a training candidate set and a prediction candidate set, specifically: sampling negative samples based on a hierarchical structure of a text correlation model and a standard glossary to generate a training candidate set, and sampling possible positive samples based on the text correlation model to generate a prediction candidate set; more specifically, with reference to fig. 2, the following sub-steps are included:

1) the training candidate set of the term normalized model consists of a large number of original clinical conceptsxAnd their corresponding standard termsyAnd (4) forming. When the term standardization model training is carried out, firstly, the term standardization model training is carried out inxA set of negative examples are sampled in different meaning standard terms. In order for the term-normalized model to learn more from negative examples, the sampling process needs to obtain standard terms that are as close as possible to, but not exactly the same as, the meaning of the original clinical concept. Some standard nomenclature exists hierarchically, for example, the disease nomenclature table ICD-10 encodes "oral, esophageal and gastric carcinoma in situ" as "D00" with the next-level nomenclature of "carcinoma in situ of the lip, oral and pharynx (D00.0)", "esophageal carcinoma in situ (D00.2)" and the like, and the next-level nomenclature of "carcinoma in situ of the tonsil (D00.001)", "carcinoma in situ of the lip (D00.002)" and the like. The operations were performed in the following order:

ifyPresence of last-level direct terminology

Then get

All next level terms of (1) are taken as a setM；

② ifyAbsence of primary direct terms but presence of secondary direct terms

Then get

All next level and next level terms of (1) are denoted as setM；

Thirdly, if not, the standard glossary is totally recorded as a setM；

Then calculatexAndMany standard term of

For the text relevance BM25 model, its text relevance score

The formula of (1) is as follows:

wherein

To representxChinese character

The value of the IDF of (a),

is composed of

In thatmThe frequency of occurrence of (a) is,lenis composed ofmThe length of (a) of (b),avglenis composed ofMThe average length of all the standard terms in (a),

is the weight of the word or words,

andbin order to empirically specify the parameters, in the present embodiment,

，

。

sorting out negative sample set according to text relevance scores

To obtain a training candidate set

。

2) In the prediction of the term standardized model, the original clinical concept without label is processedxFirstly, a group of standard terms with the highest possibility is screened out from the whole standard term table to serve as a prediction candidate set, and then the term standardization model is only needed to calculate the semantic similarity scores of the original clinical concept and the standard terms in the prediction candidate set, and the semantic similarity scores are not needed to be calculated on the whole standard term table, so that excessive useless calculation of the term standardization model can be avoided, and the prediction efficiency can be improved. In this case, the standard glossary is collectively referred to as a setMUsing text relevance scores

FromMTo select a positive sample set

To obtain a prediction candidate set

。

Secondly, training a term standardization model through self-supervision learning, specifically: training a Chinese medical language model by a self-adaptive method to obtain semantic vectors of original clinical concepts and standard terms; respectively calculating the semantic similarity of the labeled original clinical concept and the label thereof and the negative sample of the training candidate set through a semantic matching model; adopting a self-supervision learning mode, and calculating a loss function of a normalized model according to the semantic similarity; more specifically, the following substeps are included:

1) the term standardization model consists of a two-way autoregressive language model and a semantic matching model. The bidirectional autoregressive language model performs autoregressive training of semantic units based on forward and reverse contexts, and can learn efficient semantic vector representation while modeling natural language. The input of the next layer in the multi-layer bidirectional autoregressive language model is derived from the self-attention mechanism of the hidden state of the previous layer:

wherein

Respectively in the hidden state of the upper layerhThe vector after the matrix transformation is carried out,

is composed ofhThe dimension (c) of (a) is,Zis the input of the next layer, and the input of the next layer,

a matrix obtained for training; obtaining the hidden state of the next layer through the following nonlinear transformation

：

Wherein

And

in order to train the resulting matrix, the matrix is,

and

the resulting vectors are trained.

Semantic vectors of the original clinical concepts and the standard terms may be derived based on a bi-directional autoregressive language model when term normalization is performed. The method specifically comprises the following steps: the original clinical conceptxWith any standard terminology

(either positive or negative) are concatenated together word-wise, and a segmentation character [ SEP ] is added at the concatenation]And adds a start character to the leftmost side [ S ]]For example, if the original clinical operation concept "fallopian tube resection" corresponds to a positive sample with the standard term "bilateral fallopian tube resection (66.51)" in the ICD-9-CM-3 glossary, then the positive sample is spliced into "[ S ] fallopian tube resection]Ova duct excision [ SEP ]]Bilateral fallopian tube resection ". Inputting the concatenation result as a whole sentence into the bi-directional autoregressive language model, the last layer of which is at the starting character [ S ]]The output of the position isxAnd

the semantic vector of (2).

And inputting the semantic vector into a semantic matching model to calculate to obtain semantic similarity. The calculation process of the multilayer semantic matching model is as follows:

wherein

For the hidden state of the semantic matching model,

is as follows

The output value of the layer is then calculated,

and

parameters obtained for training; the output dimension of the last layer of the semantic matching model is set to 2, namely

And obtaining the following through nonlinear transformation:

of the output

Is thatxAnd

the semantic similarity score of (a) is determined,

is composed ofxAnd

the difference degree score of (a).

2) The semantic matching model is trained in an automatic supervision learning mode, so that the model autonomously learns the same characteristics of the same data from a large amount of data, and the problem of insufficient labeling training data is solved. In the specific implementation process, a comparison learning mode is adopted, common characteristics of synonymous terms are emphatically learned, and differences of terms with different meanings are distinguished. Setting original clinical conceptxIs a standard termyObtaining a training candidate set through the first step

Constructing a global penalty functionLComprises the following steps:

wherein

The function of the expectation is represented by,

for empirically specified parameters, in the present embodiment

. The term normalized model parameters are updated through gradient backpropagation using this loss function.

Thirdly, rapidly upgrading a term standardization model through active learning, specifically: calculating a semantic similarity score by using the unlabeled original clinical concepts and the standard terms of the prediction candidate set; and screening a group of samples with most uncertain current term standardization models according to the active learning standard, and fusing a training candidate set after determining labels of the samples. The more specific implementation principle and process are as follows:

obtaining a more efficient model using as little labeled data as possible is a problem faced by many machine learning algorithms. The idea of active learning is that more fuzzy samples of the current model classification can provide more information quantity, and the performance of the model can be improved greatly under the condition of the same data quantity by screening the samples and generating labels based on the principle. Unlabeled original clinical conceptxObtaining a prediction candidate set through the first step:

computing a semantic similarity score using a semantic matching model

It is normalized to a probability distribution:

term standardized model pairxDegree of uncertainty of

The calculation is as follows:

wherein

For the term standardized model pairxThe information entropy of (2):

is the edge probability:

wherein

And

are respectively all

The median maximum and second maximum probabilities;

as confidence:

for raw clinical text dataxThe frequency of occurrence, generating the correct label for high frequency of original clinical concepts may help the term standardized model to better learn the distribution of the entire set of original clinical concepts.

For the weight of each feature, in the present embodiment,

，

，

，

。

the active learning process is screened out

And (3) manually determining the labels of the original clinical concepts with higher values, merging the labels into a training candidate set for retraining the term standardization model, and repeating the steps to obtain the term standardization model with the best effect by using the least labeled data.

Training a precise ordering model, and comprehensively evaluating the prediction result of the term standardization model from text and semantic dimensions, specifically: acquiring the semantic similarity score of the original clinical concept and the standard term output by the self-supervision learning in the step two as a semantic feature, and calculating a text feature; a regression decision tree-based precision ranking model is trained based on the features for computing a confidence score for the medical term normalization result. The more specific implementation principle and process are as follows:

the term normalized result is considered in practical applicationThe problem of certainty is especially when the training data is relatively lacking during the initial startup period of the whole system. The traditional method is to manually verify the prediction result of the term standardized model again, and more manpower is usually consumed. The accurate ordering model designed by comprehensively considering various characteristics such as semantics, texts and the like can help the correctly predicted standard terms to obtain higher ranking, so that the problem is effectively solved, and the weight of the semantic characteristics in the accurate ordering model can be gradually and artificially increased along with the iterative upgrade of the term standardization model based on self-supervision and active learning. Preferably, a gradient lifting model XGBoost is used as the accurate ranking model, and the basic idea is to train a plurality of regression decision trees, the learning goal of each tree is the error of the previous tree, and the accumulation of the calculation results of all trees is the final confidence score, as shown in fig. 5. Is provided with a pairuConstructing a gradient lifting model for each sample, thentLoss function of decision tree

Comprises the following steps:

wherein

In the form of a function of the square loss,

is a sample

The label of (a) is used,

is as followstA decision tree pair

The predicted value of (a) is determined,

is frontt-1 decision tree pair

The predicted value of (a) is determined,

is a regular term representing the complexity of the decision tree, wherein

Is as followstThe number of leaf nodes of the decision tree,

is as followskThe predicted value of each of the leaf nodes,

and

as the weight parameter, in the present embodiment,

，

。

labels of samples of each group

The characteristics involved in training are shown in table 1, either 0 (wrong standard terms) or 1 (correct standard terms). The training-completed accurate ranking model comprisesTA decision tree, then the sample is matched

Computing a confidence score for a medical term normalized result

Comprises the following steps:

TABLE 1 characteristics adopted by the precision ranking model

And fifthly, predicting a final term standardization result. Predicting candidate sets using a trained accurate ranking model pair

The standard term positive sample in (1) calculates a confidence score

Then, the standard term with the largest confidence score is taken:

can be regarded as

Is composed ofxCorresponding standard terms have the same meaning.

And sixthly, screening samples with higher confidence scores of the prediction results to perform semi-supervised self-training. Specifically, the method comprises the following steps: setting strict threshold for confidence score of accurate ranking model prediction

Setting a precise ordering model to the original clinical conceptxThe standard term of prediction is

The confidence score is output as

If, if

Then will be

And adding an original training candidate set, and updating the parameters of the term standardization model and the accurate sequencing model.

Seventhly, searching the direct superior terms, specifically: acquiring a group of standard terms with the highest confidence scores predicted by an accurate sequencing model for the original clinical concept, and generating a path traced back to the upper level in the hierarchical structure of the standard term table; and determining the direct superior terms corresponding to the original clinical concept based on the principle of majority voting. More specifically, the implementation process is as follows:

for output confidence scores

Original clinical concept ofxMeans that there may not exist a standard term having the same meaning as the standard term, and the standard term is searched upward in the standard term table and locatedxIs a direct superior term of. Calculated by a precision sorting modelxAnd predicting confidence scores of the standard terms in the candidate set, sorting the confidence scores, and selecting the term with the highest confidence scorekA standard term

Is marked out

Is coded in the standard glossary to trace back to the upper level. For example, the original disease concept "right synovitis", with the highest confidence scorek=5 standard terms are shown in table 2. Starting from the coding of the terms in the table, in the standard glossaryMarking out the path of the backtracking of the upper level and each intermediate node passing through, and marking out each standard term node (marked as a node) on the backtracking path

) Marking the number of times the node has been traversed (denoted

) As shown in FIG. 6, the numbers in each node in the graph are

. Then, the first met condition is searched from the lower level to the upper level on the backtracking path

Standard term node of

Can be regarded asxIs directly superior to the standard terminology. For example, the first satisfaction encountered during the search from lower level to upper level in FIG. 6

The node of (a) is "synovitis and tenosynovitis (M65.9)", indicating that the original clinical concept "right synovitis" should be fused into the standard glossary as an immediate subordinate term to that term.

Table 2 shows the standard term with the highest confidence score of the original concept "right synovitis

The invention designs an automatic supervision learning method aiming at medical term standardization, and realizes a high-accuracy medical term standardization model under the condition of less labeled data; the active learning function is completed based on the term standardization process, so that the model can be rapidly and automatically upgraded; designing a candidate sample generating function by combining the characteristics of the standard glossary to ensure that the candidate sample has enough information; the accurate ordering function of the prediction result of the medical term standardized model is designed by integrating the semantic and text characteristics, so that the manual intervention is further reduced; the direct superior term retrieval function of the original clinical concept is designed on the basis of the accurate sequencing result, and the integrity and the uniformity of the medical term standardization result are ensured.

The foregoing is only a preferred embodiment of the present invention, and although the present invention has been disclosed in the preferred embodiments, it is not intended to limit the present invention. Those skilled in the art can make numerous possible variations and modifications to the present teachings, or modify equivalent embodiments to equivalent variations, without departing from the scope of the present teachings, using the methods and techniques disclosed above. Therefore, any simple modification, equivalent change and modification made to the above embodiments according to the technical essence of the present invention are still within the scope of the protection of the technical solution of the present invention, unless the contents of the technical solution of the present invention are departed.

Claims

1. An automatic medical term standardization system combining self-supervision and active learning, comprising:

2. The system of claim 1, further comprising a semi-supervised learning module, which fuses the sample with the confidence score of the result of the medical term standardization output by the precise ranking module satisfying the condition to the training candidate set.

3. The system of claim 1, further comprising a direct superior term retrieval module, the direct superior term retrieval module comprising: acquiring a group of standard terms with the highest confidence scores predicted by an accurate sequencing model for the original clinical concept, and generating a path traced back to the upper level in the hierarchical structure of the standard term table; and determining the direct superior terms corresponding to the original clinical concept based on the principle of majority voting.

4. A method for automatically normalizing medical terms fusing self-supervision and active learning, the method comprising:

5. The method for automatically standardizing medical terms fused with self-supervision and active learning according to claim 4, wherein the step (1) comprises:

Then get

To obtain a training candidate set

；

To obtain a prediction candidate set

。

6. The method for automatically standardizing medical terms fused with self-supervision and active learning according to claim 4, wherein the step (2) is performed byThe Chinese medical language model adopts a bidirectional autoregressive language model, and specifically comprises the following steps: the original clinical conceptxWith any standard terminology

the semantic vector of (2).

7. The method for automatically standardizing medical terms fused with self-supervision and active learning according to claim 5, wherein in the step (3), the unlabeled original clinical concept is setxObtaining a prediction candidate set through the step (1):

computing a semantic similarity score using a semantic matching model

It is normalized to a probability distribution:

term standardized model pairxDegree of uncertainty of

The calculation is as follows:

wherein

For the term standardized model pairxThe information entropy of (2):

is the edge probability:

wherein

And

are respectively all

The median maximum and second maximum probabilities;

as confidence:

for raw clinical text dataxThe frequency of occurrence;

is the weight of each feature.

8. The automatic medical term standardization method integrating self-supervision and active learning according to claim 5, wherein in the step (4), a gradient lifting model XGboost is adopted as a precise ordering model, and specifically: training a plurality of regression decision trees, wherein the learning target of each tree is the error of the previous tree, and the accumulation of the calculation results of all the trees is the final confidence score; is provided with a pairuConstructing a gradient lifting model for each sample, thentLoss function of decision tree

Comprises the following steps:

wherein

In the form of a function of the square loss,

is a sample

The label of (a) is used,

is as followstA decision tree pair

The predicted value of (a) is determined,

is frontt-1 decision tree pair

The predicted value of (a) is determined,

is a regular term representing the complexity of the decision tree, wherein

Is as followstThe number of leaf nodes of the decision tree,

is as followskThe predicted value of each of the leaf nodes,

and

is a weight parameter;

Computing a confidence score for a medical term normalized result

Comprises the following steps:

。

9. the method for automatically standardizing medical terms fused with self-supervision and active learning according to any one of claims 4-8, characterized in that the samples with confidence scores satisfying the condition of the standardized result of the medical terms output by the precise ranking module are fused to the training candidate set, and the term standardization model and the precise ranking model parameters are updated.

10. The method for automatically standardizing medical terms fused with self-supervision and active learning according to any one of claims 4-8, characterized in that the method further comprises a direct superior term retrieval function: acquiring a group of standard terms with the highest confidence scores predicted by an accurate sequencing model for the original clinical concept, and generating a path traced back to the upper level in the hierarchical structure of the standard term table; and determining the direct superior terms corresponding to the original clinical concept based on the principle of majority voting.