CN113593661A - Clinical term standardization method, device, electronic equipment and storage medium - Google Patents

Clinical term standardization method, device, electronic equipment and storage medium Download PDF

Info

Publication number
CN113593661A
CN113593661A CN202110767577.5A CN202110767577A CN113593661A CN 113593661 A CN113593661 A CN 113593661A CN 202110767577 A CN202110767577 A CN 202110767577A CN 113593661 A CN113593661 A CN 113593661A
Authority
CN
China
Prior art keywords
statement
sentence
vector
sample
bert model
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202110767577.5A
Other languages
Chinese (zh)
Inventor
尹珊珊
舒正
朱波
张骁雅
赵明
刘英杰
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Qingdao Guoxin Health Industry Technology Co ltd
Original Assignee
Qingdao Guoxin Health Industry Technology Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Qingdao Guoxin Health Industry Technology Co ltd filed Critical Qingdao Guoxin Health Industry Technology Co ltd
Priority to CN202110767577.5A priority Critical patent/CN113593661A/en
Publication of CN113593661A publication Critical patent/CN113593661A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16HHEALTHCARE INFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR THE HANDLING OR PROCESSING OF MEDICAL OR HEALTHCARE DATA
    • G16H10/00ICT specially adapted for the handling or processing of patient-related medical or healthcare data
    • G16H10/60ICT specially adapted for the handling or processing of patient-related medical or healthcare data for patient-specific data, e.g. for electronic patient records
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/36Creation of semantic tools, e.g. ontology or thesauri

Abstract

The invention provides a clinical term standardization method, a device, electronic equipment and a storage medium, wherein the method comprises the following steps: acquiring a first sentence and a plurality of second sentences; wherein the first sentence is a sentence to be identified, and the second sentence is a standard clinical term; inputting the first statement and the plurality of second statements into a pre-trained S-Bert model to obtain a first statement vector and a second statement vector; and carrying out similarity calculation on the first statement vector and the second statement vector, and taking the second statement corresponding to the highest similarity value as a standard clinical term corresponding to the first statement. The method provided by the invention can improve the accuracy and efficiency of clinical term standardization.

Description

Clinical term standardization method, device, electronic equipment and storage medium
Technical Field
The invention relates to the technical field of computers, in particular to a clinical term standardization method, a clinical term standardization device, electronic equipment and a storage medium.
Background
With the rapid development of medical informatization, the types and sizes of medical data are rapidly increasing. Obviously, it is a necessary trend to perform data analysis and mining on medical data of multiple medical data centers (simply referred to as "multiple centers") to provide support for clinical decision, medical management service, scientific research, and the like.
At present, relevant standards of medical terms in China are deficient, a system is incomplete, and a plurality of manufacturers of medical information systems cause that the isomerism of term names and codes among medical data centers and even in medical data centers is serious, a large amount of semi-structured and unstructured data and a large amount of distributed isomerism data, information, instrument equipment and systems exist, and a plurality of obstacles are brought to expression, storage, exchange, sharing and system collaborative work of medical information. To realize digitization and informatization of medical treatment, and to realize efficient social medical resource sharing, cross-regional medical treatment and cross-system medical treatment, a set of standard clinical medical terms is undoubtedly needed. However, at present, the mapping relationship between the existing international standard clinical term sets is difficult to apply to the standards of domestic medical terms due to language barriers, and the standardization and the sharing of medical data among multiple medical data centers are difficult to realize.
In the prior art, an empirical rule is mostly adopted for data standardization processing, more manual rechecks are still needed after simple processing, the workload of related mapping personnel is large, the efficiency is low, and the accuracy of the mapping relation is not high.
Disclosure of Invention
The invention provides a clinical term standardization method, a clinical term standardization device, electronic equipment and a storage medium, which are used for solving the technical problems of low mapping efficiency and low accuracy rate caused by the adoption of experience rules and manual recheck modes in standardization processing in the prior art and achieving the purpose of improving the accuracy rate and efficiency of clinical term standardization processing.
In a first aspect, the present invention provides a method of standardizing clinical terms, comprising:
acquiring a first sentence and a plurality of second sentences; wherein the first sentence is a sentence to be identified, and the second sentence is a standard clinical term;
inputting the first statement and the plurality of second statements into a pre-trained S-Bert model to obtain a first statement vector and a plurality of second statement vectors;
respectively carrying out similarity calculation on the first statement vector and the plurality of second statement vectors, and taking the second statement corresponding to the highest similarity value as a standard clinical term corresponding to the first statement;
the S-Bert model is obtained by training based on a sample statement pair and a sample class label; wherein the sample sentence pair comprises a sample sentence to be identified and a plurality of standard clinical terms; the sample category label is used for describing whether a mapping relation exists between a sample statement to be identified and a plurality of standard clinical terms;
and the S-Bert model is used for determining and pooling semantic symbol sequences of the first statement and the second statement.
According to the clinical term standardization method provided by the invention, the S-Bert model comprises a twin neural network structure and a pooling layer, wherein,
correspondingly, the inputting the first sentence and the plurality of second sentences into a pre-trained S-Bert model to obtain a first sentence vector and a plurality of second sentence vectors includes:
constructing sentence pairs by the first sentence and any one of the second sentences to obtain a plurality of sentence pairs;
respectively inputting the sentence pairs into the twin neural network structure to determine semantic unit symbols and perform symbol processing to obtain respective semantic symbol sequences of the sentence pairs;
and respectively inputting the semantic symbol sequences of the sentences into the pooling layer to perform average pooling, maximum pooling or initial symbol position pooling to obtain a first sentence vector and a plurality of second sentence vectors.
According to the clinical term standardization method provided by the invention, the symbol processing comprises the following steps:
adding a preset starting symbol in front of the sentence pair;
and/or the presence of a gas in the gas,
adding a preset separation symbol between two adjacent sentences in the sentence pair;
and/or the presence of a gas in the gas,
adding special symbols for the special semantic unit symbols.
According to the method for standardizing clinical terms provided by the invention, before the first sentence and the second sentence are obtained, an S-Bert model is trained according to a sample sentence pair and a class identification label, and the method comprises the following steps:
step S1, determining semantic unit symbols and performing symbol processing on a third sentence and a fourth sentence in the sample sentence pair by using an S-Bert model to be trained to obtain a semantic symbol sequence; wherein the third sentence is the sample sentence to be identified, and the fourth sentence is any one of the standard clinical terms;
step S2, carrying out average pooling, maximum pooling or initial symbol position pooling on the semantic symbol sequence to obtain a third statement vector and a fourth statement vector of the sample statement pair;
step S3, performing score evaluation on the first statement vector and the second statement vector of the sample statement, and selecting the model with the highest score as the S-Bert model to be trained in the next round;
step S4, the S-Bert model to be trained is trained again, when the model training termination condition is not met, the S-Bert model to be trained is adjusted, and the step S1 is executed again by utilizing the adjusted S-Bert model; and when the model training termination condition is met, obtaining the trained S-Bert model.
According to the clinical term standardization method provided by the invention, the adjusting of the S-Bert model to be trained comprises the following steps:
obtaining a difference vector according to the third statement vector and the fourth statement vector of the sample statement pair;
splicing the third statement vector, the fourth statement vector and the difference vector of the sample statement pair to obtain a fifth statement vector;
optimizing the S-Bert model to be trained according to the fifth statement vector and the training weight value;
or the like, or, alternatively,
calculating cosine similarity of a third statement vector and a fourth statement vector of the sample statement pair to obtain a cosine similarity value;
optimizing the S-Bert model to be trained according to the cosine similarity value and the score in the class label of the sample statement pair;
or the like, or, alternatively,
and optimizing the S-Bert model to be trained according to the distance between the given statement vector and the third statement vector and the fourth statement vector of the sample statement pair.
According to the method for standardizing clinical terms provided by the invention, the calculating the similarity between the first sentence vector and the plurality of second sentence vectors comprises the following steps:
and respectively carrying out similarity calculation on the first statement vector and the plurality of second statement vectors based on a cosine similarity algorithm, and taking the second statement corresponding to the highest similarity value as a standard clinical term of the first statement.
In a second aspect, the present invention provides a clinical term normalization apparatus comprising:
the acquisition module is used for acquiring a first statement and a plurality of second statements; wherein the first sentence is a sentence to be identified, and the second sentence is a standard clinical term;
the input module is used for inputting the first statement and the plurality of second statements into a pre-trained S-Bert model to obtain a first statement vector and a plurality of second statement vectors;
the calculation module is used for carrying out similarity calculation on the first statement vector and the plurality of second statement vectors respectively, and taking the second statement corresponding to the highest similarity value as a standard clinical term corresponding to the first statement;
the S-Bert model is obtained by training based on a sample statement pair and a sample class label; wherein the sample sentence pair comprises a sample sentence to be identified and a plurality of standard clinical terms; the sample category label is used for describing whether a mapping relation exists between a sample statement to be identified and a plurality of standard clinical terms;
and the S-Bert model is used for determining and pooling semantic symbol sequences of the first statement and the second statement.
According to the invention, there is provided a clinical term normalization apparatus, the input device is further configured to:
constructing sentence pairs by the first sentence and any one of the second sentences to obtain a plurality of sentence pairs;
respectively inputting the sentence pairs into the twin neural network structure to determine semantic unit symbols and perform symbol processing to obtain respective semantic symbol sequences of the sentence pairs;
and respectively inputting the semantic symbol sequences of the sentences into the pooling layer to perform average pooling, maximum pooling or initial symbol position pooling to obtain a first sentence vector and a plurality of second sentence vectors.
In a third aspect, the present invention also provides an electronic device, including:
a processor, a memory, and a bus, wherein,
the processor and the memory are communicated with each other through the bus;
the memory stores program instructions executable by the processor to invoke a method capable of performing the clinical term normalization as described in any of the above.
In a fourth aspect, the invention also provides a non-transitory computer-readable storage medium storing computer instructions that cause the computer to perform a method of clinical term normalization as described above.
The invention provides a method, a device, electronic equipment and a storage medium for standardizing clinical terms, wherein the method comprises the following steps: acquiring a first sentence and a plurality of second sentences; wherein the first sentence is a sentence to be identified, and the second sentence is a standard clinical term; inputting the first statement and the plurality of second statements into a pre-trained S-Bert model to obtain a first statement vector and a plurality of second statement vectors; and respectively carrying out similarity calculation on the first statement vector and the plurality of second statement vectors, and taking the second statement corresponding to the highest similarity value as a standard clinical term corresponding to the first statement. The method provided by the invention can improve the accuracy and efficiency of clinical term standardization.
Drawings
In order to more clearly illustrate the technical solutions of the present invention or the prior art, the drawings needed for the description of the embodiments or the prior art will be briefly described below, and it is obvious that the drawings in the following description are some embodiments of the present invention, and those skilled in the art can also obtain other drawings according to the drawings without creative efforts.
FIG. 1 is a schematic flow diagram of a method for standardizing clinical terms provided by the present invention;
FIG. 2 is a schematic diagram of the structure of a clinical term standardizing apparatus provided by the present invention;
fig. 3 is a schematic structural diagram of an electronic device provided in the present invention.
Detailed Description
In order to make the objects, technical solutions and advantages of the present invention clearer, the technical solutions of the present invention will be clearly and completely described below with reference to the accompanying drawings, and it is obvious that the described embodiments are some, but not all embodiments of the present invention. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.
The method is realized based on an S-Bert model obtained by fine tuning the Bert language model. The Bert model released in 2018 and 10 months causes huge reverberation in the NLP industry, and is considered to be a milestone progress in the NLP field. Bert calls Bidirected Encoder repetition from transforms, where the literal translation name is a transform-based bi-directional Encoder characterization. Unlike other language representation models, the Bert model aims to pre-train the deep bi-directional representation by jointly adjusting the context in all layers.
Fig. 1 is a method for standardizing clinical terms provided by the present invention, as shown in fig. 1, the method for standardizing clinical terms provided by the present invention comprises the following steps:
step 101: acquiring a first sentence and a plurality of second sentences; wherein the first sentence is a sentence to be identified, and the second sentence is a standard clinical term;
step 102: inputting the first statement and the plurality of second statements into a pre-trained S-Bert model to obtain a first statement vector and a plurality of second statement vectors;
step 103: respectively carrying out similarity calculation on the first statement vector and the plurality of second statement vectors, and taking the second statement corresponding to the highest similarity value as a standard clinical term corresponding to the first statement;
the S-Bert model is obtained by training based on a sample statement pair and a sample class label; wherein the sample sentence pair comprises a sample sentence to be identified and a plurality of standard clinical terms; the sample category label is used for describing whether a mapping relation exists between a sample statement to be identified and a plurality of standard clinical terms;
and the S-Bert model is used for determining and pooling semantic symbol sequences of the first statement and the second statement.
Specifically, the semantic symbol sequence refers to a row and a column which arrange symbols endowed with language meanings according to a certain sequence; standardization refers to the process of unifying medical terminology within a certain range to obtain the best order and social benefit by formulating the standard of clinical medical terminology.
In step 101, the first sentence is a sentence to be recognized, the second sentence is a standard clinical term, for example, the sentence to be recognized is "ICD", the standard clinical term is "lung cancer" and "infection after lung cancer operation", and it should be noted that the second sentence can be set according to actual needs, and is not specifically described herein.
In step 102, the obtained first sentence and the plurality of second sentences are input into a pre-trained S-Bert model, it should be noted that the S-Bert model is obtained by training based on the sample sentence pairs and the sample class labels, the sample category label is used for describing whether a mapping relation exists between a sample statement to be identified and a plurality of standard clinical terms, the determination of the mapping relationship may be judged by a mapping relationship score, which is set between 0 and 1 in the present embodiment, 0 indicating no mapping relationship at all, 1 indicating a mapping relationship with a determination, for example, the ICD is mapped to the post-lung cancer infection with a score of 1 and other standard medical terms with a score of less than 1, it means that ICD has a mapping relationship with "post-lung cancer infection" and may not have a mapping relationship with other standard medical terminology. The setting of the sample type label may be set according to actual needs, and is not particularly limited herein.
In step 103, similarity calculation is performed on the first sentence vector output by the S-Bert model and the plurality of second sentence vectors, and the standard clinical term corresponding to the highest similarity value is used as the standard clinical term of the first sentence, for example, the first sentence U is "ICD", V is "lung cancer" in the plurality of second sentences, W is "post-lung cancer infection", the first sentence vector is U obtained after the semantic symbol sequence and pooling process of the S-Bert model, the plurality of second sentence vectors are V and W, the similarity value of U and V can be calculated according to a cosine similarity calculation method to be 0.5, and the similarity value of U and W is 1, and then the W sentence corresponding to the highest similarity value is used as the standard clinical term of the first sentence U, that is, "post-lung cancer infection" as the standard clinical term of the first sentence U through comparative analysis. The method for calculating the similarity may be selected and set according to actual needs, and is not particularly limited herein.
In the embodiment of the invention, the obtained first sentence and the plurality of second sentences are input into a pre-trained S-Bert model to obtain a first sentence vector and a plurality of second sentence vectors, the first sentence vector and the plurality of second sentence vectors are subjected to similarity calculation, and the second sentence corresponding to the highest similarity value is used as the standard clinical term corresponding to the first sentence. The method provided by the invention can improve the accuracy and recognition efficiency of clinical term standardization.
In another embodiment of the present invention, the S-Bert model includes a twin neural network structure and a pooling layer, wherein,
correspondingly, the inputting the first sentence and the plurality of second sentences into a pre-trained S-Bert model to obtain a first sentence vector and a plurality of second sentence vectors includes:
constructing sentence pairs by the first sentence and any one of the second sentences to obtain a plurality of sentence pairs;
respectively inputting the sentence pairs into the twin neural network structure to determine semantic unit symbols and perform symbol processing to obtain respective semantic symbol sequences of the sentence pairs;
and respectively inputting the semantic symbol sequences of the sentences into the pooling layer to perform average pooling, maximum pooling or initial symbol position pooling to obtain a first sentence vector and a plurality of second sentence vectors.
In particular, a twin neural network (also known as a twinning neural network) is a coupling framework established based on two artificial neural networks.
Pooling (Pooling) is used for removing miscellaneous information, retaining key information and realizing optimization of data. The average pooling (mean pooling) is used for solving the average value of the characteristic points in the field, so that the characteristics of the whole data can be reserved, and the background information is well highlighted. And (3) performing maximum pooling (max Pooling), solving the maximum of the feature points in the field, and better retaining the texture features. And pooling initial symbol positions, and pooling the characteristic points in the initial symbol position field.
In the present embodiment, the semantic unit symbol refers to all word vectors of each sentence. The method comprises the steps of constructing sentence pairs by a first sentence and a plurality of second sentences respectively to obtain a plurality of sentence pairs, inputting the sentence pairs into two Bert models shared by parameters according to a twin neural network structure to obtain all word vectors of the sentence pairs, carrying out symbolic processing to obtain semantic symbol sequences, inputting the semantic symbol sequences output by the Bert into a Pooling layer to carry out Pooling processing to obtain a first sentence vector and a plurality of second sentence vectors. It should be noted that the semantic unit symbol may be all word vectors included in each sentence, or may be a plurality of word vectors included in each sentence obtained by performing word segmentation processing on the sentences. The setting may be performed according to actual needs, and is not particularly limited herein.
The Pooling strategy of the Pooling layer can be average Pooling treatment, namely averaging all word vectors in the length dimension of the sentence, and taking the average result as the integral semantic vector of the sentence; the method can be maximum value pooling processing, namely, taking the maximum value of all word vectors on the length dimension of the sentence as the integral semantic vector of the sentence; it is also possible to pool the starting symbol positions, i.e. to use the vector of the starting symbol positions as the overall semantic vector of the sentence.
In the embodiment of the invention, the S-Bert model is a model based on Bert sentence vectorization processing, a semantic symbol sequence of a sentence is determined through a twin neural network structure, the input semantic symbol sequence is subjected to pooling processing through a pooling layer to obtain a first sentence vector and a plurality of second sentence vectors, the model keeps the completeness of the Bert layer for sentence understanding, meanwhile, the generalization capability of the vectors on a migration task is considered, the semantic understanding can be deeply carried out, and the identification efficiency and the accuracy of the model are improved.
In another embodiment of the present invention, the symbol processing includes:
adding a preset starting symbol in front of the sentence pair;
and/or the presence of a gas in the gas,
adding a preset separation symbol between two adjacent sentences in the sentence pair;
and/or the presence of a gas in the gas,
adding special symbols for the special semantic unit symbols.
Specifically, the start symbol is a flag for indicating the start of a sentence; the separation symbol is used for separating sentence pairs; the special symbol is used for the representation of a special character.
In the embodiment of the present invention, the symbol processing on the sentence may add a preset starting symbol in front of the sentence pair, and/or add a preset separation symbol between two adjacent sentences in each sentence pair, and/or add a special symbol to the special semantic unit symbol. It should be noted that the semantic symbol marked with the special symbol is used as a semantic unit symbol, and the segmentation processing is not performed.
In this embodiment, the START symbol can be defined by itself, such as [ CLS ] or [ START ]; the predetermined separation symbol can be defined by itself, such as [ SEP ] or [ S ]; the special symbol can be defined by itself, such as [ NOT ], etc.
For example, if a sentence a is "gastric cancer", a sentence b is "gastric cancer, unspecified", the "unspecified" is set as a special symbol, which is denoted by [ NOT ], and a single word in the sentence is used as a semantic unit symbol, and if "gastric" is set as token _1, the semantic symbol sequence of the sentence pair is obtained as follows: [ CLS ] token _1token _2[ SEP ] token _3token _4[ NOT ]. Note that [ SEP ] may be two or more.
In the embodiment of the invention, the semantic symbol sequence of the sentence pair is obtained by carrying out symbol processing on the sentence pair, so that the semantic understanding can be realized more deeply, and the accuracy and efficiency of model identification are improved.
In an embodiment of the present invention, before the obtaining the first sentence and the second sentence, training an S-Bert model according to the sample sentence pair and the class identification tag includes:
step S1, determining semantic unit symbols and performing symbol processing on a third sentence and a fourth sentence in the sample sentence pair by using an S-Bert model to be trained to obtain a semantic symbol sequence; wherein the third sentence is the sample sentence to be identified, and the fourth sentence is any one of the standard clinical terms;
step S2, carrying out average pooling, maximum pooling or initial symbol position pooling on the semantic symbol sequence to obtain a third statement vector and a fourth statement vector of the sample statement pair;
step S3, performing score evaluation on the third statement vector and the fourth statement vector of the sample statement, and selecting the model with the highest score as the S-Bert model to be trained in the next round;
step S4, the S-Bert model to be trained is trained again, when the model training termination condition is not met, the S-Bert model to be trained is adjusted, and the step S1 is executed again by utilizing the adjusted S-Bert model; and when the model training termination condition is met, obtaining the trained S-Bert model.
Specifically, the model training termination condition may be that when the score of the training model is smaller than a preset threshold, the training is terminated.
In the embodiment of the invention, the training ending instruction is set to be a preset threshold value for the model score. And selecting a proper evaluator to evaluate the plurality of overall semantic vectors according to the obtained sample statement vector and the standard clinical term vector and by combining a specific business target, and taking the model with the highest score as the model to be trained in the next round. The evaluation indexes of the machine learning or deep learning model are selected for evaluation, the evaluation indexes are accuracy, precision, recall rate, F1 value and the like, and a user can select different indexes according to the proportion of positive and negative samples of a training sample and a training task. And performing intensive training according to the false positive data in the verification set until a training ending instruction is triggered, namely stopping training when the score of the model to be trained is smaller than a preset threshold value. If the preset threshold is 90, the training can be stopped when the model training score is 80. The false positive data refers to the original sentence pair with the non-mapping relationship, the training model predicts wrongly, and the obtained prediction result is the data with the mapping relationship of the sentence pair. In the embodiment of the invention, the sample sentence to be recognized and the standard clinical term need to be subjected to semantic symbol sequence determination and pooling processing to obtain the sample sentence vector and the standard clinical term vector, wherein the semantic symbol sequence can be vectors of all characters in the sentence or vectors of all words after the sentence is subjected to word segmentation processing. The detailed description is omitted here for the specific implementation mode, see the above examples.
In the embodiment of the invention, the model to be trained is evaluated and trained twice according to the training set data and the verification set data, so that the accuracy of the identification result of the S-Bert model can be improved.
In another embodiment of the present invention, the adjusting the S-Bert model to be trained includes:
obtaining a difference vector according to the third statement vector and the fourth statement vector of the sample statement pair;
splicing the third statement vector, the fourth statement vector and the difference vector of the sample statement pair to obtain a fifth statement vector;
optimizing the S-Bert model to be trained according to the fifth statement vector and the training weight value;
or the like, or, alternatively,
calculating cosine similarity of a third statement vector and a fourth statement vector of the sample statement pair to obtain a cosine similarity value;
optimizing the S-Bert model to be trained according to the cosine similarity value and the score in the class label of the sample statement pair;
or the like, or, alternatively,
and optimizing the S-Bert model to be trained according to the distance between the given statement vector and the third statement vector and the fourth statement vector of the sample statement pair.
Specifically, the optimization process may be implemented by adjusting parameters of the model, wherein the objective function applied in this embodiment may be a classification objective function, a regression objective function, or a triple objective function.
In the embodiment of the invention, the third statement is a sample statement to be identified, and the third statement vector is an integral semantic vector of the sample statement to be identified; the fourth sentence is any one of a plurality of standard clinical terms, the fourth sentence vectorIs an overall semantic vector of standard clinical terms. Obtaining a difference vector U-V according to the whole semantic vector U of the third statement and the whole semantic vector V of the fourth statement of the sample statement pair; splicing the whole semantic vector U of the third statement, the whole semantic vector V of the fourth statement and the difference vector U-V to obtain a fifth statement vector M; then multiplying the spliced fifth sentence vector M by a trainable weight value Wt∈R3n*kAnd finally training sentence pair classification by a softmax classifier, and realizing optimization processing of the S-Bert model to be trained by adjusting parameters. The classification objective function is specifically expressed as: o ═ softmax (W)t(u, v, | u-v |). Wherein, the loss function adopts a cross entropy loss function, which is specifically expressed as:
Figure BDA0003152442280000131
wherein loss (o, class) represents a loss value of a sentence to a semantic symbol sequence; class represents the category to which the sentence pair belongs, and k is the number of categories; o is the vector output by the softmax classifier for the sentence pair; j belongs to k, a possible category of sentences.
In the embodiment of the invention, cosine similarity is calculated for the whole semantic vector U of the third statement and the whole semantic vector V of the fourth statement of the sample statement pair; and adjusting the parameters of the S-Bert model to be trained according to the cosine similarity and the score in the class label of the sample statement pair, so as to realize optimization processing. Wherein, the loss function is mean square error, and is calculated by adopting the following expression formula:
loss(sim(u,v),cosine)=(cosine-sim(u,v))2
and sim (u, v) represents cosine similarity of the third statement vector and the fourth statement vector, and cosine represents scores in category labels of the third statement and the fourth statement in the sample statement.
In the embodiment of the invention, the S-Bert model to be trained is optimized according to the distance between the given sentence vector and the third sentence vector and the fourth sentence vector of the sample sentence pair, wherein the target function of the trainer is a triple target function, three sentences are input, namely a given sentence a, a positive example sentence p and a negative example sentence n, model parameters are adjusted to enable the distance between the sentence a and the sentence p to be smaller than the distance between the sentence a and the sentence n, namely the integral semantic vector of the sentence a is closer to the integral semantic vector of the sentence p, and the optimization of the model is realized. The specific characterization formula of the loss function is as follows:
loss=min(max(||Sa-Sp||-‖Sa-Sn‖+ε,0))
wherein S isa、Sp、SnAnd respectively representing the overall semantic vectors of the sentences a, p and n, and epsilon represents a preset distance vector value.
It should be noted that the purpose of the epsilon parameter is to ensure SpTo SaIs at least greater than SnTo SaWhere epsilon may be set to 1.
The embodiment of the invention provides three optimization processing modes, and the objective functions are respectively a classification objective function, a regression objective function and a triple objective function, so that the optimization processing of the trained model parameters is realized, and the accuracy of the identification result of the model is improved.
In another embodiment of the present invention, the performing similarity calculation between the first sentence vector and the plurality of second sentence vectors includes:
and respectively carrying out similarity calculation on the first statement vector and the plurality of second statement vectors based on a cosine similarity algorithm, and taking the second statement corresponding to the highest similarity value as a standard clinical term of the first statement.
Specifically, the cosine similarity is also called cosine similarity, and is a basis for measuring the difference between two individuals by calculating the cosine value of the included angle between two vectors.
In the embodiment of the invention, similarity calculation is carried out on a first statement vector and a plurality of second statement vectors respectively based on a cosine similarity algorithm, a second statement corresponding to the highest similarity value is determined as a standard clinical term of a first statement V, when the cosine similarity value of two vectors U and V is 1 and the cosine similarity value of the vectors U and W is 0, the second statement V corresponding to the second vector statement V with the cosine similarity value of 1 is taken as the standard clinical term of the first statement U. It should be noted that the range of the cosine similarity value is [ -1,1], and when the cosine value is closer to 1, it indicates that the included angle between the two vectors is closer to 0 degree, that is, the two vectors are more similar.
In the embodiment of the invention, the standard clinical term of the first statement is determined based on a cosine similarity algorithm, the calculation method is simple, and the efficiency of model identification is improved.
In another embodiment of the present invention, the construction of the S-Bert model is described in detail as follows:
the pre-trained Bert model can be subjected to fine-tuning (fine-tuning) processing through an additional output layer, and is suitable for building the most advanced model of a wide range of tasks without modifying a large-scale architecture for specific tasks. In the invention, two new unsupervised prediction tasks are used for pre-training the Bert model, wherein the task one is to randomly mask certain parts of an input sentence and then predict the masked parts to realize the training of the model, in this case, the final hidden vector corresponding to the masked parts is input into a softmax function and predicts the probability of all words as in a standard Language model so as to deepen the semantic understanding, and the step is called MLM (masked Language model). Task two is next sentence prediction, and many important downstream tasks, such as Question Answering (QA) and Natural Language Inference (NLI), are based on understanding the relationship between two text sentences and are not directly obtained through language modeling. In order to train a model for understanding sentence relations, a binarization next sentence prediction task is set, which can be easily generated from any monolingual corpus, such as selecting sentences A and B as pre-training samples, wherein B is 50% likely to be the next sentence of A and 50% likely to be random sentences from the corpus, and the task enables the model to capture semantic relations between two sentences.
The S-Bert (sequences Bert) model is mainly proposed for solving the problems that the huge time overhead and sentence characterization of the Bert semantic similarity retrieval are not suitable for unsupervised tasks, clustering, sentence similarity calculation and the like. The S-Bert model obtains the vector representation of sentence pairs by using a twin neural network structure and a pooling layer, and the S-Bert pre-training process mainly comprises two steps: firstly, a twin neural network structure is utilized to obtain the whole semantic vector representation of a sentence, and two Bert models are used for fine-tuning (fine-tuning) to pre-train the models: and inputting the sentence pair into two Bert models shared by parameters, inputting all word vectors of the output sentences of the Bert into a Pooling layer for Pooling processing, and obtaining the integral semantic vector representation of each sentence. And secondly, optimizing the model by using an objective function. The following examples can be seen in detail.
It should be noted that, for the acquisition mode of the Bert/S-Bert model, the fine-tuning (fine-tuning) may be performed on the basic model of the Bert/S-Bert by using the training corpora in the corresponding scene according to the specific application scenario, so as to obtain a model suitable for the specific service scenario, and the model may also be used by self-training the relevant model.
In another embodiment of the invention, the method further comprises:
calculating the first statement vector and the second statement vector according to a mapping strategy to obtain a standard clinical term of the first statement;
the mapping strategy is a mode of classifying the characteristics of clinical terms and limiting the mapping range by using axis words and axes.
Specifically, the axis word is a feature word or a keyword that is included in the clinical term or distinguishes other attributes; the axis is the standard or scale of classification.
In the embodiment of the invention, some mapping strategies are added according to business needs and data quality, and standard clinical terms of the first statement are obtained according to the mapping strategies. The similarity can be calculated after the vectors obtained by the S-Bert model and the vectors of the axial word are spliced, or the n clinical standard terms with higher similarity obtained by the S-Bert model are screened through medical rules and the like. For example: if the axis 1 is set to contain the axis words "digestive system" and "digestive system", the axis 1.1 contains the axis words "stomach" and "stomach body", and the axis 2 contains the axis words "respiratory system" and "respiratory system", the axis new word extraction processing is performed on the clinical diagnosis result "stomach malignant tumor", and the axis 1.1 to which the axis word "stomach" belongs is extracted, then the mapping range for the clinical term can be set in the axis 1 and the axis 1.1. It should be noted that, for the extraction of the axis word, a certain rule extraction may be adopted, or a machine learning or deep learning manner may be selected for extraction, which may be selected according to actual needs, and is not specifically limited herein.
According to the embodiment of the invention, the standard processing of the statement to be recognized is realized by adding the mapping strategy, the model recognition efficiency is improved, and the user experience is improved.
Fig. 2 is a device for standardizing clinical terms provided by the present invention, and as shown in fig. 2, the device for standardizing clinical terms provided by the present invention comprises:
an obtaining module 201, configured to obtain a first statement and a plurality of second statements; wherein the first sentence is a sentence to be identified, and the second sentence is a standard clinical term;
the input module 202 is configured to input the first sentence and the plurality of second sentences into a pre-trained S-Bert model, and obtain a first sentence vector and a plurality of second sentence vectors;
a calculating module 203, configured to perform similarity calculation on the first statement vector and the plurality of second statement vectors respectively, and use the second statement corresponding to the highest similarity value as a standard clinical term corresponding to the first statement;
the S-Bert model is obtained by training based on a sample statement pair and a sample class label; wherein the sample sentence pair comprises a sample sentence to be identified and a plurality of standard clinical terms; the sample category label is used for describing whether a mapping relation exists between a sample statement to be identified and a plurality of standard clinical terms;
and the S-Bert model is used for determining and pooling semantic symbol sequences of the first statement and the second statement.
The device for standardizing clinical terms provided in the embodiment of the invention comprises an acquisition module, a display module and a display module, wherein the acquisition module is used for acquiring a first sentence and a plurality of second sentences; wherein the first sentence is a sentence to be identified, and the second sentence is a standard clinical term; the input module is used for inputting the first statement and the plurality of second statements into a pre-trained S-Bert model to obtain a first statement vector and a second statement vector; the calculation module is used for carrying out similarity calculation on the first statement vector and the second statement vector, and taking the second statement corresponding to the highest similarity value as the standard clinical term corresponding to the first statement. The device provided by the invention can realize accurate focusing of clinical terms, and improves the accuracy and efficiency of clinical term standardization and user experience from deep understanding of depth and breadth semantics.
Since the principle of the apparatus according to the embodiment of the present invention is the same as that of the method according to the above embodiment, further details are not described herein for further explanation.
Fig. 3 is a schematic structural diagram of an electronic device according to an embodiment of the present invention, and as shown in fig. 3, the present invention provides an electronic device, including: a processor (processor)301, a memory (memory)302, and a bus 303;
wherein, the processor 301 and the memory 302 complete the communication with each other through the bus 303;
processor 301 is configured to call program instructions in memory 302 to perform the methods provided by the various method embodiments described above, including, for example: acquiring a first sentence and a plurality of second sentences; wherein the first sentence is a sentence to be identified, and the second sentence is a standard clinical term; inputting the first statement and the plurality of second statements into a pre-trained S-Bert model to obtain a first statement vector and a plurality of second statement vectors; respectively carrying out similarity calculation on the first statement vector and the plurality of second statement vectors, and taking the second statement corresponding to the highest similarity value as a standard clinical term corresponding to the first statement; the S-Bert model is obtained by training based on a sample statement pair and a sample class label; wherein the sample sentence pair comprises a sample sentence to be identified and a plurality of standard clinical terms; the sample category label is used for describing whether a mapping relation exists between a sample statement to be identified and a plurality of standard clinical terms; and the S-Bert model is used for determining and pooling semantic symbol sequences of the first statement and the second statement.
The present embodiments provide a non-transitory computer-readable storage medium storing computer instructions that cause the computer to perform the methods provided by the above method embodiments, for example, including: acquiring a first sentence and a plurality of second sentences; wherein the first sentence is a sentence to be identified, and the second sentence is a standard clinical term; inputting the first statement and the plurality of second statements into a pre-trained S-Bert model to obtain a first statement vector and a plurality of second statement vectors; respectively carrying out similarity calculation on the first statement vector and the plurality of second statement vectors, and taking the second statement corresponding to the highest similarity value as a standard clinical term corresponding to the first statement; the S-Bert model is obtained by training based on a sample statement pair and a sample class label; wherein the sample sentence pair comprises a sample sentence to be identified and a plurality of standard clinical terms; the sample category label is used for describing whether a mapping relation exists between a sample statement to be identified and a plurality of standard clinical terms; and the S-Bert model is used for determining and pooling semantic symbol sequences of the first statement and the second statement.
Those of ordinary skill in the art will understand that: all or part of the steps for implementing the method embodiments may be implemented by hardware related to program instructions, and the program may be stored in a computer readable storage medium, and when executed, the program performs the steps including the method embodiments; and the aforementioned storage medium includes: various media that can store program codes, such as ROM, RAM, magnetic or optical disks.
Finally, it should be noted that: the above examples are only intended to illustrate the technical solution of the present invention, but not to limit it; although the present invention has been described in detail with reference to the foregoing embodiments, it will be understood by those of ordinary skill in the art that: the technical solutions described in the foregoing embodiments may still be modified, or some technical features may be equivalently replaced; and such modifications or substitutions do not depart from the spirit and scope of the corresponding technical solutions of the embodiments of the present invention.

Claims (10)

1. A method of normalizing clinical terms, comprising:
acquiring a first sentence and a plurality of second sentences; wherein the first sentence is a sentence to be identified, and the second sentence is a standard clinical term;
inputting the first statement and the plurality of second statements into a pre-trained S-Bert model to obtain a first statement vector and a plurality of second statement vectors;
respectively carrying out similarity calculation on the first statement vector and the plurality of second statement vectors, and taking the second statement corresponding to the highest similarity value as a standard clinical term corresponding to the first statement;
the S-Bert model is obtained by training based on a sample statement pair and a sample class label; wherein the sample sentence pair comprises a sample sentence to be identified and a plurality of standard clinical terms; the sample category label is used for describing whether a mapping relation exists between a sample statement to be identified and a plurality of standard clinical terms;
and the S-Bert model is used for determining and pooling semantic symbol sequences of the first statement and the second statement.
2. The clinical term normalization method according to claim 1, wherein the S-Bert model includes a twin neural network structure and a pooling layer, wherein,
correspondingly, the inputting the first sentence and the plurality of second sentences into a pre-trained S-Bert model to obtain a first sentence vector and a plurality of second sentence vectors includes:
constructing sentence pairs by the first sentence and any one of the second sentences to obtain a plurality of sentence pairs;
respectively inputting the sentence pairs into the twin neural network structure to determine semantic unit symbols and perform symbol processing to obtain respective semantic symbol sequences of the sentence pairs;
and respectively inputting the semantic symbol sequences of the sentences into the pooling layer to perform average pooling, maximum pooling or initial symbol position pooling to obtain a first sentence vector and a plurality of second sentence vectors.
3. The method of clinical term normalization according to claim 2, wherein the symbol processing includes:
adding a preset starting symbol in front of the sentence pair;
and/or the presence of a gas in the gas,
adding a preset separation symbol between two adjacent sentences in the sentence pair;
and/or the presence of a gas in the gas,
adding special symbols for the special semantic unit symbols.
4. The method of claim 1, wherein prior to said obtaining the first sentence and the second sentence, training an S-Bert model from the sample sentence pair and the class identification tag comprises:
step S1, determining semantic unit symbols and performing symbol processing on a third sentence and a fourth sentence in the sample sentence pair by using an S-Bert model to be trained to obtain a semantic symbol sequence; wherein the third sentence is the sample sentence to be identified, and the fourth sentence is any one of the standard clinical terms;
step S2, carrying out average pooling, maximum pooling or initial symbol position pooling on the semantic symbol sequence to obtain a third statement vector and a fourth statement vector of the sample statement pair;
step S3, performing score evaluation on the third statement vector and the fourth statement vector of the sample statement, and selecting the model with the highest score as the S-Bert model to be trained in the next round;
step S4, the S-Bert model to be trained is trained again, when the model training termination condition is not met, the S-Bert model to be trained is adjusted, and the step S1 is executed again by utilizing the adjusted S-Bert model; and when the model training termination condition is met, obtaining the trained S-Bert model.
5. The method of claim 4, wherein the adjusting the S-Bert model to be trained comprises:
obtaining a difference vector according to the third statement vector and the fourth statement vector of the sample statement pair;
splicing the third statement vector, the fourth statement vector and the difference vector of the sample statement pair to obtain a fifth statement vector;
optimizing the S-Bert model to be trained according to the fifth statement vector and the training weight value;
or the like, or, alternatively,
calculating cosine similarity of a third statement vector and a fourth statement vector of the sample statement pair to obtain a cosine similarity value;
optimizing the S-Bert model to be trained according to the cosine similarity value and the score in the class label of the sample statement pair;
or the like, or, alternatively,
and optimizing the S-Bert model to be trained according to the distance between the given statement vector and the third statement vector and the fourth statement vector of the sample statement pair.
6. The method of claim 1, wherein the calculating the similarity between the first term vector and the plurality of second term vectors comprises:
and respectively carrying out similarity calculation on the first statement vector and the plurality of second statement vectors based on a cosine similarity algorithm, and taking the second statement corresponding to the highest similarity value as a standard clinical term of the first statement.
7. A clinical term normalization apparatus, comprising:
the acquisition module is used for acquiring a first statement and a plurality of second statements; wherein the first sentence is a sentence to be identified, and the second sentence is a standard clinical term;
the input module is used for inputting the first statement and the plurality of second statements into a pre-trained S-Bert model to obtain a first statement vector and a plurality of second statement vectors;
the calculation module is used for carrying out similarity calculation on the first statement vector and the plurality of second statement vectors respectively, and taking the second statement corresponding to the highest similarity value as a standard clinical term corresponding to the first statement;
the S-Bert model is obtained by training based on a sample statement pair and a sample class label; wherein the sample sentence pair comprises a sample sentence to be identified and a plurality of standard clinical terms; the sample category label is used for describing whether a mapping relation exists between a sample statement to be identified and a plurality of standard clinical terms;
and the S-Bert model is used for determining and pooling semantic symbol sequences of the first statement and the second statement.
8. The clinical term normalization apparatus of claim 7, wherein the input module is further configured to:
constructing sentence pairs by the first sentence and any one of the second sentences to obtain a plurality of sentence pairs;
respectively inputting the sentence pairs into the twin neural network structure to determine semantic unit symbols and perform symbol processing to obtain respective semantic symbol sequences of the sentence pairs;
and respectively inputting the semantic symbol sequences of the sentences into the pooling layer to perform average pooling, maximum pooling or initial symbol position pooling to obtain a first sentence vector and a plurality of second sentence vectors.
9. An electronic device, comprising:
a processor, a memory, and a bus, wherein,
the processor and the memory are communicated with each other through the bus;
the memory stores program instructions executable by the processor, the processor invoking the program instructions to perform the method of any of claims 1to 7.
10. A non-transitory computer-readable storage medium storing computer instructions that cause a computer to perform the method of any one of claims 1-7.
CN202110767577.5A 2021-07-07 2021-07-07 Clinical term standardization method, device, electronic equipment and storage medium Pending CN113593661A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202110767577.5A CN113593661A (en) 2021-07-07 2021-07-07 Clinical term standardization method, device, electronic equipment and storage medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202110767577.5A CN113593661A (en) 2021-07-07 2021-07-07 Clinical term standardization method, device, electronic equipment and storage medium

Publications (1)

Publication Number Publication Date
CN113593661A true CN113593661A (en) 2021-11-02

Family

ID=78246599

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202110767577.5A Pending CN113593661A (en) 2021-07-07 2021-07-07 Clinical term standardization method, device, electronic equipment and storage medium

Country Status (1)

Country Link
CN (1) CN113593661A (en)

Cited By (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114974602A (en) * 2022-05-26 2022-08-30 山东大学 Diagnostic coding method and system based on contrast learning
CN114996466A (en) * 2022-08-01 2022-09-02 神州医疗科技股份有限公司 Method and system for establishing medical standard mapping model and using method
CN116167354A (en) * 2023-04-19 2023-05-26 北京亚信数据有限公司 Medical term feature extraction model training and standardization method and device
CN116186271A (en) * 2023-04-19 2023-05-30 北京亚信数据有限公司 Medical term classification model training method, classification method and device

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20180373700A1 (en) * 2015-11-25 2018-12-27 Koninklijke Philips N.V. Reader-driven paraphrasing of electronic clinical free text
CN109697286A (en) * 2018-12-18 2019-04-30 众安信息技术服务有限公司 A kind of diagnostic standardization method and device based on term vector
CN109992648A (en) * 2019-04-10 2019-07-09 北京神州泰岳软件股份有限公司 The word-based depth text matching technique and device for migrating study
CN111353302A (en) * 2020-03-03 2020-06-30 平安医疗健康管理股份有限公司 Medical word sense recognition method and device, computer equipment and storage medium
CN112464662A (en) * 2020-12-02 2021-03-09 平安医疗健康管理股份有限公司 Medical phrase matching method, device, equipment and storage medium

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20180373700A1 (en) * 2015-11-25 2018-12-27 Koninklijke Philips N.V. Reader-driven paraphrasing of electronic clinical free text
CN109697286A (en) * 2018-12-18 2019-04-30 众安信息技术服务有限公司 A kind of diagnostic standardization method and device based on term vector
CN109992648A (en) * 2019-04-10 2019-07-09 北京神州泰岳软件股份有限公司 The word-based depth text matching technique and device for migrating study
CN111353302A (en) * 2020-03-03 2020-06-30 平安医疗健康管理股份有限公司 Medical word sense recognition method and device, computer equipment and storage medium
CN112464662A (en) * 2020-12-02 2021-03-09 平安医疗健康管理股份有限公司 Medical phrase matching method, device, equipment and storage medium

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
宣云干: "《社会化标签的语义检索研究》", 东南大学出版社, pages: 37 *

Cited By (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114974602A (en) * 2022-05-26 2022-08-30 山东大学 Diagnostic coding method and system based on contrast learning
CN114996466A (en) * 2022-08-01 2022-09-02 神州医疗科技股份有限公司 Method and system for establishing medical standard mapping model and using method
CN114996466B (en) * 2022-08-01 2022-11-01 神州医疗科技股份有限公司 Method and system for establishing medical standard mapping model and using method
CN116167354A (en) * 2023-04-19 2023-05-26 北京亚信数据有限公司 Medical term feature extraction model training and standardization method and device
CN116186271A (en) * 2023-04-19 2023-05-30 北京亚信数据有限公司 Medical term classification model training method, classification method and device

Similar Documents

Publication Publication Date Title
CN111767405B (en) Training method, device, equipment and storage medium of text classification model
WO2021233112A1 (en) Multimodal machine learning-based translation method, device, equipment, and storage medium
CN113593661A (en) Clinical term standardization method, device, electronic equipment and storage medium
CN112084331A (en) Text processing method, text processing device, model training method, model training device, computer equipment and storage medium
CN111444340A (en) Text classification and recommendation method, device, equipment and storage medium
CN112270196A (en) Entity relationship identification method and device and electronic equipment
CN111597314A (en) Reasoning question-answering method, device and equipment
CN111898374B (en) Text recognition method, device, storage medium and electronic equipment
CN114064918A (en) Multi-modal event knowledge graph construction method
CN114398961A (en) Visual question-answering method based on multi-mode depth feature fusion and model thereof
CN112800292B (en) Cross-modal retrieval method based on modal specific and shared feature learning
CN111814454A (en) Multi-modal network spoofing detection model on social network
CN111159485A (en) Tail entity linking method, device, server and storage medium
US20230114673A1 (en) Method for recognizing token, electronic device and storage medium
CN113628059A (en) Associated user identification method and device based on multilayer graph attention network
CN113806582A (en) Image retrieval method, image retrieval device, electronic equipment and storage medium
CN110889505A (en) Cross-media comprehensive reasoning method and system for matching image-text sequences
CN114756678A (en) Unknown intention text identification method and device
WO2021129411A1 (en) Text processing method and device
CN116522165B (en) Public opinion text matching system and method based on twin structure
CN116258147A (en) Multimode comment emotion analysis method and system based on heterogram convolution
CN113722477B (en) Internet citizen emotion recognition method and system based on multitask learning and electronic equipment
CN114579876A (en) False information detection method, device, equipment and medium
CN115146589A (en) Text processing method, device, medium and electronic equipment
CN111199154A (en) Fault-tolerant rough set-based polysemous word expression method, system and medium

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination