CN114582449A - Electronic medical record named entity standardization method and system based on XLNet-BiGRU-CRF model - Google Patents
Electronic medical record named entity standardization method and system based on XLNet-BiGRU-CRF model Download PDFInfo
- Publication number
- CN114582449A CN114582449A CN202210049938.7A CN202210049938A CN114582449A CN 114582449 A CN114582449 A CN 114582449A CN 202210049938 A CN202210049938 A CN 202210049938A CN 114582449 A CN114582449 A CN 114582449A
- Authority
- CN
- China
- Prior art keywords
- entity
- electronic medical
- medical record
- data
- bigru
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
Images
Classifications
-
- G—PHYSICS
- G16—INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
- G16H—HEALTHCARE INFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR THE HANDLING OR PROCESSING OF MEDICAL OR HEALTHCARE DATA
- G16H10/00—ICT specially adapted for the handling or processing of patient-related medical or healthcare data
- G16H10/60—ICT specially adapted for the handling or processing of patient-related medical or healthcare data for patient-specific data, e.g. for electronic patient records
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F40/00—Handling natural language data
- G06F40/20—Natural language analysis
- G06F40/279—Recognition of textual entities
- G06F40/289—Phrasal analysis, e.g. finite state techniques or chunking
- G06F40/295—Named entity recognition
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/04—Architecture, e.g. interconnection topology
- G06N3/047—Probabilistic or stochastic networks
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/04—Architecture, e.g. interconnection topology
- G06N3/048—Activation functions
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/08—Learning methods
Abstract
The invention provides an electronic medical record named entity standardization method, system, storage medium and electronic equipment based on an XLNet-BiGRU-CRF model, and relates to the technical field of data processing. The method comprises the steps of respectively comparing cosine similarity of a first Embedding word vector with second Embedding word vectors corresponding to a plurality of related ternary sets of data, and taking a standard entity corresponding to a word with the highest similarity score as a target mapping entity result; and mapping the target mapping entity result to a reference table to obtain a final electronic medical record standard entity. Therefore, the incomplete data retrieval result caused by different habits can be avoided when the diagnosis of any doctor is retrieved; therefore, the method not only ensures the speed of the entry of doctors in clinic and accords with the habits of doctors, but also ensures that all different writing modes with the same medical representation in data display and statistics can be identified as having the same medical meaning.
Description
Technical Field
The invention relates to the technical field of data processing, in particular to an electronic medical record named entity standardization method, system, storage medium and electronic equipment based on an XLNET-BiGRU-CRF model.
Background
The electronic medical record is a medical record stored, managed and transmitted by a computer information system and comprises digitalized information recorded by medical staff in the process of diagnosing and treating patients, such as patient medical history, clinical performance, treatment method and the like. Since most electronic medical records are semi-structured and unstructured data, analysis processing and data mining of the electronic medical records are severely restricted. Named entity recognition, which is the discovery and recognition of proper nouns and meaningful words in natural text and their classification into predefined categories, is an important branch of natural language processing tasks. The method is used for analyzing and researching the electronic medical record text by using a named entity recognition technology, and aims to automatically recognize, classify and standardize medical named entities in the electronic medical record.
The traditional electronic medical record named entity recognition research is mainly divided into a dictionary and rule-based method, a machine learning method based on statistics and a deep learning method. The dictionary-based and rule-based method needs to manually construct entity extraction rules according to phrase collocation patterns and vocabulary characteristics, and although good effects can be obtained in specific fields, a large amount of expert knowledge is needed and the recall rate is low. The learning method based on the statistical machine comprises a hidden Markov model, a support vector machine, a conditional random field, a maximum entropy model and the like. The feature set is defined mainly according to the marked training set, a statistical model is trained by applying a traditional machine learning algorithm, and the recognition performance of the statistical model is closely related to the designed features. Deep learning based methods have gained widespread use and breakthrough development in recent years, including recurrent neural network models (RNNs), Convolutional Neural Networks (CNNs), gated neural networks (GRUs), and the like. Compared with a machine learning model, the deep learning method can learn high-dimensional and deep feature representation, and is beneficial to improving the generalization capability of entity recognition.
However, medical named entity recognition belongs to named entity recognition in a specific field and aims to identify some important concepts in electronic medical records, including symptoms, disease names and the like. The electronic medical record named entity and standardization still face some difficulties and challenges, and compared with a general field text, the medical record named entity has the advantages that (1) the character length is large; (2) the number of rare words is large; (3) nesting named entities within each other, and the like. Therefore, the identification of named entities of electronic medical record in the medical field becomes a challenging task, and the identification performance of the named entities of medical record needs to be further improved.
Disclosure of Invention
Technical problem to be solved
Aiming at the defects of the prior art, the invention provides an electronic medical record named entity standardization method, system, storage medium and electronic equipment based on an XLNET-BiGRU-CRF model, and solves the technical problem that the medical named entity identification performance needs to be improved.
(II) technical scheme
In order to achieve the purpose, the invention is realized by the following technical scheme:
an electronic medical record named entity standardization method based on an XLNET-BiGRU-CRF model comprises the following steps:
s1, acquiring and preprocessing electronic medical record linguistic data to be identified;
s2, inputting the preprocessed electronic medical record linguistic data to be recognized into an XLNET sub-model to obtain a first Embedding word vector, wherein the XLNET model comprises an arrangement language model, a double-flow attention mechanism and a Transformer-XL core component;
s3, inputting the first Embedding word vector into a BiGRU-CRF sub-model, and acquiring an entity identification result corresponding to the electronic medical record corpus to be identified;
s4, extracting a plurality of related ternary group data with corresponding entities in a preset Neo4j database according to the entity identification result, wherein the ternary group data comprises an original entity, an entity category and a standard entity;
s5, respectively comparing the cosine similarity of the first Embedding word vector with the second Embedding word vectors corresponding to the plurality of relevant ternary sets of data, and taking the standard entity corresponding to the word with the highest similarity score as a target mapping entity result;
and S6, taking a preset standard table as a reference table, mapping the target mapping entity result to the reference table, and acquiring a final electronic medical record standard entity, wherein the mapping process comprises machine processing and manual marking.
Preferably, the preprocessing in S1 includes performing desensitization processing and manual sequence labeling on the electronic medical record corpus to be identified.
Preferably, the ranking language model in S2 includes:
wherein the content of the first and second substances,denotes the expectation of all permutation combinations, pθIt is the conditional probability that,is the t token, x in the decomposition orderα<tAll tokens before the t token, namely an objective function of the rearrangement language modeling, and predicting the t token by taking t-1 tokens as context;
the dual flow attention mechanism includes an interrogation characterization unit and a content characterization unit:
wherein the content of the first and second substances,for the additionally input position information of the prediction target word,representing the correlation between the positions in the text sequence;
the Transformer-XL core component comprises:
q, K, V is the input word vector matrix, and dim is the input vector dimension.
Preferably, the construction process of the Neo4j database in the S4 includes:
and carrying out classification and labeling processing on the data of the training corpus and a pre-acquired Weijian standard data set to form ternary data, and storing the ternary data into the Neo4j database.
Preferably, the BiGRU-CRF submodel in S5 includes:
zt=σ(wz·[ht-1,xt])
rt=σ(wr·[ht-1,xt])
wherein x istRepresenting an input vector at the current time t, and representing a characteristic vector of a tth word in the electronic medical record corpus to be identified; h ist、ht-1Respectively representing hidden layer state matrix vectors at the current time t and the previous time;the candidate hidden layer state at the current time t is represented and is also new memory at the current time; z is a radical oftRepresenting an update gate for controlling the extent to which the state information of the previous moment is brought into the current state, ztThe larger the value of (A) is, the more state information at the previous moment is kept; r is a radical of hydrogentIndicating a reset gate for controlling the extent to which status information from a previous moment is ignored, rtSmaller values of (c) indicate more rejection; w is az、wr、Weight matrixes respectively representing the updating gate, the resetting gate and the candidate hiding state; σ represents a sigmoid nonlinear activation function, tanh represents a tanh activation function, and represents a dot product of a vector;
and the output vector passing through the BiGRU network coding unit is Z, and the output vector Z is subjected to softmax probability normalization and then is input to a CRF layer.
Preferably, the BiGRU-CRF submodel in S5 further comprises:
for a given input sequence X, the probability of predicting the output tag sequence y is defined as S (X, y), where y ═ y1,y2,……yn) The tag sequence with the number of n words in the sentence is represented, and the calculation formula of S (X, y) is as follows:
wherein the content of the first and second substances,an element with an output vector of a BiGRU network coding unit as Z;is the element of the probability transition matrix output by the CRF layer, representing the slave label yt-1To ytSo as to take advantage of the transition probabilities between tagsAnd obtaining more reasonable label sequences. It can be seen that the probability of the whole tag sequence y is the sum of the scores of the modules, and the score of each position is composed of two parts, one part is the output probability matrix of the BiGRU network coding unit, and the other part is the output transition probability matrix of the CRF layer. After normalization processing is carried out on the formula, the final prediction probability of the label sequence y is obtained, and the formula is as follows:
where Y represents all possible tag sequences.
Preferably, the S5 further includes:
and extracting standard entities corresponding to the second and third related triples of the similarity score ranking, and using the standard entities as similar entities for reference of the electronic medical record standard entities.
An electronic medical record named entity standardization system based on an XLNET-BiGRU-CRF model comprises:
the preprocessing module is used for acquiring and preprocessing the electronic medical record linguistic data to be identified;
the acquisition module is used for inputting the preprocessed electronic medical record linguistic data to be identified into an XLNET sub-model to acquire a first Embedding word vector, wherein the XLNET model comprises an arrangement language model, a double-flow attention mechanism and a Transformer-XL core component;
the identification module is used for inputting the first Embedding word vector into a BiGRU-CRF sub-model and acquiring an entity identification result corresponding to the electronic medical record corpus to be identified;
the extraction module is used for extracting a plurality of related ternary groups of data corresponding to the entities from a preset Neo4j database according to the entity identification result, wherein the ternary groups of data comprise an original entity, an entity category and a standard entity;
the comparison module is used for respectively comparing the cosine similarity of the first Embedding word vector with the cosine similarity of the second Embedding word vectors corresponding to the plurality of related ternary sets of data, and taking a standard entity corresponding to a word with the highest similarity score as a target mapping entity result;
and the mapping module is used for mapping the target mapping entity result to the reference table to obtain a final electronic medical record standard entity by taking a preset standard table as the reference table, wherein the mapping process comprises machine processing and manual marking.
A storage medium storing a computer program for electronic medical record named entity standardization based on an XLNet-BiGRU-CRF model, wherein the computer program causes a computer to execute the electronic medical record named entity standardization method as described above.
An electronic device, comprising:
one or more processors;
a memory; and
one or more programs, wherein the one or more programs are stored in the memory and configured to be executed by the one or more processors, the programs comprising instructions for performing the electronic medical record named entity standardization method as described above.
(III) advantageous effects
The invention provides an electronic medical record named entity standardization method, system, storage medium and electronic equipment based on an XLNet-BiGRU-CRF model. Compared with the prior art, the method has the following beneficial effects:
inputting preprocessed electronic medical record linguistic data to be identified into an XLNET sub-model to obtain a first Embedding word vector; inputting the first Embedding word vector into a BiGRU-CRF sub-model, and acquiring an entity identification result corresponding to the electronic medical record corpus to be identified; extracting a plurality of related triple data with corresponding entities from a preset Neo4j database; respectively comparing the cosine similarity of the first Embedding word vector with second Embedding word vectors corresponding to the plurality of relevant ternary groups of data, and taking a standard entity corresponding to a word with the highest similarity score as a target mapping entity result; and mapping the target mapping entity result to a reference table to obtain a final electronic medical record standard entity. Therefore, the incomplete data retrieval result caused by different habits can be avoided when the diagnosis of any doctor is retrieved; therefore, the method not only ensures the speed of the entry of a doctor in clinic and accords with the habit of the doctor, but also ensures that all different writing modes with the same medical representation in data presentation and statistics can be identified as having the same medical meaning.
Drawings
In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings used in the description of the embodiments or the prior art will be briefly described below, it is obvious that the drawings in the following description are only some embodiments of the present invention, and for those skilled in the art, other drawings can be obtained according to the drawings without creative efforts.
FIG. 1 is a schematic flow chart of a method for standardizing named entities of electronic medical records based on an XLNET-BiGRU-CRF model according to an embodiment of the present invention;
fig. 2 is a structural block diagram of an electronic medical record named entity standardization system based on an XLNet-BiGRU-CRF model according to an embodiment of the present invention.
Detailed Description
In order to make the objects, technical solutions and advantages of the embodiments of the present invention clearer, the technical solutions in the embodiments of the present invention are clearly and completely described, and it is obvious that the described embodiments are a part of the embodiments of the present invention, but not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.
The embodiment of the application provides an electronic medical record named entity standardization method, system, storage medium and electronic equipment based on an XLNet-BiGRU-CRF model, and solves the technical problem that the medical named entity identification performance needs to be improved.
In order to solve the technical problems, the general idea of the embodiment of the application is as follows:
inputting preprocessed electronic medical record linguistic data to be identified into an XLNET sub-model to obtain a first Embedding word vector; inputting the first Embedding word vector into a BiGRU-CRF sub-model, and acquiring an entity identification result corresponding to the electronic medical record corpus to be identified; extracting a plurality of related ternary groups of data with corresponding entities from a preset Neo4j database; respectively comparing the cosine similarity of the first Embedding word vector with second Embedding word vectors corresponding to the plurality of relevant ternary groups of data, and taking a standard entity corresponding to a word with the highest similarity score as a target mapping entity result; and mapping the target mapping entity result to a reference table to obtain a final electronic medical record standard entity. The diagnosis of any doctor can be retrieved, and the incomplete data retrieval result caused by different habits can be avoided; therefore, the method not only ensures the speed of the entry of doctors in clinic and accords with the habits of doctors, but also ensures that all different writing modes with the same medical representation in data display and statistics can be identified as having the same medical meaning.
In order to better understand the technical solution, the technical solution will be described in detail with reference to the drawings and the specific embodiments.
Example (b):
in a first aspect, as shown in fig. 1, an embodiment of the present invention provides an electronic medical record named entity standardization method based on an XLNet-BiGRU-CRF model, including:
s1, acquiring and preprocessing electronic medical record linguistic data to be identified;
s2, inputting the preprocessed electronic medical record linguistic data to be recognized into an XLNET sub-model to obtain a first Embedding word vector, wherein the XLNET model comprises an arrangement language model, a double-flow attention mechanism and a Transformer-XL core component;
s3, inputting the first Embedding word vector into a BiGRU-CRF sub-model, and acquiring an entity identification result corresponding to the electronic medical record corpus to be identified;
s4, extracting a plurality of related ternary group data with corresponding entities in a preset Neo4j database according to the entity identification result, wherein the ternary group data comprises an original entity, an entity category and a standard entity;
s5, respectively comparing the cosine similarity of the first Embedding word vector with the second Embedding word vectors corresponding to the plurality of relevant ternary sets of data, and taking the standard entity corresponding to the word with the highest similarity score as a target mapping entity result;
and S6, taking a preset standard table as a reference table, mapping the target mapping entity result to the reference table, and acquiring a final electronic medical record standard entity, wherein the mapping process comprises machine processing and manual marking.
According to the embodiment of the invention, the incomplete data retrieval result caused by different habits can be avoided when the diagnosis of any doctor is retrieved; therefore, the method not only ensures the speed of the entry of a doctor in clinic and accords with the habit of the doctor, but also ensures that all different writing modes with the same medical representation in data presentation and statistics can be identified as having the same medical meaning.
The following will describe each step of the above technical solution in detail with reference to specific contents:
and S1, acquiring and preprocessing the electronic medical record corpus to be identified.
And preprocessing comprises desensitizing and manually labeling the electronic medical record corpus to be identified.
Desensitization processing refers to reduction processing of the content of the electronic medical record in order to reduce interference of entities irrelevant to medical clinical information on the premise of not changing semantic expression of the electronic medical record and protecting authenticity of the electronic medical record. Because the electronic medical record records the privacy information of the name, the age, the address and the like of the patient, in order to protect the privacy of the patient, the desensitization treatment needs to be carried out on the patient information, and therefore the real clinical medical record corpus with privacy removed is obtained.
The manual sequence annotation refers to manual entity annotation of unstructured electronic medical record data in early-stage data preparation. In the labeling process, entities related to medical clinic are taken as objects, and the entities are classified into categories such as diseases, symptoms, treatments, examinations, body parts and the like according to the label entity format.
The embodiment of the invention greatly expands the data scale in the pre-training stage, screens and filters the quality, introduces the data of Giga5, ClueWeb and Common Crawl in addition to BooksCorpus and English Wiki data for the pre-training data used by the subsequent XLNET, and eliminates some low-quality data with the sizes of 16G, 19G and 78G.
S2, inputting the preprocessed electronic medical record linguistic data to be recognized into an XLNET sub-model to obtain a first Embedding word vector, wherein the XLNET model comprises an arrangement language model, a double-flow attention mechanism and a Transformer-XL core component.
Training an XLNET (Chinese) model based on large-scale electronic medical record corpora; XLNET is pre-trained in large-scale linguistic data, compared with a traditional static word vector model, the XLNET can generate dynamic word vectors according to context, semantic coding is more accurate, and accuracy of named entity recognition tasks is greatly improved.
The purpose of introducing the arrangement language model is to randomly disorder the sequence of Chinese characters in a text sentence, and arrange bidirectional sentences to one direction by changing the arrangement position of the words. The arrangement language model includes:
wherein the content of the first and second substances,denotes the expectation of all permutation combinations, pθIt is the conditional probability that,is the t token, x in the decomposition orderα<tAll tokens before the t token, namely an objective function of the rearrangement language modeling, are used for predicting the t token by taking t-1 tokens as context.
After the arrangement language model is introduced, the defect that the traditional autoregressive model cannot learn context information at the same time is overcome, and simultaneously, a problem is also brought: text position information is lost.
In order for the model to learn positional information of the sequences, a dual-flow attention mechanism was introduced in XLNet.
The dual-flow self-attention mechanism of XLNET uses two feature characterization units, namely a content characterization unit and an inquiry characterization unit. The content characterization unit is a representation of the above information and will contain the current word. The query characterization unit contains a representation of the above information in addition to the current word and contains position information of the current word without access to the content information of the current word. The content representation unit and the inquiry representation unit form two information flows which are continuously transmitted upwards, and finally, the information of the inquiry unit is output. The specific double-flow attention calculation mechanism is as follows: :
wherein the content of the first and second substances,for the additionally input position information of the prediction target word,indicating the correlation between the various positions in the text sequence.
The XLNET Chinese model takes a Transformer-XL framework as a core, introduces a circulation mechanism and relative position coding, and can better utilize context semantic information and excavate potential relations in text vectors. The XLNET Chinese model is trained on large-scale label-free data to obtain corresponding model parameters, and the characteristic vector representation of an input sequence can be obtained through reasoning.
The Transformer-XL core component comprises:
q, K, V is the input word vector matrix, and dim is the input vector dimension.
Further, the feature vector output by the XLNET Chinese language model is input to the BiGRU network, and the BiGRU network controls the transmission and the cut-off of information through a gate.
And S3, inputting the first Embedding word vector into a BiGRU-CRF submodel, and acquiring an entity identification result corresponding to the electronic medical record corpus to be identified.
The BiGRU-CRF submodel comprises:
zt=σ(wz·[ht-1,xt])
rt=σ(wr·[ht-1,xt])
wherein x istRepresenting an input vector at the current time t, and representing a characteristic vector of a tth word in the electronic medical record corpus to be identified; h ist、ht-1Respectively representing hidden layer state matrix vectors at the current time t and the previous time;the candidate hidden layer state at the current time t is represented and is also new memory at the current time; z is a radical oftRepresenting an update gate for controlling the extent to which the state information of the previous moment is brought into the current state, ztThe larger the value of (A) is, the more state information at the previous moment is kept; r istIndicating a reset gate for controlling the extent to which status information from a previous moment is ignored, rtSmaller value of (A) indicates rejectionThe more; w is az、wr、Weight matrixes respectively representing the updating gate, the resetting gate and the candidate hiding state; σ denotes sigmoid nonlinear activation function, tanh denotes tanh activation function, and σ denotes dot product of vector.
And the output vector passing through the BiGRU network coding unit is Z, and the output vector Z is subjected to softmax probability normalization and then is input to a CRF layer.
The BiGRU-CRF submodel further comprises:
for a given input sequence X, the probability of predicting the output tag sequence y is defined as S (X, y), where y ═ y1,y2,……yn) The tag sequence with the number of n words in the sentence is represented, and the calculation formula of S (X, y) is as follows:
wherein the content of the first and second substances,an element with an output vector of the BiGRU network coding unit as Z;is the element of the probability transition matrix output by the CRF layer, representing the slave label yt-1To ytThe transition probability of (2) is such that more reasonable tag sequences are obtained by utilizing the dependency between tags. It can be seen that the probability of the whole tag sequence y is the sum of the scores of the modules, and the score of each position is composed of two parts, one part is the output probability matrix of the BiGRU network coding unit, and the other part is the output transition probability matrix of the CRF layer. After normalization processing is carried out on the formula, the final prediction probability of the label sequence y is obtained, and the formula is as follows:
where Y represents all possible tag sequences.
The loss function of the CRF layer adopts a negative log-likelihood function, and the formula is as follows:
updating parameters of the whole named entity recognition model by using a loss function of a CRF layer by adopting an Adam algorithm, wherein the parameters comprise model parameters including a BiGRU neural network model and the CRF layer, the parameters of the XLNet Chinese model are kept unchanged, and when a loss value generated by the model meets a set requirement or reaches a set maximum iteration number, the training of the model is terminated.
The construction of the GRU is simpler than the LSTM (one gate less than the LSTM) and thus has a few matrix multiplications less. The GRU can save much time in the case of large training data.
And S4, extracting a plurality of related ternary group data with corresponding entities in a preset Neo4j database according to the entity identification result, wherein the ternary group data comprises an original entity, an entity category and a standard entity.
The construction process of the Neo4j database comprises the following steps:
classifying and labeling the data of the training corpus and a pre-acquired Weijian Committee standard data set to form ternary data, storing the ternary data into the Neo4j database, and constructing a labeled corpus of a named entity identification model, wherein only an entity needs to be identified, and marking is carried out by using [ 'O', B-LOC ', I-LOC', wherein O represents other non-entities, B-LOC represents the beginning of the entity, and I-LOC represents a non-initial character of the entity.
And S5, respectively performing cosine similarity comparison on the first Embedding word vector and second Embedding word vectors corresponding to the plurality of relevant ternary sets of data, and taking a standard entity corresponding to a word with the highest similarity score as a target mapping entity result.
The calculation of the cosine similarity of the word vector comprises the following steps:
and (3) extracting a first Embedding word vector by using an XLNT Chinese model, respectively performing cosine similarity comparison on the first Embedding word vector and a second Embedding word vector which is stored in a Neo4j database and has related problems of corresponding entities, taking the answer of the problem with the highest similarity as a target result, and simultaneously taking the entities ranked second and third as similar entities for marking and normalizing reference.
The corresponding cosine similarity is calculated as follows:
wherein score is the similarity value, VqueryIs the first Embedding word vector, VcorpusIs the second Embedding word vector.
In addition, the step also extracts the standard entities corresponding to the second and third related triples of the similarity score ranking, and the standard entities serve as similar entities for the standard entities of the electronic medical record to serve as references.
And S6, taking a preset standard table as a reference table, mapping the target mapping entity result to the reference table, and acquiring a final electronic medical record standard entity, wherein the mapping process comprises machine processing and manual marking.
A standard table is established by using national and international standard data sets, and particularly, collected data are subjected to standardized processing on entities marked and identified based on standards of national, international and medical industries, international disease guidelines and the like, such as ICD10\ ICD9, HL7CDA \ medical subject vocabulary (MeSH) \ observation index identifier logical naming and coding system (LOINC) \ drug dictionary Specification-CFDA, ATC classification \ national health committee medical institution diagnosis subject name entry \ international tumor database structure \ oncology international diagnosis and treatment guidelines and the like, so that the requirements of subsequent service scenes are met.
And mapping the data identified by the entity with a reference table, wherein the mapping process comprises two processes of machine algorithm processing and manual labeling.
When the standard table does not correspond to the actual data, a professional doctor decides whether to expand the standard table. For example, in the standardized process, different diagnostic words are different, such as leukemia and leukemia, AIDS and acquired immune syndrome are synonymous diagnostic words, and the diagnostic word "severe fatty liver" is a type of fatty liver and has an inclusion/included relationship. By standardization, clinical diagnosis will eventually be mapped one-to-one to ICD 11. For example, diagnosis "hypothyroidism" is normalized to ICD11 disease "hypothyroidism", coding for: E03.901.
in particular, for a new named entity recognition task, the above-described algorithm model may be utilized to process the to-be-processed field. And after labeling processing is carried out by utilizing the algorithm model, rechecking is carried out in a manual mode, and a rechecking result is fed back to the algorithm model, so that the error of the model is reduced, and the accuracy of the algorithm model is improved.
In a second aspect, as shown in fig. 2, an embodiment of the present invention provides an electronic medical record named entity standardization system based on an XLNet-BiGRU-CRF model, including:
the preprocessing module is used for acquiring and preprocessing the electronic medical record linguistic data to be identified;
the acquisition module is used for inputting the preprocessed electronic medical record linguistic data to be identified into an XLNET sub-model to acquire a first Embedding word vector, wherein the XLNET model comprises an arrangement language model, a double-flow attention mechanism and a Transformer-XL core component;
the identification module is used for inputting the first Embedding word vector into a BiGRU-CRF sub-model and acquiring an entity identification result corresponding to the electronic medical record corpus to be identified;
the extraction module is used for extracting a plurality of related ternary groups of data corresponding to the entities from a preset Neo4j database according to the entity identification result, wherein the ternary groups of data comprise an original entity, an entity category and a standard entity;
the comparison module is used for respectively comparing the cosine similarity of the first Embedding word vector with the cosine similarity of the second Embedding word vectors corresponding to the plurality of related ternary sets of data, and taking a standard entity corresponding to a word with the highest similarity score as a target mapping entity result;
and the mapping module is used for mapping the target mapping entity result to the reference table to obtain a final electronic medical record standard entity by taking a preset standard table as the reference table, wherein the mapping process comprises machine processing and manual marking.
In a third aspect, an embodiment of the present invention provides a storage medium storing a computer program for electronic medical record named entity standardization based on an XLNet-BiGRU-CRF model, where the computer program causes a computer to execute the electronic medical record named entity standardization method described above.
In a fourth aspect, an embodiment of the present invention provides an electronic device, including:
one or more processors;
a memory; and
one or more programs, wherein the one or more programs are stored in the memory and configured to be executed by the one or more processors, the programs comprising instructions for performing the electronic medical record named entity standardization method as described above.
In summary, compared with the prior art, the method has the following beneficial effects:
according to the embodiment of the invention, the incomplete data retrieval result caused by different habits can be avoided when the diagnosis of any doctor is retrieved; therefore, the method not only ensures the speed of the entry of doctors in clinic and accords with the habits of doctors, but also ensures that all different writing modes with the same medical representation in data display and statistics can be identified as having the same medical meaning.
It is noted that, herein, relational terms such as first and second, and the like may be used solely to distinguish one entity or action from another entity or action without necessarily requiring or implying any actual such relationship or order between such entities or actions. Also, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, an element defined by the phrase "comprising an … …" does not exclude the presence of other identical elements in a process, method, article, or apparatus that comprises the element.
The above examples are only intended to illustrate the technical solution of the present invention, but not to limit it; although the present invention has been described in detail with reference to the foregoing embodiments, it will be understood by those of ordinary skill in the art that: the technical solutions described in the foregoing embodiments may still be modified, or some technical features may be equivalently replaced; and such modifications or substitutions do not depart from the spirit and scope of the corresponding technical solutions of the embodiments of the present invention.
Claims (10)
1. An electronic medical record named entity standardization method based on an XLNET-BiGRU-CRF model is characterized by comprising the following steps:
s1, acquiring and preprocessing electronic medical record linguistic data to be identified;
s2, inputting the preprocessed electronic medical record linguistic data to be recognized into an XLNET sub-model to obtain a first Embedding word vector, wherein the XLNET model comprises an arrangement language model, a double-flow attention mechanism and a Transformer-XL core component;
s3, inputting the first Embedding word vector into a BiGRU-CRF sub-model, and acquiring an entity identification result corresponding to the electronic medical record corpus to be identified;
s4, extracting a plurality of related ternary group data with corresponding entities in a preset Neo4j database according to the entity identification result, wherein the ternary group data comprises an original entity, an entity category and a standard entity;
s5, respectively comparing the cosine similarity of the first Embedding word vector with the second Embedding word vectors corresponding to the plurality of relevant ternary sets of data, and taking the standard entity corresponding to the word with the highest similarity score as a target mapping entity result;
and S6, taking a preset standard table as a reference table, mapping the target mapping entity result to the reference table, and acquiring a final electronic medical record standard entity, wherein the mapping process comprises machine processing and manual marking.
2. The method as claimed in claim 1, wherein the preprocessing in S1 includes desensitizing and labeling the electronic medical record corpus to be identified with artificial sequence.
3. The method for standardizing named entities of electronic medical records according to claim 1, wherein the arrangement language model in S2 comprises:
wherein the content of the first and second substances,denotes the expectation of all permutation combinations, pθIt is the conditional probability that,is the t token, x in the decomposition orderα<t is all tokens before the t token, namely an objective function of the rearrangement language modeling, and the t token is predicted by taking t-1 tokens as context;
the dual flow attention mechanism includes an interrogation characterization unit and a content characterization unit:
wherein the content of the first and second substances,for the additionally input position information of the prediction target word,representing the correlation between the positions in the text sequence;
the Transformer-XL core component comprises:
q, K, V is the input word vector matrix, and dim is the input vector dimension.
4. The method for standardizing named entities of electronic medical records as claimed in claim 3, wherein the construction process of Neo4j database in S4 comprises:
and carrying out classification and labeling processing on the data of the training corpus and a pre-acquired Weijian standard data set to form ternary data, and storing the ternary data into the Neo4j database.
5. The method for standardizing named entities of electronic medical records as claimed in claim 3, wherein the BiGRU-CRF submodel in S5 comprises:
zt=σ(wz·[ht-1,xt])
rt=σ(wr·[ht-1,xt])
wherein x istRepresenting an input vector at the current time t, and representing a characteristic vector of a tth word in the electronic medical record corpus to be identified; h ist、ht-1Respectively representing hidden layer state matrix vectors at the current time t and the previous time;the candidate hidden layer state at the current time t is represented and is also new memory at the current time; z is a radical oftA presentation update gate for controlling the degree to which the state information at the previous time is brought into the current state; r istA representation reset gate for controlling the extent to which status information at a previous time is ignored; w is az、wr、Weight matrixes respectively representing the updating gate, the resetting gate and the candidate hiding state; σ represents a sigmoid nonlinear activation function, tanh represents a tanh activation function, and represents a dot product of a vector;
and the output vector passing through the BiGRU network coding unit is Z, and the output vector Z is subjected to softmax probability normalization and then is input to a CRF layer.
6. The method for standardizing named entities in electronic medical records according to claim 5, wherein the BiGRU-CRF submodel in S5 further comprises:
for a given input sequence X, the probability of predicting the output tag sequence y is defined as S (X, y), where y ═ y1,y2,……yn) The tag sequence with the number of n words in the sentence is represented, and the calculation formula of S (X, y) is as follows:
wherein the content of the first and second substances,an element with an output vector of the BiGRU network coding unit as Z;is the element of the probability transition matrix output by the CRF layer, representing the slave label yt-1To ytThe final prediction probability of the tag sequence y is obtained after the normalization processing is carried out on the formula, and the formula is as follows:
where Y represents all possible tag sequences.
7. The method for standardizing named entities in electronic medical records according to any one of claims 1-6, wherein the step of S5 further comprises:
and extracting standard entities corresponding to the second and third related triples of the similarity score ranking, and using the standard entities as similar entities for reference of the electronic medical record standard entities.
8. An electronic medical record named entity standardization system based on an XLNet-BiGRU-CRF model is characterized by comprising the following components:
the preprocessing module is used for acquiring and preprocessing the electronic medical record linguistic data to be identified;
the acquisition module is used for inputting the preprocessed electronic medical record linguistic data to be identified into an XLNET sub-model to acquire a first Embedding word vector, wherein the XLNET model comprises an arrangement language model, a double-flow attention mechanism and a Transformer-XL core component;
the identification module is used for inputting the first Embedding word vector into a BiGRU-CRF sub-model and acquiring an entity identification result corresponding to the electronic medical record corpus to be identified;
the extraction module is used for extracting a plurality of related ternary groups of data corresponding to the entities from a preset Neo4j database according to the entity identification result, wherein the ternary groups of data comprise an original entity, an entity category and a standard entity;
the comparison module is used for respectively comparing the cosine similarity of the first Embedding word vector with the cosine similarity of the second Embedding word vectors corresponding to the plurality of related ternary sets of data, and taking a standard entity corresponding to a word with the highest similarity score as a target mapping entity result;
and the mapping module is used for mapping the target mapping entity result to the reference table to obtain a final electronic medical record standard entity by taking a preset standard table as the reference table, wherein the mapping process comprises machine processing and manual marking.
9. A storage medium storing a computer program for electronic medical record named entity standardization based on XLNet-BiGRU-CRF model, wherein the computer program causes a computer to execute the electronic medical record named entity standardization method according to any one of claims 1 to 7.
10. An electronic device, comprising:
one or more processors;
a memory; and
one or more programs, wherein the one or more programs are stored in the memory and configured to be executed by the one or more processors, the programs comprising instructions for performing the electronic medical record named entity standardization method of any of claims 1-7.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202210049938.7A CN114582449A (en) | 2022-01-17 | 2022-01-17 | Electronic medical record named entity standardization method and system based on XLNet-BiGRU-CRF model |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202210049938.7A CN114582449A (en) | 2022-01-17 | 2022-01-17 | Electronic medical record named entity standardization method and system based on XLNet-BiGRU-CRF model |
Publications (1)
Publication Number | Publication Date |
---|---|
CN114582449A true CN114582449A (en) | 2022-06-03 |
Family
ID=81768800
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202210049938.7A Pending CN114582449A (en) | 2022-01-17 | 2022-01-17 | Electronic medical record named entity standardization method and system based on XLNet-BiGRU-CRF model |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN114582449A (en) |
Cited By (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN116842021A (en) * | 2023-07-14 | 2023-10-03 | 恩核(北京)信息技术有限公司 | Data dictionary standardization method, equipment and medium based on AI generation technology |
CN116842021B (en) * | 2023-07-14 | 2024-04-26 | 恩核(北京)信息技术有限公司 | Data dictionary standardization method, equipment and medium based on AI generation technology |
Citations (6)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN109359293A (en) * | 2018-09-13 | 2019-02-19 | 内蒙古大学 | Mongolian name entity recognition method neural network based and its identifying system |
CN109471895A (en) * | 2018-10-29 | 2019-03-15 | 清华大学 | The extraction of electronic health record phenotype, phenotype name authority method and system |
CN110009599A (en) * | 2019-02-01 | 2019-07-12 | 腾讯科技(深圳)有限公司 | Liver masses detection method, device, equipment and storage medium |
CN112001177A (en) * | 2020-08-24 | 2020-11-27 | 浪潮云信息技术股份公司 | Electronic medical record named entity identification method and system integrating deep learning and rules |
CN113516659A (en) * | 2021-09-15 | 2021-10-19 | 浙江大学 | Medical image automatic segmentation method based on deep learning |
CN113641809A (en) * | 2021-08-10 | 2021-11-12 | 中电鸿信信息科技有限公司 | XLNET-BiGRU-CRF-based intelligent question answering method |
-
2022
- 2022-01-17 CN CN202210049938.7A patent/CN114582449A/en active Pending
Patent Citations (6)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN109359293A (en) * | 2018-09-13 | 2019-02-19 | 内蒙古大学 | Mongolian name entity recognition method neural network based and its identifying system |
CN109471895A (en) * | 2018-10-29 | 2019-03-15 | 清华大学 | The extraction of electronic health record phenotype, phenotype name authority method and system |
CN110009599A (en) * | 2019-02-01 | 2019-07-12 | 腾讯科技(深圳)有限公司 | Liver masses detection method, device, equipment and storage medium |
CN112001177A (en) * | 2020-08-24 | 2020-11-27 | 浪潮云信息技术股份公司 | Electronic medical record named entity identification method and system integrating deep learning and rules |
CN113641809A (en) * | 2021-08-10 | 2021-11-12 | 中电鸿信信息科技有限公司 | XLNET-BiGRU-CRF-based intelligent question answering method |
CN113516659A (en) * | 2021-09-15 | 2021-10-19 | 浙江大学 | Medical image automatic segmentation method based on deep learning |
Cited By (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN116842021A (en) * | 2023-07-14 | 2023-10-03 | 恩核(北京)信息技术有限公司 | Data dictionary standardization method, equipment and medium based on AI generation technology |
CN116842021B (en) * | 2023-07-14 | 2024-04-26 | 恩核(北京)信息技术有限公司 | Data dictionary standardization method, equipment and medium based on AI generation technology |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN110210037B (en) | Syndrome-oriented medical field category detection method | |
WO2021233112A1 (en) | Multimodal machine learning-based translation method, device, equipment, and storage medium | |
Nadif et al. | Unsupervised and self-supervised deep learning approaches for biomedical text mining | |
Liu et al. | Medical-vlbert: Medical visual language bert for covid-19 ct report generation with alternate learning | |
Dima et al. | Automatic noun compound interpretation using deep neural networks and word embeddings | |
CN110991190B (en) | Document theme enhancement system, text emotion prediction system and method | |
CN111274790B (en) | Chapter-level event embedding method and device based on syntactic dependency graph | |
CN110931137B (en) | Machine-assisted dialog systems, methods, and apparatus | |
CN109003677B (en) | Structured analysis processing method for medical record data | |
CN112151183A (en) | Entity identification method of Chinese electronic medical record based on Lattice LSTM model | |
CN111858940A (en) | Multi-head attention-based legal case similarity calculation method and system | |
CN112420191A (en) | Traditional Chinese medicine auxiliary decision making system and method | |
CN113688248A (en) | Medical event identification method and system under condition of small sample weak labeling | |
US20230315994A1 (en) | Natural Language Processing for Addressing Bias | |
Hsu et al. | Multi-label classification of ICD coding using deep learning | |
Ruwa et al. | Affective visual question answering network | |
Yadav et al. | A novel automated depression detection technique using text transcript | |
CN112035627B (en) | Automatic question and answer method, device, equipment and storage medium | |
CN113764112A (en) | Online medical question and answer method | |
CN113641809A (en) | XLNET-BiGRU-CRF-based intelligent question answering method | |
CN111627561B (en) | Standard symptom extraction method, device, electronic equipment and storage medium | |
CN116630062A (en) | Medical insurance fraud detection method, system and storage medium | |
Chen et al. | Learning the chinese sentence representation with LSTM autoencoder | |
CN114582449A (en) | Electronic medical record named entity standardization method and system based on XLNet-BiGRU-CRF model | |
Zhang et al. | Extraction of English Drug Names Based on Bert-CNN Mode. |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination |