CN114582449A - Electronic medical record named entity standardization method and system based on XLNet-BiGRU-CRF model - Google Patents

Electronic medical record named entity standardization method and system based on XLNet-BiGRU-CRF model Download PDF

Info

Publication number
CN114582449A
CN114582449A CN202210049938.7A CN202210049938A CN114582449A CN 114582449 A CN114582449 A CN 114582449A CN 202210049938 A CN202210049938 A CN 202210049938A CN 114582449 A CN114582449 A CN 114582449A
Authority
CN
China
Prior art keywords
entity
electronic medical
medical record
data
bigru
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202210049938.7A
Other languages
Chinese (zh)
Inventor
杨雨
张培龙
李华
王显荣
刘玉林
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Inner Mongolia University
Original Assignee
Inner Mongolia University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Inner Mongolia University filed Critical Inner Mongolia University
Priority to CN202210049938.7A priority Critical patent/CN114582449A/en
Publication of CN114582449A publication Critical patent/CN114582449A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16HHEALTHCARE INFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR THE HANDLING OR PROCESSING OF MEDICAL OR HEALTHCARE DATA
    • G16H10/00ICT specially adapted for the handling or processing of patient-related medical or healthcare data
    • G16H10/60ICT specially adapted for the handling or processing of patient-related medical or healthcare data for patient-specific data, e.g. for electronic patient records
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities
    • G06F40/289Phrasal analysis, e.g. finite state techniques or chunking
    • G06F40/295Named entity recognition
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/047Probabilistic or stochastic networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/048Activation functions
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods

Abstract

The invention provides an electronic medical record named entity standardization method, system, storage medium and electronic equipment based on an XLNet-BiGRU-CRF model, and relates to the technical field of data processing. The method comprises the steps of respectively comparing cosine similarity of a first Embedding word vector with second Embedding word vectors corresponding to a plurality of related ternary sets of data, and taking a standard entity corresponding to a word with the highest similarity score as a target mapping entity result; and mapping the target mapping entity result to a reference table to obtain a final electronic medical record standard entity. Therefore, the incomplete data retrieval result caused by different habits can be avoided when the diagnosis of any doctor is retrieved; therefore, the method not only ensures the speed of the entry of doctors in clinic and accords with the habits of doctors, but also ensures that all different writing modes with the same medical representation in data display and statistics can be identified as having the same medical meaning.

Description

Electronic medical record named entity standardization method and system based on XLNET-BiGRU-CRF model
Technical Field
The invention relates to the technical field of data processing, in particular to an electronic medical record named entity standardization method, system, storage medium and electronic equipment based on an XLNET-BiGRU-CRF model.
Background
The electronic medical record is a medical record stored, managed and transmitted by a computer information system and comprises digitalized information recorded by medical staff in the process of diagnosing and treating patients, such as patient medical history, clinical performance, treatment method and the like. Since most electronic medical records are semi-structured and unstructured data, analysis processing and data mining of the electronic medical records are severely restricted. Named entity recognition, which is the discovery and recognition of proper nouns and meaningful words in natural text and their classification into predefined categories, is an important branch of natural language processing tasks. The method is used for analyzing and researching the electronic medical record text by using a named entity recognition technology, and aims to automatically recognize, classify and standardize medical named entities in the electronic medical record.
The traditional electronic medical record named entity recognition research is mainly divided into a dictionary and rule-based method, a machine learning method based on statistics and a deep learning method. The dictionary-based and rule-based method needs to manually construct entity extraction rules according to phrase collocation patterns and vocabulary characteristics, and although good effects can be obtained in specific fields, a large amount of expert knowledge is needed and the recall rate is low. The learning method based on the statistical machine comprises a hidden Markov model, a support vector machine, a conditional random field, a maximum entropy model and the like. The feature set is defined mainly according to the marked training set, a statistical model is trained by applying a traditional machine learning algorithm, and the recognition performance of the statistical model is closely related to the designed features. Deep learning based methods have gained widespread use and breakthrough development in recent years, including recurrent neural network models (RNNs), Convolutional Neural Networks (CNNs), gated neural networks (GRUs), and the like. Compared with a machine learning model, the deep learning method can learn high-dimensional and deep feature representation, and is beneficial to improving the generalization capability of entity recognition.
However, medical named entity recognition belongs to named entity recognition in a specific field and aims to identify some important concepts in electronic medical records, including symptoms, disease names and the like. The electronic medical record named entity and standardization still face some difficulties and challenges, and compared with a general field text, the medical record named entity has the advantages that (1) the character length is large; (2) the number of rare words is large; (3) nesting named entities within each other, and the like. Therefore, the identification of named entities of electronic medical record in the medical field becomes a challenging task, and the identification performance of the named entities of medical record needs to be further improved.
Disclosure of Invention
Technical problem to be solved
Aiming at the defects of the prior art, the invention provides an electronic medical record named entity standardization method, system, storage medium and electronic equipment based on an XLNET-BiGRU-CRF model, and solves the technical problem that the medical named entity identification performance needs to be improved.
(II) technical scheme
In order to achieve the purpose, the invention is realized by the following technical scheme:
an electronic medical record named entity standardization method based on an XLNET-BiGRU-CRF model comprises the following steps:
s1, acquiring and preprocessing electronic medical record linguistic data to be identified;
s2, inputting the preprocessed electronic medical record linguistic data to be recognized into an XLNET sub-model to obtain a first Embedding word vector, wherein the XLNET model comprises an arrangement language model, a double-flow attention mechanism and a Transformer-XL core component;
s3, inputting the first Embedding word vector into a BiGRU-CRF sub-model, and acquiring an entity identification result corresponding to the electronic medical record corpus to be identified;
s4, extracting a plurality of related ternary group data with corresponding entities in a preset Neo4j database according to the entity identification result, wherein the ternary group data comprises an original entity, an entity category and a standard entity;
s5, respectively comparing the cosine similarity of the first Embedding word vector with the second Embedding word vectors corresponding to the plurality of relevant ternary sets of data, and taking the standard entity corresponding to the word with the highest similarity score as a target mapping entity result;
and S6, taking a preset standard table as a reference table, mapping the target mapping entity result to the reference table, and acquiring a final electronic medical record standard entity, wherein the mapping process comprises machine processing and manual marking.
Preferably, the preprocessing in S1 includes performing desensitization processing and manual sequence labeling on the electronic medical record corpus to be identified.
Preferably, the ranking language model in S2 includes:
Figure BDA0003474092420000031
wherein the content of the first and second substances,
Figure BDA0003474092420000032
denotes the expectation of all permutation combinations, pθIt is the conditional probability that,
Figure BDA0003474092420000033
is the t token, x in the decomposition orderα<tAll tokens before the t token, namely an objective function of the rearrangement language modeling, and predicting the t token by taking t-1 tokens as context;
the dual flow attention mechanism includes an interrogation characterization unit and a content characterization unit:
Figure BDA0003474092420000041
Figure BDA0003474092420000042
wherein the content of the first and second substances,
Figure BDA0003474092420000043
for the additionally input position information of the prediction target word,
Figure BDA0003474092420000044
representing the correlation between the positions in the text sequence;
the Transformer-XL core component comprises:
Figure BDA0003474092420000045
q, K, V is the input word vector matrix, and dim is the input vector dimension.
Preferably, the construction process of the Neo4j database in the S4 includes:
and carrying out classification and labeling processing on the data of the training corpus and a pre-acquired Weijian standard data set to form ternary data, and storing the ternary data into the Neo4j database.
Preferably, the BiGRU-CRF submodel in S5 includes:
zt=σ(wz·[ht-1,xt])
rt=σ(wr·[ht-1,xt])
Figure BDA0003474092420000046
Figure BDA0003474092420000047
wherein x istRepresenting an input vector at the current time t, and representing a characteristic vector of a tth word in the electronic medical record corpus to be identified; h ist、ht-1Respectively representing hidden layer state matrix vectors at the current time t and the previous time;
Figure BDA0003474092420000048
the candidate hidden layer state at the current time t is represented and is also new memory at the current time; z is a radical oftRepresenting an update gate for controlling the extent to which the state information of the previous moment is brought into the current state, ztThe larger the value of (A) is, the more state information at the previous moment is kept; r is a radical of hydrogentIndicating a reset gate for controlling the extent to which status information from a previous moment is ignored, rtSmaller values of (c) indicate more rejection; w is az、wr
Figure BDA0003474092420000051
Weight matrixes respectively representing the updating gate, the resetting gate and the candidate hiding state; σ represents a sigmoid nonlinear activation function, tanh represents a tanh activation function, and represents a dot product of a vector;
and the output vector passing through the BiGRU network coding unit is Z, and the output vector Z is subjected to softmax probability normalization and then is input to a CRF layer.
Preferably, the BiGRU-CRF submodel in S5 further comprises:
for a given input sequence X, the probability of predicting the output tag sequence y is defined as S (X, y), where y ═ y1,y2,……yn) The tag sequence with the number of n words in the sentence is represented, and the calculation formula of S (X, y) is as follows:
Figure BDA0003474092420000052
wherein the content of the first and second substances,
Figure BDA0003474092420000053
an element with an output vector of a BiGRU network coding unit as Z;
Figure BDA0003474092420000054
is the element of the probability transition matrix output by the CRF layer, representing the slave label yt-1To ytSo as to take advantage of the transition probabilities between tagsAnd obtaining more reasonable label sequences. It can be seen that the probability of the whole tag sequence y is the sum of the scores of the modules, and the score of each position is composed of two parts, one part is the output probability matrix of the BiGRU network coding unit, and the other part is the output transition probability matrix of the CRF layer. After normalization processing is carried out on the formula, the final prediction probability of the label sequence y is obtained, and the formula is as follows:
Figure BDA0003474092420000055
where Y represents all possible tag sequences.
Preferably, the S5 further includes:
and extracting standard entities corresponding to the second and third related triples of the similarity score ranking, and using the standard entities as similar entities for reference of the electronic medical record standard entities.
An electronic medical record named entity standardization system based on an XLNET-BiGRU-CRF model comprises:
the preprocessing module is used for acquiring and preprocessing the electronic medical record linguistic data to be identified;
the acquisition module is used for inputting the preprocessed electronic medical record linguistic data to be identified into an XLNET sub-model to acquire a first Embedding word vector, wherein the XLNET model comprises an arrangement language model, a double-flow attention mechanism and a Transformer-XL core component;
the identification module is used for inputting the first Embedding word vector into a BiGRU-CRF sub-model and acquiring an entity identification result corresponding to the electronic medical record corpus to be identified;
the extraction module is used for extracting a plurality of related ternary groups of data corresponding to the entities from a preset Neo4j database according to the entity identification result, wherein the ternary groups of data comprise an original entity, an entity category and a standard entity;
the comparison module is used for respectively comparing the cosine similarity of the first Embedding word vector with the cosine similarity of the second Embedding word vectors corresponding to the plurality of related ternary sets of data, and taking a standard entity corresponding to a word with the highest similarity score as a target mapping entity result;
and the mapping module is used for mapping the target mapping entity result to the reference table to obtain a final electronic medical record standard entity by taking a preset standard table as the reference table, wherein the mapping process comprises machine processing and manual marking.
A storage medium storing a computer program for electronic medical record named entity standardization based on an XLNet-BiGRU-CRF model, wherein the computer program causes a computer to execute the electronic medical record named entity standardization method as described above.
An electronic device, comprising:
one or more processors;
a memory; and
one or more programs, wherein the one or more programs are stored in the memory and configured to be executed by the one or more processors, the programs comprising instructions for performing the electronic medical record named entity standardization method as described above.
(III) advantageous effects
The invention provides an electronic medical record named entity standardization method, system, storage medium and electronic equipment based on an XLNet-BiGRU-CRF model. Compared with the prior art, the method has the following beneficial effects:
inputting preprocessed electronic medical record linguistic data to be identified into an XLNET sub-model to obtain a first Embedding word vector; inputting the first Embedding word vector into a BiGRU-CRF sub-model, and acquiring an entity identification result corresponding to the electronic medical record corpus to be identified; extracting a plurality of related triple data with corresponding entities from a preset Neo4j database; respectively comparing the cosine similarity of the first Embedding word vector with second Embedding word vectors corresponding to the plurality of relevant ternary groups of data, and taking a standard entity corresponding to a word with the highest similarity score as a target mapping entity result; and mapping the target mapping entity result to a reference table to obtain a final electronic medical record standard entity. Therefore, the incomplete data retrieval result caused by different habits can be avoided when the diagnosis of any doctor is retrieved; therefore, the method not only ensures the speed of the entry of a doctor in clinic and accords with the habit of the doctor, but also ensures that all different writing modes with the same medical representation in data presentation and statistics can be identified as having the same medical meaning.
Drawings
In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings used in the description of the embodiments or the prior art will be briefly described below, it is obvious that the drawings in the following description are only some embodiments of the present invention, and for those skilled in the art, other drawings can be obtained according to the drawings without creative efforts.
FIG. 1 is a schematic flow chart of a method for standardizing named entities of electronic medical records based on an XLNET-BiGRU-CRF model according to an embodiment of the present invention;
fig. 2 is a structural block diagram of an electronic medical record named entity standardization system based on an XLNet-BiGRU-CRF model according to an embodiment of the present invention.
Detailed Description
In order to make the objects, technical solutions and advantages of the embodiments of the present invention clearer, the technical solutions in the embodiments of the present invention are clearly and completely described, and it is obvious that the described embodiments are a part of the embodiments of the present invention, but not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.
The embodiment of the application provides an electronic medical record named entity standardization method, system, storage medium and electronic equipment based on an XLNet-BiGRU-CRF model, and solves the technical problem that the medical named entity identification performance needs to be improved.
In order to solve the technical problems, the general idea of the embodiment of the application is as follows:
inputting preprocessed electronic medical record linguistic data to be identified into an XLNET sub-model to obtain a first Embedding word vector; inputting the first Embedding word vector into a BiGRU-CRF sub-model, and acquiring an entity identification result corresponding to the electronic medical record corpus to be identified; extracting a plurality of related ternary groups of data with corresponding entities from a preset Neo4j database; respectively comparing the cosine similarity of the first Embedding word vector with second Embedding word vectors corresponding to the plurality of relevant ternary groups of data, and taking a standard entity corresponding to a word with the highest similarity score as a target mapping entity result; and mapping the target mapping entity result to a reference table to obtain a final electronic medical record standard entity. The diagnosis of any doctor can be retrieved, and the incomplete data retrieval result caused by different habits can be avoided; therefore, the method not only ensures the speed of the entry of doctors in clinic and accords with the habits of doctors, but also ensures that all different writing modes with the same medical representation in data display and statistics can be identified as having the same medical meaning.
In order to better understand the technical solution, the technical solution will be described in detail with reference to the drawings and the specific embodiments.
Example (b):
in a first aspect, as shown in fig. 1, an embodiment of the present invention provides an electronic medical record named entity standardization method based on an XLNet-BiGRU-CRF model, including:
s1, acquiring and preprocessing electronic medical record linguistic data to be identified;
s2, inputting the preprocessed electronic medical record linguistic data to be recognized into an XLNET sub-model to obtain a first Embedding word vector, wherein the XLNET model comprises an arrangement language model, a double-flow attention mechanism and a Transformer-XL core component;
s3, inputting the first Embedding word vector into a BiGRU-CRF sub-model, and acquiring an entity identification result corresponding to the electronic medical record corpus to be identified;
s4, extracting a plurality of related ternary group data with corresponding entities in a preset Neo4j database according to the entity identification result, wherein the ternary group data comprises an original entity, an entity category and a standard entity;
s5, respectively comparing the cosine similarity of the first Embedding word vector with the second Embedding word vectors corresponding to the plurality of relevant ternary sets of data, and taking the standard entity corresponding to the word with the highest similarity score as a target mapping entity result;
and S6, taking a preset standard table as a reference table, mapping the target mapping entity result to the reference table, and acquiring a final electronic medical record standard entity, wherein the mapping process comprises machine processing and manual marking.
According to the embodiment of the invention, the incomplete data retrieval result caused by different habits can be avoided when the diagnosis of any doctor is retrieved; therefore, the method not only ensures the speed of the entry of a doctor in clinic and accords with the habit of the doctor, but also ensures that all different writing modes with the same medical representation in data presentation and statistics can be identified as having the same medical meaning.
The following will describe each step of the above technical solution in detail with reference to specific contents:
and S1, acquiring and preprocessing the electronic medical record corpus to be identified.
And preprocessing comprises desensitizing and manually labeling the electronic medical record corpus to be identified.
Desensitization processing refers to reduction processing of the content of the electronic medical record in order to reduce interference of entities irrelevant to medical clinical information on the premise of not changing semantic expression of the electronic medical record and protecting authenticity of the electronic medical record. Because the electronic medical record records the privacy information of the name, the age, the address and the like of the patient, in order to protect the privacy of the patient, the desensitization treatment needs to be carried out on the patient information, and therefore the real clinical medical record corpus with privacy removed is obtained.
The manual sequence annotation refers to manual entity annotation of unstructured electronic medical record data in early-stage data preparation. In the labeling process, entities related to medical clinic are taken as objects, and the entities are classified into categories such as diseases, symptoms, treatments, examinations, body parts and the like according to the label entity format.
The embodiment of the invention greatly expands the data scale in the pre-training stage, screens and filters the quality, introduces the data of Giga5, ClueWeb and Common Crawl in addition to BooksCorpus and English Wiki data for the pre-training data used by the subsequent XLNET, and eliminates some low-quality data with the sizes of 16G, 19G and 78G.
S2, inputting the preprocessed electronic medical record linguistic data to be recognized into an XLNET sub-model to obtain a first Embedding word vector, wherein the XLNET model comprises an arrangement language model, a double-flow attention mechanism and a Transformer-XL core component.
Training an XLNET (Chinese) model based on large-scale electronic medical record corpora; XLNET is pre-trained in large-scale linguistic data, compared with a traditional static word vector model, the XLNET can generate dynamic word vectors according to context, semantic coding is more accurate, and accuracy of named entity recognition tasks is greatly improved.
The purpose of introducing the arrangement language model is to randomly disorder the sequence of Chinese characters in a text sentence, and arrange bidirectional sentences to one direction by changing the arrangement position of the words. The arrangement language model includes:
Figure BDA0003474092420000111
wherein the content of the first and second substances,
Figure BDA0003474092420000112
denotes the expectation of all permutation combinations, pθIt is the conditional probability that,
Figure BDA0003474092420000113
is the t token, x in the decomposition orderα<tAll tokens before the t token, namely an objective function of the rearrangement language modeling, are used for predicting the t token by taking t-1 tokens as context.
After the arrangement language model is introduced, the defect that the traditional autoregressive model cannot learn context information at the same time is overcome, and simultaneously, a problem is also brought: text position information is lost.
In order for the model to learn positional information of the sequences, a dual-flow attention mechanism was introduced in XLNet.
The dual-flow self-attention mechanism of XLNET uses two feature characterization units, namely a content characterization unit and an inquiry characterization unit. The content characterization unit is a representation of the above information and will contain the current word. The query characterization unit contains a representation of the above information in addition to the current word and contains position information of the current word without access to the content information of the current word. The content representation unit and the inquiry representation unit form two information flows which are continuously transmitted upwards, and finally, the information of the inquiry unit is output. The specific double-flow attention calculation mechanism is as follows: :
Figure BDA0003474092420000121
Figure BDA0003474092420000122
wherein the content of the first and second substances,
Figure BDA0003474092420000123
for the additionally input position information of the prediction target word,
Figure BDA0003474092420000124
indicating the correlation between the various positions in the text sequence.
The XLNET Chinese model takes a Transformer-XL framework as a core, introduces a circulation mechanism and relative position coding, and can better utilize context semantic information and excavate potential relations in text vectors. The XLNET Chinese model is trained on large-scale label-free data to obtain corresponding model parameters, and the characteristic vector representation of an input sequence can be obtained through reasoning.
The Transformer-XL core component comprises:
Figure BDA0003474092420000125
q, K, V is the input word vector matrix, and dim is the input vector dimension.
Further, the feature vector output by the XLNET Chinese language model is input to the BiGRU network, and the BiGRU network controls the transmission and the cut-off of information through a gate.
And S3, inputting the first Embedding word vector into a BiGRU-CRF submodel, and acquiring an entity identification result corresponding to the electronic medical record corpus to be identified.
The BiGRU-CRF submodel comprises:
zt=σ(wz·[ht-1,xt])
rt=σ(wr·[ht-1,xt])
Figure BDA0003474092420000131
Figure BDA0003474092420000132
wherein x istRepresenting an input vector at the current time t, and representing a characteristic vector of a tth word in the electronic medical record corpus to be identified; h ist、ht-1Respectively representing hidden layer state matrix vectors at the current time t and the previous time;
Figure BDA0003474092420000133
the candidate hidden layer state at the current time t is represented and is also new memory at the current time; z is a radical oftRepresenting an update gate for controlling the extent to which the state information of the previous moment is brought into the current state, ztThe larger the value of (A) is, the more state information at the previous moment is kept; r istIndicating a reset gate for controlling the extent to which status information from a previous moment is ignored, rtSmaller value of (A) indicates rejectionThe more; w is az、wr
Figure BDA0003474092420000134
Weight matrixes respectively representing the updating gate, the resetting gate and the candidate hiding state; σ denotes sigmoid nonlinear activation function, tanh denotes tanh activation function, and σ denotes dot product of vector.
And the output vector passing through the BiGRU network coding unit is Z, and the output vector Z is subjected to softmax probability normalization and then is input to a CRF layer.
The BiGRU-CRF submodel further comprises:
for a given input sequence X, the probability of predicting the output tag sequence y is defined as S (X, y), where y ═ y1,y2,……yn) The tag sequence with the number of n words in the sentence is represented, and the calculation formula of S (X, y) is as follows:
Figure BDA0003474092420000141
wherein the content of the first and second substances,
Figure BDA0003474092420000142
an element with an output vector of the BiGRU network coding unit as Z;
Figure BDA0003474092420000143
is the element of the probability transition matrix output by the CRF layer, representing the slave label yt-1To ytThe transition probability of (2) is such that more reasonable tag sequences are obtained by utilizing the dependency between tags. It can be seen that the probability of the whole tag sequence y is the sum of the scores of the modules, and the score of each position is composed of two parts, one part is the output probability matrix of the BiGRU network coding unit, and the other part is the output transition probability matrix of the CRF layer. After normalization processing is carried out on the formula, the final prediction probability of the label sequence y is obtained, and the formula is as follows:
Figure BDA0003474092420000144
where Y represents all possible tag sequences.
The loss function of the CRF layer adopts a negative log-likelihood function, and the formula is as follows:
Figure BDA0003474092420000145
updating parameters of the whole named entity recognition model by using a loss function of a CRF layer by adopting an Adam algorithm, wherein the parameters comprise model parameters including a BiGRU neural network model and the CRF layer, the parameters of the XLNet Chinese model are kept unchanged, and when a loss value generated by the model meets a set requirement or reaches a set maximum iteration number, the training of the model is terminated.
The construction of the GRU is simpler than the LSTM (one gate less than the LSTM) and thus has a few matrix multiplications less. The GRU can save much time in the case of large training data.
And S4, extracting a plurality of related ternary group data with corresponding entities in a preset Neo4j database according to the entity identification result, wherein the ternary group data comprises an original entity, an entity category and a standard entity.
The construction process of the Neo4j database comprises the following steps:
classifying and labeling the data of the training corpus and a pre-acquired Weijian Committee standard data set to form ternary data, storing the ternary data into the Neo4j database, and constructing a labeled corpus of a named entity identification model, wherein only an entity needs to be identified, and marking is carried out by using [ 'O', B-LOC ', I-LOC', wherein O represents other non-entities, B-LOC represents the beginning of the entity, and I-LOC represents a non-initial character of the entity.
And S5, respectively performing cosine similarity comparison on the first Embedding word vector and second Embedding word vectors corresponding to the plurality of relevant ternary sets of data, and taking a standard entity corresponding to a word with the highest similarity score as a target mapping entity result.
The calculation of the cosine similarity of the word vector comprises the following steps:
and (3) extracting a first Embedding word vector by using an XLNT Chinese model, respectively performing cosine similarity comparison on the first Embedding word vector and a second Embedding word vector which is stored in a Neo4j database and has related problems of corresponding entities, taking the answer of the problem with the highest similarity as a target result, and simultaneously taking the entities ranked second and third as similar entities for marking and normalizing reference.
The corresponding cosine similarity is calculated as follows:
Figure BDA0003474092420000151
wherein score is the similarity value, VqueryIs the first Embedding word vector, VcorpusIs the second Embedding word vector.
In addition, the step also extracts the standard entities corresponding to the second and third related triples of the similarity score ranking, and the standard entities serve as similar entities for the standard entities of the electronic medical record to serve as references.
And S6, taking a preset standard table as a reference table, mapping the target mapping entity result to the reference table, and acquiring a final electronic medical record standard entity, wherein the mapping process comprises machine processing and manual marking.
A standard table is established by using national and international standard data sets, and particularly, collected data are subjected to standardized processing on entities marked and identified based on standards of national, international and medical industries, international disease guidelines and the like, such as ICD10\ ICD9, HL7CDA \ medical subject vocabulary (MeSH) \ observation index identifier logical naming and coding system (LOINC) \ drug dictionary Specification-CFDA, ATC classification \ national health committee medical institution diagnosis subject name entry \ international tumor database structure \ oncology international diagnosis and treatment guidelines and the like, so that the requirements of subsequent service scenes are met.
And mapping the data identified by the entity with a reference table, wherein the mapping process comprises two processes of machine algorithm processing and manual labeling.
When the standard table does not correspond to the actual data, a professional doctor decides whether to expand the standard table. For example, in the standardized process, different diagnostic words are different, such as leukemia and leukemia, AIDS and acquired immune syndrome are synonymous diagnostic words, and the diagnostic word "severe fatty liver" is a type of fatty liver and has an inclusion/included relationship. By standardization, clinical diagnosis will eventually be mapped one-to-one to ICD 11. For example, diagnosis "hypothyroidism" is normalized to ICD11 disease "hypothyroidism", coding for: E03.901.
in particular, for a new named entity recognition task, the above-described algorithm model may be utilized to process the to-be-processed field. And after labeling processing is carried out by utilizing the algorithm model, rechecking is carried out in a manual mode, and a rechecking result is fed back to the algorithm model, so that the error of the model is reduced, and the accuracy of the algorithm model is improved.
In a second aspect, as shown in fig. 2, an embodiment of the present invention provides an electronic medical record named entity standardization system based on an XLNet-BiGRU-CRF model, including:
the preprocessing module is used for acquiring and preprocessing the electronic medical record linguistic data to be identified;
the acquisition module is used for inputting the preprocessed electronic medical record linguistic data to be identified into an XLNET sub-model to acquire a first Embedding word vector, wherein the XLNET model comprises an arrangement language model, a double-flow attention mechanism and a Transformer-XL core component;
the identification module is used for inputting the first Embedding word vector into a BiGRU-CRF sub-model and acquiring an entity identification result corresponding to the electronic medical record corpus to be identified;
the extraction module is used for extracting a plurality of related ternary groups of data corresponding to the entities from a preset Neo4j database according to the entity identification result, wherein the ternary groups of data comprise an original entity, an entity category and a standard entity;
the comparison module is used for respectively comparing the cosine similarity of the first Embedding word vector with the cosine similarity of the second Embedding word vectors corresponding to the plurality of related ternary sets of data, and taking a standard entity corresponding to a word with the highest similarity score as a target mapping entity result;
and the mapping module is used for mapping the target mapping entity result to the reference table to obtain a final electronic medical record standard entity by taking a preset standard table as the reference table, wherein the mapping process comprises machine processing and manual marking.
In a third aspect, an embodiment of the present invention provides a storage medium storing a computer program for electronic medical record named entity standardization based on an XLNet-BiGRU-CRF model, where the computer program causes a computer to execute the electronic medical record named entity standardization method described above.
In a fourth aspect, an embodiment of the present invention provides an electronic device, including:
one or more processors;
a memory; and
one or more programs, wherein the one or more programs are stored in the memory and configured to be executed by the one or more processors, the programs comprising instructions for performing the electronic medical record named entity standardization method as described above.
In summary, compared with the prior art, the method has the following beneficial effects:
according to the embodiment of the invention, the incomplete data retrieval result caused by different habits can be avoided when the diagnosis of any doctor is retrieved; therefore, the method not only ensures the speed of the entry of doctors in clinic and accords with the habits of doctors, but also ensures that all different writing modes with the same medical representation in data display and statistics can be identified as having the same medical meaning.
It is noted that, herein, relational terms such as first and second, and the like may be used solely to distinguish one entity or action from another entity or action without necessarily requiring or implying any actual such relationship or order between such entities or actions. Also, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, an element defined by the phrase "comprising an … …" does not exclude the presence of other identical elements in a process, method, article, or apparatus that comprises the element.
The above examples are only intended to illustrate the technical solution of the present invention, but not to limit it; although the present invention has been described in detail with reference to the foregoing embodiments, it will be understood by those of ordinary skill in the art that: the technical solutions described in the foregoing embodiments may still be modified, or some technical features may be equivalently replaced; and such modifications or substitutions do not depart from the spirit and scope of the corresponding technical solutions of the embodiments of the present invention.

Claims (10)

1. An electronic medical record named entity standardization method based on an XLNET-BiGRU-CRF model is characterized by comprising the following steps:
s1, acquiring and preprocessing electronic medical record linguistic data to be identified;
s2, inputting the preprocessed electronic medical record linguistic data to be recognized into an XLNET sub-model to obtain a first Embedding word vector, wherein the XLNET model comprises an arrangement language model, a double-flow attention mechanism and a Transformer-XL core component;
s3, inputting the first Embedding word vector into a BiGRU-CRF sub-model, and acquiring an entity identification result corresponding to the electronic medical record corpus to be identified;
s4, extracting a plurality of related ternary group data with corresponding entities in a preset Neo4j database according to the entity identification result, wherein the ternary group data comprises an original entity, an entity category and a standard entity;
s5, respectively comparing the cosine similarity of the first Embedding word vector with the second Embedding word vectors corresponding to the plurality of relevant ternary sets of data, and taking the standard entity corresponding to the word with the highest similarity score as a target mapping entity result;
and S6, taking a preset standard table as a reference table, mapping the target mapping entity result to the reference table, and acquiring a final electronic medical record standard entity, wherein the mapping process comprises machine processing and manual marking.
2. The method as claimed in claim 1, wherein the preprocessing in S1 includes desensitizing and labeling the electronic medical record corpus to be identified with artificial sequence.
3. The method for standardizing named entities of electronic medical records according to claim 1, wherein the arrangement language model in S2 comprises:
Figure FDA0003474092410000021
wherein the content of the first and second substances,
Figure FDA0003474092410000028
denotes the expectation of all permutation combinations, pθIt is the conditional probability that,
Figure FDA0003474092410000022
is the t token, x in the decomposition orderα<t is all tokens before the t token, namely an objective function of the rearrangement language modeling, and the t token is predicted by taking t-1 tokens as context;
the dual flow attention mechanism includes an interrogation characterization unit and a content characterization unit:
Figure FDA0003474092410000023
Figure FDA0003474092410000024
wherein the content of the first and second substances,
Figure FDA0003474092410000025
for the additionally input position information of the prediction target word,
Figure FDA0003474092410000026
representing the correlation between the positions in the text sequence;
the Transformer-XL core component comprises:
Figure FDA0003474092410000027
q, K, V is the input word vector matrix, and dim is the input vector dimension.
4. The method for standardizing named entities of electronic medical records as claimed in claim 3, wherein the construction process of Neo4j database in S4 comprises:
and carrying out classification and labeling processing on the data of the training corpus and a pre-acquired Weijian standard data set to form ternary data, and storing the ternary data into the Neo4j database.
5. The method for standardizing named entities of electronic medical records as claimed in claim 3, wherein the BiGRU-CRF submodel in S5 comprises:
zt=σ(wz·[ht-1,xt])
rt=σ(wr·[ht-1,xt])
Figure FDA0003474092410000031
Figure FDA0003474092410000032
wherein x istRepresenting an input vector at the current time t, and representing a characteristic vector of a tth word in the electronic medical record corpus to be identified; h ist、ht-1Respectively representing hidden layer state matrix vectors at the current time t and the previous time;
Figure FDA0003474092410000033
the candidate hidden layer state at the current time t is represented and is also new memory at the current time; z is a radical oftA presentation update gate for controlling the degree to which the state information at the previous time is brought into the current state; r istA representation reset gate for controlling the extent to which status information at a previous time is ignored; w is az、wr
Figure FDA0003474092410000034
Weight matrixes respectively representing the updating gate, the resetting gate and the candidate hiding state; σ represents a sigmoid nonlinear activation function, tanh represents a tanh activation function, and represents a dot product of a vector;
and the output vector passing through the BiGRU network coding unit is Z, and the output vector Z is subjected to softmax probability normalization and then is input to a CRF layer.
6. The method for standardizing named entities in electronic medical records according to claim 5, wherein the BiGRU-CRF submodel in S5 further comprises:
for a given input sequence X, the probability of predicting the output tag sequence y is defined as S (X, y), where y ═ y1,y2,……yn) The tag sequence with the number of n words in the sentence is represented, and the calculation formula of S (X, y) is as follows:
Figure FDA0003474092410000035
wherein the content of the first and second substances,
Figure FDA0003474092410000036
an element with an output vector of the BiGRU network coding unit as Z;
Figure FDA0003474092410000041
is the element of the probability transition matrix output by the CRF layer, representing the slave label yt-1To ytThe final prediction probability of the tag sequence y is obtained after the normalization processing is carried out on the formula, and the formula is as follows:
Figure FDA0003474092410000042
where Y represents all possible tag sequences.
7. The method for standardizing named entities in electronic medical records according to any one of claims 1-6, wherein the step of S5 further comprises:
and extracting standard entities corresponding to the second and third related triples of the similarity score ranking, and using the standard entities as similar entities for reference of the electronic medical record standard entities.
8. An electronic medical record named entity standardization system based on an XLNet-BiGRU-CRF model is characterized by comprising the following components:
the preprocessing module is used for acquiring and preprocessing the electronic medical record linguistic data to be identified;
the acquisition module is used for inputting the preprocessed electronic medical record linguistic data to be identified into an XLNET sub-model to acquire a first Embedding word vector, wherein the XLNET model comprises an arrangement language model, a double-flow attention mechanism and a Transformer-XL core component;
the identification module is used for inputting the first Embedding word vector into a BiGRU-CRF sub-model and acquiring an entity identification result corresponding to the electronic medical record corpus to be identified;
the extraction module is used for extracting a plurality of related ternary groups of data corresponding to the entities from a preset Neo4j database according to the entity identification result, wherein the ternary groups of data comprise an original entity, an entity category and a standard entity;
the comparison module is used for respectively comparing the cosine similarity of the first Embedding word vector with the cosine similarity of the second Embedding word vectors corresponding to the plurality of related ternary sets of data, and taking a standard entity corresponding to a word with the highest similarity score as a target mapping entity result;
and the mapping module is used for mapping the target mapping entity result to the reference table to obtain a final electronic medical record standard entity by taking a preset standard table as the reference table, wherein the mapping process comprises machine processing and manual marking.
9. A storage medium storing a computer program for electronic medical record named entity standardization based on XLNet-BiGRU-CRF model, wherein the computer program causes a computer to execute the electronic medical record named entity standardization method according to any one of claims 1 to 7.
10. An electronic device, comprising:
one or more processors;
a memory; and
one or more programs, wherein the one or more programs are stored in the memory and configured to be executed by the one or more processors, the programs comprising instructions for performing the electronic medical record named entity standardization method of any of claims 1-7.
CN202210049938.7A 2022-01-17 2022-01-17 Electronic medical record named entity standardization method and system based on XLNet-BiGRU-CRF model Pending CN114582449A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202210049938.7A CN114582449A (en) 2022-01-17 2022-01-17 Electronic medical record named entity standardization method and system based on XLNet-BiGRU-CRF model

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202210049938.7A CN114582449A (en) 2022-01-17 2022-01-17 Electronic medical record named entity standardization method and system based on XLNet-BiGRU-CRF model

Publications (1)

Publication Number Publication Date
CN114582449A true CN114582449A (en) 2022-06-03

Family

ID=81768800

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202210049938.7A Pending CN114582449A (en) 2022-01-17 2022-01-17 Electronic medical record named entity standardization method and system based on XLNet-BiGRU-CRF model

Country Status (1)

Country Link
CN (1) CN114582449A (en)

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116842021A (en) * 2023-07-14 2023-10-03 恩核(北京)信息技术有限公司 Data dictionary standardization method, equipment and medium based on AI generation technology
CN116842021B (en) * 2023-07-14 2024-04-26 恩核(北京)信息技术有限公司 Data dictionary standardization method, equipment and medium based on AI generation technology

Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109359293A (en) * 2018-09-13 2019-02-19 内蒙古大学 Mongolian name entity recognition method neural network based and its identifying system
CN109471895A (en) * 2018-10-29 2019-03-15 清华大学 The extraction of electronic health record phenotype, phenotype name authority method and system
CN110009599A (en) * 2019-02-01 2019-07-12 腾讯科技(深圳)有限公司 Liver masses detection method, device, equipment and storage medium
CN112001177A (en) * 2020-08-24 2020-11-27 浪潮云信息技术股份公司 Electronic medical record named entity identification method and system integrating deep learning and rules
CN113516659A (en) * 2021-09-15 2021-10-19 浙江大学 Medical image automatic segmentation method based on deep learning
CN113641809A (en) * 2021-08-10 2021-11-12 中电鸿信信息科技有限公司 XLNET-BiGRU-CRF-based intelligent question answering method

Patent Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109359293A (en) * 2018-09-13 2019-02-19 内蒙古大学 Mongolian name entity recognition method neural network based and its identifying system
CN109471895A (en) * 2018-10-29 2019-03-15 清华大学 The extraction of electronic health record phenotype, phenotype name authority method and system
CN110009599A (en) * 2019-02-01 2019-07-12 腾讯科技(深圳)有限公司 Liver masses detection method, device, equipment and storage medium
CN112001177A (en) * 2020-08-24 2020-11-27 浪潮云信息技术股份公司 Electronic medical record named entity identification method and system integrating deep learning and rules
CN113641809A (en) * 2021-08-10 2021-11-12 中电鸿信信息科技有限公司 XLNET-BiGRU-CRF-based intelligent question answering method
CN113516659A (en) * 2021-09-15 2021-10-19 浙江大学 Medical image automatic segmentation method based on deep learning

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116842021A (en) * 2023-07-14 2023-10-03 恩核(北京)信息技术有限公司 Data dictionary standardization method, equipment and medium based on AI generation technology
CN116842021B (en) * 2023-07-14 2024-04-26 恩核(北京)信息技术有限公司 Data dictionary standardization method, equipment and medium based on AI generation technology

Similar Documents

Publication Publication Date Title
CN110210037B (en) Syndrome-oriented medical field category detection method
WO2021233112A1 (en) Multimodal machine learning-based translation method, device, equipment, and storage medium
Nadif et al. Unsupervised and self-supervised deep learning approaches for biomedical text mining
Liu et al. Medical-vlbert: Medical visual language bert for covid-19 ct report generation with alternate learning
Dima et al. Automatic noun compound interpretation using deep neural networks and word embeddings
CN110991190B (en) Document theme enhancement system, text emotion prediction system and method
CN111274790B (en) Chapter-level event embedding method and device based on syntactic dependency graph
CN110931137B (en) Machine-assisted dialog systems, methods, and apparatus
CN109003677B (en) Structured analysis processing method for medical record data
CN112151183A (en) Entity identification method of Chinese electronic medical record based on Lattice LSTM model
CN111858940A (en) Multi-head attention-based legal case similarity calculation method and system
CN112420191A (en) Traditional Chinese medicine auxiliary decision making system and method
CN113688248A (en) Medical event identification method and system under condition of small sample weak labeling
US20230315994A1 (en) Natural Language Processing for Addressing Bias
Hsu et al. Multi-label classification of ICD coding using deep learning
Ruwa et al. Affective visual question answering network
Yadav et al. A novel automated depression detection technique using text transcript
CN112035627B (en) Automatic question and answer method, device, equipment and storage medium
CN113764112A (en) Online medical question and answer method
CN113641809A (en) XLNET-BiGRU-CRF-based intelligent question answering method
CN111627561B (en) Standard symptom extraction method, device, electronic equipment and storage medium
CN116630062A (en) Medical insurance fraud detection method, system and storage medium
Chen et al. Learning the chinese sentence representation with LSTM autoencoder
CN114582449A (en) Electronic medical record named entity standardization method and system based on XLNet-BiGRU-CRF model
Zhang et al. Extraction of English Drug Names Based on Bert-CNN Mode.

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination