CN113688255A

CN113688255A - Knowledge graph construction method based on Chinese electronic medical record

Info

Publication number: CN113688255A
Application number: CN202111026407.8A
Authority: CN
Inventors: 李丽双; 袁光辉; 唐婧尧
Original assignee: Dalian University of Technology
Current assignee: Dalian University of Technology
Priority date: 2021-09-02
Filing date: 2021-09-02
Publication date: 2021-11-23

Abstract

The invention belongs to the field of natural language processing, and provides a knowledge graph construction method based on a Chinese electronic medical record. Most of the established knowledge graphs contain less medical record linguistic data, the knowledge graphs are small in scale, the knowledge graphs are often only suitable for a single department or a disease, the universality is poor, and some of the completed medical record knowledge graphs need a large amount of manual participation, so that time and labor are wasted, and the expandability is poor. Because different departments and diseases of the electronic medical record describe different disease types, the corresponding language environments of a series of examination, treatment and the like are different, and the habitual expressions of doctors corresponding to different disease types are different, the characteristics reduce the effect of some deep learning methods, and the knowledge map construction framework is not easy to expand. Aiming at the problems, a knowledge map data analysis and processing method, a corpus labeling flow specification and an entity relation extraction scheme based on the Chinese electronic medical record are established.

Description

Knowledge graph construction method based on Chinese electronic medical record

Technical Field

The invention belongs to the field of natural language processing, and relates to a method for constructing a knowledge graph aiming at a Chinese Electronic Medical Record (EMR) text, in particular to a knowledge graph construction method based on a Chinese Electronic Medical Record.

Background

Knowledge maps were first proposed by Google corporation in 2012 and applied in search engines (SINGHAL a. Official Google block: Introducing the Knowledge Graph: threads, not strings [ R ]. Official Google block, 2012). In the general domain, knowledge maps are often stored in the form of triples < entities, relationships, entities > (WEIKUM G, THEOBAL M. from information to knowledge: numerous entities and relationships from web sources [ C ]. Proceedings of the world-connecting ACM SIGMOD-SIGACT-SIGART System on Principles of database systems, Indianapolis,2010:65-76), an entity may be any specific thing in the real world, such as: the name of a person, the name of a place, the name of an organization and the like, and the relationship can be the attribute of the entity and can also be used for expressing the semantic relation between the entities. A number of relatively mature knowledge maps have been constructed in the current general field, such as those available from foreign Freebase (BOLLACKER K, EVANS C, PARITOSH P, et al. Freebase: a colloidal laboratory map database for structuring human knowledge [ C ]. Proceedings of the 2008ACM SIGMOD international conference Management of data, Vancouver,2008: 1247-. The knowledge graph of the general field focuses more on the information breadth, the constructed knowledge graph needs to cover enough fields, and the knowledge source needed for construction generally comes from some semi-structured data websites, such as encyclopedia, Wikipedia and the like, or comes from community crowdsourcing. Compared with the general field, the knowledge map in the vertical field focuses on the depth of knowledge, the constructed corpus of the knowledge map usually depends on data in some professional fields, and the construction corpus usually has a definite construction purpose and is strong in specialty and pertinence. At present, the knowledge map is applied to various fields such as agriculture, finance, education and the like, and the medical field is one of the fields with the most potential and the most extensive application.

Experts at home and abroad do much work aiming at the construction of knowledge maps in the medical field. Foreign, the collection and construction of medical terms and resources has been known for a long time, such as i2b2, ICD-10, and MeSH. The i2b2 is a clinical medical evaluation task based on an electronic medical record, and comprises three subtasks: the concept extraction, assertion classification and relationship classification clearly define the concept, assertion and clinical relationship types, and lay the foundation for the construction of the corpus tagging system of the electronic medical record. ICD-10 is an international disease classification code made by the world health organization, which classifies diseases into an ordered combination according to some attribute characteristics of the diseases, and the system is expressed by a coding mode, and the system contains nearly 26000 disease records in total, and the content is comprehensive and accurate. MeSH is a medical subject vocabulary compiled in the national library of medicine, which contains 18000 medical subject words belonging to 15 major classes, and is mainly used for assisting PubMed indexing and retrieving medical documents. In China, Rantong (Chinese medicine knowledge graph construction and application [ J ]. J. Med. informatics, 2016,37(4):8-13) and other people analyze and research the automatic construction scheme and standardized process of Chinese medicine knowledge graph, and apply the Chinese medicine knowledge graph in the field of intelligent medical treatment. Qi Yuankaiqi (medical knowledge graph construction technology and research progress [ J ]. computer application research, 2018,35(7):1-11) and the like explore the problems still existing in the construction of medical knowledge graphs in 2017 and introduce the current application situation of the medical knowledge graphs. Oudema (Chinese medical knowledge graph CMeKG constructs first exploration [ J ] Chinese information science report 2019,33(10):1-7) and the like, aiming at the problem of modern medical Chinese knowledge graph engineering construction, a structured, hierarchical and clear medical knowledge description system is formulated, a high-performance medical knowledge graph construction framework and technology are developed, a knowledge graph construction platform CMeKG based on natural language processing technology is built, and an automatic and standardized engineering mode of knowledge graph construction is formed. In addition, there are also knowledge maps constructed for certain specific disease areas. If the structural and semantic characteristics of the repaired bud (tumor knowledge graph construction research [ D ] based on Chinese electronic medical record, Beijing: Beijing cooperative medical college, 2019) are researched aiming at tumor diseases, a set of complete tumor knowledge graph construction framework based on the Chinese electronic medical record is provided, the tumor knowledge graph of the digestive system is constructed based on the tumor of the digestive system, and the quality of the constructed knowledge graph is evaluated by adopting a quantitative evaluation and expert evaluation mode.

At present, the related research of the domestic medical knowledge map develops rapidly, but the existing research still has no standard and universal construction process, and some researches have some places which need to be perfected. Most of the established knowledge graphs contain less medical record linguistic data, the knowledge graphs are small in scale, the knowledge graphs are often only suitable for a single department or a disease, the universality is poor, and some of the completed medical record knowledge graphs need a large amount of manual participation, so that time and labor are wasted, and the expandability is poor. Because different departments and diseases of the electronic medical record describe different disease types, the corresponding language environments of a series of examination, treatment and the like are different, and the habitual expressions of doctors corresponding to different disease types are different, the characteristics reduce the effect of some deep learning methods, and the knowledge map construction framework is not easy to expand.

Disclosure of Invention

The invention provides a knowledge graph construction framework based on a Chinese electronic medical record. By taking reference to the labeling construction experience of predecessors, a large number of electronic medical record corpora of different departments and diseases are researched, and a knowledge map data analysis processing method, corpus labeling flow specification and entity relationship extraction scheme based on the Chinese electronic medical record are formulated according to the characteristics of the corpora.

A knowledge graph construction method based on Chinese electronic medical records comprises the following steps:

step 1, preprocessing Chinese electronic medical record corpus

(1) And (3) splitting the corpus: and splitting the medical records according to the labels in the electronic medical records, wherein the medical knowledge described in a natural language form is corresponding to each label.

(2) And (4) label classification: the labels are manually classified, and the labels containing medical knowledge in the same direction are put into a set.

(3) Counting and screening the number of labels: and counting the number of labels contained in all the sets, and sequencing the label sets according to the number. And then extracting a plurality of sets with more labels from the data according to the statistical number of the label sets to serve as the corpora constructed by the knowledge graph of the application.

Step 2, making data marking rules and marking process

(1) And (3) entity marking specification: entity types are divided into five categories: disease, location, symptom, examination and treatment. The details of the five entity types are as follows:

diseases: generally, abnormal phenomena occurring in the body or mind of a patient or diagnoses made by a doctor according to the body of the patient can be classified into two categories: infectious diseases and non-infectious diseases, which generally have an adverse effect on the normal life of a person.

The part: generally refers to a part of the human body, both external and internal, where, in medical pathologies, the site is usually associated with a disease or a symptom.

Symptoms are: generally, discomfort or abnormal feeling caused by a disease or other emergency condition, or in a hospital, an abnormal diagnosis result given by a doctor, an abnormal examination result of equipment, or the like is referred to.

And (4) checking: the term "examination item", examination subject, examination equipment to be carried out in order to confirm the presence or absence of a disease or to understand more details of a disease "is used broadly.

Treatment: it generally refers to a method of administration, surgery or equipment for treating a disease or symptom.

(2) Specification of relation labels

According to the determined entity types, the relationship types between the entities are further divided into seven major categories: the relationship between disease and disease, the relationship between disease and location, the relationship between disease and symptoms, the relationship between treatment and disease, the relationship between treatment and symptoms, the relationship between examination and disease, and the relationship between examination and symptoms, some of which include subdivided subclasses. The method comprises the following specific steps:

first major class, diseases and diseases: the present application generally groups the relationships between diseases into a broad category, including complications related to the relationship between diseases, and alias names indicating diseases or diseases.

Second major class, disease and site: the disease is manifested at the site, generally the site of onset, and also at the site of metastasis.

Third, disease and symptoms: an embodiment of a disease generally refers to a condition caused by a disease.

The fourth major category, treatment and disease: the relationship between treatment and disease can be subdivided into the following four categories according to the results:

treating and improving diseases: indicating that the treatment is directed to the disease and that the disease is ameliorated or cured.

Treatment of exacerbation disorders: indicating that treatment for the disease results in worsening of the disease.

Treatment leads to disease: indicating the disease that occurred as a result of the treatment.

Treatment and management of diseases: treatment is applied to the disease, and no mention is made of the therapeutic effect.

Fifth, treatment and symptoms: the present application divides the relationship between treatments for certain symptoms into two categories:

symptomatic treatment: the treatment to be taken for certain symptoms is not subdivided here.

Treatment results in symptoms: symptoms resulting from such treatment.

The sixth major group, examination and disease: diseases are confirmed by devices or other examination methods, and can be classified into two types according to whether the examination results appear or not:

the examination confirmed the disease: the examination confirmed the occurrence of the disease.

Examination to confirm disease: this means that some examination means is taken to confirm the disease, and the result is unknown.

Seventh, examination and symptoms: the examination shows symptoms, which may be normal symptoms or abnormal symptoms, or the examination confirms the presence of symptoms.

(3) Corpus annotation process

Firstly, extracting the entity in a rule matching mode, and then marking the entity relationship of the extracted entity. The main flow of entity relationship labeling is as follows:

preparing annotation data: first, entity pairing is performed: each case history text is matched with a plurality of entities, and the entities are paired according to the entity types and the distances among the entities. The distance between the entities is not the character interval but the number of sentences, and the invention sets that if two entities are separated by more than three sentences, the relation between the entities is not considered. Then, entity pair relationship screening is carried out: and (3) predicting whether the entity pairs have a relationship by using a deep learning model, wherein the training set is from pre-labeled linguistic data, and the model adopts LSTM and finally shows on the test set. The LSTM model formula is as follows:

f_t＝σ(W_f[h_t-1,x_t]+b_f)

i_t＝σ(W_i[h_t-1,x_t]+b_i)

o_t＝σ(W_o[h_t-1,x_t]+b_o)

h_t＝o_t⊙tanh(C_t)

wherein x is_tInput vector, h, representing the ttm position of the LSTM network_tOutput of hidden layer representing the ttm position of LSTM network, W_(·)，b_(·)Denotes trainable weights and offsets, tanh and σ are activation functions, and indicates that elements are multiplied one by one. This step is to screen out onlyThere are pairs of entities with a high probability of relationship and what relationships between pairs of entities need to be further manually annotated to determine.

Manual labeling process: firstly, a marking specification is established, and a marking example is given out for a marker to understand and learn. And secondly, performing accuracy verification after 1000 linguistic data are labeled by the group A of annotators, and summarizing the encountered problems. Solving the questions posed by the group A of the annotators and then deciding whether to continue to annotate. Fourthly, after half of the linguistic data are labeled, the B group of annotators randomly extract 10% of the linguistic data from the A group of labeling results for acceptance check. Discussing the consistency of the labels of the two groups of people and determining whether to return to the label or not. After the labeling is finished, the expert randomly extracts 5% of the corpus to be inspected and accepted, and if the inspection and acceptance are unsuccessful, the labeling needs to be carried out again.

And (4) marking and acceptance inspection: the acceptance mainly refers to the accuracy rate of the acceptance of the B group of annotators and the accuracy rate of the acceptance of experts. And when the accuracy of the acceptance check of the two reaches a certain value, the marking is finished.

Step 3, extracting entity and relation based on Chinese electronic medical record

(1) Entity extraction

The invention adopts a rule matching mode to extract entities from the Chinese electronic medical record. The specific process of extracting entities by using rule matching is as follows:

constructing an entity library: the entities for constructing the entity library are derived from some public resources such as ICD-10, common medical nouns (2019) issued by the national health administration, and the like. And extracting required entities according to the entity types formulated in the labeling specifications, and then storing the entities in 5 corresponding entity tables.

The rule matching entity: the invention uses the entity library to match the entities in a regular way, and realizes two matching methods: the implementation processes of the two methods are as follows.

Word segmentation search: the first method is to segment the text, then judge whether each word is in the entity library, if so, determine that the word is an entity, and the entity type is the type of the corresponding entity table. Considering that some long entities may be separated during word segmentation, in the second step of judgment, the current word is searched in the entity library, and the current word is spliced with the next word and the current word is spliced with the next two words, so that words of three forms are searched. The result of the word segmentation search depends on the word segmentation effect, if the word segmentation error is large, the matching effect is poor, and the word segmentation can split the entity, so that the label missing is caused. For example, "subarachnoid hematoma" may be segmented into "arachnoid", "subarachnoid", and "luminal hematoma", although such entities may be identified after a conjunctive search is made, but may not be identified if the segments are slightly finer.

Character string searching: the method does not need word segmentation, but makes character string matching on each word in the entity library in the text, and if the words are matched, an entity is determined to be found. String searches do not have missing marks and can be matched as long as an entity exists in the text, but the entity that the string matches may not be the entity itself in context semantics. For example, a "hospital service center," a "heart" would be matched as a site entity, but in fact it is wrong.

Thirdly, post-processing: after the above rule matching, most entities can be identified, but the method of rule matching inevitably has two problems. One is the problem of entity intersection, that is, two or more entities have a part of contents overlapped; the other is that the same entity has multiple entity types, namely one entity is both a symptom type and a disease type. For the problem of entity intersection, the invention adopts a rule merging method, if two entities contain each other, the long entity is reserved, and if the two entities intersect, the two entities are merged. The entity types after the contained entity combination are the entity types of the long entity, for the entity types after the intersection entity combination, if one entity is a part entity, the type of the other entity is taken as the entity type after the combination, and if no part entity exists, the type of one entity is taken as the entity type after the combination. For the problem that the same entity has a plurality of entity types, the invention adopts a method for model prediction according to context semantics. The training set and the testing set are provided with one type of entity, the entities in the training set and the testing set are not overlapped, and the model predicts the types of the multi-type entities by using LSTM + ATTN. ATTN represents the attention mechanism, and its calculation formula is as follows:

s_t＝F(x_t,q)

wherein x is_tRepresenting an input vector, q a query vector, F (-) a scoring function, s_tRepresenting a value of the fractional correlation, alpha_tThe normalized score values are expressed. The scoring function has the following calculation modes:

an additive model:

s(x,q)＝v^ttanh(Wx+Uq)

dot product model:

s(x,q)＝x^Tq

scaling the dot product model:

bilinear model:

s(x,q)＝x^TWq

wherein v, W, U and D are parameters which can be learnt in the model. In specific implementation, if the predicted entity type is in the type set matched with the entity originally, the type is used as the entity type, and if the predicted entity type is not in the type set matched with the entity originally, one type is randomly selected from the matched type set.

(2) Relationship extraction

The entity relationship extraction is to judge whether the matched entities have relationships, the relationship types defined by the labeling specification are 12 in total, the invention adopts a supervised mode to construct a model to extract the triples, therefore, data is firstly labeled by using a manual labeling mode, then the model is trained according to the labeled data, and finally the trained model is used for predicting the rest unlabeled data to obtain entity relationship triples.

Drawings

FIG. 1 electronic medical record knowledge graph construction framework

FIG. 2 is a conceptual diagram of entity relationship

FIG. 3 corpus processing flow chart

FIG. 4 electronic medical record data presentation

Detailed Description

1. Electronic medical record corpus preprocessing

Firstly, electronic medical record data after privacy removal is obtained from a hospital, then the data is processed according to the corpus preprocessing flow shown in fig. 3, taking the electronic medical record data shown in fig. 4 as an example, a medical record text can be extracted, namely that a patient has limb weakness and numbness after fever in the year before admission, and is still heated after leaving.

2. Entity rule matching

According to the constructed entity library, entity matching is carried out on medical record texts, and the medical record texts can be matched with entities such as heating (disease entities), limb weakness (symptom entities), numbness (symptom entities), potassium supplement (treatment entities) and the like.

3. Entity relationship labeling

And selecting a part of texts and corresponding entities from all medical record texts, and then manually marking the texts and the corresponding entities to mark the relationship between entity pairs. For example, the relationship between the entity "fever" and the entity "weakness of limbs" is labeled "disease-symptom"

4. Entity relationship extraction

From the last step, the electronic medical record text and the corresponding entity relationship pair can be obtained, and the electronic medical record text and the corresponding entity relationship pair are used as training corpora, and the training model is used for predicting the entity relationship. The method and the device adopt an electronic medical record entity relation extraction model based on position noise reduction and rich semantics.

5. Quality assessment

In order to evaluate the accuracy of the entities matched by the application, 2000 entities are randomly extracted from the matching entities without duplication removal for manual evaluation, and the accuracy of each type of entity is shown as Table1

Table1 entity extraction quality assessment

As can be seen from the above table, the average accuracy of extracting the entities by the rule matching method of the present application reaches 92.95%.

The construction quality of the knowledge graph is the key to whether the knowledge graph can be effectively applied to downstream tasks, and in order to measure the effectiveness of the relation extraction method provided by the application on Chinese electronic medical record linguistic data, the application evaluates the quality of each type of extracted relation by using accuracy.

Table2 relational broad class quality assessment

As shown in the above table, in the multi-type classification, the prediction accuracy of the type is generally expressed by the type accuracy for each class, and it can be seen from the results in the table that the type accuracy of each large class achieves good results. Among them, the "disease-location" type of relationship is the best outcome because in case history texts, the regularity of the "disease-location" relationship is obvious when it occurs, for example, the main way of spreading of "anal canal cancer is to directly invade the surrounding soft tissues and to metastasize along the lymphatic vessels. "the sentence pattern has simple structure, features are easy to learn by the model, and such relationships often appear in a sentence, rarely appear in the context of cross-sentences, so the model can predict the result well. The "treatment-symptom" type of relationship is the worst, because the span between two entities is generally large, the proportion of the relationship between sentences and sentences containing the type of relationship is similar, and the places in the medical history text where the type of relationship appears often contain a large amount of other types of medical knowledge, and the text context is complex, so the prediction result is poor.

Besides calculating the accuracy of each major relationship, the result of the minor relationship is also important, each minor is divided more finely on the basis of the major, the classification direction is more accurate, but the classification accuracy inevitably slips, and the accuracy of each minor relationship is shown as Table 3.

Table3 subclass relation quality assessment

As can be seen from the results in the table, the classification result of the sub-classification relationship is much lower than that of the larger-relationship one. The reason is that the treatment-disease relationship is more subdivided types and the relationship discrimination conditions between the types are complex, for example, the relationship between treatment improvement disease and treatment management disease is semantically closer, which increases the discrimination difficulty of the model. On the other hand, the overall accuracy of the large category of the "treatment-disease" relationship is not high, and the accuracy of the fine category relationship is naturally reduced. In the large relation between 'inspection-disease' and 'treatment-symptom', the accuracy of each fine classification is reduced, wherein the accuracy of the relation between 'inspection for confirming the disease' and 'symptom caused by treatment' is reduced more, one possible reason is that compared with other relation classes, the number of samples of the relation between the two small classes is less, the frequency of model learning is lower during training, and the model is difficult to learn the characteristics for judging the relation of the classes, so the result is poorer.

In conclusion, 7 large categories of relations are defined in the method, the general relation accuracy reaches 84.05% and the effectiveness of the method is verified, wherein the categories include 12 categories of relations.

Claims

1. A knowledge graph construction method based on Chinese electronic medical records is characterized by comprising the following steps:

step 1, preprocessing Chinese electronic medical record corpus

(1) And (3) splitting the corpus: splitting medical records according to tags in the electronic medical records, wherein medical knowledge described in a natural language form corresponds to each tag;

(2) and (4) label classification: manually classifying the labels, and putting the labels containing medical knowledge in the same direction into a set;

(3) counting and screening the number of labels: counting the number of labels contained in all the sets, and sequencing the label sets according to the number; then extracting a plurality of sets with more labels from the obtained statistics according to the number of the label sets to serve as corpora constructed by the knowledge graph of the application;

step 2, making data marking rules and marking process

(1) And (3) entity marking specification: entity types are divided into five categories: disease, location, symptoms, examination and treatment; the details of the five entity types are as follows:

diseases: generally, abnormal phenomena occurring in the body or mind of a patient or diagnoses made by a doctor according to the body of the patient can be classified into two categories: infectious diseases and non-infectious diseases, which generally have adverse effects on the normal life of a human;

the part: generally refers to a part of the human body, both external and internal, where, in medical pathologies, the site is generally associated with a disease or a symptom;

symptoms are: generally, discomfort or abnormal feeling caused by a disease or other emergency, or in a hospital, an abnormal diagnosis result given by a doctor, an abnormal examination result of equipment, or the like;

and (4) checking: generally, examination items, physical examination, examination facilities to be performed for confirming the presence or absence of a disease or for understanding more details of a disease;

treatment: broadly refers to a method of treatment such as medication, surgery or equipment, which is taken for a disease or symptom;

(2) specification of relation labels

According to the determined entity types, the relationship types between the entities are further divided into seven major categories: the relationship between disease and disease, the relationship between disease and location, the relationship between disease and symptoms, the relationship between treatment and disease, the relationship between treatment and symptoms, the relationship between examination and disease, and the relationship between examination and symptoms, some of which have subdivided subclasses; the method comprises the following specific steps:

first major class, diseases and diseases: the relationship between diseases and related complications, the disease indicating the disease or the alias of the disease, etc. are unified into a broad category;

second major class, disease and site: the disease is manifested in the site, generally the diseased site, and also in the metastatic site;

third, disease and symptoms: an embodiment of a disease, generally refers to a condition caused by a disease;

treating and improving diseases: indicates that the treatment is directed to the disease and that the disease is ameliorated or cured;

treatment of exacerbation disorders: indicates that treatment for the disease results in worsening of the condition;

treatment leads to disease: indicates the disease arising from the treatment;

treatment and management of diseases: treatment is applied to the disease, no mention is made of the effect of the treatment;

symptomatic treatment: treatment regimens taken for certain symptoms, where the treatment outcome is not subdivided;

treatment results in symptoms: symptoms resulting from such treatment;

the examination confirmed the disease: the occurrence of the disease is confirmed by examination;

examination to confirm disease: means that some examination means is taken to confirm the disease and the result is unknown;

seventh, examination and symptoms: checking the display symptom, which may be a normal symptom or an abnormal symptom, or checking to confirm whether a symptom exists;

(3) corpus annotation process

Firstly, extracting an entity in a rule matching mode, and then marking an entity relationship of the extracted entity; the main flow of entity relationship labeling is as follows:

preparing annotation data: first, entity pairing is performed: each medical record text can be matched with a plurality of entities, and the entities are paired according to the entity types and the distances among the entities; the distance between the entities is not the character interval but the number of sentences, and the invention sets that if more than three sentences are spaced between two entities, the relationship between the entities is not considered; then, entity pair relationship screening is carried out: predicting whether the entity pairs have a relationship by using a deep learning model, wherein a training set is from pre-labeled linguistic data, and the model adopts LSTM and finally shows on a test set; the LSTM model formula is as follows:

f_t＝σ(W_f[h_t-1,x_t]+b_f)

i_t＝σ(W_i[h_t-1,x_t]+b_i)

o_t＝σ(W_o[h_t-1,x_t]+b_o)

h_t＝o_t⊙tanh(C_t)

wherein x is_tInput vector, h, representing the ttm position of the LSTM network_tOutput of hidden layer representing the ttm position of LSTM network, W_(·)，b_(·)Representing trainable weights and biases, tanh and σ being activeA function, for example, indicates that the elements are multiplied one by one; the step is only to screen out entity pairs with high possibility of relationship, and what relationship between the entity pairs needs to be further manually marked for determination;

manual labeling process: firstly, establishing a marking standard and giving a marking example for a marker to understand and learn; secondly, performing accuracy verification after 1000 corpora are marked by the group A of annotators, and summarizing the encountered problems; solving the questions proposed by the group A of the annotators and then determining whether to continue to label; after half of the linguistic data are labeled, the B group of annotators randomly extract 10% of the linguistic data from the A group of labeling results for acceptance check; discussing the consistency of the labels of the two groups of people and determining whether the labels return; after the labeling is finished, the expert randomly extracts 5% of the corpus to be inspected and accepted, and if the inspection and acceptance are unsuccessful, the corpus needs to be re-labeled;

and (4) marking and acceptance inspection: the acceptance mainly refers to the accuracy rate of acceptance of the B group of annotators and the accuracy rate of acceptance of experts; when the accuracy rates of the two acceptance tests reach a certain value, the marking is finished;

(1) Entity extraction

The invention adopts a rule matching mode to extract entities from the Chinese electronic medical record; the specific process of extracting entities by using rule matching is as follows:

constructing an entity library: the entities for constructing the entity library are derived from some public resources such as ICD-10 and common medical nouns (2019) issued by the national health administration; extracting required entities according to entity types formulated in the labeling specifications, and then storing the entities in 5 corresponding entity tables;

the rule matching entity: the invention uses the entity library to match the entities in a regular way, and realizes two matching methods: performing word segmentation search and character string search, and selecting a result of a first method after comprehensively analyzing the performance of the character string search and the character string search, wherein the implementation processes of the two methods are as follows;

word segmentation search: the first method is that firstly, the text is divided into words, then whether each word is in an entity library is judged, if yes, the word is determined to be an entity, and the entity type is the type of a corresponding entity table; considering that some long entities may be separated during word segmentation, in the second step of judgment, the current word is searched in the entity library, and the current word is spliced with the next word, and the current word is spliced with the next two words to search words in three forms; the result of the word segmentation search depends on the word segmentation effect, if the word segmentation error is large, the matching effect is poor, and the word segmentation can split the entity, so that the label leakage is caused;

character string searching: the method does not need word segmentation, but makes character string matching for each word in the entity library in the text, if matching, then determines to find an entity; the character string searching has no missing mark, the entity can be matched as long as the entity exists in the text, but the entity matched with the character string may not be the entity per se in the context semantics;

thirdly, post-processing: after the rule matching, most entities can be identified, but the rule matching method has two inevitable problems; one is the problem of entity intersection, that is, two or more entities have a part of contents overlapped; secondly, the same entity has a plurality of entity types, namely, one entity is both a symptom type and a disease type; for the problem of entity intersection, a rule merging method is adopted, if two entities contain each other, a long entity is reserved, and if the two entities intersect, the two entities are merged; the entity types after the contained entity combination are the entity types of the long entity, for the entity types after the intersection entity combination, if one entity is a part entity, the type of the other entity is taken as the entity type after the combination, and if no part entity exists, the type of one entity is taken as the entity type after the combination; for the problem that the same entity has a plurality of entity types, the invention adopts a method for model prediction according to context semantics; the training set and the testing set are entities with one type, the entities in the training set and the testing set are not overlapped, and the model uses LSTM + ATTN to predict the types of the entities with multiple types; ATTN represents the attention mechanism, and its calculation formula is as follows:

s_t＝F(x_t,q)

wherein x is_tRepresenting an input vector, q a query vector, F (-) a scoring function, s_tRepresenting a value of the fractional correlation, alpha_tRepresents a normalized score value; the scoring function has the following calculation modes:

an additive model:

s(x,q)＝v^ttanh(Wx+Uq)

dot product model:

s(x,q)＝x^Tq

scaling the dot product model:

bilinear model:

s(x,q)＝x^TWq

wherein v, W, U and D are parameters which can be learned in the model; in specific implementation, if the predicted entity type is in a type set matched with the entity originally, the type is used as the entity type, and if the predicted entity type is not in the type set matched with the entity originally, one type is randomly selected from the matched type set;

(2) relationship extraction