CN113688632B

CN113688632B - Method and system for extracting structured data of disease prognosis covariates

Info

Publication number: CN113688632B
Application number: CN202110941747.7A
Authority: CN
Inventors: 贺佳; 吴骋; 林振; 秦宇辰; 秦婴逸; 李冬冬; 王志勇; 何倩; 陈琪; 郭威; 郭轶斌
Original assignee: Second Military Medical University SMMU
Current assignee: Second Military Medical University SMMU
Priority date: 2021-08-17
Filing date: 2021-08-17
Publication date: 2022-10-04
Anticipated expiration: 2041-08-17
Also published as: CN113688632A

Abstract

The invention provides a method, a system, an intelligent terminal and a computer readable storage medium for extracting structured data of disease prognosis covariates based on unstructured medical texts. According to the method, the model with the best effect is adopted for data extraction in different stages of data processing, and the accuracy of database construction is improved. After the technical scheme is adopted, the extraction of the structured data can be completed only by inputting the name of the covariate. The structured database which can be used for statistical analysis is extracted from the Chinese medical text which can not be directly used for statistical analysis, and the clinician is helped to find potential disease prognosis influencing factors from the medical history text. The method not only avoids the process of manually extracting covariates, but also has better compatibility portability, and can be conveniently nested, developed and maintained on various platforms.

Description

Method and system for extracting structured data of disease prognosis covariates

Technical Field

The invention relates to the technical field of natural language processing, in particular to a method, a system, an intelligent terminal and a computer readable storage medium for extracting structured data of disease prognosis covariates based on unstructured medical texts.

Background

The electronic medical record is a high-quality part of real world big data. Electronic medical records have emerged from the beginning of the 21 st century, and their usage rates have only been 9% in 2008 and have risen to 96% by 2015. Because the electronic medical record replaces the traditional mode of handwriting the medical record, the electronic medical record accounts for a larger proportion of various types of real world data, and the quality of the electronic medical record data is higher compared with real world data from multimedia such as the Internet. By 2018, over ten million hospitalization medical records and one billion emergency medical records exist on the health record data platform of Shanghai health committee. The electronic medical record mainly comprises a case home page, admission records, discharge summary, various imaging pictures and the like. Many important clinical information are recorded in unstructured texts, such as current medical history, physical examination, and medical history, and clinicians spend a lot of time to record, and the information accounts for a large proportion, and the information accounts for more than 80% of the total amount according to expert estimation, but the utilization rate is low, and the information cannot be directly used for data statistical analysis.

At present, natural Language Processing (NLP) has been widely used to extract information from unstructured electronic medical records, and converting unstructured texts into structured data by applying NLP technology can effectively reduce the time for manually reading texts to extract data, improve the usability of unstructured data, and thus can realize automatic Processing of large-scale texts. Because the electronic medical record is composed of different parts, the content structure of each part is different, and the data extraction method is different. At present, at home and abroad, research and application of related methods for directly converting medical texts into structured databases for data statistical analysis are few, more work is performed on named entity identification aiming at information extraction research of Chinese medical texts, and related patents exist. The prior art also lacks a method for constructing a structured database about prognostic influencing factors, and data in the structured database can be directly used for data analysis so as to support application scenarios such as clinical prognostic influencing factor analysis and prognostic model construction. In this application scenario, the existing named entity recognition method cannot be directly applied.

Disclosure of Invention

In order to overcome the technical defects, the first aspect of the invention provides a method for extracting structured data of disease prognosis covariates based on unstructured medical texts, which comprises the following steps:

step S1: preprocessing unstructured medical text: acquiring an unstructured medical text, removing the text containing negative words and/or negative words in the unstructured medical text through a regular expression, and labeling the unstructured medical text by adopting a BIO labeling system;

step S2: identifying medical entities by NER model: the NER model is a medical entity recognition model based on an ERNIE pre-training model, an expanded convolutional neural network and a conditional random field, firstly, a labeled medical text is converted into a word vector through the ERNIE pre-training model, then the word vector is input into the expanded convolutional neural network to obtain a label score of each word, and finally the label score of each word (namely the output of the expanded convolutional neural network) is input into the conditional random field to obtain the medical entity category of each word;

in the past, research on the NER model focuses on research on human names, place names, organization names and the like, the research on medical entities is less, the medical entities have unique features in the own field, the classification is more, the expressions of the same medical entity are numerous, and the dictionary base cannot be written and exhausted, so that the specific entities need to be found by deeply mining the relationship between the contexts, and deep learning can be used for identifying the named entities by learning hidden features at the deep level of medical texts. The prior art Word2Vec model does not have the ability to fine tune according to downstream tasks, the Word vector does not change with context changes, and therefore the effect is affected when using Word2Vec as the Word embedding layer for certain classes of entities. Through fine adjustment, the ERNIE can adjust the word vector according to different contexts, can better express the meaning of the word vector in specific contexts, solves the problem of word ambiguity, and improves the effect of the NER model. When the same neural network is used, the effect of ERNIE is better than that of the BERT model in the prior art, because ERNIE adopts more high-quality Chinese language database for correlation during pre-training; when the same pre-training model is used, the effect of IDCNN is superior to that of the BILSTM model in the prior art, IDCNN is superior to that of parallelization training, and the speed is obviously faster than that of the BILSTM model in the prior art;

and step S3: constructing a semi-structured database: constructing a semi-structured database according to the identified medical entity category and entity name, wherein the semi-structured database comprises a patient number, a medical entity category and an entity name;

and step S4: presence determination of a target medical entity: training an ERNIE deep learning model by utilizing a semi-structured database to construct a covariate extractor, inputting a standard name of a target medical entity into the covariate extractor, comparing the standard name of the target medical entity with the entity name in the semi-structured database by using the ERNIE deep learning model, judging whether the standard name of the target medical entity is similar to the entity name by using a logistic regression function, if so, indicating matching, indicating that the target covariate exists in an unstructured medical text, outputting a result of 1, and taking a disease entity as an example, indicating that a patient has a disease corresponding to the medical entity name; if not, the result is not matched, which means that the target covariate does not exist in the unstructured medical text, the output result is "0", and if the disease entity is taken as an example, the "0" indicates that the patient does not have the disease corresponding to the name of the medical entity; the ERNIE deep learning model is a whole body, and logistic regression is one step of the ERNIE deep learning model;

in a traditional text similarity recognition model, similarity is calculated firstly, and then whether texts are matched or not is determined by setting a threshold or sequencing, the method is often interfered by human factors, and the setting of the threshold has a great influence on the result. In the research, the technology of text similarity matching is used for realizing the unification of entities through supervised learning, and the required covariates can be accurately extracted by comparing the effects of several deep learning models. In addition, the ERNIE in the application adopts a twin network, so that network parameters of two entities are shared, overfitting is not easy to cause, the calculated amount is small, the time consumption is short, and the requirement on the performance of a computer is low, so that the obtained effect is due to a BERT model in the prior art;

step S5: constructing a structured database: after the name of the target medical entity is sequentially input into the covariate extractor, the covariate extractor constructs a structured database, and the structured database comprises the patient number, the standard name of the target medical entity and the corresponding output result.

The "standard name" in the present application mainly refers to an internationally recognized standard name and a coding dictionary, for example, the international disease coding dictionary ICD10.

The standard name for the target medical entity refers to the target medical entity that needs to be structured, for example, if a doctor wants to know which people have myocardial infarction, the standard name of the target entity is myocardial infarction.

Further, in step S4, the ERNIE deep learning model uses 12 layers of transformers, the hidden layer size is 768, the multi-head attention machine is 12 heads, the optimizer is Adam, the learning rate is set to 2e-05, the number of samples (batch size) selected in one training is 32, and the training is iterated 10 times.

Further, in step S4, the method for comparing similarity includes the following steps: by utilizing a twin network structure, firstly, respectively sending two entities, namely a standard name and an entity name of a target medical entity into ERNIE, sharing the two entities by using parameters of the ERNIE to obtain sentence vectors of the two entities, then sending the sentence vectors into a convergence layer, carrying out feature extraction and compression on the sentence vectors by adopting an average convergence mode to obtain u and v, finally splicing u, v and | u-v | and sending the spliced u, v and | u-v | into a full-connection layer, carrying out similarity comparison on the two entities, judging whether the two entities are similar through a logistic regression function, if so, indicating matching, and indicating that the target covariate exists in an unstructured original medical text; if not, it indicates a mismatch, indicating that the target covariate is not present in the unstructured original medical text.

Further, the categories of medical entities include disease entities, drug entities, surgical entities, imaging examination entities, and symptom entities.

Further, the unstructured medical text is a discharge summary.

A second aspect of the present application provides a system for extracting structured data of disease prognosis covariates based on unstructured medical texts, comprising a preprocessing module, an identification module, a semi-structured database construction module, a comparison module and a structured database construction module;

the preprocessing module is used for preprocessing unstructured medical text: acquiring an unstructured medical text, removing the text containing negative words and/or negative words in the unstructured medical text through a regular expression, and labeling the unstructured medical text by adopting a BIO labeling system;

the identification module is configured to identify a medical entity via a NER model: the NER model is a medical entity recognition model based on an ERNIE pre-training model, an expansion convolution neural network and a conditional random field, firstly, a labeled medical text is converted into a word vector through the ERNIE pre-training model, then the word vector is input into the expansion convolution neural network to obtain a label score of each word, and finally the label score of each word is input into the conditional random field to obtain a medical entity category of each word;

the semi-structured database construction module is used for constructing a semi-structured database: constructing a semi-structured database according to the identified medical entity category and entity name, wherein the semi-structured database comprises a patient number, a medical entity category and an entity name;

the comparison module is used for judging whether the target medical entity exists in the unstructured medical text: training an ERNIE deep learning model by utilizing a semi-structured database to construct a covariate extractor, inputting a standard name of a target medical entity into the covariate extractor, comparing the standard name of the target medical entity with the entity name in the semi-structured database by using the ERNIE deep learning model, judging whether the standard name of the target medical entity is similar to the entity name by using a logistic regression function, if so, indicating matching, indicating that the target covariate exists in an unstructured medical text, and outputting a result of the covariate extractor to be 1, and taking a disease entity as an example, indicating that a patient has a disease corresponding to the medical entity name; if the two covariates are not similar, the result is not matched, the target covariate does not exist in the unstructured medical text, the result output by the covariate extractor is 0, and if the disease entity is taken as an example, 0 indicates that the patient does not have the disease corresponding to the name of the medical entity;

the structured database construction module is used for constructing a structured database: after the name of the target medical entity is sequentially input into the covariate extractor, the covariate extractor constructs a structured database, and the structured database comprises the patient number, the standard name of the target medical entity and the corresponding output result.

A third aspect of the present application provides an intelligent terminal, including:

a memory for storing executable program code; and

a processor for reading executable program code stored in the memory to perform the above method of extracting structured data of disease prognosis covariates based on unstructured medical text. The intelligent terminal includes but is not limited to a PC, a portable computer, a mobile terminal and other devices having display and processing functions.

A fourth aspect of the present application provides a computer readable storage medium having stored thereon computer program instructions which, when executed by a processor, implement the above-described method for extracting structured data of disease prognosis covariates based on unstructured medical text. The computer-readable storage medium includes, but is not limited to: various media capable of storing program codes, such as a usb disk, a removable hard disk, a Read-only memory (ROM), a Random Access Memory (RAM), a magnetic disk, or an optical disk.

After the technical scheme is adopted, compared with the prior art, the method has the following beneficial effects:

the invention extracts a structured database which can be used for statistical analysis from Chinese medical texts which can not be directly used for statistical analysis, provides a covariate extraction method and a covariate extraction system, helps clinicians to extract potential disease prognosis influencing factors from medical history texts, and can be applied to scenes such as disease prognosis influencing factor analysis or prediction model construction. The process of manually extracting covariates can be dispensed with. The invention has the advantages of simple basic principle, easy and effective operation, no extra special hardware and software requirements, better compatibility portability and convenient nesting, development and maintenance on various platforms. The method is suitable for various medical personnel, and the extraction of the structured data can be completed only by inputting the name of the covariate.

According to the method, the model with the best effect is adopted for data extraction in different stages of data processing, and the accuracy of database construction is improved. In addition, compared with manual reading of medical records and manual extraction of medical record information, the method and the device greatly improve the database construction efficiency. At present, large structured databases related to medicine are constructed worldwide, some public databases can be used for direct analysis, however, the construction process of the databases is complicated, and a lot of databases need to be read by professionals and relevant information is manually input. In the process of extracting the structured data, the method is realized by computer automation, and the specific covariates can be extracted by the covariate extractor according to the subsequent actual needs.

Drawings

FIG. 1 is a flow chart of a method of extracting structured data of disease prognosis covariates from unstructured medical text;

FIG. 2 is an example of tagging a medical entity in a discharge summary using a BIO tagging architecture;

FIG. 3 is a framework of the NER model;

FIG. 4 is a schematic diagram of BERT and ERNIE coverage patterns;

FIG. 5 is an IDCNN model for text;

FIG. 6 shows the prediction results of the NER model without CRF layer;

FIG. 7 shows the predicted results of a NER model with a CRF layer;

FIG. 8 is a labeling result of the correct medical entity category;

FIG. 9 is a schematic diagram illustrating a text similarity matching process of the ERNIE deep learning model;

fig. 10 is a thermal map of the correlation of prognostic covariates in patients with ischemic stroke.

Detailed Description

The advantages of the invention are further illustrated by the following detailed description of the preferred embodiments in conjunction with the drawings. It is to be understood by persons skilled in the art that the following detailed description is illustrative and not restrictive, and is not to be taken as limiting the scope of the invention.

The embodiment is based on the node of discharge in the electronic medical record, and the structured database of the influence factors of the ischemic stroke prognosis is constructed

The data used in this embodiment mainly comes from medical texts of electronic medical records of 6053 patients with ischemic stroke in the long-sea hospitals in shanghai city from 2009 to 2019.

The research platform is provided by army health statistics teaching and research room of naval military medical university and information department of Changhai hospital of Shanghai city, and the Server adopts Windows Server 2008 R264 bit operating system and 8-core Intel (R) Xeon (R) CPU; the personal workstation employs a 64-bit operating system of Windows10, 8-Core Intel (R) Core (TM) i9-9900K CPU, NVIDIA GeForce RTX 2080 SUPER GPU. The software environment adopted Python version 3.7, tensorFlow 1.10 and C #.

The present embodiment is directed to extracting a structured database of influencing factors (i.e., covariates) on the prognosis (e.g., length of stay) of an ischemic stroke patient from unstructured medical texts, wherein the covariates are classified into diseases, drugs, operations, imaging examinations, symptoms, and the like, so that the medical entities can be classified into different categories of medical entities such as diseases, drugs, operations, imaging examinations, symptoms, and the like. Wherein, the definition of different medical entities is shown in table 1:

TABLE 1 definition of medical entity classes

As shown in fig. 1, the method for extracting structured data of covariates affecting prognosis from medical texts provided by the present application comprises the following steps:

step 1: medical text preprocessing (Main body: client)

The small discharge knots of ischemic stroke patients from 2009 to 2019 in Changhai Hospital in Shanghai are collected, and 6053 cases are counted. All information related to patient privacy, such as name and home address, has been deleted by the data provider at the time the data was obtained. The method comprises the steps of inputting an original unstructured medical text (namely a discharge nodule) which is not processed any more into a system, and removing a text containing negative words and negative words in the discharge nodule text by using a regular expression ' re.sub ' function through a ' re.sub ' (r ' [ < lambda >, ] > ' (without negative acknowledgement of | normal | () | negative) [ < lambda >, ] >, ', text) ' sentence ', so as to maximally keep the text with positive characteristics, and simultaneously deleting the information of a discharge order by using a regular expression.

And then, manually labeling the data, randomly selecting discharge knots of 1000 patients with ischemic stroke for labeling, and checking by an expert in the information department medical record room of the Changhai hospital in Shanghai city after labeling. In the application, a BIO labeling system is adopted for data, B represents the start of an entity, I represents the subsequent part of the entity except the start accident, and O represents the non-entity part. B and I will be followed by the entity class to which it belongs to "small bright sudden stroke, feeling nausea, vomiting, and after MRI examination, the patient starts to take aspirin with a course of action thrombectomy. For example, the notation is shown in FIG. 2.

After the data annotation is completed, we put the data set as 3:1: the proportion of 1 is divided into three parts at random, which are respectively a training set, a verification set and a test set, and the entity distribution of each data set is shown in table 2. The training set is used to train and fit the NER model (i.e., ERNIE + IDCNN + CRF model), the validation set is used to adjust the hyper-parameters of the model, and the test set is used to evaluate the performance of the model.

TABLE 2 medical entity Category distribution of datasets

Data set	Disease and disorder	Medicine	Surgery	Imaging examination	Symptoms and signs
						Training set	3975	2301	371	2144	1647
Verification set	1517	993	112	839	661
						Test set	1437	1059	114	821	679

Take the node of discharge of a certain ischemic stroke patient as an example:

"discharge diagnosis: 1. post-circulatory ischemia; 2. hypertension grade 3 (very high risk group); type 3.2 diabetes. The days of hospitalization: for 10 days. Admission condition: rotated for 3 hours due to "head halo with sight". "dizziness" was planned by clinic at 2015-08-04. 2015-8-4 skull CT: the brain stem, the subthalamic thalamic regions on both sides and the central hemioval are subject to multiple lacunar cerebral infarction, and are partially softened. The diagnosis and treatment process comprises the following steps: treatment of the condition: and (4) perfecting examination, performing head MRI examination and cerebral angiography after admission. Has effects in resisting blood platelet, regulating lipid by atorvastatin, promoting blood circulation, improving circulation by ginkgo dipyridamole, and nourishing nervous system by vinpocetine. Discharge situation: the patient has stable illness state at present, generally has good condition, and is improved when being admitted into hospital.

Wherein, the "posterior circulation ischemia", "hypertension of grade 3 (very high risk group)", "type 2 diabetes mellitus", "lacunar infarction" are disease entities;

wherein "atorvastatin", "Shuxuetong", "Ginkgo dipyridamole" and "vinpocetine" are the drug entities of interest;

wherein "cerebrovascular angiography" is a surgical entity;

wherein "CT" is an imaging examination entity;

wherein, dizziness and object rotation are the entities of symptoms.

In addition to classifying the named entities into five categories, disease, drug, surgery, imaging examination, and symptom, it is also necessary to distinguish the boundaries of the medical entities. In "post-circulating ischemia," the word "post" is the beginning of the disease entity and the word "blood" is the end of the disease entity.

Step 2: identifying medical entity (Main body: client)

After a certain number of medical entities are labeled, an NER model (namely a medical entity recognition model of a combined Conditional Random field of a combined expanded Neural Network embedded with pre-training words) is constructed through labeled cerebral arterial thrombosis texts, and the NER model is trained, wherein a basic framework of the NER model is a pre-training model (ERNIE) based on semantic Representation of Knowledge enhancement, namely an expanded Convolutional Neural Network model (IDCNN) and a Conditional Random Field (CRF). Firstly, converting text information into a word vector through a pre-training model (ERNIE) based on semantic representation of knowledge enhancement, then inputting the word vector into an expanded convolutional neural network, and inputting the output of the expanded convolutional neural network into a conditional random field to obtain the entity category of each word. Thereby identifying disease entities, drug entities, surgical entities, imaging examination entities, symptom entities, and the like. The neural network cannot directly recognize characters, and the pre-training model has the function of converting characters in the text into character vectors which can be recognized by the neural network and is used as initial input of the model. The step aims to help a computer to automatically acquire entity names with potential practical significance contained in the original medical texts as comprehensively and accurately as possible based on an algorithm model constructed by training.

The framework of the NER model is shown in fig. 3, where training data is first divided into different batches in sentence units, each batch containing 64 sentences, each sentence having a maximum word count of 128. For each batch: firstly, inputting a batch of data into a pre-training model to generate a word vector; secondly, inputting the word vector into an expansion convolution neural network layer to obtain all label scores of each word; thirdly, inputting the label score of each word into a CRF layer to calculate network output; and fourthly, feeding back the error and updating the network parameters.

The parameters of the NER model (i.e., ERNIE + IDCNN + CRF model) are set as follows: the learning rate is 1e-5, the dropout value is 0.5, the gradient cutoff value is 5, and the iteration number is 100.

The pre-training model (ERNIE) can better embody the characteristics of each word by learning semantic and grammatical information inside the language from a massive corpus than a method of randomly initializing each word to generate a vector. Namely, semantic representation model (ERNIE) based on knowledge enhancement, by which the true semantic relationship between texts is learned by using a priori semantic knowledge. The ERNIE model structure is composed of an input layer, a coding layer based on bidirectional transducer and an output layer based on specific tasks. ERNIE uses three different cover modes when performing MLM training: the first mode randomly extracts 15% of words for covering; in the second mode, chinese phrases are obtained through word segmentation, and partial phrases are randomly extracted to be covered; and in the third mode, entities such as names of people and place names in the corpus are selected according to the prior knowledge and are randomly covered. The corpus used for training also adopts high-quality Chinese corpuses such as Baidu encyclopedia and Chinese Wikipedia.

As shown in fig. 4, for the phrase "administering anti-inflammatory and anti-cough and anti-sputum-reducing symptomatic treatment", BERT trains in such a manner that the words are covered, "administering anti-inflammatory and anti-cough [ mask ] symptomatic [ mask ] treatment", BERT trains the textual representation of "sputum" and "treatment" by local co-occurrence, but fails to learn the deep semantics related to "sputum reduction" and "treatment", while ERNIE trains in such a manner that the words are covered, "administering anti-inflammatory and anti-cough [ mask ] [ mask ] symptomatic [ mask ] [ mask ]", so that the model can model the relationship between "sputum reduction" and "treatment", and learns that "sputum reduction" is a means of "treatment". In addition, in sentence-level text relation training, ERNIE uses a dialogue language model to train relevance features of context sentences a plurality of times by randomly selecting sentences to replace question sentences or answer sentences. In conclusion, ERNIE is superior to BERT.

CNN is derived from a receptive field mechanism, such as a certain neuron of the human visual system, which becomes excited only when a specific signal is present. General CNN is composed of convolutional layer, convergence layer and full connectionAnd (3) layer composition. The convolution layer is used for feature extraction, and various features can be extracted by setting different types of convolution kernels. The convergence layer is used for selecting features, and after the features of the convolution layer are extracted, only network connection is reduced, and the number of neurons is not reduced, so that the number of the features can be reduced in a limited manner through the convergence layer. However, for the NER, each word needs to give a specific class label, and the context is highly correlated, while the general CNN may obtain only a small part of information of the original data after performing the convolution operation, and the addition of the convolution kernel may make the parameters too large, which causes difficulty in training the model, and also causes information loss after performing the convergence layer. Based on the above shortcomings of several CNNs in handling the NER task, the field of view can be increased by expanding the convolution kernels without adding the convolution kernels and removing the convergence layer, so that each convolution kernel can capture a larger range of information. In the present application, as shown in fig. 5, a is a conventional 2-layer 3*3 convolution kernel, the field of view is 5, i.e. the i-th layer can feel a context distance of 2i +1, b is a 2-layer convolution kernel with an expansion coefficient of 2, and the field of view is 7, i.e. the i-th layer can feel a context distance of 2 ⁱ⁺¹ 1, it can be seen that the conventional convolution kernel is linearly related to the context distance, while the dilated convolution kernel is exponentially related to the context distance, so that the neural network can capture the text relation of long distance.

The prediction results that would be possible if only the neural network structure of the previous layer were used are shown in fig. 6. The result in fig. 6 is clearly an erroneous result, and "si" should be labeled as "I-drug", i.e. only the first word of each entity should be labeled as "B", while the remaining entity parts should be labeled as "I", and different classes of "I" cannot be connected to each other, and two adjacent "I" must be of the same class. Since the neural network structure of the upper layer cannot utilize the relationship between the entity labels, that is, the unreasonable combination of "B-drug/B-drug" cannot be excluded, the present application can solve the problem well by CRF, and the relationship between the label sequences is constrained, and the result is shown in fig. 7.

CRF is a type of directed graph model, where the most common in NER is a linear chain structure, used for sequence tag analysis. For a given literal sequence x = { x = { x } ₁ ，x ₂ …，x _n }，x _i A feature vector representing the ith character, given the label sequence y = { y that x corresponds to ₁ ，y ₂ ，…，y _n Y (x) denotes the possible label of x, S denotes the potential function, θ is a parameter of the model, there are

According to the Viterbi algorithm, the relationship between context labels can be learned by using CRF, and the input character sequence x = { x = (zero) is carried out on the input character sequence ₁ ，x ₂ …，x _n And solving a global optimal label sequence.

After the text passes through the IDCNN network, each word label sequence is scored, and the type with the highest score is obtained as the predicted medical entity, and the correct medical entity label result should be as shown in fig. 8.

And step 3: extracting entity recognition results and constructing stroke related semi-structured database

The ERNIE + IDCNN + CRF model has two effects: 1. identifying a medical entity; 2. all medical entities are extracted and a semi-structured database is constructed. Through the entity identification in the last step, a model with the optimal identification effect, namely ERNIE + IDCNN + CRF, is selected to extract all named entities, and a semi-structured data set is constructed, as shown in Table 4.

For example, a semi-structured database (Table 4) is constructed for each patient based on the medical entities identified in step 2. In the semi-structured database, column 1 is the patient number, columns 2 to 21 are reserved 20 disease column targets, which are set as "disease 1", "disease 2", … … "disease 20", columns 22 to 41 are reserved 20 drug column targets, which are set as "drug 1", "drug 2", … … "drug 20", columns 42 to 61 are reserved 20 surgery column targets, which are set as "surgery 1", "surgery 2", … … "surgery 20", columns 62 to 81 are reserved 20 imaging examination column targets, which are set as "imaging examination 1", "imaging examination 2", … … "imaging examination 20", columns 82 to 101 are reserved 20 symptom column targets, which are set as "symptom 1", "symptom 2", … … "symptom 20". The number of columns that each entity needs to reserve can be adjusted manually as needed.

TABLE 4 semi-structured database

And 4, step 4: construction of covariate extractor for Presence determination and extraction of target covariates (subject: client)

As the unstructured part of the electronic medical record is recorded by doctors or nursing staff, doctors have own unique style and terminology in the aspects of defining diseases, symptoms and the like of patients, which causes the problem that the electronic medical record lacks standardization and uniformity, the same entity has multiple calling names, and the disease name has multiple calling names, such as myocardial infarction, myocardial infarction and the like, by taking the category of the disease as an example; some are represented by arabic numerals (type 2 diabetes) and some by roman numerals (type ii diabetes); some are called the full term of the disease (transient ischemic attack) and some are called the abbreviation (TIA). Similar situations arise in surgery, medicine, imaging examinations and medicine. Therefore, the semi-structural data constructed in the previous step still cannot meet the conventional statistical analysis requirements, and therefore a covariate extractor needs to be constructed. The covariate extractor is an ERNIE-based integrated model, and comprises an ERNIE model, deep supervised learning and logistic regression functions. Taking myocardial infarction as an example, by inputting myocardial infarction into the covariate extractor, the system automatically matches disease category entities in the semi-structured database so as to judge whether the target covariate exists in the unstructured medical text, and if so, the target covariate represents the unstructured medical text; if not, the representation is not present in the unstructured medical text.

And (4) carrying out text similarity matching model training by utilizing the semi-structured database, and developing a covariate extractor. The method constructs a covariate extractor by training a supervised deep text matching model (namely an ERNIE deep learning model), and the model has the best effect. The ERNIE deep learning model adopts 12 layers of transformers, the size of a hidden layer is 768, the multi-head attention mechanism is 12 heads, the optimizer is Adam, the learning rate is set to be 2e-05, the batch size is set to be 32, and training iteration is carried out for 10 times. The text similarity matching process of the ERNIE deep learning model is shown in fig. 9:

sending two entities (an entity A and an entity B) into ERNIE by utilizing a twin network structure, sharing parameters of the ERNIE with the two entities to obtain sentence vectors of the two entities, then sending the sentence vectors into a convergence layer, extracting and compressing the characteristics of the sentence vectors by adopting an average convergence mode to obtain u and v, finally splicing u, v and | u-v | and sending the spliced sentence vectors into a full connection layer, comparing the similarity of the input entity (assumed to be 'an entity A' in figure 9) with a medical entity (assumed to be 'an entity B' in figure 9) in a semi-structured database which belongs to disease categories and is constructed in the step 3, judging whether the two entities are similar by a Logistic regression (Logistic) function (namely a classifier in figure 9), if so, indicating that the two entities are not matched, indicating that the target covariate does not exist in an unstructured original medical text, outputting '0' by a covariate extractor, and taking the disease entity as an example, indicating that the patient does not suffer from the disease corresponding name of the medical entity; if they are similar, indicating a match, indicating that the target covariate is present in the unstructured raw medical text, the covariate extractor outputs a "1", and in the case of a disease entity, the "1" indicates that the patient has a disease corresponding to the name of the medical entity.

The covariate extractor model is trained by adopting supervised learning, and named entities of 1000 patients are randomly extracted and labeled on the basis of a semi-structure data set obtained after the entity identification in the last step. Labeling examples as shown in table 5, the first column is the data number, the second column is the entity category, the third column is the medical entity extracted from the medical record, the fourth column is the standard medical entity labeled by the researcher, and the fifth column indicates whether the entity 1 and the entity 2 are matched.

TABLE 5 text similarity match Positive samples

NO	Categories	Entity 1	Entity 2	Whether it is matched
					1	Disease(s)	Cerebral infarction (atherosclerosis type)	Cerebral infarction	Is that
2	Disease and disorder	Acute cerebral infarction	Cerebral infarction	Is that
					3	Disease and disorder	Grade 1 extreme high risk of hypertension	Hypertension (hypertension)	Is that
4	Disease and disorder	Cerebral hemorrhage (left temporal cavernous hemangioma)	Cerebral hemorrhage	Is that
					5	Disease and disorder	Multiple cerebral blood supply arteriosclerosis stenosis	Stenosis of cerebral artery	Is that
6	Surgery	Interventional embolization of right vertebral artery dissection aneurysm	Interventional embolization of aneurysms	Is that
					7	Surgery	Left middle cerebral aneurysm stent assist coil embolization	Stent assisted spring coil embolization	Is that
8	Surgery	Balloon dilatation stent forming method for severe stenosis of litholateral internal carotid artery bed outburst section	Angioplasty of arterial stents	Is that
					9	Surgery	Basilar artery drug thrombolysis	Arterial thrombolysis	Is that
10	Surgery	Cerebral arteriovenous fistula embolism operation in right side lateral sinus region	Arteriovenous fistula embolization	Is that
					11	Medicine	Glutathione tablet	Glutathione	Is that
12	Medicament	Aspirin enteric-coated tablet	Aspirin	Is that
					13	Medicine	Levofloxacin sodium chloride	Levofloxacin	Is that
14	Medicine	Adenosine cobalt tablet	Adenosine cobalt tablet	Is that
					15	Medicine	Fang Xinnuo	Compound sulfamethoxazole	Is that
16	Imaging examination	Head MRI enhancement	MRI	Is that
					17	Imaging examination	Brain CT enhancement	CT	Is that
18	Imaging examination	Skull CT flat scan + enhancement	CT	Is that
					19	Imaging examination	Brain MRI	MRI	Is that
20	Imaging examination	Carotid CT augmentation	CT	Is that
					21	Symptoms and signs	Disadvantaged limb movement	Dysfunction of limbs	Is that
22	Symptoms and signs	Unclear speech structure sound	Dysarthria	Is that
					23	Symptoms and signs	Paralysis patient	Paralysis patient	Is that
24	Symptoms and signs	Visual deterioration	Diminution of vision	Is that
					25	Symptoms and signs	Weakness of the stomach	Weakness of the stomach	Is that

Since supervised learning is used in this part of the study, only positive samples are not sufficient, and negative samples need to be constructed (as shown in table 6). The research adopts a data augmentation technology to carry out negative sample construction and positive sample augmentation. Because the standard name corresponding to the extracted entity has uniqueness, the column of "entity 2" is randomly shuffled and paired, and when "entity 2" is not the original entity, it can be marked as "no match", and the negative examples are shown in table 6. The expansion of the positive sample uses similar transfer, i.e. entity a matches entity B, entity C matches entity B, and entity a matches entity C. And finally merging the marked data set, the constructed negative sample data set and the expanded positive sample data set, and randomly dividing the data set into a training set, a verification set and a test set in a ratio of 3: 1.

TABLE 6 text similarity match negative examples

And 5: output structured database (Main body: client)

By sequentially entering the name of the target medical entity in the covariate extractor, the covariate extractor can automatically output a structured database containing the patient ID number, the entered standard name of the target medical entity and the structured data of the output result ("0" or "1"). Illustratively, as shown in Table 7, the first column is the patient number and the subsequent columns are targeted to the entered standard name of the target medical entity and its matching results. For example, if the output result in a column corresponding to myocardial infarction is "1", it indicates that the patient has myocardial infarction; if the result is "0", it indicates that the patient does not suffer from myocardial infarction. By analogy, a number of diseases that the patient has merged can be identified. Drugs, surgery, imaging examination, symptoms, etc. can be identified through the above steps 2 to 5. The output structured database may also include a plurality of entity recognition results. For example, covariates such as hypertension, hyperlipidemia, diabetes, cardiac insufficiency, atrial fibrillation, atherosclerosis, angioplasty, embolectomy, tracheal intubation, angiography, venous thrombolysis, arterial thrombolysis, extracranial revascularization, decompression by boneless valvesis, SWI, DWI, CTA, CTP, aphasia, slurred vision, speech impairment, facioplegia, cognitive dysfunction, limb dysfunction, dyskinesia, hemiplegia, coma, low molecular dextran, heparin, dredgeon, oxiracetam, warfarin, butylphthalide, edaravone, cilostazol, aspirin, thromboxane, clopidogrel, and statin were extracted by a covariate extractor after entity recognition based on unstructured discharge nodule text, respectively.

TABLE 7 structured database

The method for extracting the structured data of the disease prognosis covariate from the unstructured medical text can be developed into software, program codes of the software comprise instructions capable of executing the method, and specific steps of the method are shown in the content, and are not repeated herein.

For convenience and simplicity of description, a specific working process of the system for extracting structured data of disease prognosis covariates from unstructured medical texts in the present application can refer to a corresponding process of the above method in this embodiment, which is not described herein again.

It should be appreciated by those skilled in the art that embodiments of the invention may be provided as a computer program product, a system, a smart terminal, or a computer-readable storage medium. Accordingly, the present invention may take the form of an entirely software embodiment, or an embodiment combining software and hardware aspects. The functions implemented by the respective functional modules included in the system may be stored in a computer-readable storage medium if implemented in the form of software functional units and sold or used as independent products. Based on such understanding, the technical solution of the present invention or portions thereof contributing to the prior art may be essentially embodied in the form of a software product stored in a storage medium and including instructions for causing a computer device (e.g., a personal computer, a server, or a network device) to execute all or part of the steps of the method described in the embodiments of the present invention. The computer readable storage medium includes, but is not limited to, various media that can store computer program code, such as a usb disk, a removable hard disk, a read-only memory, a random access memory, a magnetic or optical disk, and the like.

Effect example based on the structured database obtained in the above example, a prediction model of the prognosis hospitalization duration of ischemic stroke patients was constructed

A Logistic regression model was constructed using the structured data extracted from the electronic medical records in the above examples to predict whether the length of stay was greater than 7 days, and this prediction model was compared to a comparative prediction model (as a comparative example) constructed from data extracted only from the first page of the medical records. The first page of the medical record directly extracts data including sex, age, year of admission, grade of the admitted illness, discharge diagnosis, operation code and the like. In the comparative example, structured data was first extracted by retrieving disease diagnosis and surgical code, ICD-10 for discharge diagnosis and ICD-9 for surgery, and a comparative prediction model was constructed using data from the first page of the case and further compared with a prediction model based on the structured database obtained in the examples. And evaluating the two prediction models by comparing the AUC values, the sensitivity and the specificity of the two prediction models in the test set.

Constructing a predictive model of prognosis (length of stay) includes the following steps:

1. data set

All the prognosis covariates and the structural data of the patients with cerebral arterial thrombosis obtained in the embodiment are incorporated into a data set, and the data set is randomly divided into a training set and a testing set according to the ratio of 4: 1.

By plotting a thermodynamic diagram of the correlation test between covariates (fig. 10) we find that the correlation between partial variables is high. In order to solve the problem of multiple collinearity, LASSO regression is adopted to screen covariates, and two variables of 'hemiplegia' and 'Shuxuetong' are finally eliminated. The results of the study are classified into two categories, namely the duration of hospitalization is less than or equal to 7 days or more than 7 days.

2. Training a predictive model

In this study, a Logistic regression model was constructed to predict whether hospital stays were longer than 7 days. And (3) dividing a training set and a test set into a training set, performing 5-fold cross validation on the training set, namely dividing the training set into 5 parts, evaluating the model as a validation set for every 1 part, and using the rest 4 parts for model training.

The second-class Logistic regression predicts the probability of occurrence of an event and completes the classification of 0 to 1 by mapping the result of the linear regression to a Sigmoid function. The linear regression is: z is a radical of _i ＝w·x _i + b, wherein x _i An N-dimensional feature vector representing the ith sample, i.e.

w is the weight vector and b is the bias constant. The Sigmoid function is:

then there is a conditional probability distribution of the Logistic regression model:

for a given training set t = { (x) ₁ ，y ₁ )，(x ₂ ，y ₂ )，(x ₃ ，y ₃ )，...，(x _n ，y _n ) }, there is a likelihood function:

taking the logarithm to obtain a log-likelihood function:

the gradient descent method is used to determine the maximum value of the above formula to obtain the estimated value of w, which is related to the estimation of "0" or "1".

The research result is as follows:

(ii) basic conditions in patients with ischemic stroke

Finally, 6053 patients with ischemic stroke are included, and the basic conditions of the patients are shown in table 8. 77.07% of patients have a length of stay >7 days, and are statistically different from patients with a length of stay of less than 7 days in age, hospitalization, smoking history, drinking history, hypertension, hyperlipidemia, diabetes, cardiac insufficiency, atrial fibrillation, arterial stenting, arterial embolization, angiography, extracranial revascularization, balloon dilatation, SWI, CTA, CTP, aphasia, blurred vision, speech impairment, facial-lingual paralysis, limb dysfunction, movement impairment, coma, low molecular dextran, heparin, oxiracetam, warfarin, butylphthalide, edaravone, cilostazol, aspirin, thromboxane, clopidogrel, and statin.

TABLE 8 basic conditions of ischemic stroke patients

(II) related factors of length of hospitalization >7 days

Through multi-factor Logistic regression (table 9), the factors of emergent danger of hospitalization, diabetes, artery angioplasty, extracranial revascularization, CTP examination, speech disorder, facial paralysis, cilostazol, clopidogrel and statin taking are related to longer hospitalization time.

Table 9 duration of hospitalization >7 days influencing factors (covariates)

(III) effect comparison of prediction model constructed based on structured database and prediction model constructed based on first page data of medical record

The number of covariates extracted based on the ICD code on the first page of the medical record was 15, while the number of covariates extracted using the method for extracting structured data of disease prognosis covariates from unstructured medical texts (hereinafter, referred to as "NLP technique") of the present application reached 43. Comparison of the numbers of covariate cases extracted based on ICD codes on the first page of the medical record and using the method of the present application is shown in table 10, except for atherosclerosis, the numbers of covariate cases extracted by NLP technique are all greater than those extracted from the first page of the medical record by ICD codes, indicating that the first page of the medical record may have incomplete records for some patients' diseases and surgeries.

TABLE 10 covariate example ratio extraction based on case first page ICD coding and NLP technique

The OR values of the prediction models based on the Logistic regression based on the case top ICD coding and constructed by the method of the present application are shown in table 11, the balloon dilatation, the admission disease and the hyperlipidemia are the prediction factors of the top three in the prediction models based on the Logistic regression based on the case top ICD coding, and the cilostazol, the extracranial revascularization and the admission disease are the prediction factors of the top three in the prediction models based on the Logistic regression constructed by the method of the present application, and the prediction models based on the Logistic regression constructed by the NLP technology incorporate more meaningful prediction factors.

TABLE 11 Logistic regression prediction model comparison based on case home page and NLP technology construction

In addition, the AUC values of the stay duration prediction models constructed by extracting covariates through the NLP technique are all significantly higher than those of the prediction models constructed only by the first page of the medical records (table 12), and the differences have statistical significance. Illustrating extraction of covariates based on NLP technique more prognostic prediction information is provided.

TABLE 12 prediction model AUC values based on medical record first page and NLP technology

Model (model)	Medical record front page (95% CI)	NLP(95％CI)	P
				Logistic regression	0.684(0.657-0.710)	0.776(0.751-0.799)	<0.001

The comparison result of the prediction model of the hospitalization duration of the stroke patient is as follows:

in the prediction model of Logistic regression based on ICD coding of the first page of the medical record, 8 prediction factors are included, the prediction factors are selected from the selected covariates, and the length of stay of a hospital can be predicted through the prediction factors. And 16 prediction factors are included in a prediction model of Logistic regression constructed based on the NLP technology. In addition, the number of covariates extracted based on ICD coding of the first page of the medical record is 15, and the number of covariates extracted by applying NLP technology reaches 43. Moreover, the AUC value of the stay duration prediction model constructed by extracting covariates through the NLP technology is obviously higher than that of the prediction model constructed by only using the first page of the medical record, and the difference has statistical significance. Therefore, the prediction effect of the stay duration prediction model constructed by extracting the covariates through the NLP technology is remarkably higher than that of the prediction model constructed by only using the first page of the medical record, and the effectiveness and the practical application value of extracting the covariates through the NLP are reflected.

The existing research shows that in stroke patients, the period of hospitalization for providing acute stroke care, finding out the cause of stroke and preventing stroke complications is generally completed within the first week of hospitalization, so that the period of hospitalization for more than 6 to 8 days of stroke patients is generally defined as long-time hospitalization in the past research, and the long-time hospitalization is an independent influence factor increasing hospitalization cost. The study on the influence factors of long-time hospitalization (more than 7 days) and the prediction of whether the long-time hospitalization is beneficial to reasonably distributing medical resources and improving the flexibility of bed use, so that the management cost and the medical care cost are reduced, and the personalized diagnosis and treatment path and the discharge plan can be formulated for the patient according to the factors, so that the length of hospitalization of the patient is reduced, and the satisfaction degree of the patient and family members thereof is improved. Predicting patient-related outcomes (e.g., whether hospital stays are longer than 7 days) from clinical data can help optimize clinical decisions and improve personalized care. Different from various existing regression analysis models, the clinical prediction model has better generalization capability, namely better predicting new data except training data. Useful information is extracted by using the massive structured data of the electronic medical record text, the length of the stay of the patient is accurately estimated, and overall management of hospital materials (beds, medicines and instruments) and distribution of medical care personnel is facilitated.

Aiming at patients with cerebral arterial thrombosis, the NER model is constructed as the optimal entity identification model to identify 5 types of medical entities such as diseases, medicines, operations, imaging examinations and symptoms, and a semi-structured database is constructed. In order to further extract the structured data from the semi-structured database, an ERNIE model with the optimal matching effect is constructed to perform a text similarity matching model. Based on the extracted structured data, the information amount is increased, a prediction model whether the length of stay is more than 7 days is constructed, and richer information is provided for clinical decision and resource allocation.

It should be noted that the embodiments of the present invention have been described in terms of preferred embodiments, and not by way of limitation, and that those skilled in the art can make modifications and variations of the embodiments described above without departing from the spirit of the invention.

Claims

1. A method for extracting structured data of disease prognosis covariates based on unstructured medical text, characterized by the steps of:

step S2: identifying medical entities by NER model: the NER model is a medical entity recognition model based on an ERNIE pre-training model, an expanded convolutional neural network and a conditional random field, firstly, a labeled medical text is converted into a word vector through the ERNIE pre-training model, then the word vector is input into the expanded convolutional neural network to obtain a label score of each word, and finally the label score of each word is input into the conditional random field to obtain all entity names contained in the text and the medical entity category of each word;

and step S4: judging whether the target medical entity exists: training an ERNIE deep learning model by utilizing a semi-structured database to construct a covariate extractor, inputting a standard name of a target medical entity into the covariate extractor, comparing the standard name of the target medical entity with the entity name in the semi-structured database by using the ERNIE deep learning model, judging whether the standard name of the target medical entity is similar to the entity name by using a logistic regression function, if so, indicating matching, indicating that the target covariate exists in an unstructured medical text, outputting a result of 1, and taking a disease entity as an example, indicating that a patient has a disease corresponding to the medical entity name by using 1; if not, the result is not matched, which means that the target covariate does not exist in the unstructured medical text, the output result is "0", and if the disease entity is taken as an example, the "0" indicates that the patient does not have the disease corresponding to the name of the medical entity;

2. The method for extracting structured data of disease prognosis covariates based on unstructured medical text according to claim 1, wherein in step S4, the ERNIE deep learning model uses 12 layers of transformers, the hidden layer size is 768, the multi-head attention mechanism is 12 heads, the optimizer is Adam, the learning rate is set to be 2e-05, the number of samples selected in one training is 32, and the training is iterated 10 times.

3. The method for extracting structured data of disease prognosis covariates based on unstructured medical text according to claim 1, wherein in step S4, the similarity matching method comprises the following steps: by utilizing a twin network structure, firstly, respectively sending two entities, namely a standard name and an entity name of a target medical entity into ERNIE, sharing the two entities by using parameters of the ERNIE to obtain sentence vectors of the two entities, then sending the sentence vectors into a convergence layer, carrying out feature extraction and compression on the sentence vectors by adopting an average convergence mode to obtain u and v, finally splicing u, v and | u-v | and sending the spliced u, v and | u-v | into a full-connection layer, carrying out similarity comparison on the two entities, judging whether the two entities are similar through a logistic regression function, if so, indicating matching, and indicating that the target covariate exists in an unstructured original medical text; if not, it indicates a mismatch, indicating that the target covariate is not present in the unstructured original medical text.

4. The method for extracting structured data of disease prognosis covariates based on unstructured medical text according to claim 1, wherein the medical entity categories include disease entities, drug entities, surgical entities, imaging examination entities and symptom entities.

5. The method for extracting structured data of disease prognosis covariates based on unstructured medical text according to any of claims 1 to 4, wherein the unstructured medical text is a discharge summary.

6. A system for extracting structured data of disease prognosis covariates based on unstructured medical texts is characterized by comprising a preprocessing module, an identification module, a semi-structured database construction module, a comparison module and a structured database construction module;

7. An intelligent terminal, comprising:

a memory for storing executable program code; and

a processor for reading executable program code stored in the memory to perform the method of extracting structured data of disease prognosis covariates based on unstructured medical text according to any of claims 1 to 5.

8. A computer-readable storage medium having stored thereon computer program instructions which, when executed by a processor, implement the method for extracting structured data of disease prognosis covariates based on unstructured medical text according to any of claims 1-5.