CN110428907A - A kind of text mining method and system based on unstructured electronic health record - Google Patents

A kind of text mining method and system based on unstructured electronic health record Download PDF

Info

Publication number
CN110428907A
CN110428907A CN201910701406.5A CN201910701406A CN110428907A CN 110428907 A CN110428907 A CN 110428907A CN 201910701406 A CN201910701406 A CN 201910701406A CN 110428907 A CN110428907 A CN 110428907A
Authority
CN
China
Prior art keywords
text
feature
electronic health
health record
cluster
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN201910701406.5A
Other languages
Chinese (zh)
Inventor
杨波
王芮
彭立志
李宝生
朱健
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
University of Jinan
Shandong Institute of Cancer Prevention and Treatment
Original Assignee
University of Jinan
Shandong Institute of Cancer Prevention and Treatment
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by University of Jinan, Shandong Institute of Cancer Prevention and Treatment filed Critical University of Jinan
Priority to CN201910701406.5A priority Critical patent/CN110428907A/en
Publication of CN110428907A publication Critical patent/CN110428907A/en
Pending legal-status Critical Current

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/35Clustering; Classification
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/80Information retrieval; Database structures therefor; File system structures therefor of semi-structured data, e.g. markup language structured data such as SGML, XML or HTML
    • G06F16/81Indexing, e.g. XML tags; Data structures therefor; Storage structures
    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16HHEALTHCARE INFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR THE HANDLING OR PROCESSING OF MEDICAL OR HEALTHCARE DATA
    • G16H10/00ICT specially adapted for the handling or processing of patient-related medical or healthcare data
    • G16H10/60ICT specially adapted for the handling or processing of patient-related medical or healthcare data for patient-specific data, e.g. for electronic patient records
    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16HHEALTHCARE INFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR THE HANDLING OR PROCESSING OF MEDICAL OR HEALTHCARE DATA
    • G16H50/00ICT specially adapted for medical diagnosis, medical simulation or medical data mining; ICT specially adapted for detecting, monitoring or modelling epidemics or pandemics
    • G16H50/70ICT specially adapted for medical diagnosis, medical simulation or medical data mining; ICT specially adapted for detecting, monitoring or modelling epidemics or pandemics for mining of medical data, e.g. analysing previous cases of other patients

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Public Health (AREA)
  • Medical Informatics (AREA)
  • Health & Medical Sciences (AREA)
  • Databases & Information Systems (AREA)
  • General Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Epidemiology (AREA)
  • General Health & Medical Sciences (AREA)
  • Primary Health Care (AREA)
  • Biomedical Technology (AREA)
  • Pathology (AREA)
  • Software Systems (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The present disclosure proposes a kind of text mining method and system based on unstructured electronic health record, it include: by multiple a certain present illness history records derived in hospital database, as raw experimental data, and every sample is unfolded with time series, the word of description time meaning is identified first, then it is several short texts by long text cutting using timing node as boundary, i.e., is record of being hospitalized each time by present illness history cutting;It determines a certain present illness history feature and decimation rule and is saved into xml document;The history information in rule-based library extracts and structured storage, and regular expression is formed after being written over to defined rule to realize the feature extraction of non-structured text;The quantization means of feature: the data type of the characteristic value after being extracted by analysis carries out numerical quantization to characteristic value.Character representation X=(x during characteristic value after quantization is unified for primary be hospitalized1,x2,x3,...,x57), then text cluster is realized as the text feature input of Unsupervised clustering algorithm.

Description

A kind of text mining method and system based on unstructured electronic health record
Technical field
The disclosure relates to natural language processing and machine learning techniques field, is based on unstructured electronics more particularly to one kind The text mining method and system of case history.
Background technique
Main suit, present illness history, the past in Chinese electronic health record (Electronic Medical Record, EMR) data Natural language verbal description is used in the main bodys such as history, image report, operation record, mainly with structuring and non-structured data Format is collected storage, is the concrete embodiment of the practical diagnosis and treatment details of clinician, contains about the complete of patient health information Face, profession, accurately description, are valuable medical knowledge resources.Therefore the structuring processing of Chinese electronic health record and text are dug Pick is of great significance to clinical medicine assisting in diagnosis and treatment.
The data of structuring are stored in electronic health record with two-dimensional format at present, however non-structured data are mainly by certainly It is made of text.Although value that non-structured text information contains is abundant, processing, using difficulty, from non-structured It extracts effective information in data usually manually to do, this is very time-consuming and laborious.
In natural language processing (NLP) and machine learning field, various algorithms are easier to locate to the data of structuring, vectorization Reason and analysis, so thus need non-structured text data first processing into structural data, then carry out again to Quantification treatment simultaneously inputs to algorithm, to carry out data analysis and excavate.By observation data it can be found that when just for certain When specific disease data, such as cancer of the esophagus electronic health record, professional domain associated description term and basic syntax are substantially Fixed, so carrying out feature extraction and structured representation about characteristic using rule-based method, both keeping away in this way Exempt from a large amount of human and material resources waste, and improves accuracy.Meanwhile text mining generally is carried out to electronic health record now Method be supervised learning, this not only needs a large amount of manual tag, brings burden to medical worker, and very possible Cause error propagation.
Summary of the invention
The purpose of this specification embodiment is to provide a kind of text mining method based on unstructured electronic health record, base In the method for unstructured the Chinese text information extraction and text mining of rule, the final knot for realizing esophageal squamous cell carcinoma electronic health record Structureization and Concurrent Chemoradiotherapy Sensitivity are predicted.
This specification embodiment provides a kind of text mining method based on unstructured electronic health record, passes through following skill Art scheme is realized:
Include:
By multiple a certain present illness histories records derived in hospital database, as raw experimental data, and by every sample It is unfolded with time series, identifies that long text cutting is by the word of description time meaning then using timing node as boundary first Present illness history cutting is record of being hospitalized each time by several short texts;
It determines a certain present illness history feature and decimation rule and is saved into xml document;
The history information in rule-based library extracts and structured storage, forms canonical after being written over to defined rule Expression formula realizes the feature extraction of non-structured text;
The quantization means of feature: the data type of the characteristic value after being extracted by analysis carries out numerical quantization to characteristic value;
Character representation X=(x during characteristic value after quantization is unified for primary be hospitalized1,x2,x3,...,x57), then Text cluster is realized in text feature input as Unsupervised clustering algorithm.
Further technical solution, after obtaining text cluster, using TextCNN model realization text classification, comprising:
Text Pretreatment: using Tokenizer by present illness history text conversion at numerical characteristic, first by Chinese text word with It is separated between word with space;Then the dictionary for establishing a regular length is drawn on the strong points to offset the weaknesses so that all sample lengths are consistent;Finally Each word distributes an index, so that every present illness history record is changed into a coding;
Each Chinese word coding is converted into term vector: based on resulting Chinese word coding, one-hot coding being carried out to each word, often The vector that a word can all be tieed up with a vocabulary_size;Then it updates to obtain one by the training iteration of neural network Suitable weight matrix, row size are vocabulary_size, and column size is the dimension of term vector, by what is encoded with one-hot Term vector is mapped to lower dimensional space, obtains low-dimensional term vector;
Feature is extracted using convolutional neural networks: carrying out convolution operation using three various sizes of convolution kernels;
The feature feature map that various sizes of convolution kernel obtains is of different sizes, uses pond to each feature map Change function, keeps their dimension identical, obtain final feature vector;
Export the probability of each classification.
It is unstructured to realize to form regular expression for further technical solution after being written over to defined rule The feature extraction of text, particular content are as follows:
It is successively read the record of being hospitalized every time segmented, regular expression is utilized to search and position occurred key Word;
Using the keyword of each appearance as separation, short text is carried out segment processing again, is divided into multiple clauses;
Determine characteristic value: in view of the corresponding value of each keyword is generally present in the small clause on the keyword both sides, Search the value of the keyword using regular expression again in the text chunk on the keyword both sides, a keyword can have more A value, thus by complicated text conversion for key-value type;
As a result structured storage: result will be extracted and stored according to time series into xml document.
Further technical solution, the feature of extraction principally fall into four major class: symptom, check seen in, chemicotherapy scheme with And therapeutic evaluation.
Further technical solution realizes text cluster using DBSCAN clustering algorithm:
(1) it selects the object p not yet checked from data set at random first, if p is not processed, checks its neighbour Domain, if comprising number of objects be not less than MinPts, establish new cluster C, Candidate Set N be added in all the points therein, object is existing disease History text feature;
(2) to not yet processed object q all in Candidate Set N, its neighborhood is checked, if it is a right to include at least MinPts As N then is added in these objects;If q is not included into any one cluster, C is added in q;
(3) step (2) are repeated, continues checking untreated object in N, current candidate collection N is sky;
(4) step (1)~(3) are repeated, until all objects have all been included into some cluster or labeled as noise.
Further technical solution, the above method are suitable for the text mining of the electronic health record of esophageal squamous cell carcinoma, but and unlimited In the text mining of such electronic health record.
This specification embodiment provides a kind of Text Mining System based on unstructured electronic health record, passes through following skill Art scheme is realized:
Include:
Text Pretreatment module: multiple a certain present illness histories derived in hospital database are recorded, as Initial experiments number According to, and every sample is unfolded with time series, the word of description time meaning is identified first, is then with timing node Long text cutting is several short texts by boundary, i.e., is record of being hospitalized each time by present illness history cutting;
Feature Engineering module: a certain present illness history feature and decimation rule are determined and is saved into xml document;
Feature extraction: the history information in rule-based library extracts and structured storage, is written over to defined rule Form regular expression afterwards to realize the feature extraction of non-structured text;
Characteristic quantification: the data type of the characteristic value after being extracted by analysis carries out numerical quantization to characteristic value;
Prediction module is analyzed, text classification is realized based on text cluster;
Text cluster: the character representation X=(x during the characteristic value after quantization is unified for primary be hospitalized1,x2,x3,..., x57), then text cluster is realized as the text feature input of Unsupervised clustering algorithm.
This specification embodiment provides a kind of computer equipment, including memory, processor and storage are on a memory And the computer program that can be run on a processor, the processor are realized a kind of based on unstructured electricity when executing described program The step of text mining method of sub- case history.
This specification embodiment provides a kind of computer readable storage medium, is stored thereon with computer program, the journey The step of a kind of text mining method based on unstructured electronic health record is realized when sequence is executed by processor.
Compared with prior art, the beneficial effect of the disclosure is:
The disclosure is together by the patient passed through with similar diagnosis and treatment cluster, i.e., literary first from unsupervised angle This cluster can realize the recommendation of therapeutic scheme indirectly;Then using " therapeutic evaluation " feature of every sample as sample label, i.e., Text classification, it can be achieved that Concurrent Chemoradiotherapy Sensitivity prediction.By above step, it is finally reached and makes full use of esophageal squamous cell carcinoma present illness history Data realize the purpose of Concurrent Chemoradiotherapy Sensitivity prediction.
Detailed description of the invention
The Figure of description for constituting a part of this disclosure is used to provide further understanding of the disclosure, and the disclosure is shown Meaning property embodiment and its explanation do not constitute the improper restriction to the disclosure for explaining the disclosure.
Fig. 1 is the method flow diagram of disclosure specific embodiment;
Neural network model schematic diagram when Fig. 2 is the text classification of disclosure specific embodiment;
Table 1 is the defined feature in part and extracting rule example.
Specific embodiment
It is noted that following detailed description is all illustrative, it is intended to provide further instruction to the disclosure.Unless another It indicates, all technical and scientific terms used herein has usual with disclosure person of an ordinary skill in the technical field The identical meanings of understanding.
It should be noted that term used herein above is merely to describe specific embodiment, and be not intended to restricted root According to the illustrative embodiments of the disclosure.As used herein, unless the context clearly indicates otherwise, otherwise singular Also it is intended to include plural form, additionally, it should be understood that, when in the present specification using term "comprising" and/or " packet Include " when, indicate existing characteristics, step, operation, device, component and/or their combination.
Examples of implementation one
This embodiment disclose a kind of text mining methods based on unstructured electronic health record to wrap referring to described in attached drawing 1 It includes:
Step (1): customization corpus.The present illness history record that 2000 esophageal squamous cell carcinomas are exported from hospital database, as Raw experimental data, and every sample is unfolded with time series.In order to make full use of the diagnosis and therapy recording letter of being hospitalized every time of patient Breath, need to identify first description time meaning word, such as: in September, 2009, before 4 years, be admitted to hospital for the 7th time, 2014-10-16,5 It is several short texts by long text cutting then using timing node as boundary before month etc., i.e., is each time by present illness history cutting It records in hospital.
In this step, customization is exactly to need to do initial data some pretreatment operations, i.e., according to timing node to existing Medical history long sentence cutting is several short sentences.
Corpus is exactly all text data sets that next step operation is carried out after pretreatment.
In this step, since the Electronic Medical Record content of certain specified disease has fixed grammer or template, So can also customize extracting rule accordingly after corpus determines.Present illness history is with the hospital stays each time for one A timing node, the overall process after describing disease, that is, occur, develop, developing and diagnosis and treatment are passed through.Therefore to present illness history data into Row processing assists in the treatment of the individual character for being necessarily help to clinical medicine with analysis.
Step (2): feature extraction mainly includes defined feature value and extracting rule.The information extraction work of natural language Make there are many method, method more commonly used at present is based on deep learning and rule match.Since deep learning method needs A large amount of label data, and present illness history data format is single as the result is shown for analysis, and data volume is small, and grammer, term are fixed, so Verification and interpretation of the disclosure based on the data to electronic health record are taken based on the mode that simple rule discovery is combined with verification, Final to determine totally 56 esophageal squamous cell carcinoma present illness history features and decimation rule, partial content is as shown in table 1, and for convenience of the later period Modeling and reading data, are saved into xml document.
Table 1
Step (3): the history information in rule-based library extracts and structured storage.After being written over to defined rule Regular expression is formed to realize the feature extraction of non-structured text, is realized accurate, semi-automatic, intelligent, efficient, simple Purpose.Particular content is as follows:
It 3-1) is successively read the record of being hospitalized every time segmented in step 1, is searched and is positioned using regular expression and is all The keyword of appearance.As " patient in September, 2009 because " half a year uncomfortable in chest, anterior pectorial region pain January " through Esophagoscopy examine for " cancer of the esophagus ".";Keyword are as follows: uncomfortable in chest, pain, the cancer of the esophagus;
3-2) using the keyword of each appearance as separation, short text is carried out segment processing again, is divided into multiple sons Sentence.As a result as follows: 1. half a year uncomfortable in chest;2. anterior pectorial region pain January;3. examining through Esophagoscopy as " cancer of the esophagus ";
3-3) determine feature (keyword) value.In view of the corresponding value of each keyword is generally present in the keyword both sides Small clause in, so we search the keyword using regular expression again in the text chunk on the keyword both sides Value, a keyword can have multiple values, thus by complicated text conversion for key-value type.As a result as follows: chest It is bored: half a year;Pain: anterior pectorial region;The cancer of the esophagus: it is;
3-4) the structured storage of result.Because extensible markup language (XML) facilitates progress data biography between various programs Defeated, we will extract result and be stored according to time series into xml document.
Regular expression is formed in specific embodiment, after being written over to defined rule to realize unstructured text This characteristic information extracts, present illness history text: " starting to give the 1st cycle T P scheme on December 7th, 2016 ".
Regular expression: " GP scheme | TPF scheme | TP scheme | d.* (a) the period | d.* days/period ";
Extracting result is " scheme: TP ".
Step (4): the quantization means of feature.The data type of characteristic value after being extracted by analysis, with reference to SEER data Behind library (U.S. authority cancer staqtistical data base, tumor information have been subjected to unified and specification), numerical quantization is carried out to characteristic value, with Facilitate the model read of next step.If two-value type is quantified as (0,1), continuous type numeric data is constant, the discrete data amount of enumerating Change etc..
Step (5): text mining and analysis are predicted.By the information extraction and quantization of above step, machine can be passed through The method of study to carry out text mining to electronic health record, to find some implicit text knowledges, to play auxiliary diagnosis Effect.The present invention attempt by have supervision and unsupervised combination method, Lai Shixian esophageal squamous cell carcinoma Concurrent Chemoradiotherapy Sensitivity it is pre- It surveys.
In specific embodiment, about text cluster, the unsupervised model of machine learning is constructed to clinical electronic health record text Data carry out behavior pattern discovery.Text cluster can be widely applied to the different aspect of text mining and information retrieval, advise greatly Automatically generating etc. for the classification of the tissue and browsing, text set level of mould text set all has important value.In the present invention The feature of extraction principally falls into four major class: symptom checks finding, chemicotherapy scheme and therapeutic evaluation.It first will be after quantization Characteristic value is unified for the character representation X=(x during patient is once hospitalized1,x2,x3,...,x57), then as unsupervised The text feature of clustering algorithm inputs, and output is text cluster as a result, certain between patients with esophageal squamous cell carcinoma to find with this General character or tacit knowledge, and finally play the effect of clinical assisting in diagnosis and treatment.
In numerous Unsupervised clustering algorithms, DBSCAN (Density-Based Spatial Clustering of Applications with Noise) it is a more representational density-based algorithms.Compared with K-means, DBSCAN does not need the quantity for knowing cluster class to be formed in advance, while not only it can be found that the cluster class of arbitrary shape, also can It identifies noise spot, is therefore widely used in all kinds of cluster tasks.
In specific embodiment, the disclosure realizes that text cluster, particular content are as follows using DBSCAN clustering algorithm:
Input: data set S, Eps sweep radius, minimum include points MinPts value;Data set herein is exactly every provision Characteristic value after notebook data quantization;
Output: the cluster of all generations reaches density requirements.Cluster is exactly the meaning of classification, and clustering algorithm can be by data set S certainly Dynamic is divided into different classifications.In this application, cluster refers specifically to different Concurrent Chemoradiotherapy Sensitivities.Gathered by analysis in same cluster Or of a sort sample, it is found that the implicit domain knowledge between the sample with identical Concurrent Chemoradiotherapy Sensitivity, thus under guidance Stream task.
(1) select the object p not yet checked from data set at random first, if p it is not processed (be classified as some cluster or Person be labeled as noise), then check its neighborhood, if comprising number of objects be not less than MinPts, establish new cluster C, will it is therein own Candidate Set N is added in point;Specifically, object is exactly the sample referred in data set S, i.e. a present illness history text feature.
(2) to not yet processed object q all in Candidate Set N, its neighborhood is checked, if it is a right to include at least MinPts As N then is added in these objects;If q is not included into any one cluster, C is added in q;
(3) step (2) are repeated, continues checking untreated object in N, current candidate collection N is sky;
(4) step (1)~(3) are repeated, until all objects have all been included into some cluster or labeled as noise.
Concurrent Chemoradiotherapy Sensitivity prediction, i.e. text classification.Text Classification is the important base of information retrieval and text mining Plinth, main task are that its classification is determined according to content of text under previously given class label set.Tumor patient To Concurrent Chemoradiotherapy Sensitivity, there are great differences between body, should select drug according to specific diagnosis situation, individual physique etc. when treatment And customization Radiation treatment plans, the final maximization for realizing patient's Concurrent Chemoradiotherapy Sensitivity.
In the text classification algorithm for having supervision, compared with conventional machines learning text sorting algorithm, by convolutional Neural net Network CNN is applied to text categorization task, and the key message extracted in sentence using the kernel of multiple and different size (is similar to The n-gram of multiwindow size), feature to be extracted from text automatically, eliminates Feature Engineering, realization is trained end to end, thus Local correlations can preferably be captured.Have benefited from the powerful characteristic present ability of deep learning, carries out text using deep learning This its effect of classifying will often be better than traditional method.
The present invention is using all esophageal squamous cell carcinoma present illness history texts as raw data set, by the primary complete of each admission discharge Property treatment label of the therapeutic evaluation as Concurrent Chemoradiotherapy Sensitivity;Using TextCNN model, establish based on esophageal squamous cell carcinoma electronics disease The Concurrent Chemoradiotherapy Sensitivity classification prediction model gone through;It, can be directly as the defeated of model finally, for the present illness history record of new patient Enter, predicts the Concurrent Chemoradiotherapy Sensitivity of new samples, it is desirable to doctor can be assisted to adjust chemicotherapy scheme by prediction result, it is final real The maximization of existing sensibility.
In specific embodiment, the disclosure uses the text classification of TextCNN model realization, and particular content is as follows:
(1) Text Pretreatment: using Tokenizer by present illness history text conversion at numerical characteristic.First by Chinese text Separated between word and word with space, then establish the dictionary of a regular length (vocabulary_size), draw on the strong points to offset the weaknesses so that All sample lengths are consistent.Last each word distributes an index, so that every present illness history record is changed into a coding;
(2) each Chinese word coding embeding layer (embedding layer): is converted into term vector.Embedding layers based on upper The resulting Chinese word coding of text carries out one-hot coding to each word, each word can with vocabulary_size dimension to Amount;Then it updates to obtain a suitable weight matrix by the training iteration of neural network, row size is vocabulary_ Size, column size are the dimension of term vector, will be mapped to lower dimensional space originally with the term vector that one-hot is encoded, obtain low-dimensional Term vector.Shown in following schematic diagram 2.
(3) it convolutional layer (convolution layer): goes to extract feature using convolutional neural networks.Due to phase in sentence Adjacent word associations are always very high, therefore one-dimensional convolution can be used, i.e. the difference of text convolution and image convolution It is only to do convolution in a direction (vertical) of text sequence, the width of convolution kernel is fixed as the dimension d of term vector.Height is Hyper parameter can be set.Three various sizes of convolution kernel (3,4,5) Lai Jinhang convolution operations are used in the present invention.
(4) pond layer (maxpooling layer): the feature (feature map) that various sizes of convolution kernel obtains is big Small is also different, therefore we use pond function to each feature map, keep their dimension identical.The present invention Using 1-max pooling, that maximum value of feature map photo is extracted, passes through each feature map's of selection Maximum value can capture its most important feature.It is exactly a value that each convolution kernel, which obtains feature, in this way, is made to all convolution kernels It with 1-max pooling, then cascades up, available final feature vector, this feature vector inputs softmax again Layer, which is done, to classify.
(5) full articulamentum (full-connection and softmax): being most followed by the one layer of softmax connected entirely layer, Export the probability of each classification.Classification in the invention be Concurrent Chemoradiotherapy Sensitivity state (aggravation, sb.'s illness took a favorable turn, become Change and do not write, cure completely etc.).
The technical solution of the embodiment of the present application solves existing a large amount of valuable structurings and unstructured electronics disease The wasting of resources for counting evidence one by one, by the information extraction and Structured Design to unstructured data, further using have supervision and The method of unsupervised models coupling carries out text mining to electronic health record, it is intended that is realized by electronic health record data about esophageal squamous cell The assisting in diagnosis and treatment of cancer.
Examples of implementation two
This specification embodiment provides a kind of Text Mining System based on unstructured electronic health record, passes through following skill Art scheme is realized:
Include:
Text Pretreatment module: multiple a certain present illness histories derived in hospital database are recorded, as Initial experiments number According to, and every sample is unfolded with time series, the word of description time meaning is identified first, is then with timing node Long text cutting is several short texts by boundary, i.e., is record of being hospitalized each time by present illness history cutting;
Feature Engineering module: a certain present illness history feature and decimation rule are determined and is saved into xml document;
Feature extraction: the history information in rule-based library extracts and structured storage, is written over to defined rule Form regular expression afterwards to realize the feature extraction of non-structured text;
Characteristic quantification: the data type of the characteristic value after being extracted by analysis carries out numerical quantization to characteristic value;
Prediction module is analyzed, text classification is realized based on text cluster;
Text cluster: the character representation X=(x during the characteristic value after quantization is unified for primary be hospitalized1,x2,x3,..., x57), then text cluster is realized as the text feature input of Unsupervised clustering algorithm.
The specific implementation process of module in the examples of implementation can be found in the detailed step content in examples of implementation one.
Examples of implementation three
This specification embodiment provides a kind of computer equipment, including memory, processor and storage are on a memory And the computer program that can be run on a processor, which is characterized in that the processor realizes embodiment when executing described program The step of text mining method of the son one of one based on unstructured electronic health record.
Examples of implementation four
This specification embodiment provides a kind of computer readable storage medium, is stored thereon with computer program, special Sign is, realizes that text of one of the examples of implementation one based on unstructured electronic health record is dug when which is executed by processor The step of pick method.
It is understood that in the description of this specification, reference term " embodiment ", " another embodiment ", " other The description of embodiment " or " first embodiment~N embodiment " etc. means specific spy described in conjunction with this embodiment or example Sign, structure, material or feature are included at least one embodiment or example of the invention.In the present specification, to above-mentioned The schematic representation of term may not refer to the same embodiment or example.Moreover, the specific features of description, structure, material The characteristics of can be combined in any suitable manner in any one or more of the embodiments or examples.
The foregoing is merely preferred embodiment of the present disclosure, are not limited to the disclosure, for the skill of this field For art personnel, the disclosure can have various modifications and variations.It is all within the spirit and principle of the disclosure, it is made any to repair Change, equivalent replacement, improvement etc., should be included within the protection scope of the disclosure.

Claims (10)

1. a kind of text mining method based on unstructured electronic health record, characterized in that include:
By multiple a certain present illness histories records derived in hospital database, as raw experimental data, and by every sample with when Between sequence be unfolded, identify first description time meaning word, then using timing node as boundary, by long text cutting be it is several Present illness history cutting is record of being hospitalized each time by a short text;
It determines a certain present illness history feature and decimation rule and is saved into xml document;
The history information in rule-based library extracts and structured storage, forms regular expressions after being written over to defined rule Formula realizes the feature extraction of non-structured text;
The quantization means of feature: the data type of the characteristic value after being extracted by analysis carries out numerical quantization to characteristic value;
Character representation X=(x during characteristic value after quantization is unified for primary be hospitalized1,x2,x3,...,x57), then by it Text cluster is realized in text feature input as Unsupervised clustering algorithm.
2. a kind of text mining method based on unstructured electronic health record as described in claim 1, characterized in that obtaining After text cluster, using TextCNN model realization text classification, comprising:
Text Pretreatment: using Tokenizer by present illness history text conversion at numerical characteristic, first by Chinese text word and word it Between separated with space;Then the dictionary for establishing a regular length is drawn on the strong points to offset the weaknesses so that all sample lengths are consistent;It is last each Word distributes an index, so that every present illness history record is changed into a coding;
Each Chinese word coding is converted into term vector;
Feature is extracted using convolutional neural networks: carrying out convolution operation using three various sizes of convolution kernels;
The feature feature map that various sizes of convolution kernel obtains is of different sizes, uses Chi Huahan to each feature map Number, keeps their dimension identical, obtains final feature vector;
Export the probability of each classification.
3. a kind of text mining method based on unstructured electronic health record as claimed in claim 2, characterized in that will be each When Chinese word coding is converted to term vector, it is based on resulting Chinese word coding, one-hot coding is carried out to each word, each word can be with one The vector of a vocabulary_size dimension;Then it updates to obtain a suitable weight square by the training iteration of neural network Battle array, row size are vocabulary_size, and column size is the dimension of term vector, and the term vector encoded with one-hot is mapped to Lower dimensional space obtains low-dimensional term vector.
4. a kind of text mining method based on unstructured electronic health record as described in claim 1, characterized in that having determined The rule of justice forms regular expression to realize the feature extraction of non-structured text after being written over, particular content is as follows:
It is successively read the record of being hospitalized every time segmented, regular expression is utilized to search and position occurred keyword;
Using the keyword of each appearance as separation, short text is carried out segment processing again, is divided into multiple clauses;
Determine characteristic value: in view of the corresponding value of each keyword is generally present in the small clause on the keyword both sides, at this Search the value of the keyword in the text chunk on keyword both sides using regular expression again, a keyword can have multiple Value, thus by complicated text conversion for key-value type;
As a result structured storage: result will be extracted and stored according to time series into xml document.
5. a kind of text mining method based on unstructured electronic health record as described in claim 1, characterized in that extraction Feature principally falls into four major class: symptom checks finding, chemicotherapy scheme and therapeutic evaluation.
6. a kind of text mining method based on unstructured electronic health record as described in claim 1, characterized in that use DBSCAN clustering algorithm realizes text cluster:
(1) the object p not yet checked is selected to check its neighborhood if p is not processed from data set at random first, if The number of objects for including is not less than MinPts, establishes new cluster C, and Candidate Set N is added in all the points therein;
(2) to not yet processed object q all in Candidate Set N, its neighborhood is checked, if including at least MinPts object, N is added in these objects;If q is not included into any one cluster, C is added in q;
(3) step (2) are repeated, continues checking untreated object in N, current candidate collection N is sky;
(4) step (1)~(3) are repeated, until all objects have all been included into some cluster or labeled as noise.
7. a kind of text mining method based on unstructured electronic health record as described in claim 1-6, characterized in that above-mentioned Method is suitable for the text mining of the electronic health record of esophageal squamous cell carcinoma, but is not limited to the text mining of such electronic health record.
8. a kind of Text Mining System based on unstructured electronic health record, characterized in that include:
Text Pretreatment module: multiple a certain present illness histories derived in hospital database are recorded, as raw experimental data, and Every sample is unfolded with time series, identifies that the word of description time meaning will be grown then using timing node as boundary first Text dividing is several short texts, i.e., is record of being hospitalized each time by present illness history cutting;
Feature Engineering module: a certain present illness history feature and decimation rule are determined and is saved into xml document;
Feature extraction: the history information in rule-based library extracts and structured storage, is written over rear shape to defined rule The feature extraction of non-structured text is realized at regular expression;
Characteristic quantification: the data type of the characteristic value after being extracted by analysis carries out numerical quantization to characteristic value;
Prediction module is analyzed, text classification is realized based on text cluster;
Text cluster: the character representation X=(x during the characteristic value after quantization is unified for primary be hospitalized1,x2,x3,...,x57), Then text cluster is realized as the text feature input of Unsupervised clustering algorithm.
9. a kind of computer equipment including memory, processor and stores the meter that can be run on a memory and on a processor Calculation machine program, which is characterized in that the processor realizes any a kind of base in claim 1-7 when executing described program In the text mining method of unstructured electronic health record the step of.
10. a kind of computer readable storage medium, is stored thereon with computer program, which is characterized in that the program is by processor A kind of step of any text mining method based on unstructured electronic health record in claim 1-7 is realized when execution Suddenly.
CN201910701406.5A 2019-07-31 2019-07-31 A kind of text mining method and system based on unstructured electronic health record Pending CN110428907A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201910701406.5A CN110428907A (en) 2019-07-31 2019-07-31 A kind of text mining method and system based on unstructured electronic health record

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201910701406.5A CN110428907A (en) 2019-07-31 2019-07-31 A kind of text mining method and system based on unstructured electronic health record

Publications (1)

Publication Number Publication Date
CN110428907A true CN110428907A (en) 2019-11-08

Family

ID=68411762

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201910701406.5A Pending CN110428907A (en) 2019-07-31 2019-07-31 A kind of text mining method and system based on unstructured electronic health record

Country Status (1)

Country Link
CN (1) CN110428907A (en)

Cited By (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112687367A (en) * 2020-12-29 2021-04-20 中国人民解放军总医院 Medical record grouping method, device and equipment based on dynamic disease condition and storage medium
CN112927814A (en) * 2021-03-30 2021-06-08 善诊(上海)信息技术有限公司 Physical examination recommendation method, device, equipment and storage medium for placeholder lesions
CN113010643A (en) * 2021-03-22 2021-06-22 平安科技(深圳)有限公司 Method, device and equipment for processing vocabulary in field of Buddhism and storage medium
CN113094477A (en) * 2021-06-09 2021-07-09 腾讯科技(深圳)有限公司 Data structuring method and device, computer equipment and storage medium
CN116401532A (en) * 2023-06-07 2023-07-07 山东大学 Method and system for recognizing frequency instability of power system after disturbance
CN117789907A (en) * 2024-02-28 2024-03-29 山东金卫软件技术有限公司 Intelligent medical data intelligent management method based on multi-source data fusion

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN106909783A (en) * 2017-02-24 2017-06-30 北京交通大学 A kind of case history textual medical Methods of Knowledge Discovering Based based on timeline
CN108831559A (en) * 2018-06-20 2018-11-16 清华大学 A kind of Chinese electronic health record text analyzing method and system
CN109918507A (en) * 2019-03-08 2019-06-21 北京工业大学 One kind being based on the improved file classification method of TextCNN

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN106909783A (en) * 2017-02-24 2017-06-30 北京交通大学 A kind of case history textual medical Methods of Knowledge Discovering Based based on timeline
CN108831559A (en) * 2018-06-20 2018-11-16 清华大学 A kind of Chinese electronic health record text analyzing method and system
CN109918507A (en) * 2019-03-08 2019-06-21 北京工业大学 One kind being based on the improved file classification method of TextCNN

Non-Patent Citations (3)

* Cited by examiner, † Cited by third party
Title
包小源等: "非结构化电子病历中信息抽取的定制化方法", 《北京大学学报(医学版)》 *
谷波等: "文本聚类算法的分析与比较", 《电脑开发与应用》 *
赵冬晓等: "面向情报研究的文本语义挖掘方法述评", 《现代图书情报技术》 *

Cited By (10)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112687367A (en) * 2020-12-29 2021-04-20 中国人民解放军总医院 Medical record grouping method, device and equipment based on dynamic disease condition and storage medium
CN113010643A (en) * 2021-03-22 2021-06-22 平安科技(深圳)有限公司 Method, device and equipment for processing vocabulary in field of Buddhism and storage medium
CN113010643B (en) * 2021-03-22 2023-07-21 平安科技(深圳)有限公司 Method, device, equipment and storage medium for processing vocabulary in Buddha field
CN112927814A (en) * 2021-03-30 2021-06-08 善诊(上海)信息技术有限公司 Physical examination recommendation method, device, equipment and storage medium for placeholder lesions
CN113094477A (en) * 2021-06-09 2021-07-09 腾讯科技(深圳)有限公司 Data structuring method and device, computer equipment and storage medium
CN113094477B (en) * 2021-06-09 2021-08-31 腾讯科技(深圳)有限公司 Data structuring method and device, computer equipment and storage medium
CN116401532A (en) * 2023-06-07 2023-07-07 山东大学 Method and system for recognizing frequency instability of power system after disturbance
CN116401532B (en) * 2023-06-07 2024-02-23 山东大学 Method and system for recognizing frequency instability of power system after disturbance
CN117789907A (en) * 2024-02-28 2024-03-29 山东金卫软件技术有限公司 Intelligent medical data intelligent management method based on multi-source data fusion
CN117789907B (en) * 2024-02-28 2024-05-10 山东金卫软件技术有限公司 Intelligent medical data intelligent management method based on multi-source data fusion

Similar Documents

Publication Publication Date Title
Qayyum et al. Medical image retrieval using deep convolutional neural network
CN110428907A (en) A kind of text mining method and system based on unstructured electronic health record
CN109446338B (en) Neural network-based drug disease relation classification method
CN104516942B (en) The automatic merogenesis mark of Concept-driven test
CN105760507B (en) Cross-module state topic relativity modeling method based on deep learning
US20200303072A1 (en) Method and system for supporting medical decision making
CN109670179A (en) Case history text based on iteration expansion convolutional neural networks names entity recognition method
CN109670177A (en) One kind realizing the semantic normalized control method of medicine and control device based on LSTM
Sun et al. Intelligent analysis of medical big data based on deep learning
WO2017151759A1 (en) Category discovery and image auto-annotation via looped pseudo-task optimization
CN110032739A (en) Chinese electronic health record name entity abstracting method and system
CN109710670A (en) A method of case history text is converted into structural metadata from natural language
Wang et al. Attention-based multi-instance neural network for medical diagnosis from incomplete and low quality data
CN116580849B (en) Medical data acquisition and analysis system and method thereof
CN114003734A (en) Breast cancer risk factor knowledge system model, knowledge map system and construction method
Liu et al. Can a convolutional neural network support auditing of nci thesaurus neoplasm concepts?
CN114399634B (en) Three-dimensional image classification method, system, equipment and medium based on weak supervision learning
CN104537280B (en) Protein interactive relation recognition methods based on text relation similitude
CN117574898A (en) Domain knowledge graph updating method and system based on power grid equipment
CN110060749B (en) Intelligent electronic medical record diagnosis method based on SEV-SDG-CNN
CN116881336A (en) Efficient multi-mode contrast depth hash retrieval method for medical big data
CN116313141A (en) Knowledge-graph-based intelligent inquiry method for unknown cause fever
Shah et al. Exploring diseases based biomedical document clustering and visualization using self-organizing maps
Chen et al. Entity relation extraction from electronic medical records based on improved annotation rules and BiLSTM-CRF
Zhao et al. Protein function prediction with functional and topological knowledge of gene ontology

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
RJ01 Rejection of invention patent application after publication

Application publication date: 20191108

RJ01 Rejection of invention patent application after publication