CN110428907A - A kind of text mining method and system based on unstructured electronic health record - Google Patents
A kind of text mining method and system based on unstructured electronic health record Download PDFInfo
- Publication number
- CN110428907A CN110428907A CN201910701406.5A CN201910701406A CN110428907A CN 110428907 A CN110428907 A CN 110428907A CN 201910701406 A CN201910701406 A CN 201910701406A CN 110428907 A CN110428907 A CN 110428907A
- Authority
- CN
- China
- Prior art keywords
- text
- feature
- electronic health
- health record
- cluster
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/30—Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
- G06F16/35—Clustering; Classification
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/80—Information retrieval; Database structures therefor; File system structures therefor of semi-structured data, e.g. markup language structured data such as SGML, XML or HTML
- G06F16/81—Indexing, e.g. XML tags; Data structures therefor; Storage structures
-
- G—PHYSICS
- G16—INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
- G16H—HEALTHCARE INFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR THE HANDLING OR PROCESSING OF MEDICAL OR HEALTHCARE DATA
- G16H10/00—ICT specially adapted for the handling or processing of patient-related medical or healthcare data
- G16H10/60—ICT specially adapted for the handling or processing of patient-related medical or healthcare data for patient-specific data, e.g. for electronic patient records
-
- G—PHYSICS
- G16—INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
- G16H—HEALTHCARE INFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR THE HANDLING OR PROCESSING OF MEDICAL OR HEALTHCARE DATA
- G16H50/00—ICT specially adapted for medical diagnosis, medical simulation or medical data mining; ICT specially adapted for detecting, monitoring or modelling epidemics or pandemics
- G16H50/70—ICT specially adapted for medical diagnosis, medical simulation or medical data mining; ICT specially adapted for detecting, monitoring or modelling epidemics or pandemics for mining of medical data, e.g. analysing previous cases of other patients
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Data Mining & Analysis (AREA)
- Public Health (AREA)
- Medical Informatics (AREA)
- Health & Medical Sciences (AREA)
- Databases & Information Systems (AREA)
- General Physics & Mathematics (AREA)
- General Engineering & Computer Science (AREA)
- Physics & Mathematics (AREA)
- Epidemiology (AREA)
- General Health & Medical Sciences (AREA)
- Primary Health Care (AREA)
- Biomedical Technology (AREA)
- Pathology (AREA)
- Software Systems (AREA)
- Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
Abstract
The present disclosure proposes a kind of text mining method and system based on unstructured electronic health record, it include: by multiple a certain present illness history records derived in hospital database, as raw experimental data, and every sample is unfolded with time series, the word of description time meaning is identified first, then it is several short texts by long text cutting using timing node as boundary, i.e., is record of being hospitalized each time by present illness history cutting;It determines a certain present illness history feature and decimation rule and is saved into xml document;The history information in rule-based library extracts and structured storage, and regular expression is formed after being written over to defined rule to realize the feature extraction of non-structured text;The quantization means of feature: the data type of the characteristic value after being extracted by analysis carries out numerical quantization to characteristic value.Character representation X=(x during characteristic value after quantization is unified for primary be hospitalized1,x2,x3,...,x57), then text cluster is realized as the text feature input of Unsupervised clustering algorithm.
Description
Technical field
The disclosure relates to natural language processing and machine learning techniques field, is based on unstructured electronics more particularly to one kind
The text mining method and system of case history.
Background technique
Main suit, present illness history, the past in Chinese electronic health record (Electronic Medical Record, EMR) data
Natural language verbal description is used in the main bodys such as history, image report, operation record, mainly with structuring and non-structured data
Format is collected storage, is the concrete embodiment of the practical diagnosis and treatment details of clinician, contains about the complete of patient health information
Face, profession, accurately description, are valuable medical knowledge resources.Therefore the structuring processing of Chinese electronic health record and text are dug
Pick is of great significance to clinical medicine assisting in diagnosis and treatment.
The data of structuring are stored in electronic health record with two-dimensional format at present, however non-structured data are mainly by certainly
It is made of text.Although value that non-structured text information contains is abundant, processing, using difficulty, from non-structured
It extracts effective information in data usually manually to do, this is very time-consuming and laborious.
In natural language processing (NLP) and machine learning field, various algorithms are easier to locate to the data of structuring, vectorization
Reason and analysis, so thus need non-structured text data first processing into structural data, then carry out again to
Quantification treatment simultaneously inputs to algorithm, to carry out data analysis and excavate.By observation data it can be found that when just for certain
When specific disease data, such as cancer of the esophagus electronic health record, professional domain associated description term and basic syntax are substantially
Fixed, so carrying out feature extraction and structured representation about characteristic using rule-based method, both keeping away in this way
Exempt from a large amount of human and material resources waste, and improves accuracy.Meanwhile text mining generally is carried out to electronic health record now
Method be supervised learning, this not only needs a large amount of manual tag, brings burden to medical worker, and very possible
Cause error propagation.
Summary of the invention
The purpose of this specification embodiment is to provide a kind of text mining method based on unstructured electronic health record, base
In the method for unstructured the Chinese text information extraction and text mining of rule, the final knot for realizing esophageal squamous cell carcinoma electronic health record
Structureization and Concurrent Chemoradiotherapy Sensitivity are predicted.
This specification embodiment provides a kind of text mining method based on unstructured electronic health record, passes through following skill
Art scheme is realized:
Include:
By multiple a certain present illness histories records derived in hospital database, as raw experimental data, and by every sample
It is unfolded with time series, identifies that long text cutting is by the word of description time meaning then using timing node as boundary first
Present illness history cutting is record of being hospitalized each time by several short texts;
It determines a certain present illness history feature and decimation rule and is saved into xml document;
The history information in rule-based library extracts and structured storage, forms canonical after being written over to defined rule
Expression formula realizes the feature extraction of non-structured text;
The quantization means of feature: the data type of the characteristic value after being extracted by analysis carries out numerical quantization to characteristic value;
Character representation X=(x during characteristic value after quantization is unified for primary be hospitalized1,x2,x3,...,x57), then
Text cluster is realized in text feature input as Unsupervised clustering algorithm.
Further technical solution, after obtaining text cluster, using TextCNN model realization text classification, comprising:
Text Pretreatment: using Tokenizer by present illness history text conversion at numerical characteristic, first by Chinese text word with
It is separated between word with space;Then the dictionary for establishing a regular length is drawn on the strong points to offset the weaknesses so that all sample lengths are consistent;Finally
Each word distributes an index, so that every present illness history record is changed into a coding;
Each Chinese word coding is converted into term vector: based on resulting Chinese word coding, one-hot coding being carried out to each word, often
The vector that a word can all be tieed up with a vocabulary_size;Then it updates to obtain one by the training iteration of neural network
Suitable weight matrix, row size are vocabulary_size, and column size is the dimension of term vector, by what is encoded with one-hot
Term vector is mapped to lower dimensional space, obtains low-dimensional term vector;
Feature is extracted using convolutional neural networks: carrying out convolution operation using three various sizes of convolution kernels;
The feature feature map that various sizes of convolution kernel obtains is of different sizes, uses pond to each feature map
Change function, keeps their dimension identical, obtain final feature vector;
Export the probability of each classification.
It is unstructured to realize to form regular expression for further technical solution after being written over to defined rule
The feature extraction of text, particular content are as follows:
It is successively read the record of being hospitalized every time segmented, regular expression is utilized to search and position occurred key
Word;
Using the keyword of each appearance as separation, short text is carried out segment processing again, is divided into multiple clauses;
Determine characteristic value: in view of the corresponding value of each keyword is generally present in the small clause on the keyword both sides,
Search the value of the keyword using regular expression again in the text chunk on the keyword both sides, a keyword can have more
A value, thus by complicated text conversion for key-value type;
As a result structured storage: result will be extracted and stored according to time series into xml document.
Further technical solution, the feature of extraction principally fall into four major class: symptom, check seen in, chemicotherapy scheme with
And therapeutic evaluation.
Further technical solution realizes text cluster using DBSCAN clustering algorithm:
(1) it selects the object p not yet checked from data set at random first, if p is not processed, checks its neighbour
Domain, if comprising number of objects be not less than MinPts, establish new cluster C, Candidate Set N be added in all the points therein, object is existing disease
History text feature;
(2) to not yet processed object q all in Candidate Set N, its neighborhood is checked, if it is a right to include at least MinPts
As N then is added in these objects;If q is not included into any one cluster, C is added in q;
(3) step (2) are repeated, continues checking untreated object in N, current candidate collection N is sky;
(4) step (1)~(3) are repeated, until all objects have all been included into some cluster or labeled as noise.
Further technical solution, the above method are suitable for the text mining of the electronic health record of esophageal squamous cell carcinoma, but and unlimited
In the text mining of such electronic health record.
This specification embodiment provides a kind of Text Mining System based on unstructured electronic health record, passes through following skill
Art scheme is realized:
Include:
Text Pretreatment module: multiple a certain present illness histories derived in hospital database are recorded, as Initial experiments number
According to, and every sample is unfolded with time series, the word of description time meaning is identified first, is then with timing node
Long text cutting is several short texts by boundary, i.e., is record of being hospitalized each time by present illness history cutting;
Feature Engineering module: a certain present illness history feature and decimation rule are determined and is saved into xml document;
Feature extraction: the history information in rule-based library extracts and structured storage, is written over to defined rule
Form regular expression afterwards to realize the feature extraction of non-structured text;
Characteristic quantification: the data type of the characteristic value after being extracted by analysis carries out numerical quantization to characteristic value;
Prediction module is analyzed, text classification is realized based on text cluster;
Text cluster: the character representation X=(x during the characteristic value after quantization is unified for primary be hospitalized1,x2,x3,...,
x57), then text cluster is realized as the text feature input of Unsupervised clustering algorithm.
This specification embodiment provides a kind of computer equipment, including memory, processor and storage are on a memory
And the computer program that can be run on a processor, the processor are realized a kind of based on unstructured electricity when executing described program
The step of text mining method of sub- case history.
This specification embodiment provides a kind of computer readable storage medium, is stored thereon with computer program, the journey
The step of a kind of text mining method based on unstructured electronic health record is realized when sequence is executed by processor.
Compared with prior art, the beneficial effect of the disclosure is:
The disclosure is together by the patient passed through with similar diagnosis and treatment cluster, i.e., literary first from unsupervised angle
This cluster can realize the recommendation of therapeutic scheme indirectly;Then using " therapeutic evaluation " feature of every sample as sample label, i.e.,
Text classification, it can be achieved that Concurrent Chemoradiotherapy Sensitivity prediction.By above step, it is finally reached and makes full use of esophageal squamous cell carcinoma present illness history
Data realize the purpose of Concurrent Chemoradiotherapy Sensitivity prediction.
Detailed description of the invention
The Figure of description for constituting a part of this disclosure is used to provide further understanding of the disclosure, and the disclosure is shown
Meaning property embodiment and its explanation do not constitute the improper restriction to the disclosure for explaining the disclosure.
Fig. 1 is the method flow diagram of disclosure specific embodiment;
Neural network model schematic diagram when Fig. 2 is the text classification of disclosure specific embodiment;
Table 1 is the defined feature in part and extracting rule example.
Specific embodiment
It is noted that following detailed description is all illustrative, it is intended to provide further instruction to the disclosure.Unless another
It indicates, all technical and scientific terms used herein has usual with disclosure person of an ordinary skill in the technical field
The identical meanings of understanding.
It should be noted that term used herein above is merely to describe specific embodiment, and be not intended to restricted root
According to the illustrative embodiments of the disclosure.As used herein, unless the context clearly indicates otherwise, otherwise singular
Also it is intended to include plural form, additionally, it should be understood that, when in the present specification using term "comprising" and/or " packet
Include " when, indicate existing characteristics, step, operation, device, component and/or their combination.
Examples of implementation one
This embodiment disclose a kind of text mining methods based on unstructured electronic health record to wrap referring to described in attached drawing 1
It includes:
Step (1): customization corpus.The present illness history record that 2000 esophageal squamous cell carcinomas are exported from hospital database, as
Raw experimental data, and every sample is unfolded with time series.In order to make full use of the diagnosis and therapy recording letter of being hospitalized every time of patient
Breath, need to identify first description time meaning word, such as: in September, 2009, before 4 years, be admitted to hospital for the 7th time, 2014-10-16,5
It is several short texts by long text cutting then using timing node as boundary before month etc., i.e., is each time by present illness history cutting
It records in hospital.
In this step, customization is exactly to need to do initial data some pretreatment operations, i.e., according to timing node to existing
Medical history long sentence cutting is several short sentences.
Corpus is exactly all text data sets that next step operation is carried out after pretreatment.
In this step, since the Electronic Medical Record content of certain specified disease has fixed grammer or template,
So can also customize extracting rule accordingly after corpus determines.Present illness history is with the hospital stays each time for one
A timing node, the overall process after describing disease, that is, occur, develop, developing and diagnosis and treatment are passed through.Therefore to present illness history data into
Row processing assists in the treatment of the individual character for being necessarily help to clinical medicine with analysis.
Step (2): feature extraction mainly includes defined feature value and extracting rule.The information extraction work of natural language
Make there are many method, method more commonly used at present is based on deep learning and rule match.Since deep learning method needs
A large amount of label data, and present illness history data format is single as the result is shown for analysis, and data volume is small, and grammer, term are fixed, so
Verification and interpretation of the disclosure based on the data to electronic health record are taken based on the mode that simple rule discovery is combined with verification,
Final to determine totally 56 esophageal squamous cell carcinoma present illness history features and decimation rule, partial content is as shown in table 1, and for convenience of the later period
Modeling and reading data, are saved into xml document.
Table 1
Step (3): the history information in rule-based library extracts and structured storage.After being written over to defined rule
Regular expression is formed to realize the feature extraction of non-structured text, is realized accurate, semi-automatic, intelligent, efficient, simple
Purpose.Particular content is as follows:
It 3-1) is successively read the record of being hospitalized every time segmented in step 1, is searched and is positioned using regular expression and is all
The keyword of appearance.As " patient in September, 2009 because " half a year uncomfortable in chest, anterior pectorial region pain January " through Esophagoscopy examine for
" cancer of the esophagus ".";Keyword are as follows: uncomfortable in chest, pain, the cancer of the esophagus;
3-2) using the keyword of each appearance as separation, short text is carried out segment processing again, is divided into multiple sons
Sentence.As a result as follows: 1. half a year uncomfortable in chest;2. anterior pectorial region pain January;3. examining through Esophagoscopy as " cancer of the esophagus ";
3-3) determine feature (keyword) value.In view of the corresponding value of each keyword is generally present in the keyword both sides
Small clause in, so we search the keyword using regular expression again in the text chunk on the keyword both sides
Value, a keyword can have multiple values, thus by complicated text conversion for key-value type.As a result as follows: chest
It is bored: half a year;Pain: anterior pectorial region;The cancer of the esophagus: it is;
3-4) the structured storage of result.Because extensible markup language (XML) facilitates progress data biography between various programs
Defeated, we will extract result and be stored according to time series into xml document.
Regular expression is formed in specific embodiment, after being written over to defined rule to realize unstructured text
This characteristic information extracts, present illness history text: " starting to give the 1st cycle T P scheme on December 7th, 2016 ".
Regular expression: " GP scheme | TPF scheme | TP scheme | d.* (a) the period | d.* days/period ";
Extracting result is " scheme: TP ".
Step (4): the quantization means of feature.The data type of characteristic value after being extracted by analysis, with reference to SEER data
Behind library (U.S. authority cancer staqtistical data base, tumor information have been subjected to unified and specification), numerical quantization is carried out to characteristic value, with
Facilitate the model read of next step.If two-value type is quantified as (0,1), continuous type numeric data is constant, the discrete data amount of enumerating
Change etc..
Step (5): text mining and analysis are predicted.By the information extraction and quantization of above step, machine can be passed through
The method of study to carry out text mining to electronic health record, to find some implicit text knowledges, to play auxiliary diagnosis
Effect.The present invention attempt by have supervision and unsupervised combination method, Lai Shixian esophageal squamous cell carcinoma Concurrent Chemoradiotherapy Sensitivity it is pre-
It surveys.
In specific embodiment, about text cluster, the unsupervised model of machine learning is constructed to clinical electronic health record text
Data carry out behavior pattern discovery.Text cluster can be widely applied to the different aspect of text mining and information retrieval, advise greatly
Automatically generating etc. for the classification of the tissue and browsing, text set level of mould text set all has important value.In the present invention
The feature of extraction principally falls into four major class: symptom checks finding, chemicotherapy scheme and therapeutic evaluation.It first will be after quantization
Characteristic value is unified for the character representation X=(x during patient is once hospitalized1,x2,x3,...,x57), then as unsupervised
The text feature of clustering algorithm inputs, and output is text cluster as a result, certain between patients with esophageal squamous cell carcinoma to find with this
General character or tacit knowledge, and finally play the effect of clinical assisting in diagnosis and treatment.
In numerous Unsupervised clustering algorithms, DBSCAN (Density-Based Spatial Clustering of
Applications with Noise) it is a more representational density-based algorithms.Compared with K-means,
DBSCAN does not need the quantity for knowing cluster class to be formed in advance, while not only it can be found that the cluster class of arbitrary shape, also can
It identifies noise spot, is therefore widely used in all kinds of cluster tasks.
In specific embodiment, the disclosure realizes that text cluster, particular content are as follows using DBSCAN clustering algorithm:
Input: data set S, Eps sweep radius, minimum include points MinPts value;Data set herein is exactly every provision
Characteristic value after notebook data quantization;
Output: the cluster of all generations reaches density requirements.Cluster is exactly the meaning of classification, and clustering algorithm can be by data set S certainly
Dynamic is divided into different classifications.In this application, cluster refers specifically to different Concurrent Chemoradiotherapy Sensitivities.Gathered by analysis in same cluster
Or of a sort sample, it is found that the implicit domain knowledge between the sample with identical Concurrent Chemoradiotherapy Sensitivity, thus under guidance
Stream task.
(1) select the object p not yet checked from data set at random first, if p it is not processed (be classified as some cluster or
Person be labeled as noise), then check its neighborhood, if comprising number of objects be not less than MinPts, establish new cluster C, will it is therein own
Candidate Set N is added in point;Specifically, object is exactly the sample referred in data set S, i.e. a present illness history text feature.
(2) to not yet processed object q all in Candidate Set N, its neighborhood is checked, if it is a right to include at least MinPts
As N then is added in these objects;If q is not included into any one cluster, C is added in q;
(3) step (2) are repeated, continues checking untreated object in N, current candidate collection N is sky;
(4) step (1)~(3) are repeated, until all objects have all been included into some cluster or labeled as noise.
Concurrent Chemoradiotherapy Sensitivity prediction, i.e. text classification.Text Classification is the important base of information retrieval and text mining
Plinth, main task are that its classification is determined according to content of text under previously given class label set.Tumor patient
To Concurrent Chemoradiotherapy Sensitivity, there are great differences between body, should select drug according to specific diagnosis situation, individual physique etc. when treatment
And customization Radiation treatment plans, the final maximization for realizing patient's Concurrent Chemoradiotherapy Sensitivity.
In the text classification algorithm for having supervision, compared with conventional machines learning text sorting algorithm, by convolutional Neural net
Network CNN is applied to text categorization task, and the key message extracted in sentence using the kernel of multiple and different size (is similar to
The n-gram of multiwindow size), feature to be extracted from text automatically, eliminates Feature Engineering, realization is trained end to end, thus
Local correlations can preferably be captured.Have benefited from the powerful characteristic present ability of deep learning, carries out text using deep learning
This its effect of classifying will often be better than traditional method.
The present invention is using all esophageal squamous cell carcinoma present illness history texts as raw data set, by the primary complete of each admission discharge
Property treatment label of the therapeutic evaluation as Concurrent Chemoradiotherapy Sensitivity;Using TextCNN model, establish based on esophageal squamous cell carcinoma electronics disease
The Concurrent Chemoradiotherapy Sensitivity classification prediction model gone through;It, can be directly as the defeated of model finally, for the present illness history record of new patient
Enter, predicts the Concurrent Chemoradiotherapy Sensitivity of new samples, it is desirable to doctor can be assisted to adjust chemicotherapy scheme by prediction result, it is final real
The maximization of existing sensibility.
In specific embodiment, the disclosure uses the text classification of TextCNN model realization, and particular content is as follows:
(1) Text Pretreatment: using Tokenizer by present illness history text conversion at numerical characteristic.First by Chinese text
Separated between word and word with space, then establish the dictionary of a regular length (vocabulary_size), draw on the strong points to offset the weaknesses so that
All sample lengths are consistent.Last each word distributes an index, so that every present illness history record is changed into a coding;
(2) each Chinese word coding embeding layer (embedding layer): is converted into term vector.Embedding layers based on upper
The resulting Chinese word coding of text carries out one-hot coding to each word, each word can with vocabulary_size dimension to
Amount;Then it updates to obtain a suitable weight matrix by the training iteration of neural network, row size is vocabulary_
Size, column size are the dimension of term vector, will be mapped to lower dimensional space originally with the term vector that one-hot is encoded, obtain low-dimensional
Term vector.Shown in following schematic diagram 2.
(3) it convolutional layer (convolution layer): goes to extract feature using convolutional neural networks.Due to phase in sentence
Adjacent word associations are always very high, therefore one-dimensional convolution can be used, i.e. the difference of text convolution and image convolution
It is only to do convolution in a direction (vertical) of text sequence, the width of convolution kernel is fixed as the dimension d of term vector.Height is
Hyper parameter can be set.Three various sizes of convolution kernel (3,4,5) Lai Jinhang convolution operations are used in the present invention.
(4) pond layer (maxpooling layer): the feature (feature map) that various sizes of convolution kernel obtains is big
Small is also different, therefore we use pond function to each feature map, keep their dimension identical.The present invention
Using 1-max pooling, that maximum value of feature map photo is extracted, passes through each feature map's of selection
Maximum value can capture its most important feature.It is exactly a value that each convolution kernel, which obtains feature, in this way, is made to all convolution kernels
It with 1-max pooling, then cascades up, available final feature vector, this feature vector inputs softmax again
Layer, which is done, to classify.
(5) full articulamentum (full-connection and softmax): being most followed by the one layer of softmax connected entirely layer,
Export the probability of each classification.Classification in the invention be Concurrent Chemoradiotherapy Sensitivity state (aggravation, sb.'s illness took a favorable turn, become
Change and do not write, cure completely etc.).
The technical solution of the embodiment of the present application solves existing a large amount of valuable structurings and unstructured electronics disease
The wasting of resources for counting evidence one by one, by the information extraction and Structured Design to unstructured data, further using have supervision and
The method of unsupervised models coupling carries out text mining to electronic health record, it is intended that is realized by electronic health record data about esophageal squamous cell
The assisting in diagnosis and treatment of cancer.
Examples of implementation two
This specification embodiment provides a kind of Text Mining System based on unstructured electronic health record, passes through following skill
Art scheme is realized:
Include:
Text Pretreatment module: multiple a certain present illness histories derived in hospital database are recorded, as Initial experiments number
According to, and every sample is unfolded with time series, the word of description time meaning is identified first, is then with timing node
Long text cutting is several short texts by boundary, i.e., is record of being hospitalized each time by present illness history cutting;
Feature Engineering module: a certain present illness history feature and decimation rule are determined and is saved into xml document;
Feature extraction: the history information in rule-based library extracts and structured storage, is written over to defined rule
Form regular expression afterwards to realize the feature extraction of non-structured text;
Characteristic quantification: the data type of the characteristic value after being extracted by analysis carries out numerical quantization to characteristic value;
Prediction module is analyzed, text classification is realized based on text cluster;
Text cluster: the character representation X=(x during the characteristic value after quantization is unified for primary be hospitalized1,x2,x3,...,
x57), then text cluster is realized as the text feature input of Unsupervised clustering algorithm.
The specific implementation process of module in the examples of implementation can be found in the detailed step content in examples of implementation one.
Examples of implementation three
This specification embodiment provides a kind of computer equipment, including memory, processor and storage are on a memory
And the computer program that can be run on a processor, which is characterized in that the processor realizes embodiment when executing described program
The step of text mining method of the son one of one based on unstructured electronic health record.
Examples of implementation four
This specification embodiment provides a kind of computer readable storage medium, is stored thereon with computer program, special
Sign is, realizes that text of one of the examples of implementation one based on unstructured electronic health record is dug when which is executed by processor
The step of pick method.
It is understood that in the description of this specification, reference term " embodiment ", " another embodiment ", " other
The description of embodiment " or " first embodiment~N embodiment " etc. means specific spy described in conjunction with this embodiment or example
Sign, structure, material or feature are included at least one embodiment or example of the invention.In the present specification, to above-mentioned
The schematic representation of term may not refer to the same embodiment or example.Moreover, the specific features of description, structure, material
The characteristics of can be combined in any suitable manner in any one or more of the embodiments or examples.
The foregoing is merely preferred embodiment of the present disclosure, are not limited to the disclosure, for the skill of this field
For art personnel, the disclosure can have various modifications and variations.It is all within the spirit and principle of the disclosure, it is made any to repair
Change, equivalent replacement, improvement etc., should be included within the protection scope of the disclosure.
Claims (10)
1. a kind of text mining method based on unstructured electronic health record, characterized in that include:
By multiple a certain present illness histories records derived in hospital database, as raw experimental data, and by every sample with when
Between sequence be unfolded, identify first description time meaning word, then using timing node as boundary, by long text cutting be it is several
Present illness history cutting is record of being hospitalized each time by a short text;
It determines a certain present illness history feature and decimation rule and is saved into xml document;
The history information in rule-based library extracts and structured storage, forms regular expressions after being written over to defined rule
Formula realizes the feature extraction of non-structured text;
The quantization means of feature: the data type of the characteristic value after being extracted by analysis carries out numerical quantization to characteristic value;
Character representation X=(x during characteristic value after quantization is unified for primary be hospitalized1,x2,x3,...,x57), then by it
Text cluster is realized in text feature input as Unsupervised clustering algorithm.
2. a kind of text mining method based on unstructured electronic health record as described in claim 1, characterized in that obtaining
After text cluster, using TextCNN model realization text classification, comprising:
Text Pretreatment: using Tokenizer by present illness history text conversion at numerical characteristic, first by Chinese text word and word it
Between separated with space;Then the dictionary for establishing a regular length is drawn on the strong points to offset the weaknesses so that all sample lengths are consistent;It is last each
Word distributes an index, so that every present illness history record is changed into a coding;
Each Chinese word coding is converted into term vector;
Feature is extracted using convolutional neural networks: carrying out convolution operation using three various sizes of convolution kernels;
The feature feature map that various sizes of convolution kernel obtains is of different sizes, uses Chi Huahan to each feature map
Number, keeps their dimension identical, obtains final feature vector;
Export the probability of each classification.
3. a kind of text mining method based on unstructured electronic health record as claimed in claim 2, characterized in that will be each
When Chinese word coding is converted to term vector, it is based on resulting Chinese word coding, one-hot coding is carried out to each word, each word can be with one
The vector of a vocabulary_size dimension;Then it updates to obtain a suitable weight square by the training iteration of neural network
Battle array, row size are vocabulary_size, and column size is the dimension of term vector, and the term vector encoded with one-hot is mapped to
Lower dimensional space obtains low-dimensional term vector.
4. a kind of text mining method based on unstructured electronic health record as described in claim 1, characterized in that having determined
The rule of justice forms regular expression to realize the feature extraction of non-structured text after being written over, particular content is as follows:
It is successively read the record of being hospitalized every time segmented, regular expression is utilized to search and position occurred keyword;
Using the keyword of each appearance as separation, short text is carried out segment processing again, is divided into multiple clauses;
Determine characteristic value: in view of the corresponding value of each keyword is generally present in the small clause on the keyword both sides, at this
Search the value of the keyword in the text chunk on keyword both sides using regular expression again, a keyword can have multiple
Value, thus by complicated text conversion for key-value type;
As a result structured storage: result will be extracted and stored according to time series into xml document.
5. a kind of text mining method based on unstructured electronic health record as described in claim 1, characterized in that extraction
Feature principally falls into four major class: symptom checks finding, chemicotherapy scheme and therapeutic evaluation.
6. a kind of text mining method based on unstructured electronic health record as described in claim 1, characterized in that use
DBSCAN clustering algorithm realizes text cluster:
(1) the object p not yet checked is selected to check its neighborhood if p is not processed from data set at random first, if
The number of objects for including is not less than MinPts, establishes new cluster C, and Candidate Set N is added in all the points therein;
(2) to not yet processed object q all in Candidate Set N, its neighborhood is checked, if including at least MinPts object,
N is added in these objects;If q is not included into any one cluster, C is added in q;
(3) step (2) are repeated, continues checking untreated object in N, current candidate collection N is sky;
(4) step (1)~(3) are repeated, until all objects have all been included into some cluster or labeled as noise.
7. a kind of text mining method based on unstructured electronic health record as described in claim 1-6, characterized in that above-mentioned
Method is suitable for the text mining of the electronic health record of esophageal squamous cell carcinoma, but is not limited to the text mining of such electronic health record.
8. a kind of Text Mining System based on unstructured electronic health record, characterized in that include:
Text Pretreatment module: multiple a certain present illness histories derived in hospital database are recorded, as raw experimental data, and
Every sample is unfolded with time series, identifies that the word of description time meaning will be grown then using timing node as boundary first
Text dividing is several short texts, i.e., is record of being hospitalized each time by present illness history cutting;
Feature Engineering module: a certain present illness history feature and decimation rule are determined and is saved into xml document;
Feature extraction: the history information in rule-based library extracts and structured storage, is written over rear shape to defined rule
The feature extraction of non-structured text is realized at regular expression;
Characteristic quantification: the data type of the characteristic value after being extracted by analysis carries out numerical quantization to characteristic value;
Prediction module is analyzed, text classification is realized based on text cluster;
Text cluster: the character representation X=(x during the characteristic value after quantization is unified for primary be hospitalized1,x2,x3,...,x57),
Then text cluster is realized as the text feature input of Unsupervised clustering algorithm.
9. a kind of computer equipment including memory, processor and stores the meter that can be run on a memory and on a processor
Calculation machine program, which is characterized in that the processor realizes any a kind of base in claim 1-7 when executing described program
In the text mining method of unstructured electronic health record the step of.
10. a kind of computer readable storage medium, is stored thereon with computer program, which is characterized in that the program is by processor
A kind of step of any text mining method based on unstructured electronic health record in claim 1-7 is realized when execution
Suddenly.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201910701406.5A CN110428907A (en) | 2019-07-31 | 2019-07-31 | A kind of text mining method and system based on unstructured electronic health record |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201910701406.5A CN110428907A (en) | 2019-07-31 | 2019-07-31 | A kind of text mining method and system based on unstructured electronic health record |
Publications (1)
Publication Number | Publication Date |
---|---|
CN110428907A true CN110428907A (en) | 2019-11-08 |
Family
ID=68411762
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201910701406.5A Pending CN110428907A (en) | 2019-07-31 | 2019-07-31 | A kind of text mining method and system based on unstructured electronic health record |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN110428907A (en) |
Cited By (6)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN112687367A (en) * | 2020-12-29 | 2021-04-20 | 中国人民解放军总医院 | Medical record grouping method, device and equipment based on dynamic disease condition and storage medium |
CN112927814A (en) * | 2021-03-30 | 2021-06-08 | 善诊(上海)信息技术有限公司 | Physical examination recommendation method, device, equipment and storage medium for placeholder lesions |
CN113010643A (en) * | 2021-03-22 | 2021-06-22 | 平安科技(深圳)有限公司 | Method, device and equipment for processing vocabulary in field of Buddhism and storage medium |
CN113094477A (en) * | 2021-06-09 | 2021-07-09 | 腾讯科技(深圳)有限公司 | Data structuring method and device, computer equipment and storage medium |
CN116401532A (en) * | 2023-06-07 | 2023-07-07 | 山东大学 | Method and system for recognizing frequency instability of power system after disturbance |
CN117789907A (en) * | 2024-02-28 | 2024-03-29 | 山东金卫软件技术有限公司 | Intelligent medical data intelligent management method based on multi-source data fusion |
Citations (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN106909783A (en) * | 2017-02-24 | 2017-06-30 | 北京交通大学 | A kind of case history textual medical Methods of Knowledge Discovering Based based on timeline |
CN108831559A (en) * | 2018-06-20 | 2018-11-16 | 清华大学 | A kind of Chinese electronic health record text analyzing method and system |
CN109918507A (en) * | 2019-03-08 | 2019-06-21 | 北京工业大学 | One kind being based on the improved file classification method of TextCNN |
-
2019
- 2019-07-31 CN CN201910701406.5A patent/CN110428907A/en active Pending
Patent Citations (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN106909783A (en) * | 2017-02-24 | 2017-06-30 | 北京交通大学 | A kind of case history textual medical Methods of Knowledge Discovering Based based on timeline |
CN108831559A (en) * | 2018-06-20 | 2018-11-16 | 清华大学 | A kind of Chinese electronic health record text analyzing method and system |
CN109918507A (en) * | 2019-03-08 | 2019-06-21 | 北京工业大学 | One kind being based on the improved file classification method of TextCNN |
Non-Patent Citations (3)
Title |
---|
包小源等: "非结构化电子病历中信息抽取的定制化方法", 《北京大学学报(医学版)》 * |
谷波等: "文本聚类算法的分析与比较", 《电脑开发与应用》 * |
赵冬晓等: "面向情报研究的文本语义挖掘方法述评", 《现代图书情报技术》 * |
Cited By (10)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN112687367A (en) * | 2020-12-29 | 2021-04-20 | 中国人民解放军总医院 | Medical record grouping method, device and equipment based on dynamic disease condition and storage medium |
CN113010643A (en) * | 2021-03-22 | 2021-06-22 | 平安科技(深圳)有限公司 | Method, device and equipment for processing vocabulary in field of Buddhism and storage medium |
CN113010643B (en) * | 2021-03-22 | 2023-07-21 | 平安科技(深圳)有限公司 | Method, device, equipment and storage medium for processing vocabulary in Buddha field |
CN112927814A (en) * | 2021-03-30 | 2021-06-08 | 善诊(上海)信息技术有限公司 | Physical examination recommendation method, device, equipment and storage medium for placeholder lesions |
CN113094477A (en) * | 2021-06-09 | 2021-07-09 | 腾讯科技(深圳)有限公司 | Data structuring method and device, computer equipment and storage medium |
CN113094477B (en) * | 2021-06-09 | 2021-08-31 | 腾讯科技(深圳)有限公司 | Data structuring method and device, computer equipment and storage medium |
CN116401532A (en) * | 2023-06-07 | 2023-07-07 | 山东大学 | Method and system for recognizing frequency instability of power system after disturbance |
CN116401532B (en) * | 2023-06-07 | 2024-02-23 | 山东大学 | Method and system for recognizing frequency instability of power system after disturbance |
CN117789907A (en) * | 2024-02-28 | 2024-03-29 | 山东金卫软件技术有限公司 | Intelligent medical data intelligent management method based on multi-source data fusion |
CN117789907B (en) * | 2024-02-28 | 2024-05-10 | 山东金卫软件技术有限公司 | Intelligent medical data intelligent management method based on multi-source data fusion |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
Qayyum et al. | Medical image retrieval using deep convolutional neural network | |
CN110428907A (en) | A kind of text mining method and system based on unstructured electronic health record | |
CN109446338B (en) | Neural network-based drug disease relation classification method | |
CN104516942B (en) | The automatic merogenesis mark of Concept-driven test | |
CN105760507B (en) | Cross-module state topic relativity modeling method based on deep learning | |
US20200303072A1 (en) | Method and system for supporting medical decision making | |
CN109670179A (en) | Case history text based on iteration expansion convolutional neural networks names entity recognition method | |
CN109670177A (en) | One kind realizing the semantic normalized control method of medicine and control device based on LSTM | |
Sun et al. | Intelligent analysis of medical big data based on deep learning | |
WO2017151759A1 (en) | Category discovery and image auto-annotation via looped pseudo-task optimization | |
CN110032739A (en) | Chinese electronic health record name entity abstracting method and system | |
CN109710670A (en) | A method of case history text is converted into structural metadata from natural language | |
Wang et al. | Attention-based multi-instance neural network for medical diagnosis from incomplete and low quality data | |
CN116580849B (en) | Medical data acquisition and analysis system and method thereof | |
CN114003734A (en) | Breast cancer risk factor knowledge system model, knowledge map system and construction method | |
Liu et al. | Can a convolutional neural network support auditing of nci thesaurus neoplasm concepts? | |
CN114399634B (en) | Three-dimensional image classification method, system, equipment and medium based on weak supervision learning | |
CN104537280B (en) | Protein interactive relation recognition methods based on text relation similitude | |
CN117574898A (en) | Domain knowledge graph updating method and system based on power grid equipment | |
CN110060749B (en) | Intelligent electronic medical record diagnosis method based on SEV-SDG-CNN | |
CN116881336A (en) | Efficient multi-mode contrast depth hash retrieval method for medical big data | |
CN116313141A (en) | Knowledge-graph-based intelligent inquiry method for unknown cause fever | |
Shah et al. | Exploring diseases based biomedical document clustering and visualization using self-organizing maps | |
Chen et al. | Entity relation extraction from electronic medical records based on improved annotation rules and BiLSTM-CRF | |
Zhao et al. | Protein function prediction with functional and topological knowledge of gene ontology |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
RJ01 | Rejection of invention patent application after publication |
Application publication date: 20191108 |
|
RJ01 | Rejection of invention patent application after publication |