CN106682411A

CN106682411A - Method for converting physical examination diagnostic data into disease label

Info

Publication number: CN106682411A
Application number: CN201611198280.7A
Authority: CN
Inventors: 吴健; 周立水; 顾盼; 邱奇波; 邓水光; 李莹; 尹建伟; 吴朝晖
Original assignee: Zhejiang University ZJU
Current assignee: Zhejiang University ZJU
Priority date: 2016-12-22
Filing date: 2016-12-22
Publication date: 2017-05-17
Anticipated expiration: 2036-12-22
Also published as: CN106682411B

Abstract

The invention discloses a method for converting physical examination diagnostic data into a disease label. The method comprises the steps that 1, the physical examination diagnostic data is subjected to text word segmentation and new word finding processing, and a word sequence is obtained; 2, in the word sequence, needed disease vocabularies are extracted, and disease names are obtained; 3, synonyms in the disease names are merged, and merged disease names are obtained; 4, the disease names are clustered, and a disease category tree is set up; 5, disease marking is carried out according to the merged disease names and the disease category tree, and the disease label is obtained. According to the method, multiple natural language processing technologies are adopted for mining disease results in the physical examination diagnostic data, a disease classification structure is extracted, code digitization is carried out, the disease name label is provided for physical examination records, and therefore the physical examination result is more directly described, and other big data medical treatment can be served.

Description

A kind of method that physical examination diagnostic data is converted into disease label

Technical field

The invention belongs to big data medical field, and in particular to a kind of side that physical examination diagnostic data is converted into disease label Method.

Background technology

With the improvement of people ' s living standards, the enhancing of health care consciousness, health examination is increasingly becoming a kind of social fashion, people Changed the traditional concept of the Cai Qu hospitals only when falling ill, have regular physical checkups and received by most people.Therefore, Hospital have accumulated the electronics physical examination data of magnanimity.However, in primary electronics physical examination data, there is substantial amounts of non-structured text This physical examination data.These text datas often unordered mixed and disorderly redundancy, comes from the hand of the doctor of the various customs of each level, has very Big limitation, it is impossible to used well.

Big data medical treatment is a current focus, refers to the data that medical field is analyzed by big data correlation technique And knowledge therein is excavated so as to increase substantially medical services.In the past few decades, big data influences dearly Each enterprise, including health care industry.Nowadays, substantial amounts of data can allow health care more efficient, more individual character Change.Natural language processing is the closely related technology of big data.Under the auxiliary of big data, natural language processing is also relied on greatly Data have more development, occur in that such as word2vec, LDA new technology.

At present, after each physical examination terminates, the physical examination diagnostic data of acquisition is the diagnosis text for having complex form, and Diagnostic result is not converted to the disease name of summary, it is difficult to by user's fast understanding and be difficult to by follow-up big data medical treatment should With being used to analyze medical data.Therefore, on the basis of it can obtain complete original physical examination diagnostic data, proposition is examined doctor The method that disconnected result is analyzed, and then generation of classifying and carries out digital numeralization at disease classification tree, then in actual data Feed back the disease label of physical examination diagnostic result, be that there is its practical Research Significance and application prospect.

The content of the invention

In view of above-mentioned, the present invention proposes a kind of method that physical examination diagnostic data is converted into disease label, main to use Various natural language processing techniques excavate disease outcome in physical examination diagnostic data, extract wherein classification of diseases structure and are encoded Digitlization, for physical examination record standard disease name label is provided so that more directly description physical examination result and can be other Big data medical services.

The purpose of the present invention is that doctor, to the diagnostic result of patient's physical examination data, finds out disease therein in analyzing physical examination data Name of disease claims, and carries out cluster search, sets up the classification tree of disease, so that provide one is converted into disease by original diagnostic result The method of label.

A kind of method that physical examination diagnostic data is converted into disease label, comprises the following steps：

(1) text participle and new word discovery treatment are carried out to physical examination diagnostic data, word sequence is obtained；

(2) in word sequence, the disease vocabulary required for extracting obtains disease name；

(3) merger is carried out to the synonym in disease name, obtains the disease name of merger；

(4) disease is clustered, sets up disease classification tree；

(5) disease name and disease classification tree according to merger carries out disease marker, obtains disease label.

Step (1) concretely comprises the following steps：

(1-1) carries out participle to diagnostic result primary in physical examination diagnostic data, obtains text word sequence；

(1-2) carries out word merging according to the relation between word in text word sequence, obtains new term；

(1-3) carries out participle again to new term, obtains word sequence.

In step (1-1), for diagnostic result primary in physical examination diagnostic data, first, dictionary creation is used Trie trees carry out participle matching, then, do further index to the result of its participle matching and count, and obtain maximum probability point Word result, referred to as text word sequence.

When carrying out participle matching using Trie trees, there can be various participle modes often for same character string, as It is preferred that, the maximum probability word segmentation result of character string is solved using Dynamic Programming.

In step (1-2), in physical examination diagnostic data, due to the particularity of physical examination diagnostic text, it is difficult to for the first time Participle is that can obtain optimal effect, and the related vocabulary of many diseases may be torn open by mistake, therefore, according to word in word segmentation result Between relation carry out word merging, obtain new term, be easy to subsequent use.Preferably, being referred to by counting point mutual information Mark to judge the relation between word.Point mutual information (Point-wise Mutual Information, PMI) is to weigh two The index of the correlation between things (such as two words), is widely used in machine learning field, its principle such as below equation institute Show：

In probability theory, if x is uncorrelated with y, p (x, y)=p (x) p (y).The two correlation is bigger, then p (x, y) Difference with p (x) p (y) is bigger.Equally, the conditional probability p (x | y) that x occurs in the case where y occurs occurs in itself divided by x Probability p (x) also illustrate that degrees of correlation of the x with y.For adjacent word, for example：The probability for P (thyroid nodule) occur is long-range When the probability of P (thyroid gland) P (tubercle), it is believed that thyroid nodule is the neologisms in a corpus.

In step (2), the keyword (disease name) in word sequence is extracted using statistical method TF-IDF, and combination stops With dictionary, the as basic label for obtaining physical examination diagnostic result text, disease name.Statistical method TF-IDF (term Frequency-inverse document frequency) it is used to assess a words for a text set or a corpus In a copy of it text significance level.The importance of words is directly proportional increase with the number of times that it occurs in the text, but Can be inversely proportional decline with the frequency that it occurs in corpus simultaneously.Word frequency (term frequency, TF) refers to some The number of times that given word occurs in the text.Inverse text frequency (inverse document frequency, IDF) is by total Divided by the number of the text comprising the word, then the business that will be obtained takes the logarithm and obtains text number.

In step (3), there is substantial amounts of synonym in all disease vocabulary of acquisition, such as After Cataract possesses not Name aftercataract and secondary cataract, they are required for being integrated into a disease.Herein for each disease word of discovery The synonym of remittance, mathematic vector is abstracted into using word2vec algorithms by each disease name, reuses the distance of mathematic vector Calculate synonym.

In step (4), in order to more understand the relation between each disease, all extracted in all disease names and finished Afterwards, disease name is clustered, obtains the classification relation between disease.Present invention employs the distribution of implicit Di Li Crays (Latent Dirichlet Allocation, LDA) carries out disease cluster.LDA is a kind of document subject matter generation model, also referred to as It is three layers of bayesian probability model, comprising word, theme and document three-decker.

The inventive method be mainly analysis physical examination data in doctor to the diagnostic result of patient's physical examination data, find out therein Disease name, and cluster search is carried out, the classification tree of disease is set up, so as to original diagnostic result is converted into disease label. The method have the advantage that for：

(1) new word discovery is carried out to medical examination diagnostic result using PMI, it can be found that many undiscovered new disease words Converge.

(2) for a large amount of disease vocabulary, synonym therein is found using word2vec algorithms, in standardization disease name While decrease the word of redundancy.

(3) cluster of disease is carried out using LDA, the contact between the disease of each classification can be easily obtained.

Brief description of the drawings

Fig. 1 is the disease tag system structure chart based on physical examination diagnostic data；

Fig. 2 is that synonym finds algorithm word2vec schematic diagrames；

Fig. 3 is local disease's label classification tree exemplary plot.

Specific embodiment

In order to more specifically describe the present invention, below in conjunction with the accompanying drawings and specific embodiment is to technical scheme It is described in detail.

As shown in figure 1, the present invention comprises the following steps the method that physical examination diagnostic data is converted into disease label：

Step 1, text participle：Participle is carried out to diagnostic result primary in physical examination diagnostic data, text word sequence is obtained.

Text participle needs dictionary to obtain accuracy rate higher, and being collected from outside corpus for selecting here is obtained Basic dictionary combines related medical dictionary.For diagnostic result primary in physical examination data, the Trie trees of dictionary creation are used Participle matching is carried out, further index is done to the result of its participle matching and is counted, obtain maximum probability word segmentation result, referred to as Text word sequence.

Trie trees, also known as word lookup tree, are a kind of tree structures, for preserving substantial amounts of character string.Its advantage is： Memory space is saved using the common prefix of character string.The speed of participle can be increased substantially using Trie trees participle.It is right There can be various participle modes often in same character string, the maximum probability participle of character string is solved used here as Dynamic Programming As a result.

Step 2, new word discovery：Word merging is carried out according to the relation between word in text word sequence, new term is obtained.

In this step, each statistical information after being terminated according to step 1 participle, emphasis is divided adjacent word Analysis.The relation between word is judged using PMI indexs, for adjacent word, the probability for P (thyroid nodule) such as occur is long-range When the probability of P (thyroid gland) P (tubercle), it is believed that thyroid nodule is the neologisms in a corpus.

Step 3, text participle：Participle is carried out again to new term, word sequence is obtained.

The method that this step carries out participle is identical with the segmenting method in step 1, is all using the Trie trees of dictionary creation Participle matching is carried out, the maximum probability word segmentation result of character string is solved using Dynamic Programming.

Step 4, disease is extracted：Required disease vocabulary is extracted in word sequence, disease name is obtained.

The disease name in word sequence is extracted used here as statistical method TF-IDF, and combines deactivation dictionary, it is basic to obtain The label of physical examination diagnostic result text, as disease name.For each diagnostic text data, TF-IDF can find out wherein Occur more and occur and few word inside other diagnostic text data.

Step 5, synonymous conflation of words：Merger is carried out to the synonym in disease name, the disease name of merger is obtained；

Herein, each disease name is abstracted into the vector of mathematics using word2vec algorithms, mathematic vector is reused Distance calculates synonym.As shown in Fig. 2 CBOW and Skip-gram are two kinds of realizations of word2vec algorithms, wherein, w (t) is Refer to a word of optional position, and w (t-1) is then the previous word of its correspondence position, w (t+1) is the latter of its correspondence position Individual word, CBOW algorithms are to can be inferred that current word according to former and later two words of a word, and Skip-gram algorithms are according to one Individual word, can be inferred that the possibility environment of its context.The basic thought of word2vec algorithms is reflected each word by training K dimension real numbers vector (K is the hyper parameter in model) are penetrated into, by the distance between word (such as cosine similarities, Euclidean distance Deng) judge the semantic similarity between them.It using one three layers of neutral net, input layer-hidden layer-output layer, this Individual three-layer neural network is in itself that language model is modeled, but also obtains a kind of table of word in vector space simultaneously Show, and this side effect is only the real target of Word2vec.When obtain each word term vector after, can according to term vector it Between distance find a synonym for word.

Step 6, disease cluster：Disease is clustered, disease classification tree is set up；

Disease cluster is carried out using LDA in this step, to obtain the classification relation between disease, classification tree is set up, Fig. 3 is The Local map of the classification tree for establishing.For every document in corpus, LDA defines following generating process：

(1) to each document, a theme is extracted from theme distribution；

(2) a word is extracted from the word distribution corresponding to the above-mentioned theme being pumped to；

(3) said process is repeated, until each word in traversal document.

Assuming that after such document structure tree mode, ask for the parameters of this model, so as to obtain each theme and Word distribution inside theme.For physical examination diagnostic data base, the word inside each theme is the title of disease, and each Theme is exactly a disease for classification.

Step 7, tag extraction：Disease name and disease classification tree according to merger carry out disease marker, obtain every individuality The inspection unique disease label of text.

Above-described specific embodiment has been described in detail to technical scheme and beneficial effect, Ying Li Solution is to the foregoing is only presently most preferred embodiment of the invention, is not intended to limit the invention, all in principle model of the invention Interior done any modification, supplement and equivalent etc. are enclosed, be should be included within the scope of the present invention.

Claims

1. a kind of method that physical examination diagnostic data is converted into disease label, comprises the following steps：

(4) disease name is clustered, sets up disease classification tree；

2. physical examination diagnostic data is converted into the method for disease label according to claim 1, it is characterised in that：Step (1) Concretely comprise the following steps：

(1-3) carries out participle again to new term, obtains word sequence.

3. physical examination diagnostic data is converted into the method for disease label according to claim 2, it is characterised in that：In step In (1-1), for diagnostic result primary in physical examination diagnostic data, first, participle matching is carried out using Trie trees plus dictionary, so Afterwards, further index is done to the result of its participle matching and is counted, obtain maximum probability word segmentation result, referred to as text word order Row.

4. physical examination diagnostic data is converted into the method for disease label according to claim 3, it is characterised in that：Using dynamic The maximum probability word segmentation result of programming evaluation character string.

5. physical examination diagnostic data is converted into the method for disease label according to claim 2, it is characterised in that：In step In (1-2), the relation between word is judged by counting point mutual information index.

6. physical examination diagnostic data is converted into the method for disease label according to claim 1, it is characterised in that：In step (2) in, the disease name in word sequence is extracted using statistical method TF-IDF, and combines deactivation dictionary, obtain physical examination diagnosis knot The disease name of fruit text.

7. physical examination diagnostic data is converted into the method for disease label according to claim 1, it is characterised in that：In step (3) in, each disease name is abstracted into by mathematic vector using word2vec algorithms, the distance for reusing mathematic vector calculates same Adopted word.

8. physical examination diagnostic data is converted into the method for disease label according to claim 1, it is characterised in that：In step (4) in, disease cluster is carried out using the distribution of implicit Di Li Crays.