CN106682411B - A method of disease label is converted by physical examination diagnostic data - Google Patents

A method of disease label is converted by physical examination diagnostic data Download PDF

Info

Publication number
CN106682411B
CN106682411B CN201611198280.7A CN201611198280A CN106682411B CN 106682411 B CN106682411 B CN 106682411B CN 201611198280 A CN201611198280 A CN 201611198280A CN 106682411 B CN106682411 B CN 106682411B
Authority
CN
China
Prior art keywords
disease
physical examination
word
obtains
disease name
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201611198280.7A
Other languages
Chinese (zh)
Other versions
CN106682411A (en
Inventor
吴健
周立水
顾盼
邱奇波
邓水光
李莹
尹建伟
吴朝晖
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Zhejiang University ZJU
Original Assignee
Zhejiang University ZJU
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Zhejiang University ZJU filed Critical Zhejiang University ZJU
Priority to CN201611198280.7A priority Critical patent/CN106682411B/en
Publication of CN106682411A publication Critical patent/CN106682411A/en
Application granted granted Critical
Publication of CN106682411B publication Critical patent/CN106682411B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16HHEALTHCARE INFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR THE HANDLING OR PROCESSING OF MEDICAL OR HEALTHCARE DATA
    • G16H50/00ICT specially adapted for medical diagnosis, medical simulation or medical data mining; ICT specially adapted for detecting, monitoring or modelling epidemics or pandemics
    • G16H50/70ICT specially adapted for medical diagnosis, medical simulation or medical data mining; ICT specially adapted for detecting, monitoring or modelling epidemics or pandemics for mining of medical data, e.g. analysing previous cases of other patients

Landscapes

  • Engineering & Computer Science (AREA)
  • Medical Informatics (AREA)
  • Data Mining & Analysis (AREA)
  • Health & Medical Sciences (AREA)
  • Public Health (AREA)
  • Pathology (AREA)
  • Databases & Information Systems (AREA)
  • Biomedical Technology (AREA)
  • Epidemiology (AREA)
  • General Health & Medical Sciences (AREA)
  • Primary Health Care (AREA)
  • Medical Treatment And Welfare Office Work (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The invention discloses a kind of methods for converting disease label for physical examination diagnostic data, comprising: (1) carries out text participle to physical examination diagnostic data and new word discovery is handled, obtain word sequence;(2) in word sequence, disease vocabulary, obtains disease name required for extracting;(3) merger is carried out to the synonym in disease name, obtains the disease name of merger;(4) disease name is clustered, establishes disease classification tree;(5) disease marker is carried out according to the disease name of merger and disease classification tree, obtains disease label.This method excavates disease outcome in physical examination diagnostic data using a variety of natural language processing techniques, it extracts wherein classification of diseases structure and carries out coded number word, standard disease name label is provided for physical examination record, to more directly describe the result of physical examination and can be other big data medical services.

Description

A method of disease label is converted by physical examination diagnostic data
Technical field
The invention belongs to big data medical fields, and in particular to a kind of side for converting physical examination diagnostic data on disease label Method.
Background technique
With the improvement of people ' s living standards, the enhancing of health care consciousness, health examination are increasingly becoming a kind of social fashion, people Changed the traditional concept of the only Cai Qu hospital when falling ill, have regular physical checkups and received by most people.Therefore, Hospital has accumulated the electronics physical examination data of magnanimity.However, there is a large amount of non-structured text in primary electronics physical examination data This physical examination data.These text datas often unordered mixed and disorderly redundancy, comes from the hand of the doctor of each various habits of level, has very Big limitation can not be used well.
Big data medical treatment is a current hot spot, refers to the data that medical field is analyzed by big data the relevant technologies And knowledge therein is excavated to increase substantially medical services.In the past few decades, big data influences dearly Each enterprise, including health care industry.Nowadays, a large amount of data can allow health care more efficient, more individual character Change.Natural language processing is the closely related technology of big data.Under the auxiliary of big data, natural language processing is also relied on greatly Data have more development, such as word2vec, LDA new technology occur.
Currently, the physical examination diagnostic data of acquisition is the diagnosis text for having complex form after each physical examination, and Diagnostic result is not converted to the disease name of summary, it is difficult to by user's fast understanding and be difficult to be answered by subsequent big data medical treatment With to analyze medical data.Therefore, on the basis of can obtain complete original physical examination diagnostic data, doctor is examined in proposition The method that disconnected result is analyzed, and then classify and generate disease classification tree and carry out digital numeralization, then in actual data The disease label for feeding back physical examination diagnostic result out is that have its practical research significance and application prospect.
Summary of the invention
In view of above-mentioned, the invention proposes a kind of methods for converting disease label for physical examination diagnostic data, main to use A variety of natural language processing techniques excavate disease outcome in physical examination diagnostic data, extract wherein classification of diseases structure and are encoded Digitlization provides standard disease name label for physical examination record, to more directly describe the result of physical examination and can be other Big data medical services.
The purpose of the present invention is doctor finds out disease therein to the diagnostic result of patient's physical examination data in analysis physical examination data Name of disease claims, and carries out cluster search, establishes the classification tree of disease, so that providing one is converted into disease for original diagnostic result The method of label.
A method of disease label being converted by physical examination diagnostic data, is included the following steps:
(1) text participle is carried out to physical examination diagnostic data and new word discovery is handled, obtain word sequence;
(2) in word sequence, disease vocabulary, obtains disease name required for extracting;
(3) merger is carried out to the synonym in disease name, obtains the disease name of merger;
(4) disease is clustered, establishes disease classification tree;
(5) disease marker is carried out according to the disease name of merger and disease classification tree, obtains disease label.
The specific steps of step (1) are as follows:
(1-1) segments diagnostic result primary in physical examination diagnostic data, obtains text word sequence;
(1-2) carries out word merging according to the relationship in text word sequence between word, obtains new term;
(1-3) segments new term again, obtains word sequence.
In step (1-1), for diagnostic result primary in physical examination diagnostic data, firstly, using dictionary creation Trie tree carries out participle matching, then, matched result is segmented to it and does further index and counts, obtains maximum probability point Word is as a result, referred to as text word sequence.
Using Trie tree carry out participle matching when, for the same character string often can there are many segment mode, as It is preferred that solving the maximum probability word segmentation result of character string using Dynamic Programming.
In step (1-2), in physical examination diagnostic data, due to the particularity of physical examination diagnostic text, it is difficult to for the first time Participle can be obtained optimal effect, and the relevant vocabulary of many diseases may accidentally be torn open, therefore, according to word in word segmentation result Between relationship carry out word merging, obtain new term, be convenient for subsequent use.Preferably, being referred to by statistics point mutual information Mark is to judge the relationship between word.Point mutual information (Point-wise Mutual Information, PMI) is to measure two The index of correlation between things (such as two words) is widely used in machine learning field, principle such as following formula institute Show:
In probability theory, if x is uncorrelated with y, p (x, y)=p (x) p (y).The two correlation is bigger, then p (x, y) It is bigger with the difference of p (x) p (y).Equally, the conditional probability p (x | y) that x occurs in the case where y occurs occurs divided by x itself Probability p (x) also illustrate that degree of correlation of the x with y.For adjacent word, such as: the probability for P (thyroid nodule) occur is long-range When the probability of P (thyroid gland) P (tubercle), it is believed that thyroid nodule is the neologisms in a corpus.
In step (2), the keyword (disease name) in word sequence is extracted using statistical method TF-IDF, and combine and stop With dictionary, the basic label for obtaining physical examination diagnostic result text, as disease name.Statistical method TF-IDF (term Frequency-inverse document frequency) to assess a words for a text set or a corpus In a copy of it text significance level.The importance of words with the directly proportional increase of number that it occurs in the text, but The frequency that can occur in corpus with it is inversely proportional decline simultaneously.Word frequency (term frequency, TF) refers to some The number that given word occurs in the text.Inverse text frequency (inverse document frequency, IDF) is by total Obtained quotient then is taken logarithm to obtain by text number divided by the number of the text comprising the word.
There are a large amount of synonyms, such as After Cataract to possess not in step (3), in all disease vocabulary of acquisition Name aftercataract and secondary cataract, they require to be integrated into a disease.Here in order to find each disease word Each disease name is abstracted into mathematic vector using word2vec algorithm, reuses the distance of mathematic vector by the synonym of remittance Calculate synonym.
In step (4), in order to more understand the relationship between each disease, all extracts and finish in all disease names Later, disease name is clustered, obtains the classification relationship between disease.Present invention employs the distributions of implicit Di Li Cray (Latent Dirichlet Allocation, LDA) carries out disease cluster.LDA is a kind of document subject matter generation model, also referred to as It include word, theme and document three-decker for three layers of bayesian probability model.
The method of the present invention is mainly analyzed doctor in physical examination data and is found out therein to the diagnostic result of patient's physical examination data Disease name, and cluster search is carried out, the classification tree of disease is established, so that original diagnostic result is converted into disease label. This method has the advantage that are as follows:
(1) new word discovery is carried out to medical examination diagnostic result using PMI, it can be found that many not found new disease words It converges.
(2) for a large amount of disease vocabulary, synonym therein is found using word2vec algorithm, in standardization disease name While decrease the word of redundancy.
(3) cluster that disease is carried out using LDA, obtains the connection between the disease of each classification in which can be convenient.
Detailed description of the invention
Fig. 1 is the disease tag system structure chart based on physical examination diagnostic data;
Fig. 2 is that synonym finds algorithm word2vec schematic diagram;
Fig. 3 is local disease's label classification tree exemplary diagram.
Specific embodiment
In order to more specifically describe the present invention, with reference to the accompanying drawing and specific embodiment is to technical solution of the present invention It is described in detail.
As shown in Figure 1, the present invention by physical examination diagnostic data be converted into disease label method the following steps are included:
Step 1, text segments: segmenting to diagnostic result primary in physical examination diagnostic data, obtains text word sequence.
Text participle needs dictionary to obtain higher accuracy rate, and what is selected here is collected from external corpus Basic dictionary combines relevant medical dictionary.For diagnostic result primary in physical examination data, the Trie tree of dictionary creation is used Participle matching is carried out, matched result is segmented to it and does further index and counts, obtains maximum probability word segmentation result, referred to as Text word sequence.
Trie tree is also known as word lookup tree, is a kind of tree structure, for saving a large amount of character string.Its advantages are: Memory space is saved using the common prefix of character string.The speed of participle can be increased substantially using Trie tree participle.It is right The maximum probability participle of character string can be solved used here as Dynamic Programming there are many mode of participle often in the same character string As a result.
Step 2, new word discovery: word merging is carried out according to the relationship in text word sequence between word, obtains new term.
In this step, according to each statistical information after step 1 participle, emphasis divides adjacent word Analysis.The relationship between word is judged using PMI index, for adjacent word, the probability for P (thyroid nodule) such as occur is long-range When the probability of P (thyroid gland) P (tubercle), it is believed that thyroid nodule is the neologisms in a corpus.
Step 3, text segments: segmenting again to new term, obtains word sequence.
The method that this step is segmented is identical as the segmenting method in step 1, is all the Trie tree using dictionary creation Participle matching is carried out, the maximum probability word segmentation result of character string is solved using Dynamic Programming.
Step 4, disease is extracted: disease vocabulary required for extracting in word sequence obtains disease name.
The disease name in word sequence is extracted used here as statistical method TF-IDF, and combines and deactivates dictionary, it is basic to obtain The label of physical examination diagnostic result text, as disease name.For each diagnostic text data, TF-IDF can be found out wherein Occur it is more and inside other diagnostic text data occur and few word.
Step 5, synonymous conflation of words: merger is carried out to the synonym in disease name, obtains the disease name of merger;
Herein, each disease name is abstracted into the vector of mathematics using word2vec algorithm, reuses mathematic vector Distance calculates synonym.As shown in Fig. 2, CBOW and Skip-gram are two kinds of word2vec algorithm realizations, wherein w (t) is Refer to a word of any position, and w (t-1) is then the previous word of its corresponding position, w (t+1) is the latter of its corresponding position A word, CBOW algorithm are to can be inferred that current word according to former and later two words of a word, and Skip-gram algorithm is according to one A word can be inferred that the possibility environment of its context.The basic thought of word2vec algorithm is to be reflected each word by training K dimension real vector (K is the hyper parameter in model) is penetrated into, the distance between word (such as cosine similarity, Euclidean distance are passed through Deng) judge the semantic similarity between them.It uses one three layers of neural network, input layer-hidden layer-output layer, this A three-layer neural network itself is to model to language model, but also obtain a kind of table of word in vector space simultaneously Show, and this side effect is only the real target of Word2vec.After obtaining the term vector of each word, can according to term vector it Between distance find the synonym of a word.
Step 6, disease clusters: clustering to disease, establishes disease classification tree;
Disease cluster is carried out using LDA in this step and establishes classification tree, Fig. 3 is to obtain the classification relationship between disease The Local map of established classification tree.For every document in corpus, LDA defines following generating process:
(1) to each document, a theme is extracted from theme distribution;
(2) word is extracted from the distribution of word corresponding to the above-mentioned theme being pumped to;
(3) it repeats the above process, until each of traversal document word.
Assuming that seek the parameters of this model after such document structure tree mode, so as to find out each theme and Word distribution inside theme.For physical examination diagnostic data base, the word inside each theme is the title of disease, and each Theme is exactly the disease of a classification.
Step 7, tag extraction: disease marker is carried out according to the disease name of merger and disease classification tree, obtains every individual Examine the unique disease label of text.
Technical solution of the present invention and beneficial effect is described in detail in above-described specific embodiment, Ying Li Solution is not intended to restrict the invention the foregoing is merely presently most preferred embodiment of the invention, all in principle model of the invention Interior done any modification, supplementary, and equivalent replacement etc. are enclosed, should all be included in the protection scope of the present invention.

Claims (1)

1. a kind of method for converting disease label for physical examination diagnostic data, includes the following steps:
(1) diagnostic result primary in physical examination diagnostic data is segmented with Trie tree plus dictionary, to the matched result of participle It does and further indexes and count, character string is solved using Dynamic Programming, obtains text word sequence;Using by counting point mutual trust Breath index carries out word merging according to the relationship in text word sequence between word to judge the relationship between word, obtains New term;New term is segmented again, obtains word sequence;
(2) disease name in word sequence is extracted using statistical method TF-IDF, and combines and deactivates dictionary, obtain physical examination diagnosis knot The disease name of fruit text;
(3) each disease name is abstracted by mathematic vector using word2vec algorithm, the distance for reusing mathematic vector calculates Synonym is realized with this and carries out merger to the synonym in disease name, obtains the disease name of merger;
(4) disease name is clustered using implicit Di Li Cray distribution, establishes disease classification tree;
(5) disease marker is carried out according to the disease name of merger and disease classification tree, obtains disease label.
CN201611198280.7A 2016-12-22 2016-12-22 A method of disease label is converted by physical examination diagnostic data Active CN106682411B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201611198280.7A CN106682411B (en) 2016-12-22 2016-12-22 A method of disease label is converted by physical examination diagnostic data

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201611198280.7A CN106682411B (en) 2016-12-22 2016-12-22 A method of disease label is converted by physical examination diagnostic data

Publications (2)

Publication Number Publication Date
CN106682411A CN106682411A (en) 2017-05-17
CN106682411B true CN106682411B (en) 2019-04-16

Family

ID=58871285

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201611198280.7A Active CN106682411B (en) 2016-12-22 2016-12-22 A method of disease label is converted by physical examination diagnostic data

Country Status (1)

Country Link
CN (1) CN106682411B (en)

Families Citing this family (19)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107515908A (en) * 2017-08-11 2017-12-26 新智数通(北京)技术服务有限公司 A kind of data processing method and device
CN107977354A (en) * 2017-10-12 2018-05-01 北京知道未来信息技术有限公司 A kind of mixing language material segmenting method based on Bi-LSTM-CNN
CN107894976A (en) * 2017-10-12 2018-04-10 北京知道未来信息技术有限公司 A kind of mixing language material segmenting method based on Bi LSTM
CN107992467A (en) * 2017-10-12 2018-05-04 北京知道未来信息技术有限公司 A kind of mixing language material segmenting method based on LSTM
CN110299206A (en) * 2018-03-21 2019-10-01 华东师范大学 A kind of more contents imply Di Li Cray distributed model and interpretation of the cause, onset and process of an illness method for digging is implied in Chinese medicine case
CN109243618B (en) * 2018-09-12 2020-06-16 腾讯科技(深圳)有限公司 Medical model construction method, disease label construction method and intelligent device
CN109582675A (en) * 2018-11-29 2019-04-05 北京达佳互联信息技术有限公司 Tag match method, apparatus, server and storage medium
CN109697286A (en) * 2018-12-18 2019-04-30 众安信息技术服务有限公司 A kind of diagnostic standardization method and device based on term vector
CN109994201B (en) * 2019-03-18 2021-06-11 浙江大学 Diabetes and hypertension probability calculation system based on deep learning
CN110335684A (en) * 2019-06-14 2019-10-15 电子科技大学 The intelligent dialectical aid decision-making method of Chinese medicine based on topic model technology
CN110297933A (en) * 2019-07-01 2019-10-01 山东浪潮人工智能研究院有限公司 A kind of theme label recommended method and tool based on deep learning
CN110580942A (en) * 2019-08-23 2019-12-17 和宇健康科技股份有限公司 novel physical examination report generation method, device, medium and terminal equipment
CN111275122B (en) * 2020-02-03 2023-08-11 腾讯医疗健康(深圳)有限公司 Label labeling method, device, equipment and readable storage medium
CN111696635A (en) * 2020-05-13 2020-09-22 平安科技(深圳)有限公司 Disease name standardization method and device
CN112002416A (en) * 2020-08-23 2020-11-27 吾征智能技术(北京)有限公司 Disease symptom prediction system based on urine character self-learning
CN112735475B (en) * 2020-12-25 2023-02-21 北京博瑞彤芸科技股份有限公司 Method and system for searching disease knowledge through voice
CN112992376A (en) * 2021-03-04 2021-06-18 山东大学 Disease name matching method and system based on weight adjustment
CN113838579A (en) * 2021-09-29 2021-12-24 平安医疗健康管理股份有限公司 Medical data anomaly detection method, device, equipment and storage medium
CN114822865B (en) * 2022-06-27 2022-11-11 天津幸福生命科技有限公司 Diagnostic data identification method and device, electronic equipment and storage medium

Citations (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102662976A (en) * 2012-03-12 2012-09-12 浙江工业大学 Text feature weighting method based on supervision
CN102760134A (en) * 2011-04-28 2012-10-31 北京百度网讯科技有限公司 Method and device for mining synonyms
CN102915493A (en) * 2011-08-05 2013-02-06 株式会社东芝 Information processing apparatus and method
CN102930009A (en) * 2012-10-30 2013-02-13 北京奇虎科技有限公司 Individual website navigation system
CN103020212A (en) * 2012-12-07 2013-04-03 合一网络技术(北京)有限公司 Method and device for finding hot videos based on user query logs in real time
CN103559178A (en) * 2013-05-31 2014-02-05 武汉中文百科网络有限公司 System and method for switching between simplified Chinese characters and traditional Chinese characters on Internet
CN103902570A (en) * 2012-12-27 2014-07-02 腾讯科技(深圳)有限公司 Text classification feature extraction method, classification method and device
CN104424177A (en) * 2013-08-26 2015-03-18 高德软件有限公司 Method and device for extracting core words

Family Cites Families (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN104598642A (en) * 2015-02-13 2015-05-06 杜雨阳 Standard disease name checking method and system

Patent Citations (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102760134A (en) * 2011-04-28 2012-10-31 北京百度网讯科技有限公司 Method and device for mining synonyms
CN102915493A (en) * 2011-08-05 2013-02-06 株式会社东芝 Information processing apparatus and method
CN102662976A (en) * 2012-03-12 2012-09-12 浙江工业大学 Text feature weighting method based on supervision
CN102930009A (en) * 2012-10-30 2013-02-13 北京奇虎科技有限公司 Individual website navigation system
CN103020212A (en) * 2012-12-07 2013-04-03 合一网络技术(北京)有限公司 Method and device for finding hot videos based on user query logs in real time
CN103902570A (en) * 2012-12-27 2014-07-02 腾讯科技(深圳)有限公司 Text classification feature extraction method, classification method and device
CN103559178A (en) * 2013-05-31 2014-02-05 武汉中文百科网络有限公司 System and method for switching between simplified Chinese characters and traditional Chinese characters on Internet
CN104424177A (en) * 2013-08-26 2015-03-18 高德软件有限公司 Method and device for extracting core words

Also Published As

Publication number Publication date
CN106682411A (en) 2017-05-17

Similar Documents

Publication Publication Date Title
CN106682411B (en) A method of disease label is converted by physical examination diagnostic data
CN111414393B (en) Semantic similar case retrieval method and equipment based on medical knowledge graph
CN105760507B (en) Cross-module state topic relativity modeling method based on deep learning
CN106227880B (en) Method for implementing doctor search recommendation
CN106776711B (en) Chinese medical knowledge map construction method based on deep learning
CN107341264B (en) Electronic medical record retrieval system and method supporting user-defined entity
CN107480125B (en) Relation linking method based on knowledge graph
CN108628824A (en) A kind of entity recognition method based on Chinese electronic health record
Varma et al. Inferring generative model structure with static analysis
CN112232065B (en) Method and device for mining synonyms
CN106156272A (en) A kind of information retrieval method based on multi-source semantic analysis
CN103955703A (en) Medical image disease classification method based on naive Bayes
CN114817386A (en) Method and device for generating structured medical data
CN110188359B (en) Text entity extraction method
Saranya et al. Intelligent medical data storage system using machine learning approach
CN111597330A (en) Intelligent expert recommendation-oriented user image drawing method based on support vector machine
CN108595593A (en) Meeting research hotspot based on topic model and development trend information analysis method
Wagenpfeil et al. Explainable multimedia feature fusion for medical applications
CN111460173A (en) Method for constructing disease ontology model of thyroid cancer
Wang et al. A review of the application of natural language processing in clinical medicine
CN115831380A (en) Intelligent medical data management system and method based on medical knowledge graph
Bettouche et al. Mapping researcher activity based on publication data by means of transformers
Gong et al. Automatic pathology annotation on medical images: A statistical machine translation framework
Abu et al. Biodiversity image retrieval framework for monogeneans
Zheng et al. A Domain expertise and word-embedding geometric projection based semantic mining framework for measuring the soft power of social entities

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant