CN104572624A - Method for discovering treatment relation between single medicine and disease based on term vector - Google Patents
Method for discovering treatment relation between single medicine and disease based on term vector Download PDFInfo
- Publication number
- CN104572624A CN104572624A CN201510027487.7A CN201510027487A CN104572624A CN 104572624 A CN104572624 A CN 104572624A CN 201510027487 A CN201510027487 A CN 201510027487A CN 104572624 A CN104572624 A CN 104572624A
- Authority
- CN
- China
- Prior art keywords
- disease
- term vector
- vector
- treatment
- relation
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Granted
Links
Landscapes
- Machine Translation (AREA)
Abstract
The invention discloses a method for discovering a treatment relation between single medicine and disease based on the term vector. The method includes first selecting a training set. 8980 medicines in the book Chinese Materia Medica are utilized as the main body of the treatment relation, the major functions of the medicines are described to extract the concept of the diseases as the object of the treatment relation, and a three-element group of 'medicine, treatment and disease' is formed. The Word2Vec tool published by Google is adopted as the term vector training tool, the resources from the encyclopedia of Baidu are utilized as the training corpus, and finally the term vector obtained by training is utilized to train the required model through SVM training. The model can judge whether the medicine and the disease has the treatment relation by inputting the single medicine and the disease.
Description
Technical field
The treatment relation between simple and disease that to the present invention relates in traditional Chinese medicine finds field, is that traditional Chinese medicine combines with computer science the product intersected, particularly relates to a kind of method found for treatment relation based on term vector and SVM.
Background technology
There is treatment in tcm field between simple and disease and be related to that this is have something to base on, can be inquired about obtaining by standard works and teaching material, but how to find that relation of more treating but never has an effective method.Along with the develop rapidly of computer science, deepening constantly of machine learning method provides new thinking for solving tcm field problem with improving.The particularly proposition of term vector, has individual vector space for each word, have greatly expanded word meanings, and the difference of word vector also has certain implication, for relation finds to lay a good foundation.
Summary of the invention
The object of the invention is to for the deficiencies in the prior art, a kind of method finding the treatment relation between simple and disease based on term vector is provided, utilizes the mode of machine learning to find the treatment relation between tcm field Chinese traditional medicine and disease.
The object of the invention is to be achieved through the following technical solutions: the discover method of the treatment relation between a kind of simple based on term vector and disease, comprises the following steps:
(1) OCR process is carried out to " China's book on Chinese herbal medicine ", extract it and cure mainly attribute;
(2) carry out three pre-service to curing mainly attribute, first time pre-service, according to Segmentation of Punctuation, obtains first time Candidate Set; All vocabulary in the Candidate Set that first time obtains by second time pre-service access Baidupedia, interactive encyclopaedia and wikipedia as key word, if three's one comprises the page of this key word, namely think that this key word is certain disease, join in disease set, otherwise add in second time Candidate Set; First third time pre-service utilizes syntax analyzer to carry out grammatical analysis to the vocabulary of second time Candidate Set, find out the form that result is adjective+noun, its noun part is accessed Baidupedia, interactive encyclopaedia and wikipedia as key word, if three's one comprises the page of this key word, namely think that this adjective+noun is the concrete form of certain disease, join in disease set equally, remaining word does gives up process; Through three pre-service, the treatment relation tlv triple of structure drug and disease;
(3) encyclopaedia data separate CRF model is carried out participle with the longest word matched method is combined, filter out the useless lexical items such as stop words, preposition and numeral-classifier compound simultaneously, build the training set of term vector; Utilize Word2Vec (Open-Source Tools of google) to construct term vector matrix, namely to each word, represent with a vector;
(4) for the tlv triple that step 3 obtains, find out the term vector of medicine and disease difference correspondence, construct the term vector for the treatment of relation according to the vectorial mode deducting disease vector of simple;
(5) treatment relative vector step 4 constructed is as training tuple, and its vector dimension, as the feature space of SVM, utilizes SVM to train, obtains training pattern;
(6) simple and disease is inputted, the term vector of simple and disease difference correspondence is found in the term vector matrix that step 3 constructs, the input of the model that relation vector trains as step 5 is obtained, whether containing treatment relation both judging according to training pattern Output rusults with the term vector that the term vector of simple deducts disease.
Beneficial effect of the present invention: the present invention obtains standard " simple, treatment, disease " tlv triple through three times treatment steps from standard works, Baidupedia data and google Open-Source Tools word2vec is utilized to train term vector, tlv triple being combined with term vector utilizes SVM classifier to train, finally judge whether simple and disease exist treatment relation accurately, and effectively can disclose some treatment relations hidden, there is very large reference value for traditional Chinese medicine switching; Meanwhile, the method set forth has generality, as long as training dataset is made adequate preparation, goes for the excavation of universal relation.
Accompanying drawing explanation
Fig. 1 is the inventive method overall flow figure;
Fig. 2 for " purple wild Soviet Union ", first pass preprocessing process figure;
Fig. 3 for " purple wild Soviet Union ", second time preprocessing process figure;
Fig. 4 for " purple wild Soviet Union ", the 3rd time preprocessing process figure.
Embodiment
Below in conjunction with the drawings and specific embodiments, the present invention is described in further detail.
As shown in Figure 1, a kind of method finding the treatment relation between simple and disease based on term vector of the present invention, comprises the following steps:
(1) OCR process is carried out to " China's book on Chinese herbal medicine ", extract it and cure mainly attribute;
(2) carry out three pre-service to curing mainly attribute, first time pre-service, according to Segmentation of Punctuation, obtains first time Candidate Set, and as shown in Figure 2, purple leaf Soviet Union cures mainly attribute for " main hot summer days catch a cold, headache body weight, husband's sweat aversion to cold, stomachache is vomited and diarrhoea, oedema, sore and toxic, enterobiasis, Trichomonas vaginalis ", the Candidate Set obtained after first pass process is " cold headache body on hot summer days heavy husband's sweat aversion to cold stomachache vomiting and diarrhoea oedema sore and toxic enterobiasis Trichomonas vaginalis ", first all vocabulary in the Candidate Set that first time obtains by second time pre-service access local disease database, if exist, then think that this key word is the concept of diseases, otherwise, as key word access Baidupedia, interactive encyclopaedia and wikipedia, if three's one comprises the page of this key word, namely think that this key word is certain disease, join in disease set, crawl this definition of head-word is added in local data base simultaneously, otherwise add in second time Candidate Set, as shown in Figure 3, the vocabulary that encyclopaedia comprises is " oedema ", " sore and toxic ", " enterobiasis ", these vocabulary are added in disease set, residue vocabulary joins in second time Candidate Set, first third time pre-service utilizes syntax analyzer to carry out grammatical analysis to the vocabulary of second time Candidate Set, find out the form that result is noun+verb and noun+noun, to the vocabulary of noun+verb form, its noun part is accessed Baidupedia as key word, interactive encyclopaedia and wikipedia, if three's one comprises the page of this key word, namely think that the vocabulary of this noun+verb form is the concrete form of certain disease, join in disease set, crawl this definition of head-word is added in local data base simultaneously, to the vocabulary of noun+noun form, as key word, Baidupedia is accessed to each noun part, interactive encyclopaedia and wikipedia, if three's one comprises the page of this key word, namely think that the vocabulary of this noun+noun form is the parallel fashion of the concept of diseases, also join in disease set and crawl this definition of head-word simultaneously and be added in local data base, remaining word does gives up process, as shown in Figure 4, " stomachache is vomited and diarrhoea " " flu on hot summer days ", " headache body weight ", " Trichomonas vaginalis " is resolved the form into verb+noun, be respectively " stomachache is vomited and diarrhoea " " flu on hot summer days " " headache body weight " " Trichomonas vaginalis ", the vocabulary that encyclopaedia comprises for " vomiting and diarrhoea " " flu " " body weight " " trichomonad " therefore these four vocabulary be also added in disease Candidate Set, remaining " husband's sweat aversion to cold " does discard processing, through three pre-service, the treatment relation tlv triple of structure drug and disease,
(3) encyclopaedia data separate CRF model is carried out participle with the longest word matched method is combined, filter out the useless lexical items such as stop words, preposition and numeral-classifier compound simultaneously, build the training set of term vector; Utilize Word2Vec (Open-Source Tools of google) to construct term vector matrix, namely to each word, represent with a vector;
(4) for the tlv triple that step 3 obtains, find out the term vector of medicine and disease difference correspondence, construct the term vector for the treatment of relation according to the vectorial mode deducting disease vector of simple;
(5) treatment relative vector step 4 constructed is as training tuple, and its vector dimension, as the feature space of SVM, utilizes SVM to train, obtains training pattern;
(6) simple and disease is inputted, the term vector of simple and disease difference correspondence is found in the term vector matrix that step 3 constructs, the input of the model that relation vector trains as step 5 is obtained, whether containing treatment relation both judging according to training pattern Output rusults with the term vector that the term vector of simple deducts disease.Also can input simple, model exports its disease that may treat, and is exemplified below table:
Medicine | The disease for the treatment of relation has been there is in regular set | By the newfound disease of model |
Ginger | Anemofrigid cold, fever with aversion to cold, headache, nasal obstruction, vomiting, phlegm and retained fluid, coughs and breathes heavily, and has loose bowels | Cough, asthma, stomachache |
The bighead atractylodes rhizome | Difficult urination, oedema, phlegm and retained fluid, spontaneous perspiration due to deficiency of vital energy | Abdominal distension is had loose bowels, insufficiency of the spleen |
Levisticum | Wind-cold-dampness arthralgia lumbocrural pain is had a headache | Chill |
Rice sprout | Turgor is had loose bowels insufficiency of the spleen beriberi | Indigestion |
Claims (1)
1., based on a discover method for the treatment relation between the simple of term vector and disease, it is characterized in that, comprise the following steps:
(1) OCR process is carried out to " China's book on Chinese herbal medicine ", extract it and cure mainly attribute;
(2) carry out three pre-service to curing mainly attribute, first time pre-service, according to Segmentation of Punctuation, obtains first time Candidate Set; All vocabulary in the Candidate Set that first time obtains by second time pre-service access Baidupedia, interactive encyclopaedia and wikipedia as key word, if three's one comprises the page of this key word, namely think that this key word is certain disease, join in disease set, otherwise add in second time Candidate Set; First third time pre-service utilizes syntax analyzer to carry out grammatical analysis to the vocabulary of second time Candidate Set, find out the form that result is adjective+noun, its noun part is accessed Baidupedia, interactive encyclopaedia and wikipedia as key word, if three's one comprises the page of this key word, namely think that this adjective+noun is the concrete form of certain disease, join in disease set equally, remaining word does gives up process; Through three pre-service, the treatment relation tlv triple of structure drug and disease;
(3) encyclopaedia data separate CRF model is carried out participle with the longest word matched method is combined, filter out the useless lexical items such as stop words, preposition and numeral-classifier compound simultaneously, build the training set of term vector; Utilize the Open-Source Tools of Word2Vec(google) construct term vector matrix, namely to each word, represent with a vector;
(4) for the tlv triple that step 3 obtains, find out the term vector of medicine and disease difference correspondence, construct the term vector for the treatment of relation according to the vectorial mode deducting disease vector of simple;
(5) treatment relative vector step 4 constructed is as training tuple, and its vector dimension, as the feature space of SVM, utilizes SVM to train, obtains training pattern;
(6) simple and disease is inputted, the term vector of simple and disease difference correspondence is found in the term vector matrix that step 3 constructs, the input of the model that relation vector trains as step 5 is obtained, whether containing treatment relation both judging according to training pattern Output rusults with the term vector that the term vector of simple deducts disease.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201510027487.7A CN104572624B (en) | 2015-01-20 | 2015-01-20 | A kind of method that the treatment relation between simple and disease is found based on term vector |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201510027487.7A CN104572624B (en) | 2015-01-20 | 2015-01-20 | A kind of method that the treatment relation between simple and disease is found based on term vector |
Publications (2)
Publication Number | Publication Date |
---|---|
CN104572624A true CN104572624A (en) | 2015-04-29 |
CN104572624B CN104572624B (en) | 2017-12-29 |
Family
ID=53088728
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201510027487.7A Active CN104572624B (en) | 2015-01-20 | 2015-01-20 | A kind of method that the treatment relation between simple and disease is found based on term vector |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN104572624B (en) |
Cited By (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN105824904A (en) * | 2016-03-15 | 2016-08-03 | 浙江大学 | Chinese herbal medicine plant picture capturing method based on professional term vector of traditional Chinese medicine and pharmacy field |
CN106874643A (en) * | 2016-12-27 | 2017-06-20 | 中国科学院自动化研究所 | Build the method and system that knowledge base realizes assisting in diagnosis and treatment automatically based on term vector |
CN110929511A (en) * | 2018-09-04 | 2020-03-27 | 清华大学 | Intelligent matching method for personalized traditional Chinese medicine diagnosis and treatment information and traditional Chinese medicine information based on semantic similarity |
Citations (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN101059806A (en) * | 2007-06-06 | 2007-10-24 | 华东师范大学 | Word sense based local file searching method |
CN101118556A (en) * | 2007-09-17 | 2008-02-06 | 中国科学院计算技术研究所 | New word of short-text discovering method and system |
US20130110497A1 (en) * | 2011-10-27 | 2013-05-02 | Microsoft Corporation | Functionality for Normalizing Linguistic Items |
CN103279543A (en) * | 2013-05-13 | 2013-09-04 | 清华大学 | Path mode inquiring system for massive image data |
-
2015
- 2015-01-20 CN CN201510027487.7A patent/CN104572624B/en active Active
Patent Citations (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN101059806A (en) * | 2007-06-06 | 2007-10-24 | 华东师范大学 | Word sense based local file searching method |
CN101118556A (en) * | 2007-09-17 | 2008-02-06 | 中国科学院计算技术研究所 | New word of short-text discovering method and system |
US20130110497A1 (en) * | 2011-10-27 | 2013-05-02 | Microsoft Corporation | Functionality for Normalizing Linguistic Items |
CN103279543A (en) * | 2013-05-13 | 2013-09-04 | 清华大学 | Path mode inquiring system for massive image data |
Cited By (6)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN105824904A (en) * | 2016-03-15 | 2016-08-03 | 浙江大学 | Chinese herbal medicine plant picture capturing method based on professional term vector of traditional Chinese medicine and pharmacy field |
CN105824904B (en) * | 2016-03-15 | 2018-12-25 | 浙江大学 | Chinese herbal medicine picture crawling method based on tcm field profession term vector |
CN106874643A (en) * | 2016-12-27 | 2017-06-20 | 中国科学院自动化研究所 | Build the method and system that knowledge base realizes assisting in diagnosis and treatment automatically based on term vector |
CN106874643B (en) * | 2016-12-27 | 2020-02-28 | 中国科学院自动化研究所 | Method and system for automatically constructing knowledge base to realize auxiliary diagnosis and treatment based on word vectors |
CN110929511A (en) * | 2018-09-04 | 2020-03-27 | 清华大学 | Intelligent matching method for personalized traditional Chinese medicine diagnosis and treatment information and traditional Chinese medicine information based on semantic similarity |
CN110929511B (en) * | 2018-09-04 | 2021-12-17 | 清华大学 | Intelligent matching method for personalized traditional Chinese medicine diagnosis and treatment information and traditional Chinese medicine information based on semantic similarity |
Also Published As
Publication number | Publication date |
---|---|
CN104572624B (en) | 2017-12-29 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
Strauss et al. | Results of the wnut16 named entity recognition shared task | |
Qiu et al. | Automatic non-taxonomic relation extraction from big data in smart city | |
CN105677634A (en) | Method for extracting sentences with similar meanings and standard grammar from academic documents | |
CN104572624A (en) | Method for discovering treatment relation between single medicine and disease based on term vector | |
CN105808711A (en) | System and method for generating model based on semantic text concept | |
CN108460132B (en) | Chinese medicinal material attribute feature coding and searching system based on Chinese pharmacology theory | |
Dang et al. | HotMatch results for OEAI 2012. | |
Li et al. | Identification of ancient Chinese medical prescriptions and case data analysis under artificial intelligence Gpt algorithm: a case study of song dynasty medical literature | |
Gkirtzou et al. | RDF keyword search based on keywords-to-SPARQL translation | |
Zheng et al. | The Identification of Chinese Herbal Medicine Combination Association Rule Analysis Based on an Improved Apriori Algorithm in Treating Patients with COVID‐19 Disease | |
He et al. | Discovering herbal functional groups of traditional Chinese medicine | |
Wang et al. | EvaHan2023: Overview of the First International Ancient Chinese Translation Bakeoff | |
CN103902523A (en) | Uygur language sentence similarity calculation method | |
Zhang et al. | Improving end-to-end biomedical question answering system | |
Kim et al. | Age-related changes in word defining abilities in concrete and abstract nouns with normal elderly | |
Tian et al. | Text-enhanced question answering over knowledge graph | |
Wang et al. | A Novel Group Detection Method for Finding Related Chinese Herbs. | |
Chung et al. | AI-assisted literature exploration of innovative Chinese medicine formulas | |
Wei et al. | Summarizing professor Chen Ruquan's therapeutic experience of thyroid disease based on machine learning | |
Park et al. | The relationships between loss experiences and depression of the men and women elderly: Focused on the moderating effects of stress coping styles | |
Fan et al. | Research on university ranking based on literature big data and industry thesaurus-Taking double first-class universities as an example | |
Chen et al. | Historical medical exchanges following the confluence of traditional Chinese medicine and western medicine: coexistence and mutual development | |
Qi et al. | Traditional Chinese Medicine Prescription Recommendation Model Based on Large Language Models and Graph Neural Networks | |
Stoeckel et al. | SenseFitting: Sense Level Semantic Specialization of Word Embeddings for Word Sense Disambiguation | |
Li et al. | A hybrid approach for Chinese word similarity computing based on HowNet |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
C06 | Publication | ||
PB01 | Publication | ||
C10 | Entry into substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant |