CN104572624A - Method for discovering treatment relation between single medicine and disease based on term vector - Google Patents

Method for discovering treatment relation between single medicine and disease based on term vector Download PDF

Info

Publication number
CN104572624A
CN104572624A CN201510027487.7A CN201510027487A CN104572624A CN 104572624 A CN104572624 A CN 104572624A CN 201510027487 A CN201510027487 A CN 201510027487A CN 104572624 A CN104572624 A CN 104572624A
Authority
CN
China
Prior art keywords
disease
term vector
vector
treatment
relation
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN201510027487.7A
Other languages
Chinese (zh)
Other versions
CN104572624B (en
Inventor
张引
魏宝刚
庄越挺
黎磊
姚亮
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Zhejiang University ZJU
Original Assignee
Zhejiang University ZJU
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Zhejiang University ZJU filed Critical Zhejiang University ZJU
Priority to CN201510027487.7A priority Critical patent/CN104572624B/en
Publication of CN104572624A publication Critical patent/CN104572624A/en
Application granted granted Critical
Publication of CN104572624B publication Critical patent/CN104572624B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Landscapes

  • Machine Translation (AREA)

Abstract

The invention discloses a method for discovering a treatment relation between single medicine and disease based on the term vector. The method includes first selecting a training set. 8980 medicines in the book Chinese Materia Medica are utilized as the main body of the treatment relation, the major functions of the medicines are described to extract the concept of the diseases as the object of the treatment relation, and a three-element group of 'medicine, treatment and disease' is formed. The Word2Vec tool published by Google is adopted as the term vector training tool, the resources from the encyclopedia of Baidu are utilized as the training corpus, and finally the term vector obtained by training is utilized to train the required model through SVM training. The model can judge whether the medicine and the disease has the treatment relation by inputting the single medicine and the disease.

Description

A kind of method finding the treatment relation between simple and disease based on term vector
Technical field
The treatment relation between simple and disease that to the present invention relates in traditional Chinese medicine finds field, is that traditional Chinese medicine combines with computer science the product intersected, particularly relates to a kind of method found for treatment relation based on term vector and SVM.
Background technology
There is treatment in tcm field between simple and disease and be related to that this is have something to base on, can be inquired about obtaining by standard works and teaching material, but how to find that relation of more treating but never has an effective method.Along with the develop rapidly of computer science, deepening constantly of machine learning method provides new thinking for solving tcm field problem with improving.The particularly proposition of term vector, has individual vector space for each word, have greatly expanded word meanings, and the difference of word vector also has certain implication, for relation finds to lay a good foundation.
Summary of the invention
The object of the invention is to for the deficiencies in the prior art, a kind of method finding the treatment relation between simple and disease based on term vector is provided, utilizes the mode of machine learning to find the treatment relation between tcm field Chinese traditional medicine and disease.
The object of the invention is to be achieved through the following technical solutions: the discover method of the treatment relation between a kind of simple based on term vector and disease, comprises the following steps:
(1) OCR process is carried out to " China's book on Chinese herbal medicine ", extract it and cure mainly attribute;
(2) carry out three pre-service to curing mainly attribute, first time pre-service, according to Segmentation of Punctuation, obtains first time Candidate Set; All vocabulary in the Candidate Set that first time obtains by second time pre-service access Baidupedia, interactive encyclopaedia and wikipedia as key word, if three's one comprises the page of this key word, namely think that this key word is certain disease, join in disease set, otherwise add in second time Candidate Set; First third time pre-service utilizes syntax analyzer to carry out grammatical analysis to the vocabulary of second time Candidate Set, find out the form that result is adjective+noun, its noun part is accessed Baidupedia, interactive encyclopaedia and wikipedia as key word, if three's one comprises the page of this key word, namely think that this adjective+noun is the concrete form of certain disease, join in disease set equally, remaining word does gives up process; Through three pre-service, the treatment relation tlv triple of structure drug and disease;
(3) encyclopaedia data separate CRF model is carried out participle with the longest word matched method is combined, filter out the useless lexical items such as stop words, preposition and numeral-classifier compound simultaneously, build the training set of term vector; Utilize Word2Vec (Open-Source Tools of google) to construct term vector matrix, namely to each word, represent with a vector;
(4) for the tlv triple that step 3 obtains, find out the term vector of medicine and disease difference correspondence, construct the term vector for the treatment of relation according to the vectorial mode deducting disease vector of simple;
(5) treatment relative vector step 4 constructed is as training tuple, and its vector dimension, as the feature space of SVM, utilizes SVM to train, obtains training pattern;
(6) simple and disease is inputted, the term vector of simple and disease difference correspondence is found in the term vector matrix that step 3 constructs, the input of the model that relation vector trains as step 5 is obtained, whether containing treatment relation both judging according to training pattern Output rusults with the term vector that the term vector of simple deducts disease.
Beneficial effect of the present invention: the present invention obtains standard " simple, treatment, disease " tlv triple through three times treatment steps from standard works, Baidupedia data and google Open-Source Tools word2vec is utilized to train term vector, tlv triple being combined with term vector utilizes SVM classifier to train, finally judge whether simple and disease exist treatment relation accurately, and effectively can disclose some treatment relations hidden, there is very large reference value for traditional Chinese medicine switching; Meanwhile, the method set forth has generality, as long as training dataset is made adequate preparation, goes for the excavation of universal relation.
Accompanying drawing explanation
Fig. 1 is the inventive method overall flow figure;
Fig. 2 for " purple wild Soviet Union ", first pass preprocessing process figure;
Fig. 3 for " purple wild Soviet Union ", second time preprocessing process figure;
Fig. 4 for " purple wild Soviet Union ", the 3rd time preprocessing process figure.
Embodiment
Below in conjunction with the drawings and specific embodiments, the present invention is described in further detail.
As shown in Figure 1, a kind of method finding the treatment relation between simple and disease based on term vector of the present invention, comprises the following steps:
(1) OCR process is carried out to " China's book on Chinese herbal medicine ", extract it and cure mainly attribute;
(2) carry out three pre-service to curing mainly attribute, first time pre-service, according to Segmentation of Punctuation, obtains first time Candidate Set, and as shown in Figure 2, purple leaf Soviet Union cures mainly attribute for " main hot summer days catch a cold, headache body weight, husband's sweat aversion to cold, stomachache is vomited and diarrhoea, oedema, sore and toxic, enterobiasis, Trichomonas vaginalis ", the Candidate Set obtained after first pass process is " cold headache body on hot summer days heavy husband's sweat aversion to cold stomachache vomiting and diarrhoea oedema sore and toxic enterobiasis Trichomonas vaginalis ", first all vocabulary in the Candidate Set that first time obtains by second time pre-service access local disease database, if exist, then think that this key word is the concept of diseases, otherwise, as key word access Baidupedia, interactive encyclopaedia and wikipedia, if three's one comprises the page of this key word, namely think that this key word is certain disease, join in disease set, crawl this definition of head-word is added in local data base simultaneously, otherwise add in second time Candidate Set, as shown in Figure 3, the vocabulary that encyclopaedia comprises is " oedema ", " sore and toxic ", " enterobiasis ", these vocabulary are added in disease set, residue vocabulary joins in second time Candidate Set, first third time pre-service utilizes syntax analyzer to carry out grammatical analysis to the vocabulary of second time Candidate Set, find out the form that result is noun+verb and noun+noun, to the vocabulary of noun+verb form, its noun part is accessed Baidupedia as key word, interactive encyclopaedia and wikipedia, if three's one comprises the page of this key word, namely think that the vocabulary of this noun+verb form is the concrete form of certain disease, join in disease set, crawl this definition of head-word is added in local data base simultaneously, to the vocabulary of noun+noun form, as key word, Baidupedia is accessed to each noun part, interactive encyclopaedia and wikipedia, if three's one comprises the page of this key word, namely think that the vocabulary of this noun+noun form is the parallel fashion of the concept of diseases, also join in disease set and crawl this definition of head-word simultaneously and be added in local data base, remaining word does gives up process, as shown in Figure 4, " stomachache is vomited and diarrhoea " " flu on hot summer days ", " headache body weight ", " Trichomonas vaginalis " is resolved the form into verb+noun, be respectively " stomachache is vomited and diarrhoea " " flu on hot summer days " " headache body weight " " Trichomonas vaginalis ", the vocabulary that encyclopaedia comprises for " vomiting and diarrhoea " " flu " " body weight " " trichomonad " therefore these four vocabulary be also added in disease Candidate Set, remaining " husband's sweat aversion to cold " does discard processing, through three pre-service, the treatment relation tlv triple of structure drug and disease,
(3) encyclopaedia data separate CRF model is carried out participle with the longest word matched method is combined, filter out the useless lexical items such as stop words, preposition and numeral-classifier compound simultaneously, build the training set of term vector; Utilize Word2Vec (Open-Source Tools of google) to construct term vector matrix, namely to each word, represent with a vector;
(4) for the tlv triple that step 3 obtains, find out the term vector of medicine and disease difference correspondence, construct the term vector for the treatment of relation according to the vectorial mode deducting disease vector of simple;
(5) treatment relative vector step 4 constructed is as training tuple, and its vector dimension, as the feature space of SVM, utilizes SVM to train, obtains training pattern;
(6) simple and disease is inputted, the term vector of simple and disease difference correspondence is found in the term vector matrix that step 3 constructs, the input of the model that relation vector trains as step 5 is obtained, whether containing treatment relation both judging according to training pattern Output rusults with the term vector that the term vector of simple deducts disease.Also can input simple, model exports its disease that may treat, and is exemplified below table:
Medicine The disease for the treatment of relation has been there is in regular set By the newfound disease of model
Ginger Anemofrigid cold, fever with aversion to cold, headache, nasal obstruction, vomiting, phlegm and retained fluid, coughs and breathes heavily, and has loose bowels Cough, asthma, stomachache
The bighead atractylodes rhizome Difficult urination, oedema, phlegm and retained fluid, spontaneous perspiration due to deficiency of vital energy Abdominal distension is had loose bowels, insufficiency of the spleen
Levisticum Wind-cold-dampness arthralgia lumbocrural pain is had a headache Chill
Rice sprout Turgor is had loose bowels insufficiency of the spleen beriberi Indigestion

Claims (1)

1., based on a discover method for the treatment relation between the simple of term vector and disease, it is characterized in that, comprise the following steps:
(1) OCR process is carried out to " China's book on Chinese herbal medicine ", extract it and cure mainly attribute;
(2) carry out three pre-service to curing mainly attribute, first time pre-service, according to Segmentation of Punctuation, obtains first time Candidate Set; All vocabulary in the Candidate Set that first time obtains by second time pre-service access Baidupedia, interactive encyclopaedia and wikipedia as key word, if three's one comprises the page of this key word, namely think that this key word is certain disease, join in disease set, otherwise add in second time Candidate Set; First third time pre-service utilizes syntax analyzer to carry out grammatical analysis to the vocabulary of second time Candidate Set, find out the form that result is adjective+noun, its noun part is accessed Baidupedia, interactive encyclopaedia and wikipedia as key word, if three's one comprises the page of this key word, namely think that this adjective+noun is the concrete form of certain disease, join in disease set equally, remaining word does gives up process; Through three pre-service, the treatment relation tlv triple of structure drug and disease;
(3) encyclopaedia data separate CRF model is carried out participle with the longest word matched method is combined, filter out the useless lexical items such as stop words, preposition and numeral-classifier compound simultaneously, build the training set of term vector; Utilize the Open-Source Tools of Word2Vec(google) construct term vector matrix, namely to each word, represent with a vector;
(4) for the tlv triple that step 3 obtains, find out the term vector of medicine and disease difference correspondence, construct the term vector for the treatment of relation according to the vectorial mode deducting disease vector of simple;
(5) treatment relative vector step 4 constructed is as training tuple, and its vector dimension, as the feature space of SVM, utilizes SVM to train, obtains training pattern;
(6) simple and disease is inputted, the term vector of simple and disease difference correspondence is found in the term vector matrix that step 3 constructs, the input of the model that relation vector trains as step 5 is obtained, whether containing treatment relation both judging according to training pattern Output rusults with the term vector that the term vector of simple deducts disease.
CN201510027487.7A 2015-01-20 2015-01-20 A kind of method that the treatment relation between simple and disease is found based on term vector Active CN104572624B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201510027487.7A CN104572624B (en) 2015-01-20 2015-01-20 A kind of method that the treatment relation between simple and disease is found based on term vector

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201510027487.7A CN104572624B (en) 2015-01-20 2015-01-20 A kind of method that the treatment relation between simple and disease is found based on term vector

Publications (2)

Publication Number Publication Date
CN104572624A true CN104572624A (en) 2015-04-29
CN104572624B CN104572624B (en) 2017-12-29

Family

ID=53088728

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201510027487.7A Active CN104572624B (en) 2015-01-20 2015-01-20 A kind of method that the treatment relation between simple and disease is found based on term vector

Country Status (1)

Country Link
CN (1) CN104572624B (en)

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN105824904A (en) * 2016-03-15 2016-08-03 浙江大学 Chinese herbal medicine plant picture capturing method based on professional term vector of traditional Chinese medicine and pharmacy field
CN106874643A (en) * 2016-12-27 2017-06-20 中国科学院自动化研究所 Build the method and system that knowledge base realizes assisting in diagnosis and treatment automatically based on term vector
CN110929511A (en) * 2018-09-04 2020-03-27 清华大学 Intelligent matching method for personalized traditional Chinese medicine diagnosis and treatment information and traditional Chinese medicine information based on semantic similarity

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101059806A (en) * 2007-06-06 2007-10-24 华东师范大学 Word sense based local file searching method
CN101118556A (en) * 2007-09-17 2008-02-06 中国科学院计算技术研究所 New word of short-text discovering method and system
US20130110497A1 (en) * 2011-10-27 2013-05-02 Microsoft Corporation Functionality for Normalizing Linguistic Items
CN103279543A (en) * 2013-05-13 2013-09-04 清华大学 Path mode inquiring system for massive image data

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101059806A (en) * 2007-06-06 2007-10-24 华东师范大学 Word sense based local file searching method
CN101118556A (en) * 2007-09-17 2008-02-06 中国科学院计算技术研究所 New word of short-text discovering method and system
US20130110497A1 (en) * 2011-10-27 2013-05-02 Microsoft Corporation Functionality for Normalizing Linguistic Items
CN103279543A (en) * 2013-05-13 2013-09-04 清华大学 Path mode inquiring system for massive image data

Cited By (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN105824904A (en) * 2016-03-15 2016-08-03 浙江大学 Chinese herbal medicine plant picture capturing method based on professional term vector of traditional Chinese medicine and pharmacy field
CN105824904B (en) * 2016-03-15 2018-12-25 浙江大学 Chinese herbal medicine picture crawling method based on tcm field profession term vector
CN106874643A (en) * 2016-12-27 2017-06-20 中国科学院自动化研究所 Build the method and system that knowledge base realizes assisting in diagnosis and treatment automatically based on term vector
CN106874643B (en) * 2016-12-27 2020-02-28 中国科学院自动化研究所 Method and system for automatically constructing knowledge base to realize auxiliary diagnosis and treatment based on word vectors
CN110929511A (en) * 2018-09-04 2020-03-27 清华大学 Intelligent matching method for personalized traditional Chinese medicine diagnosis and treatment information and traditional Chinese medicine information based on semantic similarity
CN110929511B (en) * 2018-09-04 2021-12-17 清华大学 Intelligent matching method for personalized traditional Chinese medicine diagnosis and treatment information and traditional Chinese medicine information based on semantic similarity

Also Published As

Publication number Publication date
CN104572624B (en) 2017-12-29

Similar Documents

Publication Publication Date Title
Strauss et al. Results of the wnut16 named entity recognition shared task
Qiu et al. Automatic non-taxonomic relation extraction from big data in smart city
CN105677634A (en) Method for extracting sentences with similar meanings and standard grammar from academic documents
CN104572624A (en) Method for discovering treatment relation between single medicine and disease based on term vector
CN105808711A (en) System and method for generating model based on semantic text concept
CN108460132B (en) Chinese medicinal material attribute feature coding and searching system based on Chinese pharmacology theory
Dang et al. HotMatch results for OEAI 2012.
Li et al. Identification of ancient Chinese medical prescriptions and case data analysis under artificial intelligence Gpt algorithm: a case study of song dynasty medical literature
Gkirtzou et al. RDF keyword search based on keywords-to-SPARQL translation
Zheng et al. The Identification of Chinese Herbal Medicine Combination Association Rule Analysis Based on an Improved Apriori Algorithm in Treating Patients with COVID‐19 Disease
He et al. Discovering herbal functional groups of traditional Chinese medicine
Wang et al. EvaHan2023: Overview of the First International Ancient Chinese Translation Bakeoff
CN103902523A (en) Uygur language sentence similarity calculation method
Zhang et al. Improving end-to-end biomedical question answering system
Kim et al. Age-related changes in word defining abilities in concrete and abstract nouns with normal elderly
Tian et al. Text-enhanced question answering over knowledge graph
Wang et al. A Novel Group Detection Method for Finding Related Chinese Herbs.
Chung et al. AI-assisted literature exploration of innovative Chinese medicine formulas
Wei et al. Summarizing professor Chen Ruquan's therapeutic experience of thyroid disease based on machine learning
Park et al. The relationships between loss experiences and depression of the men and women elderly: Focused on the moderating effects of stress coping styles
Fan et al. Research on university ranking based on literature big data and industry thesaurus-Taking double first-class universities as an example
Chen et al. Historical medical exchanges following the confluence of traditional Chinese medicine and western medicine: coexistence and mutual development
Qi et al. Traditional Chinese Medicine Prescription Recommendation Model Based on Large Language Models and Graph Neural Networks
Stoeckel et al. SenseFitting: Sense Level Semantic Specialization of Word Embeddings for Word Sense Disambiguation
Li et al. A hybrid approach for Chinese word similarity computing based on HowNet

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
C10 Entry into substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant