CN108509419A - Ancient TCM books document participle and part of speech indexing method and system - Google Patents

Ancient TCM books document participle and part of speech indexing method and system Download PDF

Info

Publication number
CN108509419A
CN108509419A CN201810233868.4A CN201810233868A CN108509419A CN 108509419 A CN108509419 A CN 108509419A CN 201810233868 A CN201810233868 A CN 201810233868A CN 108509419 A CN108509419 A CN 108509419A
Authority
CN
China
Prior art keywords
chinese medicine
speech
traditional chinese
dictionary
word segmentation
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN201810233868.4A
Other languages
Chinese (zh)
Other versions
CN108509419B (en
Inventor
付先军
李学博
王振国
陈晓康
桑晓明
鞠芳凝
周扬
陈聪
邵欣欣
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Shandong University of Traditional Chinese Medicine
Original Assignee
Shandong University of Traditional Chinese Medicine
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Shandong University of Traditional Chinese Medicine filed Critical Shandong University of Traditional Chinese Medicine
Priority to CN201810233868.4A priority Critical patent/CN108509419B/en
Publication of CN108509419A publication Critical patent/CN108509419A/en
Application granted granted Critical
Publication of CN108509419B publication Critical patent/CN108509419B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities
    • G06F40/284Lexical analysis, e.g. tokenisation or collocates
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/237Lexical tools
    • G06F40/247Thesauruses; Synonyms
    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16HHEALTHCARE INFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR THE HANDLING OR PROCESSING OF MEDICAL OR HEALTHCARE DATA
    • G16H50/00ICT specially adapted for medical diagnosis, medical simulation or medical data mining; ICT specially adapted for detecting, monitoring or modelling epidemics or pandemics
    • G16H50/70ICT specially adapted for medical diagnosis, medical simulation or medical data mining; ICT specially adapted for detecting, monitoring or modelling epidemics or pandemics for mining of medical data, e.g. analysing previous cases of other patients
    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16HHEALTHCARE INFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR THE HANDLING OR PROCESSING OF MEDICAL OR HEALTHCARE DATA
    • G16H70/00ICT specially adapted for the handling or processing of medical references
    • G16H70/40ICT specially adapted for the handling or processing of medical references relating to drugs, e.g. their side effects or intended usage

Landscapes

  • Engineering & Computer Science (AREA)
  • Health & Medical Sciences (AREA)
  • Theoretical Computer Science (AREA)
  • General Health & Medical Sciences (AREA)
  • Public Health (AREA)
  • Medical Informatics (AREA)
  • Physics & Mathematics (AREA)
  • Data Mining & Analysis (AREA)
  • General Physics & Mathematics (AREA)
  • Computational Linguistics (AREA)
  • General Engineering & Computer Science (AREA)
  • Artificial Intelligence (AREA)
  • Primary Health Care (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Epidemiology (AREA)
  • Toxicology (AREA)
  • Pharmacology & Pharmacy (AREA)
  • Medicinal Chemistry (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Chemical & Material Sciences (AREA)
  • Biomedical Technology (AREA)
  • Databases & Information Systems (AREA)
  • Pathology (AREA)
  • Machine Translation (AREA)

Abstract

The invention discloses ancient TCM books document participle and part of speech indexing method and systems;The method, including:Step (1):Build traditional Chinese medicine dictionary for word segmentation;Step (2):The text that participle is treated using traditional Chinese medicine dictionary for word segmentation carries out word segmentation processing and part-of-speech tagging;Step (3):Judge whether text to be segmented all segments successfully;It is directly exported to segmenting successful word segmentation result;Step (4):To the text of participle failure, word segmentation processing is carried out using ansj dictionaries again;Obtain final word segmentation result.

Description

Ancient TCM books document participle and part of speech indexing method and system
Technical field
The present invention relates to ancient TCM books document participle and part of speech indexing method and systems.
Background technology
Document is most important to civilization, the social progress of the mankind, is the basis of all scientific researches.TCM Literature is The important component of Ancient Literature in Chinese is the important foundation for studying ancient physician's clinical application experience, in not only combining The knowledge such as the science, method, prescription, drugs of medicine also contain the academic thought accumulated in traditional Chinese medicine evolution in thousands of years and clinical use Medicine experience excavates important prerequisite and basis that these valuable cultural heritages are traditional Chinese medicine academic inheritance and innovation.Chinese medicine pharmacology The Modern Annotation of opinion, Chinese medical disease, therapy, prescription modern study, all be unable to do without to classic medicine, such as discovery of " qinghaosu " Just it be unable to do without《Handbook of Prescriptions for Emergencies》The inspiration obtained in equal traditional Chinese medicines classical literature.
The finishing analysis of TCM Literature is based on participle and part-of-speech tagging.Participle be by consecutive word sequence according to Certain specification is reassembled into the process of word sequence, the outer researchs theoretical, methods and techniques in relation to Chinese word segmentation of Current Domestic Majority still handles opinion or experimental stage and is biased to natural language processing and information retrieval, be molded available Chinese word segmentation software compared with It is few;And specifically for the software and method of traditional Chinese medicine participle and part-of-speech tagging, there is not been reported, due to Chinese medicine major term Particularity, the word segmentation result accuracy rate and recall rate carried out to TCM Literature using general Chinese word segmentation software is all relatively low, Accuracy rate that highest Pan Gu participle segment TCM Document is had been reported that also with regard to 0.735, recall rate only has 0.663, in others The accuracy rate and recall rate of literary Words partition system, compressive classification rate (F1) in addition 0.5 hereinafter, only such as PHP Analysis accuracys rate Have 0.312, recall rate only has 0.369, and the borrowing-word that cannot be all directed to traditional Chinese medicine carries out specific part-of-speech tagging.This is big The big utilization and excavation for constraining TCM Literature.And software needs to configure environment mostly, has particular requirement to system, and it is removable Plant property is poor, not easy to operate.
Therefore, build that a kind of suitable TCM Literature feature, accuracy rate and recall rate are high, can carry out meeting Chinese medicine major TCM Literature participle and the part-of-speech tagging system and method for the part-of-speech tagging of term characteristics break through current restriction TCM Literature It excavates and the major technology bottleneck of Knowledge Discovery, succession and innovation for traditional Chinese medicine, the original advantage for playing traditional Chinese medicine has Highly important meaning.
Invention content
The object of the present invention is to provide ancient TCM books document participle and part of speech indexing method and systems, can improve Chinese medicine The accuracy and recall rate of medicine literature of ancient book participle, and the part-of-speech tagging for meeting Chinese medicine major term characteristics can be carried out, it solves Certainly current Chinese automatic word-cut segments accuracy rate to TCM Literature and recall rate is low, can not carry out Chinese medicine major part-of-speech tagging Problem, it is right by us《The Treatise on Fevrile Diseases》The participle and part-of-speech tagging of text are applied, it is found that this Words partition system is more general Chinese automatic word-cut there is higher accuracy rate and recall rate, and it is right《The Treatise on Fevrile Diseases》The part-of-speech tagging of document, also connects very much The level of nearly professional.
The first aspect of the present invention provides ancient TCM books document participle and part of speech indexing method;
Ancient TCM books document segments and part of speech indexing method, including:
Step (1):Build traditional Chinese medicine dictionary for word segmentation;
Step (2):The text that participle is treated using traditional Chinese medicine dictionary for word segmentation carries out word segmentation processing and part-of-speech tagging;
Step (3):Judge whether text to be segmented all segments successfully;It is directly defeated to segmenting successful word segmentation result Go out;
Step (4):To the text of participle failure, word segmentation processing is carried out using ansj dictionaries again;Obtain final participle As a result.
Further, the step of step (1) structure traditional Chinese medicine dictionary for word segmentation is:
Step (101):Build Chinese medicine major term dictionary;
Step (102):Parts of speech classification and label are carried out to the word in Chinese medicine major term dictionary;
Step (103):Traditional Chinese medicine dictionary for word segmentation is built using three-row dictionary creation method.
Further, the step of step (101) structure Chinese medicine major term dictionary is:
Chinese medicine major term is extracted from traditional Chinese medicine literature of ancient book and traditional Chinese medicine dictionary;
The Chinese medicine major term, including:Chinese medicine medicine name, prescription title, Chinese medical book title, doctor's name, Chinese medicine In condition symptoms title, traditional Chinese medicine effect title, acupuncture point title, Traditional Chinese Medicine dosage title, archaic Chinese vocabulary and modern medicine Specialized vocabulary.
Further, the step (102) in Chinese medicine major term dictionary word carry out parts of speech classification the step of be:
Reference《National Standard of the People's Republic of China's tcm clinical practice diagnosis and treatment term》Diseased portion, syndrome part or therapy portion Point, in conjunction with the feature of traditional Chinese medicine vocabulary of terms, if traditional Chinese medicine noun is divided into Ganlei's part of speech, 14 classes of structure are classified part of speech table, and 14 Class classification part of speech include:1. theory of traditional Chinese medical science is basic, 2. diagnostic methods of TCM, 3. Chinese medicine nouns, 4. prescription nouns, 5. typhoid fever and warm disease, 6. therapeutic principle of traditional Chinese medicine, 7. tcm treatment methods, 8. traditional Chinese medicines and related discipline, 9. Chinese medical books, 10. institutions of traditional Chinese medicine, equipment or medicine Health officer, 11. people claim word, 12. geographic names, 13. season time words, 14. other words;It is divided into several grades of subclass per class word, According to the rank of part of speech, the classification and marking of part of speech is carried out to the traditional Chinese medicine noun in dictionary according to sequence from low to high.
Be divided into several grades of subclass per class word, for example the diagnostic method of TCM includes four methods of diagnosis subclass, the four methods of diagnosis include observation, auscultation and olfaction, interrogation, Diagnosis, observation include lingual diagnosis, and lingual diagnosis includes tongue picture, and tongue picture includes tongue fur and tongue nature, and tongue fur includes coating colour and coating nature, be up to 7 Grade subclass.
Further, the step (103) builds traditional Chinese medicine dictionary for word segmentation, traditional Chinese medicine using three-row dictionary creation method Dictionary for word segmentation is divided into three row, is respectively:
1st is classified as Chinese medicine major word , such as poor thieves, zhusha anshen pills etc.;
2nd is classified as parts of speech classification letter, as zhusha anshen pills belong to the tranquilizing prescription with heavy material in the classification of the prescription in part of speech, word Property classification letter be FCzzasj;
3rd is classified as part of speech classification mark.Tranquilizing prescription with heavy material in classifying such as prescription belongs to the 4th grade in classification, is labeled as 4。
Further, step (2) step is:
Step (201):Participle text, which is treated, using bag of words carries out keyword abstraction;
Step (202):Use the existing word training condition random field CRF models in traditional Chinese medicine dictionary for word segmentation, use condition Random field CRF models find neologisms, and neologisms are included in traditional Chinese medicine dictionary for word segmentation;
Step (203):Have word using the whole in dictionary for word segmentation and builds even numbers group Tire trees;
Step (204):The keyword extracted in text to be segmented is carried out single string pattern with even numbers group Tire trees to match, is made The keyword of current extraction is segmented with even numbers group Tire trees, obtains word segmentation result;
Step (205):Training Hidden Markov Model:To each have word in dictionary for word segmentation as observation state sequence, The part of speech of each word carries out Hidden Markov Model training as hidden state sequence, obtains trained Hidden Markov mould Type;
Step (206):Part-of-speech tagging is carried out using trained Hidden Markov Model:By what is obtained in step (204) Word sequence in word segmentation result, to trained Hidden Markov Model, is calculated as observation state sequence inputting by viterbi Method generates the hidden state sequence of current observation state sequence, and to obtain corresponding hidden state, hidden state is to wait for point The part of speech of word text, to complete part-of-speech tagging.
Further, step (3) judges whether text to be segmented all segments successfully, and criterion is:
If each word segmentation result is with part-of-speech tagging letter, then it represents that segment successfully, otherwise, indicate participle failure.
The second aspect of the present invention provides ancient TCM books document participle and part of speech indexing system;
Ancient TCM books document segments and part of speech indexing system, including:Memory, processor and storage are on a memory And the computer instruction run on a processor, when the computer instruction is run by processor, complete any of the above-described method institute The step of stating.
The third aspect of the present invention provides a kind of computer readable storage medium;
A kind of computer readable storage medium, thereon operation have computer instruction, the computer instruction to be transported by processor When row, the step described in any of the above-described method is completed.
Compared with prior art, the beneficial effects of the invention are as follows:
The recall rate and accuracy rate that the present invention segments traditional Chinese medicine literature of ancient book are significantly larger than the prior art.
The present invention realizes Chinese medicine major part-of-speech tagging for the first time, is excavated for TCM Literature and Knowledge Discovery provides base Plinth.
The word segmentation processing twice of the present invention, ensure that the integrality and accuracy of word segmentation result.
Description of the drawings
The accompanying drawings which form a part of this application are used for providing further understanding of the present application, and the application's shows Meaning property embodiment and its explanation do not constitute the improper restriction to the application for explaining the application.
Fig. 1 is flow chart of the method for the present invention.
Specific implementation mode
It is noted that following detailed description is all illustrative, it is intended to provide further instruction to the application.Unless another It indicates, all technical and scientific terms used herein has usual with the application person of an ordinary skill in the technical field The identical meanings of understanding.
As shown in Figure 1, ancient TCM books document participle and part of speech indexing method, including:
Step (1):Build traditional Chinese medicine dictionary for word segmentation;
Step (2):The text that participle is treated using traditional Chinese medicine dictionary for word segmentation carries out word segmentation processing and part-of-speech tagging;
Step (3):Judge whether text to be segmented all segments successfully;It is directly defeated to segmenting successful word segmentation result Go out;
Step (4):To the text of participle failure, word segmentation processing is carried out using ansj dictionaries again;Obtain final participle As a result.
Further, the step of step (1) structure traditional Chinese medicine dictionary for word segmentation is:
Step (101):Build Chinese medicine major term dictionary;
Step (102):Parts of speech classification and label are carried out to the word in Chinese medicine major term dictionary;
Step (103):Traditional Chinese medicine dictionary for word segmentation is built using three-row dictionary creation method.
1. the structure of traditional Chinese medicine dictionary for word segmentation
1.1 Chinese medicine major term dictionaries are built
One of the main reason for current general Chinese word segmentation software is to traditional Chinese medicine word segmentation accuracy difference is to Syndrome in TCM The terms recognition capability such as time, channels and collaterals, acupuncture point is different, therefore this system constructs perfect traditional Chinese medicine term dictionary first.Using Web crawlers, artificial neural network and manual synchronizing, extraction, standardization processing method, from traditional Chinese medicine literature of ancient book, it is various in In medical dictionary, a special dictionary for covering the Chinese medicine majors terms such as Chinese medicine medicine name, prescription title is extracted and constructed, is related to And traditional Chinese medicine related term 155,343, it is the most Chinese medicine major term dictionary of current receipts word amount.
1. traditional Chinese medicine of table segments dictionary and constitutes table
The special part of speech mask method of 1.2 traditional Chinese medicines
Part-of-speech tagging (Part-of-Speech tagging or POS tagging), also known as part-of-speech tagging or referred to as mark Note refers to the program that a correct part of speech is marked for each word in word segmentation result, in general, present part-of-speech tagging It is to determine that each word is the process of noun, verb, adjective or other parts of speech more.The mark of this part of speech is for TCM Literature Text mining and analysis significance be not very big, be based on this, we combine Chinese medicine major feature, according to traditional Chinese medical theory body Traditional Chinese medicine noun is divided into 14 class, 818 parts of speech by the sorting technique of system:Theory of traditional Chinese medical science basis, the diagnostic method of TCM, Chinese medicine, prescription phase Noun, typhoid fever and warm disease, the rules for the treatment of, therapy, traditional Chinese medicine related discipline, Chinese medical book, the institution of traditional Chinese medicine, traditional Chinese medicine instrument is closed to set Standby, medical and health personel's title, geographic name and other.
And the hidden horse model of single order is used, in this hidden Markov model, hidden state is 818 parts of speech, shows state 818 letter abbreviations, in order to be distinguished with general part-of-speech tagging, before plus FC.
Simultaneously according to the rank of part of speech, it is labeled as possible according to priority from low to high.
2. Chinese medicine major part of speech of table constitutes table (part)
The structure of 1.3 traditional Chinese medicine dictionary for word segmentation and extension
Dictionary for word segmentation is the core of this system, and the accuracy rate and speed to word segmentation result can all have an important influence on, This system is based on above Chinese medicine major term dictionary and part-of-speech tagging method, using 3 column dictionary creation methods, the 1st row It is classified as part-of-speech tagging letter for Chinese medicine major vocabulary of terms, the 2nd, the 3rd is classified as classer's description.
1.4tire trees (dictionary tree) construction process
(1) root node root is established, base [root]=1 is enabled
(2) the child node collection { root.childreni } (i=1...n) of root is found out so that check [root.childreni]=base [root]=1
(3) to each element in root.children:
1) { elemenet.childreni } (i=1...n) is found, a character is located at the knot of character string if paying attention to Tail, then its child nodes includes an empty node, and code values are set as 0 and find a value begin making each check [begini+element.childreni.code]=0
2) setting base [element.childreni]=begini
3) execute step 3 to element.childreni recurrence does not have if traversing some element Children, i.e. leaf node, then it is negative value that base [element], which is arranged,
2. TCM Literature segmentation methods and part-of-speech tagging
The core algorithm of this Words partition system is the Open Source Code of Ansj, is a Java Chinese word segmentation tool, is based on middle section The ictclas Chinese Word Automatic Segmentations of institute, the participle accuracy rate higher of participle tool (such as mmseg4j) of increasing income more common than other.
It applies the Chinese medicine major dictionary of our oneself structure to replace acquiescence dictionary on this basis, utilizes the dictionary of Ansj As supplement, the carry out part-of-speech tagging based on HMM.
3. TCM Literature segments and structure and the use of part-of-speech tagging service system
Chinese medical book document Words partition system is developed using Java language, and system includes participle framework and user interface.User Interface is presented to the user using form web page, and user is logged in by webpage, registered, and is not logged in user and only be may have access to website, Unusable participle function.Login user can need to segment text by replicating the form submission of paste text, can also pass through It uploads txt textual forms and submits participle text, word segmentation result replicates also there are two types of mode and txt texts are downloaded.
4. implementation result
4.1 improve participle accuracy rate and recall rate
With《The Treatise on Fevrile Diseases》The word content of clean this full text of Gu as a comparison with Ansj original programs, is carried out as test text Participle test, as a result, it has been found that, the recall rate and accuracy rate of ancient TCM books document Words partition systems participle are significantly larger than the sources Ansj Program and system dictionary, traditional Chinese medicine proper noun such as Taiyin diseases, sweating, foul wind, arteries and veins are slow etc. in test text, with the sources Ansj journey Sequence and system dictionary None- identified, also cannot correctly be segmented, and ancient TCM books document Words partition system can be accurate It identifies and is segmented.
Table 3 segments effect and compares
4.2 realize Chinese medicine major part-of-speech tagging
On the basis of accurate participle, accurate proprietary part-of-speech tagging, as shown in table 3, " Taiyin diseases ", " apoplexy " are realized It accurately is labelled with FCbm, indicates that this word is " names of disease of tcm ";" fever ", " sweating ", " arteries and veins is slow " are labeled as FCzz, indicate these Word is the symptom title in Chinese medicine, and the statistical analysis and Knowledge Discovery in this text mining for the later stage are of great significance.
4.3 system operatios are simple, portable strong
Chinese medical book document Words partition system is developed using Java language, readable strong, is easy to extend, is easy to change.System Including user's login, registration and user right control, it is not logged in user and only may have access to website, unusable participle function.System System friendly interface, wieldy, the prompt with hommization.
The foregoing is merely the preferred embodiments of the application, are not intended to limit this application, for the skill of this field For art personnel, the application can have various modifications and variations.Within the spirit and principles of this application, any made by repair Change, equivalent replacement, improvement etc., should be included within the protection domain of the application.

Claims (10)

1. ancient TCM books document segments and part of speech indexing method, characterized in that including:
Step (1):Build traditional Chinese medicine dictionary for word segmentation;
Step (2):The text that participle is treated using traditional Chinese medicine dictionary for word segmentation carries out word segmentation processing and part-of-speech tagging;
Step (3):Judge whether text to be segmented all segments successfully;It is directly exported to segmenting successful word segmentation result;
Step (4):To the text of participle failure, word segmentation processing is carried out using ansj dictionaries again;Obtain final word segmentation result.
2. ancient TCM books document participle as described in claim 1 and part of speech indexing method, characterized in that the step (1) Build traditional Chinese medicine dictionary for word segmentation the step of be:
Step (101):Build Chinese medicine major term dictionary;
Step (102):Parts of speech classification and label are carried out to the word in Chinese medicine major term dictionary;
Step (103):Traditional Chinese medicine dictionary for word segmentation is built using three-row dictionary creation method.
3. ancient TCM books document participle as claimed in claim 2 and part of speech indexing method, characterized in that the step (101) the step of structure Chinese medicine major term dictionary is:
Chinese medicine major term is extracted from traditional Chinese medicine literature of ancient book and traditional Chinese medicine dictionary.
4. ancient TCM books document participle as claimed in claim 3 and part of speech indexing method, characterized in that the traditional Chinese medicine is special Industry term, including:Chinese medicine medicine name, prescription title, Chinese medical book title, doctor's name, disease of tcm symptom title, traditional Chinese medicine work( Imitate the specialized vocabulary in title, acupuncture point title, Traditional Chinese Medicine dosage title, archaic Chinese vocabulary and modern medicine.
5. ancient TCM books document participle as claimed in claim 2 and part of speech indexing method, characterized in that the step (102) it is to the step of word progress parts of speech classification in Chinese medicine major term dictionary:
Reference《National Standard of the People's Republic of China's tcm clinical practice diagnosis and treatment term》Diseased portion, syndrome part or therapy part, In conjunction with the feature of traditional Chinese medicine vocabulary of terms, if traditional Chinese medicine noun is divided into Ganlei's part of speech, structure 14 classes classification part of speech table, 14 classes point Class part of speech includes:1. theory of traditional Chinese medical science is basic, in 2. diagnostic methods of TCM, 3. Chinese medicine nouns, 4. prescription nouns, 5. typhoid fever and warm disease, 6. Cure then, 7. tcm treatment methods, 8. traditional Chinese medicines and related discipline, 9. Chinese medical books, 10. institutions of traditional Chinese medicine, equipment or medical and health Personnel, 11. people claim word, 12. geographic names, 13. season time words, 14. other words;
It is divided into several grades of subclass per class word, according to the rank of part of speech, according to sequence from low to high to the traditional Chinese medicine name in dictionary Word carries out the classification and marking of part of speech.
6. ancient TCM books document participle as claimed in claim 2 and part of speech indexing method, characterized in that
The step (103) builds traditional Chinese medicine dictionary for word segmentation using three-row dictionary creation method, and traditional Chinese medicine dictionary for word segmentation is divided into Three arrange, and are respectively:1st is classified as Chinese medicine major word;2nd is classified as parts of speech classification letter;3rd is classified as part of speech classification mark.
7. ancient TCM books document participle as described in claim 1 and part of speech indexing method, characterized in that the step (2) Step is:
Step (201):Participle text, which is treated, using bag of words carries out keyword abstraction;
Step (202):Using the existing word training condition random field CRF models in traditional Chinese medicine dictionary for word segmentation, use condition is random Field CRF models find neologisms, and neologisms are included in traditional Chinese medicine dictionary for word segmentation;
Step (203):Have word using the whole in dictionary for word segmentation and builds even numbers group Tire trees;
Step (204):The keyword extracted in text to be segmented is carried out single string pattern with even numbers group Tire trees to match, using double Array Tire trees segment the keyword of current extraction, obtain word segmentation result;
Step (205):Training Hidden Markov Model:To each have word in dictionary for word segmentation as observation state sequence, each The part of speech of word carries out Hidden Markov Model training as hidden state sequence, obtains trained Hidden Markov Model;
Step (206):Part-of-speech tagging is carried out using trained Hidden Markov Model:The participle that will be obtained in step (204) As a result the word sequence in is produced as observation state sequence inputting to trained Hidden Markov Model by viterbi algorithms The hidden state sequence of raw current observation state sequence, to obtain corresponding hidden state, hidden state is to wait for participle text This part of speech, to complete part-of-speech tagging.
8. ancient TCM books document participle as described in claim 1 and part of speech indexing method, characterized in that
Step (3) judges whether text to be segmented all segments successfully, and criterion is:
If each word segmentation result is with part-of-speech tagging letter, then it represents that segment successfully, otherwise, indicate participle failure.
9. ancient TCM books document segments and part of speech indexing system, characterized in that including:It memory, processor and is stored in The computer instruction run on memory and on a processor when the computer instruction is run by processor, is completed right and is wanted Seek any steps of 1-8.
10. a kind of computer readable storage medium, characterized in that operation has computer instruction, the computer instruction quilt thereon When processor is run, any steps of claim 1-8 are completed.
CN201810233868.4A 2018-03-21 2018-03-21 Chinese medicine ancient book document word segmentation and part of speech indexing method and system Active CN108509419B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201810233868.4A CN108509419B (en) 2018-03-21 2018-03-21 Chinese medicine ancient book document word segmentation and part of speech indexing method and system

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201810233868.4A CN108509419B (en) 2018-03-21 2018-03-21 Chinese medicine ancient book document word segmentation and part of speech indexing method and system

Publications (2)

Publication Number Publication Date
CN108509419A true CN108509419A (en) 2018-09-07
CN108509419B CN108509419B (en) 2022-02-22

Family

ID=63377776

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201810233868.4A Active CN108509419B (en) 2018-03-21 2018-03-21 Chinese medicine ancient book document word segmentation and part of speech indexing method and system

Country Status (1)

Country Link
CN (1) CN108509419B (en)

Cited By (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109829159A (en) * 2019-01-29 2019-05-31 南京师范大学 A kind of integrated automatic morphology analysis methods and system of archaic Chinese text
CN110134766A (en) * 2019-05-09 2019-08-16 北京科技大学 A kind of segmenting method and device towards Chinese medical book document
CN110675962A (en) * 2019-09-10 2020-01-10 电子科技大学 Traditional Chinese medicine pharmacological action identification method and system based on machine learning and text rules
CN111104801A (en) * 2019-12-26 2020-05-05 济南大学 Text word segmentation method, system, device and medium based on website domain name
CN111488497A (en) * 2019-01-25 2020-08-04 北京沃东天骏信息技术有限公司 Similarity determination method and device for character string set, terminal and readable medium
CN111814464A (en) * 2020-05-25 2020-10-23 清华大学 Part-of-speech tagging method based on hidden Markov model
CN113033193A (en) * 2021-01-20 2021-06-25 山谷网安科技股份有限公司 C + + language-based mixed Chinese text word segmentation method
CN113377965A (en) * 2021-06-30 2021-09-10 中国农业银行股份有限公司 Method and related device for perceiving text keywords

Citations (15)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN1731395A (en) * 2005-08-18 2006-02-08 山东中医药大学 Chinese medicine ancient document database
CN101295295A (en) * 2008-06-13 2008-10-29 中国科学院计算技术研究所 Chinese language lexical analysis method based on linear model
CN101539907A (en) * 2008-03-19 2009-09-23 日电(中国)有限公司 Part-of-speech tagging model training device and part-of-speech tagging system and method thereof
CN101950309A (en) * 2010-10-08 2011-01-19 华中师范大学 Subject area-oriented method for recognizing new specialized vocabulary
CN102314507A (en) * 2011-09-08 2012-01-11 北京航空航天大学 Recognition ambiguity resolution method of Chinese named entity
CN102541865A (en) * 2010-12-15 2012-07-04 盛乐信息技术(上海)有限公司 Method for improving word segmentation property by using new words identified in word segmentation process
CN103365992A (en) * 2013-07-03 2013-10-23 深圳市华傲数据技术有限公司 Method for realizing dictionary search of Trie tree based on one-dimensional linear space
US9053089B2 (en) * 2007-10-02 2015-06-09 Apple Inc. Part-of-speech tagging using latent analogy
CN104933152A (en) * 2015-06-24 2015-09-23 北京京东尚科信息技术有限公司 Named entity recognition method and device
CN105426358A (en) * 2015-11-09 2016-03-23 中国农业大学 Automatic disease noun identification method
CN106682220A (en) * 2017-01-04 2017-05-17 华南理工大学 Online traditional Chinese medicine text named entity identifying method based on deep learning
CN107092674A (en) * 2017-04-14 2017-08-25 福建工程学院 The automatic abstracting method and system of a kind of Chinese medicine acupuncture field event trigger word
CN107169078A (en) * 2017-05-10 2017-09-15 京东方科技集团股份有限公司 Knowledge of TCM collection of illustrative plates and its method for building up and computer system
CN107179085A (en) * 2016-03-10 2017-09-19 中国科学院地理科学与资源研究所 A kind of condition random field map-matching method towards sparse floating car data
CN107562834A (en) * 2017-08-23 2018-01-09 四川长虹电器股份有限公司 The method of geographic location criteriaization extraction

Patent Citations (15)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN1731395A (en) * 2005-08-18 2006-02-08 山东中医药大学 Chinese medicine ancient document database
US9053089B2 (en) * 2007-10-02 2015-06-09 Apple Inc. Part-of-speech tagging using latent analogy
CN101539907A (en) * 2008-03-19 2009-09-23 日电(中国)有限公司 Part-of-speech tagging model training device and part-of-speech tagging system and method thereof
CN101295295A (en) * 2008-06-13 2008-10-29 中国科学院计算技术研究所 Chinese language lexical analysis method based on linear model
CN101950309A (en) * 2010-10-08 2011-01-19 华中师范大学 Subject area-oriented method for recognizing new specialized vocabulary
CN102541865A (en) * 2010-12-15 2012-07-04 盛乐信息技术(上海)有限公司 Method for improving word segmentation property by using new words identified in word segmentation process
CN102314507A (en) * 2011-09-08 2012-01-11 北京航空航天大学 Recognition ambiguity resolution method of Chinese named entity
CN103365992A (en) * 2013-07-03 2013-10-23 深圳市华傲数据技术有限公司 Method for realizing dictionary search of Trie tree based on one-dimensional linear space
CN104933152A (en) * 2015-06-24 2015-09-23 北京京东尚科信息技术有限公司 Named entity recognition method and device
CN105426358A (en) * 2015-11-09 2016-03-23 中国农业大学 Automatic disease noun identification method
CN107179085A (en) * 2016-03-10 2017-09-19 中国科学院地理科学与资源研究所 A kind of condition random field map-matching method towards sparse floating car data
CN106682220A (en) * 2017-01-04 2017-05-17 华南理工大学 Online traditional Chinese medicine text named entity identifying method based on deep learning
CN107092674A (en) * 2017-04-14 2017-08-25 福建工程学院 The automatic abstracting method and system of a kind of Chinese medicine acupuncture field event trigger word
CN107169078A (en) * 2017-05-10 2017-09-15 京东方科技集团股份有限公司 Knowledge of TCM collection of illustrative plates and its method for building up and computer system
CN107562834A (en) * 2017-08-23 2018-01-09 四川长虹电器股份有限公司 The method of geographic location criteriaization extraction

Non-Patent Citations (5)

* Cited by examiner, † Cited by third party
Title
YANG HAIFENG 等: "Applicability of commonly used Chinese word segmentation software in the field of TCM text and literature research", 《WORLD SCIENCE AND TECHNOLOGYTCM MODERNIZATION》 *
ZHOU X. 等: "Text mining for traditional Chinese medical knowledge discovery:A survey", 《JOURNAL OF BIOMEDICAL INFORMATICS》 *
刘凯: "基于条件随机场的中医病历命名实体抽取方法研究", 《中国优秀硕士学位论文全文数据库 信息科技辑》 *
蒋建洪 等: "词典与统计方法结合的中文分词模型研究及应用", 《计算机工程与设计》 *
韩雅丽 等: "文献计量学视角的中医药文献信息化研究现状探讨", 《世界科学技术-中医药现代化》 *

Cited By (14)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111488497B (en) * 2019-01-25 2023-05-12 北京沃东天骏信息技术有限公司 Similarity determination method and device for character string set, terminal and readable medium
CN111488497A (en) * 2019-01-25 2020-08-04 北京沃东天骏信息技术有限公司 Similarity determination method and device for character string set, terminal and readable medium
CN109829159B (en) * 2019-01-29 2020-02-18 南京师范大学 Integrated automatic lexical analysis method and system for ancient Chinese text
CN109829159A (en) * 2019-01-29 2019-05-31 南京师范大学 A kind of integrated automatic morphology analysis methods and system of archaic Chinese text
CN110134766A (en) * 2019-05-09 2019-08-16 北京科技大学 A kind of segmenting method and device towards Chinese medical book document
CN110134766B (en) * 2019-05-09 2021-06-25 北京科技大学 Word segmentation method and device for traditional Chinese medical ancient book documents
CN110675962A (en) * 2019-09-10 2020-01-10 电子科技大学 Traditional Chinese medicine pharmacological action identification method and system based on machine learning and text rules
CN111104801A (en) * 2019-12-26 2020-05-05 济南大学 Text word segmentation method, system, device and medium based on website domain name
CN111104801B (en) * 2019-12-26 2023-09-26 济南大学 Text word segmentation method, system, equipment and medium based on website domain name
CN111814464A (en) * 2020-05-25 2020-10-23 清华大学 Part-of-speech tagging method based on hidden Markov model
CN113033193A (en) * 2021-01-20 2021-06-25 山谷网安科技股份有限公司 C + + language-based mixed Chinese text word segmentation method
CN113033193B (en) * 2021-01-20 2024-04-16 山谷网安科技股份有限公司 Mixed Chinese text word segmentation method based on C++ language
CN113377965A (en) * 2021-06-30 2021-09-10 中国农业银行股份有限公司 Method and related device for perceiving text keywords
CN113377965B (en) * 2021-06-30 2024-02-23 中国农业银行股份有限公司 Method and related device for sensing text keywords

Also Published As

Publication number Publication date
CN108509419B (en) 2022-02-22

Similar Documents

Publication Publication Date Title
CN108509419A (en) Ancient TCM books document participle and part of speech indexing method and system
CN105894088B (en) Based on deep learning and distributed semantic feature medical information extraction system and method
CN109670179B (en) Medical record text named entity identification method based on iterative expansion convolutional neural network
Névéol et al. CLEF eHealth 2018 Multilingual Information Extraction Task Overview: ICD10 Coding of Death Certificates in French, Hungarian and Italian.
CN110838368B (en) Active inquiry robot based on traditional Chinese medicine clinical knowledge map
CN109190113B (en) Knowledge graph construction method of traditional Chinese medicine theory book
CN108549639A (en) Based on the modified Chinese medicine case name recognition methods of multiple features template and system
CN108628824A (en) A kind of entity recognition method based on Chinese electronic health record
CN107368547A (en) A kind of intelligent medical automatic question-answering method based on deep learning
CN107391906A (en) Health diet knowledge network construction method based on neutral net and collection of illustrative plates structure
CN106980609A (en) A kind of name entity recognition method of the condition random field of word-based vector representation
CN104484845B (en) Disease autoanalysis platform based on medical information ontology database
Siddharth et al. Evaluating the impact of Idea-Inspire 4.0 on analogical transfer of concepts
Barhoom et al. Sarcasm Detection in Headline News using Machine and Deep Learning Algorithms
Ji et al. A BILSTM-CRF method to Chinese electronic medical record named entity recognition
Steinert Assyrian and Babylonian scholarly text catalogues: medicine, magic and divination
CN109215798B (en) Knowledge base construction method for traditional Chinese medicine ancient languages
CN112949308A (en) Method and system for identifying named entities of Chinese electronic medical record based on functional structure
Oleynik et al. HPI-DHC at TREC 2018 Precision Medicine Track.
CN106886565A (en) A kind of basic house type auto-polymerization method
CN106933802B (en) Multi-data-source-oriented social security entity identification method and device
Wang et al. Research on named entity recognition of doctor-patient question answering community based on bilstm-crf model
CN110543630B (en) Method and device for generating text structured representation and computer storage medium
CN106354715A (en) Method and device for medical word processing
Cao The common prescription patterns based on the hierarchical clustering of herb-pairs efficacies

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant