CN108509419A - Ancient TCM books document participle and part of speech indexing method and system - Google Patents
Ancient TCM books document participle and part of speech indexing method and system Download PDFInfo
- Publication number
- CN108509419A CN108509419A CN201810233868.4A CN201810233868A CN108509419A CN 108509419 A CN108509419 A CN 108509419A CN 201810233868 A CN201810233868 A CN 201810233868A CN 108509419 A CN108509419 A CN 108509419A
- Authority
- CN
- China
- Prior art keywords
- chinese medicine
- speech
- traditional chinese
- dictionary
- word segmentation
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Granted
Links
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F40/00—Handling natural language data
- G06F40/20—Natural language analysis
- G06F40/279—Recognition of textual entities
- G06F40/284—Lexical analysis, e.g. tokenisation or collocates
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F40/00—Handling natural language data
- G06F40/20—Natural language analysis
- G06F40/237—Lexical tools
- G06F40/247—Thesauruses; Synonyms
-
- G—PHYSICS
- G16—INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
- G16H—HEALTHCARE INFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR THE HANDLING OR PROCESSING OF MEDICAL OR HEALTHCARE DATA
- G16H50/00—ICT specially adapted for medical diagnosis, medical simulation or medical data mining; ICT specially adapted for detecting, monitoring or modelling epidemics or pandemics
- G16H50/70—ICT specially adapted for medical diagnosis, medical simulation or medical data mining; ICT specially adapted for detecting, monitoring or modelling epidemics or pandemics for mining of medical data, e.g. analysing previous cases of other patients
-
- G—PHYSICS
- G16—INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
- G16H—HEALTHCARE INFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR THE HANDLING OR PROCESSING OF MEDICAL OR HEALTHCARE DATA
- G16H70/00—ICT specially adapted for the handling or processing of medical references
- G16H70/40—ICT specially adapted for the handling or processing of medical references relating to drugs, e.g. their side effects or intended usage
Landscapes
- Engineering & Computer Science (AREA)
- Health & Medical Sciences (AREA)
- Theoretical Computer Science (AREA)
- General Health & Medical Sciences (AREA)
- Public Health (AREA)
- Medical Informatics (AREA)
- Physics & Mathematics (AREA)
- Data Mining & Analysis (AREA)
- General Physics & Mathematics (AREA)
- Computational Linguistics (AREA)
- General Engineering & Computer Science (AREA)
- Artificial Intelligence (AREA)
- Primary Health Care (AREA)
- Audiology, Speech & Language Pathology (AREA)
- Epidemiology (AREA)
- Toxicology (AREA)
- Pharmacology & Pharmacy (AREA)
- Medicinal Chemistry (AREA)
- Bioinformatics & Cheminformatics (AREA)
- Chemical & Material Sciences (AREA)
- Biomedical Technology (AREA)
- Databases & Information Systems (AREA)
- Pathology (AREA)
- Machine Translation (AREA)
Abstract
The invention discloses ancient TCM books document participle and part of speech indexing method and systems;The method, including:Step (1):Build traditional Chinese medicine dictionary for word segmentation;Step (2):The text that participle is treated using traditional Chinese medicine dictionary for word segmentation carries out word segmentation processing and part-of-speech tagging;Step (3):Judge whether text to be segmented all segments successfully;It is directly exported to segmenting successful word segmentation result;Step (4):To the text of participle failure, word segmentation processing is carried out using ansj dictionaries again;Obtain final word segmentation result.
Description
Technical field
The present invention relates to ancient TCM books document participle and part of speech indexing method and systems.
Background technology
Document is most important to civilization, the social progress of the mankind, is the basis of all scientific researches.TCM Literature is
The important component of Ancient Literature in Chinese is the important foundation for studying ancient physician's clinical application experience, in not only combining
The knowledge such as the science, method, prescription, drugs of medicine also contain the academic thought accumulated in traditional Chinese medicine evolution in thousands of years and clinical use
Medicine experience excavates important prerequisite and basis that these valuable cultural heritages are traditional Chinese medicine academic inheritance and innovation.Chinese medicine pharmacology
The Modern Annotation of opinion, Chinese medical disease, therapy, prescription modern study, all be unable to do without to classic medicine, such as discovery of " qinghaosu "
Just it be unable to do without《Handbook of Prescriptions for Emergencies》The inspiration obtained in equal traditional Chinese medicines classical literature.
The finishing analysis of TCM Literature is based on participle and part-of-speech tagging.Participle be by consecutive word sequence according to
Certain specification is reassembled into the process of word sequence, the outer researchs theoretical, methods and techniques in relation to Chinese word segmentation of Current Domestic
Majority still handles opinion or experimental stage and is biased to natural language processing and information retrieval, be molded available Chinese word segmentation software compared with
It is few;And specifically for the software and method of traditional Chinese medicine participle and part-of-speech tagging, there is not been reported, due to Chinese medicine major term
Particularity, the word segmentation result accuracy rate and recall rate carried out to TCM Literature using general Chinese word segmentation software is all relatively low,
Accuracy rate that highest Pan Gu participle segment TCM Document is had been reported that also with regard to 0.735, recall rate only has 0.663, in others
The accuracy rate and recall rate of literary Words partition system, compressive classification rate (F1) in addition 0.5 hereinafter, only such as PHP Analysis accuracys rate
Have 0.312, recall rate only has 0.369, and the borrowing-word that cannot be all directed to traditional Chinese medicine carries out specific part-of-speech tagging.This is big
The big utilization and excavation for constraining TCM Literature.And software needs to configure environment mostly, has particular requirement to system, and it is removable
Plant property is poor, not easy to operate.
Therefore, build that a kind of suitable TCM Literature feature, accuracy rate and recall rate are high, can carry out meeting Chinese medicine major
TCM Literature participle and the part-of-speech tagging system and method for the part-of-speech tagging of term characteristics break through current restriction TCM Literature
It excavates and the major technology bottleneck of Knowledge Discovery, succession and innovation for traditional Chinese medicine, the original advantage for playing traditional Chinese medicine has
Highly important meaning.
Invention content
The object of the present invention is to provide ancient TCM books document participle and part of speech indexing method and systems, can improve Chinese medicine
The accuracy and recall rate of medicine literature of ancient book participle, and the part-of-speech tagging for meeting Chinese medicine major term characteristics can be carried out, it solves
Certainly current Chinese automatic word-cut segments accuracy rate to TCM Literature and recall rate is low, can not carry out Chinese medicine major part-of-speech tagging
Problem, it is right by us《The Treatise on Fevrile Diseases》The participle and part-of-speech tagging of text are applied, it is found that this Words partition system is more general
Chinese automatic word-cut there is higher accuracy rate and recall rate, and it is right《The Treatise on Fevrile Diseases》The part-of-speech tagging of document, also connects very much
The level of nearly professional.
The first aspect of the present invention provides ancient TCM books document participle and part of speech indexing method;
Ancient TCM books document segments and part of speech indexing method, including:
Step (1):Build traditional Chinese medicine dictionary for word segmentation;
Step (2):The text that participle is treated using traditional Chinese medicine dictionary for word segmentation carries out word segmentation processing and part-of-speech tagging;
Step (3):Judge whether text to be segmented all segments successfully;It is directly defeated to segmenting successful word segmentation result
Go out;
Step (4):To the text of participle failure, word segmentation processing is carried out using ansj dictionaries again;Obtain final participle
As a result.
Further, the step of step (1) structure traditional Chinese medicine dictionary for word segmentation is:
Step (101):Build Chinese medicine major term dictionary;
Step (102):Parts of speech classification and label are carried out to the word in Chinese medicine major term dictionary;
Step (103):Traditional Chinese medicine dictionary for word segmentation is built using three-row dictionary creation method.
Further, the step of step (101) structure Chinese medicine major term dictionary is:
Chinese medicine major term is extracted from traditional Chinese medicine literature of ancient book and traditional Chinese medicine dictionary;
The Chinese medicine major term, including:Chinese medicine medicine name, prescription title, Chinese medical book title, doctor's name, Chinese medicine
In condition symptoms title, traditional Chinese medicine effect title, acupuncture point title, Traditional Chinese Medicine dosage title, archaic Chinese vocabulary and modern medicine
Specialized vocabulary.
Further, the step (102) in Chinese medicine major term dictionary word carry out parts of speech classification the step of be:
Reference《National Standard of the People's Republic of China's tcm clinical practice diagnosis and treatment term》Diseased portion, syndrome part or therapy portion
Point, in conjunction with the feature of traditional Chinese medicine vocabulary of terms, if traditional Chinese medicine noun is divided into Ganlei's part of speech, 14 classes of structure are classified part of speech table, and 14
Class classification part of speech include:1. theory of traditional Chinese medical science is basic, 2. diagnostic methods of TCM, 3. Chinese medicine nouns, 4. prescription nouns, 5. typhoid fever and warm disease,
6. therapeutic principle of traditional Chinese medicine, 7. tcm treatment methods, 8. traditional Chinese medicines and related discipline, 9. Chinese medical books, 10. institutions of traditional Chinese medicine, equipment or medicine
Health officer, 11. people claim word, 12. geographic names, 13. season time words, 14. other words;It is divided into several grades of subclass per class word,
According to the rank of part of speech, the classification and marking of part of speech is carried out to the traditional Chinese medicine noun in dictionary according to sequence from low to high.
Be divided into several grades of subclass per class word, for example the diagnostic method of TCM includes four methods of diagnosis subclass, the four methods of diagnosis include observation, auscultation and olfaction, interrogation,
Diagnosis, observation include lingual diagnosis, and lingual diagnosis includes tongue picture, and tongue picture includes tongue fur and tongue nature, and tongue fur includes coating colour and coating nature, be up to 7
Grade subclass.
Further, the step (103) builds traditional Chinese medicine dictionary for word segmentation, traditional Chinese medicine using three-row dictionary creation method
Dictionary for word segmentation is divided into three row, is respectively:
1st is classified as Chinese medicine major word , such as poor thieves, zhusha anshen pills etc.;
2nd is classified as parts of speech classification letter, as zhusha anshen pills belong to the tranquilizing prescription with heavy material in the classification of the prescription in part of speech, word
Property classification letter be FCzzasj;
3rd is classified as part of speech classification mark.Tranquilizing prescription with heavy material in classifying such as prescription belongs to the 4th grade in classification, is labeled as
4。
Further, step (2) step is:
Step (201):Participle text, which is treated, using bag of words carries out keyword abstraction;
Step (202):Use the existing word training condition random field CRF models in traditional Chinese medicine dictionary for word segmentation, use condition
Random field CRF models find neologisms, and neologisms are included in traditional Chinese medicine dictionary for word segmentation;
Step (203):Have word using the whole in dictionary for word segmentation and builds even numbers group Tire trees;
Step (204):The keyword extracted in text to be segmented is carried out single string pattern with even numbers group Tire trees to match, is made
The keyword of current extraction is segmented with even numbers group Tire trees, obtains word segmentation result;
Step (205):Training Hidden Markov Model:To each have word in dictionary for word segmentation as observation state sequence,
The part of speech of each word carries out Hidden Markov Model training as hidden state sequence, obtains trained Hidden Markov mould
Type;
Step (206):Part-of-speech tagging is carried out using trained Hidden Markov Model:By what is obtained in step (204)
Word sequence in word segmentation result, to trained Hidden Markov Model, is calculated as observation state sequence inputting by viterbi
Method generates the hidden state sequence of current observation state sequence, and to obtain corresponding hidden state, hidden state is to wait for point
The part of speech of word text, to complete part-of-speech tagging.
Further, step (3) judges whether text to be segmented all segments successfully, and criterion is:
If each word segmentation result is with part-of-speech tagging letter, then it represents that segment successfully, otherwise, indicate participle failure.
The second aspect of the present invention provides ancient TCM books document participle and part of speech indexing system;
Ancient TCM books document segments and part of speech indexing system, including:Memory, processor and storage are on a memory
And the computer instruction run on a processor, when the computer instruction is run by processor, complete any of the above-described method institute
The step of stating.
The third aspect of the present invention provides a kind of computer readable storage medium;
A kind of computer readable storage medium, thereon operation have computer instruction, the computer instruction to be transported by processor
When row, the step described in any of the above-described method is completed.
Compared with prior art, the beneficial effects of the invention are as follows:
The recall rate and accuracy rate that the present invention segments traditional Chinese medicine literature of ancient book are significantly larger than the prior art.
The present invention realizes Chinese medicine major part-of-speech tagging for the first time, is excavated for TCM Literature and Knowledge Discovery provides base
Plinth.
The word segmentation processing twice of the present invention, ensure that the integrality and accuracy of word segmentation result.
Description of the drawings
The accompanying drawings which form a part of this application are used for providing further understanding of the present application, and the application's shows
Meaning property embodiment and its explanation do not constitute the improper restriction to the application for explaining the application.
Fig. 1 is flow chart of the method for the present invention.
Specific implementation mode
It is noted that following detailed description is all illustrative, it is intended to provide further instruction to the application.Unless another
It indicates, all technical and scientific terms used herein has usual with the application person of an ordinary skill in the technical field
The identical meanings of understanding.
As shown in Figure 1, ancient TCM books document participle and part of speech indexing method, including:
Step (1):Build traditional Chinese medicine dictionary for word segmentation;
Step (2):The text that participle is treated using traditional Chinese medicine dictionary for word segmentation carries out word segmentation processing and part-of-speech tagging;
Step (3):Judge whether text to be segmented all segments successfully;It is directly defeated to segmenting successful word segmentation result
Go out;
Step (4):To the text of participle failure, word segmentation processing is carried out using ansj dictionaries again;Obtain final participle
As a result.
Further, the step of step (1) structure traditional Chinese medicine dictionary for word segmentation is:
Step (101):Build Chinese medicine major term dictionary;
Step (102):Parts of speech classification and label are carried out to the word in Chinese medicine major term dictionary;
Step (103):Traditional Chinese medicine dictionary for word segmentation is built using three-row dictionary creation method.
1. the structure of traditional Chinese medicine dictionary for word segmentation
1.1 Chinese medicine major term dictionaries are built
One of the main reason for current general Chinese word segmentation software is to traditional Chinese medicine word segmentation accuracy difference is to Syndrome in TCM
The terms recognition capability such as time, channels and collaterals, acupuncture point is different, therefore this system constructs perfect traditional Chinese medicine term dictionary first.Using
Web crawlers, artificial neural network and manual synchronizing, extraction, standardization processing method, from traditional Chinese medicine literature of ancient book, it is various in
In medical dictionary, a special dictionary for covering the Chinese medicine majors terms such as Chinese medicine medicine name, prescription title is extracted and constructed, is related to
And traditional Chinese medicine related term 155,343, it is the most Chinese medicine major term dictionary of current receipts word amount.
1. traditional Chinese medicine of table segments dictionary and constitutes table
The special part of speech mask method of 1.2 traditional Chinese medicines
Part-of-speech tagging (Part-of-Speech tagging or POS tagging), also known as part-of-speech tagging or referred to as mark
Note refers to the program that a correct part of speech is marked for each word in word segmentation result, in general, present part-of-speech tagging
It is to determine that each word is the process of noun, verb, adjective or other parts of speech more.The mark of this part of speech is for TCM Literature
Text mining and analysis significance be not very big, be based on this, we combine Chinese medicine major feature, according to traditional Chinese medical theory body
Traditional Chinese medicine noun is divided into 14 class, 818 parts of speech by the sorting technique of system:Theory of traditional Chinese medical science basis, the diagnostic method of TCM, Chinese medicine, prescription phase
Noun, typhoid fever and warm disease, the rules for the treatment of, therapy, traditional Chinese medicine related discipline, Chinese medical book, the institution of traditional Chinese medicine, traditional Chinese medicine instrument is closed to set
Standby, medical and health personel's title, geographic name and other.
And the hidden horse model of single order is used, in this hidden Markov model, hidden state is 818 parts of speech, shows state
818 letter abbreviations, in order to be distinguished with general part-of-speech tagging, before plus FC.
Simultaneously according to the rank of part of speech, it is labeled as possible according to priority from low to high.
2. Chinese medicine major part of speech of table constitutes table (part)
The structure of 1.3 traditional Chinese medicine dictionary for word segmentation and extension
Dictionary for word segmentation is the core of this system, and the accuracy rate and speed to word segmentation result can all have an important influence on,
This system is based on above Chinese medicine major term dictionary and part-of-speech tagging method, using 3 column dictionary creation methods, the 1st row
It is classified as part-of-speech tagging letter for Chinese medicine major vocabulary of terms, the 2nd, the 3rd is classified as classer's description.
1.4tire trees (dictionary tree) construction process
(1) root node root is established, base [root]=1 is enabled
(2) the child node collection { root.childreni } (i=1...n) of root is found out so that check
[root.childreni]=base [root]=1
(3) to each element in root.children:
1) { elemenet.childreni } (i=1...n) is found, a character is located at the knot of character string if paying attention to
Tail, then its child nodes includes an empty node, and code values are set as 0 and find a value begin making each check
[begini+element.childreni.code]=0
2) setting base [element.childreni]=begini
3) execute step 3 to element.childreni recurrence does not have if traversing some element
Children, i.e. leaf node, then it is negative value that base [element], which is arranged,
2. TCM Literature segmentation methods and part-of-speech tagging
The core algorithm of this Words partition system is the Open Source Code of Ansj, is a Java Chinese word segmentation tool, is based on middle section
The ictclas Chinese Word Automatic Segmentations of institute, the participle accuracy rate higher of participle tool (such as mmseg4j) of increasing income more common than other.
It applies the Chinese medicine major dictionary of our oneself structure to replace acquiescence dictionary on this basis, utilizes the dictionary of Ansj
As supplement, the carry out part-of-speech tagging based on HMM.
3. TCM Literature segments and structure and the use of part-of-speech tagging service system
Chinese medical book document Words partition system is developed using Java language, and system includes participle framework and user interface.User
Interface is presented to the user using form web page, and user is logged in by webpage, registered, and is not logged in user and only be may have access to website,
Unusable participle function.Login user can need to segment text by replicating the form submission of paste text, can also pass through
It uploads txt textual forms and submits participle text, word segmentation result replicates also there are two types of mode and txt texts are downloaded.
4. implementation result
4.1 improve participle accuracy rate and recall rate
With《The Treatise on Fevrile Diseases》The word content of clean this full text of Gu as a comparison with Ansj original programs, is carried out as test text
Participle test, as a result, it has been found that, the recall rate and accuracy rate of ancient TCM books document Words partition systems participle are significantly larger than the sources Ansj
Program and system dictionary, traditional Chinese medicine proper noun such as Taiyin diseases, sweating, foul wind, arteries and veins are slow etc. in test text, with the sources Ansj journey
Sequence and system dictionary None- identified, also cannot correctly be segmented, and ancient TCM books document Words partition system can be accurate
It identifies and is segmented.
Table 3 segments effect and compares
4.2 realize Chinese medicine major part-of-speech tagging
On the basis of accurate participle, accurate proprietary part-of-speech tagging, as shown in table 3, " Taiyin diseases ", " apoplexy " are realized
It accurately is labelled with FCbm, indicates that this word is " names of disease of tcm ";" fever ", " sweating ", " arteries and veins is slow " are labeled as FCzz, indicate these
Word is the symptom title in Chinese medicine, and the statistical analysis and Knowledge Discovery in this text mining for the later stage are of great significance.
4.3 system operatios are simple, portable strong
Chinese medical book document Words partition system is developed using Java language, readable strong, is easy to extend, is easy to change.System
Including user's login, registration and user right control, it is not logged in user and only may have access to website, unusable participle function.System
System friendly interface, wieldy, the prompt with hommization.
The foregoing is merely the preferred embodiments of the application, are not intended to limit this application, for the skill of this field
For art personnel, the application can have various modifications and variations.Within the spirit and principles of this application, any made by repair
Change, equivalent replacement, improvement etc., should be included within the protection domain of the application.
Claims (10)
1. ancient TCM books document segments and part of speech indexing method, characterized in that including:
Step (1):Build traditional Chinese medicine dictionary for word segmentation;
Step (2):The text that participle is treated using traditional Chinese medicine dictionary for word segmentation carries out word segmentation processing and part-of-speech tagging;
Step (3):Judge whether text to be segmented all segments successfully;It is directly exported to segmenting successful word segmentation result;
Step (4):To the text of participle failure, word segmentation processing is carried out using ansj dictionaries again;Obtain final word segmentation result.
2. ancient TCM books document participle as described in claim 1 and part of speech indexing method, characterized in that the step (1)
Build traditional Chinese medicine dictionary for word segmentation the step of be:
Step (101):Build Chinese medicine major term dictionary;
Step (102):Parts of speech classification and label are carried out to the word in Chinese medicine major term dictionary;
Step (103):Traditional Chinese medicine dictionary for word segmentation is built using three-row dictionary creation method.
3. ancient TCM books document participle as claimed in claim 2 and part of speech indexing method, characterized in that the step
(101) the step of structure Chinese medicine major term dictionary is:
Chinese medicine major term is extracted from traditional Chinese medicine literature of ancient book and traditional Chinese medicine dictionary.
4. ancient TCM books document participle as claimed in claim 3 and part of speech indexing method, characterized in that the traditional Chinese medicine is special
Industry term, including:Chinese medicine medicine name, prescription title, Chinese medical book title, doctor's name, disease of tcm symptom title, traditional Chinese medicine work(
Imitate the specialized vocabulary in title, acupuncture point title, Traditional Chinese Medicine dosage title, archaic Chinese vocabulary and modern medicine.
5. ancient TCM books document participle as claimed in claim 2 and part of speech indexing method, characterized in that the step
(102) it is to the step of word progress parts of speech classification in Chinese medicine major term dictionary:
Reference《National Standard of the People's Republic of China's tcm clinical practice diagnosis and treatment term》Diseased portion, syndrome part or therapy part,
In conjunction with the feature of traditional Chinese medicine vocabulary of terms, if traditional Chinese medicine noun is divided into Ganlei's part of speech, structure 14 classes classification part of speech table, 14 classes point
Class part of speech includes:1. theory of traditional Chinese medical science is basic, in 2. diagnostic methods of TCM, 3. Chinese medicine nouns, 4. prescription nouns, 5. typhoid fever and warm disease, 6.
Cure then, 7. tcm treatment methods, 8. traditional Chinese medicines and related discipline, 9. Chinese medical books, 10. institutions of traditional Chinese medicine, equipment or medical and health
Personnel, 11. people claim word, 12. geographic names, 13. season time words, 14. other words;
It is divided into several grades of subclass per class word, according to the rank of part of speech, according to sequence from low to high to the traditional Chinese medicine name in dictionary
Word carries out the classification and marking of part of speech.
6. ancient TCM books document participle as claimed in claim 2 and part of speech indexing method, characterized in that
The step (103) builds traditional Chinese medicine dictionary for word segmentation using three-row dictionary creation method, and traditional Chinese medicine dictionary for word segmentation is divided into
Three arrange, and are respectively:1st is classified as Chinese medicine major word;2nd is classified as parts of speech classification letter;3rd is classified as part of speech classification mark.
7. ancient TCM books document participle as described in claim 1 and part of speech indexing method, characterized in that the step (2)
Step is:
Step (201):Participle text, which is treated, using bag of words carries out keyword abstraction;
Step (202):Using the existing word training condition random field CRF models in traditional Chinese medicine dictionary for word segmentation, use condition is random
Field CRF models find neologisms, and neologisms are included in traditional Chinese medicine dictionary for word segmentation;
Step (203):Have word using the whole in dictionary for word segmentation and builds even numbers group Tire trees;
Step (204):The keyword extracted in text to be segmented is carried out single string pattern with even numbers group Tire trees to match, using double
Array Tire trees segment the keyword of current extraction, obtain word segmentation result;
Step (205):Training Hidden Markov Model:To each have word in dictionary for word segmentation as observation state sequence, each
The part of speech of word carries out Hidden Markov Model training as hidden state sequence, obtains trained Hidden Markov Model;
Step (206):Part-of-speech tagging is carried out using trained Hidden Markov Model:The participle that will be obtained in step (204)
As a result the word sequence in is produced as observation state sequence inputting to trained Hidden Markov Model by viterbi algorithms
The hidden state sequence of raw current observation state sequence, to obtain corresponding hidden state, hidden state is to wait for participle text
This part of speech, to complete part-of-speech tagging.
8. ancient TCM books document participle as described in claim 1 and part of speech indexing method, characterized in that
Step (3) judges whether text to be segmented all segments successfully, and criterion is:
If each word segmentation result is with part-of-speech tagging letter, then it represents that segment successfully, otherwise, indicate participle failure.
9. ancient TCM books document segments and part of speech indexing system, characterized in that including:It memory, processor and is stored in
The computer instruction run on memory and on a processor when the computer instruction is run by processor, is completed right and is wanted
Seek any steps of 1-8.
10. a kind of computer readable storage medium, characterized in that operation has computer instruction, the computer instruction quilt thereon
When processor is run, any steps of claim 1-8 are completed.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201810233868.4A CN108509419B (en) | 2018-03-21 | 2018-03-21 | Chinese medicine ancient book document word segmentation and part of speech indexing method and system |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201810233868.4A CN108509419B (en) | 2018-03-21 | 2018-03-21 | Chinese medicine ancient book document word segmentation and part of speech indexing method and system |
Publications (2)
Publication Number | Publication Date |
---|---|
CN108509419A true CN108509419A (en) | 2018-09-07 |
CN108509419B CN108509419B (en) | 2022-02-22 |
Family
ID=63377776
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201810233868.4A Active CN108509419B (en) | 2018-03-21 | 2018-03-21 | Chinese medicine ancient book document word segmentation and part of speech indexing method and system |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN108509419B (en) |
Cited By (8)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN109829159A (en) * | 2019-01-29 | 2019-05-31 | 南京师范大学 | A kind of integrated automatic morphology analysis methods and system of archaic Chinese text |
CN110134766A (en) * | 2019-05-09 | 2019-08-16 | 北京科技大学 | A kind of segmenting method and device towards Chinese medical book document |
CN110675962A (en) * | 2019-09-10 | 2020-01-10 | 电子科技大学 | Traditional Chinese medicine pharmacological action identification method and system based on machine learning and text rules |
CN111104801A (en) * | 2019-12-26 | 2020-05-05 | 济南大学 | Text word segmentation method, system, device and medium based on website domain name |
CN111488497A (en) * | 2019-01-25 | 2020-08-04 | 北京沃东天骏信息技术有限公司 | Similarity determination method and device for character string set, terminal and readable medium |
CN111814464A (en) * | 2020-05-25 | 2020-10-23 | 清华大学 | Part-of-speech tagging method based on hidden Markov model |
CN113033193A (en) * | 2021-01-20 | 2021-06-25 | 山谷网安科技股份有限公司 | C + + language-based mixed Chinese text word segmentation method |
CN113377965A (en) * | 2021-06-30 | 2021-09-10 | 中国农业银行股份有限公司 | Method and related device for perceiving text keywords |
Citations (15)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN1731395A (en) * | 2005-08-18 | 2006-02-08 | 山东中医药大学 | Chinese medicine ancient document database |
CN101295295A (en) * | 2008-06-13 | 2008-10-29 | 中国科学院计算技术研究所 | Chinese language lexical analysis method based on linear model |
CN101539907A (en) * | 2008-03-19 | 2009-09-23 | 日电(中国)有限公司 | Part-of-speech tagging model training device and part-of-speech tagging system and method thereof |
CN101950309A (en) * | 2010-10-08 | 2011-01-19 | 华中师范大学 | Subject area-oriented method for recognizing new specialized vocabulary |
CN102314507A (en) * | 2011-09-08 | 2012-01-11 | 北京航空航天大学 | Recognition ambiguity resolution method of Chinese named entity |
CN102541865A (en) * | 2010-12-15 | 2012-07-04 | 盛乐信息技术(上海)有限公司 | Method for improving word segmentation property by using new words identified in word segmentation process |
CN103365992A (en) * | 2013-07-03 | 2013-10-23 | 深圳市华傲数据技术有限公司 | Method for realizing dictionary search of Trie tree based on one-dimensional linear space |
US9053089B2 (en) * | 2007-10-02 | 2015-06-09 | Apple Inc. | Part-of-speech tagging using latent analogy |
CN104933152A (en) * | 2015-06-24 | 2015-09-23 | 北京京东尚科信息技术有限公司 | Named entity recognition method and device |
CN105426358A (en) * | 2015-11-09 | 2016-03-23 | 中国农业大学 | Automatic disease noun identification method |
CN106682220A (en) * | 2017-01-04 | 2017-05-17 | 华南理工大学 | Online traditional Chinese medicine text named entity identifying method based on deep learning |
CN107092674A (en) * | 2017-04-14 | 2017-08-25 | 福建工程学院 | The automatic abstracting method and system of a kind of Chinese medicine acupuncture field event trigger word |
CN107169078A (en) * | 2017-05-10 | 2017-09-15 | 京东方科技集团股份有限公司 | Knowledge of TCM collection of illustrative plates and its method for building up and computer system |
CN107179085A (en) * | 2016-03-10 | 2017-09-19 | 中国科学院地理科学与资源研究所 | A kind of condition random field map-matching method towards sparse floating car data |
CN107562834A (en) * | 2017-08-23 | 2018-01-09 | 四川长虹电器股份有限公司 | The method of geographic location criteriaization extraction |
-
2018
- 2018-03-21 CN CN201810233868.4A patent/CN108509419B/en active Active
Patent Citations (15)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN1731395A (en) * | 2005-08-18 | 2006-02-08 | 山东中医药大学 | Chinese medicine ancient document database |
US9053089B2 (en) * | 2007-10-02 | 2015-06-09 | Apple Inc. | Part-of-speech tagging using latent analogy |
CN101539907A (en) * | 2008-03-19 | 2009-09-23 | 日电(中国)有限公司 | Part-of-speech tagging model training device and part-of-speech tagging system and method thereof |
CN101295295A (en) * | 2008-06-13 | 2008-10-29 | 中国科学院计算技术研究所 | Chinese language lexical analysis method based on linear model |
CN101950309A (en) * | 2010-10-08 | 2011-01-19 | 华中师范大学 | Subject area-oriented method for recognizing new specialized vocabulary |
CN102541865A (en) * | 2010-12-15 | 2012-07-04 | 盛乐信息技术(上海)有限公司 | Method for improving word segmentation property by using new words identified in word segmentation process |
CN102314507A (en) * | 2011-09-08 | 2012-01-11 | 北京航空航天大学 | Recognition ambiguity resolution method of Chinese named entity |
CN103365992A (en) * | 2013-07-03 | 2013-10-23 | 深圳市华傲数据技术有限公司 | Method for realizing dictionary search of Trie tree based on one-dimensional linear space |
CN104933152A (en) * | 2015-06-24 | 2015-09-23 | 北京京东尚科信息技术有限公司 | Named entity recognition method and device |
CN105426358A (en) * | 2015-11-09 | 2016-03-23 | 中国农业大学 | Automatic disease noun identification method |
CN107179085A (en) * | 2016-03-10 | 2017-09-19 | 中国科学院地理科学与资源研究所 | A kind of condition random field map-matching method towards sparse floating car data |
CN106682220A (en) * | 2017-01-04 | 2017-05-17 | 华南理工大学 | Online traditional Chinese medicine text named entity identifying method based on deep learning |
CN107092674A (en) * | 2017-04-14 | 2017-08-25 | 福建工程学院 | The automatic abstracting method and system of a kind of Chinese medicine acupuncture field event trigger word |
CN107169078A (en) * | 2017-05-10 | 2017-09-15 | 京东方科技集团股份有限公司 | Knowledge of TCM collection of illustrative plates and its method for building up and computer system |
CN107562834A (en) * | 2017-08-23 | 2018-01-09 | 四川长虹电器股份有限公司 | The method of geographic location criteriaization extraction |
Non-Patent Citations (5)
Title |
---|
YANG HAIFENG 等: "Applicability of commonly used Chinese word segmentation software in the field of TCM text and literature research", 《WORLD SCIENCE AND TECHNOLOGYTCM MODERNIZATION》 * |
ZHOU X. 等: "Text mining for traditional Chinese medical knowledge discovery:A survey", 《JOURNAL OF BIOMEDICAL INFORMATICS》 * |
刘凯: "基于条件随机场的中医病历命名实体抽取方法研究", 《中国优秀硕士学位论文全文数据库 信息科技辑》 * |
蒋建洪 等: "词典与统计方法结合的中文分词模型研究及应用", 《计算机工程与设计》 * |
韩雅丽 等: "文献计量学视角的中医药文献信息化研究现状探讨", 《世界科学技术-中医药现代化》 * |
Cited By (14)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN111488497B (en) * | 2019-01-25 | 2023-05-12 | 北京沃东天骏信息技术有限公司 | Similarity determination method and device for character string set, terminal and readable medium |
CN111488497A (en) * | 2019-01-25 | 2020-08-04 | 北京沃东天骏信息技术有限公司 | Similarity determination method and device for character string set, terminal and readable medium |
CN109829159B (en) * | 2019-01-29 | 2020-02-18 | 南京师范大学 | Integrated automatic lexical analysis method and system for ancient Chinese text |
CN109829159A (en) * | 2019-01-29 | 2019-05-31 | 南京师范大学 | A kind of integrated automatic morphology analysis methods and system of archaic Chinese text |
CN110134766A (en) * | 2019-05-09 | 2019-08-16 | 北京科技大学 | A kind of segmenting method and device towards Chinese medical book document |
CN110134766B (en) * | 2019-05-09 | 2021-06-25 | 北京科技大学 | Word segmentation method and device for traditional Chinese medical ancient book documents |
CN110675962A (en) * | 2019-09-10 | 2020-01-10 | 电子科技大学 | Traditional Chinese medicine pharmacological action identification method and system based on machine learning and text rules |
CN111104801A (en) * | 2019-12-26 | 2020-05-05 | 济南大学 | Text word segmentation method, system, device and medium based on website domain name |
CN111104801B (en) * | 2019-12-26 | 2023-09-26 | 济南大学 | Text word segmentation method, system, equipment and medium based on website domain name |
CN111814464A (en) * | 2020-05-25 | 2020-10-23 | 清华大学 | Part-of-speech tagging method based on hidden Markov model |
CN113033193A (en) * | 2021-01-20 | 2021-06-25 | 山谷网安科技股份有限公司 | C + + language-based mixed Chinese text word segmentation method |
CN113033193B (en) * | 2021-01-20 | 2024-04-16 | 山谷网安科技股份有限公司 | Mixed Chinese text word segmentation method based on C++ language |
CN113377965A (en) * | 2021-06-30 | 2021-09-10 | 中国农业银行股份有限公司 | Method and related device for perceiving text keywords |
CN113377965B (en) * | 2021-06-30 | 2024-02-23 | 中国农业银行股份有限公司 | Method and related device for sensing text keywords |
Also Published As
Publication number | Publication date |
---|---|
CN108509419B (en) | 2022-02-22 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN108509419A (en) | Ancient TCM books document participle and part of speech indexing method and system | |
CN105894088B (en) | Based on deep learning and distributed semantic feature medical information extraction system and method | |
CN109670179B (en) | Medical record text named entity identification method based on iterative expansion convolutional neural network | |
Névéol et al. | CLEF eHealth 2018 Multilingual Information Extraction Task Overview: ICD10 Coding of Death Certificates in French, Hungarian and Italian. | |
CN110838368B (en) | Active inquiry robot based on traditional Chinese medicine clinical knowledge map | |
CN109190113B (en) | Knowledge graph construction method of traditional Chinese medicine theory book | |
CN109741806B (en) | Auxiliary generation method and device for medical image diagnosis report | |
CN108549639A (en) | Based on the modified Chinese medicine case name recognition methods of multiple features template and system | |
CN108628824A (en) | A kind of entity recognition method based on Chinese electronic health record | |
CN107368547A (en) | A kind of intelligent medical automatic question-answering method based on deep learning | |
CN111048167B (en) | Hierarchical case structuring method and system | |
JP7464800B2 (en) | METHOD AND SYSTEM FOR RECOGNITION OF MEDICAL EVENTS UNDER SMALL SAMPLE WEAKLY LABELING CONDITIONS - Patent application | |
Siddharth et al. | Evaluating the impact of Idea-Inspire 4.0 on analogical transfer of concepts | |
CN107391906A (en) | Health diet knowledge network construction method based on neutral net and collection of illustrative plates structure | |
Barhoom et al. | Sarcasm detection in headline news using machine and deep learning algorithms | |
CN106934220A (en) | Towards the disease class entity recognition method and device of multi-data source | |
Ji et al. | A BILSTM-CRF method to Chinese electronic medical record named entity recognition | |
CN106844351A (en) | A kind of medical institutions towards multi-data source organize class entity recognition method and device | |
Steinert | Assyrian and Babylonian scholarly text catalogues: medicine, magic and divination | |
CN105389470A (en) | Method for automatically extracting Traditional Chinese Medicine acupuncture entity relationship | |
CN112949308A (en) | Method and system for identifying named entities of Chinese electronic medical record based on functional structure | |
CN109215798B (en) | Knowledge base construction method for traditional Chinese medicine ancient languages | |
Oleynik et al. | HPI-DHC at TREC 2018 Precision Medicine Track. | |
CN106886565A (en) | A kind of basic house type auto-polymerization method | |
CN106933802B (en) | Multi-data-source-oriented social security entity identification method and device |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant |