CN108509419A

CN108509419A - Ancient TCM books document participle and part of speech indexing method and system

Info

Publication number: CN108509419A
Application number: CN201810233868.4A
Authority: CN
Inventors: 付先军; 李学博; 王振国; 陈晓康; 桑晓明; 鞠芳凝; 周扬; 陈聪; 邵欣欣
Original assignee: Shandong University of Traditional Chinese Medicine
Current assignee: Shandong University of Traditional Chinese Medicine
Priority date: 2018-03-21
Filing date: 2018-03-21
Publication date: 2018-09-07
Anticipated expiration: 2038-03-21
Also published as: CN108509419B

Abstract

The invention discloses ancient TCM books document participle and part of speech indexing method and systems；The method, including：Step (1)：Build traditional Chinese medicine dictionary for word segmentation；Step (2)：The text that participle is treated using traditional Chinese medicine dictionary for word segmentation carries out word segmentation processing and part-of-speech tagging；Step (3)：Judge whether text to be segmented all segments successfully；It is directly exported to segmenting successful word segmentation result；Step (4)：To the text of participle failure, word segmentation processing is carried out using ansj dictionaries again；Obtain final word segmentation result.

Description

Ancient TCM books document participle and part of speech indexing method and system

Technical field

The present invention relates to ancient TCM books document participle and part of speech indexing method and systems.

Background technology

Document is most important to civilization, the social progress of the mankind, is the basis of all scientific researches.TCM Literature is The important component of Ancient Literature in Chinese is the important foundation for studying ancient physician's clinical application experience, in not only combining The knowledge such as the science, method, prescription, drugs of medicine also contain the academic thought accumulated in traditional Chinese medicine evolution in thousands of years and clinical use Medicine experience excavates important prerequisite and basis that these valuable cultural heritages are traditional Chinese medicine academic inheritance and innovation.Chinese medicine pharmacology The Modern Annotation of opinion, Chinese medical disease, therapy, prescription modern study, all be unable to do without to classic medicine, such as discovery of " qinghaosu " Just it be unable to do without《Handbook of Prescriptions for Emergencies》The inspiration obtained in equal traditional Chinese medicines classical literature.

The finishing analysis of TCM Literature is based on participle and part-of-speech tagging.Participle be by consecutive word sequence according to Certain specification is reassembled into the process of word sequence, the outer researchs theoretical, methods and techniques in relation to Chinese word segmentation of Current Domestic Majority still handles opinion or experimental stage and is biased to natural language processing and information retrieval, be molded available Chinese word segmentation software compared with It is few；And specifically for the software and method of traditional Chinese medicine participle and part-of-speech tagging, there is not been reported, due to Chinese medicine major term Particularity, the word segmentation result accuracy rate and recall rate carried out to TCM Literature using general Chinese word segmentation software is all relatively low, Accuracy rate that highest Pan Gu participle segment TCM Document is had been reported that also with regard to 0.735, recall rate only has 0.663, in others The accuracy rate and recall rate of literary Words partition system, compressive classification rate (F1) in addition 0.5 hereinafter, only such as PHP Analysis accuracys rate Have 0.312, recall rate only has 0.369, and the borrowing-word that cannot be all directed to traditional Chinese medicine carries out specific part-of-speech tagging.This is big The big utilization and excavation for constraining TCM Literature.And software needs to configure environment mostly, has particular requirement to system, and it is removable Plant property is poor, not easy to operate.

Therefore, build that a kind of suitable TCM Literature feature, accuracy rate and recall rate are high, can carry out meeting Chinese medicine major TCM Literature participle and the part-of-speech tagging system and method for the part-of-speech tagging of term characteristics break through current restriction TCM Literature It excavates and the major technology bottleneck of Knowledge Discovery, succession and innovation for traditional Chinese medicine, the original advantage for playing traditional Chinese medicine has Highly important meaning.

Invention content

The object of the present invention is to provide ancient TCM books document participle and part of speech indexing method and systems, can improve Chinese medicine The accuracy and recall rate of medicine literature of ancient book participle, and the part-of-speech tagging for meeting Chinese medicine major term characteristics can be carried out, it solves Certainly current Chinese automatic word-cut segments accuracy rate to TCM Literature and recall rate is low, can not carry out Chinese medicine major part-of-speech tagging Problem, it is right by us《The Treatise on Fevrile Diseases》The participle and part-of-speech tagging of text are applied, it is found that this Words partition system is more general Chinese automatic word-cut there is higher accuracy rate and recall rate, and it is right《The Treatise on Fevrile Diseases》The part-of-speech tagging of document, also connects very much The level of nearly professional.

The first aspect of the present invention provides ancient TCM books document participle and part of speech indexing method；

Ancient TCM books document segments and part of speech indexing method, including：

Step (1)：Build traditional Chinese medicine dictionary for word segmentation；

Step (2)：The text that participle is treated using traditional Chinese medicine dictionary for word segmentation carries out word segmentation processing and part-of-speech tagging；

Step (3)：Judge whether text to be segmented all segments successfully；It is directly defeated to segmenting successful word segmentation result Go out；

Step (4)：To the text of participle failure, word segmentation processing is carried out using ansj dictionaries again；Obtain final participle As a result.

Further, the step of step (1) structure traditional Chinese medicine dictionary for word segmentation is：

Step (101)：Build Chinese medicine major term dictionary；

Step (102)：Parts of speech classification and label are carried out to the word in Chinese medicine major term dictionary；

Step (103)：Traditional Chinese medicine dictionary for word segmentation is built using three-row dictionary creation method.

Further, the step of step (101) structure Chinese medicine major term dictionary is：

Chinese medicine major term is extracted from traditional Chinese medicine literature of ancient book and traditional Chinese medicine dictionary；

The Chinese medicine major term, including：Chinese medicine medicine name, prescription title, Chinese medical book title, doctor's name, Chinese medicine In condition symptoms title, traditional Chinese medicine effect title, acupuncture point title, Traditional Chinese Medicine dosage title, archaic Chinese vocabulary and modern medicine Specialized vocabulary.

Further, the step (102) in Chinese medicine major term dictionary word carry out parts of speech classification the step of be：

Reference《National Standard of the People's Republic of China's tcm clinical practice diagnosis and treatment term》Diseased portion, syndrome part or therapy portion Point, in conjunction with the feature of traditional Chinese medicine vocabulary of terms, if traditional Chinese medicine noun is divided into Ganlei's part of speech, 14 classes of structure are classified part of speech table, and 14 Class classification part of speech include：1. theory of traditional Chinese medical science is basic, 2. diagnostic methods of TCM, 3. Chinese medicine nouns, 4. prescription nouns, 5. typhoid fever and warm disease, 6. therapeutic principle of traditional Chinese medicine, 7. tcm treatment methods, 8. traditional Chinese medicines and related discipline, 9. Chinese medical books, 10. institutions of traditional Chinese medicine, equipment or medicine Health officer, 11. people claim word, 12. geographic names, 13. season time words, 14. other words；It is divided into several grades of subclass per class word, According to the rank of part of speech, the classification and marking of part of speech is carried out to the traditional Chinese medicine noun in dictionary according to sequence from low to high.

Be divided into several grades of subclass per class word, for example the diagnostic method of TCM includes four methods of diagnosis subclass, the four methods of diagnosis include observation, auscultation and olfaction, interrogation, Diagnosis, observation include lingual diagnosis, and lingual diagnosis includes tongue picture, and tongue picture includes tongue fur and tongue nature, and tongue fur includes coating colour and coating nature, be up to 7 Grade subclass.

Further, the step (103) builds traditional Chinese medicine dictionary for word segmentation, traditional Chinese medicine using three-row dictionary creation method Dictionary for word segmentation is divided into three row, is respectively：

1st is classified as Chinese medicine major word , such as poor thieves, zhusha anshen pills etc.；

2nd is classified as parts of speech classification letter, as zhusha anshen pills belong to the tranquilizing prescription with heavy material in the classification of the prescription in part of speech, word Property classification letter be FCzzasj；

3rd is classified as part of speech classification mark.Tranquilizing prescription with heavy material in classifying such as prescription belongs to the 4th grade in classification, is labeled as 4。

Further, step (2) step is：

Step (201)：Participle text, which is treated, using bag of words carries out keyword abstraction；

Step (202)：Use the existing word training condition random field CRF models in traditional Chinese medicine dictionary for word segmentation, use condition Random field CRF models find neologisms, and neologisms are included in traditional Chinese medicine dictionary for word segmentation；

Step (203)：Have word using the whole in dictionary for word segmentation and builds even numbers group Tire trees；

Step (204)：The keyword extracted in text to be segmented is carried out single string pattern with even numbers group Tire trees to match, is made The keyword of current extraction is segmented with even numbers group Tire trees, obtains word segmentation result；

Step (205)：Training Hidden Markov Model：To each have word in dictionary for word segmentation as observation state sequence, The part of speech of each word carries out Hidden Markov Model training as hidden state sequence, obtains trained Hidden Markov mould Type；

Step (206)：Part-of-speech tagging is carried out using trained Hidden Markov Model：By what is obtained in step (204) Word sequence in word segmentation result, to trained Hidden Markov Model, is calculated as observation state sequence inputting by viterbi Method generates the hidden state sequence of current observation state sequence, and to obtain corresponding hidden state, hidden state is to wait for point The part of speech of word text, to complete part-of-speech tagging.

Further, step (3) judges whether text to be segmented all segments successfully, and criterion is：

If each word segmentation result is with part-of-speech tagging letter, then it represents that segment successfully, otherwise, indicate participle failure.

The second aspect of the present invention provides ancient TCM books document participle and part of speech indexing system；

Ancient TCM books document segments and part of speech indexing system, including：Memory, processor and storage are on a memory And the computer instruction run on a processor, when the computer instruction is run by processor, complete any of the above-described method institute The step of stating.

The third aspect of the present invention provides a kind of computer readable storage medium；

A kind of computer readable storage medium, thereon operation have computer instruction, the computer instruction to be transported by processor When row, the step described in any of the above-described method is completed.

Compared with prior art, the beneficial effects of the invention are as follows：

The recall rate and accuracy rate that the present invention segments traditional Chinese medicine literature of ancient book are significantly larger than the prior art.

The present invention realizes Chinese medicine major part-of-speech tagging for the first time, is excavated for TCM Literature and Knowledge Discovery provides base Plinth.

The word segmentation processing twice of the present invention, ensure that the integrality and accuracy of word segmentation result.

Description of the drawings

The accompanying drawings which form a part of this application are used for providing further understanding of the present application, and the application's shows Meaning property embodiment and its explanation do not constitute the improper restriction to the application for explaining the application.

Fig. 1 is flow chart of the method for the present invention.

Specific implementation mode

It is noted that following detailed description is all illustrative, it is intended to provide further instruction to the application.Unless another It indicates, all technical and scientific terms used herein has usual with the application person of an ordinary skill in the technical field The identical meanings of understanding.

As shown in Figure 1, ancient TCM books document participle and part of speech indexing method, including：

Step (101)：Build Chinese medicine major term dictionary；

1. the structure of traditional Chinese medicine dictionary for word segmentation

1.1 Chinese medicine major term dictionaries are built

One of the main reason for current general Chinese word segmentation software is to traditional Chinese medicine word segmentation accuracy difference is to Syndrome in TCM The terms recognition capability such as time, channels and collaterals, acupuncture point is different, therefore this system constructs perfect traditional Chinese medicine term dictionary first.Using Web crawlers, artificial neural network and manual synchronizing, extraction, standardization processing method, from traditional Chinese medicine literature of ancient book, it is various in In medical dictionary, a special dictionary for covering the Chinese medicine majors terms such as Chinese medicine medicine name, prescription title is extracted and constructed, is related to And traditional Chinese medicine related term 155,343, it is the most Chinese medicine major term dictionary of current receipts word amount.

1. traditional Chinese medicine of table segments dictionary and constitutes table

The special part of speech mask method of 1.2 traditional Chinese medicines

Part-of-speech tagging (Part-of-Speech tagging or POS tagging), also known as part-of-speech tagging or referred to as mark Note refers to the program that a correct part of speech is marked for each word in word segmentation result, in general, present part-of-speech tagging It is to determine that each word is the process of noun, verb, adjective or other parts of speech more.The mark of this part of speech is for TCM Literature Text mining and analysis significance be not very big, be based on this, we combine Chinese medicine major feature, according to traditional Chinese medical theory body Traditional Chinese medicine noun is divided into 14 class, 818 parts of speech by the sorting technique of system：Theory of traditional Chinese medical science basis, the diagnostic method of TCM, Chinese medicine, prescription phase Noun, typhoid fever and warm disease, the rules for the treatment of, therapy, traditional Chinese medicine related discipline, Chinese medical book, the institution of traditional Chinese medicine, traditional Chinese medicine instrument is closed to set Standby, medical and health personel's title, geographic name and other.

And the hidden horse model of single order is used, in this hidden Markov model, hidden state is 818 parts of speech, shows state 818 letter abbreviations, in order to be distinguished with general part-of-speech tagging, before plus FC.

Simultaneously according to the rank of part of speech, it is labeled as possible according to priority from low to high.

2. Chinese medicine major part of speech of table constitutes table (part)

The structure of 1.3 traditional Chinese medicine dictionary for word segmentation and extension

Dictionary for word segmentation is the core of this system, and the accuracy rate and speed to word segmentation result can all have an important influence on, This system is based on above Chinese medicine major term dictionary and part-of-speech tagging method, using 3 column dictionary creation methods, the 1st row It is classified as part-of-speech tagging letter for Chinese medicine major vocabulary of terms, the 2nd, the 3rd is classified as classer's description.

1.4tire trees (dictionary tree) construction process

(1) root node root is established, base [root]=1 is enabled

(2) the child node collection { root.childreni } (i=1...n) of root is found out so that check [root.childreni]=base [root]=1

(3) to each element in root.children:

1) { elemenet.childreni } (i=1...n) is found, a character is located at the knot of character string if paying attention to Tail, then its child nodes includes an empty node, and code values are set as 0 and find a value begin making each check [begini+element.childreni.code]=0

2) setting base [element.childreni]=begini

3) execute step 3 to element.childreni recurrence does not have if traversing some element Children, i.e. leaf node, then it is negative value that base [element], which is arranged,

2. TCM Literature segmentation methods and part-of-speech tagging

The core algorithm of this Words partition system is the Open Source Code of Ansj, is a Java Chinese word segmentation tool, is based on middle section The ictclas Chinese Word Automatic Segmentations of institute, the participle accuracy rate higher of participle tool (such as mmseg4j) of increasing income more common than other.

It applies the Chinese medicine major dictionary of our oneself structure to replace acquiescence dictionary on this basis, utilizes the dictionary of Ansj As supplement, the carry out part-of-speech tagging based on HMM.

3. TCM Literature segments and structure and the use of part-of-speech tagging service system

Chinese medical book document Words partition system is developed using Java language, and system includes participle framework and user interface.User Interface is presented to the user using form web page, and user is logged in by webpage, registered, and is not logged in user and only be may have access to website, Unusable participle function.Login user can need to segment text by replicating the form submission of paste text, can also pass through It uploads txt textual forms and submits participle text, word segmentation result replicates also there are two types of mode and txt texts are downloaded.

4. implementation result

4.1 improve participle accuracy rate and recall rate

With《The Treatise on Fevrile Diseases》The word content of clean this full text of Gu as a comparison with Ansj original programs, is carried out as test text Participle test, as a result, it has been found that, the recall rate and accuracy rate of ancient TCM books document Words partition systems participle are significantly larger than the sources Ansj Program and system dictionary, traditional Chinese medicine proper noun such as Taiyin diseases, sweating, foul wind, arteries and veins are slow etc. in test text, with the sources Ansj journey Sequence and system dictionary None- identified, also cannot correctly be segmented, and ancient TCM books document Words partition system can be accurate It identifies and is segmented.

Table 3 segments effect and compares

4.2 realize Chinese medicine major part-of-speech tagging

On the basis of accurate participle, accurate proprietary part-of-speech tagging, as shown in table 3, " Taiyin diseases ", " apoplexy " are realized It accurately is labelled with FCbm, indicates that this word is " names of disease of tcm "；" fever ", " sweating ", " arteries and veins is slow " are labeled as FCzz, indicate these Word is the symptom title in Chinese medicine, and the statistical analysis and Knowledge Discovery in this text mining for the later stage are of great significance.

4.3 system operatios are simple, portable strong

Chinese medical book document Words partition system is developed using Java language, readable strong, is easy to extend, is easy to change.System Including user's login, registration and user right control, it is not logged in user and only may have access to website, unusable participle function.System System friendly interface, wieldy, the prompt with hommization.

The foregoing is merely the preferred embodiments of the application, are not intended to limit this application, for the skill of this field For art personnel, the application can have various modifications and variations.Within the spirit and principles of this application, any made by repair Change, equivalent replacement, improvement etc., should be included within the protection domain of the application.

Claims

1. ancient TCM books document segments and part of speech indexing method, characterized in that including：

Step (3)：Judge whether text to be segmented all segments successfully；It is directly exported to segmenting successful word segmentation result；

Step (4)：To the text of participle failure, word segmentation processing is carried out using ansj dictionaries again；Obtain final word segmentation result.

2. ancient TCM books document participle as described in claim 1 and part of speech indexing method, characterized in that the step (1) Build traditional Chinese medicine dictionary for word segmentation the step of be：

Step (101)：Build Chinese medicine major term dictionary；

3. ancient TCM books document participle as claimed in claim 2 and part of speech indexing method, characterized in that the step (101) the step of structure Chinese medicine major term dictionary is：

Chinese medicine major term is extracted from traditional Chinese medicine literature of ancient book and traditional Chinese medicine dictionary.

4. ancient TCM books document participle as claimed in claim 3 and part of speech indexing method, characterized in that the traditional Chinese medicine is special Industry term, including：Chinese medicine medicine name, prescription title, Chinese medical book title, doctor's name, disease of tcm symptom title, traditional Chinese medicine work( Imitate the specialized vocabulary in title, acupuncture point title, Traditional Chinese Medicine dosage title, archaic Chinese vocabulary and modern medicine.

5. ancient TCM books document participle as claimed in claim 2 and part of speech indexing method, characterized in that the step (102) it is to the step of word progress parts of speech classification in Chinese medicine major term dictionary：

Reference《National Standard of the People's Republic of China's tcm clinical practice diagnosis and treatment term》Diseased portion, syndrome part or therapy part, In conjunction with the feature of traditional Chinese medicine vocabulary of terms, if traditional Chinese medicine noun is divided into Ganlei's part of speech, structure 14 classes classification part of speech table, 14 classes point Class part of speech includes：1. theory of traditional Chinese medical science is basic, in 2. diagnostic methods of TCM, 3. Chinese medicine nouns, 4. prescription nouns, 5. typhoid fever and warm disease, 6. Cure then, 7. tcm treatment methods, 8. traditional Chinese medicines and related discipline, 9. Chinese medical books, 10. institutions of traditional Chinese medicine, equipment or medical and health Personnel, 11. people claim word, 12. geographic names, 13. season time words, 14. other words；

It is divided into several grades of subclass per class word, according to the rank of part of speech, according to sequence from low to high to the traditional Chinese medicine name in dictionary Word carries out the classification and marking of part of speech.

6. ancient TCM books document participle as claimed in claim 2 and part of speech indexing method, characterized in that

The step (103) builds traditional Chinese medicine dictionary for word segmentation using three-row dictionary creation method, and traditional Chinese medicine dictionary for word segmentation is divided into Three arrange, and are respectively：1st is classified as Chinese medicine major word；2nd is classified as parts of speech classification letter；3rd is classified as part of speech classification mark.

7. ancient TCM books document participle as described in claim 1 and part of speech indexing method, characterized in that the step (2) Step is：

Step (202)：Using the existing word training condition random field CRF models in traditional Chinese medicine dictionary for word segmentation, use condition is random Field CRF models find neologisms, and neologisms are included in traditional Chinese medicine dictionary for word segmentation；

Step (204)：The keyword extracted in text to be segmented is carried out single string pattern with even numbers group Tire trees to match, using double Array Tire trees segment the keyword of current extraction, obtain word segmentation result；

Step (205)：Training Hidden Markov Model：To each have word in dictionary for word segmentation as observation state sequence, each The part of speech of word carries out Hidden Markov Model training as hidden state sequence, obtains trained Hidden Markov Model；

Step (206)：Part-of-speech tagging is carried out using trained Hidden Markov Model：The participle that will be obtained in step (204) As a result the word sequence in is produced as observation state sequence inputting to trained Hidden Markov Model by viterbi algorithms The hidden state sequence of raw current observation state sequence, to obtain corresponding hidden state, hidden state is to wait for participle text This part of speech, to complete part-of-speech tagging.

8. ancient TCM books document participle as described in claim 1 and part of speech indexing method, characterized in that

Step (3) judges whether text to be segmented all segments successfully, and criterion is：

9. ancient TCM books document segments and part of speech indexing system, characterized in that including：It memory, processor and is stored in The computer instruction run on memory and on a processor when the computer instruction is run by processor, is completed right and is wanted Seek any steps of 1-8.

10. a kind of computer readable storage medium, characterized in that operation has computer instruction, the computer instruction quilt thereon When processor is run, any steps of claim 1-8 are completed.