CN110134766B - Word segmentation method and device for traditional Chinese medical ancient book documents - Google Patents

Word segmentation method and device for traditional Chinese medical ancient book documents Download PDF

Info

Publication number
CN110134766B
CN110134766B CN201910384880.XA CN201910384880A CN110134766B CN 110134766 B CN110134766 B CN 110134766B CN 201910384880 A CN201910384880 A CN 201910384880A CN 110134766 B CN110134766 B CN 110134766B
Authority
CN
China
Prior art keywords
word
words
result
noun
segmentation
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201910384880.XA
Other languages
Chinese (zh)
Other versions
CN110134766A (en
Inventor
谢永红
周越
张德政
阿孜古丽
栗辉
贾麒
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
University of Science and Technology Beijing USTB
Original Assignee
University of Science and Technology Beijing USTB
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by University of Science and Technology Beijing USTB filed Critical University of Science and Technology Beijing USTB
Priority to CN201910384880.XA priority Critical patent/CN110134766B/en
Publication of CN110134766A publication Critical patent/CN110134766A/en
Application granted granted Critical
Publication of CN110134766B publication Critical patent/CN110134766B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/33Querying
    • G06F16/3331Query processing
    • G06F16/334Query execution
    • G06F16/3344Query execution using natural language analysis
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/205Parsing
    • G06F40/216Parsing using statistical methods
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities
    • G06F40/289Phrasal analysis, e.g. finite state techniques or chunking

Abstract

The embodiment of the invention discloses a word segmentation method and a word segmentation device for ancient Chinese medical book documents, wherein the method comprises the following steps: preprocessing ancient book documents in the field of traditional Chinese medicine to generate a corpus of a training language model; training the corpus to generate a language model; performing unsupervised word segmentation on the ancient book documents by using the language model to generate a primary word segmentation result; summarizing the preliminary word segmentation result according to the part of speech relationship, the fixed collocation of the sentence patterns and the linguistic knowledge, and sorting out segmentation rules to form rule files; and according to the rule in the rule file, performing first correction on the preliminary word segmentation result to generate a first correction result.

Description

Word segmentation method and device for traditional Chinese medical ancient book documents
Technical Field
The invention relates to a word segmentation method of medical literatures in the field of natural language processing, in particular to a word segmentation method and a word segmentation device for ancient Chinese medical literature.
Background
Chinese participles are a basic step in chinese text processing. Unlike English and other characters, Chinese sentences are not divided into words by using spaces, so that Chinese word segmentation has a key meaning as a basic step when performing Chinese information processing tasks such as text classification, information retrieval, information filtering, automatic indexing of documents, automatic generation of abstracts and the like. The correctness of the Chinese word segmentation result directly influences the correctness of the subsequent task.
In the field of traditional Chinese medicine, a great number of medical literature ancient books are accumulated in the traditional Chinese medicine which is born from the original society and is continuously developed and changed. These documents are large in number, complicated in content, and various in types, including the theory of essence and qi, the theory of yin and yang, the theory of five elements, qi, blood and body fluids, visceral manifestation, meridians, constitutions, etiology, pathogenesis, therapeutic principle, health preservation, etc. Most of them are recorded by using Chinese language or ancient Chinese and singing formula, and their writing mode and writing period are different from modern Chinese language. Also, there are many proper nouns and terminology of the traditional Chinese medicine field. The reasonable word segmentation of the traditional Chinese medical ancient book documents is the basis for structuring the traditional Chinese medical knowledge, but at present, no word segmentation device specially aiming at the field of traditional Chinese medicine exists, and the word segmentation task on the traditional Chinese medical ancient book documents cannot be well solved by the word segmentation device in the general field.
Disclosure of Invention
In view of this, the embodiments of the present invention provide a word segmentation method and device for ancient Chinese medical literature, which can improve the accuracy of word segmentation of the literature in the field of traditional Chinese medicine.
A word segmentation method for ancient Chinese medical book documents comprises the following steps:
preprocessing ancient book documents in the field of traditional Chinese medicine to generate linguistic data of a training language model
Training the corpus to generate a language model;
performing unsupervised word segmentation on the ancient book documents by using the language model to generate a primary word segmentation result;
summarizing the preliminary word segmentation result according to the part of speech relationship, the fixed collocation of the sentence patterns and the linguistic knowledge, and sorting out segmentation rules to form rule files;
and according to the rule in the rule file, performing first correction on the preliminary word segmentation result to generate a first correction result.
The method further comprises the following steps:
acquiring traditional Chinese medicine field terms sorted according to the ancient book literature as a word list;
and correcting the first correction result by using the word list to obtain a final word segmentation result.
The step of pre-processing the ancient book documents comprises:
acquiring an original text of the ancient book document, deleting a catalogue of the ancient book document from the original text, deleting a sentence containing characters which cannot be expressed by utf-8, and generating a cleaned text;
and adding a space after each character in the cleaned text to be used as a corpus of the training language model.
The step of unsupervised word segmentation of the ancient book documents by using the language model comprises the following steps:
the transition states of a word are divided into four types: the first one is: the first character of the single word or the multi-word is marked as a; the second method is as follows: a second word of the multi-word, labeled b; the third is: a third word of the multi-word, labeled c; the fourth method is as follows: the rest of the multi-word, labeled d;
wherein, the single word can only be followed by the first word of the single word or the multiple word, the first word of the single word can only be followed by the second word of the multiple word, the second word of the multiple word can only be followed by the third word of the multiple word or the first word of the single word or the multiple word, the third word of the multiple word can only be followed by the rest of the multiple word or the first word of the single word or the multiple word, the rest of the multiple word can only be followed by the first word of the single word or the multiple word or the rest of the multiple word, except the transition state, the rest transition probability is zero;
in summary, there are 8 non-zero state transition probabilities, namely:
p(a|a),p(b|a),p(c|b),p(a|b),p(a|c),p(d|c),p(a|d),p(d|d);
wherein p (a | a) represents a conditional probability that a following word is a first word of a single-word or multi-word under a condition that a preceding word is a single-word, p (b | a) represents a conditional probability that the following word is a second word of a multi-word under a condition that the preceding word is a first word of a multi-word, p (c | b) represents a conditional probability that the following word is a third word of the multi-word under a condition that the preceding word is a second word of the multi-word, p (a | b) represents a conditional probability that the following word is a first word of the single-word or multi-word under a condition that the preceding word is a second word of the multi-word, p (a | c) represents a conditional probability that the following word is a first word of the single-word or multi-word under a condition that the preceding word is a third word of the multi-word, and p (d | c) represents a conditional probability that the following word is a remaining part of the multi-word under a condition that the preceding word is a third word of the multi-word, p (a | d) represents a conditional probability that the next word is a single word or a first word of a multi-word under the condition that the previous word is the rest of the multi-word, and p (d | d) represents a conditional probability that the next word is the rest of the multi-word under the condition that the previous word is the rest of the multi-word;
setting different transition probabilities for experimental comparison to obtain the transition probabilities;
p(a|a)=0.96,p(b|a)=0.2,p(c|b)=0.009,p(a|b)=0.9,p(a|c)=1,p(d|c)=0.005,
p(a|d)=1,p(d|d)=0.0001
calculating a conditional probability using the language model;
and finding the optimal path, namely the path with the maximum conditional probability, by using a dynamic programming method, and taking the optimal path as a segmentation result to obtain an initial word segmentation result.
The step of calculating conditional probabilities using the language model comprises:
p(w)=p(z1)p(z2|z1)p(z3|z2z1)...p(zk|zk-1...z2z1) (1)
where word w is a word consisting of k words, z1,z2,…zkRespectively, the 1 st and 2 … k th words of the word w, the possibility that the word w is a word, i.e., the existence probability p (w) of the word w can be converted into the probability of each word of the composition. Wherein the probability of each word is calculated by dividing the number of times the word occurs in the text of the training language model by the total number of words.
The step of summarizing the preliminary word segmentation result according to the part of speech relationship, the fixed collocation of sentence patterns and the linguistic knowledge, sorting out segmentation rules and forming rule files comprises the following steps:
sequencing the preliminary word segmentation results according to the pinyin sequence;
sequentially processing the sorted preliminary word segmentation results; the treatment specifically comprises the following steps:
writing a rule file according to the punctuation marks as to the result that the Chinese characters and the punctuation marks are divided into a word;
judging words with the same first character according to part of speech and linguistic knowledge as a result of dividing the Chinese characters into a word, and splitting the words into a verb and a noun when the words are in a verb + noun form; when the form is adjective + noun, the separation is divided into two words, namely adjective and noun. Writing a rule file according to the verb words;
for the result that the Chinese character and the Chinese character are divided into a word, when the word is the same as the first two characters, the judgment is carried out according to the part of speech and the linguistic knowledge, for example, when the word is in a verb + noun form, the word is divided into a verb and a noun; when the form is adjective + noun, the separation is divided into two words, namely adjective and noun. Writing a rule file according to the same first two characters;
for the results of dividing the Chinese characters and the Chinese characters into a word, sorting words with the same final characters, and judging according to the part of speech and the linguistic knowledge, for example, when the words are in a verb + noun form, splitting the words into a verb and a noun; when the form is adjective + noun, the separation is divided into two words, namely adjective and noun. And writing a rule file according to the final word.
The step of summarizing the preliminary word segmentation result according to the part of speech relationship, the fixed collocation of sentence patterns and the linguistic knowledge, sorting out segmentation rules and forming rule files further comprises the following steps:
counting all words in the preliminary word segmentation result and the occurrence frequency of each word;
sequencing according to the times from high to low to obtain words of a preset number;
judging whether the words with the preset number are in a preset word list or not;
if so, a rule file is compiled according to the words.
The step of using the word list to correct the first correction result to obtain a final word segmentation result comprises the following steps:
searching a word in the word list in the original text to serve as a word to be corrected;
recording the position of the word to be corrected in the original text;
finding the word segmentation result of the word to be corrected after the first correction according to the recorded position;
judging whether the word segmentation result is consistent with the words in the word list or not;
if not, modifying the word segmentation result of the word to be corrected according to a word list; if the word segmentation results are consistent, the word segmentation results are reserved;
and sequentially carrying out word list correction on the first correction result to obtain a final word segmentation result.
The step of using the vocabulary to correct the first correction result to obtain a final word segmentation result further comprises:
performing long and short word disambiguation when the first word is both a word in the vocabulary and a sub-word of a word in the vocabulary.
A word segmentation device for ancient Chinese medical book documents is characterized by comprising:
the preprocessing module is used for preprocessing ancient book documents in the field of Chinese medicine to generate linguistic data of a training language model
The training module is used for training the corpus to generate a language model;
the word segmentation module is used for carrying out unsupervised word segmentation on the ancient book documents by using the language model to generate a preliminary word segmentation result;
the rule establishing module is used for summarizing the preliminary word segmentation result according to the part of speech relationship, the fixed collocation of the sentence patterns and the linguistic knowledge, and sorting out segmentation rules to form a rule file;
and the first correction module is used for correcting the preliminary word segmentation result for the first time according to the rule in the rule file to generate a first correction result.
The device, still include:
the acquisition module is used for acquiring traditional Chinese medicine field terms sorted according to the ancient book documents as a word list;
and the second correction module corrects the first correction result by using the word list to obtain a final word segmentation result.
In the above embodiment, the word segmentation accuracy in the field of traditional Chinese medicine is improved by designing the word segmentation method in the field of traditional Chinese medicine.
Drawings
In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings used in the description of the embodiments or the prior art will be briefly described below, it is obvious that the drawings in the following description are only some embodiments of the present invention, and for those skilled in the art, other drawings can be obtained according to the drawings without creative efforts.
FIG. 1 is a flow chart of a word segmentation method for ancient Chinese medical book literature according to an embodiment of the present invention;
FIG. 2 is a schematic connection diagram of the word segmentation apparatus for ancient Chinese medical book literature according to the present invention;
FIG. 3 is a flow chart of the method for word segmentation of ancient book literature in traditional Chinese medicine;
FIG. 4 is a corpus pre-processing result according to the present invention;
FIG. 5 is a rule file of the present invention;
FIG. 6 shows the word segmentation result of the present invention.
Detailed Description
Embodiments of the present invention will be described in detail below with reference to the accompanying drawings.
As shown in fig. 1, the word segmentation method for ancient book literature in traditional chinese medicine according to the present invention includes:
101, preprocessing ancient book documents in the field of Chinese medicine to generate linguistic data of a training language model; wherein the step of pre-processing the ancient book documents comprises: acquiring an original text of the ancient book document, deleting a catalogue of the ancient book document from the original text, deleting a sentence containing characters which cannot be expressed by utf-8, and generating a cleaned text; and adding a space after each character in the cleaned text to be used as a corpus of the training language model.
102, training the corpus to generate a language model;
103, performing unsupervised word segmentation on the ancient book document by using the language model to generate a primary word segmentation result;
the step of unsupervised word segmentation of the ancient book documents by using the language model comprises the following steps:
the transition states of a word are divided into four types: the first one is: the first character of the single word or the multi-word is marked as a; the second method is as follows: a second word of the multi-word, labeled b; the third is: a third word of the multi-word, labeled c; the fourth method is as follows: the rest of the multi-word, labeled d;
wherein, the single word can only be followed by the first word of the single word or the multiple word, the first word of the single word can only be followed by the second word of the multiple word, the second word of the multiple word can only be followed by the third word of the multiple word or the first word of the single word or the multiple word, the third word of the multiple word can only be followed by the rest of the multiple word or the first word of the single word or the multiple word, the rest of the multiple word can only be followed by the first word of the single word or the multiple word or the rest of the multiple word, except the transition state, the rest transition probability is zero;
in summary, there are 8 non-zero state transition probabilities, namely:
p(a|a),p(b|a),p(c|b),p(a|b),p(a|c),p(d|c),p(a|d),p(d|d);
wherein p (a | a) represents a conditional probability that a following word is a first word of a single-word or multi-word under a condition that a preceding word is a single-word, p (b | a) represents a conditional probability that the following word is a second word of a multi-word under a condition that the preceding word is a first word of a multi-word, p (c | b) represents a conditional probability that the following word is a third word of the multi-word under a condition that the preceding word is a second word of the multi-word, p (a | b) represents a conditional probability that the following word is a first word of the single-word or multi-word under a condition that the preceding word is a second word of the multi-word, p (a | c) represents a conditional probability that the following word is a first word of the single-word or multi-word under a condition that the preceding word is a third word of the multi-word, and p (d | c) represents a conditional probability that the following word is a remaining part of the multi-word under a condition that the preceding word is a third word of the multi-word, p (a | d) represents a conditional probability that the next word is a single word or a first word of a multi-word under the condition that the previous word is the rest of the multi-word, and p (d | d) represents a conditional probability that the next word is the rest of the multi-word under the condition that the previous word is the rest of the multi-word.
Setting different transition probabilities for experimental comparison to obtain the transition probabilities;
p(a|a)=0.96,p(b|a)=0.2,p(c|b)=0.009,p(a|b)=0.9,p(a|c)=1,p(d|c)=0.005,
p(a|d)=1,p(d|d)=0.0001
calculating a conditional probability using the language model;
and finding the optimal path, namely the path with the maximum conditional probability, by using a dynamic programming method, and taking the optimal path as a segmentation result to obtain an initial word segmentation result.
The step of calculating conditional probabilities using the language model comprises:
p(w)=p(z1)p(z2|z1)p(z3|z2z1)...p(zk|zk-1...z2z1) (1)
where word w is a word consisting of k words, z1,z2,…zkRespectively, the 1 st and 2 … k th words of the word w, the possibility that the word w is a word, i.e., the existence probability p (w) of the word w can be converted into the probability of each word of the composition. Wherein the probability of each word is calculated by dividing the number of times the word occurs in the text of the training language model by the total number of words.
Step 104, summarizing the preliminary word segmentation result according to the part of speech relationship, the fixed collocation of the sentence patterns and the linguistic knowledge, and sorting out segmentation rules to form rule files;
the step of summarizing the preliminary word segmentation result according to the part of speech relationship, the fixed collocation of sentence patterns and the linguistic knowledge, sorting out segmentation rules and forming rule files comprises the following steps:
sequencing the preliminary word segmentation results according to the pinyin sequence;
sequentially processing the sorted preliminary word segmentation results; the treatment specifically comprises the following steps:
writing a rule file according to the punctuation marks as to the result that the Chinese characters and the punctuation marks are divided into a word;
judging words with the same first character according to part of speech and linguistic knowledge as a result of dividing the Chinese characters into a word, and splitting the words into a verb and a noun when the words are in a verb + noun form; when the form is adjective + noun, the separation is divided into two words, namely adjective and noun. Writing a rule file according to the verb words;
for the result that the Chinese character and the Chinese character are divided into a word, when the word is the same as the first two characters, the judgment is carried out according to the part of speech and the linguistic knowledge, for example, when the word is in a verb + noun form, the word is divided into a verb and a noun; when the form is adjective + noun, the separation is divided into two words, namely adjective and noun. Writing a rule file according to the same first two characters;
for the results of dividing the Chinese characters and the Chinese characters into a word, sorting words with the same final characters, and judging according to the part of speech and the linguistic knowledge, for example, when the words are in a verb + noun form, splitting the words into a verb and a noun; when the form is adjective + noun, the separation is divided into two words, namely adjective and noun. And writing a rule file according to the final word.
The step of summarizing the preliminary word segmentation result according to the part of speech relationship, the fixed collocation of sentence patterns and the linguistic knowledge, sorting out segmentation rules and forming rule files further comprises the following steps:
counting all words in the preliminary word segmentation result and the occurrence frequency of each word;
sequencing according to the times from high to low to obtain words of a preset number;
judging whether the words with the preset number are in a preset word list or not;
if so, a rule file is compiled according to the words.
And 105, performing first correction on the preliminary word segmentation result according to the rule in the rule file to generate a first correction result.
The method further comprises the following steps:
step 106, acquiring traditional Chinese medicine field terms sorted according to the ancient book documents as a word list;
and 107, correcting the first correction result by using the word list to obtain a final word segmentation result. The method comprises the following steps:
searching a word in the word list in the original text to serve as a word to be corrected;
recording the position of the word to be corrected in the original text;
finding the word segmentation result of the word to be corrected after the first correction according to the recorded position;
judging whether the word segmentation result is consistent with the words in the word list or not;
if not, modifying the word segmentation result of the word to be corrected according to a word list; if the word segmentation results are consistent, the word segmentation results are reserved;
and sequentially carrying out word list correction on the first correction result to obtain a final word segmentation result.
The step may further include:
performing long and short word disambiguation when the first word is both a word in the vocabulary and a sub-word of a word in the vocabulary.
As shown in fig. 2, the word segmentation apparatus for ancient book literature in chinese medicine according to the present invention includes:
a preprocessing module 21 for preprocessing ancient book documents in the Chinese medical field to generate corpus of training language model
The training module 22 is used for training the linguistic data to generate a language model;
the word segmentation module 23 is used for performing unsupervised word segmentation on the ancient book documents by using the language model to generate a preliminary word segmentation result;
the rule establishing module 24 summarizes the preliminary word segmentation result according to the part of speech relationship, the fixed collocation of the sentence patterns and the linguistic knowledge, and arranges segmentation rules to form a rule file;
and the first correction module 25 is used for performing first correction on the preliminary word segmentation result according to the rule in the rule file to generate a first correction result.
The device, still include:
an obtaining module 26, for obtaining Chinese medicine field terms sorted according to the ancient book documents as a word list;
and a second correction module 27, which corrects the first correction result by using the word list to obtain a final word segmentation result.
The following describes an application scenario of the present invention. The invention provides a word segmentation method based on ancient book literature in the field of traditional Chinese medicine, aiming at solving the word segmentation problem of the ancient book literature in the field of traditional Chinese medicine. The specific implementation steps are as follows:
the method comprises the following steps: ancient book documents related to the traditional Chinese medicine field are obtained and used as linguistic data of a training language model, and special terms in the traditional Chinese medicine field are arranged and used as word lists. One word per row of the vocabulary, in the format of TXT.
Step two: the documents are preprocessed and the language model is trained using the kenlm tool.
Wherein the pretreatment comprises:
(1) deleting the catalogue, and deleting the sentences which contain the special characters which cannot be expressed by utf-8 (the sentences are divided by periods, exclamation marks and question marks);
(2) and adding a space behind each character in the cleaned text to be used as the corpus of the training character language model.
Step three: the language model is used for carrying out primary unsupervised word segmentation on ancient book documents in the field of traditional Chinese medicine.
In this context, words having a length greater than four are relatively few, so in this patent, the state of a word is divided into four types: the first character of a single word or a multi-word is marked as a, the second character of the multi-word, the third character of the multi-word is marked as b, the third character of the multi-word is marked as c, and the rest of the multi-word is marked as d. The single word can be followed by the first word of the single word or the multiple word, the first word of the single word can be followed by the second word of the multiple word, the second word of the multiple word can be followed by the third word of the multiple word or the first word of the single word or the multiple word, the third word of the multiple word can be followed by the rest of the multiple word or the first word of the single word or the multiple word, the rest of the multiple word can be followed by the first word of the single word or the multiple word or the rest of the multiple word, except the transition state, the rest of the transition probability is zero. In summary, there are 8 non-zero transition probabilities, namely:
p(a|a),p(b|a),p(c|b),p(a|b),p(a|c),p(d|c),p(a|d),p(d|d)。
according to the condition that long words in ancient languages are few and single words with complete semantics are many, different transition probabilities are set for experimental comparison to obtain the transition probabilities
p(a|a)=0.96,p(b|a)=0.2,p(c|b)=0.009,p(a|b)=0.9,p(a|c)=1,p(d|c)=0.005,
p(a|d)=1,p(d|d)=0.0001
And (5) calculating the conditional probability by using the language model obtained in the step two, as shown in formula (1). Where a word w is a word composed of k words, the probability of the word w may be converted to the probability of the composed word. And finding the optimal path, namely the path with the maximum probability by using a dynamic programming method, and taking the optimal path as a segmentation result to obtain an initial word segmentation result.
p(w)=p(z1)p(z2|z1)p(z3|z2z1)...p(zk|zk-1...z2z1) (1)
Step four: summarizing the preliminary word segmentation result in the third step according to the part of speech relation, the fixed collocation of the sentence patterns and the linguistic knowledge, and sorting out segmentation rules to form rule files. The method for forming the rule file is summarized as follows:
(1) counting all words and the occurrence frequency of each word in the word segmentation result obtained in the step three;
(2) sorting the statistical results according to the pinyin sequence;
(3) for the result that the Chinese characters and the punctuation marks are divided into a word, writing the punctuation marks into a rule file as rules;
(4) for the result that the Chinese characters and the Chinese characters are divided into a word, firstly, words with the same first character are sorted out, then, judgment is carried out according to the part of speech and the linguistic knowledge, if the form of a verb and a noun is generally divided into a verb and two words of the noun, the verb word is added into a rule file;
(5) sorting out the words with the same front two characters, judging according to the part of speech and the linguistic knowledge, and adding the corresponding words into a rule file;
(6) and sorting words with the same final characters, judging according to the part of speech and the linguistic knowledge, and adding corresponding words into the rule file.
(7) And then sequencing the high-frequency words from high to low according to the frequency, and judging the high-frequency words.
The rule file storage format is TXT, one rule is written per line. The specific description is shown in the following table 1:
TABLE 1 rules File
Figure BDA0002054502430000121
Figure BDA0002054502430000131
Step five: performing first correction on the preliminary word segmentation result in the third step according to the rule in the rule file;
step six: and correcting the first correction result in the step five by using a word list, wherein the detailed correction method comprises the following steps:
(1) searching words appearing in a word list in an original text, and recording the appearing positions;
(2) if a word is a word in the word list and a sub-word of a certain word in the word list, performing long and short word disambiguation to determine which word in the word list should be used;
(3) finding a word segmentation result obtained by one-time correction in the fifth step according to the position recorded in the step (1);
(4) if the word segmentation result is inconsistent with the word list, namely one word in the word list is segmented into a plurality of words or the segmentation limit is incorrect, merging and modifying the word segmentation result according to the word list; if the two are consistent, the result is reserved;
(5) and obtaining a final word segmentation result after word list correction.
The invention has the advantages that:
in the existing word segmentation tools, word segmentation devices for segmenting Chinese ancient characters and word segmentation devices special for the field of traditional Chinese medicine are not provided. In the above expression, there is a great difference between the ancient chinese language and the modern chinese language, for example, the modern chinese language generally uses "wool", "do", "ya" and the like as the semantic words, but the semantic words in the ancient chinese language are generally "he", "do", "he", "also"; compared with the modern Chinese, more single characters in the ancient languages can show complete semantics. These all lead to erroneous results that are frequently presented by existing segmenters at the upstream of the ancient segmentation task. In addition, in ancient books of traditional Chinese medicine, a large number of prescriptions of the medicine names are described, and the specific expression methods in the traditional Chinese medicine field are difficult to be seen in the general field, so that the word segmentation task of the ancient books of traditional Chinese medicine cannot be well solved by the word segmentation device in the general field at present. The invention trains the word segmentation model specially applied to the ancient book literature of traditional Chinese medicine by using an unsupervised method, and can well solve the word segmentation task of the ancient book literature of traditional Chinese medicine. And moreover, an unsupervised method is adopted, so that the labor and time cost of manual labeling is saved. The method can be easily extended to be used in ancient documents of other fields, and the word segmentation device disclosed by the invention can be applied to other fields by changing the data set of the training model and the word list of the related field.
The following ancient book of traditional Chinese medicine document is taken as an example to illustrate the word segmentation method of the ancient book of traditional Chinese medicine document, as shown in fig. 3.
Firstly, ancient Chinese medical book documents are obtained as the corpus of the training language model. The specific terms in the traditional Chinese medicine field are arranged to be used as word lists, and are shown in the table 1.
Table 1: word list of special terms in traditional Chinese medicine field
Figure BDA0002054502430000141
Figure BDA0002054502430000151
Secondly, preprocessing the corpus is performed, and the preprocessing result is shown in fig. 4. The language model was trained using the Kenlm tool with the parameters of the gram set to 4.
Thirdly, unsupervised word segmentation is carried out on ancient book documents in the field of traditional Chinese medicine by using the language model trained in the step two. The words are separated by spaces.
Fourthly, the first word segmentation result is summarized through sorting. Firstly, sorting according to pinyin, and dividing the result of dividing the symbols and Chinese characters into a word. For example, if "poria" appears in the word segmentation result, "," "is added to the rule file. "x" indicates that all characters, i.e. once present, are segmented, and that "connected to any character" is segmented. Next, the words with the same initial word are all found out, and the combination of the word "verb + noun" should be generally divided into two words according to the common knowledge of linguistics, such as "eat warm wine", "eat salt", "eat hot substance", and the like, wherein the word takes the "eat" word as the head, and the rule of "eat" word "is added in the rule file. And similarly, arranging the words with the same initial characters and the words with the same ending single characters, and adding corresponding rules in the rule asking price. And sequencing the primary word segmentation results according to the word frequency from high to low, checking high-frequency words, and adding corresponding rules needing segmentation into the rule file.
Fifthly, the rule in the rule file is used for correcting the primary word segmentation result. The rule file is shown in fig. 5.
Sixthly, correcting the result obtained in the fifth step by using a word list.
First, the words in the vocabulary are found in the original non-participled chinese medical literature and the position is recorded.
Then, the words of the word list are found in the word segmentation result corrected by the rule.
And if the word segmentation result is that the words in the word list are inconsistent, correcting the result according to the word list. For example, there is a sentence in the original Chinese medicine literature, which is "add or subtract six ingredients decoction to treat enuresis, and many of the deficient people have the syndrome. If the word segmentation result after the rule correction is that the six ingredients decoction is added or subtracted, this syndrome is usually seen in the deficient people. However, the plus-minus six-ingredient decoction is a word in a word list and is a traditional Chinese medicine prescription name, and in this case, the plus-minus six-ingredient decoction is combined into a word in the final word segmentation result. The final word segmentation result is that the addition or subtraction of six ingredients decoction treats enuresis, which is often the case for deficiency people. The form of the word segmentation result is shown in fig. 6.
For convenience of description, the above devices are described separately in terms of functional division into various units/modules. Of course, the functionality of the units/modules may be implemented in one or more software and/or hardware implementations of the invention.
It will be understood by those skilled in the art that all or part of the processes of the methods of the embodiments described above can be implemented by a computer program, which can be stored in a computer-readable storage medium, and when executed, can include the processes of the embodiments of the methods described above. The storage medium may be a magnetic disk, an optical disk, a Read-Only Memory (ROM), a Random Access Memory (RAM), or the like.
The above description is only for the specific embodiment of the present invention, but the scope of the present invention is not limited thereto, and any changes or substitutions that can be easily conceived by those skilled in the art within the technical scope of the present invention are included in the scope of the present invention. Therefore, the protection scope of the present invention shall be subject to the protection scope of the claims.

Claims (9)

1. A word segmentation method for ancient Chinese medical book documents is characterized by comprising the following steps:
preprocessing ancient book documents in the field of traditional Chinese medicine to generate a corpus of a training language model;
training the corpus to generate a language model;
performing unsupervised word segmentation on the ancient book documents by using the language model to generate a primary word segmentation result;
summarizing the preliminary word segmentation result according to the part of speech relationship, the fixed collocation of the sentence patterns and the linguistic knowledge, and sorting out segmentation rules to form rule files;
according to the rule in the rule file, performing first correction on the preliminary word segmentation result to generate a first correction result;
the step of summarizing the preliminary word segmentation result according to the part of speech relationship, the fixed collocation of sentence patterns and the linguistic knowledge, sorting out segmentation rules and forming rule files comprises the following steps:
sequencing the preliminary word segmentation results according to the pinyin sequence;
sequentially processing the sorted preliminary word segmentation results; the treatment specifically comprises the following steps:
writing a rule file according to the punctuation marks as to the result that the Chinese characters and the punctuation marks are divided into a word;
judging words with the same first character according to part of speech and linguistic knowledge as a result of dividing the Chinese characters into a word, and splitting the words into a verb and a noun when the words are in a verb + noun form; when the form is an adjective + noun form, the adjective + noun form is split into two words; writing a rule file according to the same first character;
for the result that the Chinese characters and the Chinese characters are divided into a word, when the word is the same as the first two characters, the judgment is carried out according to the part of speech and the linguistic knowledge, and when the word is in the form of verb + noun, the word is divided into two words of verb and noun; when the form is an adjective + noun form, the adjective + noun form is split into two words; writing a rule file according to the same first two characters;
for the results of dividing the Chinese characters and the Chinese characters into a word, sorting words with the same final characters, judging according to the part of speech and the linguistic knowledge, and splitting the words into a verb and a noun when the words are in a verb + noun form; when the form is an adjective + noun form, the adjective + noun form is split into two words; and writing a rule file according to the final word.
2. The method of claim 1, further comprising:
acquiring traditional Chinese medicine field terms sorted according to the ancient book literature as a word list;
and correcting the first correction result by using the word list to obtain a final word segmentation result.
3. The method of claim 1, wherein the step of pre-processing the ancient book documents comprises:
acquiring an original text of the ancient book document, deleting a catalogue of the ancient book document from the original text, deleting a sentence containing characters which cannot be expressed by utf-8, and generating a cleaned text;
and adding a space after each character in the cleaned text to be used as a corpus of the training language model.
4. The method of claim 2, wherein said step of unsupervised word segmentation of the ancient book documents using the language model comprises:
the transition states of a word are divided into four types: the first one is: the first character of the single word or the multi-word is marked as a; the second method is as follows: a second word of the multi-word, labeled b; the third is: a third word of the multi-word, labeled c; the fourth method is as follows: the rest of the multi-word, labeled d;
wherein, the single word can only be followed by the first word of the single word or the multiple word, the first word of the single word can only be followed by the second word of the multiple word, the second word of the multiple word can only be followed by the third word of the multiple word or the first word of the single word or the multiple word, the third word of the multiple word can only be followed by the rest of the multiple word or the first word of the single word or the multiple word, the rest of the multiple word can only be followed by the first word of the single word or the multiple word or the rest of the multiple word, except the transition state, the rest transition probability is zero;
in summary, there are 8 non-zero state transition probabilities, namely:
p(a|a),p(b|a),p(c|b),p(a|b),p(a|c),p(d|c),p(a|d),p(d|d);
wherein p (a | a) represents a conditional probability that a subsequent word is a first word of a single word or a multiple word under the condition that a previous word is a single word; p (b | a) represents the conditional probability that the following word is the second word of the multi-word, provided that the preceding word is the first word of the multi-word; p (c | b) represents the conditional probability that the following word is the third word of the multi-word, provided that the preceding word is the second word of the multi-word; p (a | b) represents a conditional probability that the latter word is a first word of a single word or a multiple word under the condition that the former word is a second word of the multiple word; p (a | c) represents a conditional probability that the latter word is a first word of a single word or a multiple word under the condition that the former word is a third word of the multiple word; p (d | c) represents the conditional probability that the next word is the remainder of the multi-word, provided that the previous word is the third word of the multi-word; p (a | d) represents the conditional probability that the next word is the first word of a single word or a multi-word under the condition that the previous word is the rest of the multi-word; p (d | d) represents the conditional probability that the next word is the rest of the multi-word under the condition that the previous word is the rest of the multi-word;
setting different transition probabilities for experimental comparison to obtain the transition probabilities;
p(a|a)=0.96,p(b|a)=0.2,p(c|b)=0.009,p(a|b)=0.9,p(a|c)=1,p(d|c)=0.005,p(a|d)=1,p(d|d)=0.0001
calculating a conditional probability using the language model;
and finding the optimal path, namely the path with the maximum conditional probability, by using a dynamic programming method, and taking the optimal path as a segmentation result to obtain an initial word segmentation result.
5. The method of claim 4, wherein the step of calculating the conditional probability using the language model comprises:
p(w)=p(z1)p(z2|z1)p(z3|z2z1)...p(zk|zk-1...z2z1) (1)
where word w is a word consisting of k words, z1,z2,…zkRespectively, the 1 st and 2 … k th words of the word w, the possibility that the word w is one word, i.e., the existence probability p (w) of the word w can be converted into the existence probability of each word of the composition, wherein the existence probability of each word is calculated by dividing the number of times the word appears in the text of the training language model by the total number of words.
6. The method according to claim 1, wherein the step of summarizing the preliminary word segmentation result according to part-of-speech relations, fixed collocation of sentence patterns and linguistic knowledge, sorting segmentation rules, and forming a rule file further comprises:
counting all words in the preliminary word segmentation result and the occurrence frequency of each word;
sequencing according to the times from high to low to obtain words of a preset number;
judging whether the words with the preset number are in a preset word list or not;
if so, a rule file is compiled according to the words.
7. The method according to claim 2, wherein the step of modifying the first modification result by using the vocabulary to obtain a final segmentation result comprises:
searching a word in the word list in the original text as a word to be corrected;
recording the position of the word to be corrected in the original text;
finding the word segmentation result of the word to be corrected after the first correction according to the recorded position;
judging whether the word segmentation result is consistent with the words in the word list or not;
if not, modifying the word segmentation result of the word to be corrected according to the word list; if the word segmentation results are consistent, the word segmentation results are reserved;
sequentially carrying out word list correction on the first correction result to obtain a final word segmentation result; or
The step of using the vocabulary to correct the first correction result to obtain a final word segmentation result further comprises:
performing long and short word disambiguation when there is a first word that is both a word in the vocabulary and a subword of a word in the vocabulary.
8. A word segmentation device for ancient Chinese medical book documents is characterized by comprising:
the preprocessing module is used for preprocessing ancient book documents in the field of Chinese medicine to generate linguistic data of a training language model
The training module is used for training the corpus to generate a language model;
the word segmentation module is used for carrying out unsupervised word segmentation on the ancient book documents by using the language model to generate a preliminary word segmentation result;
the rule establishing module is used for summarizing the preliminary word segmentation result according to the part of speech relationship, the fixed collocation of the sentence patterns and the linguistic knowledge, and sorting out segmentation rules to form a rule file;
the first correction module is used for correcting the preliminary word segmentation result for the first time according to the rule in the rule file to generate a first correction result;
wherein the rule establishing module is used for: writing a rule file according to the punctuation marks as to the result that the Chinese characters and the punctuation marks are divided into a word; judging words with the same first character according to part of speech and linguistic knowledge as a result of dividing the Chinese characters into a word, and splitting the words into a verb and a noun when the words are in a verb + noun form; when the form is an adjective + noun form, the adjective + noun form is split into two words; writing a rule file according to the same first character; for the result that the Chinese characters and the Chinese characters are divided into a word, when the word is the same as the first two characters, the judgment is carried out according to the part of speech and the linguistic knowledge, and when the word is in the form of verb + noun, the word is divided into two words of verb and noun; when the form is an adjective + noun form, the adjective + noun form is split into two words; writing a rule file according to the same first two characters; for the results of dividing the Chinese characters and the Chinese characters into a word, sorting words with the same final characters, judging according to the part of speech and the linguistic knowledge, and splitting the words into a verb and a noun when the words are in a verb + noun form; when the form is an adjective + noun form, the adjective + noun form is split into two words; and writing a rule file according to the final word.
9. The apparatus of claim 8, further comprising:
the acquisition module is used for acquiring traditional Chinese medicine field terms sorted according to the ancient book documents as a word list;
and the second correction module corrects the first correction result by using the word list to obtain a final word segmentation result.
CN201910384880.XA 2019-05-09 2019-05-09 Word segmentation method and device for traditional Chinese medical ancient book documents Active CN110134766B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201910384880.XA CN110134766B (en) 2019-05-09 2019-05-09 Word segmentation method and device for traditional Chinese medical ancient book documents

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201910384880.XA CN110134766B (en) 2019-05-09 2019-05-09 Word segmentation method and device for traditional Chinese medical ancient book documents

Publications (2)

Publication Number Publication Date
CN110134766A CN110134766A (en) 2019-08-16
CN110134766B true CN110134766B (en) 2021-06-25

Family

ID=67576958

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201910384880.XA Active CN110134766B (en) 2019-05-09 2019-05-09 Word segmentation method and device for traditional Chinese medical ancient book documents

Country Status (1)

Country Link
CN (1) CN110134766B (en)

Families Citing this family (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112735556A (en) * 2019-10-28 2021-04-30 北京中医药大学 Traditional Chinese medicine ancient book data processing method for diagnosing and treating insomnia
CN111259667A (en) * 2020-01-16 2020-06-09 上海国民集团健康科技有限公司 Chinese medicine word segmentation algorithm
CN111241833A (en) * 2020-01-16 2020-06-05 支付宝(杭州)信息技术有限公司 Word segmentation method and device for text data and electronic equipment

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103020034A (en) * 2011-09-26 2013-04-03 北京大学 Chinese words segmentation method and device
CN107168946A (en) * 2017-04-14 2017-09-15 北京化工大学 A kind of name entity recognition method of medical text data
CN107291692A (en) * 2017-06-14 2017-10-24 北京百度网讯科技有限公司 Method for customizing, device, equipment and the medium of participle model based on artificial intelligence
CN108509419A (en) * 2018-03-21 2018-09-07 山东中医药大学 Ancient TCM books document participle and part of speech indexing method and system

Family Cites Families (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US8909514B2 (en) * 2009-12-15 2014-12-09 Microsoft Corporation Unsupervised learning using global features, including for log-linear model word segmentation

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103020034A (en) * 2011-09-26 2013-04-03 北京大学 Chinese words segmentation method and device
CN107168946A (en) * 2017-04-14 2017-09-15 北京化工大学 A kind of name entity recognition method of medical text data
CN107291692A (en) * 2017-06-14 2017-10-24 北京百度网讯科技有限公司 Method for customizing, device, equipment and the medium of participle model based on artificial intelligence
CN108509419A (en) * 2018-03-21 2018-09-07 山东中医药大学 Ancient TCM books document participle and part of speech indexing method and system

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
以清代医籍为例探讨中医古籍分词规范标准;付璐;《中华中医药杂志》;20181031;论文第4702页 *

Also Published As

Publication number Publication date
CN110134766A (en) 2019-08-16

Similar Documents

Publication Publication Date Title
Silberztein Formalizing natural languages: The NooJ approach
CN109460552B (en) Method and equipment for automatically detecting Chinese language diseases based on rules and corpus
Shamsfard et al. STeP-1: A Set of Fundamental Tools for Persian Text Processing.
CN108519974A (en) English composition automatic detection of syntax error and analysis method
CN110134766B (en) Word segmentation method and device for traditional Chinese medical ancient book documents
Alkanhal et al. Automatic stochastic arabic spelling correction with emphasis on space insertions and deletions
WO2008059111A2 (en) Natural language processing
Loftsson Correcting a PoS-tagged corpus using three complementary methods
Van Der Goot et al. Lexical normalization for code-switched data and its effect on POS-tagging
Inoue et al. Morphosyntactic tagging with pre-trained language models for Arabic and its dialects
Go et al. Developing an unsupervised grammar checker for Filipino using hybrid N-grams as grammar rules
Mon et al. SymSpell4Burmese: symmetric delete Spelling correction algorithm (SymSpell) for burmese spelling checking
Yang et al. Spell Checking for Chinese.
US11501077B2 (en) Semantic processing method, electronic device, and non-transitory computer readable recording medium
Sirts et al. Evaluating Sentence Segmentation and Word Tokenization Systems on Estonian Web Texts.
Mekki et al. COTA 2.0: An automatic corrector of tunisian Arabic social media texts
Amrani et al. A semi-automatic system for tagging specialized corpora
Saty et al. Survey of Arabic checker techniques
Schulz et al. From 0 to 10 million annotated words: part-of-speech tagging for Middle High German
Tukur et al. Parts-of-speech tagging of Hausa-based texts using hidden Markov model
Al-Zyoud et al. Arabic stemming techniques: comparisons and new vision
Wu et al. Correcting serial grammatical errors based on n-grams and syntax
Kučera The odd couple: The linguist and the software engineer. The struggle for high quality computerized language aids
Heilman et al. Precision isn’t everything: A hybrid approach to grammatical error detection
Mortensen et al. A hmong corpus with elaborate expression annotations

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant