CN110134766B

CN110134766B - Word segmentation method and device for traditional Chinese medical ancient book documents

Info

Publication number: CN110134766B
Application number: CN201910384880.XA
Authority: CN
Inventors: 谢永红; 周越; 张德政; 阿孜古丽; 栗辉; 贾麒
Original assignee: University of Science and Technology Beijing USTB
Current assignee: University of Science and Technology Beijing USTB
Priority date: 2019-05-09
Filing date: 2019-05-09
Publication date: 2021-06-25
Anticipated expiration: 2039-05-09
Also published as: CN110134766A

Abstract

The embodiment of the invention discloses a word segmentation method and a word segmentation device for ancient Chinese medical book documents, wherein the method comprises the following steps: preprocessing ancient book documents in the field of traditional Chinese medicine to generate a corpus of a training language model; training the corpus to generate a language model; performing unsupervised word segmentation on the ancient book documents by using the language model to generate a primary word segmentation result; summarizing the preliminary word segmentation result according to the part of speech relationship, the fixed collocation of the sentence patterns and the linguistic knowledge, and sorting out segmentation rules to form rule files; and according to the rule in the rule file, performing first correction on the preliminary word segmentation result to generate a first correction result.

Description

Word segmentation method and device for traditional Chinese medical ancient book documents

Technical Field

The invention relates to a word segmentation method of medical literatures in the field of natural language processing, in particular to a word segmentation method and a word segmentation device for ancient Chinese medical literature.

Background

Chinese participles are a basic step in chinese text processing. Unlike English and other characters, Chinese sentences are not divided into words by using spaces, so that Chinese word segmentation has a key meaning as a basic step when performing Chinese information processing tasks such as text classification, information retrieval, information filtering, automatic indexing of documents, automatic generation of abstracts and the like. The correctness of the Chinese word segmentation result directly influences the correctness of the subsequent task.

In the field of traditional Chinese medicine, a great number of medical literature ancient books are accumulated in the traditional Chinese medicine which is born from the original society and is continuously developed and changed. These documents are large in number, complicated in content, and various in types, including the theory of essence and qi, the theory of yin and yang, the theory of five elements, qi, blood and body fluids, visceral manifestation, meridians, constitutions, etiology, pathogenesis, therapeutic principle, health preservation, etc. Most of them are recorded by using Chinese language or ancient Chinese and singing formula, and their writing mode and writing period are different from modern Chinese language. Also, there are many proper nouns and terminology of the traditional Chinese medicine field. The reasonable word segmentation of the traditional Chinese medical ancient book documents is the basis for structuring the traditional Chinese medical knowledge, but at present, no word segmentation device specially aiming at the field of traditional Chinese medicine exists, and the word segmentation task on the traditional Chinese medical ancient book documents cannot be well solved by the word segmentation device in the general field.

Disclosure of Invention

In view of this, the embodiments of the present invention provide a word segmentation method and device for ancient Chinese medical literature, which can improve the accuracy of word segmentation of the literature in the field of traditional Chinese medicine.

A word segmentation method for ancient Chinese medical book documents comprises the following steps:

preprocessing ancient book documents in the field of traditional Chinese medicine to generate linguistic data of a training language model

Training the corpus to generate a language model;

performing unsupervised word segmentation on the ancient book documents by using the language model to generate a primary word segmentation result;

summarizing the preliminary word segmentation result according to the part of speech relationship, the fixed collocation of the sentence patterns and the linguistic knowledge, and sorting out segmentation rules to form rule files;

and according to the rule in the rule file, performing first correction on the preliminary word segmentation result to generate a first correction result.

The method further comprises the following steps:

acquiring traditional Chinese medicine field terms sorted according to the ancient book literature as a word list;

and correcting the first correction result by using the word list to obtain a final word segmentation result.

The step of pre-processing the ancient book documents comprises:

acquiring an original text of the ancient book document, deleting a catalogue of the ancient book document from the original text, deleting a sentence containing characters which cannot be expressed by utf-8, and generating a cleaned text;

and adding a space after each character in the cleaned text to be used as a corpus of the training language model.

The step of unsupervised word segmentation of the ancient book documents by using the language model comprises the following steps:

the transition states of a word are divided into four types: the first one is: the first character of the single word or the multi-word is marked as a; the second method is as follows: a second word of the multi-word, labeled b; the third is: a third word of the multi-word, labeled c; the fourth method is as follows: the rest of the multi-word, labeled d;

wherein, the single word can only be followed by the first word of the single word or the multiple word, the first word of the single word can only be followed by the second word of the multiple word, the second word of the multiple word can only be followed by the third word of the multiple word or the first word of the single word or the multiple word, the third word of the multiple word can only be followed by the rest of the multiple word or the first word of the single word or the multiple word, the rest of the multiple word can only be followed by the first word of the single word or the multiple word or the rest of the multiple word, except the transition state, the rest transition probability is zero;

in summary, there are 8 non-zero state transition probabilities, namely:

p(a|a),p(b|a),p(c|b),p(a|b),p(a|c),p(d|c),p(a|d),p(d|d)；

wherein p (a | a) represents a conditional probability that a following word is a first word of a single-word or multi-word under a condition that a preceding word is a single-word, p (b | a) represents a conditional probability that the following word is a second word of a multi-word under a condition that the preceding word is a first word of a multi-word, p (c | b) represents a conditional probability that the following word is a third word of the multi-word under a condition that the preceding word is a second word of the multi-word, p (a | b) represents a conditional probability that the following word is a first word of the single-word or multi-word under a condition that the preceding word is a second word of the multi-word, p (a | c) represents a conditional probability that the following word is a first word of the single-word or multi-word under a condition that the preceding word is a third word of the multi-word, and p (d | c) represents a conditional probability that the following word is a remaining part of the multi-word under a condition that the preceding word is a third word of the multi-word, p (a | d) represents a conditional probability that the next word is a single word or a first word of a multi-word under the condition that the previous word is the rest of the multi-word, and p (d | d) represents a conditional probability that the next word is the rest of the multi-word under the condition that the previous word is the rest of the multi-word;

setting different transition probabilities for experimental comparison to obtain the transition probabilities;

p(a|a)＝0.96，p(b|a)＝0.2，p(c|b)＝0.009，p(a|b)＝0.9，p(a|c)＝1，p(d|c)＝0.005，

p(a|d)＝1，p(d|d)＝0.0001

calculating a conditional probability using the language model;

and finding the optimal path, namely the path with the maximum conditional probability, by using a dynamic programming method, and taking the optimal path as a segmentation result to obtain an initial word segmentation result.

The step of calculating conditional probabilities using the language model comprises:

p(w)＝p(z₁)p(z₂|z₁)p(z₃|z₂z₁)...p(z_k|z_k-1...z₂z₁) (1)

where word w is a word consisting of k words, z₁，z₂，…z_kRespectively, the 1 st and 2 … k th words of the word w, the possibility that the word w is a word, i.e., the existence probability p (w) of the word w can be converted into the probability of each word of the composition. Wherein the probability of each word is calculated by dividing the number of times the word occurs in the text of the training language model by the total number of words.

The step of summarizing the preliminary word segmentation result according to the part of speech relationship, the fixed collocation of sentence patterns and the linguistic knowledge, sorting out segmentation rules and forming rule files comprises the following steps:

sequencing the preliminary word segmentation results according to the pinyin sequence;

sequentially processing the sorted preliminary word segmentation results; the treatment specifically comprises the following steps:

writing a rule file according to the punctuation marks as to the result that the Chinese characters and the punctuation marks are divided into a word;

judging words with the same first character according to part of speech and linguistic knowledge as a result of dividing the Chinese characters into a word, and splitting the words into a verb and a noun when the words are in a verb + noun form; when the form is adjective + noun, the separation is divided into two words, namely adjective and noun. Writing a rule file according to the verb words;

for the result that the Chinese character and the Chinese character are divided into a word, when the word is the same as the first two characters, the judgment is carried out according to the part of speech and the linguistic knowledge, for example, when the word is in a verb + noun form, the word is divided into a verb and a noun; when the form is adjective + noun, the separation is divided into two words, namely adjective and noun. Writing a rule file according to the same first two characters;

for the results of dividing the Chinese characters and the Chinese characters into a word, sorting words with the same final characters, and judging according to the part of speech and the linguistic knowledge, for example, when the words are in a verb + noun form, splitting the words into a verb and a noun; when the form is adjective + noun, the separation is divided into two words, namely adjective and noun. And writing a rule file according to the final word.

The step of summarizing the preliminary word segmentation result according to the part of speech relationship, the fixed collocation of sentence patterns and the linguistic knowledge, sorting out segmentation rules and forming rule files further comprises the following steps:

counting all words in the preliminary word segmentation result and the occurrence frequency of each word;

sequencing according to the times from high to low to obtain words of a preset number;

judging whether the words with the preset number are in a preset word list or not;

if so, a rule file is compiled according to the words.

The step of using the word list to correct the first correction result to obtain a final word segmentation result comprises the following steps:

searching a word in the word list in the original text to serve as a word to be corrected;

recording the position of the word to be corrected in the original text;

finding the word segmentation result of the word to be corrected after the first correction according to the recorded position;

judging whether the word segmentation result is consistent with the words in the word list or not;

if not, modifying the word segmentation result of the word to be corrected according to a word list; if the word segmentation results are consistent, the word segmentation results are reserved;

and sequentially carrying out word list correction on the first correction result to obtain a final word segmentation result.

The step of using the vocabulary to correct the first correction result to obtain a final word segmentation result further comprises:

performing long and short word disambiguation when the first word is both a word in the vocabulary and a sub-word of a word in the vocabulary.

A word segmentation device for ancient Chinese medical book documents is characterized by comprising:

the preprocessing module is used for preprocessing ancient book documents in the field of Chinese medicine to generate linguistic data of a training language model

The training module is used for training the corpus to generate a language model;

the word segmentation module is used for carrying out unsupervised word segmentation on the ancient book documents by using the language model to generate a preliminary word segmentation result;

the rule establishing module is used for summarizing the preliminary word segmentation result according to the part of speech relationship, the fixed collocation of the sentence patterns and the linguistic knowledge, and sorting out segmentation rules to form a rule file;

and the first correction module is used for correcting the preliminary word segmentation result for the first time according to the rule in the rule file to generate a first correction result.

The device, still include:

the acquisition module is used for acquiring traditional Chinese medicine field terms sorted according to the ancient book documents as a word list;

and the second correction module corrects the first correction result by using the word list to obtain a final word segmentation result.

In the above embodiment, the word segmentation accuracy in the field of traditional Chinese medicine is improved by designing the word segmentation method in the field of traditional Chinese medicine.

Drawings

In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings used in the description of the embodiments or the prior art will be briefly described below, it is obvious that the drawings in the following description are only some embodiments of the present invention, and for those skilled in the art, other drawings can be obtained according to the drawings without creative efforts.

FIG. 1 is a flow chart of a word segmentation method for ancient Chinese medical book literature according to an embodiment of the present invention;

FIG. 2 is a schematic connection diagram of the word segmentation apparatus for ancient Chinese medical book literature according to the present invention;

FIG. 3 is a flow chart of the method for word segmentation of ancient book literature in traditional Chinese medicine;

FIG. 4 is a corpus pre-processing result according to the present invention;

FIG. 5 is a rule file of the present invention;

FIG. 6 shows the word segmentation result of the present invention.

Detailed Description

Embodiments of the present invention will be described in detail below with reference to the accompanying drawings.

As shown in fig. 1, the word segmentation method for ancient book literature in traditional chinese medicine according to the present invention includes:

101, preprocessing ancient book documents in the field of Chinese medicine to generate linguistic data of a training language model; wherein the step of pre-processing the ancient book documents comprises: acquiring an original text of the ancient book document, deleting a catalogue of the ancient book document from the original text, deleting a sentence containing characters which cannot be expressed by utf-8, and generating a cleaned text; and adding a space after each character in the cleaned text to be used as a corpus of the training language model.

102, training the corpus to generate a language model;

103, performing unsupervised word segmentation on the ancient book document by using the language model to generate a primary word segmentation result;

in summary, there are 8 non-zero state transition probabilities, namely:

p(a|a)，p(b|a)，p(c|b)，p(a|b)，p(a|c)，p(d|c)，p(a|d)，p(d|d)；

wherein p (a | a) represents a conditional probability that a following word is a first word of a single-word or multi-word under a condition that a preceding word is a single-word, p (b | a) represents a conditional probability that the following word is a second word of a multi-word under a condition that the preceding word is a first word of a multi-word, p (c | b) represents a conditional probability that the following word is a third word of the multi-word under a condition that the preceding word is a second word of the multi-word, p (a | b) represents a conditional probability that the following word is a first word of the single-word or multi-word under a condition that the preceding word is a second word of the multi-word, p (a | c) represents a conditional probability that the following word is a first word of the single-word or multi-word under a condition that the preceding word is a third word of the multi-word, and p (d | c) represents a conditional probability that the following word is a remaining part of the multi-word under a condition that the preceding word is a third word of the multi-word, p (a | d) represents a conditional probability that the next word is a single word or a first word of a multi-word under the condition that the previous word is the rest of the multi-word, and p (d | d) represents a conditional probability that the next word is the rest of the multi-word under the condition that the previous word is the rest of the multi-word.

p(a|d)＝1，p(d|d)＝0.0001

calculating a conditional probability using the language model;

p(w)＝p(z₁)p(z₂|z₁)p(z₃|z₂z₁)...p(z_k|z_k-1...z₂z₁) (1)

where word w is a word consisting of k words, z₁，z₂,…z_kRespectively, the 1 st and 2 … k th words of the word w, the possibility that the word w is a word, i.e., the existence probability p (w) of the word w can be converted into the probability of each word of the composition. Wherein the probability of each word is calculated by dividing the number of times the word occurs in the text of the training language model by the total number of words.

Step 104, summarizing the preliminary word segmentation result according to the part of speech relationship, the fixed collocation of the sentence patterns and the linguistic knowledge, and sorting out segmentation rules to form rule files;

if so, a rule file is compiled according to the words.

And 105, performing first correction on the preliminary word segmentation result according to the rule in the rule file to generate a first correction result.

The method further comprises the following steps:

step 106, acquiring traditional Chinese medicine field terms sorted according to the ancient book documents as a word list;

and 107, correcting the first correction result by using the word list to obtain a final word segmentation result. The method comprises the following steps:

recording the position of the word to be corrected in the original text;

The step may further include:

As shown in fig. 2, the word segmentation apparatus for ancient book literature in chinese medicine according to the present invention includes:

a preprocessing module 21 for preprocessing ancient book documents in the Chinese medical field to generate corpus of training language model

The training module 22 is used for training the linguistic data to generate a language model;

the word segmentation module 23 is used for performing unsupervised word segmentation on the ancient book documents by using the language model to generate a preliminary word segmentation result;

the rule establishing module 24 summarizes the preliminary word segmentation result according to the part of speech relationship, the fixed collocation of the sentence patterns and the linguistic knowledge, and arranges segmentation rules to form a rule file;

and the first correction module 25 is used for performing first correction on the preliminary word segmentation result according to the rule in the rule file to generate a first correction result.

The device, still include:

an obtaining module 26, for obtaining Chinese medicine field terms sorted according to the ancient book documents as a word list;

and a second correction module 27, which corrects the first correction result by using the word list to obtain a final word segmentation result.

The following describes an application scenario of the present invention. The invention provides a word segmentation method based on ancient book literature in the field of traditional Chinese medicine, aiming at solving the word segmentation problem of the ancient book literature in the field of traditional Chinese medicine. The specific implementation steps are as follows:

the method comprises the following steps: ancient book documents related to the traditional Chinese medicine field are obtained and used as linguistic data of a training language model, and special terms in the traditional Chinese medicine field are arranged and used as word lists. One word per row of the vocabulary, in the format of TXT.

Step two: the documents are preprocessed and the language model is trained using the kenlm tool.

Wherein the pretreatment comprises:

(1) deleting the catalogue, and deleting the sentences which contain the special characters which cannot be expressed by utf-8 (the sentences are divided by periods, exclamation marks and question marks);

(2) and adding a space behind each character in the cleaned text to be used as the corpus of the training character language model.

Step three: the language model is used for carrying out primary unsupervised word segmentation on ancient book documents in the field of traditional Chinese medicine.

In this context, words having a length greater than four are relatively few, so in this patent, the state of a word is divided into four types: the first character of a single word or a multi-word is marked as a, the second character of the multi-word, the third character of the multi-word is marked as b, the third character of the multi-word is marked as c, and the rest of the multi-word is marked as d. The single word can be followed by the first word of the single word or the multiple word, the first word of the single word can be followed by the second word of the multiple word, the second word of the multiple word can be followed by the third word of the multiple word or the first word of the single word or the multiple word, the third word of the multiple word can be followed by the rest of the multiple word or the first word of the single word or the multiple word, the rest of the multiple word can be followed by the first word of the single word or the multiple word or the rest of the multiple word, except the transition state, the rest of the transition probability is zero. In summary, there are 8 non-zero transition probabilities, namely:

p(a|a)，p(b|a)，p(c|b)，p(a|b)，p(a|c)，p(d|c)，p(a|d)，p(d|d)。

according to the condition that long words in ancient languages are few and single words with complete semantics are many, different transition probabilities are set for experimental comparison to obtain the transition probabilities

p(a|d)＝1，p(d|d)＝0.0001

And (5) calculating the conditional probability by using the language model obtained in the step two, as shown in formula (1). Where a word w is a word composed of k words, the probability of the word w may be converted to the probability of the composed word. And finding the optimal path, namely the path with the maximum probability by using a dynamic programming method, and taking the optimal path as a segmentation result to obtain an initial word segmentation result.

p(w)＝p(z₁)p(z₂|z₁)p(z₃|z₂z₁)...p(z_k|z_k-1...z₂z₁) (1)

Step four: summarizing the preliminary word segmentation result in the third step according to the part of speech relation, the fixed collocation of the sentence patterns and the linguistic knowledge, and sorting out segmentation rules to form rule files. The method for forming the rule file is summarized as follows:

(1) counting all words and the occurrence frequency of each word in the word segmentation result obtained in the step three;

(2) sorting the statistical results according to the pinyin sequence;

(3) for the result that the Chinese characters and the punctuation marks are divided into a word, writing the punctuation marks into a rule file as rules;

(4) for the result that the Chinese characters and the Chinese characters are divided into a word, firstly, words with the same first character are sorted out, then, judgment is carried out according to the part of speech and the linguistic knowledge, if the form of a verb and a noun is generally divided into a verb and two words of the noun, the verb word is added into a rule file;

(5) sorting out the words with the same front two characters, judging according to the part of speech and the linguistic knowledge, and adding the corresponding words into a rule file;

(6) and sorting words with the same final characters, judging according to the part of speech and the linguistic knowledge, and adding corresponding words into the rule file.

(7) And then sequencing the high-frequency words from high to low according to the frequency, and judging the high-frequency words.

The rule file storage format is TXT, one rule is written per line. The specific description is shown in the following table 1:

TABLE 1 rules File

Step five: performing first correction on the preliminary word segmentation result in the third step according to the rule in the rule file;

step six: and correcting the first correction result in the step five by using a word list, wherein the detailed correction method comprises the following steps:

(1) searching words appearing in a word list in an original text, and recording the appearing positions;

(2) if a word is a word in the word list and a sub-word of a certain word in the word list, performing long and short word disambiguation to determine which word in the word list should be used;

(3) finding a word segmentation result obtained by one-time correction in the fifth step according to the position recorded in the step (1);

(4) if the word segmentation result is inconsistent with the word list, namely one word in the word list is segmented into a plurality of words or the segmentation limit is incorrect, merging and modifying the word segmentation result according to the word list; if the two are consistent, the result is reserved;

(5) and obtaining a final word segmentation result after word list correction.

The invention has the advantages that:

in the existing word segmentation tools, word segmentation devices for segmenting Chinese ancient characters and word segmentation devices special for the field of traditional Chinese medicine are not provided. In the above expression, there is a great difference between the ancient chinese language and the modern chinese language, for example, the modern chinese language generally uses "wool", "do", "ya" and the like as the semantic words, but the semantic words in the ancient chinese language are generally "he", "do", "he", "also"; compared with the modern Chinese, more single characters in the ancient languages can show complete semantics. These all lead to erroneous results that are frequently presented by existing segmenters at the upstream of the ancient segmentation task. In addition, in ancient books of traditional Chinese medicine, a large number of prescriptions of the medicine names are described, and the specific expression methods in the traditional Chinese medicine field are difficult to be seen in the general field, so that the word segmentation task of the ancient books of traditional Chinese medicine cannot be well solved by the word segmentation device in the general field at present. The invention trains the word segmentation model specially applied to the ancient book literature of traditional Chinese medicine by using an unsupervised method, and can well solve the word segmentation task of the ancient book literature of traditional Chinese medicine. And moreover, an unsupervised method is adopted, so that the labor and time cost of manual labeling is saved. The method can be easily extended to be used in ancient documents of other fields, and the word segmentation device disclosed by the invention can be applied to other fields by changing the data set of the training model and the word list of the related field.

The following ancient book of traditional Chinese medicine document is taken as an example to illustrate the word segmentation method of the ancient book of traditional Chinese medicine document, as shown in fig. 3.

Firstly, ancient Chinese medical book documents are obtained as the corpus of the training language model. The specific terms in the traditional Chinese medicine field are arranged to be used as word lists, and are shown in the table 1.

Table 1: word list of special terms in traditional Chinese medicine field

Secondly, preprocessing the corpus is performed, and the preprocessing result is shown in fig. 4. The language model was trained using the Kenlm tool with the parameters of the gram set to 4.

Thirdly, unsupervised word segmentation is carried out on ancient book documents in the field of traditional Chinese medicine by using the language model trained in the step two. The words are separated by spaces.

Fourthly, the first word segmentation result is summarized through sorting. Firstly, sorting according to pinyin, and dividing the result of dividing the symbols and Chinese characters into a word. For example, if "poria" appears in the word segmentation result, "," "is added to the rule file. "x" indicates that all characters, i.e. once present, are segmented, and that "connected to any character" is segmented. Next, the words with the same initial word are all found out, and the combination of the word "verb + noun" should be generally divided into two words according to the common knowledge of linguistics, such as "eat warm wine", "eat salt", "eat hot substance", and the like, wherein the word takes the "eat" word as the head, and the rule of "eat" word "is added in the rule file. And similarly, arranging the words with the same initial characters and the words with the same ending single characters, and adding corresponding rules in the rule asking price. And sequencing the primary word segmentation results according to the word frequency from high to low, checking high-frequency words, and adding corresponding rules needing segmentation into the rule file.

Fifthly, the rule in the rule file is used for correcting the primary word segmentation result. The rule file is shown in fig. 5.

Sixthly, correcting the result obtained in the fifth step by using a word list.

First, the words in the vocabulary are found in the original non-participled chinese medical literature and the position is recorded.

Then, the words of the word list are found in the word segmentation result corrected by the rule.

And if the word segmentation result is that the words in the word list are inconsistent, correcting the result according to the word list. For example, there is a sentence in the original Chinese medicine literature, which is "add or subtract six ingredients decoction to treat enuresis, and many of the deficient people have the syndrome. If the word segmentation result after the rule correction is that the six ingredients decoction is added or subtracted, this syndrome is usually seen in the deficient people. However, the plus-minus six-ingredient decoction is a word in a word list and is a traditional Chinese medicine prescription name, and in this case, the plus-minus six-ingredient decoction is combined into a word in the final word segmentation result. The final word segmentation result is that the addition or subtraction of six ingredients decoction treats enuresis, which is often the case for deficiency people. The form of the word segmentation result is shown in fig. 6.

For convenience of description, the above devices are described separately in terms of functional division into various units/modules. Of course, the functionality of the units/modules may be implemented in one or more software and/or hardware implementations of the invention.

It will be understood by those skilled in the art that all or part of the processes of the methods of the embodiments described above can be implemented by a computer program, which can be stored in a computer-readable storage medium, and when executed, can include the processes of the embodiments of the methods described above. The storage medium may be a magnetic disk, an optical disk, a Read-Only Memory (ROM), a Random Access Memory (RAM), or the like.

The above description is only for the specific embodiment of the present invention, but the scope of the present invention is not limited thereto, and any changes or substitutions that can be easily conceived by those skilled in the art within the technical scope of the present invention are included in the scope of the present invention. Therefore, the protection scope of the present invention shall be subject to the protection scope of the claims.

Claims

1. A word segmentation method for ancient Chinese medical book documents is characterized by comprising the following steps:

preprocessing ancient book documents in the field of traditional Chinese medicine to generate a corpus of a training language model;

training the corpus to generate a language model;

according to the rule in the rule file, performing first correction on the preliminary word segmentation result to generate a first correction result;

judging words with the same first character according to part of speech and linguistic knowledge as a result of dividing the Chinese characters into a word, and splitting the words into a verb and a noun when the words are in a verb + noun form; when the form is an adjective + noun form, the adjective + noun form is split into two words; writing a rule file according to the same first character;

for the result that the Chinese characters and the Chinese characters are divided into a word, when the word is the same as the first two characters, the judgment is carried out according to the part of speech and the linguistic knowledge, and when the word is in the form of verb + noun, the word is divided into two words of verb and noun; when the form is an adjective + noun form, the adjective + noun form is split into two words; writing a rule file according to the same first two characters;

for the results of dividing the Chinese characters and the Chinese characters into a word, sorting words with the same final characters, judging according to the part of speech and the linguistic knowledge, and splitting the words into a verb and a noun when the words are in a verb + noun form; when the form is an adjective + noun form, the adjective + noun form is split into two words; and writing a rule file according to the final word.

2. The method of claim 1, further comprising:

3. The method of claim 1, wherein the step of pre-processing the ancient book documents comprises:

4. The method of claim 2, wherein said step of unsupervised word segmentation of the ancient book documents using the language model comprises:

in summary, there are 8 non-zero state transition probabilities, namely:

p(a|a)，p(b|a)，p(c|b)，p(a|b)，p(a|c)，p(d|c)，p(a|d)，p(d|d)；

wherein p (a | a) represents a conditional probability that a subsequent word is a first word of a single word or a multiple word under the condition that a previous word is a single word; p (b | a) represents the conditional probability that the following word is the second word of the multi-word, provided that the preceding word is the first word of the multi-word; p (c | b) represents the conditional probability that the following word is the third word of the multi-word, provided that the preceding word is the second word of the multi-word; p (a | b) represents a conditional probability that the latter word is a first word of a single word or a multiple word under the condition that the former word is a second word of the multiple word; p (a | c) represents a conditional probability that the latter word is a first word of a single word or a multiple word under the condition that the former word is a third word of the multiple word; p (d | c) represents the conditional probability that the next word is the remainder of the multi-word, provided that the previous word is the third word of the multi-word; p (a | d) represents the conditional probability that the next word is the first word of a single word or a multi-word under the condition that the previous word is the rest of the multi-word; p (d | d) represents the conditional probability that the next word is the rest of the multi-word under the condition that the previous word is the rest of the multi-word;

p(a|a)＝0.96，p(b|a)＝0.2，p(c|b)＝0.009，p(a|b)＝0.9，p(a|c)＝1，p(d|c)＝0.005，p(a|d)＝1，p(d|d)＝0.0001

calculating a conditional probability using the language model;

5. The method of claim 4, wherein the step of calculating the conditional probability using the language model comprises:

p(w)＝p(z₁)p(z₂|z₁)p(z₃|z₂z₁)...p(z_k|z_k-1...z₂z₁) (1)

where word w is a word consisting of k words, z₁，z₂,…z_kRespectively, the 1 st and 2 … k th words of the word w, the possibility that the word w is one word, i.e., the existence probability p (w) of the word w can be converted into the existence probability of each word of the composition, wherein the existence probability of each word is calculated by dividing the number of times the word appears in the text of the training language model by the total number of words.

6. The method according to claim 1, wherein the step of summarizing the preliminary word segmentation result according to part-of-speech relations, fixed collocation of sentence patterns and linguistic knowledge, sorting segmentation rules, and forming a rule file further comprises:

if so, a rule file is compiled according to the words.

7. The method according to claim 2, wherein the step of modifying the first modification result by using the vocabulary to obtain a final segmentation result comprises:

searching a word in the word list in the original text as a word to be corrected;

recording the position of the word to be corrected in the original text;

if not, modifying the word segmentation result of the word to be corrected according to the word list; if the word segmentation results are consistent, the word segmentation results are reserved;

sequentially carrying out word list correction on the first correction result to obtain a final word segmentation result; or

performing long and short word disambiguation when there is a first word that is both a word in the vocabulary and a subword of a word in the vocabulary.

8. A word segmentation device for ancient Chinese medical book documents is characterized by comprising:

the first correction module is used for correcting the preliminary word segmentation result for the first time according to the rule in the rule file to generate a first correction result;

wherein the rule establishing module is used for: writing a rule file according to the punctuation marks as to the result that the Chinese characters and the punctuation marks are divided into a word; judging words with the same first character according to part of speech and linguistic knowledge as a result of dividing the Chinese characters into a word, and splitting the words into a verb and a noun when the words are in a verb + noun form; when the form is an adjective + noun form, the adjective + noun form is split into two words; writing a rule file according to the same first character; for the result that the Chinese characters and the Chinese characters are divided into a word, when the word is the same as the first two characters, the judgment is carried out according to the part of speech and the linguistic knowledge, and when the word is in the form of verb + noun, the word is divided into two words of verb and noun; when the form is an adjective + noun form, the adjective + noun form is split into two words; writing a rule file according to the same first two characters; for the results of dividing the Chinese characters and the Chinese characters into a word, sorting words with the same final characters, judging according to the part of speech and the linguistic knowledge, and splitting the words into a verb and a noun when the words are in a verb + noun form; when the form is an adjective + noun form, the adjective + noun form is split into two words; and writing a rule file according to the final word.

9. The apparatus of claim 8, further comprising: