CN108845982A

CN108845982A - A kind of Chinese word cutting method of word-based linked character

Info

Publication number: CN108845982A
Application number: CN201711293044.8A
Authority: CN
Inventors: 龙华; 李康康; 邵玉斌
Original assignee: Kunming University of Science and Technology
Current assignee: Kunming University of Science and Technology
Priority date: 2017-12-08
Filing date: 2017-12-08
Publication date: 2018-11-20
Anticipated expiration: 2037-12-08
Also published as: CN108845982B

Abstract

The present invention relates to a kind of Chinese word cutting methods of word-based linked character, belong to technical field of information processing.The present invention selects text to be treated from text library, and pre-processes to text library, including removes symbol and form it into sentence, using removing the sentence builder corpus after symbol.Using the segmenting method of front and back splicing word, the corpus in step a1 is segmented, forms segmentation fragment.Using word splicing before and after binary cutting, word splicing before and after ternary cutting, word joining method before and after quaternary cutting forms binary candidate's dictionary, ternary candidate dictionary and quaternary candidate's dictionary.One word frequency thresholding is set to the candidate word for the word frequency that statistics has been got well, and it is made decisions, meets the reservation of this judgement, forms new corpus.

Description

Chinese word segmentation method based on word association characteristics

Technical Field

The invention relates to a Chinese word segmentation method based on word association characteristics, and belongs to the technical field of information processing.

Background

Chinese word segmentation technology belongs to the field of natural language processing technology, and for a sentence, people can understand which words are and which are not words through own knowledge, but how can a computer understand? The processing process is the word segmentation algorithm.

Existing word segmentation algorithms can be divided into three major categories: an understanding-based word segmentation method, a character string matching-based word segmentation method and a traditional statistics-based word segmentation method.

The word segmentation method based on understanding achieves the effect of recognizing words by enabling a computer to simulate human understanding of sentences. The basic idea is to analyze syntax and semantics while segmenting words, and to process ambiguity phenomenon by using syntax information and semantic information. It generally comprises three parts: word segmentation subsystem, syntax semantic subsystem, and master control part. Under the coordination of the master control part, the word segmentation subsystem can obtain syntactic and semantic information of related words, sentences and the like to judge word segmentation ambiguity, namely the word segmentation subsystem simulates the process of understanding sentences by people. This word segmentation method requires the use of a large amount of linguistic knowledge and information. Because of the generality and complexity of Chinese language knowledge, it is difficult to organize various language information into a form that can be directly read by a machine, so that the existing understanding-based word segmentation system is still in a test stage.

A word segmentation method based on character string matching is also called a mechanical word segmentation method, and is characterized in that a Chinese character string to be analyzed is matched with a vocabulary entry in a sufficiently large machine dictionary according to a certain strategy, and if a certain character string is found in the dictionary, the matching is successful (a word is identified). According to different scanning directions, the string matching word segmentation method can be divided into forward matching and reverse matching; according to the condition of preferential matching of different lengths, the method can be divided into maximum (longest) matching and minimum (shortest) matching; whether the method is combined with the part-of-speech tagging process or not can be divided into a simple word segmentation method and an integrated method combining word segmentation and tagging. Several mechanical word segmentation methods are commonly used: (1) forward maximum matching (left to right direction); (2) inverse maximum matching (right-to-left direction); (3) least segmentation (minimizing the number of words cut in each sentence). The three mechanical word segmentation methods can also be combined with each other, for example, a forward maximum matching method and a reverse maximum matching method can be combined to form a bidirectional matching method. Because of the characteristic of Chinese single word formation, the forward minimum matching and the reverse minimum matching are generally rarely used. Generally speaking, the segmentation precision of the reverse matching is slightly higher than that of the forward matching, and the encountered ambiguity phenomenon is less. In the actually used word segmentation system, mechanical word segmentation is used as an initial segmentation means, and various other language information is used to further improve the accuracy of segmentation. One method of improvement is to improve the scanning mode, called feature scanning or mark segmentation, to identify and segment some words with obvious features in the character string to be analyzed, and to use these words as break points, to segment the original character string into smaller strings and then to perform mechanical segmentation, so as to reduce the matching error rate. Another improvement method is to combine word segmentation and part of speech tagging, to utilize rich part of speech information to provide help for word segmentation decision, and to check and adjust the word segmentation result in the tagging process, thereby greatly improving the segmentation accuracy.

In the above word segmentation method based on character string matching, that is, in the mechanical word segmentation method, whether it is a forward maximum matching method, a reverse maximum matching method, or a minimum segmentation method, the maximum matching method aims to try to maximize the matching length between each word and the word in the dictionary. The maximum matching method has the advantages of simple principle and easy realization, and has the defects that the maximum matching length is not easy to determine, if the maximum matching length is too large, the time complexity is increased, and if the maximum matching length is too small, some words exceeding the maximum matching length cannot be matched, so that the word segmentation accuracy is reduced. The evaluation principle of the maximum matching method is "long term first". However, the existing maximum matching method, whether forward or backward, increases or decreases characters, performs maximum matching in a local range, that is, the range of maximum matching is i characters first or i characters last each time, which does not fully embody the principle of "long word first".

The principle of the conventional statistical word segmentation method is that a word is a stable combination of words in form, so that the more times adjacent words appear simultaneously in the context, the more likely it is to constitute a word. Therefore, the frequency or probability of the co-occurrence of the characters and the adjacent characters can better reflect the credibility of the words. The frequency of the combination of adjacent co-occurring words in the material can be counted to calculate their co-occurrence information. The co-occurrence information of two characters is defined, and the adjacent co-occurrence probability of two Chinese characters X, Y is calculated. The mutual-occurrence information embodies the closeness of the combination relationship between the Chinese characters. When the degree of closeness is above a certain threshold, it is considered that the word group may constitute a word. The method only needs to count the word group frequency in the corpus without dividing the dictionary, so the method is called a dictionary-free word segmentation method or a statistical word extraction method. However, this method also has a limitation in that some common word groups, which have a high co-occurrence frequency but are not words, such as "this", "one", "some", "my", "many", and the like, are often extracted, and the accuracy of recognition of common words is poor, and the space-time overhead is large.

Disclosure of Invention

The invention aims to provide a Chinese word segmentation method based on word association characteristics, which is used for overcoming the defect that words cannot be effectively identified and extracted from a large-scale corpus in the prior art and realizing the effective identification and extraction of the words in the large-scale corpus by a computer system.

The technical scheme of the invention is as follows: a Chinese word segmentation method based on word association characteristics comprises the following steps:

a. selecting a text to be processed from a text library, preprocessing the text library, including removing symbols and forming sentences, and constructing a corpus by using the sentences after symbol removal;

b. performing word segmentation on the corpus in the step 1 by adopting a word segmentation method of splicing words in front and back to form word segmentation fragments;

c. forming a binary candidate word bank, a ternary candidate word bank and a quaternary candidate word bank by adopting a binary segmentation pre-post word splicing method, a ternary segmentation pre-post word splicing method and a quaternary segmentation pre-post word splicing method;

d. performing word frequency statistics on binary candidate words, ternary candidate words and quaternary candidate words in the binary candidate word library, the ternary candidate word library and the quaternary candidate word library;

e. setting a word frequency threshold for the candidate words with the counted word frequency, judging the word frequency threshold, reserving the candidate words meeting the threshold to form a new corpus, and deleting the candidate words if the candidate words do not meet the threshold;

f. calculating the degree of freedom and the degree of condensation of the candidate words in the corpus processed in the step 5, giving a uniform threshold of the degree of freedom and the degree of condensation of all the candidate words, judging, reserving the candidate words meeting the judgment, and deleting the candidate words if the candidate words do not meet the judgment;

g. and further filtering the screened ternary candidate words and quaternary candidate words by adopting a word segmentation filtering method to form a new word bank.

The word splicing method comprises the steps of continuously cutting and segmenting words from the first word in a Chinese sentence, and cutting all word-forming words out, and specifically comprises the following steps:

for the text content contained in a chinese text, assume that:

{a_i,a_i+1,a_i+2,a_i+3,a_i+4,a_i+5.......a_i-1+n,a_i+nin which a_iExpressed as a character in the text, N belongs to N;

performing binary segmentation and splicing processing on the text set by adopting a binary segmentation pre-and post-word splicing method to obtain a processing result binary text fragment set, wherein the processing result binary text fragment set comprises the following steps: { (a)_ia_i+1),(a_i+1a_i+2),(a_i+2a_i+3),(a_i+3a_i+4),a_i+5.......(a_i-1+na_i+n)}；

Performing ternary segmentation and splicing processing on the text set by adopting a ternary segmentation pre-and-post word splicing method to obtain a processing result ternary text segment set, wherein the processing result is as follows: { (a)_ia_i+1),(a_i+1a_i+2),(a_i+2a_i+3),(a_i+3a_i+4),a_i+5.......(a_i-1+na_i+n)}；

Performing quaternary segmentation splicing processing on the text set by adopting a quaternary segmentation pre-and-post word splicing method to obtain a processing result quaternary text segment set, wherein the processing result quaternary text segment set comprises the following steps: { (a)_ia_i+1a_i+2a_i+3),(a_i+1a_i+2a_i+3a_i+4),(a_i+2a_i+3a_i+4a_i+5).......(a_i-3+ _na_i-2+na_i-1+na_i+n)}。

The degrees of freedom are: when a text segment appears in various different text sets and has a left adjacent word set and a right adjacent word set, wherein the left adjacent word set is a set indicating adjacent characters on the left side of the text segment, the right adjacent word set is a set indicating adjacent characters on the right side of the text segment, the information entropy of the text segment is obtained by calculating the information entropy of the left adjacent word set and the right adjacent word set, and the smaller information entropy in the left adjacent word set and the right adjacent word set is taken as the degree of freedom.

In the obtained text segment set, when a text segment can appear in various different text sets and has a left adjacent character set and a right adjacent character set, the information entropy H of the text segment is obtained by calculating the information entropy of the left adjacent character set and the information entropy of the right adjacent character set, that is, H is min { s ', s' } H represents the degree of freedom of the candidate word, S 'represents the right entropy of the candidate word, S' is the left entropy of the candidate word, and the smaller information entropy in the left adjacent word set and the right adjacent word set is taken as the degree of freedom.

The degree of condensation means that in one text, a new word appears alone with a higher probability than in combinationThe product of the probabilities of the words, i.e., P (AB) > P (A) P (B), letAnd taking the minimum M as the degree of condensation, wherein AB represents a new word, P (AB) represents the probability of the new word appearing in the text, A and B respectively represent a combined word, and P (A) and P (B) respectively represent the probability of the combined word appearing in the text.

The statistical degree of aggregation of the candidate words is obtained by calculating the ratio of the independent probability and the joint probability of the candidate words in the corpus, and the specific steps are as follows:

(1) obtaining the condensation degree M of the binary candidate words according to the ratio of the probability of the candidate words to the combined probability of the candidate words₂：

Wherein M is₂Representing degree of cohesion of candidate words, s_iProbability of occurrence of the first word of a binary candidate word in a corpus, N_iRepresenting the number of occurrences of the first word of a binary candidate in a corpus, N_i+1Representing the number of occurrences of the second word of the binary candidate in the corpus, N_i,i+1Representing the number of occurrences of a binary candidate word in a corpus, representing the total word number in the corpus, s_i+1Representing the probability of the second character of the binary candidate word appearing in the corpus, and p (i, i +1) representing the probability of the binary candidate word appearing in the corpus;

(2) obtaining the condensation degree M of the ternary candidate words according to the ratio of the probability of the candidate words to the combined probability of the candidate words₃：

Wherein M is₃Representing degree of cohesion of the candidate word, S_iOutline of appearance of first word in corpus for representing ternary candidate wordRate, s_i+1,i+2Probability, s, of the two last words of a ternary candidate word appearing in the corpus at the same time_i,i+1Probability, s, of the first two words of a ternary candidate word appearing in the corpus at the same time_i+2Representing the probability of the last word of a ternary candidate word appearing in the corpus, N_iNumber of occurrences of the first word of the ternary candidate word in the corpus, N_i+2The number of times that the third word representing the ternary candidate word appears in the corpus, N_i+1,i+2The number of occurrences of the last two words of the ternary candidate word in the corpus, N_i,i+1Representing the number of occurrences of the first two words of a ternary candidate word in a corpus, N_i,i+1,i+2Representing the number of occurrences of a ternary candidate word in a corpus, P_(i,i+1,i+2)Representing the probability of the occurrence of the ternary candidate words in the corpus;

(3) obtaining the condensation degree M of the quaternary candidate words according to the ratio of the probability of the candidate words to the combined probability of the candidate words₄：

Wherein M is₄Representing degree of cohesion of the candidate word, S_iProbability of occurrence of the first word of the quaternary candidate word in the corpus, S_i+1,i+2,i+3Probability of the last three words representing a quaternary candidate word appearing in the corpus at the same time, S_i,i+1,i+2Representing the probability of the first three words of a quaternary candidate word appearing in the corpus at the same time, S_i,i+1,Representing the probability of the first two words of a quaternary candidate word appearing in the corpus, S_i+2,i+3Representing the probability of the last two words of a quaternary candidate word appearing in the corpus, N_iNumber of occurrences of the first word of the quaternary candidate word in the corpus, N_i+3Number of times that a fourth word representing a quaternary candidate word appears in the corpus, N_i，i+1Representing the number of occurrences of the first two words of a quaternary candidate word in the corpus, N_i+2，i+3The number of occurrences of the last two words of the quaternary candidate word in the corpus, N_i+1,i+2,i+3The number of occurrences of the last three words of the quaternary candidate word in the corpus, N_i+1,i+2,i+3The number of times the first three words of the quaternary candidate word appear in the corpus is represented, and P (i, i +1, i +2, i +3) represents the probability of the quaternary candidate word appearing in the corpus.

The filtering method of the ternary candidate words comprises the following steps: for the ternary candidate word, if the last two characters exist in the binary candidate word library, judging whether the first character and the left adjacent character form a word, if the first two characters exist in the binary candidate word library, judging whether the last character and the right adjacent character form a word, and determining whether the ternary candidate word is a candidate word;

if it isThen₍A_i-1A_iA_i+1)Belong to a three-element word;

wherein,₍A_i-1A_iA_i+1)belonging to ternary candidate words, A_i-2Is a left-adjacent word set of ternary candidate words, A_i+2Is the right adjacent word of the ternary candidate word, { A₀....A_i....A_NIs a corpus character set, { (A)₀,A₁)...(A_i,A_i+1)...(A_i-2,A_i-1) Is a binary candidate word set.

The filtering method of the quaternary candidate words comprises the following steps: firstly, segmenting quaternary candidate words which are possible to become words, wherein the first two words are segmentation segments, the second two words are segmentation segments, the segmentation segments are respectively matched with the segmented binary word banks, the words in the matching process are used as pre-selected words, then, segmenting the middle two words of the quaternary words and are matched with the segmented binary word banks, the words which are not matched are used as the pre-selected words, and if the two conditions are met, the words are used as the result of segmentation:

and is

Wherein, { A_i-2,(A_i-1A_iA_i+1)A_i+2}. denotes a quaternary candidate, { (A)_i-2,A_i-1),(A_i,A_i+1) Means the front and rear words divided by the four candidate words, { (A)₀A₁)...(A_iA_i+1)...(A_NA_N+1) Denotes a binary word library, { (A)_i-1Ai) } denotes a quaternary candidate word intermediate word.

The invention has the beneficial effects that: the method provided by the invention has high correctness and effectiveness, the system can efficiently perform word segmentation, and the ternary and quaternary word segmentation method with the condensation degree and the degree of freedom designed by the invention can well solve the problems of the traditional word segmentation method based on a statistical model.

Drawings

FIG. 1 is a schematic flow diagram of the present invention;

Detailed Description

The invention is further described with reference to the following drawings and detailed description.

Example 1: as shown in fig. 1, a chinese word segmentation method based on word association features:

for the text content contained in a chinese text, assume that:

performing binary segmentation and splicing processing on the text set by adopting a binary segmentation pre-and-post word splicing method to obtain a processing resultThe text segment set is as follows: { (a)_ia_i+1),(a_i+1a_i+2),(a_i+2a_i+3),(a_i+3a_i+4),a_i+5.......(a_i-1+na_i+n)}；

Performing quaternary segmentation splicing processing on the text set by adopting a quaternary segmentation pre-and-post word splicing method to obtain a processing result quaternary text segment set, wherein the processing result quaternary text segment set comprises the following steps: { (a)_ia_i+1a_i+2a_i+3)_,(a_i+1a_i+₂a_i+3a_i+4),(a_i+2a_i+3a_i+4a_i+5).......(a_i-3+ _na_i-2+na_i-1+na_i+n)}。

In the obtained text segment set, the left adjacent word set of the text segment refers to: sets of characters adjacent to and appearing to the left of a text segment, e.g. text segment (a)_i+1a_i+2) In the text set { a_i,a_i+1,a_i+2,a_i+3,a_i+4,a_i+ ₅.......a_i-1+n,a_i+nThe left adjacent word set in the Chinese character is { a }_iText fragmentThe right set of neighbors of (1) refers to: sets of characters adjacent to and appearing to the right of a text segment, e.g. text segment (a)_i+1a_i+2) In the text set { a_i,a_i+1,a_i+2,a_i+3,a_i+4,a_i+5.......a_i-1+n,a_i+nThe right adjacent words in the Chinese character are set as { a }_i+3}。

And the degree of freedom of the candidate word is the degree of freedom of the candidate word by calculating and selecting the small entropy of the information entropy of the left and right adjacent word sets of the candidate word.

H＝min{s′,s″}

H represents the degree of freedom of the candidate word, and s' represents the right entropy of the candidate word;

wherein, b_iRight neighbourhood belonging to candidate words, nbi denoting b_iThe frequency number appearing on the right side of the candidate word, k represents the number of character elements in the right adjacent word set of the candidate word, and s' is the left entropy of the candidate word;

where mi belongs to the left neighbor set of candidate words, n_miIndicating the frequency of mi appearing to the left of the candidate word, and M indicating the number of word elements in the left-adjacent word set of the candidate word.

The degree of condensation is the product of the probability that a new word appears alone in a text and the probability that the new word appears alone is higher than the probability that the new word is combined, i.e. P (AB) > P (A) P (B), so thatAnd taking the minimum M as the degree of condensation, wherein AB represents a new word, P (AB) represents the probability of the new word appearing in the text, A and B respectively represent a combined word, and P (A) and P (B) respectively represent the probability of the combined word appearing in the text.

Wherein M is₂Representing degree of cohesion of candidate words, s_iRepresenting binary candidate wordsProbability of first word appearing in corpus, N_iRepresenting the number of occurrences of the first word of a binary candidate in a corpus, N_i+1Representing the number of occurrences of the second word of the binary candidate in the corpus, N_i,i+1Representing the number of occurrences of a binary candidate word in the corpus, representing the total word count in the corpus, S_i+1Representing the probability of the second character of the binary candidate word appearing in the corpus, and p (i, i +1) representing the probability of the binary candidate word appearing in the corpus;

Wherein M is₃Representing degree of cohesion of the candidate word, S_iProbability, s, of occurrence of the first word of a ternary candidate word in a corpus_i+1,i+2Probability, s, of the two last words of a ternary candidate word appearing in the corpus at the same time_i,i+1Probability, s, of the first two words of a ternary candidate word appearing in the corpus at the same time_i+2Representing the probability of the last word of a ternary candidate word appearing in the corpus, N_iRepresenting ternary candidate wordsNumber of occurrences of the first word in the corpus, N_i+2The number of times that the third word representing the ternary candidate word appears in the corpus, N_i+1,i+2The number of occurrences of the last two words of the ternary candidate word in the corpus, N_i,i+1Representing the number of occurrences of the first two words of a ternary candidate word in a corpus, N_i,i+1,i+2Representing the number of occurrences of a ternary candidate word in a corpus, P_(i,i+1,i+2)Representing the probability of the occurrence of the ternary candidate words in the corpus;

if it isThen (A)_i-1A_iA_i+1) Belong to a three-element word;

wherein,₍A_i-1A_iA_i+1)belonging to ternary candidate words, A_i-2Is a left-adjacent word set of ternary candidate words, A_i+2Is the right adjacent word of the ternary candidate word, { A_0....A_i....A_N.Is a character set in the corpus { (A)₀,A₁)...(A_i,A_i+1)...(A_i-2,A_i-1) Is a binary candidate word set.

and is

Example 2: as shown in fig. 1, a text to be processed is selected from a text library, and the text library is preprocessed, including removing symbols and forming sentences, and a corpus is constructed by using the words after symbol removal.

And c, segmenting the corpus in the step a1 by adopting a segmentation method of splicing words front and back to form segmentation fragments.

And a binary candidate word bank, a ternary candidate word bank and a quaternary candidate word bank are formed by adopting a binary segmentation front-and-back word splicing method, a ternary segmentation front-and-back word splicing method and a quaternary segmentation front-and-back word splicing method.

And setting a word frequency threshold for the candidate words with the counted word frequency, judging the candidate words, and keeping the judgment to form a new corpus.

Counting the degree of condensation of the candidate words and the degree of freedom of the candidate words; in this embodiment, the statistical aggregation degree of the candidate words may be obtained by calculating an independent probability and a joint probability ratio of the candidate words in the corpus; the degree of freedom of the candidate word can be calculated and the minimum entropy of the left and right adjacent word sets of the candidate word is selected as the degree of freedom of the candidate word.

And comparing the degree of condensation of the candidate words and the degree of freedom of the candidate words with a set threshold value.

And extracting candidate words larger than a threshold value to serve as a candidate word library.

In this example, four well known small-sized lectures "journey to the West" were collected. In the counted word stock, if the distribution of the word-forming text segments is sufficient, the degree of cohesion is higher and the degree of freedom is larger relative to the non-word-forming segments. If the left and right adjacent words of a word are regarded as random variables, the information entropy of the left and right adjacent word sets of the word reflects the randomness of the left and right adjacent words of the word, the larger the entropy value is, the richer the left adjacent word set or the right adjacent word set of the word is, and the smaller entropy in the left and right adjacent word sets is taken as the degree of freedom.

In this embodiment, the degree of aggregation is higher for the fragment of the word formation, and the degree of relationship between the explanatory word and the word is higher, and when calculating the degree of aggregation, we take the smaller degree of aggregation as the final degree of aggregation.

The word bank counted by the degree of freedom and the degree of condensation is used as a candidate word bank, the ternary candidate word bank and the quaternary candidate word bank are processed by a ternary word segmentation filtering method and a quaternary word segmentation filtering method, and finally the word bank is obtained.

The ternary word segmentation filtering method and the quaternary word segmentation filtering method solve the problem that words are not words in reality from the subjective point of view, thereby improving the effectiveness of the ternary word bank and the quaternary word bank.

While the present invention has been described in detail with reference to the embodiments shown in the drawings, the present invention is not limited to the embodiments, and various changes can be made without departing from the spirit and scope of the present invention.

Claims

1. A Chinese word segmentation method based on word association characteristics is characterized in that:

2. The method of claim 1, wherein the method comprises: the word splicing method comprises the steps of continuously cutting and segmenting words from the first word in a Chinese sentence, and cutting all word-forming words out, and specifically comprises the following steps:

for the text content contained in a chinese text, assume that:

Text set by adopting ternary segmentation preceding and following word splicing methodPerforming ternary segmentation and splicing treatment on the combined text to obtain a processing result ternary text fragment set, wherein the processing result is as follows: { (a)_ia_i+1),(a_i+1a_i+2),(a_i+2a_i+3),(a_i+3a_i+4),a_i+5.......(a_i-1+na_i+n)}；

Performing quaternary segmentation splicing processing on the text set by adopting a quaternary segmentation pre-and-post word splicing method to obtain a processing result quaternary text segment set, wherein the processing result quaternary text segment set comprises the following steps: { (a)_ia_i+1a_i+2a_i+3),(a_i+1a_i+2a_i+3a_i+4),(a_i+2a_i+3a_i+4a_i+5).......(a_i-3+na_i-2+ _na_i-1+na_i+n)}。

3. The method of claim 1, wherein the method comprises: the degrees of freedom are: when a text segment appears in various different text sets and has a left adjacent word set and a right adjacent word set, wherein the left adjacent word set is a set indicating adjacent characters on the left side of the text segment, the right adjacent word set is a set indicating adjacent characters on the right side of the text segment, the information entropy of the text segment is obtained by calculating the information entropy of the left adjacent word set and the right adjacent word set, and the smaller information entropy in the left adjacent word set and the right adjacent word set is taken as the degree of freedom.

4. The method of claim 3, wherein the method comprises: the degree of freedom is that in the obtained text segment set, when one text segment can appear in various different text sets and has a left adjacent character set and a right adjacent character set, the information entropy H of the text segment is obtained by calculating the information entropy of the left adjacent character set and the right adjacent character set, that is, H is min { s ', s' },h represents the degree of freedom of the candidate word, S' represents the candidate wordAnd s' is the left entropy of the candidate word, and the smaller information entropy in the left adjacent word set and the right adjacent word set is taken as the degree of freedom.

5. The method of claim 1, wherein the method comprises: the degree of condensation is the product of the probability that a new word appears alone in a text and the probability that the new word appears alone is higher than the probability that the new word is combined, namely P (AB)>P (A) P (B), inAnd taking the minimum M as the degree of condensation, wherein AB represents a new word, P (AB) represents the probability of the new word appearing in the text, A and B respectively represent a combined word, and P (A) and P (B) respectively represent the probability of the combined word appearing in the text.

6. The method of claim 1, wherein the method comprises: the statistical degree of aggregation of the candidate words is obtained by calculating the ratio of the independent probability and the joint probability of the candidate words in the corpus, and the specific steps are as follows:

(2) the Buddhist scriptureThe ratio of the probability of word selection and the combined probability of the candidate words obtains the degree of condensation M of the ternary candidate words₃：

Wherein M is₃Representing degree of cohesion of the candidate word, S_iProbability, s, of occurrence of the first word of a ternary candidate word in a corpus_i+1,i+2Probability, s, of the two last words of a ternary candidate word appearing in the corpus at the same time_i,i+1Probability, s, of the first two words of a ternary candidate word appearing in the corpus at the same time_i+2Representing the probability of the last word of a ternary candidate word appearing in the corpus, N_iNumber of occurrences of the first word of the ternary candidate word in the corpus, N_i+2The number of times that the third word representing the ternary candidate word appears in the corpus, N_i+1,i+2The number of occurrences of the last two words of the ternary candidate word in the corpus, N_i,i+1Representing the number of occurrences of the first two words of a ternary candidate word in a corpus, N_i,i+1,i+2Representing the number of occurrences of a ternary candidate word in a corpus, P_(i,i+1,i+2)Representing the probability of the occurrence of the ternary candidate words in the corpus;

Wherein M is₄Representing degree of cohesion of the candidate word, S_iProbability of occurrence of the first word of the quaternary candidate word in the corpus, S_i+1,i+2,i+3Probability of the last three words representing a quaternary candidate word appearing in the corpus at the same time, S_i,i+1,i+2Representing the probability of the first three words of a quaternary candidate word appearing in the corpus at the same time, S_i,i+1Representing the probability of the first two words of a quaternary candidate word appearing in the corpus, S_i+2,i+3The last two words representing quaternary candidate words appearing in the corpusProbability, N_iNumber of occurrences of the first word of the quaternary candidate word in the corpus, N_i+3Number of times that a fourth word representing a quaternary candidate word appears in the corpus, N_i，i+1Representing the number of occurrences of the first two words of a quaternary candidate word in the corpus, N_i+2，i+3The number of occurrences of the last two words of the quaternary candidate word in the corpus, N_i+1,i+2,i+3The number of occurrences of the last three words of the quaternary candidate word in the corpus, N_i+1,i+2,i+3The number of times the first three words of the quaternary candidate word appear in the corpus is represented, and P (i, i +1, i +2, i +3) represents the probability of the quaternary candidate word appearing in the corpus.

7. The method of claim 1, wherein the method comprises: the filtering method of the ternary candidate words comprises the following steps: for the ternary candidate word, if the last two characters exist in the binary candidate word library, judging whether the first character and the left adjacent character form a word, if the first two characters exist in the binary candidate word library, judging whether the last character and the right adjacent character form a word, and determining whether the ternary candidate word is a candidate word;

if it isThen (A)_i-1A_iA_i+1) Belong to a three-element word;

wherein (A)_i-1A_iA_i+1) Belonging to ternary candidate words, A_i-2Is a left-adjacent word set of ternary candidate words, A_i+2Is the right adjacent word of the ternary candidate word, { A₀....A_i....A_NIs a corpus character set, { (A)₀,A₁)...(A_i,A_i+1)...(A_i-2,A_i-1)Is a binary candidate word set.

8. The method of claim 1, wherein the method comprises: the filtering method of the quaternary candidate words comprises the following steps: firstly, segmenting quaternary candidate words which are possible to become words, wherein the first two words are segmentation segments, the second two words are segmentation segments, the segmentation segments are respectively matched with the segmented binary word banks, the words in the matching process are used as pre-selected words, then, segmenting the middle two words of the quaternary words and are matched with the segmented binary word banks, the words which are not matched are used as the pre-selected words, and if the two conditions are met, the words are used as the result of segmentation:

and is