CN106610937A

CN106610937A - Information theory-based Chinese automatic word segmentation method

Info

Publication number: CN106610937A
Application number: CN201610831711.2A
Authority: CN
Inventors: 金平艳; 胡成华
Original assignee: Sichuan Yonglian Information Technology Co Ltd
Current assignee: Sichuan Yonglian Information Technology Co Ltd
Priority date: 2016-09-19
Filing date: 2016-09-19
Publication date: 2017-05-03

Abstract

An information theory-based Chinese automatic word segmentation method comprises the steps of comparing and matching a sentence to be segmented with a word in a corpus which is already successfully initialized; splitting the sentence to be segmented to a mesh structure according to a probability statistical method; solving a weight value of each edge of the mesh structure by an information theory method, wherein a path with maximum weight value is a word segmentation result of the sentence to be segmented; and judging a word segmentation effect by precision rate and recall rate. The Chinese pre-processing speed is relatively faster than that of a method based on a word segmentation dictionary, the precision is relatively higher than that of the method based on the word segmentation dictionary, the method is higher in accuracy than a method based on a statistical method, is higher in practicability and more conforms to an experience value, and a great application value is provided for a subsequent natural language processing technology.

Description

One kind is based on information-theoretical Chinese Automatic Word Segmentation algorithm

Technical field

The present invention relates to Chinese semanteme networking technology area, and in particular to a kind of to be calculated based on information-theoretical Chinese Automatic Word Segmentation Method.

Background technology

At this stage based on the Chinese Word Automatic Segmentation for understanding at present also in experimental stage, based on dictionary for word segmentation and based on probability The method of statistics becomes the main flow of current Chinese automatic word segmentation technology.Method based on dictionary for word segmentation is transplanted simply, without the need for considering The adaptivity problem transplanted between different field；But this kind of method to ambiguity analysis produced during automatic word segmentation and The also relative shortcoming of the process of the problems such as name Entity recognition.Statistics-Based Method relies on powerful mathematical statistical model, Participle aspect of performance is greatly improved, but bad in cross-cutting aspect effect, to the dependency of corpus than larger, Need for different fields, prepare different corpus to train different fields to count participle model.So cause After the conversion of field, it is necessary to provide the participle corpus in corresponding field for them.However, carrying out the mark required for participle training The foundation and maintenance of language material needs substantial amounts of man power and material, and by contrast, the method based on dictionary for word segmentation is in domain-adaptive Aspect has some superiority.When target participle field changes, only need to add the word in corresponding field based on the method for dictionary Allusion quotation, the acquisition of domain lexicon compare it is also easily many for corpus, therefore by dictionary for word segmentation and probability statistics Method is used in combination the main flow for becoming current participle.In order to realize Chinese Automatic Word Segmentation function and improve the accurate of word segmentation result Degree, the present invention proposes a kind of based on information-theoretical Chinese Automatic Word Segmentation algorithm.

The content of the invention

To realize Chinese Automatic Word Segmentation function and the not high problem of accuracy for word segmentation result, the invention provides one Plant and be based on information-theoretical Chinese Automatic Word Segmentation algorithm.

In order to solve the above problems, the present invention is achieved by the following technical solutions：

Step 1：Initialization training pattern, Ke Yishi《Dictionary for word segmentation》Or the corpus of association area, or both combinations Model.

Step 2：According to《Dictionary for word segmentation》Find treat in participle sentence with the word matched in dictionary.

Step 3：According to probability statistics, will treat that participle sentence is split as network structure, obtain final product the n sentence that may be combined Structure, every sequential node of this structure SM is defined as successively₁M₂M₃M₄M₅E。

Step 4：Based on method of information theory, to above-mentioned network structure each edge certain weights are given.

Step 5：A paths of maximum weight are found, the word segmentation result of participle sentence is as treated.

Step 6：Verify the accuracy rate and recall rate of this word segmentation result.

Present invention has the advantages that：

1st, the speed of Chinese pretreatment is fast compared with the method based on dictionary for word segmentation.

2nd, the method has more preferable precision compared with the method based on dictionary for word segmentation.

3rd, the method has more preferable accuracy compared with based on statistical method.

4th, the method practicality is bigger, more meets empirical value.

5th, the method provides greatly using value for follow-up natural language processing technique.

Description of the drawings

Fig. 1 is a kind of to be based on information-theoretical Chinese Automatic Word Segmentation algorithm structure flow chart

Fig. 2 n-grams segmentation methods are illustrated

Specific embodiment

In order to improve the accuracy of Chinese Automatic Word Segmentation, the present invention is described in detail with reference to Fig. 1-Fig. 2, its is concrete Implementation steps are as follows：

Step 2：According to《Dictionary for word segmentation》Find treat in participle sentence with the word matched in dictionary, it is described in detail below：

The complete scanning of the Chinese character string of participle one time is treated, matching is made a look up in the dictionary of system, in running into dictionary Some words are just identified；If there are no relevant matches in dictionary, individual character is just simply partitioned into as word；Until Chinese character string For sky.

Step 3：According to probability statistics, will treat that participle sentence is split as network structure, obtain final product the n sentence that may be combined Structure, every sequential node of this structure SM is defined as successively₁M₂M₃M₄M₅E, its structure chart is as shown in Figure 2.

Step 4：Based on method of information theory, certain weights are given to above-mentioned network structure each edge, it was specifically calculated Journey is as follows：

According to《Dictionary for word segmentation》The dictionary word for matching and the single word not matched, the i-th paths are comprising the number of word n_i.That is the number collection of n paths word is combined into (n₁, n₂..., n_n)。

Obtain min ()=min (n₁, n₂..., n_n)

In above-mentioned remaining (n-m) path for staying, the weight size of every adjacent path is solved.

In statistics corpus, the quantity of information X (C of each word are calculated_i), then the co-occurrence letter of the adjacent word of solution path

Breath amount X (C_i, C_i+1).Existing following formula：

X(C_i)=| x (C_i)₁-x(C_i)₂|

Above formula x (C_i)₁For word C in text corpus_iQuantity of information, x (C_i)₂It is C containing word_iText message amount.

x(C_i)₁=-p (C_i)₁lnp(C_i)₁

Above formula p (C_i)₁For C_iProbability in text corpus, n is C containing word_iText corpus number.

x(C_i)₂=-p (C_i)₂lnp(C_i)₂

Above formula p (C_i)₂It is C containing word_iTextual data probit, N is statistics corpus Chinese version sum.

X (C in the same manner_i, C_i+1)=| x (C_i, C_i+1)₁-x(C_i, C_i+1)₂|

x(C_i, C_i+1)₁It is the word (C in text corpus_i, C_i+1) co-occurrence information amount, x (C_i, C_i+1)₂For adjacent word (C_i, C_i+1) co-occurrence text message amount.

X (C in the same manner_i, C_i+1)₁=-p (C_i, C_i+1)₁lnp(C_i, C_i+1)₁

Above formula p (C_i, C_i+1)₁It is the word (C in text corpus_i, C_i+1) co-occurrence probabilities, m is the word (C in text library_i, C_i+1) co-occurrence amount of text.

x(C_i, C_i+1)₂=-p (C_i, C_i+1)₂lnp(C_i, C_i+1)₂

p(C_i, C_i+1)₂For adjacent word (C in text library_i, C_i+1) co-occurrence textual data probability.

The weights that every adjacent path can to sum up be obtained are

w(C_i, C_i+1)=X (C_i)+X(C_i+1)-2X(C_i, C_i+1)

Step 5：A paths of maximum weight are found, the word segmentation result of participle sentence is as treated, its concrete calculating process It is as follows：

There are n paths, it is different per paths length, it is assumed that path collection is combined into (L₁, L₂..., L_n)。

Assume the minimum number operation through taking word in path, eliminate m paths, m<n.It is left (n-m) path, if Its path collection is combined into

It is per paths weight then:

Above formulaRespectively the 1,2nd arrivesThe weighted value on path side, can calculate one by one according to step 4,For S in remaining (n-m) path_jBar The length in path.

One paths of maximum weight:

Accuracy rate：

Above formula n_KnowFor《Dictionary for word segmentation》The number of dictionary word in participle sentence, n are treated in identification_zFor the correct participle word of the method Number.

Recall rate：

Above formula n_AlwaysThe total number of word in treat participle sentence.

Finally consider the two factors, judge the correctness of this system word segmentation result.

That is d=| zhaorate-rate |≤ε

ε is the threshold value of a very little, and this is given by expert.When d meets above-mentioned condition, then participle effect is more satisfactory.

Claims

1. it is a kind of to be based on information-theoretical Chinese Automatic Word Segmentation algorithm, the present invention relates to Chinese semanteme networking technology area, specifically relates to And one kind is based on information-theoretical Chinese Automatic Word Segmentation algorithm, it is characterized in that, comprises the steps：

Step 1：Initialization training pattern, Ke Yishi《Dictionary for word segmentation》Or the corpus of association area, or both binding models

Step 2：According to《Dictionary for word segmentation》Find treat in participle sentence with the word matched in dictionary

Step 3：According to probability statistics, will treat that participle sentence is split as network structure, obtain final product the n sentence knot that may be combined Structure, is successively defined as every sequential node of this structure

Step 4：Based on method of information theory, to above-mentioned network structure each edge certain weights are given

Step 5：A paths of maximum weight are found, the word segmentation result of participle sentence is as treated

Step 6：Verify the accuracy rate and recall rate of this word segmentation result

Accuracy rate：

Above formulaFor《Dictionary for word segmentation》The number of dictionary word in participle sentence is treated in identification,For the number of the correct participle word of the method

Recall rate：

Above formulaThe total number of word in treat participle sentence

Finally consider the two factors, judge the correctness of this system word segmentation result

I.e.

For the threshold value of a very little, this is given by expert, and when d meets above-mentioned condition, then participle effect is more satisfactory.

2. information-theoretical Chinese Automatic Word Segmentation algorithm is based on according to the one kind described in claim 1, be it is characterized in that, the above Concrete calculating process is as follows in step 4：

Step 4：Based on method of information theory, certain weights are given to above-mentioned network structure each edge, its concrete calculating process is such as Under：

According to《Dictionary for word segmentation》The dictionary word for matching and the single word not matched, the i-th paths are comprising the number of word, i.e., The number collection of n paths words is combined into

In above-mentioned remaining (n-m) path for staying, the weight size of every adjacent path is solved

In statistics corpus, the quantity of information of each word is calculated, then the co-occurrence information amount of the adjacent word of solution path, existing following formula：

Above formulaFor word in text corpusQuantity of information,It is containing wordText message amount

Above formulaForProbability in text corpus, n is containing wordText corpus number

Above formulaIt is containing wordTextual data probit, N is statistics corpus Chinese version sum

In the same manner

It is the word in text corpusCo-occurrence information amount,For adjacent wordThe text message amount of co-occurrence

In the same manner

Above formulaIt is the word in text corpusCo-occurrence probabilities, m is the word in text libraryThe amount of text of co-occurrence

For adjacent word in text libraryThe textual data probability of co-occurrence

The weights that every adjacent path can to sum up be obtained are

。

3. information-theoretical Chinese Automatic Word Segmentation algorithm is based on according to the one kind described in claim 1, be it is characterized in that, the above Concrete calculating process is as follows in step 5：

There are n paths, it is different per paths length, it is assumed that path collection is combined into

Assume the minimum number operation through taking word in path, eliminate m paths, m<N, that is, be left (n-m) path, if its road Electrical path length collection is combined into

It is per paths weight then:

Above formulaRespectively the 1st, 2 arriveThe weighted value on path side, can calculate one by one according to step 4,For in remaining (n-m) path TheThe maximum paths of the Length Weight of paths:

。