CN103164198A

CN103164198A - Method and device of cutting linguistic model

Info

Publication number: CN103164198A
Application number: CN2011104169744A
Authority: CN
Inventors: 周杨; 肖镜辉; 李露
Original assignee: Shenzhen Tencent Computer Systems Co Ltd
Current assignee: Shenzhen Shiji Guangsu Information Technology Co Ltd
Priority date: 2011-12-14
Filing date: 2011-12-14
Publication date: 2013-06-19

Abstract

The invention provides a method and a device of cutting a linguistic model. The method comprises that Ngram statistics is carried out on training corpus data to form the Ngram listing of an original Ngram linguistic model, and the Ngram listing comprises all Ngram of the original linguistic model; aiming at each Ngram of the Ngram listing, relative entropy between the probability distribution of the Ngram linguistic model with the Ngram cut down and the probability distribution of the original Ngram linguistic model is calculated; and at least one Ngram with small relative entropy of the Ngram listing is deleted to obtain the cut Ngram linguistic model. The method and the device of cutting the linguistic model can reduce the influences on the performance of the Ngram linguistic model in a cutting process.

Description

A kind of method and apparatus of reducing language model

Technical field

The present invention relates to the Language Modeling technical field, relate in particular to a kind of method and apparatus of reducing language model.

Background technology

Along with continuous lifting and intelligent the improving constantly of software of computer hardware performance, people expect that more and more computing machine can provide more natural man-machine interaction mode, and this shows: (1) provides more intelligent Chinese character input method; (2) provide the continuous speech input function; (3) provide continuous handwriting functions.And the realization of these three kinds of interactive modes, bottom all needs to have the support of Language Modeling technology, and the performance of language model has directly determined the intelligent and ease for use of above-mentioned interactive software.

The statistical language modeling technology is the mainstream technology of present Language Modeling, and the Ngram language model is the most successful statistical language model.Ngram is illustrated in the sequence of terms of N the word composition that occurs continuously in corpus, and that relatively more commonly used is bigram (2 sequences that word forms) and trigram (3 sequences that word forms), and the Ngram language model is comprised of a large amount of Ngram.The Ngram language model is to come the probability of calculated candidate Chinese sentence according to the conditional probability between word, and selects candidate's Chinese sentence of maximum probability as the output of interactive software.According to the regulation of Ngram language model, for a Chinese sentence S=W who comprises m word ₁W ₂... W _m, its probability is:

P (S) = P (W_{1} W_{2} . . . W_{m}) = Π_{i = 1}^{m} P (W_{i} | W_{i - n + 1} . . . W_{i - 1}) = Π_{i = 1}^{m} \frac{C (W_{i - n + 1} . . . W_{i - 1} W_{i})}{C (W_{i - n + 1} . . . W_{i - 1})}

Wherein, P (W _i| W _I-n+1... W _i-1) be illustrated in sequence of terms W _I-n+1W _i-1Under the condition that occurs, word W appears _iConditional probability;

C(W _I-n+1... W _i-1W _i) expression sequence of terms W _I-n+1... W _i-1W _iThe number of times that occurs in corpus;

C(W _I-n+1... W _i-1) expression sequence of terms W _I-n+1... W _i-1The number of times that occurs in corpus;

N is predefined integer.

The Ngram language model that adopts maximum Likelihood to obtain can't be applied directly in the input method engine and go, original Ngram language model also faces " zero probability " problem---when some word in testing material is combined in when not occurring in the Ngram language model, the statement probability that calculates by original Ngram language model is zero, and this brings serious problems can for most the application.For solving " zero probability " problem, need to adjust the probability in original Ngram language model, make when running into unknown Ngram, the probability that calculates is non-vanishing, and concrete probability method of adjustment is called as the smoothing algorithm of Ngram model.Smoothing algorithm is divided into two large classes, and a class is the interpolation smoothing algorithm, adopts model to merge thought, and lower-order model is combined with the mode of high-order model by linear interpolation, and concrete formula is as follows:

\tilde{P} (w | h) = λ \times P (w | h) + (1 - λ) \times P (w | h^{'}),

Wherein,

H represents the historical word in Ngram;

W represents the current word in Ngram;

Ngram probability after expression is level and smooth;

P (w|h) represents original Ngram probability;

P (w|h ') expression low order Ngram probability;

Historical word in h ' expression low order Ngram;

λ is interpolation coefficient, and value is between [0,1] usually.

Another kind of is the rollback smoothing algorithm, namely when high-order model has the zero probability problem, adopts more reliable lower-order model, and concrete formula is as follows:

Wherein,

H represents the historical word in Ngram;

W represents the current word in Ngram;

Ngram probability after expression is level and smooth;

P _d(w|h) expression is through the level and smooth probable value afterwards of Good-Turing;

The number of times that C (h, w) expression w and h occur in corpus simultaneously;

α adjusts coefficient, is the function of h;

P (w|h ') expression low order Ngram probability;

Historical word in h ' expression low order Ngram.

When the word number was K, the parameter space of Ngram language model was O (K in theory ⁿ).In the business Input Software, the value of K is 100,000 usually---1,000,000 magnitudes.In actual use, because the internal memory of computing machine is limited, can't load into a complete Ngram model, the Ngram model need to could use through after cutting usually.The quality of Pruning strategy has directly had influence on the actual usability of Ngram model.Which kind of strategy to reduce model with, be the committed step of Language Modeling.

The model reduction mode of standard is to carry out cutting according to the frequency of Ngram parameter at present, namely removes the Ngram parameter of low frequency, a reserved high-frequency Ngram parameter.The shortcoming of this method is the impact of not considering in the reduction process Ngram language model performance, a lot of low frequencies but cropped mistakenly promoting the helpful Ngram parameter of Ngram language model performance are to such an extent as to the Performance Ratio of the language model after cutting is lower.

Summary of the invention

The invention provides a kind of method and apparatus of reducing language model, can reduce the reduction process to the impact of Ngram language model performance.

Technical scheme of the present invention is achieved in that

A kind of method of reducing language model comprises:

The corpus data are carried out the Ngram statistics, form the Ngram list of original Ngram language model, described Ngram list comprises all Ngram in original Ngram language model;

For each Ngram in the Ngram list, calculate to reduce fall the relative entropy between the probability distribution of Ngram language model after this Ngram and original Ngram language model;

According to the actual requirements, delete the little Ngram of relative entropy at least one described Ngram list, obtain the Ngram language model after cutting.

A kind of device of reducing language model comprises:

Statistical module is used for the corpus data are carried out the Ngram statistics, forms the Ngram list of original Ngram language model, and described Ngram list comprises all Ngram in original Ngram language model;

Computing module is used for each Ngram for the Ngram list, calculates the relative entropy between the probability distribution of the Ngram language model reducing after this Ngram and original Ngram language model;

Reduce module, be used for according to the actual requirements, delete the little Ngram of relative entropy at least one described Ngram list, obtain the Ngram language model after cutting.

As seen, the method and apparatus of the reduction language model that the present invention proposes, for all Ngram in the Ngram language model, calculate to reduce fall the relative entropy between the probability distribution of Ngram language model and original Ngram language model after this Ngram, and the little Ngram of relative entropy is fallen in reduction.Because the difference between two probabilistic language models distributions of the less expression of relative entropy is less, so the present invention can reduce the reduction process to the impact of Ngram language model performance.

Description of drawings

Fig. 1 is the method flow diagram of the reduction language model that proposes of the present invention.

Embodiment

The present invention proposes a kind of method of reducing language model, and the method flow diagram as Fig. 1 is the reduction language model that proposes of the present invention comprises:

Step 101: the corpus data are carried out the Ngram statistics, form the Ngram list of original Ngram language model, described Ngram list comprises all Ngram in original Ngram language model;

Step 102: for each Ngram in the Ngram list, calculate to reduce fall the relative entropy between the probability distribution of Ngram language model after this Ngram and original Ngram language model;

Step 103: delete the little Ngram of relative entropy at least one described Ngram list, obtain the Ngram language model after cutting.

Relative entropy is the tolerance of weighing two probability distribution differences.For the Ngram language model, when cropping certain Ngram, before cutting and cutting afterwards the probability distribution of Ngram language model change, the relative entropy between these two probability distribution is calculated by following formula:

D_{KL} = \underset{h, w}{Σ} P (h, w) \times {\log [P (w | h)] - \log [P^{'} (w | h)]},

Wherein,

D _KLThe expression relative entropy;

H represents the historical word in Ngram;

W represents the current word in Ngram;

The joint probability of h and w appears in P (h, w) expression;

Before P (w|h) expression reduces this Ngram, the Ngram language model provide in the conditional probability that w occurs occurring under the condition of h;

P ' (w|h) represent to reduce this Ngram after, the Ngram language model by smoothing algorithm provide in the conditional probability that w occurs occurring under the condition of h.

Can find out from top formula, language model adopts different smoothing algorithms, and the account form of probability P ' (w|h) is different, and the computing method of relative entropy are also different.

According to aforementioned two kinds of different smoothing algorithms computing formula, when adopting the rollback smoothing algorithm, above-mentioned relative entropy is calculated by following formula:

D_{KL} = \frac{C (w, h)}{N} \times {\log P (w | h) - \log [α (h) \times P (w | h^{'})]},

Wherein,

The number of times that C (w, h) expression h and w occur in corpus simultaneously;

N represent all Ngram occurrence numbers and;

α represents to adjust coefficient, is the function of h;

P (w|h ')] expression low order Ngram language model provide in the conditional probability that w occurs occurring under the condition of h ';

Historical word in h ' expression low order Ngram language model.

When adopting the interpolation smoothing algorithm, above-mentioned relative entropy is calculated by following formula:

D_{KL} = \frac{C (w, h)}{N} \times {\log [λ \times P (w | h) + (1 - λ) \times P (w | h^{'})] - \log [(1 - λ) \times P (w | h^{'})]},

Wherein,

The number of times that C (w, h) expression w and h occur in corpus simultaneously;

N represent all Ngram occurrence numbers and;

λ represents interpolation coefficient;

Historical word in h ' expression low order Ngram language model.

In the Ngram language model, by above-mentioned formula, each Ngram is calculated, and all Ngram can be sorted according to relative entropy, according to the sequence situation, crop those smaller Ngram of relative entropy, thereby obtain the language model near original Ngram model.

The present invention also proposes a kind of device of reducing language model, comprising:

Reduce module, be used for the little Ngram of at least one described Ngram list relative entropy of deletion, obtain the Ngram language model after cutting.

Above-mentioned computing module can adopt following formula to calculate relative entropy:

D_{KL} = \underset{h, w}{Σ} P (h, w) \times {\log [P (w | h)] - \log [P^{'} (w | h)]},

Wherein,

D _KLThe expression relative entropy;

H represents the historical word in Ngram;

W represents the current word in Ngram;

The joint probability of w and h appears in P (h, w) expression;

When adopting the rollback smoothing algorithm, above-mentioned computing module can adopt following formula to calculate relative entropy:

D_{KL} = \frac{C (w, h)}{N} \times {\log P (w | h) - \log [α (h) \times P (w | h^{'})]},

Wherein,

N represent all Ngram occurrence numbers and;

α represents to adjust coefficient, is the function of h;

Historical word in h ' expression low order Ngram language model.

When adopting the interpolation smoothing algorithm, above-mentioned computing module can adopt following formula to calculate relative entropy:

D_{KL} = \frac{C (w, h)}{N} \times {\log [λ \times P (w | h) + (1 - λ) \times P (w | h^{'})] - \log [(1 - λ) \times P (w | h^{'})]},

Wherein,

N represent all Ngram occurrence numbers and;

λ represents interpolation coefficient;

Historical word in h ' expression low order Ngram language model.

As fully visible, the method and apparatus of the reduction language model that the present invention proposes calculates each Ngram, calculates the relative entropy between the probability distribution of the Ngram language model that cuts after this Ngram and original Ngram language model; And according to the actual requirements, reduce those smaller Ngram of relative entropy, thereby obtain the language model near original Ngram model, reduce the reduction process to the impact of Ngram language model performance, actual demand can be the size of considering the actual computer internal memory, determining the number of the Ngram of the cutting of wanting according to the size of internal memory, is generally all that relative entropy Ngram is from small to large deleted in turn, until meet demand.In the Ngram of identical scale model parameter situation, the Ngram language model cutting method that the present invention proposes can obtain higher-quality Ngram language model.The present invention can be applied to the association areas such as speech recognition, handwritten form identification, optical character identification.On basis of the present invention, can set up the information retrieval system based on language model, improve the performance of information retrieval system.

The above is only preferred embodiment of the present invention, and is in order to limit the present invention, within the spirit and principles in the present invention not all, any modification of making, is equal to replacement, improvement etc., within all should being included in the scope of protection of the invention.

Claims

1. a method of reducing language model, is characterized in that, described method comprises:

Delete the little Ngram of relative entropy at least one described Ngram list, obtain the Ngram language model after cutting.

2. method according to claim 1, is characterized in that, described relative entropy is calculated by following formula:

D_{KL} = \underset{h, w}{Σ} P (h, w) \times {\log [P (w | h)] - \log [P^{'} (w | h)]},

Wherein,

D _KLThe expression relative entropy;

The joint probability of h and w appears in P (h, w) expression;

3. method according to claim 2, is characterized in that, when adopting the rollback smoothing algorithm, described relative entropy is calculated by following formula:

D_{KL} = \frac{C (w, h)}{N} \times {\log P (w | h) - \log [α (h) \times P (w | h^{'})]},

Wherein,

N represent all Ngram occurrence numbers and;

α represents to adjust coefficient, is the function of h;

Historical word in h ' expression low order Ngram language model.

4. method according to claim 2, is characterized in that, when adopting the interpolation smoothing algorithm, described relative entropy is calculated by following formula:

D_{KL} = \frac{C (w, h)}{N} \times {\log [λ \times P (w | h) + (1 - λ) \times P (w | h^{'})] - \log [(1 - λ) \times P (w | h^{'})]},

Wherein,

N represent all Ngram occurrence numbers and;

λ represents interpolation coefficient;

Historical word in h ' expression low order Ngram language model.

5. according to claim 1-4 arbitrary described methods, is characterized in that, the Ngram language model after reducing is applied in the input method engine.

6. a device of reducing language model, is characterized in that, described device comprises:

7. device according to claim 6, is characterized in that, described computing module adopts following formula to calculate relative entropy:

D_{KL} = \underset{h, w}{Σ} P (h, w) \times {\log [P (w | h)] - \log [P^{'} (w | h)]},

Wherein,

D _KLThe expression relative entropy;

H represents the historical word in Ngram;

W represents the current word in Ngram;

The joint probability of h and w appears in P (h, w) expression;

8. device according to claim 7, is characterized in that, when adopting the rollback smoothing algorithm, described computing module adopts following formula to calculate relative entropy:

D_{KL} = \frac{C (w, h)}{N} \times {\log P (w | h) - \log [α (h) \times P (w | h^{'})]},

Wherein,

N represent all Ngram occurrence numbers and;

α represents to adjust coefficient, is the function of h;

Historical word in h ' expression low order Ngram language model.

9. device according to claim 7, is characterized in that, when adopting the interpolation smoothing algorithm, described computing module adopts following formula to calculate relative entropy:

D_{KL} = \frac{C (w, h)}{N} \times {\log [λ \times P (w | h) + (1 - λ) \times P (w | h^{'})] - \log [(1 - λ) \times P (w | h^{'})]},

Wherein,

N represent all Ngram occurrence numbers and;

λ represents interpolation coefficient;

Historical word in h ' expression low order Ngram language model.

10. according to claim 6-9 arbitrary described devices, is characterized in that, reduces module, and the Ngram language model after also being used for reducing is applied to the input method engine.