CN105373529A

CN105373529A - Intelligent word segmentation method based on hidden Markov model

Info

Publication number: CN105373529A
Application number: CN201510708169.7A
Authority: CN
Inventors: 邓剑波; 马润宇; 刘毓智
Original assignee: Gansu Zhicheng Network Technology Co Ltd
Current assignee: Gansu Zhicheng Network Technology Co Ltd
Priority date: 2015-10-28
Filing date: 2015-10-28
Publication date: 2016-03-02
Anticipated expiration: 2035-10-28
Also published as: CN105373529B

Abstract

The invention relates to an intelligent word segmentation method based on a hidden Markov model. The method comprises the following steps of (1) building a parameter Lambda<0>=(N, M, L, Pi, A, B<1>, B<2>) of the hidden Markov model; (2) determining a state set Theta in an article; (3) abbreviating Lambda<0>=(N, M, L, Pi, A, B<1>, B<2>) as Lambda=(Pi, A, B<1>, B<2>) after determining N, M and L; (4) carrying out word segmentation on a large amount of articles by a mechanical word segmentation method through applying computer languages, and then marking the states of the articles by a computer to further form an initial Pi matrix, an A matrix, a B<1> matrix and a B<2> matrix; (5) carrying out article training on the formed initial A matrix, the B<1> matrix and the B<2> matrix by using a BW algorithm, and revaluating according to a BW algorithm revaluation formula to obtain a new Pi matrix, a new A matrix, a new B<1> matrix and a new B<2> matrix; and (6) carrying out Chinese word segmentation by using a viterbi algorithm according to a new parameter of the hidden Markov model (please see the abstract), dividing the article into a plurality of sentences according to punctuation symbols, and carrying out Chinese word segmentation on each sentence, thereby obtaining the article after word segmentation. By the intelligent word segmentation method, accurate and high-efficiency word segmentation can be carried out on a large amount of Chinese texts.

Description

A kind of Word Intelligent Segmentation method based on Hidden Markov Model (HMM)

Technical field

The present invention relates to a kind of Chinese word cutting method, particularly relate to a kind of Word Intelligent Segmentation method based on Hidden Markov Model (HMM).

Background technology

Along with the development of Internet technology, the requirement of people to computer disposal text is more and more higher.Wherein, software needs the function such as input, display, editor, output had article, and the basis realizing these functions is then the identification to word in text; But different from English, the word of Chinese does not have natural boundary, so want to improve Chinese software to the processing power of text, just Chinese word segmentation must be carried out.

At present, the main method being used for carrying out Chinese word segmentation has mechanical Chinese word segmentation method, understanding method and statistic law.Mechanical Chinese word segmentation method carries out participle according to character string existing in dictionary, but its participle needs a large amount of data, and helpless for emerging word; Understanding method carries out participle by the analysis of computing machine to article sentence meaning, grammer, and shortcoming is the complicacy due to Chinese, and the realization of its algorithm has great difficulty; Statistic law, is added up probability between word and word by a large amount of training, thus realizes Chinese word segmentation.

Hidden Markov model (HiddenMarkovModel, HMM), as a kind of Statistic analysis models, is successfully used to speech recognition, Activity recognition, the field such as Text region and fault diagnosis." studying based on the Chinese word segmentation of Hidden Markov Model (HMM) " (Wei Xiaoning, computer knowledge and technology (academic exchange), 21 phases in 2007) adopt a kind of algorithm based on Hidden Markov Model (HMM) (HMM), participle is carried out by CHMM (stacked shape Markov model), do layering again, both add the accuracy of participle, in turn ensure that the efficiency of participle.But hidden Markov model lacks for the analysis of language environment, lower but be of little use or more appearance but do not become the situation process of word also easily inaccurate for frequency.

AsaharaM,GohCL,WangX,etal.CombiningsegmenterandchunkerforChinesewordsegmentation[C]//ProceedingsofthesecondSIGHANworkshoponChineselanguageprocessing-Volume17.AssociationforComputationalLinguistics,2003:144-147.

XueN.Chinesewordsegmentationascharactertagging[J].ComputationalLinguisticsandChineseLanguageProcessing,2003,8(1):29-48.

This two sections of documents describe a kind of Hidden Markov Chinese word segmentation model based on sign note, the advantage of this model inheritance word marking model it can treat the identification problem of vocabulary word and unregistered word evenly, but lack the analysis to language environment.

Summary of the invention

Technical matters to be solved by this invention is to provide a kind of Word Intelligent Segmentation method based on Hidden Markov Model (HMM) of a large amount of Chinese text being carried out to accurately and efficiently participle.

For solving the problem, a kind of Word Intelligent Segmentation method based on Hidden Markov Model (HMM) of the present invention, comprises the following steps:

(1) set up hidden Markov model parameter ,

Wherein

nfor state number markovian in model; Remember that n state is θ ₁..., θ _n, the state residing for note t Markov chain is , and ( ..., );

mfor the observed value number of possible individual Chinese character corresponding to each state; Remember that m observed value is V ₁..., V _m, the observed value that note t is observed , wherein, (V ₁..., V _m);

lfor the observed value number of possible multiple Chinese characters corresponding to each state; Remember l expansion observed value ..., , the observed value that note t is observed , wherein ( ..., );

πcertain shape probability of state is chosen when representing that sequence starts, π=(π ₁..., π _n), in formula , 1≤ ?≤ n;

arepresent the transition probability matrix choosing next state under current state, ( ) _{n × N}, in formula , 1≤ ≤ n;

b ₁represent observed value in a jth state kthe probability matrix occurred, _{n × M}, in formula , 1≤ ≤ n, 1≤ ≤ m;

b ₂represent observed value s and observed value in a jth state kthe probability matrix of continuous appearance, i.e. extended pattern observed value probability matrix, _{n × L}, in formula , 1≤ ≤ n, 1≤ ≤ l;

(2) determine the state set Θ in article: in conjunction with the language regulation of Chinese, Chinese word state set is elected as Z in prefix H, word, suffix E, solely word S one of four states;

(3) determining n, M, Lafterwards, will referred to as ;

(4) use computerese, first adopt mechanical Chinese word segmentation method to carry out participle to a large amount of articles; Then with computing machine, its state is marked, and then adds up the probability that each word occurs in this state, formed initial π matrix, A matrix, b ₁matrix, b ₂matrix;

To the described initial A matrix formed and described initially b ₁matrix and described initially b ₂matrix adopts BW algorithm to carry out article training, and article observed value sequence is called O, and extended pattern observed value sequence is called EO, obtains each expectation value, and the conditional probability that sequence occurs under this parameter , and by BW algorithm revaluation formula, revaluation is carried out to the observed value probability of each observation element, calculate the parameter of new hidden Markov model and ; And make converge to a maximal value, thus obtain new πmatrix, A matrix and b ₁ , B ₂matrix;

Wherein: ;

；

(6) use the parameter of new hidden Markov model , adopt viterbi algorithm carry out Chinese word segmentation, according to punctuation mark, article is divided into multiple sentence, Chinese word segmentation is carried out to each sentence, obtain the article after participle.

Described step (5) middle BW algorithm refers to: a given observed value sequence O= o ₁ , o ₂ ..., o _t, and expansion EO= e ₁ , e ₂ ..., e _t, determine one , make ? the maximum probability of expansion observation sequence EO is under condition;

Definition observed value probability function:

；

The formula of forwards algorithms is ;

Initialization: to 1≤ i≤ n, have ;

Recursion: for 1≤t≤t-1, 1≤ j≤ n, have ;

Stop: ;

The formula of backward algorithm is ;

Initialization: to 1 ≤ i≤N, have ;

Recursion: right t=t-1, t-2 ..., 1, and 1≤ i≤ n, have ;

Stop: ;

According to the forward and backward variable of definition, BW algorithm has , 1≤ t≤ t-1;

Definition for given training sequence O and model time, tmoment is in istate, t+1moment is in jshape probability of state, namely ; In the moment tbe in ishape probability of state is .

Described step (6) middle viterbi algorithm refers to definition for during moment t along a paths q ₁, q ₂..., q _t, and q _t= i, produce e ₁, e ₂..., e _tmaximum probability, namely have: ; The process then asking for optimum condition sequence Q* is

Initialization: right , have ; ;

Recursion: right have , ; , ;

Stop: ; ;

Path is recalled, and determines optimum condition sequence t=T-1, T-2 ..., 1.

The present invention compared with prior art has the following advantages:

1, the present invention first passes through Baum-Welch algorithm (being called for short BW algorithm) to existing observed value probability matrix, train with state probability matrix, obtain new observed value probability matrix and state probability matrix, based on new matrix, then use viterbi algorithm to carry out Chinese word segmentation to article.Different from traditional Hidden Markov Model (HMM), present invention employs novel observed value probability matrix, i.e. extended pattern observed value probability matrix; This matrix not only covers the information of of Chinese individual character itself, and covers the information of linguistic context, effectively reduces the mistake of statistic law Chinese word segmentation, substantially increases the accuracy of Chinese word segmentation

2, the present invention can carry out participle accurately and efficiently, as the prerequisite of other a series of text-processing technology to a large amount of Chinese texts.

Accompanying drawing explanation

Below in conjunction with accompanying drawing, the specific embodiment of the present invention is described in further detail.

Fig. 1 is an observation state schematic diagram after example of the present invention expansion.

Fig. 2 is example A matrix initial value schematic diagram of the present invention.

Embodiment

Based on a Word Intelligent Segmentation method for Hidden Markov Model (HMM), comprise the following steps:

(1) set up hidden Markov model parameter ,

Wherein

b ₂represent observed value s and observed value in a jth state kthe probability matrix of continuous appearance, i.e. extended pattern observed value probability matrix, _{n × L}, in formula , 1≤ ≤ n, 1≤ ≤ l.

(2) determine the state set Θ in article: in conjunction with the language regulation of Chinese, Chinese word state set is elected as Z in prefix H, word, suffix E, solely word S one of four states.

(3) determining n, M, Lafterwards, will referred to as .

(4) use computerese, first adopt mechanical Chinese word segmentation method to carry out participle to a large amount of articles; Then with computing machine, its state is marked, and then adds up the probability that each word occurs in this state, formed initial π matrix, A matrix, b ₁matrix, b ₂matrix.

Such as: by tmoment with t-1moment forms an element in observation sequence jointly, specific to participle, expands to two Chinese characters, adds the Chinese character in the previous moment of sequence, expands to an observation state (as shown in Figure 1).In status switch, the state in each moment is by the observed value (o in each moment in word sequence _t) determine, observed value is expanded, and to have become in figure two Chinese characters (this moment and previous word), this observed value of t (t ≠ 1) is namely .And A matrix can obtain its value of initial value by statistics, due to the logical laws in Chinese, some of them value should be 0, as shown in Fig. 2, table 1.

Table 1

(5) to the initial A matrix and initial formed b ₁matrix and initial b ₂matrix adopts BW algorithm to carry out article training, and article observed value sequence is called O, and extended pattern observed value sequence is called EO, obtains each expectation value, and the conditional probability that sequence occurs under this parameter , and by BW algorithm revaluation formula, revaluation is carried out to the observed value probability of each observation element, calculate the parameter of new hidden Markov model and ; And make converge to a maximal value, thus obtain new πmatrix, A matrix and b ₁ , B ₂matrix;

Wherein: ;

。

Wherein: BW algorithm refers to: a given observed value sequence O= o ₁ , o ₂ ..., o _t, and expansion EO= e ₁ , e ₂ ..., e _t, determine one , make ? the maximum probability of expansion observation sequence EO is under condition;

Definition observed value probability function:

；

The formula of forwards algorithms is ;

Initialization: to 1≤ i≤ n, have ;

Recursion: for 1≤t≤t-1, 1≤ j≤ n, have ;

Stop: ;

The formula of backward algorithm is ;

Initialization: to 1 ≤ i≤N, have ;

Recursion: right t=t-1, t-2 ..., 1, and 1≤ i≤ n, have ;

Stop: ;

Wherein: viterbi algorithm refers to definition for during moment t along a paths q ₁, q ₂..., q _t, and q _t= i, produce e ₁, e ₂..., e _tmaximum probability, namely have: ; The process then asking for optimum condition sequence Q* is

Initialization: right , have ; ;

Recursion: right have , ; , ;

Stop: ; ;

Path is recalled, and determines optimum condition sequence t=T-1, T-2 ..., 1.

Claims

1., based on a Word Intelligent Segmentation method for Hidden Markov Model (HMM), comprise the following steps:

(1) set up hidden Markov model parameter ,

Wherein

mfor the observed value number of possible individual Chinese character corresponding to each state; Remember that m observed value is V ₁..., V _m, the observed value that note t is observed, wherein, (V ₁..., V _m);

arepresent the transition probability matrix choosing next state under current state, () _{n × N}, in formula , 1≤ ≤ n;

(3) determining n, M, Lafterwards, will referred to as ;

Wherein: ;

；

2. a kind of Word Intelligent Segmentation method based on Hidden Markov Model (HMM) as claimed in claim 1, is characterized in that: described step (5) in BW algorithm refer to: a given observed value sequence O= o ₁ , o ₂ ..., o _t, and expansion EO= e ₁ , e ₂ ..., e _t, determine one , make ? the maximum probability of expansion observation sequence EO is under condition;

Definition observed value probability function:

；

The formula of forwards algorithms is ;

Initialization: to 1≤ i≤ n, have ;

Recursion: for 1≤t≤t-1, 1≤ j≤ n, have ;

Stop: ;

The formula of backward algorithm is ;

Initialization: to 1 ≤ i≤N, have ;

Recursion: right t=t-1, t-2 ..., 1, and 1≤ i≤ n, have ;

Stop: ;

3. a kind of Word Intelligent Segmentation method based on Hidden Markov Model (HMM) as claimed in claim 1, is characterized in that: described step (6) in viterbi algorithm refer to definition for during moment t along a paths q ₁, q ₂..., q _t, and q _t= i, produce e ₁, e ₂..., e _tmaximum probability, namely have: ; The process then asking for optimum condition sequence Q* is

Initialization: right , have ; ;

Recursion: right have , ; , ;

Stop: ; ;

Path is recalled, and determines optimum condition sequence t=T-1, T-2 ..., 1.