CN105373529B

CN105373529B - A kind of Word Intelligent Segmentation method based on Hidden Markov Model

Info

Publication number: CN105373529B
Application number: CN201510708169.7A
Authority: CN
Inventors: 邓剑波; 马润宇; 刘毓智
Original assignee: Gansu Zhicheng Network Technology Co Ltd
Current assignee: Gansu Zhicheng Network Technology Co Ltd
Priority date: 2015-10-28
Filing date: 2015-10-28
Publication date: 2018-04-20
Anticipated expiration: 2035-10-28
Also published as: CN105373529A

Abstract

The present invention relates to a kind of Word Intelligent Segmentation method based on Hidden Markov Model, this method comprises the following steps：(1) hidden Markov model parameter is established；(2) the state set Θ in article is determined；(3) determiningN, M, LAfterwards, willIt is abbreviated as；(4) computer language is used, first substantial amounts of article is segmented using mechanical Chinese word segmentation method；Then its state is labeled with computer, so formed initial π matrixes, A matrixes,B ₁Matrix,B ₂Matrix；To the initial A matrixes of formation andB ₁Matrix andB ₂Matrix carries out article training using BW algorithms, and carries out revaluation by BW algorithm revaluations formula, obtains newπMatrix, A matrixes andB ₁ 、B ₂Matrix；(6) the parameter of new hidden Markov model is used

Description

A kind of Word Intelligent Segmentation method based on Hidden Markov Model

Technical field

The present invention relates to a kind of Chinese word cutting method, more particularly to a kind of Word Intelligent Segmentation side based on Hidden Markov Model Method.

Background technology

With the development of Internet technology, requirement of the people to computer disposal text is higher and higher.Wherein, software needs There is input to article, display, editor, output etc., and realize that the bases of these functions is then to word in text Identification；But it is different from English, Chinese word does not have natural boundary, so to improve processing of the Chinese software to text Ability, must just carry out Chinese word segmentation.

At present, there are mechanical Chinese word segmentation method, understanding method and statistic law for carrying out the main method of Chinese word segmentation.Mechanical Chinese word segmentation method It is to be segmented according to existing character string in dictionary, but its participle needs substantial amounts of data, and for emerging word Language is helpless；Understanding method be anticipated to article sentence by computer, the analysis of grammer segments, shortcoming is due to Chinese Complexity, have great difficulty in the realization of its algorithm；Statistic law, be by largely train to probability between word and word into Row statistics, so as to fulfill Chinese word segmentation.

Hidden Markov model（Hidden Markov Model, HMM）As a kind of Statistic analysis models, successfully For speech recognition, Activity recognition, the field such as Text region and fault diagnosis.《Chinese based on Hidden Markov Model point Word is studied》（Wei Xiaoning, computer knowledge and technology (academic exchange), 21 phases in 2007）Hidden Markov Model is based on using one kind (HMM) algorithm, is segmented by CHMM (stacking shape Markov model), then is layered, and has both added the accurate of participle Property, it in turn ensure that the efficiency of participle.It is relatively low for frequency but or not hidden Markov model lacks analysis for language environment Common or more appearance but also easily not inaccurate into the situation processing of word.

Asahara M, Goh C L, Wang X, et al. Combining segmenter and chunker for Chinese word segmentation[C]//Proceedings of the second SIGHAN workshop on Chinese language processing-Volume 17. Association for Computational Linguistics, 2003: 144-147.

Xue N. Chinese word segmentation as character tagging[J]. Computational Linguistics and Chinese Language Processing, 2003, 8(1): 29-48.

This two documents describe a kind of Hidden Markov Chinese word segmentation model based on word mark, model inheritance word The advantages of marking model, it can evenly treat the identification problem of vocabulary word and unregistered word, but be a lack of to language environment Analysis.

The content of the invention

The technical problems to be solved by the invention, which are to provide, a kind of to be carried out a large amount of Chinese texts accurate and efficiently segments Word Intelligent Segmentation method based on Hidden Markov Model.

To solve the above problems, a kind of Word Intelligent Segmentation method based on Hidden Markov Model of the present invention, including Following steps：

(1) hidden Markov model parameter is established,

Wherein

NFor markovian state number in model；Remember that n state isθ ₁...,θ _n, note t moment Markov Chain institute The state at place is, and （...,）；

MFor the observed value number of the corresponding possible individual Chinese character of each state；Remember that m observed value is V₁..., V_M, remember t The observed value that moment is observed, wherein, （V₁..., V_M）；

LFor the observed value number of the corresponding possible multiple Chinese characters of each state；L extension observed value of note... ..., , remember the observed value that t moment is observed, wherein （... ...,）；

πRepresent to choose some shape probability of state when sequence starts,π=（π₁..., π_n）, in formula, 1≤і≤ N；

AThe transition probability matrix of next state is chosen in expression under current state,（）_N×N, in formula, 1≤≤ N；

B ₁Represent observed value in j-th of statekThe probability matrix of appearance, _N×M, in formula,1 ≤≤ N, 1≤≤ M；

B ₂Represent observed value s and observed value in j-th of statekThe probability matrix continuously occurred, i.e. extended pattern observed value are general Rate matrix, _N×L, in formula, 1≤≤ N, 1≤≤ L；

(2) the state set Θ in article is determined：With reference to the language regulation of Chinese, by Chinese word state set elect as prefix H, Tetra- Z, suffix E, only word S states in word；

(3) determiningN, M, LAfterwards, willIt is abbreviated as；

(4) computer language is used, first substantial amounts of article is segmented using mechanical Chinese word segmentation method；Then with computer pair Its state is labeled, and then counts the probability that each word occurs in the state, formed initial π matrixes, A matrixes,B ₁ Matrix,B ₂Matrix；

(5) initial A matrixes to formation and described initialB ₁Matrix and described initialB ₂Matrix uses BW algorithms Article training is carried out, article observation value sequence is known as O, and extended pattern observation value sequence is known as EO, obtains each desired value, and The conditional probability that sequence occurs under this parameter, and BW algorithm weights are pressed to the observed value probability of each observation element Estimate formula and carry out revaluation, calculate the parameter of new hidden Markov modelAnd；And makeA maximum is converged to, so as to obtain newπMatrix, A Matrix andB ₁ 、B ₂Matrix；

Wherein：；

；

(6) the parameter of new hidden Markov model is used, using viterbi algorithm Chinese word segmentation is carried out, article is divided into according to punctuation mark by multiple sentences, Chinese word segmentation is carried out to each sentence, up to segmenting Article afterwards.

(5) middle BW algorithms refer to the step：Given observation value sequence O= o ₁ , o ₂ ..., o _t, and extension EO=e ₁ , e ₂ ..., e _t, determine one, make Under the conditions of the probability in extension observation sequence EO It is maximum；

Define observed value probability function：

；

The formula of forwards algorithms is；

Initialization：To 1≤i ≤ N, have；

Recursion：For1≤t≤t-1, 1≤j≤N, have；

Terminate：；

The formula of backward algorithm is；

Initialization：To 1≤ i ≤ N, have；

Recursion：It is rightT=t-1, t-2 ..., 1, and 1≤i ≤ N, have；

Terminate：；

According to the forward and backward variable of definition, BW algorithms have , 1≤t≤t-1；

DefinitionFor given training sequence O and modelWhen,tMoment is iniState,t+1Moment is injState Probability, i.e.,；At the momenttIt is iniShape Probability of state is。

(6) middle viterbi algorithm refers to define the stepFor moment t when along a paths q₁, q₂..., q_t, and q_t=i, produce Go out e₁, e₂..., e_tMaximum probability, that is, have：；The process for then asking for optimum state sequence Q* is

Initialization：It is right, have；；

Recursion：It is rightHave,；,；

Terminate：；；

Path is recalled, and determines optimum state sequence t=T-1, T-2 ..., 1.

The present invention has the following advantages compared with prior art：

1st, the present invention first passes through Baum-Welch algorithms（Abbreviation BW algorithms）To existing observed value probability matrix, and state Probability matrix is trained, and new observed value probability matrix and state probability matrix is obtained, based on new matrix, then with Wei Te Chinese word segmentation is carried out to article than algorithm.Different from traditional Hidden Markov Model, present invention employs new observed value Probability matrix, i.e. extended pattern observed value probability matrix；This matrix not only covers the information of of Chinese individual character itself, Er Qiehan The information of linguistic context has been covered, has effectively reduced the mistake of statistic law Chinese word segmentation, has substantially increased the accuracy of Chinese word segmentation

2nd, the present invention can carry out substantial amounts of Chinese text accurate and efficiently segment, at a series of other texts The premise of reason technology.

Brief description of the drawings

The embodiment of the present invention is described in further detail below in conjunction with the accompanying drawings.

Fig. 1 is an observation state schematic diagram after example of the present invention extension.

Fig. 2 is example A matrix initial value schematic diagram of the present invention.

Embodiment

A kind of Word Intelligent Segmentation method based on Hidden Markov Model, comprises the following steps：

(1) hidden Markov model parameter is established,

Wherein

B ₂Represent observed value s and observed value in j-th of statekThe probability matrix continuously occurred, i.e. extended pattern observed value are general Rate matrix, _N×L, in formula, 1≤≤ N, 1≤≤ L。

(2) the state set Θ in article is determined：With reference to the language regulation of Chinese, by Chinese word state set elect as prefix H, Tetra- Z, suffix E, only word S states in word.

(3) determiningN, M, LAfterwards, willIt is abbreviated as。

(4) computer language is used, first substantial amounts of article is segmented using mechanical Chinese word segmentation method；Then with computer pair Its state is labeled, and then counts the probability that each word occurs in the state, formed initial π matrixes, A matrixes,B ₁ Matrix,B ₂Matrix.

Such as：BytMoment witht-1Moment collectively forms an element in observation sequence, specific to participle, expands to two A Chinese character, plus the Chinese character at the previous moment of sequence, expands to an observation state（As shown in Figure 1）.It is each in status switch The state at a moment by each moment in word sequence observed value（o_t）Determine, observed value is extended, into figure two A Chinese character（The moment and previous word）, this observed value of t moment（t≠1）I.e..And A matrixes can be by counting To its value of initial value, due to the logical laws in Chinese, some of them value should be 0, as shown in Fig. 2, table 1.

Table 1

(5) to the initial A matrixes of formation and initialB ₁Matrix and initialB ₂Matrix carries out article training, text using BW algorithms Chapter observation value sequence is known as O, and extended pattern observation value sequence is known as EO, obtains each desired value, and sequence is under this parameter The conditional probability of appearance, and revaluation is carried out by BW algorithm revaluations formula to the observed value probability of each observation element, Calculate the parameter of new hidden Markov modelAnd； And makeA maximum is converged to, so as to obtain newπMatrix, A matrixes andB ₁ 、B ₂Matrix；

Wherein：；

。

Wherein：BW algorithms refer to：Given observation value sequence O= o ₁ , o ₂ ..., o _t, and extension EO=e ₁ , e ₂ ..., e _t, determine one, make Under the conditions of the maximum probability in extension observation sequence EO；

Define observed value probability function：

；

The formula of forwards algorithms is；

Initialization：To 1≤i ≤ N, have；

Recursion：For1≤t≤t-1, 1≤j≤N, have；

Terminate：；

The formula of backward algorithm is；

Initialization：To 1≤ i ≤ N, have；

Recursion：It is rightT=t-1, t-2 ..., 1, and 1≤i ≤ N, have；

Terminate：；

(6) the parameter of new hidden Markov model is used, using viterbi algorithm into Article, is divided into multiple sentences by row Chinese word segmentation according to punctuation mark, Chinese word segmentation is carried out to each sentence, after segmenting Article.

Wherein：Viterbi algorithm refers to defineFor moment t when along a paths q₁, q₂..., q_t, and q_t=i, produce e₁, e₂..., e_tMaximum probability, that is, have：；The process for then asking for optimum state sequence Q* is

Initialization：It is right, have；；

Recursion：It is rightHave,；,；

Terminate：；；

Path is recalled, and determines optimum state sequence t=T-1, T-2 ..., 1.

Claims

1. a kind of Word Intelligent Segmentation method based on Hidden Markov Model, comprises the following steps：

(1) hidden Markov model parameter is established,

Wherein

NFor markovian state number in model；Remember that N number of state isθ ₁...,θ _N, remember residing for t moment Markov Chain State is, and （...,）；

MFor the observed value number of the corresponding possible individual Chinese character of each state；Remember that M observed value is V₁..., V_M, remember t moment It was observed that observed value, wherein, （V₁..., V_M）；

LFor the observed value number of the corresponding possible multiple Chinese characters of each state；L extension observed value of note... ...,, note The observed value that t moment is observed, wherein （... ...,）；

πRepresent to choose some shape probability of state when sequence starts,π=（π₁..., π_N）, in formula, 1 ≤і≤N；

AThe transition probability matrix of next state is chosen in expression under current state,（）_N×N, in formula, 1≤≤N；

B ₁Represent that j-th of state corresponds toMInkThe probability matrix that a observed value occurs, _N×M, in formula,1 ≤j ≤N, 1≤k ≤M；

B ₂Represent that j-th of state corresponds toLInkThe probability matrix that the observed value of a element occurs, i.e. extended pattern observed value probability square Battle array, _N×L, in formula, 1≤j≤N, 1≤k≤L；

(2) the state set Θ in article is determined：With reference to the language regulation of Chinese, Chinese word state set is elected as in prefix H, word Z, tetra- suffix E, only word S states；

(3) determiningN, M, LAfterwards, willIt is abbreviated as；

(4) computer language is used, first substantial amounts of article is segmented using mechanical Chinese word segmentation method；Then with computer to its shape State is labeled, and then counts the probability that each word occurs in the state, formed initial π matrixes, A matrixes,B ₁Matrix,B ₂Matrix；

(5) initial A matrixes to formation and described initialB ₁Matrix and described initialB ₂Matrix is carried out using BW algorithms Article is trained, and article observation value sequence is known as O, and extended pattern observation value sequence is known as EO, obtains each desired value, and sequence The conditional probability occurred under this parameter, and BW algorithm weights are pressed to the observed value probability of each observation element Estimate formula and carry out revaluation, calculate the parameter of new hidden Markov modelAnd；And makeA maximum is converged to, so as to obtain newπSquare Battle array, A matrixes andB ₁ 、B ₂Matrix；

Wherein：；

；T refers to the total length of sequence；

The BW algorithms refer to：Given observation value sequence O= o ₁ , o ₂ ..., o _t, and extension EO=e ₁ , e ₂ ..., e _t, determine One, make Under the conditions of the probability in extension observation sequence EO It is maximum；

Define observed value probability function：

；

The formula of forwards algorithms is；

Initialization：To 1≤i≤N, have；

Recursion：For 1≤ t ≤ T-1,1≤j≤N, have；

Terminate：；

The formula of backward algorithm is；

Initialization：To 1≤ i ≤ N, have；

Recursion：It is rightt=T- 1,T- 2 ..., 1, and 1≤i ≤N, have；

Terminate：；

According to the forward and backward variable of definition, BW algorithms have , 1≤t≤T-1；

DefinitionFor given training sequence O and modelWhen,tMoment is iniState,t+1Moment is injState Probability, i.e.,；At the momenttIt is iniShape probability of state is；

(6) the parameter of new hidden Markov model is used, carried out using viterbi algorithm Article, is divided into multiple sentences by text participle according to punctuation mark, Chinese word segmentation is carried out to each sentence, up to the text after participle Chapter.

A kind of 2. Word Intelligent Segmentation method based on Hidden Markov Model as claimed in claim 1, it is characterised in that：The step Rapid (6) middle viterbi algorithm refers to defineFor moment t when along a paths q₁, q₂..., q_t, and q_t=i, produce e₁, e₂..., e_tMaximum probability, that is, have：；Then The process for asking for optimum state sequence Q* is

Initialization：It is right, have；；

Recursion：It is rightHave,；,；

Terminate：；；

Path is recalled, and determines optimum state sequence t=T-1, T-2 ..., 1.