CN102622339A

CN102622339A - Intersection type pseudo ambiguity recognition method based on improved largest matching algorithm

Info

Publication number: CN102622339A
Application number: CN2012100501542A
Authority: CN
Inventors: 周俊; 郑中华; 张炜
Original assignee: ANHUI BORYOU INFORMATION TECHNOLOGY CO LTD
Current assignee: ANHUI BORYOU INFORMATION TECHNOLOGY CO LTD
Priority date: 2012-02-24
Filing date: 2012-02-24
Publication date: 2012-08-01

Abstract

The invention discloses an intersection type pseudo ambiguity recognition method based on the improved largest matching algorithm, which comprises three core functions of intersection type ambiguity detection, intersection type ambiguity authenticity judgment and pseudo ambiguity resolution. The intersection type ambiguity detection algorithm not only can ensure 100% detection of intersection type detection but also is small in needed pay expense, high in execution speed and low in algorithm complexity, is only O(N), does not require drawing support from any ambiguity word lists or relative statistical data, and is simple and efficient. The pseudo ambiguity resolution method has higher recognition capability on pseudo ambiguity, is capable of recognizing intersection type ambiguity authenticity, and avoids error recognition on true ambiguity. Further, required data are simple and easy to acquire.

Description

Based on the pseudo-ambiguity recognition methods of the chiasma type that improves maximum matching algorithm

[technical field]

The present invention relates to the Algorithm of Automatic Chinese Word Segmentation technology, relate in particular to a kind of based on the pseudo-ambiguity recognition methods of the chiasma type that improves maximum matching algorithm.

[background technology]

Artificial intelligence (Artificial Intelligence; AI) no longer be noun rarely known by the people; From preliminary proposition till now, through the research and development of five more than ten years, be widely used at each ambits such as machine-building, information Control, Aero-Space and bionics.Natural language understanding (Natural Language Processing; NLP) promptly be an important branch of artificial intelligence; It also is the important foundation that realizes other branch fields of artificial intelligence technology; Like the knowledge learning of expert system, the voice control in control field, intelligent search of search engine or the like, all be analysis foundation, so NLP is a technical task that has great Research Significance at artificial intelligence field with NLP.

Classification according to natural language is different, and natural language understanding is divided into a plurality of research directions, and wherein topmost English natural language understanding and the Chinese natural language of comprising understood; The English natural language understanding will be easy to many with respect to Chinese natural language understanding; Because English statement itself is made up of the minimum unit with complete semanteme (being English word) exactly, and Chinese sentence is made up of continuous Chinese character, single Chinese character does not have the semantic ability of The expressed; The minimum unit that has complete semanteme in the Chinese is an entry; Therefore, before the semantic understanding of Chinese sentence, need to become by the Chinese sentence segmentation that continous characters is formed the set of entry; With the data basis of understanding as Chinese natural language, this process is called Chinese word segmenting.Chinese word segmenting is the basic steps that Chinese natural language is understood, and also is a crucial step.

One of main difficult point in the Chinese word segmenting process is exactly ambiguity identification.So-called ambiguity identification is meant in the Chinese word segmenting process, to detect all ambiguities that exist in the input Chinese sentence and the process of clearing up, and comprises that ambiguity detects and two gordian techniquies of ambiguity resolution.It is in read statement, to locate ambiguity that ambiguity detects, if having ambiguity in this statement; Ambiguity resolution then is that oriented ambiguity is cleared up, and result, the i.e. correct cutting route of ambiguity are cleared up in output.

Because the dirigibility of Chinese language, ambiguity also has diversity, different classes of ambiguity, and needs adopt diverse ways to carry out the ambiguity detection and clear up.Whether according to ambiguity itself is entry, can ambiguity be divided into two types of make-up ambiguity and chiasma type ambiguities, and make-up ambiguity is meant that ambiguity itself is exactly a Chinese entry; Like " understanding that high-tech people just can address this problem "; " talent " is the participle ambiguity, can be divided into " people " and " " two entries, also can regard " talent " entry as; Ambiguity itself is an entry, and therefore " talent " is make-up ambiguity; The chiasma type ambiguity can be regarded as the ambiguity that ambiguity itself is not an entry, and as above routine " technology of this factory all is first-class with service ", participle ambiguity " and service " is not an entry, so belong to the chiasma type ambiguity.Number according to the correct participle mode of ambiguity; Can it be divided into two types of true ambiguity and pseudo-ambiguities; True ambiguity is meant that ambiguity possibly have the ambiguity of two or more correct slit mode; Like ambiguity " Chinese household ", cutting is " China " and " household " two entries in statement " developing china household cause ", and in statement " developing china household world medium level ", should be divided into " in ", " country " and " residence " three entries; Pseudo-ambiguity is meant the ambiguity of under any linguistic context, all having only a kind of correct slit mode; Like ambiguity " spot "; Under any linguistic context, can cutting be " crime " and " scene " two entries all, and can not be cut into " case ", " discovery " and " field " three speech.Obviously, make-up ambiguity all belongs to true ambiguity.

Statistics according to extensive circulation corpus shows that the chiasma type ambiguity accounts in all ambiguities and gets more than 90%, and in all crossing ambiguities, the pseudo-ambiguity of chiasma type (the pseudo-ambiguity of hereinafter referred) accounts for ratio over half.Therefore, pseudo-ambiguity is modal Chinese word segmenting ambiguity.

The technical scheme of prior art one

Pseudo-ambiguity recognition methods based on memory is used wider in pseudo-ambiguity identification; This method mainly utilizes pseudo-ambiguity to have the character of unique correct cutting route; Through statistics to extensive corpus, obtain all chiasma type ambiguities, filter the true ambiguity of chiasma type then; Obtain pseudo-ambiguity set; At last all pseudo-ambiguities are included in pseudo-ambiguity vocabulary, when Chinese word segmenting, search the coupling Chinese sentence and detect, detected pseudo-ambiguity is directly obtained and cleared up scheme through inquiring about pseudo-ambiguity vocabulary to realize ambiguity through pseudo-ambiguity vocabulary.This method recognition accuracy is high, and principle is simple, processing ease, and the pseudo-ambiguity vocabulary but ambiguity detection and ambiguity resolution all place one's entire reliance upon, it is very big influenced by pseudo-ambiguity vocabulary scale, so recall rate is not high.The concise and to the point flowchart of this technical scheme is as shown in Figure 3.

The shortcoming of prior art one:

1; The statistics corpus can't comprise the pseudo-ambiguity of all intersections, and therefore pseudo-ambiguity vocabulary can not be included all pseudo-ambiguities, so should technology when ambiguity detects, can't guarantee to detect all pseudo-ambiguities; Finally cause the detection of pseudo-ambiguity to omit, and cause the identification recall rate of pseudo-ambiguity lower;

2; The true ambiguity of certain chiasma type a kind of cutting route possibly only occur in the statistics corpus; Thereby mistake is treated to pseudo-ambiguity and includes in pseudo-ambiguity vocabulary easily; Cause the identification error of ambiguity at last, this is to limit the basic reason that this scheme accuracy rate rises, and is difficult to accomplish to eliminate fully;

3; The RM of ambiguity can not be discerned the pseudo-ambiguity of not included by pseudo-ambiguity vocabulary fully than mechanization, and it is very big that the identification recall rate is influenced by the scale of pseudo-ambiguity vocabulary; Therefore need to bring in constant renewal in and safeguard pseudo-ambiguity vocabulary; To include more pseudo-ambiguity, enlarge the scale of pseudo-ambiguity vocabulary, to improve the recall rate of pseudo-ambiguity identification as far as possible.

The technical scheme of prior art two

Ambiguity identification based on lexical analysis can be described as the most general ambiguity recognition methods of present use, but basic only to the identification of pseudo-ambiguity, the ambiguity based on lexical analysis of hereinafter discussion is discerned the identification that is all pseudo-ambiguity.This method is at first searched coupling, ambiguity mark methods such as [1] or other chiasma type ambiguity detection algorithms [2] through pseudo-ambiguity vocabulary and is detected the ambiguity in the Chinese sentence; The basic thought that utilizes word-building to combine with statistics is again set up mathematical model to ambiguity; And the selection possibility of each bar cutting route of calculating ambiguity, will select the maximum cutting route of possibility as the ambiguity resolution result at last.This scheme needn't depend on pseudo-ambiguity vocabulary, just adopts the statistics of relevant informations such as Chinese character, word, and like Chinese character mutual information, the word frequency of occurrences etc., so the dirigibility of ambiguity identification is big, for all ambiguities certain recognition capability is arranged all.

Fig. 4 has shown the prior art two concise and to the point flow processs of carrying out; The ambiguity detection technique of prior art two mainly adopts pseudo-ambiguity vocabulary matching process, pseudo-ambiguity labelling method or other chiasma type ambiguity detection algorithm at present, and mathematical model (promptly selecting the possibility computation model) is the core of prior art two, and the quality of mathematical model directly influences the recognition effect of whole proposal; Sun Maosong etc. are through adjacent words mutual information and the modeling of t-test difference; In order to describing the possibility that adjacent words becomes speech, Wang Sili etc. propose the notion of the double word degree of coupling on the basis of Sun Maosong etc.; And combine the modeling of t-test difference, be used for the identification of pseudo-ambiguity.

The shortcoming of prior art two:

A lot of relevant scholars have provided the concrete implementation of pseudo-ambiguity recognition technology based on lexical analysis of oneself, but have had with the next item down or several deficiencies according to different separately mathematical models.

1, pseudo-ambiguity detects and omits.Adopt pseudo-ambiguity vocabulary to search coupling or ambiguity mark mode and be used for pseudo-ambiguity and find, can not guarantee that all 100% of pseudo-ambiguity detects, therefore limited the raising of pseudo-ambiguity identification recall rate.

2, the ambiguity true or false can't be judged.Though pattern of partial intersection ambiguity detection algorithm can detect all chiasma type ambiguities at present; But the true or false resolving ability that does not have the chiasma type ambiguity; Because prior art two is only effective to pseudo-ambiguity; The true and false of ambiguity is not judged again simultaneously, thereby possibly caused the wrong identification of true ambiguity, cause recognition accuracy not high.

3, mathematical model is considered not comprehensive.Mathematical model; It is the selection possibility computation model in ambiguity partition path; It is the core of prior art two; At present the concrete technical scheme of a lot of prior aries two all have a mathematical model consideration comprehensively or parameter problems such as inappropriate are set, making all can not be satisfactory to the accuracy rate and the recall rate of pseudo-ambiguity identification.

[summary of the invention]

The technical matters that the present invention will solve provides a kind of based on the pseudo-ambiguity recognition methods of the chiasma type that improves maximum matching algorithm.This method guarantees that chiasma type ambiguity 100% detects, and pseudo-ambiguity resolution ability is strong, and algorithm execution speed is fast, and required expense is little, and desired data is simple.

In order to solve the problems of the technologies described above, the technical scheme that the present invention adopts is that the pseudo-ambiguity recognition methods of a kind of chiasma type based on the improvement maximum matching algorithm comprises that step is following:

(1) input Chinese sentence through the intersection ambiguity that exists in the improvement maximum matching algorithm inspect statement, and is put into the set of chiasma type ambiguity; If set is for empty, no chiasma type ambiguity in the expression read statement is not carried out any processing; Directly return; Otherwise all ambiguities in the traversal set get into (2) step process;

(2) employing is carried out the full cutting in path based on the recursion method of depth-first search to ambiguity, obtains the set in all paths, and the traverse path set is done (3) step process to every paths;

(3) according to given selection possibility computational mathematics model, modeling is carried out in the ambiguity partition path, calculated and write down the selection possibility numerical value of respective paths; Calculate two maximum in the set of paths of ambiguity differences of selecting possibility numerical value,, assert that then this ambiguity is true ambiguity if in a certain given threshold value; Stop to clear up, and be submitted to true ambiguity resolution resume module, otherwise; Judge that this ambiguity is pseudo-ambiguity, and will select the clear up result of the maximum path of possibility numerical value as this ambiguity.

The invention has the beneficial effects as follows:

1, technical scheme of the present invention is in chiasma type ambiguity testing process; Any ambiguity vocabulary can not used; Also under the situation of ambiguity, guarantee that 100% of chiasma type ambiguity detects, avoided omitting problems such as the identification recall rate that causes is lower because of the detection of chiasma type ambiguity with word relevant information statistics; And the detection algorithm complexity is merely O (N), and the required expense of algorithm is little, detection speed is fast;

2, the Path selection possibility computation model of technical scheme of the present invention not only desired data simply and easily obtain; The investigation of computation model is also more comprehensive simultaneously; The minimum word frequency in ambiguity path, maximum relatively word frequency spacing and word frequency fluctuation mean square deviation had both been considered; Also suitably combine the principle of priority of long word, have stronger pseudo-ambiguity resolution ability;

3, technical scheme of the present invention has stronger chiasma type ambiguity true and false resolving ability; Can avoid most of at present chiasma type ambiguity recognition technologies to divide the identification error that causes, thereby effectively improve the recognition accuracy of ambiguity because of the true and false mistake of chiasma type ambiguity based on lexical analysis.

[description of drawings]

Below in conjunction with accompanying drawing and embodiment the present invention is done further detailed explanation.

Fig. 1 is that the integral body of the embodiment of the invention is carried out general flow chart.

Fig. 2 is that the embodiment of the invention is based on the chiasma type ambiguity detection algorithm process flow diagram that improves maximum matching algorithm.

Fig. 3 is the concise and to the point execution flow process of prior art one.

Fig. 4 is the concise and to the point execution flow process of prior art two.

[embodiment]

Present embodiment mainly comprises three cores: the chiasma type ambiguity detects, chiasma type ambiguity true or false is judged and pseudo-ambiguity resolution.It is all chiasma type ambiguities that detect in the given Chinese sentence that the chiasma type ambiguity detects, and puts into the set of chiasma type ambiguity, and this part realizes through improving maximum matching algorithm; Chiasma type ambiguity true or false judges it is from the set of chiasma type ambiguity, to distinguish and reject all true ambiguities; Because the present technique scheme can't be cleared up true ambiguity; Therefore need true ambiguity be rejected from the set of chiasma type ambiguity; And submit to true ambiguity resolution module and handle (true ambiguity resolution has exceeded technical scheme scope of the present invention, so do not do introduction); Pseudo-ambiguity resolution then is that the ambiguity of rejecting in the chiasma type ambiguity set after the true ambiguity is one by one cleared up; Digestion process comprises that the full cutting in ambiguity path, each cutting road are through selecting three steps of selection of possibility calculating and correct cutting route; Wherein, selecting the computation model of possibility is again the core of pseudo-ambiguity resolution.What need particularly point out is, the intersect true or false of ambiguity judge with pseudo-ambiguity resolution be not that order is carried out, but carry out synchronously; Promptly in the digestion process of ambiguity, judge its true or false; If pseudo-ambiguity then continues to clear up, and returns and clear up the result; If true ambiguity then is submitted to true ambiguity resolution section processes.

As shown in Figure 1, the execution flow process of whole technical proposal comprises following 3 steps:

In above-mentioned steps, concrete way is following:

One, the chiasma type ambiguity detects

It is all chiasma type ambiguities that exist in the given input Chinese sentence in order to detect that the chiasma type ambiguity detects.This technical scheme adopts improves the detection that maximum matching algorithm is realized the chiasma type ambiguity, and this cover improves maximum matching algorithm and do not need by any ambiguity vocabulary, also need not do the statistics of any ambiguity relevant information; Can realize that 100% of chiasma type ambiguity detects; And algorithm complex is Q (N), and required expense is little, execution speed is fast, but this cover algorithm itself does not have the true and false resolving ability of chiasma type ambiguity; Therefore all chiasma type ambiguities be can detect, true ambiguity and pseudo-ambiguity comprised.

Why being referred to as to improve maximum matching algorithm, is owing to this cover algorithm develops from maximum matching algorithm, has continued to use the priority of long word basic thought of maximum matching algorithm, and just this algorithm expands to the broad sense entry with the definition of Chinese entry.Claim that a continous characters string is the broad sense entry, it must satisfy one of following two conditions:

1, this continuous word string itself is conventional entry, the entry of promptly including in the dictionary;

2, this continuous word string is the chiasma type ambiguity, and like " Chinese people ", " being combined into " all is the broad sense entry.

Fig. 2 is based on the chiasma type ambiguity detection algorithm flowchart of improving maximum matching algorithm.The detailed execution flow process that this cover improves maximum matching algorithm is following:

1, given input Chinese character statement S, the Chinese character number that note S comprises is N, i Chinese character is W among the S _i, the Chinese character number that the long word language with word x beginning of including in L (x) the expression dictionary is comprised, Index representes the position of current sensing S Chinese words; And make Index be initialized as 1; Promptly point to the 1st literal, establish and comprise among the S that starting position and the end position of long word bar in S of the broad sense of Chinese character is respectively SI and EI on the Index position, and SI and EI are initialized as 1 and 2 respectively; Set the ambiguity set A and preserve all detected ambiguities, A is initialized as null set;

2, whether judge Index greater than N, if, then carried out for the 5th step, otherwise whether judge Index less than EI, if not carried out for the 4th step, otherwise obtain Index Chinese character W among the S _Index, if L (W _Index)+Index＞N+1 then makes L (W _Index)=N+1-Index;

3, obtain Index position and Index+L (W among the S _Index) (comprise Index Chinese character, do not comprise Index+L (W between the position _Index) individual Chinese character) and Chinese character string, if this Chinese character string is not Chinese entry and the Index+L (W that has included _Index)＞EI, then L (W _Index)--, continued to carry out the 3rd step, otherwise, make EI=Index+L (W _Index), Index++ continued to carry out the 2nd step;

4, extract among the statement S Chinese character string of (comprise SI Chinese character, but do not comprise EI Chinese character) between the position SI and position EI, make SI=Index; EI=Index+1, if this Chinese character string is not the word of including in the dictionary, then this Chinese character string is the chiasma type ambiguity; Put into set A, carried out for the 2nd step then, otherwise; This Chinese character string is not the chiasma type ambiguity, directly carries out for the 2nd step;

The improvement maximum matching algorithm is finished, and submits chiasma type ambiguity set A to, if A is empty, then representes no chiasma type ambiguity in the read statement.

Two, select the possibility computation model

Select the possibility computation model in order to the possibility of portrayal ambiguity bar cutting route as the cutting result, present technique Scheme Selection possibility computation model has been taken all factors into consideration the minimum word frequency in ambiguity partition path, maximum relatively word frequency spacing, word frequency fluctuation mean square deviation and four aspects of priority of long word principle.

If ambiguity S, its certain bar cutting route is W={W _i, i=1,2 ..N, W _iI entry of expression cutting route, N representes the entry number of cutting route, i.e. path.Note P (Wi) representes the word frequency of this i entry in path, p (W _i) expression P (W _i) corresponding relative word frequency, and have:

p (W_{i}) = \frac{P (W_{i}) - MIN {P (W)}}{MAX {P (W)} - MIN {P (W)}} - - - (2.1)

In the formula (2.1), MAX{P (W) }, MIN{P (W) represent the minimum and maximum word frequency of entry in this cutting route of ambiguity S respectively.

1, minimum word frequency

The word frequency bright word of novel more is of little use more; Minimum word frequency in the cutting route has then reflected the selection possibility size of this cutting route from a side; And the minimum word frequency of ambiguity cutting route W is big more, and then the selection possibility Φ (W) of this cutting route is big more, that is:

Φ(W)∝MIN{P(W)} (2.2)

Like chiasma type ambiguity " because what ", comprise " because of (20) why (310) " and " because of (318) what (75) " two kinds of cutting route, bracket inner digital is represented the word frequency of corresponding entry; The minimum word frequency of first kind of cutting route is 20; Less than the minimum word frequency of second kind of cutting route, so second kind of cutting route selects possibility big, in fact; Clearly, second kind of correct cutting that cutting route is an ambiguity.

2, maximum relatively word frequency spacing

Maximum word frequency spacing F (W) is meant the maximum word frequency of cutting route W and the difference of minimum word frequency, that is:

F(W)＝MAX{P(W)}-MIN{P(W)} (2.3)

The word frequency fluctuating range that maximum word frequency spacing has been portrayed path W from the side; In general the word frequency fluctuating range is big more, and the selection possibility in path is just more little, because the word frequency of different entries differs bigger; Word frequency is incomparable can be sayed; So for strengthening the comparability of word frequency fluctuating range, adopt maximum relatively word frequency spacing f (W) to portray the fluctuating range of word frequency, maximum relatively word frequency distance computation formula is following:

f (W) = \frac{F (W)}{MAX {P (W)}} = \frac{MAX {P (W)} - MIN {P (W)}}{MAX {P (W)}} - - - (2.4)

3, word frequency fluctuation mean square deviation

Maximum relatively word frequency spacing has been portrayed the amplitude of word frequency fluctuation, but can't describe the fluctuation tendency of word frequency, and word frequency fluctuation mean square deviation has then remedied this deficiency.Though maximum relatively word frequency spacing and word frequency fluctuation mean square deviation all are used to portray the fluctuation situation of word frequency; But describe from different sides, the former is used to portray the fluctuating range of word frequency, and the latter is used to describe the wave stability of word frequency; The difference that the two has essence can not replace each other.Word frequency fluctuation mean square deviation μ (W) adopts the relative frequency of all entries in the cutting route W to calculate, shown in (2.5).

μ (W) = \sqrt{\frac{1}{N} \times Σ_{i = 1}^{N} {(p (W_{i}) - \overset{&OverBar;}{p (W)})}^{2}} - - - (2.5)

4, priority of long word

Priority of long word is the participle principle that present mechanical Chinese word segmentation method generally adopts; It also is the participle principle that the most of researchers of industry relatively admit; Although priority of long word can not under any circumstance can obtain correct cutting result; But on the whole, the application of priority of long word principle can effectively improve the accuracy rate of participle.Priority of long word is meant that specifically the entry number that when Chinese word segmenting, makes cutting is the least possible.This computation model is also considered the priority of long word principle in the digestion process of pseudo-ambiguity; And with entry number in the cutting route, promptly path is represented " length " of long word, and the entry number is few more in the W of path; Then " length " of long word is big more, selects possibility also just big more.

, set up the taking all factors into consideration of minimum word frequency, maximum relatively word frequency spacing, word frequency fluctuation three indexs of mean square deviation and priority of long word principle through above suc as formula the ambiguity partition routing possibility computation model shown in (2.6):

Φ(W)＝{μ(W)×N ^α×f(W) ^β×MIN{P(W)} ^-γ} ^-1 (2.6)

In the formula (2.6), α, β and γ are respectively path, relative word frequency spacing and minimum word frequency to the factor of influence of routing possibility, can confirm according to related experiment.Bring formula (2.4) and formula (2.5) into selection possibility computation model that formula (2.6) can get pseudo-ambiguity partition path, suc as formula (2.7).

Φ (W) = {\sqrt{Σ_{i = 1}^{N} {(p (W_{i}) - \overset{&OverBar;}{p (W)})}^{2}} \times N^{&Proportional; - \frac{1}{2}} \times {[\frac{MAX {P (W)} - MIN {P (W)}}{MAX {P (W)} \times MIN {P (W)}^{\frac{γ}{β}}}]}^{β}}^{- 1} - - - (2.7)

In the formula (2.7):

The word frequency of P (Wi)-i entry of path W;

The relative word frequency of p (Wi)-i entry of path W is referring to formula (2.1);

-be the average word frequency relatively of path all entries of W;

The length of N-path W, i.e. entry number;

Three, chiasma type ambiguity true or false is distinguished

The true or false of chiasma type ambiguity distinguishes it is the selection possibility difference of selecting two maximum paths of possibility in all cutting route of this ambiguity through calculating; When this difference during less than certain threshold value; It is judged to be true ambiguity; Can ambiguity be submitted to true ambiguity resolution module then and clear up, otherwise it is judged to be pseudo-ambiguity, and the maximum path of output selection possibility is the ambiguity resolution result.

Claims

1. the pseudo-ambiguity recognition methods of the chiasma type based on the improvement maximum matching algorithm is characterized in that, comprises that step is following:

(1) input Chinese sentence through the intersection ambiguity that exists in the improvement maximum matching algorithm inspect statement, and is put into the set of chiasma type ambiguity, if set is sky, no chiasma type ambiguity in the expression read statement is not carried out any processing, directly returns; Otherwise all ambiguities in the traversal set get into (2) step process;