CN105138514B

CN105138514B - It is a kind of based on dictionary it is positive gradually plus a word maximum matches Chinese word cutting method

Info

Publication number: CN105138514B
Application number: CN201510522091.XA
Authority: CN
Inventors: 彭艺; 苏黎韡; 邵玉斌; 龙华; 宋浩
Original assignee: Kunming University of Science and Technology
Current assignee: Kunming University of Science and Technology
Priority date: 2015-08-24
Filing date: 2015-08-24
Publication date: 2018-11-09
Anticipated expiration: 2035-08-24
Also published as: CN105138514A

Abstract

The forward direction based on dictionary that the present invention relates to a kind of is gradually plus a word maximum matches Chinese word cutting method, belongs to computer Chinese text-processing technical field.The present invention includes step：Text to be slit is read in first, and the text of input is carried out by thick cutting according to the apparent separator such as punctuate, number, western language, chart, is divided into short text one by one；Using the short text of thick cutting as further cutting object, further participle search length is set；The short text after thick cutting is taken gradually plus in the way of a word to be segmented with dictionary pattern matching by positive, until all short texts participle terminates.The invention avoids conventional forward maximums to match the shortcomings that participle rate-accuracy rate is difficult to balance, and all increases than conventional forward and reverse Max Match word segmentation arithmetic in terms of cutting word speed and participle accuracy rate.

Description

It is a kind of based on dictionary it is positive gradually plus a word maximum matches Chinese word cutting method

Technical field

The forward direction based on dictionary that the present invention relates to a kind of is gradually plus a word maximum matches Chinese word cutting method, belongs to computer Chinese text processing technical field.

Background technology

With the development of science and technology, human society has come into the information age.Computer is allowed " to understand " the natural language of the mankind Speech, realizes that free human-computer interaction has become fine vision.For human language, word is minimum, energy independent activities , significant linguistic unit.There is very big differences for the western languages such as Chinese and English, French, between the word and word of western language There is apparent space as separator, computer is easy to understand the meaning of a word according to these spaces；And Chinese sentence Middle word and word closely come together, computer understanding get up will be difficult it is more.Chinese word segmentation is the key that Chinese information processing And premise, only handle Chinese word segmentation well, could allow computer understanding Chinese, carry out subsequent Chinese information processing, and from sea Useful information is extracted in the information of amount and provides service for the mankind, realizes Computerized intelligent.With the development of Chinese information processing, Chinese words segmentation is widely used, and is deeply applied in generally main three fields below, is played crucial work With.1) computer and artificial intelligence field：It is engaged in natural language understanding and treatment research using Chinese word segmentation achievement, such as semantic point Analysis, autoabstract, knowledge engineering, machine translation, expert system and intelligent computer etc.；2) information field：Under study for action In the combination of technologies such as text participle and automatic indexing, Chinese word segmentation and information retrieval and search engine, achieve many gratifying Achievement.3) Chinese linguistics research field：Promote Chinese (Han)language to study using Chinese word segmentation, such as studies the spy of Chinese Point, the comparison with other Languages, the specification etc. of Chinese.

Chinese word segmentation is the basic link of Chinese information processing, and restricts one serious " bottleneck " of its development.In recent years Come, Chinese words segmentation causes the attention and research of various circles of society especially company and colleges and universities, occurs various points Word method：Two-way maximum matching method, by word traversal, set up cutting notation, Word-frequency, augmented transition network method, double To Markov chain method, fuzzy clustering algorithm, expert system approach, at least segment a variety of segmenting methods such as method, neuroid method.It is different Segmenting method simulates the not ipsilateral that the mankind segment behavior, serves the Chinese information processing system of different purposes.Total comes It says, these methods are all the extension, extension and improvement of three basic skills.These three basic skills are respectively：Based on dictionary Segmenting method, the segmenting method based on statistics and the segmenting method based on understanding, they have respectively represented current segmenting method Three great development directions.

Forward Maximum Method method (Forward Maximum Matching Method), so-called " maximum " refers to the algorithm The word string as long as possible started with a certain Chinese character is always regarded as a word, that is, is embodied " priority of long word ".When in word When can not find the word string in allusion quotation (when matching unsuccessful), then removes the last one Chinese character and continue to search for matching.This method is general Referred to as FMM methods.Its algorithm idea is：If D is dictionary, L indicates that the most major term in D is long, and S is word string to be slit.Every time from S Middle length of taking out is matched for the substring M of L with the word in D.If successful match, go out substring M as a word segmentation Come, while moving L character after pointer and continuing to match；Otherwise the last character of substring M is removed, then is carried out by identical method Matching, until being syncopated as all words.Conventional forward and reverse Max Match word segmentation arithmetic need that a matching length is previously set M is spent, is generally grown as matching length and is segmented using the most major term in dictionary for word segmentation.It is it is emphasised that " priority of long word ", every time It will be matched since M character.If M is long, to search repeatedly can just be syncopated as a word, cause unnecessary time wave Take, participle speed is not high.And M is too short, having some words length to be more than the long word of M cannot correctly be cut out, and can not be ensured The accuracy rate of participle.

In order to solve the deficiency that above-mentioned conventional forward matching algorithm occurs, it is based on positive matching algorithm herein and proposes forward direction Gradually plus a word maximum matching algorithm, the preferably perfect deficiency of traditional algorithm.

Invention content

The forward direction based on dictionary that the present invention provides a kind of is gradually plus a word maximum matches Chinese word cutting method, for solution The problems such as speed is slow, and word segmentation result is inaccurate is certainly segmented caused by conventional forward maximum match segmentation, this method does not need It presets that maximum matching word is long, avoids traditional maximum matching method because the maximum matching word length of setting is long, and carry out more Secondary useless matching, participle speed are slower；Maximum matching word length is too short, but can not correct cutting the case where.

The technical scheme is that：It is a kind of based on dictionary it is positive gradually plus word maximum matching Chinese word cutting method It is as follows：

Step1, read in text to be slit, according to punctuate, number, western language, chart as separator by the text of input into The thick cutting of row, is divided into short text one by one；

Step2, using the short text of thick cutting as further cutting object, set further participle search length L, wherein L takes the length grown less than most major term in dictionary；

Step3, two words of starting for taking a short text after thick cutting search matching in dictionary；

If there is no two words currently inputted, then it represents that first character is individual character, its cutting is gone out；Then word is read to refer to It is moved after needle, two words next is taken to carry out the lookup matching of a new round；

If in the presence of two words currently inputted, the length pointers for searching text are increased into a word backward, increase to three A word, continuation are matched in dictionary；

If this three words is not present, shows that the first two word is a word, its cutting is gone out, the knot as a cutting Fruit；Then participle moves after searching pointer, and two words next is taken to carry out the lookup matching of a new round；

If this three words exist, continue to increase a word backward, constitute four words, search this four words whether there is in In dictionary, and so on, matched and searched is carried out, to be segmented；

Step4, when find search length be L when, since the character late of L, again according in step Step3 with This method analogized carries out searching matching and participle, until all short texts participle terminates.

The beneficial effects of the invention are as follows：

1, matched and searched mechanism of this method based on dictionary carries out lookup matching, to determine to the text to be slit of input Word segmentation result.It is long that maximum matching word is not preset when participle, but a summary is set according to maximum entry length in dictionary Less than most major term length corresponding search length L, avoid traditional maximum matching method because setting maximum matching word length it is long, And repeatedly useless matching is carried out, participle speed is slower；Maximum matching word length is too short, but can not correct cutting the case where；

2, this method is improved well in terms of participle response time and participle accuracy.For test text, Positive using the present invention gradually adds word matching segmenting method to be segmented with traditional Forward Maximum Method based on dictionary, and Reverse maximum match segmentation is compared in participle aspect of performance, and either accuracy, which still segments the time, must all show Good advantage is gone out.

Description of the drawings

Fig. 1 is the flow chart of the present invention；

Fig. 2, which is that embodiment 1 is positive in the present invention, gradually adds a word to match segmenting method flow chart；

Fig. 3 is that the forward direction based on dictionary is gradually in the present invention plus a word matches segmenting method and traditional participle based on dictionary The accuracy comparison diagram of method.

Specific implementation mode

Embodiment 1：As shown in Figs. 1-3, a kind of forward direction based on dictionary is gradually plus a word maximum matches Chinese word cutting method, The step of the method is：

Step 1: thick cutting；Rejecting punctuation mark, space, date, number, English alphabet are carried out to text to be slit Deng label, pending text is set as A, is divided into N number of short text sequence S_iSet (0 < i≤N), i.e., cutting be S_iA short essay This, A={ S₁,S₂,S₃,...S_N}；

Step 2: as shown in Fig. 2, read in the short text after thick cutting one by one in order successively, it is denoted as S_iIf each sentence Subsequence S_iBy m word W_ij(0 < j≤m) is formed, i.e. S_i=<W_i1W_i2W_i3...W_im>；

Step 3: by the text S after thick cutting_iIt is segmented.As shown in Fig. 2, text is carried out word segmentation processing.

1) participle search length L, L that one is slightly less than most major term length in dictionary are set and is generally slightly smaller than in dictionary most major term It is long；

2) in short text S_iThe character W that middle sequence takes starting the first two adjacent_ijW_i(j+1), it is W when initial_i1W_i2, in dictionary Middle lookup matching, if the two word W currently inputted_ijW_i(j+1)It is not the word in dictionary, then turns (3)；Otherwise, turn (4)；

If 3) the two word W currently inputted_ijW_i(j+1)It is not present in dictionary, then shows the first character in the first two word It is a word, by W_ijFrom sentence S_iMiddle cutting is gone out.Judge whether to S_iSentence tail, if so, S_iParticiple terminates；Otherwise j=j+1, Turn again (2)；

If 4) in the presence of the two word W currently inputted_ijW_i(j+1), then the length pointers for searching text are increased into a word backward, That is W_ijW_i(j+1)Afterwards plus a word, increase to three words, obtain S_k=W_ijW_i(j+1)W_ik(o < k≤L) continues the progress in dictionary Match, judges that the word newly read in whether there is in dictionary.If in the presence of turning (5), otherwise, turning (6)；

If 5) this three words S_k=W_ijW_i(j+1)W_ikIn the presence of if this three words exists, continuing pointer toward S_k= W_ijW_i(j+1)W_ikIncrease a word afterwards, constitutes four words S_k+1=W_ijW_i(j+1)...W_ikW_i(k+1), search this four words S_k+1= W_ijW_i(j+1)...W_ikW_i(k+1)With the presence or absence of in dictionary, if so, continuing gradually a word to be added again to judge backward, turn (7)；If no It is, then S_kCutting is gone out, and word segmentation result is put into；

If 6) this three words S_k=W_ijW_i(j+1)W_ikIt is not present, then shows the first two word W_ijW_i(j+1)It is a word, it will W_ijW_i(j+1)From S_iIn its cutting go out, then participle moves after searching pointer, makes pointer j=j+2, then two words next is taken to carry out The lookup of a new round matches.If j≤m, show the also incomplete cutting of current short text, turns (2), if pointer j=m, short text S_iParticiple terminates；

7) the rest may be inferred, judges whether the mobile current word number k≤L read in later is true when mobile participle pointer every time, If so, then continue in S_k+1=W_ijW_i(j+1)...W_ikW_i(k+1)Gradually a word is added to be judged afterwards；Otherwise from W_i(k+1)Place starts It takes two word characters to carry out next round and searches matching.

Step 4: judge whether reading textual data i≤N is true, if so, show that current text does not segment also and terminate, then It segments pointer and increases by one, i=i+1, read in next sentence and carry out searching matching and participle according to program above again, into Row participle terminates until entirely inputting text participle；Otherwise, illustrate that entire text participle terminates.

Embodiment 2：As shown in Figs. 1-3, a kind of forward direction based on dictionary is gradually plus a word maximum matches Chinese word cutting method, The step of the method is：

Setting one is slightly less than the participle search length L that most major term is grown in dictionary；If character string to be slit is S= s₁s₂s₃s₄...s_i.Subordinate clause head starts, and takes the first two character s₁s₂, judge s₁s₂Whether it is a word in dictionary, if it is not, then Illustrate s₁It is monosyllabic word, its cutting is gone out, then the length pointers for searching text are increased into a word backward, increases to third Word takes the s in dictionary₂s₃Carry out the lookup matching of a new round；If s₁s₂It is the word in dictionary, then increases a word backward, judges s₁s₂s₃Whether at word, if s₁s₂s₃It is not the word in dictionary, then shows s₁s₂It is a word, its cutting is gone out；If s₁s₂s₃It is A word in dictionary then continues to increase a word backward, searches s₁s₂s₃s₄Whether it is word in dictionary, if not word, then will s₁s₂s₃It goes out as a word segmentation, if the word in dictionary, then continues to increase a word backward matching again.The rest may be inferred, Until entire sentence S=s₁s₂s₃s₄...s_iCutting finishes.

Embodiment 3：As shown in Figs. 1-3, a kind of forward direction based on dictionary is gradually plus a word maximum matches Chinese word cutting method, The step of the method is：

Step1, read in text to be slit, according to punctuate, number, western language, chart as separator by the text of input into The thick cutting of row, is divided into short text one by one；Such as it is divided into a text " today, weather was particularly good "；

Step2, using the short text of thick cutting as further cutting object, set further participle search length L=7, Wherein L takes the length grown less than most major term in dictionary, wherein most major term a length of 12；

Step3, two words " today " of starting for taking a short text after thick cutting search matching in dictionary；Through It is present in dictionary with " today ", then the length pointers for searching text increase a word backward, increases to three word " today It ", continuation is matched in dictionary；Matched " today day " is not present, then shows that " today " is a word, then " modern It " cutting goes out, the result as a cutting；Then participle moves after searching pointer, takes two words " weather " next to carry out new The lookup matching of one wheel；Matched " weather " exists, then the length pointers for searching text increase a word backward, increases to three A word " weather is special ", continuation is matched in dictionary；Matched " weather is special " is not present, then shows that " weather " is a word, So " weather " cutting is gone out, the result as a cutting；And so on, matched and searched is carried out, to be segmented, point The result of word be/today/weather/especially// it is good/；The process specifically segmented is shown in Table 1；

Table 1 is positive gradually to add a word maximum to match participle process

Matching field	Matching is passed through	Matching result
			Today	Exist in dictionary	Today
Weather	Exist in dictionary	Weather
			Especially	Exist in dictionary	Especially
It is good	It is not present in dictionary
			's	Monosyllabic word	's
It is good	Monosyllabic word	It is good

In order to verify the advantageous effect of this method, with this method and traditional Forward Maximum Method segmenting method, it is reverse most Big matching segmenting method (primary maximum matching character length is 4) is compared, traditional Forward Maximum Method segmenting method, inverse To maximum match segmentation participle process as shown in table 2, table 3；

1) Forward Maximum Method segmenting method：

2 Forward Maximum Method of table segments process

Matching field	Matching is passed through	Matching result
			Today weather	It is not present in dictionary
Today day	It is not present in dictionary
			Today	Exist in dictionary	Today
Weather is special	It is not present in dictionary
			Weather is special	It is not present in dictionary
Weather	Exist in dictionary	Weather
			It is special good	It is not present in dictionary
Particularly	It is not present in dictionary
			Especially	Exist in dictionary	Especially
It is good	It is not present in dictionary
			's	Monosyllabic word	's
It is good	Monosyllabic word	It is good

Forward Maximum Method the result is that：/ today/weather/especially// good/

2) reverse maximum match segmentation:Substring is taken to be matched from character string to be slit from right to left；

The reverse maximum matching participle process of table 3

Match word field	Matching is passed through	Matching result
			It is special good	It is not present in dictionary
It is other good	It is not present in dictionary
			It is good	It is not present in dictionary
It is good	Monosyllabic word	It is good
			Gas is special	It is not present in dictionary
Particularly	It is not present in dictionary
			It is other	It is not present in dictionary
's	Monosyllabic word	's
			Weather is special	It is not present in dictionary
Gas is special	It is not present in dictionary
			Today weather	It is not present in dictionary
Its weather	It is not present in dictionary
			Weather	Exist in dictionary	Weather
Today	Exist in dictionary	Today

It is reverse maximum matched the result is that：/ today/weather/especially// good/

Although can be seen that final word segmentation result all from the participle process of above-mentioned three kinds of methods and be it is identical, correct, But it can be clearly seen that the participle of traditional positive, reverse maximum matching process based on dictionary from the process of participle Journey all occurs reading in the repeated matching step that word is not present, and wastes the time of participle, causes dictionary matching, ambiguity after participle The workload of judgement.And forward direction proposed by the present invention gradually adds a word maximum matching process, almost each two-character phrase is obtained for As soon as quick, the accurate participle of step participle, the whole efficiency segmented in this way are greatly improved, the conclusion of test simulation This point is demonstrated, as shown in table 4 below.

The average cutting rate of 4 three kinds of segmenting methods of table compares

Segmenting method	Average cutting speed (word/s)
		Traditional Forward Maximum Method method	52000
Traditional reverse maximum matching method	103000
		It is positive gradually to add a word matching method	113000

Three kinds of methods are applied in the experimental enviroment of the present invention, with a complete dictionary for including 270,000 entries As dictionary for word segmentation, in hardware using calculator memory 1G or more, software Windows7 uses JAVA development languages, My Simulated experiment is carried out under the running environment of 8.5 developing instruments of Eclipse.Have chosen economy, science and technology, social news, military affairs four Aspect size is the article of 0.02M or so, is segmented using three kinds of different segmentation methods, obtained result such as Fig. 3 institutes Show, ordinate indicates that participle accuracy rate, abscissa indicate the field of participle, it can be seen that in these three segmenting methods, herein The positive of proposition gradually adds a word matching process to be compared with traditional positive, reverse maximum match segmentation, and accuracy rate obtains Raising is arrived.

The experiment conclusion table 4 of above example, Fig. 3 can be shown that a kind of forward direction based on dictionary of the present invention gradually adds one Word maximum match segmentation is more traditional to be had based on the segmenting method of dictionary in terms of participle cutting speed, participle accuracy rate Significant improvement.

The specific implementation mode of the present invention is explained in detail above in conjunction with attached drawing, but the present invention is not limited to above-mentioned Embodiment within the knowledge of a person skilled in the art can also be before not departing from present inventive concept Put that various changes can be made.

Claims

1. a kind of forward direction based on dictionary is gradually plus a word maximum matches Chinese word cutting method, it is characterised in that：It is described word-based The positive of allusion quotation gradually adds word maximum matching Chinese word cutting method to be as follows：

Step1, text to be slit is read in, is carried out the text of input slightly as separator according to punctuate, number, western language, chart Cutting is divided into short text one by one；

Step2, using the short text of thick cutting as further cutting object, set further participle search length L, wherein L and take The length grown less than most major term in dictionary；

If there is no two words currently inputted, then it represents that first character is individual character, its cutting is gone out；Then after reading word pointer It moves, two words next is taken to carry out the lookup matching of a new round；

If in the presence of two words currently inputted, word pointer will be read and increase a word backward, increase to three words, continue in dictionary In matched；

If this three words is not present, shows that the first two word is a word, its cutting is gone out, the result as a cutting； Then it is moved after reading word pointer, two words next is taken to carry out the lookup matching of a new round；

If this three words exists, continue to increase a word backward, constitute four words, searches this four words and whether there is in dictionary In, and so on, matched and searched is carried out, to be segmented；

Step4, when find search length be L when, since the character late of L, again according in step Step3 with such The method pushed away carries out searching matching and participle, until all short texts participle terminates.