CN104252542A

CN104252542A - Dynamic-planning Chinese words segmentation method based on lexicons

Info

Publication number: CN104252542A
Application number: CN201410507974.9A
Authority: CN
Inventors: 孙珂; 田冰川; 张道强
Original assignee: Nanjing University of Aeronautics and Astronautics
Current assignee: Nanjing University of Aeronautics and Astronautics
Priority date: 2014-09-29
Filing date: 2014-09-29
Publication date: 2014-12-31

Abstract

Disclosed is a dynamic-planning Chinese words segmentation method based on lexicons. The dynamic-planning Chinese words segmentation method includes the steps of (1) uploading the common Chinese lexicons; (2) uploading infrequent Chinese lexicons; (3) reading in Chinese texts and acquiring the current Chinese text contents; (4) phrasing the Chinese texts into short sentences; (5) performing the automatic dynamic-planning Chinese words segmentation; (6) scanning from the last word to acquire the words segmentation result, and analyzing and tagging the characteristics of the words and outputting the result; (7) storing unregistered words into the infrequent Chinese lexicons; (8) judging whether or not the text is over; if not, going back to the step (4) for the circular process. The dynamic-planning Chinese words segmentation method has the advantages that accuracy and efficiency are high, and word segmentation accuracy reaches to the manual work standard, with segmentation speed over 2MB per second.

Description

A kind of dynamic programming Chinese word cutting method based on dictionary

Technical field

The present invention relates to Chinese information technology for automatically treating field, especially a kind of dynamic programming Chinese word cutting method based on dictionary.

Background technology

Along with the arrival of information age, Chinese information resource gets more and more, and the data how to find oneself to need in the vast as the open sea Chinese information world is a very important problem.Because data volume increases severely, it is unrealistic that manual process has become.Automatic processing method helps people's retrieval, management information, solves the present situation that present social information enriches and knowledge is poor.Occurred the language processing techniques such as instrument such as autoabstract, autofile retrieval of a lot of robotization at present, a key of these technology is descriptor.Extraction for descriptor contributes to simplifying this type of work, and how to find descriptor to need participle technique.

Chinese word segmentation is pre-service the most key in Chinese text information processing, is the basis of text mining.Chinese word segmentation is the basis of other Chinese information processing, such as Chinese search engine, mechanical translation, phonetic synthesis, automatic classification, autoabstract, automatic Proofreading etc., all needs to use participle.For the research of Chinese words segmentation, the development automatically processed for China's Chinese information has vital effect.

Summary of the invention

Technical matters to be solved by this invention is, provides a kind of accuracy rate high, the fireballing dynamic programming Chinese word cutting method based on dictionary.

For solving the problems of the technologies described above, the invention provides a kind of dynamic programming Chinese word cutting method based on dictionary, comprising the steps: that (1) loads conventional Chinese vocabulary bank; (2) non-common Chinese vocabulary bank is loaded; (3) read in Chinese text, obtain current Chinese content of text; (4) subordinate sentence is carried out for Chinese text, Chinese text is divided into short sentence one by one; (5) dynamic programming Chinese Automatic Word Segmentation is carried out; Find the possible position of first word, note F [i] represents the current minimum word number assigned to i-th word and assign to; Transfer is started for each word, from current word, finds transfer, find his previous word; As F [i] <F [j]+1, shift, thus to store the longest current word be result; (6) scan from last word, obtain word segmentation result and carry out part of speech analysis and add part-of-speech tagging, Output rusults; (7) by unregistered word stored in non-common Chinese vocabulary bank; (8) judge whether text terminates; If not, proceed to step (4), carry out circular treatment.

Even numbers group dictionary tree set up in conventional Chinese vocabulary bank and non-common dictionary; The data structure of even numbers group dictionary tree is made up of two integer arrays, and one is base [], is the address of word, and one is check [], is the hash value of word; If array index is i, if base [i], check [i] are 0, represent that this position is for empty, if base [i] is negative value, represent that this state is word, check [i] represents the previous state of this state.

Build even numbers group dictionary tree, through four traversals, even numbers group is put in all words, then travel through a vocabulary, amendment base value; Represent that this position is word by negative base value, if the corresponding some words of state i, and base [i]=0, so make base [i]=(-1) * i; If the value of base [i] is not 0, so another base [i]=(-1) * base [i].

Beneficial effect of the present invention is: accuracy rate is high, efficiency is fast, and the precision of word segmentation can reach the level similar with the mankind, and participle speed can reach more than 2MB per second.

Accompanying drawing explanation

Fig. 1 is the workflow diagram of the dynamic programming Chinese word cutting method based on dictionary of the present invention.

Fig. 2 is even numbers group dictionary tree data structure legend of the present invention.

Fig. 3 is even numbers group dictionary tree Chinese vocabulary bank legend of the present invention.

Embodiment

For solving the problems of the technologies described above, the invention provides a kind of dynamic programming Chinese word cutting method based on dictionary, comprising the steps: that (1) loads conventional Chinese vocabulary bank; (2) non-common Chinese vocabulary bank is loaded; (3) Chinese text is read in; (4) subordinate sentence is carried out for Chinese text, Chinese text is divided into short sentence one by one; (5) dynamic programming Chinese Automatic Word Segmentation is carried out; (6) obtain result to carry out part of speech analysis and add part-of-speech tagging; (7) by unregistered word stored in non-common Chinese vocabulary bank; (8) proceed to step (4), carry out circular treatment.

Chinese word segmentation: a Chinese character sequence is cut into word independent one by one, participle is exactly process continuous print word sequence being reassembled into word sequence according to certain specification.Chinese word segmentation dictionary: the dictionary be made up of Chinese everyday expressions, should ensure that this dictionary committed memory is less and inquiry velocity is very fast.Unregistered word: not to be incorporated in participle vocabulary but the word that must cut out, comprises all kinds of proper noun (name, place name, enterprise's name etc.), abb., newly-increased vocabulary etc.Crossing ambiguity: the ambiguity formed because word occurs simultaneously, as " improving the quality of products ", raising, high yield, product, quality, quality etc.Make-up ambiguity: same word string not only can be closed but also can divide, if " individual " in " individual resentment " is exactly a word, " individual " in " this people " just must take apart; " handle " in " handle of this fan door " is exactly a word, and " handle " in " raising one's hands " just must be taken apart.

As shown in Figure 1, first system will read the conventional Chinese vocabulary bank of having preserved, and the non-common Chinese vocabulary bank learning and obtain.In order to save memory headroom and ensure the efficiency of Chinese Automatic Word Segmentation, for these two dictionaries, set up even numbers group dictionary tree.System reads the text needing Chinese word segmentation, carries out subordinate sentence for text by punctuation mark.Enter the Chinese word segmentation stage after subordinate sentence, find the possible position of first word, represent with F [i] situation that word number that sentence can be assigned to i-th word is minimum.Transfer is started for each word, from current word, finds transfer, find his previous word.As F [i] <=F [j], shift, thus to store the longest current word be result.Scan from last word, obtain word segmentation result, and carry out part-of-speech tagging, export.Judge whether text terminates, if do not terminate, continue by text by punctuate subordinate sentence, the process of participle is carried out in circulation.

The minimum number of words that note F [i] can be divided into i-th word for participle, so F [i]=min (F [i], F [j]+1), the maximum length j<i of i word, F [i]=F [j]+1.Wherein the enumeration order of j upgrades from big to small, if namely F [i]=F [j]+1 upgrades, j is now minimum one that can upgrade, and can think that last word is long as much as possible, this meets Chinese Word Automatic Segmentation maximum matching method intuitively.State transfer each time all will ensure F [i] >F [j]+1, and what algorithm necessarily ensured that whole sentence separates is minimum word number.Based on the dynamic programming Chinese word cutting method of dictionary by two kinds of methods combining to together, effectively improve the efficiency of Chinese word segmentation.

As shown in Figure 2, be even numbers group dictionary tree.Even numbers group dictionary tree is a kind of special dictionary tree, compares dictionary tree, and its space availability ratio is higher, and the internal memory of consumption is less, and the efficiency of inquiry is set identical with common dictionary.The data structure of even numbers group dictionary tree is made up of two integer arrays, and one is base [], is the address of word, and one is check [], is the hash value of word.If array index is i, if base [i], check [i] are 0, represent that this position is for empty.If base [i] is negative value, represent that this state is word, check [i] represents the previous state of this state.

Even numbers group dictionary tree structure segmentation methods dictionary, assuming that only have in vocabulary ", Argentina, donkey-hide gelatin, Arabic, Arabic, Egyptian " these words.First 10 Chinese characters occurring in this table are encoded ,-1, Ah-2, sound of sighing-3, root-4, glue-5, draw-6 and-7, the court of a feudal ruler-8, primary-9, people-10.For each Chinese character, need to determine a base value, make, for all words started with this Chinese character, can put down in even numbers group.Such as, to determine now " the base value of Ah "'s word, suppose so that " second word sequence code of the word that Ah " starts is followed successively by a1, a2, a3 ... an, a value i must be found, make base [i+a1], check [i+a1], base [i+a2], check [i+a2] ... base [i+an], check [i+an] are 0.Once have found this i, " the base value of Ah " is just defined as i.

As shown in Figure 3, above-mentioned example is built even numbers group dictionary tree, through four traversals, even numbers group is put in all words, then travel through a vocabulary, amendment base value.Represent that this position is word by negative base value, if the corresponding some words of state i, and base [i]=0, so make base [i]=(-1) * i; If the value of base [i] is not 0, so another base [i]=(-1) * base [i].

In addition, also need to safeguard some special vocabularys, such as, become separately the function word vocabulary of word, ground noun list, the unregistered word tables such as people's noun list, promote the accuracy of software further.

This dynamic programming Chinese word cutting method based on dictionary, time complexity remains linear grade other, the forerunner finding it is only needed for each word, so time complexity is very little.

Use People's Daily's in January, 1998 language material, after comparing with artificial correct participle, accuracy rate reaches 98.8904%, and participle speed is 2504kb/s.

Although the present invention illustrates with regard to preferred implementation and describes, only it will be understood by those of skill in the art that otherwise exceed claim limited range of the present invention, variations and modifications can be carried out to the present invention.

Claims

1. based on a dynamic programming Chinese word cutting method for dictionary, it is characterized in that, comprise the steps: that (1) loads conventional Chinese vocabulary bank; (2) non-common Chinese vocabulary bank is loaded; (3) read in Chinese text, obtain current Chinese content of text; (4) subordinate sentence is carried out for Chinese text, Chinese text is divided into short sentence one by one; (5) dynamic programming Chinese Automatic Word Segmentation is carried out; Find the possible position of first word, note F [i] represents the current minimum word number assigned to i-th word and assign to; Transfer is started for each word, from current word, finds transfer, find his previous word; As F [i] <F [j]+1, shift, thus to store the longest current word be result; (6) scan from last word, obtain word segmentation result and carry out part of speech analysis and add part-of-speech tagging, Output rusults; (7) by unregistered word stored in non-common Chinese vocabulary bank; (8) judge whether text terminates; If not, proceed to step (4), carry out circular treatment.

2. Chinese word cutting method as claimed in claim 1, is characterized in that, even numbers group dictionary tree set up in conventional Chinese vocabulary bank and non-common dictionary; The data structure of even numbers group dictionary tree is made up of two integer arrays, and one is base [], is the address of word, and one is check [], is the hash value of word; If array index is i, if base [i], check [i] are 0, represent that this position is for empty, if base [i] is negative value, represent that this state is word, check [i] represents the previous state of this state.

3. Chinese word cutting method as claimed in claim 2, is characterized in that, builds even numbers group dictionary tree, through four traversals, even numbers group is put in all words, then travels through a vocabulary, amendment base value; Represent that this position is word by negative base value, if the corresponding some words of state i, and base [i]=0, so make base [i]=(-1) * i; If the value of base [i] is not 0, so another base [i]=(-1) * base [i].