CN104252542A - Dynamic-planning Chinese words segmentation method based on lexicons - Google Patents
Dynamic-planning Chinese words segmentation method based on lexicons Download PDFInfo
- Publication number
- CN104252542A CN104252542A CN201410507974.9A CN201410507974A CN104252542A CN 104252542 A CN104252542 A CN 104252542A CN 201410507974 A CN201410507974 A CN 201410507974A CN 104252542 A CN104252542 A CN 104252542A
- Authority
- CN
- China
- Prior art keywords
- word
- chinese
- base
- text
- words
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/20—Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
- G06F16/22—Indexing; Data structures therefor; Storage structures
- G06F16/2228—Indexing structures
- G06F16/2246—Trees, e.g. B+trees
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F40/00—Handling natural language data
- G06F40/20—Natural language analysis
- G06F40/279—Recognition of textual entities
- G06F40/284—Lexical analysis, e.g. tokenisation or collocates
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Physics & Mathematics (AREA)
- General Engineering & Computer Science (AREA)
- General Physics & Mathematics (AREA)
- Audiology, Speech & Language Pathology (AREA)
- Computational Linguistics (AREA)
- General Health & Medical Sciences (AREA)
- Health & Medical Sciences (AREA)
- Artificial Intelligence (AREA)
- Software Systems (AREA)
- Data Mining & Analysis (AREA)
- Databases & Information Systems (AREA)
- Document Processing Apparatus (AREA)
Abstract
Disclosed is a dynamic-planning Chinese words segmentation method based on lexicons. The dynamic-planning Chinese words segmentation method includes the steps of (1) uploading the common Chinese lexicons; (2) uploading infrequent Chinese lexicons; (3) reading in Chinese texts and acquiring the current Chinese text contents; (4) phrasing the Chinese texts into short sentences; (5) performing the automatic dynamic-planning Chinese words segmentation; (6) scanning from the last word to acquire the words segmentation result, and analyzing and tagging the characteristics of the words and outputting the result; (7) storing unregistered words into the infrequent Chinese lexicons; (8) judging whether or not the text is over; if not, going back to the step (4) for the circular process. The dynamic-planning Chinese words segmentation method has the advantages that accuracy and efficiency are high, and word segmentation accuracy reaches to the manual work standard, with segmentation speed over 2MB per second.
Description
Technical field
The present invention relates to Chinese information technology for automatically treating field, especially a kind of dynamic programming Chinese word cutting method based on dictionary.
Background technology
Along with the arrival of information age, Chinese information resource gets more and more, and the data how to find oneself to need in the vast as the open sea Chinese information world is a very important problem.Because data volume increases severely, it is unrealistic that manual process has become.Automatic processing method helps people's retrieval, management information, solves the present situation that present social information enriches and knowledge is poor.Occurred the language processing techniques such as instrument such as autoabstract, autofile retrieval of a lot of robotization at present, a key of these technology is descriptor.Extraction for descriptor contributes to simplifying this type of work, and how to find descriptor to need participle technique.
Chinese word segmentation is pre-service the most key in Chinese text information processing, is the basis of text mining.Chinese word segmentation is the basis of other Chinese information processing, such as Chinese search engine, mechanical translation, phonetic synthesis, automatic classification, autoabstract, automatic Proofreading etc., all needs to use participle.For the research of Chinese words segmentation, the development automatically processed for China's Chinese information has vital effect.
Summary of the invention
Technical matters to be solved by this invention is, provides a kind of accuracy rate high, the fireballing dynamic programming Chinese word cutting method based on dictionary.
For solving the problems of the technologies described above, the invention provides a kind of dynamic programming Chinese word cutting method based on dictionary, comprising the steps: that (1) loads conventional Chinese vocabulary bank; (2) non-common Chinese vocabulary bank is loaded; (3) read in Chinese text, obtain current Chinese content of text; (4) subordinate sentence is carried out for Chinese text, Chinese text is divided into short sentence one by one; (5) dynamic programming Chinese Automatic Word Segmentation is carried out; Find the possible position of first word, note F [i] represents the current minimum word number assigned to i-th word and assign to; Transfer is started for each word, from current word, finds transfer, find his previous word; As F [i] <F [j]+1, shift, thus to store the longest current word be result; (6) scan from last word, obtain word segmentation result and carry out part of speech analysis and add part-of-speech tagging, Output rusults; (7) by unregistered word stored in non-common Chinese vocabulary bank; (8) judge whether text terminates; If not, proceed to step (4), carry out circular treatment.
Even numbers group dictionary tree set up in conventional Chinese vocabulary bank and non-common dictionary; The data structure of even numbers group dictionary tree is made up of two integer arrays, and one is base [], is the address of word, and one is check [], is the hash value of word; If array index is i, if base [i], check [i] are 0, represent that this position is for empty, if base [i] is negative value, represent that this state is word, check [i] represents the previous state of this state.
Build even numbers group dictionary tree, through four traversals, even numbers group is put in all words, then travel through a vocabulary, amendment base value; Represent that this position is word by negative base value, if the corresponding some words of state i, and base [i]=0, so make base [i]=(-1) * i; If the value of base [i] is not 0, so another base [i]=(-1) * base [i].
Beneficial effect of the present invention is: accuracy rate is high, efficiency is fast, and the precision of word segmentation can reach the level similar with the mankind, and participle speed can reach more than 2MB per second.
Accompanying drawing explanation
Fig. 1 is the workflow diagram of the dynamic programming Chinese word cutting method based on dictionary of the present invention.
Fig. 2 is even numbers group dictionary tree data structure legend of the present invention.
Fig. 3 is even numbers group dictionary tree Chinese vocabulary bank legend of the present invention.
Embodiment
For solving the problems of the technologies described above, the invention provides a kind of dynamic programming Chinese word cutting method based on dictionary, comprising the steps: that (1) loads conventional Chinese vocabulary bank; (2) non-common Chinese vocabulary bank is loaded; (3) Chinese text is read in; (4) subordinate sentence is carried out for Chinese text, Chinese text is divided into short sentence one by one; (5) dynamic programming Chinese Automatic Word Segmentation is carried out; (6) obtain result to carry out part of speech analysis and add part-of-speech tagging; (7) by unregistered word stored in non-common Chinese vocabulary bank; (8) proceed to step (4), carry out circular treatment.
Chinese word segmentation: a Chinese character sequence is cut into word independent one by one, participle is exactly process continuous print word sequence being reassembled into word sequence according to certain specification.Chinese word segmentation dictionary: the dictionary be made up of Chinese everyday expressions, should ensure that this dictionary committed memory is less and inquiry velocity is very fast.Unregistered word: not to be incorporated in participle vocabulary but the word that must cut out, comprises all kinds of proper noun (name, place name, enterprise's name etc.), abb., newly-increased vocabulary etc.Crossing ambiguity: the ambiguity formed because word occurs simultaneously, as " improving the quality of products ", raising, high yield, product, quality, quality etc.Make-up ambiguity: same word string not only can be closed but also can divide, if " individual " in " individual resentment " is exactly a word, " individual " in " this people " just must take apart; " handle " in " handle of this fan door " is exactly a word, and " handle " in " raising one's hands " just must be taken apart.
As shown in Figure 1, first system will read the conventional Chinese vocabulary bank of having preserved, and the non-common Chinese vocabulary bank learning and obtain.In order to save memory headroom and ensure the efficiency of Chinese Automatic Word Segmentation, for these two dictionaries, set up even numbers group dictionary tree.System reads the text needing Chinese word segmentation, carries out subordinate sentence for text by punctuation mark.Enter the Chinese word segmentation stage after subordinate sentence, find the possible position of first word, represent with F [i] situation that word number that sentence can be assigned to i-th word is minimum.Transfer is started for each word, from current word, finds transfer, find his previous word.As F [i] <=F [j], shift, thus to store the longest current word be result.Scan from last word, obtain word segmentation result, and carry out part-of-speech tagging, export.Judge whether text terminates, if do not terminate, continue by text by punctuate subordinate sentence, the process of participle is carried out in circulation.
The minimum number of words that note F [i] can be divided into i-th word for participle, so F [i]=min (F [i], F [j]+1), the maximum length j<i of i word, F [i]=F [j]+1.Wherein the enumeration order of j upgrades from big to small, if namely F [i]=F [j]+1 upgrades, j is now minimum one that can upgrade, and can think that last word is long as much as possible, this meets Chinese Word Automatic Segmentation maximum matching method intuitively.State transfer each time all will ensure F [i] >F [j]+1, and what algorithm necessarily ensured that whole sentence separates is minimum word number.Based on the dynamic programming Chinese word cutting method of dictionary by two kinds of methods combining to together, effectively improve the efficiency of Chinese word segmentation.
As shown in Figure 2, be even numbers group dictionary tree.Even numbers group dictionary tree is a kind of special dictionary tree, compares dictionary tree, and its space availability ratio is higher, and the internal memory of consumption is less, and the efficiency of inquiry is set identical with common dictionary.The data structure of even numbers group dictionary tree is made up of two integer arrays, and one is base [], is the address of word, and one is check [], is the hash value of word.If array index is i, if base [i], check [i] are 0, represent that this position is for empty.If base [i] is negative value, represent that this state is word, check [i] represents the previous state of this state.
Even numbers group dictionary tree structure segmentation methods dictionary, assuming that only have in vocabulary ", Argentina, donkey-hide gelatin, Arabic, Arabic, Egyptian " these words.First 10 Chinese characters occurring in this table are encoded ,-1, Ah-2, sound of sighing-3, root-4, glue-5, draw-6 and-7, the court of a feudal ruler-8, primary-9, people-10.For each Chinese character, need to determine a base value, make, for all words started with this Chinese character, can put down in even numbers group.Such as, to determine now " the base value of Ah "'s word, suppose so that " second word sequence code of the word that Ah " starts is followed successively by a1, a2, a3 ... an, a value i must be found, make base [i+a1], check [i+a1], base [i+a2], check [i+a2] ... base [i+an], check [i+an] are 0.Once have found this i, " the base value of Ah " is just defined as i.
As shown in Figure 3, above-mentioned example is built even numbers group dictionary tree, through four traversals, even numbers group is put in all words, then travel through a vocabulary, amendment base value.Represent that this position is word by negative base value, if the corresponding some words of state i, and base [i]=0, so make base [i]=(-1) * i; If the value of base [i] is not 0, so another base [i]=(-1) * base [i].
In addition, also need to safeguard some special vocabularys, such as, become separately the function word vocabulary of word, ground noun list, the unregistered word tables such as people's noun list, promote the accuracy of software further.
This dynamic programming Chinese word cutting method based on dictionary, time complexity remains linear grade other, the forerunner finding it is only needed for each word, so time complexity is very little.
Use People's Daily's in January, 1998 language material, after comparing with artificial correct participle, accuracy rate reaches 98.8904%, and participle speed is 2504kb/s.
Although the present invention illustrates with regard to preferred implementation and describes, only it will be understood by those of skill in the art that otherwise exceed claim limited range of the present invention, variations and modifications can be carried out to the present invention.
Claims (3)
1. based on a dynamic programming Chinese word cutting method for dictionary, it is characterized in that, comprise the steps: that (1) loads conventional Chinese vocabulary bank; (2) non-common Chinese vocabulary bank is loaded; (3) read in Chinese text, obtain current Chinese content of text; (4) subordinate sentence is carried out for Chinese text, Chinese text is divided into short sentence one by one; (5) dynamic programming Chinese Automatic Word Segmentation is carried out; Find the possible position of first word, note F [i] represents the current minimum word number assigned to i-th word and assign to; Transfer is started for each word, from current word, finds transfer, find his previous word; As F [i] <F [j]+1, shift, thus to store the longest current word be result; (6) scan from last word, obtain word segmentation result and carry out part of speech analysis and add part-of-speech tagging, Output rusults; (7) by unregistered word stored in non-common Chinese vocabulary bank; (8) judge whether text terminates; If not, proceed to step (4), carry out circular treatment.
2. Chinese word cutting method as claimed in claim 1, is characterized in that, even numbers group dictionary tree set up in conventional Chinese vocabulary bank and non-common dictionary; The data structure of even numbers group dictionary tree is made up of two integer arrays, and one is base [], is the address of word, and one is check [], is the hash value of word; If array index is i, if base [i], check [i] are 0, represent that this position is for empty, if base [i] is negative value, represent that this state is word, check [i] represents the previous state of this state.
3. Chinese word cutting method as claimed in claim 2, is characterized in that, builds even numbers group dictionary tree, through four traversals, even numbers group is put in all words, then travels through a vocabulary, amendment base value; Represent that this position is word by negative base value, if the corresponding some words of state i, and base [i]=0, so make base [i]=(-1) * i; If the value of base [i] is not 0, so another base [i]=(-1) * base [i].
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201410507974.9A CN104252542A (en) | 2014-09-29 | 2014-09-29 | Dynamic-planning Chinese words segmentation method based on lexicons |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201410507974.9A CN104252542A (en) | 2014-09-29 | 2014-09-29 | Dynamic-planning Chinese words segmentation method based on lexicons |
Publications (1)
Publication Number | Publication Date |
---|---|
CN104252542A true CN104252542A (en) | 2014-12-31 |
Family
ID=52187432
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201410507974.9A Pending CN104252542A (en) | 2014-09-29 | 2014-09-29 | Dynamic-planning Chinese words segmentation method based on lexicons |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN104252542A (en) |
Cited By (7)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN105279150A (en) * | 2015-10-27 | 2016-01-27 | 江苏电力信息技术有限公司 | Lucene full-text retrieval based Chinese word segmentation method |
CN105468584A (en) * | 2015-12-31 | 2016-04-06 | 武汉鸿瑞达信息技术有限公司 | Filtering method and system for bad literal information in text |
CN108073566A (en) * | 2016-11-16 | 2018-05-25 | 北京搜狗科技发展有限公司 | Segmenting method and device, the device for participle |
CN109918665A (en) * | 2019-03-05 | 2019-06-21 | 湖北亿咖通科技有限公司 | Segmenting method, device and the electronic equipment of text |
CN110309400A (en) * | 2018-02-07 | 2019-10-08 | 鼎复数据科技(北京)有限公司 | A kind of method and system that intelligent Understanding user query are intended to |
CN112364605A (en) * | 2020-11-27 | 2021-02-12 | 智业软件股份有限公司 | Text labeling method based on double-array Trie, terminal equipment and storage medium |
CN113033193A (en) * | 2021-01-20 | 2021-06-25 | 山谷网安科技股份有限公司 | C + + language-based mixed Chinese text word segmentation method |
Citations (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN1193779A (en) * | 1997-03-13 | 1998-09-23 | 国际商业机器公司 | Method for dividing sentences in Chinese language into words and its use in error checking system for texts in Chinese language |
CN101071421A (en) * | 2007-05-14 | 2007-11-14 | 腾讯科技(深圳)有限公司 | Chinese word cutting method and device |
CN102063424A (en) * | 2010-12-24 | 2011-05-18 | 上海电机学院 | Method for Chinese word segmentation |
-
2014
- 2014-09-29 CN CN201410507974.9A patent/CN104252542A/en active Pending
Patent Citations (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN1193779A (en) * | 1997-03-13 | 1998-09-23 | 国际商业机器公司 | Method for dividing sentences in Chinese language into words and its use in error checking system for texts in Chinese language |
CN101071421A (en) * | 2007-05-14 | 2007-11-14 | 腾讯科技(深圳)有限公司 | Chinese word cutting method and device |
CN102063424A (en) * | 2010-12-24 | 2011-05-18 | 上海电机学院 | Method for Chinese word segmentation |
Non-Patent Citations (2)
Title |
---|
赵欢等: "基于双数组Trie树中文分词研究", 《湖南大学学报(自然科学版)》 * |
金凌: "语音识别中语言模型的研究", 《万方数据库清华大学硕士论文》 * |
Cited By (10)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN105279150A (en) * | 2015-10-27 | 2016-01-27 | 江苏电力信息技术有限公司 | Lucene full-text retrieval based Chinese word segmentation method |
CN105468584A (en) * | 2015-12-31 | 2016-04-06 | 武汉鸿瑞达信息技术有限公司 | Filtering method and system for bad literal information in text |
CN108073566A (en) * | 2016-11-16 | 2018-05-25 | 北京搜狗科技发展有限公司 | Segmenting method and device, the device for participle |
CN108073566B (en) * | 2016-11-16 | 2022-01-18 | 北京搜狗科技发展有限公司 | Word segmentation method and device and word segmentation device |
CN110309400A (en) * | 2018-02-07 | 2019-10-08 | 鼎复数据科技(北京)有限公司 | A kind of method and system that intelligent Understanding user query are intended to |
CN109918665A (en) * | 2019-03-05 | 2019-06-21 | 湖北亿咖通科技有限公司 | Segmenting method, device and the electronic equipment of text |
CN109918665B (en) * | 2019-03-05 | 2021-11-02 | 湖北亿咖通科技有限公司 | Word segmentation method and device for text and electronic equipment |
CN112364605A (en) * | 2020-11-27 | 2021-02-12 | 智业软件股份有限公司 | Text labeling method based on double-array Trie, terminal equipment and storage medium |
CN113033193A (en) * | 2021-01-20 | 2021-06-25 | 山谷网安科技股份有限公司 | C + + language-based mixed Chinese text word segmentation method |
CN113033193B (en) * | 2021-01-20 | 2024-04-16 | 山谷网安科技股份有限公司 | Mixed Chinese text word segmentation method based on C++ language |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN104252542A (en) | Dynamic-planning Chinese words segmentation method based on lexicons | |
CN102479191B (en) | Method and device for providing multi-granularity word segmentation result | |
CN105718586B (en) | The method and device of participle | |
CN102122298B (en) | Method for matching Chinese similarity | |
CN103970798B (en) | The search and matching of data | |
CN105930362B (en) | Search for target identification method, device and terminal | |
CN104778256B (en) | A kind of the quick of field question answering system consulting can increment clustering method | |
KR20100054587A (en) | System for extracting ralation between technical terms in large collection using a verb-based pattern | |
CN104199965A (en) | Semantic information retrieval method | |
CN112883165B (en) | Intelligent full-text retrieval method and system based on semantic understanding | |
CN115080694A (en) | Power industry information analysis method and equipment based on knowledge graph | |
CN102486787A (en) | Method and device for extracting document structure | |
CN114064901B (en) | Book comment text classification method based on knowledge graph word meaning disambiguation | |
CN109885641B (en) | Method and system for searching Chinese full text in database | |
JP2015088064A (en) | Text summarization device, text summarization method, and program | |
CN109086285B (en) | Intelligent Chinese processing method, system and device based on morphemes | |
CN112949293A (en) | Similar text generation method, similar text generation device and intelligent equipment | |
CN110705285B (en) | Government affair text subject word library construction method, device, server and readable storage medium | |
CN112765977A (en) | Word segmentation method and device based on cross-language data enhancement | |
CN104239294B (en) | Hide the how tactful Tibetan language long sentence cutting method of Chinese translation system | |
Havrashenko et al. | Analysis of text augmentation algorithms in artificial language machine translation systems | |
CN106569997B (en) | Science and technology compound phrase identification method based on hidden Markov model | |
CN115617965A (en) | Rapid retrieval method for language structure big data | |
CN109727591B (en) | Voice search method and device | |
Al-Zyoud et al. | Arabic stemming techniques: comparisons and new vision |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
C06 | Publication | ||
PB01 | Publication | ||
C10 | Entry into substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
RJ01 | Rejection of invention patent application after publication | ||
RJ01 | Rejection of invention patent application after publication |
Application publication date: 20141231 |