CN104252542A - Dynamic-planning Chinese words segmentation method based on lexicons - Google Patents

Dynamic-planning Chinese words segmentation method based on lexicons Download PDF

Info

Publication number
CN104252542A
CN104252542A CN201410507974.9A CN201410507974A CN104252542A CN 104252542 A CN104252542 A CN 104252542A CN 201410507974 A CN201410507974 A CN 201410507974A CN 104252542 A CN104252542 A CN 104252542A
Authority
CN
China
Prior art keywords
word
chinese
base
text
words
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN201410507974.9A
Other languages
Chinese (zh)
Inventor
孙珂
田冰川
张道强
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Nanjing University of Aeronautics and Astronautics
Original Assignee
Nanjing University of Aeronautics and Astronautics
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Nanjing University of Aeronautics and Astronautics filed Critical Nanjing University of Aeronautics and Astronautics
Priority to CN201410507974.9A priority Critical patent/CN104252542A/en
Publication of CN104252542A publication Critical patent/CN104252542A/en
Pending legal-status Critical Current

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/22Indexing; Data structures therefor; Storage structures
    • G06F16/2228Indexing structures
    • G06F16/2246Trees, e.g. B+trees
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities
    • G06F40/284Lexical analysis, e.g. tokenisation or collocates

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Computational Linguistics (AREA)
  • General Health & Medical Sciences (AREA)
  • Health & Medical Sciences (AREA)
  • Artificial Intelligence (AREA)
  • Software Systems (AREA)
  • Data Mining & Analysis (AREA)
  • Databases & Information Systems (AREA)
  • Document Processing Apparatus (AREA)

Abstract

Disclosed is a dynamic-planning Chinese words segmentation method based on lexicons. The dynamic-planning Chinese words segmentation method includes the steps of (1) uploading the common Chinese lexicons; (2) uploading infrequent Chinese lexicons; (3) reading in Chinese texts and acquiring the current Chinese text contents; (4) phrasing the Chinese texts into short sentences; (5) performing the automatic dynamic-planning Chinese words segmentation; (6) scanning from the last word to acquire the words segmentation result, and analyzing and tagging the characteristics of the words and outputting the result; (7) storing unregistered words into the infrequent Chinese lexicons; (8) judging whether or not the text is over; if not, going back to the step (4) for the circular process. The dynamic-planning Chinese words segmentation method has the advantages that accuracy and efficiency are high, and word segmentation accuracy reaches to the manual work standard, with segmentation speed over 2MB per second.

Description

A kind of dynamic programming Chinese word cutting method based on dictionary
Technical field
The present invention relates to Chinese information technology for automatically treating field, especially a kind of dynamic programming Chinese word cutting method based on dictionary.
Background technology
Along with the arrival of information age, Chinese information resource gets more and more, and the data how to find oneself to need in the vast as the open sea Chinese information world is a very important problem.Because data volume increases severely, it is unrealistic that manual process has become.Automatic processing method helps people's retrieval, management information, solves the present situation that present social information enriches and knowledge is poor.Occurred the language processing techniques such as instrument such as autoabstract, autofile retrieval of a lot of robotization at present, a key of these technology is descriptor.Extraction for descriptor contributes to simplifying this type of work, and how to find descriptor to need participle technique.
Chinese word segmentation is pre-service the most key in Chinese text information processing, is the basis of text mining.Chinese word segmentation is the basis of other Chinese information processing, such as Chinese search engine, mechanical translation, phonetic synthesis, automatic classification, autoabstract, automatic Proofreading etc., all needs to use participle.For the research of Chinese words segmentation, the development automatically processed for China's Chinese information has vital effect.
Summary of the invention
Technical matters to be solved by this invention is, provides a kind of accuracy rate high, the fireballing dynamic programming Chinese word cutting method based on dictionary.
For solving the problems of the technologies described above, the invention provides a kind of dynamic programming Chinese word cutting method based on dictionary, comprising the steps: that (1) loads conventional Chinese vocabulary bank; (2) non-common Chinese vocabulary bank is loaded; (3) read in Chinese text, obtain current Chinese content of text; (4) subordinate sentence is carried out for Chinese text, Chinese text is divided into short sentence one by one; (5) dynamic programming Chinese Automatic Word Segmentation is carried out; Find the possible position of first word, note F [i] represents the current minimum word number assigned to i-th word and assign to; Transfer is started for each word, from current word, finds transfer, find his previous word; As F [i] <F [j]+1, shift, thus to store the longest current word be result; (6) scan from last word, obtain word segmentation result and carry out part of speech analysis and add part-of-speech tagging, Output rusults; (7) by unregistered word stored in non-common Chinese vocabulary bank; (8) judge whether text terminates; If not, proceed to step (4), carry out circular treatment.
Even numbers group dictionary tree set up in conventional Chinese vocabulary bank and non-common dictionary; The data structure of even numbers group dictionary tree is made up of two integer arrays, and one is base [], is the address of word, and one is check [], is the hash value of word; If array index is i, if base [i], check [i] are 0, represent that this position is for empty, if base [i] is negative value, represent that this state is word, check [i] represents the previous state of this state.
Build even numbers group dictionary tree, through four traversals, even numbers group is put in all words, then travel through a vocabulary, amendment base value; Represent that this position is word by negative base value, if the corresponding some words of state i, and base [i]=0, so make base [i]=(-1) * i; If the value of base [i] is not 0, so another base [i]=(-1) * base [i].
Beneficial effect of the present invention is: accuracy rate is high, efficiency is fast, and the precision of word segmentation can reach the level similar with the mankind, and participle speed can reach more than 2MB per second.
Accompanying drawing explanation
Fig. 1 is the workflow diagram of the dynamic programming Chinese word cutting method based on dictionary of the present invention.
Fig. 2 is even numbers group dictionary tree data structure legend of the present invention.
Fig. 3 is even numbers group dictionary tree Chinese vocabulary bank legend of the present invention.
Embodiment
For solving the problems of the technologies described above, the invention provides a kind of dynamic programming Chinese word cutting method based on dictionary, comprising the steps: that (1) loads conventional Chinese vocabulary bank; (2) non-common Chinese vocabulary bank is loaded; (3) Chinese text is read in; (4) subordinate sentence is carried out for Chinese text, Chinese text is divided into short sentence one by one; (5) dynamic programming Chinese Automatic Word Segmentation is carried out; (6) obtain result to carry out part of speech analysis and add part-of-speech tagging; (7) by unregistered word stored in non-common Chinese vocabulary bank; (8) proceed to step (4), carry out circular treatment.
Chinese word segmentation: a Chinese character sequence is cut into word independent one by one, participle is exactly process continuous print word sequence being reassembled into word sequence according to certain specification.Chinese word segmentation dictionary: the dictionary be made up of Chinese everyday expressions, should ensure that this dictionary committed memory is less and inquiry velocity is very fast.Unregistered word: not to be incorporated in participle vocabulary but the word that must cut out, comprises all kinds of proper noun (name, place name, enterprise's name etc.), abb., newly-increased vocabulary etc.Crossing ambiguity: the ambiguity formed because word occurs simultaneously, as " improving the quality of products ", raising, high yield, product, quality, quality etc.Make-up ambiguity: same word string not only can be closed but also can divide, if " individual " in " individual resentment " is exactly a word, " individual " in " this people " just must take apart; " handle " in " handle of this fan door " is exactly a word, and " handle " in " raising one's hands " just must be taken apart.
As shown in Figure 1, first system will read the conventional Chinese vocabulary bank of having preserved, and the non-common Chinese vocabulary bank learning and obtain.In order to save memory headroom and ensure the efficiency of Chinese Automatic Word Segmentation, for these two dictionaries, set up even numbers group dictionary tree.System reads the text needing Chinese word segmentation, carries out subordinate sentence for text by punctuation mark.Enter the Chinese word segmentation stage after subordinate sentence, find the possible position of first word, represent with F [i] situation that word number that sentence can be assigned to i-th word is minimum.Transfer is started for each word, from current word, finds transfer, find his previous word.As F [i] <=F [j], shift, thus to store the longest current word be result.Scan from last word, obtain word segmentation result, and carry out part-of-speech tagging, export.Judge whether text terminates, if do not terminate, continue by text by punctuate subordinate sentence, the process of participle is carried out in circulation.
The minimum number of words that note F [i] can be divided into i-th word for participle, so F [i]=min (F [i], F [j]+1), the maximum length j<i of i word, F [i]=F [j]+1.Wherein the enumeration order of j upgrades from big to small, if namely F [i]=F [j]+1 upgrades, j is now minimum one that can upgrade, and can think that last word is long as much as possible, this meets Chinese Word Automatic Segmentation maximum matching method intuitively.State transfer each time all will ensure F [i] >F [j]+1, and what algorithm necessarily ensured that whole sentence separates is minimum word number.Based on the dynamic programming Chinese word cutting method of dictionary by two kinds of methods combining to together, effectively improve the efficiency of Chinese word segmentation.
As shown in Figure 2, be even numbers group dictionary tree.Even numbers group dictionary tree is a kind of special dictionary tree, compares dictionary tree, and its space availability ratio is higher, and the internal memory of consumption is less, and the efficiency of inquiry is set identical with common dictionary.The data structure of even numbers group dictionary tree is made up of two integer arrays, and one is base [], is the address of word, and one is check [], is the hash value of word.If array index is i, if base [i], check [i] are 0, represent that this position is for empty.If base [i] is negative value, represent that this state is word, check [i] represents the previous state of this state.
Even numbers group dictionary tree structure segmentation methods dictionary, assuming that only have in vocabulary ", Argentina, donkey-hide gelatin, Arabic, Arabic, Egyptian " these words.First 10 Chinese characters occurring in this table are encoded ,-1, Ah-2, sound of sighing-3, root-4, glue-5, draw-6 and-7, the court of a feudal ruler-8, primary-9, people-10.For each Chinese character, need to determine a base value, make, for all words started with this Chinese character, can put down in even numbers group.Such as, to determine now " the base value of Ah "'s word, suppose so that " second word sequence code of the word that Ah " starts is followed successively by a1, a2, a3 ... an, a value i must be found, make base [i+a1], check [i+a1], base [i+a2], check [i+a2] ... base [i+an], check [i+an] are 0.Once have found this i, " the base value of Ah " is just defined as i.
As shown in Figure 3, above-mentioned example is built even numbers group dictionary tree, through four traversals, even numbers group is put in all words, then travel through a vocabulary, amendment base value.Represent that this position is word by negative base value, if the corresponding some words of state i, and base [i]=0, so make base [i]=(-1) * i; If the value of base [i] is not 0, so another base [i]=(-1) * base [i].
In addition, also need to safeguard some special vocabularys, such as, become separately the function word vocabulary of word, ground noun list, the unregistered word tables such as people's noun list, promote the accuracy of software further.
This dynamic programming Chinese word cutting method based on dictionary, time complexity remains linear grade other, the forerunner finding it is only needed for each word, so time complexity is very little.
Use People's Daily's in January, 1998 language material, after comparing with artificial correct participle, accuracy rate reaches 98.8904%, and participle speed is 2504kb/s.
Although the present invention illustrates with regard to preferred implementation and describes, only it will be understood by those of skill in the art that otherwise exceed claim limited range of the present invention, variations and modifications can be carried out to the present invention.

Claims (3)

1. based on a dynamic programming Chinese word cutting method for dictionary, it is characterized in that, comprise the steps: that (1) loads conventional Chinese vocabulary bank; (2) non-common Chinese vocabulary bank is loaded; (3) read in Chinese text, obtain current Chinese content of text; (4) subordinate sentence is carried out for Chinese text, Chinese text is divided into short sentence one by one; (5) dynamic programming Chinese Automatic Word Segmentation is carried out; Find the possible position of first word, note F [i] represents the current minimum word number assigned to i-th word and assign to; Transfer is started for each word, from current word, finds transfer, find his previous word; As F [i] <F [j]+1, shift, thus to store the longest current word be result; (6) scan from last word, obtain word segmentation result and carry out part of speech analysis and add part-of-speech tagging, Output rusults; (7) by unregistered word stored in non-common Chinese vocabulary bank; (8) judge whether text terminates; If not, proceed to step (4), carry out circular treatment.
2. Chinese word cutting method as claimed in claim 1, is characterized in that, even numbers group dictionary tree set up in conventional Chinese vocabulary bank and non-common dictionary; The data structure of even numbers group dictionary tree is made up of two integer arrays, and one is base [], is the address of word, and one is check [], is the hash value of word; If array index is i, if base [i], check [i] are 0, represent that this position is for empty, if base [i] is negative value, represent that this state is word, check [i] represents the previous state of this state.
3. Chinese word cutting method as claimed in claim 2, is characterized in that, builds even numbers group dictionary tree, through four traversals, even numbers group is put in all words, then travels through a vocabulary, amendment base value; Represent that this position is word by negative base value, if the corresponding some words of state i, and base [i]=0, so make base [i]=(-1) * i; If the value of base [i] is not 0, so another base [i]=(-1) * base [i].
CN201410507974.9A 2014-09-29 2014-09-29 Dynamic-planning Chinese words segmentation method based on lexicons Pending CN104252542A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201410507974.9A CN104252542A (en) 2014-09-29 2014-09-29 Dynamic-planning Chinese words segmentation method based on lexicons

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201410507974.9A CN104252542A (en) 2014-09-29 2014-09-29 Dynamic-planning Chinese words segmentation method based on lexicons

Publications (1)

Publication Number Publication Date
CN104252542A true CN104252542A (en) 2014-12-31

Family

ID=52187432

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201410507974.9A Pending CN104252542A (en) 2014-09-29 2014-09-29 Dynamic-planning Chinese words segmentation method based on lexicons

Country Status (1)

Country Link
CN (1) CN104252542A (en)

Cited By (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN105279150A (en) * 2015-10-27 2016-01-27 江苏电力信息技术有限公司 Lucene full-text retrieval based Chinese word segmentation method
CN105468584A (en) * 2015-12-31 2016-04-06 武汉鸿瑞达信息技术有限公司 Filtering method and system for bad literal information in text
CN108073566A (en) * 2016-11-16 2018-05-25 北京搜狗科技发展有限公司 Segmenting method and device, the device for participle
CN109918665A (en) * 2019-03-05 2019-06-21 湖北亿咖通科技有限公司 Segmenting method, device and the electronic equipment of text
CN110309400A (en) * 2018-02-07 2019-10-08 鼎复数据科技(北京)有限公司 A kind of method and system that intelligent Understanding user query are intended to
CN112364605A (en) * 2020-11-27 2021-02-12 智业软件股份有限公司 Text labeling method based on double-array Trie, terminal equipment and storage medium
CN113033193A (en) * 2021-01-20 2021-06-25 山谷网安科技股份有限公司 C + + language-based mixed Chinese text word segmentation method

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN1193779A (en) * 1997-03-13 1998-09-23 国际商业机器公司 Method for dividing sentences in Chinese language into words and its use in error checking system for texts in Chinese language
CN101071421A (en) * 2007-05-14 2007-11-14 腾讯科技(深圳)有限公司 Chinese word cutting method and device
CN102063424A (en) * 2010-12-24 2011-05-18 上海电机学院 Method for Chinese word segmentation

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN1193779A (en) * 1997-03-13 1998-09-23 国际商业机器公司 Method for dividing sentences in Chinese language into words and its use in error checking system for texts in Chinese language
CN101071421A (en) * 2007-05-14 2007-11-14 腾讯科技(深圳)有限公司 Chinese word cutting method and device
CN102063424A (en) * 2010-12-24 2011-05-18 上海电机学院 Method for Chinese word segmentation

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
赵欢等: "基于双数组Trie树中文分词研究", 《湖南大学学报(自然科学版)》 *
金凌: "语音识别中语言模型的研究", 《万方数据库清华大学硕士论文》 *

Cited By (10)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN105279150A (en) * 2015-10-27 2016-01-27 江苏电力信息技术有限公司 Lucene full-text retrieval based Chinese word segmentation method
CN105468584A (en) * 2015-12-31 2016-04-06 武汉鸿瑞达信息技术有限公司 Filtering method and system for bad literal information in text
CN108073566A (en) * 2016-11-16 2018-05-25 北京搜狗科技发展有限公司 Segmenting method and device, the device for participle
CN108073566B (en) * 2016-11-16 2022-01-18 北京搜狗科技发展有限公司 Word segmentation method and device and word segmentation device
CN110309400A (en) * 2018-02-07 2019-10-08 鼎复数据科技(北京)有限公司 A kind of method and system that intelligent Understanding user query are intended to
CN109918665A (en) * 2019-03-05 2019-06-21 湖北亿咖通科技有限公司 Segmenting method, device and the electronic equipment of text
CN109918665B (en) * 2019-03-05 2021-11-02 湖北亿咖通科技有限公司 Word segmentation method and device for text and electronic equipment
CN112364605A (en) * 2020-11-27 2021-02-12 智业软件股份有限公司 Text labeling method based on double-array Trie, terminal equipment and storage medium
CN113033193A (en) * 2021-01-20 2021-06-25 山谷网安科技股份有限公司 C + + language-based mixed Chinese text word segmentation method
CN113033193B (en) * 2021-01-20 2024-04-16 山谷网安科技股份有限公司 Mixed Chinese text word segmentation method based on C++ language

Similar Documents

Publication Publication Date Title
CN104252542A (en) Dynamic-planning Chinese words segmentation method based on lexicons
CN102479191B (en) Method and device for providing multi-granularity word segmentation result
CN105718586B (en) The method and device of participle
CN102122298B (en) Method for matching Chinese similarity
CN103970798B (en) The search and matching of data
CN105930362B (en) Search for target identification method, device and terminal
CN104778256B (en) A kind of the quick of field question answering system consulting can increment clustering method
KR20100054587A (en) System for extracting ralation between technical terms in large collection using a verb-based pattern
CN104199965A (en) Semantic information retrieval method
CN112883165B (en) Intelligent full-text retrieval method and system based on semantic understanding
CN115080694A (en) Power industry information analysis method and equipment based on knowledge graph
CN102486787A (en) Method and device for extracting document structure
CN114064901B (en) Book comment text classification method based on knowledge graph word meaning disambiguation
CN109885641B (en) Method and system for searching Chinese full text in database
JP2015088064A (en) Text summarization device, text summarization method, and program
CN109086285B (en) Intelligent Chinese processing method, system and device based on morphemes
CN112949293A (en) Similar text generation method, similar text generation device and intelligent equipment
CN110705285B (en) Government affair text subject word library construction method, device, server and readable storage medium
CN112765977A (en) Word segmentation method and device based on cross-language data enhancement
CN104239294B (en) Hide the how tactful Tibetan language long sentence cutting method of Chinese translation system
Havrashenko et al. Analysis of text augmentation algorithms in artificial language machine translation systems
CN106569997B (en) Science and technology compound phrase identification method based on hidden Markov model
CN115617965A (en) Rapid retrieval method for language structure big data
CN109727591B (en) Voice search method and device
Al-Zyoud et al. Arabic stemming techniques: comparisons and new vision

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
C10 Entry into substantive examination
SE01 Entry into force of request for substantive examination
RJ01 Rejection of invention patent application after publication
RJ01 Rejection of invention patent application after publication

Application publication date: 20141231