TW201035783A - Chinese word segmentation syatem and method thereof - Google Patents

Chinese word segmentation syatem and method thereof Download PDF

Info

Publication number
TW201035783A
TW201035783A TW98110770A TW98110770A TW201035783A TW 201035783 A TW201035783 A TW 201035783A TW 98110770 A TW98110770 A TW 98110770A TW 98110770 A TW98110770 A TW 98110770A TW 201035783 A TW201035783 A TW 201035783A
Authority
TW
Taiwan
Prior art keywords
vocabulary
word
combined
combination
cumulative
Prior art date
Application number
TW98110770A
Other languages
Chinese (zh)
Inventor
Chau-Cer Chiu
Ling Chen
Original Assignee
Inventec Corp
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Inventec Corp filed Critical Inventec Corp
Priority to TW98110770A priority Critical patent/TW201035783A/en
Publication of TW201035783A publication Critical patent/TW201035783A/en

Links

Landscapes

  • Document Processing Apparatus (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
  • Machine Translation (AREA)

Abstract

A Chinese word segmentation system and method thereof are provided, which pick a combination and its next character from the Chinese text as a add-combination to continue searching while it makes sure that the character-combination exists in the word database, or else pick the last one character of the combination and its next character as another one combination to continue searching, so as to solve the problem existing in the prior art that a large number of combinations produced in the process of chopping the Chinese text, thereby achieving the technical efficacy of reducing the system overload as well as make allowances for the chopping-speed and fitting in with original meaning.

Description

201035783 六、發明說明: 【發明所屬之技術領域】 一種中文字串之詞彙切分系統及其方法,特別係指一種產生 有效切分字詞之中文字串之詞彙切分系統及其方法。 【先前技術】 中文分詞(Chinese Word Segmentation,CWS )係將一中文詞 句切分為數個切分字詞,其應用於資訊檢索、人機互交、資訊提 取、文本挖掘、中外文翻譯、中文校對與自動摘要等。以資訊檢 索為例,搜尋引擎預先將檢索資料切分為多個切分糊,並依照 這些切分詞建立索引;搜尋引擎於接收伽者輸人之關鍵字句…、 時’同樣將關鍵字句切分為-或多個切分字詞,將其與索引進行 比對’以找出與關鍵字句相關的檢索資料。 但中文不同於英文,字與字之間沒有空格,所以亦無法按空 格提取字串並建立索引,故分詞_為達成有效資訊檢索的關鍵 技術。以「我正面臨-項挑戰」為例,現行採㈣交叉二 法將切分出7個切分糊,其分媽「我正」、「正面」、「面臨」刀、 -」、-項」、「項考」與「考驗」;此交又二元切分法簡單快 速,但會產生大量無意義的切分字詞,導致索引其儲存空間增加 而造成負擔’此外’此種機械切分雖提高關鍵字句的命 因未考慮字句原意,反攝低了檢纽果的相關性。 — 綜上所述,可知先前技術於切分中文字串的過程中, =:::字_題’_要提 【發明内容】 201035783 有參於先則技術於切分中文字串的過程中, 大量無效切分字詞_題 2在者產生 分系統及其方法,射: ^揭路—射文字串之詞棄切201035783 VI. Description of the invention: [Technical field of invention] A vocabulary segmentation system and method for a Chinese character string, in particular, a vocabulary segmentation system and method for generating a string of words in an effective segmentation word. [Prior Art] Chinese Word Segmentation (CWS) divides a Chinese word into several words, which are used in information retrieval, human-machine interaction, information extraction, text mining, Chinese-foreign translation, Chinese proofreading. With automatic summaries, etc. Taking information retrieval as an example, the search engine divides the search data into multiple segments and indexes them according to the segmentation words. When the search engine receives the keyword sentences from the gambler, the same words are used. Divide into - or multiple segmentation words and compare them to the index to find the retrieval data related to the keyword sentence. However, Chinese is different from English. There is no space between words and words. Therefore, it is impossible to extract strings and index them by space. Therefore, the word segmentation_ is a key technology for achieving effective information retrieval. Take "I am facing the challenge" as an example. The current mining (4) cross two method will cut out seven points, which are divided into two parts: "I am positive", "positive", "face" knife, -", - item "," "test" and "test"; this cross-cutting method is simple and fast, but it will produce a large number of meaningless word-cutting words, resulting in an increase in the storage space of the index and causing the burden 'in addition' such mechanical cutting Although the reason for improving the keyword sentence does not take into account the original meaning of the sentence, the inverse photo is less relevant to the check fruit. - In summary, it can be seen that the prior art in the process of segmenting the Chinese character string, =::: word_question__ to be mentioned [invention content] 201035783 is involved in the process of cutting the Chinese character string , a large number of invalid segmentation words _ title 2 in the generation of sub-systems and their methods, shooting: ^ Jielu - shooting text string abandonment

廑㈣揭露之中文字串之詞彙切分純,其包含-詞彙 庫、-操賴組與-查找模組。其中,詞彙庫包含有複數侧棄; 榻取模組用以自接收之—中文字㈣取其第—字至第二字為第一 組合詞彙後,依據該第—組合詞彙其後續查找之有無,選擇擷取 該第-組合詞彙與其下-字為第—組合累加詞彙,或擷取該第一 ,合詞彙之尾字與其下-字為第二組合财;查賊組用以於詞 彙庫依序進行該第一組合詞彙之匹配查找,以及該第一組合累加 詞彙或該第二組合詞彙之匹配查找;其中,當查找模組確認詞彙 庫有該第一組合累加詞彙時,擷取模組選擇擷取該第一組合累加 4莱與其下一字為次第一組合累加詞彙,反之,則操取該第一組 s累加δ司囊之尾子與其下一字為該第二組合詞棄,並由查找模組 進行該次第一組合累加詞彙或該第二組合詞彙之匹配查找,依此 類推至任一組合詞彙或組合累加詞彙包含有該中文字串之最終字 為止。 本發明所揭露之中文字串之詞彙切分方法,預建有包含複數 個詞彙之一詞彙庫,該方法首先自一中文字串擷取其第一字至第 二字為第一組合詞彙;接著,以該第一組合詞彙至詞彙庫進行匹 配查找,當確認詞彙庫有該第一組合詞彙時’擷取該第一組合詞 彙與其下一字為第一組合累加詞彙,反之則操取該第一組合詞彙 之尾字與其下一字為第二組合詞彙,並再次於詞彙庫查找與該第 二組合詞彙符合之一詞彙;以該第一組合累加詞彙於詞彙庫進行 5 201035783 匹配查找,當確認詞彙庫有該第一組合累加詞彙時,擷取該第一 組合累加詞彙與其下-字為次第_組合累加詞彙,反之則擷取該 第-組合累加詞彙之尾字與其下―字為該第二組合詞彙,並再次 以έ亥第一組合詞彙於詞彙庫進行匹配查找;依此類推,直至任一 組合闺彙或組合累加詞彙包含有該中文字串之最終字為止。 本發明所揭1:之系統與方法如上,與先前技術之間的差異在 於本發明於確認詞彙庫有自該中文字串擷取之組合詞彙時,進一 步擷取其與下—字作為組合累加詞彙並繼續錢,反之則擷取該 組。θ彙之尾字與其下—字進行查找的技術手段,透過此技術手 段’本發明可產生有狀組合詞彙或組合帛加啦,於兼顧切分 速度與字句原意之下’達成降低系統負擔的技術功效。 【實施方式】 以下將配合圖式及實施例來詳細說明本發明之實施方式,藉 此對本發明如何應用技術手段來解決技術問題並達成技術功效的 實現過程能充分理解並據以實施。 ^ 第1圖」繪示為本發明之中文字串切分系統其方塊示意圖。 明參照「第1圖」,巾文字串切分系統1〇〇包含有一詞彙庫11〇、 心貝取模組120與一查找模組13〇。其中,詞彙庫u〇包含有複數 個巧彙’麵詞彙係用以作為後續比對之依據;指頁取模組12〇用 =自接收之一中文字串1〇1擷取其第一字至第二字為第一組合詞 彙後’依獅第-組合触其後續錢之有無,辦娜該第一 三:弓彙與其下—子為第—組合累力σ詞彙,或是擷取該第-組合 ’彙之尾字與其下—字為第二組合詞囊。 舉例來說,一中文字串1〇1為「我最喜歡你的笑容」,其中「我」 201035783 為第一字,「容」為該中文字串101之最終字;擷取模組120自該 中文字串101擷取其第一字至第二字為第一組合詞彙,即「我最」, 並依據該第一組合詞彙「我最」後續查找之有無,選擇擷取「我 最」與「我最」於該中文字串101之下一字「喜」為第一組合累 加詞彙「我最喜」’或是擷取「我最」之尾字「最」與其下一字「喜」 為第二組合詞彙「最喜」。又以中文字串101「星期一天氣晴」為 例,擷取模組120將擷取「星期」為第一組合詞彙,並依據後續 〇 查找「星期」之有無,選擇擷取第一組合累加詞彙「星期一」或 第二組合詞彙「期一」。 查找模組130用以於詞彙庫11〇依序進行該第一組合詞彙之 匹配查找,以及該第一組合累加詞彙或該第二組合詞彙之匹配查 找;其中,當查找模組130確認詞彙庫11()有該第一組合累加詞 彙時,擷取模組120繼續選擇擷取該第一組合累加詞彙與其下一 字為次第一組合累加詞彙,反之,則擷取該第一組合累加詞彙之 尾予與其下一字為該第二組合詞彙,由查找模組13〇進行該次第 -組合累加詞彙或該第二組合累加詞彙之匹配查找,依此類推至 任一組合詞彙或組合累加詞彙包含有該中文字串之最終字為 止。 、*、 a延續上例,查找模組130於詞彙庫11〇查找與該第一組合詞 菜「我最」相符之一詞彙,並確定詞彙庫11〇無「我最」,此時, ^取模組12G選擇擷取該第一組合詞彙「我最」之尾字與其下一 予為第二組合詞彙「最喜」,查找模組⑽即於詞彙庫110進行該 第二組合詞彙「最喜」之匹配雜;當查麵組13G確認詞彙庫 n〇無该第二組合詞彙「最喜」時,擷取模組120選擇擷取該第二 201035783 =合詞彙「最喜」之尾字與其下一字為該第三組合詞彙,即「喜 歡」、並由查找模組130於詞彙庫11〇進行該第三組合詞囊之匹配 f找;=此類推,當查找模組130查找出與該第三組合詞彙「喜 歡」相符之詞彙後’更依序對娜模組120所產生之第三組合累 加雜吾I你」、第四组合詞囊「你的」、第五組合詞囊「的笑」 ,第六組合詞彙「笑容」進行㈣麵,最後依據該第六組合詞 菜「笑容」包含最終字「容」而停止查找,產生「我」、「最」、「喜 歡」、「你」、「的」與「笑容」之切分結果。 於另一例中,查找模組130於詞彙庫11〇查找出與該第一組 司囊「星期」相符之一詞彙,擷取模組12〇即依據此查找結果 選擇擷取第-組合累加詞彙「星期―」,並由查找模組13G於詞彙 庫110進行「星期-」的匹配查找;當查找出該第一組合累加詞 彙「星期-」日寺,榻取模組120擷取次第一組合累加詞囊「星期 一天」’並由查找模組130於詞彙庫u〇進行「星期一天」的匹配 查找’依此類堆’查找模組⑽確定無該次第一組合累加詞彙「星 期一天」之詞彙時,更依序對擷取模組12〇所產生之第二組合詞 彙天氣」、第二組合累加詞彙「天氣晴」與第三組合詞彙「晴」 進行匹配查找,最後依據該第三組合詞囊「晴」為最終字而停止 查找,產生「星期」、「星期―」、「天氣」、「晴」之切分結果。 此外,中文字串切分系統100可更包含一識別模組14〇(;請見 第3圖」),用以於查找模組13〇確認詞彙庫11C)無該第一組合 闲彙(或該第二組合詞彙)時,將該第一組合詞彙(或該第二組 合詞彙)之首字識別為第一切分字(或第二切分字)。於上例產生 我」、「最」、「喜歡」、「你」、「的」與「笑容」的過程中,識別 201035783 模組140依據該第一組合詞彙「我最」其查找結果為無,將其尾 字「我」識別為第一切分字;依此類推’「最」為第二切分字,「你」 為第四切分字’「的」為第五切分字,而「喜歡」與「笑容」則為 上述之第三組合詞彙與第六組合詞彙。 「第2圖」繪示為本發明之中文字串切分方法其步驟流程圖, 預建包含有複數個詞彙之一詞彙庫110。請參照「第2圖」,首先 自一中文字串101擷取其第一字至第二字為第一組合詞彙;(步驟 0 210);以該第一組合詞彙至詞彙庫no進行匹配查找,當確認詞 彙庫110有該第一組合詞彙時,擷取該第一組合詞彙與其了一字 為第一組合累加詞彙,反之’則擷取該第一組合詞彙之尾字與其 下一子為第一組合列彙’並再次以該第二組合詞彙於詞囊庫110 進行匹配查找(步驟220);以該第一組合累加詞彙至詞囊庫n〇 進行匹配查找,當確認詞彙庫110有該第一組合累加詞彙時,擷 取該第一組合累加詞彙與其下一字為次第一組合累加詞彙,反 之,則擷取該第一組合累加詞彙之尾字與其下一字為該第二組合 〇 詞彙’並再次以該第二組合詞彙於詞彙庫no進行匹配查找(步 驟230 );依此類推至任一組合詞彙或組合累加詞彙包含有該中文 字串101之最終字為止(步驟240)。 承上,將第二組合詞彙於詞彙庫110進行匹配查找(步驟220 或步驟230),當確認詞彙庫110有該第二組合詞彙時,擷取該第 二組合詞彙與其下一字為第二組合累加詞彙,並於詞彙庫110進 行該第二組合累加詞彙之匹配查找;反之,則擷取該第二組合詞 彙之尾字與其下一字為第三組合詞彙,於詞彙庫110進行該第三 組合詞彙之匹配查找。 201035783 UG進行-組合詞彙(包含第—組合詞彙、 並確;:Ί=】或疋第二紅合詞彙及其以上之可能)之匹配查找, 辨識二=分二t _彙之討 及其以上)的娜 刀分字、第二切料、第三切分字 及1=解ί,’ Μ下針對本發明之巾文字串之詞彙切分系統 實施墟、練字數翻彙記狀齡,以使其 S 3圖」_為本發明之增添識職_設賴組的中文 子串切分系統其方塊示意圖。請參照「第3圖」,中文字串切分系 統100較前述更包含—識職組⑽與-設定模組15G,1中識別 拉組刚如前所述,而包含有一擷取起點、—類取數字盘一巧囊 _的設賴組⑼,係用以依據查找模組13G查找之有益,選擇 =取=與卿記咖_蝴㈣是對該操取 起點進订累加’ JE重擷取字數與鞠彙記錄。 以上例「星期-天氣晴」說明之,請同時參照「第3圖」、「第 =圖」與「第5圖」’設定模組15〇所預設之該擷取起點⑽為〇, 箱取字數52G為2,該詞彙記錄則未標記;首先,榻取模缸⑽ 以預設之該擷取起點51G為基準,自該中文字串類取符合該 擷取字數520之字串為第一組合詞彙,即「星期」(步驟仙);查 賊組13〇於續認詞彙庫11〇有該第一組合詞囊「星期」時= 驟420)’再次確認「星期」不包含該中文字串ι〇ι之最終字「晴v ;(步驟430);設定模組15。依據「星期」之查找結果,選擇累月加」 該掏取字數520為3,並標記賴彙記錄(步驟;掏取模組 201035783 120依據該擷取起點510與擷取字數520,擷取第ο至第3之間的 子串生期一」為第一組合累加詞囊(步驟410)。 Ο 〇 承上,查找模組130於確認詞彙庫110有該第一組合累加詞 囊「星期一」時(步驟420),再次確認「星期一」不包含最終字 「晴」(步驟430);設定模組15〇依據「星期一」之查找結果,選 擇將該擷取字數520累加至4,並維持標記之該詞彙記錄(步驟 440);擷取模組12〇擷取第〇至第4之間的字串「星期一天」為 次第一組合累加詞彙(步驟410);查找模組130於確認詞彙庫no 不包含該次第一組合累加詞彙「星期一天」(步驟420),再次確認 該詞彙記錄為標記狀態,識別模組140於確認該詞彙記錄為標記 狀態後’選擇不動作(步驟45〇);設定模組ls〇依據查找結果為 無,將該擷取起點510指定為3 (0+4-1),重新指定該操取字數 520為2,並取消該詞彙記錄之標記(步驟460);擷取模組12〇 娜第3至第5之間的字串「天氣」為第二組合詞彙(步驟41〇)。 查找模組130於確認詞彙庫11〇有該第二組合詞彙「天氣」 時(步驟420),再次確認「天氣」不包含最終字「晴」(步驟); 設定模組15〇將該擷取字數520累加至3,並標記該詞彙記錄(步 驟操取做120擷取第3至第6之間的字串「天氣晴」為 第t組合累加詞彙(步驟41G);查找模組13〇確認詞彙庫110不 二^二組合累加詞彙「天氣晴」時(步驟叫識別模組⑽ 職齡驗,聰·物雜);設定模 為依據天氣晴」之查找結果為無,將該摘取起點训指定 為(3+3·!),重新指定該娜字數52()為2,並取消該 之私圮(步驟460);擷取模組12〇擷取出 、 ㈣W 5至第7之間的字串 201035783 「晴」為第三組合詞彙(步驟410);查找模組13〇於詞彙庫11() 查無「晴」時(步驟420),識別模組140依據該詞彙標記為無, 而將「晴」識別為第三切分字(步驟47〇),查找模組13〇確認「晴」 包含有該最終字「晴」(步驟480),故結束查找,產生「星期」、「星 期一」、「天氣」、「晴」的切分結果。若查找模組13〇於確認一切 分字不包含最終字時(步驟480),設定模組150則累計該操取起 點510 ’重新給定該擷取數字52〇為2 (步驟49〇),並重複步驟 楊。「第5圖」之「星期一」、「天氣」、「晴」為此例之長詞優先 的分詞結果,其中「星期-」530為第一組合累加字囊,「天氣」〇 540為第二組合詞彙,「晴」55G為第三切分字,此長詞優先之^ 分結果適合應用於中外文翻譯。 ,上所述,可知本發明與先·術之_差異在於具有於確 認詞彙庫有自該中文字φ娜之組合詞彙時,進—步擷取其與下 -字作為組合累加詞彙並、_麵,反之職取該組合詞囊之尾 子與其下·?·進行查找的技術手段,藉由此一技術手段可以解決 先前技術所存在的問題,進而於兼_分速度與字句原意之下, 達成降低糸統負擔的技術功效。 雖然本發’揭露之實施方式如上,惟所述之魄並非用以 直接限定本發明之專利保護範圍。任何本發明所屬技術領域中且 有通常知識者,林脫離本發明所揭露之精神和翻的前提下了 :以在實施的形式上及細節上作些許之更動。本發明之專利保護 範圍’仍触_之申請專利顧所界定者為準。 【圖式簡單說明】 第1圖為本發明之中文字串切分系統其方塊示意圖。 12 201035783 第2圖為本發明之中文字串切分方法其步驟流程圖。 /第3圖為本發明之增添識顺組與設定模組的中文字串切分 糸統其方塊示意圖。 第4圖為本發明之增添識別步驟與設定步驟的中文字串切分 方法其步驟流程圖。 刀 第5圖為本發明之中文字串其長詞優先之切分結果的示意圖。 【主要元件符號說明】廑 (4) exposes the vocabulary of the Chinese character string, which includes the vocabulary, the vocabulary group and the - search module. Wherein, the vocabulary library includes a plurality of side discards; the couching module is used to receive the Chinese character (4), and after the first word to the second word is the first combined vocabulary, according to the first combination vocabulary, whether the subsequent search has its presence or not Selecting the first-combined vocabulary and the lower-word as the first-combination cumulative vocabulary, or extracting the first, the lexical end word and the lower-word as the second combination; the thief group is used in the vocabulary Performing a matching search of the first combined vocabulary, and a matching search of the first combined cumulative vocabulary or the second combined vocabulary; wherein, when the search module confirms that the vocabulary has the first combined cumulative vocabulary, the capture module The group selection takes the first combination to accumulate 4 lei and the next word as the second combination to accumulate the vocabulary, and vice versa, the fiction of the first group s is added to the tail of the δ capsule and the next word is the second combination Discarded, and the search module performs the first combined cumulative vocabulary or the matching search of the second combined vocabulary, and so on to any combination vocabulary or combination cumulative vocabulary containing the final word of the Chinese character string. The vocabulary segmentation method of the character string disclosed in the present invention pre-built has a vocabulary library comprising a plurality of vocabulary, the method first extracting the first word to the second word as a first combination vocabulary from a Chinese character string; Then, the matching search is performed by using the first combined vocabulary to the vocabulary, and when the vocabulary has the first combined vocabulary, the first combined vocabulary is retrieved from the first combined vocabulary as the first combined vocabulary, and vice versa. The tail word of the first combined vocabulary and the next word are the second combined vocabulary, and again the vocabulary search finds one vocabulary corresponding to the second combined vocabulary; and the first combined cumulative vocabulary is used in the vocabulary to perform 5 201035783 matching search, When it is confirmed that the vocabulary has the first combined accumulated vocabulary, the first combined cumulative vocabulary is extracted from the lower-word _combined vocabulary, and vice versa, the suffix of the first combined cumulative vocabulary is extracted The second combined vocabulary, and again using the first combination vocabulary of the έhai to perform a matching search in the vocabulary; and so on, until any combination 闺 or combination vocabulary contains the most End of the word. The system and method of the present invention are as follows. The difference between the prior art and the prior art is that the present invention further extracts the combined vocabulary from the Chinese character string when the vocabulary has a combined vocabulary extracted from the Chinese character string. The vocabulary continues with the money, and vice versa. The technical means of searching for the _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ Technical efficiency. [Embodiment] Hereinafter, embodiments of the present invention will be described in detail with reference to the drawings and embodiments, and thus the implementation of the present invention to solve the technical problems and achieve the technical effects can be fully understood and implemented. ^ Fig. 1 is a block diagram showing the character string segmentation system of the present invention. Referring to "FIG. 1", the towel string segmentation system 1 includes a vocabulary library 11〇, a heartbeat module 120 and a search module 13〇. Among them, the vocabulary library u〇 contains a plurality of Qiaohui 'face vocabulary system for use as a basis for subsequent comparison; refers to the page capture module 12 = = one of the received Chinese character string 1 〇 1 to take its first word After the second word is the first combination vocabulary, 'Yi Shi--the combination touches the follow-up money, and the first three: the bow and the next--the first-combination force σ vocabulary, or take the The first-combination 'Huizhiwei word and its lower-word are the second combination word capsule. For example, a Chinese character string 1〇1 is "My favorite smile", where "I" 201035783 is the first word, "容容" is the final word of the Chinese character string 101; the capture module 120 is The Chinese character string 101 takes the first word to the second word as the first combination vocabulary, that is, "I am the most", and selects "I am the most" based on the presence or absence of the first combination vocabulary "I am the most". With "I am the most" under the Chinese character string 101, the word "Hi" is the first combination to add the word "I am most happy" or to take the "My Most" ending word "Most" and the next word "Hi" "The most happy" for the second combined vocabulary. Taking the Chinese character string 101 "Monday weather fine" as an example, the capture module 120 will take "week" as the first combined vocabulary and search for the "week" according to the subsequent ,, and select the first combination to be accumulated. The vocabulary "Monday" or the second combined vocabulary "Period One". The search module 130 is configured to perform a matching search of the first combined vocabulary in the vocabulary library 11 and a matching search of the first combined cumulative vocabulary or the second combined vocabulary; wherein, when the search module 130 confirms the vocabulary 11(), when the first combination accumulates the vocabulary, the capture module 120 continues to select the first combined cumulative vocabulary and the next vocabulary as the second combined vocabulary, and vice versa, the first combined cumulative vocabulary is retrieved And the next word is the second combined vocabulary, and the search module 13 〇 performs the matching search of the first combined combined vocabulary or the second combined cumulative vocabulary, and so on to any combined vocabulary or combined cumulative vocabulary Contains the final word of the Chinese character string. , *, a continue the above example, the search module 130 searches the vocabulary library 11 for a vocabulary that matches the first combination word "I am the most", and determines that the vocabulary library 11 has no "I am the most", at this time, ^ The module 12G selects and selects the tail word of the first combination vocabulary "I am the most" and the next word is "the most happy", and the search module (10) performs the second combination vocabulary in the vocabulary library 110. When the check group 13G confirms that the vocabulary n〇 does not have the second combined vocabulary "the most happy", the capture module 120 selects the second 201035783 = the slogan "the most happy" The next word is the third combined vocabulary, that is, "like", and the matching module 130 searches for the matching of the third combined vocabulary in the vocabulary library 〇; = such a push, when the search module 130 finds out After the vocabulary corresponding to the third combined vocabulary "like", the slogan "more in the order of the third combination generated by the Na module 120", the fourth combination vocabulary "your", the fifth combination sac "Laughter", the sixth combined vocabulary "smile" is carried out (4), and finally based on the sixth combination word "smile" Include the final word "content" and stop searching to produce the results of "I", "Most", "Hi", "You", "Yes" and "Smile". In another example, the search module 130 searches the vocabulary library 11 for a vocabulary that matches the first group of "weeks", and the module 12 selects the first-composite cumulative vocabulary according to the search result. "Week" - and the search module 13G performs a "week-" matching search in the vocabulary 110; when the first combined cumulative vocabulary "week-" day temple is found, the couch module 120 takes the first combination Accumulate the word "Monday" and use the search module 130 to perform a "Monday" match search in the vocabulary library. The "Search by this type" search module (10) determines that there is no first combination of accumulated words "week" In the vocabulary of one day, the second combined vocabulary weather generated by the module 12 is sequentially selected, the second combined vocabulary "weather clear" and the third combined vocabulary "clear" are matched and searched. The third combination of the word "clear" is the final word and the search is stopped. The result of the "week", "week", "weather" and "clear" is generated. In addition, the Chinese character string segmentation system 100 may further include an identification module 14 ( (see FIG. 3) for confirming the vocabulary 11C) without the first combination (or The second combined vocabulary) identifies the first word of the first combined vocabulary (or the second combined vocabulary) as a first sharding word (or a second severing word). In the process of generating my, "most", "like", "you", "" and "smile" in the above example, the recognition module 201035 is based on the first combination vocabulary "I am the most" and the search result is none. , the tail word "I" is identified as the first segmentation word; and so on, the "most" is the second segmentation word, and the "you" is the fourth segmentation word "" is the fifth segmentation word. "Like" and "smile" are the third combined vocabulary and the sixth combined vocabulary. Fig. 2 is a flow chart showing the steps of the character string segmentation method in the present invention, and a vocabulary library 110 containing a plurality of words is prefabricated. Please refer to "Fig. 2", firstly extracting the first word to the second word from the Chinese character string 101 as the first combined vocabulary; (step 0 210); performing matching search with the first combined vocabulary to the vocabulary no When the vocabulary 110 is confirmed to have the first combined vocabulary, the first combined vocabulary is extracted from the first combined vocabulary, and the second vocabulary is extracted from the first combined vocabulary. The first combination column 'and the matching search of the second combined vocabulary in the vocabulary library 110 again (step 220); performing the matching search by adding the vocabulary to the vocabulary library n 该 in the first combination, when the vocabulary 110 is confirmed When the first combination accumulates the vocabulary, the first combined cumulative vocabulary is extracted from the next combined initial vocabulary, and vice versa, the tail of the first combined cumulative vocabulary is extracted and the next word is the second Combining the vocabulary' and performing a matching search on the vocabulary no with the second combined vocabulary again (step 230); and so on until any combined vocabulary or combined cumulative vocabulary contains the final word of the Chinese character string 101 (step 240) ). The second combined vocabulary is searched for in the vocabulary 110 (step 220 or step 230). When the vocabulary 110 is confirmed to have the second combined vocabulary, the second combined vocabulary and the next vocabulary are taken as the second Combining the accumulated vocabulary, and performing matching search of the second combined cumulative vocabulary in the vocabulary library 110; otherwise, drawing the tail word of the second combined vocabulary and the next word as the third combined vocabulary, performing the vocabulary in the vocabulary 110 Matching search for three combined vocabulary. 201035783 UG carries out a combination of vocabulary (including the first-combined vocabulary, and indeed;: Ί =) or 疋 second red vocabulary and above), the identification of two = two points _ sinks and above ) Na Nai word division, second cut material, third cut word and 1 = solution ί, ' Μ 针对 针对 针对 针对 针对 针对 针对 针对 针对 针对 针对 针对 针对 针对 针对 针对 针对 针对 针对 针对 针对 针对 针对 针对 针对 针对 针对 针对 针对 针对 针对 针对 , The block diagram of the Chinese substring segmentation system of the addition of the _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ Please refer to "3rd figure". The Chinese character string segmentation system 100 further includes the above-mentioned identification group (10) and the setting module 15G. The identification group in the 1 is just as described above, and includes a starting point, The set-up group (9) of the digital disk is used to search for the benefit according to the search module 13G. The choice ============================================================================= Take the word count and the record. For the above example, "Weekly - Sunny", please refer to "3", "第图" and "5"" to set the starting point (10) of the preset module 15〇 as 〇, box The word number 52G is 2, and the vocabulary record is not marked. First, the couch stencil cylinder (10) takes the predetermined starting point 51G as a reference, and takes the string 520 from the Chinese character string. For the first combined vocabulary, that is, "week" (step sin); the thief group 13 〇 in the renegade vocabulary 11 〇 when the first vocabulary "week" = 420) 'reconfirm "week" does not contain The final character of the Chinese character string ι〇ι is "clear v; (step 430); setting module 15. According to the search result of "week", select the accumulated monthly addition", the number of captured words 520 is 3, and the mark is Recording (step; the capture module 201035783 120 according to the capture start point 510 and the captured word number 520, and the sub-string between the ο and the third one) is the first combination accumulation capsule (step 410) In the 模组 ,, the search module 130 confirms that the vocabulary 110 has the first combined accumulating sac "Monday" (step 420), and confirms "star again" 1) does not include the final word "clear" (step 430); the setting module 15 selects to accumulate the extracted word number 520 to 4 according to the search result of "Monday", and maintains the vocabulary record of the mark (step 440). The capture module 12 captures the string "Day of the week" between the third and the fourth to accumulate the vocabulary (step 410); the search module 130 confirms that the vocabulary no does not include the time The first combination accumulates the vocabulary "Monday Day" (step 420), and confirms that the vocabulary record is in the marked state again, and the recognition module 140 selects no action after confirming that the vocabulary record is marked (step 45 〇); setting module Ls〇 according to the search result is none, the extraction start point 510 is designated as 3 (0+4-1), the designated operation word number 520 is re-specified, and the mark of the vocabulary record is cancelled (step 460); The string "Weather" between the third and fifth of the module 12 is the second combined vocabulary (step 41 〇). The search module 130 confirms that the vocabulary 11 has the second combined vocabulary "weather" ( Step 420), reconfirm that "the weather" does not include the final word "clear" (step); setting module 15 Adding the number of words 520 to 3, and marking the vocabulary record (step fetching 120 to retrieve the string between the 3rd and the 6th, "weather clear" is the t-combination cumulative vocabulary (step 41G); The module 13 〇 〇 〇 〇 110 110 110 110 110 110 110 110 110 110 110 110 110 110 110 110 110 110 110 110 ( ( ( ( ( ( ( ( ( ( ( ( ( ( ( ( ( ( ( ( ( ( ( ( ( ( ( Specify the starting point training as (3+3·!), re-specify the number of words 52 () to 2, and cancel the private (step 460); capture module 12 to remove, (4) W 5 The string 201035783 "clear" is the third combined vocabulary (step 410); the search module 13 〇 when the vocabulary library 11() checks that there is no "clear" (step 420), the recognition module 140 is The vocabulary mark is none, and "clear" is recognized as the third sever word (step 47 〇), and the search module 13 〇 confirms that "clear" contains the final word "clear" (step 480), so the search ends and the result is generated. The results of the segmentation of "Week", "Monday", "Weather" and "Sun". If the search module 13 confirms that all the word segments do not contain the final word (step 480), the setting module 150 accumulates the operation starting point 510' to re-determine the captured number 52 to 2 (step 49〇), And repeat the steps Yang. "Monday", "Weather" and "Qing" in "Picture 5" are the result of the long-term word segmentation of this example. Among them, "Week-" 530 is the first combination of accumulated words, and "Weather" is 540. The second combination vocabulary, "Qing" 55G is the third segmentation word, and the result of this long word priority is suitable for Chinese and foreign translation. As described above, it can be seen that the difference between the present invention and the first method is that when the confirmation vocabulary has a combined vocabulary from the Chinese character φ Na, the step-by-step method is used to extract the vocabulary and the lower-word as a combined cumulative vocabulary, and _ On the contrary, the job takes the tail of the combination capsule and the technical means of searching for it. By means of this technical means, the problems existing in the prior art can be solved, and then the speed and the original meaning of the sentence are Achieve the technical effect of reducing the burden of the system. While the present invention has been described above, it is not intended to limit the scope of the invention. It is to be understood that the spirit of the invention and the scope of the invention are set forth in the appended claims. The patent protection scope of the present invention is still subject to the definition of the patent application. BRIEF DESCRIPTION OF THE DRAWINGS Fig. 1 is a block diagram showing a character string segmentation system of the present invention. 12 201035783 Fig. 2 is a flow chart showing the steps of the character string segmentation method in the present invention. / Figure 3 is a block diagram of the Chinese character string segmentation system of the identification group and the setting module of the present invention. Fig. 4 is a flow chart showing the steps of the Chinese character string segmentation method for adding the identification step and the setting step of the present invention. Knife Figure 5 is a schematic diagram showing the result of segmentation of long words in the string of characters in the present invention. [Main component symbol description]

100 中文字串切分系統 101 中文字串 110 詞彙庫 120 擷取模組 130 查找模組 140 識別模組 150 設定模組 510 擷取起點 520 擷取字數 530 「星期一」 540 「天氣」 550 「晴」 步驟210自一中文字串擷取其第一字至第二字為第一組合 詞彙 步驟220以該第一組合詞彙至該詞彙庫進行匹配查找,當確 遂5亥5司菜庫有該第一組合詞彙時,擷取該第一組合 丢司茱與其下一字為第一組合累加詞彙,反之,則掏 13 201035783 取該第一組合詞彙之尾字與其下一字為第二組合 詞彙,並再次以該第二組合詞彙於該詞彙庫進行匹 配查找 步驟230以該第一組合累加詞彙至該詞彙庫進行匹配查 找,當破認該詞彙庫有該第一組合累加詞彙時,操 取該第一組合累加詞彙與其下一字為次第一組合 累加詞彙’反之,則擷取該第一組合累加詞彙之尾 子與其下一字為該第二組合詞彙,並再次以該第二 組合詞彙於該詞彙庫進行匹配查找 步驟240依此類推至任一組合詞彙或組合累加詞彙包含有 該中文字串之最終字為止 步驟彻以該擷取起點為基準,自該中文字串掏取符合該揭 取字數之一或多個字串 步驟420確認該詞彙庫有該些字串100 Chinese character string segmentation system 101 Chinese character string 110 Vocabulary library 120 Capture module 130 Search module 140 Identification module 150 Setting module 510 Capture starting point 520 Number of words 530 "Monday" 540 "Weather" 550 "清晴" Step 210: extracting the first word to the second word from a Chinese character string into a first combined vocabulary step 220 to search for the first combined vocabulary to the vocabulary, and confirming the 5 hai 5 division When there is the first combined vocabulary, the first combination singer and the next word are added as the first combination to accumulate vocabulary, and vice versa, then 掏13 201035783 takes the first combination vocabulary and the next word as the second Combining the vocabulary and performing the matching search step 230 on the vocabulary again with the second combined vocabulary to add the vocabulary to the lexical library for matching search, when the vocabulary has the first combined cumulative vocabulary Obtaining the first combined cumulative vocabulary and the next word as the second first combined vocabulary', and vice versa, taking the tail of the first combined cumulative vocabulary and the next word as the second combined vocabulary, and again The second combination vocabulary performs the matching search step 240 in the lexicon, and so on to any combination vocabulary or combination. The accumulated vocabulary includes the final word of the Chinese character string, and the step is based on the retrieval starting point, from the Chinese character string. Extracting one or more strings that match the number of extracted words, step 420 confirms that the vocabulary has the strings

步驟470將該些字串之尾字識 別為切分字 步驟480 步驟490 確認該切分字_最终字 累計該擷取起點, 給定該#貞料數為2 14Step 470 identifies the last word of the strings as a segmentation word. Step 480. Step 490: Confirm that the segmentation word_final word accumulates the extraction starting point, and the number of the reference data is 2 14

Claims (1)

201035783 七、申請專利範圍: 1. 一種中文字串之詞彙切分系統,其包含: 一詞彙庫,包含有複數個詞彙; 一擷取模組,用以自一中文字串擷取其第一字至第二字 為第一組合詞彙後,依據該第一組合詞彙其後續查找之有 無,選擇擷取該第一組合詞彙與其下一字為第一組合累加詞 彙,或是擷取該第一組合詞彙之尾字與其下一字為第二組合 0 詞彙;及 一查找模組,用以於該詞彙庫依序進行該第一組合詞彙 之匹配查找,以及該第一組合累加詞彙或該第二組合詞彙之 匹配查找; 其中,當該查找模組確認該詞彙庫有該第一組合累加詞 彙時,搁取模組選擇操取該第一組合累加詞彙與其下一字為 次第一組合累加詞彙,反之,則擷取該第一組合累加詞彙之 尾子與其下一字為該第二組合詞彙,由該查找模組進行該次 〇 第一組合累加詞彙或該第二組合累加詞彙之匹配查找,依此 類推至任一組合詞彙或組合累加詞彙包含有該中文字串之最 終字為止。 2·如申請專利範圍第1項所述之中文字串之詞彙切分系統,該 系統更包含一識別模組,用以於該查找模組確認該詞彙庫無 該第一組合詞彙/第二組合詞彙時,將該第一組合詞彙/第二組 合詞彙之首字識別為第一切分字/第二切分字。 3· —種中文字串之詞彙切分方法,預建包含有複數個詞彙之_ 詞彙庫’該方法包含下列步驟: 15 201035783 自^文字轉取其第—字至第二字為第—組合詞囊; &以》亥第、组合題至該詞囊庫進行匹配查找 ,當確認該詞 莱庫有該第—組合詞彙時’_取該第—組合詞彙與其下一字為 第組合累加5司莱’反之’則擷取該第一組合詞彙之尾字與其 下字為第—組合詞彙,並再次以該第二組合詞囊於該詞彙庫 進行匹配查找;201035783 VII. Patent application scope: 1. A Chinese word string vocabulary segmentation system, comprising: a vocabulary containing a plurality of vocabulary; a capture module for capturing the first one from a Chinese character string After the word to the second word is the first combined vocabulary, according to the presence or absence of the subsequent combination of the first combined vocabulary, selecting the first combined vocabulary and the next word as the first combined cumulative vocabulary, or drawing the first The ending word of the combined vocabulary and the next word are the second combined 0 vocabulary; and a search module for sequentially performing the matching search of the first combined vocabulary in the vocabulary, and the first combined cumulative vocabulary or the first a matching search of the combined vocabulary; wherein, when the searching module confirms that the vocabulary has the first combined cumulative vocabulary, the escaping module selects the first combined cumulative vocabulary and the next word as the first combined first Vocabulary, otherwise, the tail of the first combined cumulative vocabulary and the next word are the second combined vocabulary, and the search module performs the first combined vocabulary or the second combined Find the matching words, and so on to any combination of words or word combinations comprising the most of the accumulated string of Chinese characters until the end. 2. The vocabulary segmentation system of the character string according to the first aspect of the patent application, the system further comprising an identification module for confirming that the vocabulary does not have the first combination vocabulary/second When the vocabulary is combined, the first word of the first combined vocabulary/second combined vocabulary is recognized as the first shard word/second sever word. 3·—The method of vocabulary segmentation of Chinese characters, pre-built with multiple vocabulary _ vocabulary 'This method contains the following steps: 15 201035783 From ^ text to its first word to the second word is the first combination The word capsule; & the "Hai", the combination of the title to the word capsule for matching search, when confirming that the word Leku has the first-combination vocabulary '_ take the first-combined vocabulary and its next word for the first combination accumulation 5Sile 'opposite' draws the tail word of the first combination vocabulary and the following word as the first-combination vocabulary, and again uses the second combination word to perform matching search in the vocabulary; 以该第-組合累加詞彙至該詞彙庫進行匹配查找,當確認 賴彙庫有該第一組合累加詞彙時,擷取該第一組合累加詞囊 與其下-字為次第-組合累加詞彙,反之,則擷取該第一組合 累加詞彙之尾字與其下-字為該第二組合詞彙,並再次以該第 一組合詞彙於該詞彙庫進行匹配查找;及 依此類推至任一組合詞彙或組合累加詞彙包含有該中文 字串之最終字為止。Performing a matching search by using the first-combination-added vocabulary to the vocabulary library, and when the acknowledgment library has the first combined cumulative vocabulary, extracting the first combined cumulative vocabulary and the lower-word as a sub-combination cumulative vocabulary, and vice versa And extracting the tail word of the first combined cumulative vocabulary and the lower-word as the second combined vocabulary, and performing matching search on the vocabulary again with the first combined vocabulary; and so on to any combination vocabulary or The combined cumulative vocabulary contains the final word of the Chinese character string. 4.如申請專利範圍第3項所述之中文字串之詞彙切分方法,其 中於確5忍§亥詞彙庫無該第一組合詞彙/第二組合詞彙時,更包 含將該第一組合詞彙/第二組合詞彙之首字辨識為第一切分字 /第二切分字的步驟。 164. The vocabulary segmentation method of the character string according to item 3 of the patent application scope, wherein the first combination is further included when the vocabulary of the fifth vocabulary library does not have the first combination vocabulary/second combination vocabulary The step of recognizing the first word of the vocabulary/second combination vocabulary as the first shard word/second sever word. 16
TW98110770A 2009-03-31 2009-03-31 Chinese word segmentation syatem and method thereof TW201035783A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
TW98110770A TW201035783A (en) 2009-03-31 2009-03-31 Chinese word segmentation syatem and method thereof

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
TW98110770A TW201035783A (en) 2009-03-31 2009-03-31 Chinese word segmentation syatem and method thereof

Publications (1)

Publication Number Publication Date
TW201035783A true TW201035783A (en) 2010-10-01

Family

ID=44855985

Family Applications (1)

Application Number Title Priority Date Filing Date
TW98110770A TW201035783A (en) 2009-03-31 2009-03-31 Chinese word segmentation syatem and method thereof

Country Status (1)

Country Link
TW (1) TW201035783A (en)

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
TWI385545B (en) * 2011-03-04 2013-02-11 Rakuten Inc A collective expansion processing apparatus, a collective expansion processing method, a program, and a non-temporary recording medium
CN104077275A (en) * 2014-06-27 2014-10-01 北京奇虎科技有限公司 Method and device for performing word segmentation based on context

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
TWI385545B (en) * 2011-03-04 2013-02-11 Rakuten Inc A collective expansion processing apparatus, a collective expansion processing method, a program, and a non-temporary recording medium
CN104077275A (en) * 2014-06-27 2014-10-01 北京奇虎科技有限公司 Method and device for performing word segmentation based on context

Similar Documents

Publication Publication Date Title
US6580437B1 (en) System for organizing videos based on closed-caption information
CN103761261B (en) A kind of media search method and device based on speech recognition
JP4408129B2 (en) Image document processing apparatus, image document processing method, program, and recording medium
CN101452470B (en) Summary-style network search engine system and search method and uses
US8126897B2 (en) Unified inverted index for video passage retrieval
US8577882B2 (en) Method and system for searching multilingual documents
CN103631802B (en) Song information searching method, device and corresponding server
CN105787095A (en) Automatic generation method and device for internet news
WO2005055196A3 (en) System & method for integrative analysis of intrinsic and extrinsic audio-visual data
CN102136001B (en) Multi-media information fuzzy search method
TW201035783A (en) Chinese word segmentation syatem and method thereof
WO2015024429A1 (en) Method and device for acquiring movie and television subject from webpage
EP2887239A2 (en) Method and system of audio retrieval and source separation
US20210342393A1 (en) Artificial intelligence for content discovery
CN101673263A (en) Method for searching video content
WO2008100037A1 (en) The system and method for generating indexing information of multimedia data file using vocal data and retrieving indexing information of multimedia data file
JP2004086845A (en) Apparatus, method, and program for expanding electronic document information, and recording medium storing the program
CN113495874A (en) Information processing apparatus and computer readable medium
CN103870606A (en) Webpage information extracting system and extracting method
CN116521626A (en) Personal knowledge management method and system based on content retrieval
CN102541865A (en) Method for improving word segmentation property by using new words identified in word segmentation process
JP2008003972A (en) Metadata generation apparatus and metadata generation method
JP2008204007A (en) Image dictionary generation method, device and program
Xiao et al. Constructing parallel corpus from movie subtitles
TW201007483A (en) Multi-language translation system and method