TW517195B

TW517195B - Learning method and system for new vocabularies in computer

Info

Publication number: TW517195B
Application number: TW89123412A
Authority: TW
Inventors: Li-Wei Yang
Original assignee: Eland Technologies Co Ltd
Priority date: 2000-11-06
Filing date: 2000-11-06
Publication date: 2003-01-11

Abstract

The invention provides a learning method and system for new vocabularies in a computer, which includes a sub-vocabulary recording procedure, a first deleting procedure, and a second deleting procedure. The sub-vocabulary recording procedure is to divide at least a non-vocabulary portion recorded in a non-vocabulary set into at least a sub-vocabulary, and record the resultant sub-vocabulary in a sub-vocabulary set, wherein the non-vocabulary portion means that any neighbored words in a computer readable document can not compose a recognizable vocabulary by the computer. The first deleting procedure is to calculate the appearing times of each sub-vocabulary, and delete the sub-vocabulary with appearing times less than a predetermined value from the sub-vocabulary set. The second deleting procedure is to sequentially select a first sub-vocabulary and a different second sub-vocabulary from the sub-vocabulary set, and when the first sub-vocabulary is included in the second vocabulary and the appearing times of the first sub-vocabulary is not larger than that of the second sub-vocabulary, the first sub-vocabulary is deleted from the sub-vocabulary set, and a new vocabulary recognizable by the computer is generated from the sub-vocabulary set. The present invention also discloses a system for new vocabularies in computer that implements the method.

Description

517195 五、發明說明（1) 【發明領域】本發明係關於一種電腦新詞學習方法與系統，尤關於一種針對文件中任意相鄰單字均無法構成電腦可辨識詞彙的部份進行分析，藉以得到新詞並增加電腦可辨識詞彙之數量的電腦新詞學習方法與系統。【習知技術】在英文、法文或德文等拼音文字的文件中，由於各個@ 字（word )之間均有空白加以分隔，故並不存在對句子進行分割才能了解其真正含義的問題。然而，在中文、日文或韓文等，各個字之間並無空白加以分隔的文件中，若無法對文件的内容加以切割，將無法了解其真正的含義為何，而造成判讀上的錯誤。所謂的「斷詞」，係指將由成串中文、日文或韓文等的字元所組成的文句進行分割，使文句被切割成許多有意義的詞彙。在許多語言處理的應用上，如校正、翻譯或語4 音辨識等，都必需要先對文件進行斷詞後，才能夠作進一步的處理。隨著電腦科技的發展，已經有以電腦來斷詞的方法與系統出現。與以人工來進行斷詞相比較，電腦斷詞可以大517195 V. Description of the Invention (1) [Field of the Invention] The present invention relates to a computer new word learning method and system, and more particularly to an analysis of a part of a document where any adjacent single word cannot constitute a computer-recognizable vocabulary, thereby obtaining New computer word learning method and system for adding new words and increasing the number of computer-recognizable words. [Knowledge technology] In the English, French, or German phonetic alphabet files, there is no space between each @ character (word) to separate them, so there is no problem of segmenting a sentence to understand its true meaning. However, in documents such as Chinese, Japanese, or Korean, where there is no space between the characters, if the content of the document cannot be cut, it will be impossible to understand what it really means, which will result in interpretation errors. The so-called "word-breaking" refers to the segmentation of a sentence composed of a string of Chinese, Japanese, or Korean characters, so that the sentence is cut into many meaningful words. In many language processing applications, such as correction, translation, or phonetic recognition, you must perform word segmentation on the file before you can proceed further. With the development of computer technology, computer-based methods and systems have emerged. Compared with manual word segmentation, computer segmentation can be large

517195 五、發明說明（2) 2 ί少所f的時間。然而，電腦斷詞的困難處之在於 §其娅到其無法辨識的詞彙時，若不以人工，在彙’其將無法作適當的處理。别入新的字【發明概要】針對上述問習方法與系統，本發明之另統，其可做為電題，本發明之目的為提供一其可自動學習文件中的新詞彙。新詞學一目的為提供一種電腦新詞腦詞彙資料庫更新及維護的基；方法與系為子詞記詞記錄成至少其中無均無法別計算子詞自序選取包含於詞時，產生電達上述目錄程序、程序係將一子詞，詞部份係構成電腦各子詞的子詞集合不相同之第二子詞將第一子腦可辨識的，依本發明之電程序以詞集合並將所得到之子詞可讀取彙的部並將出一第一刪除 g己錄於一無指於一電腦可辨識之詞出現次數，中刪除。第刪除第一子詞與一第中，且第一詞自子詞集之新詞。子言司之合刪除腦新詞及一第之至少 δ己錄於之文件份。第現次數程序係二子詞出現次，藉以學習方法包一刪除輕序一無詞部份一子詞集合中，任相鄰一刪除程序小於一預定自子詞集合，當第一子數不大於第由該子詞集括一〇子分解中， no 早字係分值之中依詞被二子合中517195 V. Description of the invention (2) 2 The time of less. However, the difficulty of computer word segmentation is that § Qiya will not be able to deal with it properly if she does not use artificial words when she does not recognize the words. Breaking into new words [Summary of the Invention] In view of the above-mentioned questioning method and system, another aspect of the present invention can be used as a question, and the object of the present invention is to provide a new vocabulary that can automatically learn documents. The purpose of the new vocabulary is to provide a basis for updating and maintaining the computerized vocabulary database of new words and brains. The method and system are to record at least one of the subwords, which cannot be calculated separately. When the subwords are self-selected and included in the words, electricity is generated The above-mentioned directory program and program are a sub-word, and the word part is a second sub-word that is different from the sub-word set constituting each sub-word of the computer. The first sub-brain is recognizable. The obtained subwords can be read in the Department of Sink, and a first deletion g has been recorded in a number of times that a finger can recognize a computer, and deleted. Paragraph deletes the first subword and the first one, and the first word is a new word from the subword set. The combination of Ziyanji deleted the new words in the brain and at least δ has been recorded in the documents. The number-of-times program refers to the occurrence of two subwords, so that the learning method includes deleting a light order, a wordless part of a subword set, and any adjacent deletion program is less than a predetermined self-word set. This set of subwords includes 10 sub-decompositions, and no in the early word series scores are combined by two sub-words.

第5頁 517195 五、發明說明（3) 記錄模組、：:：：種電滕新詞學習系統，其包括_ ，模組係將記錄於：：模組以及-第二刪除模組。子：：少一子詞，龙轉兮…巧集合之至少一無詞部份八j。份係指於一電子詞記錄於一子詞集合中，其成至電腦可辨識之★:項取之文件中’任相鄰單字均“:】：等子詞的出現；；的部份。第-删除模組係分別；= 子詞集合中並將出現次數小於—預定值亥不相同之一第S第一刪除模組則自子詞集合、° 第二子詞中，S 與—第二子詞，當第一子將第-子詞自J第—子詞之出現次數不大於;於可辨識之新詞。3集合删除，藉以由子詞集合中產生^ 較佳實施例之詳細說明】以下將參日s 4 !新詞學習方：：=式二月依本發明較佳實施例之電付唬加以說明。’、，，、相同的凡件將以相同的參照請參照圖] 法1係先進行〜較佳實施例之電腦新詞學習方 51« 同辨識程序11，以針一雷日《 π 彳子自方 Μ進行斷詞處 ^ . 乂對電恥可讀取之文件中文、曰文或ί : 戶“胃「斷詞」係指將由成串 %文等的字元所組成的文句進行分：2 II Ul! Β_ι 第6頁二>丄 /ly：) 五、發明說明（4) 句被切割成耸炙亡* 種習知之「i典統；式例中，係使用-即，利用詞彙細点I ^ 」來對文件進行斷詞，亦評估，以求得最隹之、詞彙長度等對被切割之文件進行項技術者可：然而，需注意者，熟悉該不脫離本發===斷詞法來對文件進行斷詞，而、著’進行無詞部份記錄 =δ己錄於一無詞集合5 2中，」’係指在文件5 1中，线之d菜的部份。例如，「王明昨天拜訪李小華」· 句中，由於「王明」與「各，組合（如，「李小華、、「李小華」等三種組合為兩個「無詞部份」。亦王明昨天拜訪明」與「李小華」這兩個程序12，以將文件51中的無。在本發明中，所謂的「無 ^相鄰單字均無法構成電腦若文件51中具有如下的句李小華」這兩個部份中，單」可以有「李小」、「小 )均無法被系統辨識，因此即，此句的斷詞結果會成李小華」無詞部份將會成為單字的組公.^ΐ ’在子詞記錄程序13中，無詞集合52各μ 取主夕一子詞，並將为解所得之子詞記錄於—子Page 5 517195 V. Description of the invention (3) Recording module: :::: A new electronic word learning system including _, the module will be recorded in :: module and-the second delete module. Child: One less word, the dragon is turning ... At least one wordless part of the clever set is eight j. The part refers to the part recorded in an electronic word in a collection of sub-words, and its completion to the computer-recognizable ★: item taken 'any adjacent word is ":]: the part where the sub-word appears ;; The deleted modules are respectively; = in the set of subwords and the number of occurrences is less than-one of the predetermined value, the first S deleted module is from the set of subwords, ° in the second subword, S and-the second child Words, when the first sub-sub-word appears from the J-sub-word no more than; the new word is identifiable. 3 sets are deleted to generate from the sub-word set ^ Detailed description of the preferred embodiment] The following Let's learn about s 4! New word learning party :: == February will be explained according to the electric payment method of the preferred embodiment of the present invention. ',,,, the same parts will use the same reference, please refer to the drawing] Method 1 The first method of computer new word learning in the preferred embodiment 51 «is the same as the recognition program 11, and the word segmentation is performed with the needle π 日自自 Zi Zi Fang ^. 乂 The document that can be read in Chinese, Sentences or ί: The household "stomach" word segmentation "refers to a sentence composed of a string of characters such as 2% of words: 2 II Ul! Β_ Page 6 II > 丄 / ly :) V. Explanation of the invention (4) The sentence was cut into a towering death * The conventional "i allusion system; in the formula, it is used-that is, using the vocabulary point I ^" To perform word segmentation on the document and evaluate it to obtain the most appropriate, vocabulary length, etc. The technical person who performs the item on the cut file may: However, those who need to pay attention to it should be familiar with the word segmentation === Word segmentation is performed on the file, and 'Record without word part = δ has been recorded in a wordless set 5 2 "" refers to the part of line d in file 51. For example, in the sentence "Wang Ming visited Li Xiaohua yesterday", the combination of "Wang Ming" and "Each" (for example, "Li Xiaohua", "Li Xiaohua" and other three combinations are two "wordless parts". Also Wang Ming yesterday "Visit Ming" and "Li Xiaohua" two programs 12 to remove none in file 51. In the present invention, the so-called "no ^ adjacent words cannot constitute a computer. If file 51 has the following sentence Li Xiaohua" Among the parts, "Single" can have "Li Xiao", "Xiao" can not be recognized by the system, so that the word segmentation result of this sentence will become Li Xiaohua. "The wordless part will become the single-word group. ^ Ϊ́ 'In the sub-word recording program 13, each of the non-word sets 52 takes a sub-word of the main eve, and records the sub-words obtained from the solution in the sub-

第7頁 517195 五、發明說明（5) 詞集合53中。以前面的「王明昨天拜訪李小 ^ ，在子詞記錄程序丨3中，無詞部份「李」廷句為「李小」、「d、筮命「本I试字小華」會被分解 β ^ 小華」與「李小華」三個子勃λ认丄例為 …、」、小華」與「李小華」三個子叫4. 子詞e錄程序13會將每一個無詞 =二換言之’ 來。％子詞都分解出任一種計算方法接著，第一刪除程序14分別計算各個數，並將出現次數小於一預定值之子詞自哕二=現次除。在此程序中，所謂各個子詞&「出現；：列集合中刪子詞於原本無詞集合52中的出現次數，亦^ ’ y指各詞集合53 :的出現次數。熟悉本技術者可視J ::司於子乂尤·名矣古·]* ★七、、4· 貝際狀况選擇當某個子詞在整份文件51中的出現次數過少其在文件中是偶而出⑨，故可將二l表示 m:王明」在整份文件51中僅d: 「王明」並非-個且：广出現了十表次’則很明顯地，出現的子詞…錄價值的新詞’其僅為-個偶而由於出現次數眾多，2」這個子詞相對於文件51而言，故為一具有記錄價值的新詞。至於預定值的大小，則如，可以人工輸入的方式更數’來機動性調整預定值的可視實際狀況加以設定。例改預設值，或依文件5 1的字大小。如此將可針對不同的文Page 7 517195 V. Description of the invention (5) In the word set 53. Based on the previous "Wang Ming visited Li Xiao ^ yesterday, in the sub-word recording program 丨 3, the unspoken part of" Li "was sentenced as" Li Xiao "," d, fatal, "this I try the word Xiaohua" will The three subordinates λ, which are decomposed β ^ "Xiaohua" and "Li Xiaohua", are identified as "...", "Xiaohua" and "Li Xiaohua". The three sub-names are 4. The sub-word e-recording program 13 will replace each wordless = 2 in other words ' Come. The% subwords are decomposed into any kind of calculation method. Next, the first deletion program 14 calculates each number separately, and divides the subwords whose occurrences are less than a predetermined value from the second = the current division. In this program, the so-called each subword & "appears": the number of occurrences of deleted subwords in the original non-word set 52 in the column set, and ^ 'y refers to the number of occurrences of each word set 53: Those familiar with the technology Visual J :: Si Yuzi 乂 You · Ming 矣 gu ·] * ★ VII, 4, · Inter-state selection When a subword appears too few times in the entire document 51, it appears occasionally in the document, Therefore, the two l can be expressed as m: Wang Ming ”in the entire document 51. Only d:“ Wang Ming ”is not a single and: ten table times appear widely. It is obvious that the sub-words appearing ... The word 'It's only an occasional and because of its many occurrences, the sub-word' 2 'is a new word with record value relative to file 51. As for the magnitude of the predetermined value, for example, the value can be manually inputted, and the predetermined value can be adjusted flexibly according to actual conditions. For example, change the default value, or according to the word size of document 5 1. This will target different languages.

^17195 五、發明說明（6) 件’设定不同的新詞學習標準不相除程序=中笛係先自該子詞集合53依序選取包含=4;:=第t子詞。然後，當第-子= 小」為匕巧：：rr除。例如，當選取7: 小」係包含於「李小華」中」，故第；τ詞時，由於「李 ;數會等☆「李小華」這個子詞的出二個子詞的出現下，即將「李*自手古司隼人μ的出現一人數。在這種情況，詞。相同心留下；李小華」李小華」中，故直也會自子1鱼」坆個子詞也包含於删除包含於較長子詞的較短子；集：？：除。如此，將可保留長度較長的子詞。接著，在判斷程序1 6中，若早人 ^，在第一删除程序14與第二二=a 53為空集合，亦，除，則結束整個依本發=郎中已將所有的子驾方法1的流程。若子詞隼人5又佳貫施例之電腦新詞學刪除程序17,僅保留出現V』3最中，有子詞，則出現次數較少的子詞。如此，—夕的子詞，删除所有其它次將只會產生一個新詞。在產生新詞之後，即進行無包括了新詞的無詞部份刪除，^ :⑷份分割程序〗8，以將中，新詞以外的部份獨立出來开彡成包括了新詞的無詞部份 '成新的無詞部份。亦即， / 五二發明說明 ο 一 — -— d:詞；份中，位於新詞之前的單字數h 無詞部份，並加；：：：==:詞之前的部份視為2 時’則將無詞部：中：：;:詞之後的單字數量為兩個以工份，並將JLA入中新詞之後的部份視為另一盔1 a 寻八加入至無詞集合52。 “、、阔部例如，若文件51中另有—句：「他 :丄；由於整句都沒有電前 Ϊ 了：個無詞部份，且其中包括了剛Ϊ二; 華」。針對此一盔叫加yv 4压王的新岡「李小 —無詞部份就合無詞部份分割程序18中，此即新詞「李d二1 =，並被分割成兩個新的無詞部份，「李小華」之:以個字，新詞序19在= = 之後’即進行子詞集合清空程進行子詞分解的動;月空’並回到子詞記錄程序13重新 ,，流程，將可找出文件51中所有可能的新一： J衫響到電腦原本可辨識的詞，以及文件5 1中既f :Ϊ 2 :。因Λ，其可有效地對電腦可讀取的文件進仃更適當的斷詞處理。^ 17195 V. Description of the invention (6) Case ‘Set different learning standards for new words. Non-divide program = Zhongdi system first selects from this sub-word set 53 sequentially. Contains = 4;: = t-th sub-word. Then, when the first-sub = small "is dagger :: rr division. For example, when 7: Xiao is selected to be included in "Li Xiaohua", so the term; τ, due to the appearance of the two sub-words "Li Xiaohua" due to "Li; Shuhui etc." * Since there is a number of people in the ancient Siyaren μ. In this case, the words. Concentric staying; Li Xiaohua "Li Xiaohua", so straight will also be a son of a fish "A sub-word is also included in the deletion included in the longer sub-word Shorter child; set:? :except. In this way, longer subwords will be retained. Next, in the determination program 16, if the early person ^, the first deletion program 14 and the second two = a 53 are empty sets, and also, except, the entire method according to the present invention = all the sub-driving methods have been ended 1 process. If the sub-word 隼人 5 is a good example of the new computer vocabulary, delete the program 17, and keep only the most appearing V′3. If there are sub-words, the sub-words appearing less often. So, the subword of Xi, deleting all other times will only produce a new word. After generating a new word, delete the non-word part that does not include the new word, ^: the segmentation procedure of the part 〖8, in order to separate the parts other than the new word into a non-word that includes the new word. The word part 'becomes a new wordless part. That is, / May 2nd invention description ο One — — — d: word; in the copy, the number of words before the new word h without the word part, and added; ::: ==: the part before the word is regarded as 2 Shi 'will add the wordless part: Middle ::;: to the number of words after the word, and treat the part after the new word in JLA as another helmet. 1 a 52. "、, wide section For example, if there is another sentence in document 51:" He: 丄; because there is no electricity before the whole sentence 前: a wordless part, and it includes Gang Er Er; Hua ". In response to this helmet, the new gang called yv 4 is king "Li Xiao-the wordless part is combined with the wordless part in the segmentation program 18, which is the new word" Li d 2 1 =, and divided into two new No word part, "Li Xiaohua": with a word, the new word order 19 after == ', that is, the subword set emptying process is performed for the subword decomposition; the moon is empty, and it returns to the subword recording program 13, and re-, Process, you can find all possible new ones in file 51: J shirt ringing to the computer's original recognizable word, and both f: Ϊ 2: in file 51. Because of Λ, it can effectively perform more proper word segmentation on computer-readable files.

第10頁 J丄/丄3:)Page 10 J 丄 / 丄 3 :)

王於所產生的新詞，〜可辨識詞*。例如，若產生2際狀況使其成為斬的電腦慮將此-近年來才以詞為「電子商務」，即考使其成為新的電腦可辨加入電腦：詞囊資料庫’ 資料庫的更新與維護。如此，將有助於電腦詞彙 3月參照圖2，依本發明輕統2包括一詞辨識模电21施例之電腦新詞學習系詞記錄模組23、_第一…、闲部份記錄模組22、一子 -第三刪除模組26以及一 -第二刪除模組25、例中，各模組均為儲存於：；;: =模組2J。在本實施 -中央處理單元讀取後硬f機或，使而，需。:孰Λν:找出文件51中的新詞。然與進一步的岸了;;，項技術者亦可對其進行等效之修改文進行電子裝置中，以對詞的工作，而不超出本；：：；；：;1，。斷詞與產生新電腦新詞學習备& q γ ^ t 際網路自另-網路飼取文件51，或經由 ^ ^ ^ /t Q Π ^服态項取文件5 1。進行詞辨識時所〇貝^ ’、可儲存於一電腦可讀取之記憶裝置或New words produced by Wang Yu, ~ Identifiable words *. For example, if there is a situation in which the computer becomes chopped, consider this word-in recent years, the word has been used as "e-commerce". And maintenance. In this way, it will be helpful for the computer vocabulary to refer to FIG. 2 in accordance with the present invention. In accordance with the present invention, the light system 2 includes a word recognition module 21 embodiment of the computer new word learning system word record module 23, _first ... Module 22, one child-third deletion module 26 and one-second deletion module 25. In the example, each module is stored in: ;;: = module 2J. In this implementation-the central processing unit reads the hard f machine or, so that it is needed. : 孰 Λν: Find new words in file 51. Of course, with the further shore ;;, the technicians can also make equivalent modifications to the text in the electronic device to work on the word without exceeding this; :: ;;:; 1. Word segmentation and new computer learning words & q γ ^ t The Internet picks up file 51 from the Internet-or fetches file 51 via the ^ ^ ^ / t Q Π ^ service status item. Word recognition in word recognition, can be stored in a computer-readable memory device, or

517195 發明說明（9) $媒體中，以便電腦新詞學習系統2之存取。電腦新詞學駕系統2所產生的新詞亦可加入至詞彙資料庫3 〇中，以對其進行維護與更新的動作。 /依本發明之電腦新詞學習方法與系統係利用電腦技術來對電腦可讀取文件進行斷詞，以將文件中的文句正確切割成有意義的詞彙。其有助於許多語言處理的應用，如校正、翻譯或語音辨識等科技的進一步發展。依本發明之電腦新詞學習方法與系統可自動學中的新詞彙，以對文件作適當的斷詞處理。 =务明之電腦新詞學習方法與系統一貝枓庫更新及維護的基礎。々电細d菜以上所述僅為舉例性，而非本發明之精神與範疇，而對苴進行2望，。任何未脫離應包含於後附之申請專利巾。4效修改或'變更，均第12頁 517195 圖式簡單說明【圖式之簡單說明】圖1為一流程圖，顯示依本發明較佳實施例之電腦新詞學習方法之流程。圖2為一示意圖，顯示依本發明較佳實施例之電腦新詞學習系統之架構。【圖式符號說明】 1 電腦新詞學習方法 11 詞辨識程序 12 無詞部份記錄程序 13 子詞記錄程序 14 第 — 刪除程序 15 第二刪除程序 16 判斷程序 17 第二刪除程序 18 無詞部份分割程序 19 子詞集合清空程序 2 電腦新詞學習系統 21 詞辨識模組 22 無詞部份記錄模組 23 子詞記錄模組517195 Description of the invention (9) $ in the media for easy access to the computer new word learning system 2. The new words learned by the computer new driving system 2 can also be added to the vocabulary database 3 0 to perform maintenance and update operations. / The computer new word learning method and system according to the present invention uses computer technology to segment words that can be read by a computer, so as to correctly cut sentences in the file into meaningful words. It helps many language processing applications, such as the further development of technologies such as correction, translation or speech recognition. The computer new word learning method and system according to the present invention can automatically learn new words in order to perform proper word segmentation processing on files. = Mingming's Computer New Words Learning Method and System A foundation for updating and maintaining the library. The above description is only exemplary, not the spirit and scope of the present invention. Any non-detachment shall be included in the attached patent towel. 4-effect modification or 'change, both on page 12 517195 Simple description of the diagram [Simplified description of the diagram] FIG. 1 is a flowchart showing the flow of a computer new word learning method according to a preferred embodiment of the present invention. Fig. 2 is a schematic diagram showing the architecture of a computer new word learning system according to a preferred embodiment of the present invention. [Illustration of Graphical Symbols] 1 Computer New Word Learning Method 11 Word Recognition Program 12 Wordless Part Recording Program 13 Subword Recording Program 14th-Delete Program 15 Second Delete Program 16 Judgment Program 17 Second Delete Program 18 No Word Department Partitioning program 19 Subword collection emptying program 2 Computer new word learning system 21 Word recognition module 22 No-word part record module 23 Subword record module

第13頁 517195 圖式簡單說明 24 第一刪除模組 25 第二刪除模組 26 第三刪除模組 27 無詞部份分割模組 30 詞彙資料庫 51 文件 52 無詞集合 53 子詞集合Page 13 517195 Schematic description 24 First delete module 25 Second delete module 26 Third delete module 27 Segmentation module without words 30 Lexical database 51 Documents 52 No word collection 53 Subword collection

Claims

517195 6. Scope of patent application 1. A computer new word learning method, including the following procedures: A subword recording program is to decompose at least one nonword part recorded in a nonword set into at least one subword, and The subwords are recorded in a set of subwords, where the non-word part refers to a computer-readable file, and any adjacent word cannot form a part of the computer-recognizable vocabulary; a first deletion process, respectively Count the number of occurrences of each of these subwords, and delete the subwords whose occurrences are less than a predetermined value from the set of subwords; and a second deletion procedure, which sequentially selects a different one from the subword set A sub-word and a second sub-word, when the first sub-word is included in the second sub-word, and the number of occurrences of the first sub-word is not greater than the second sub-word, the first sub-word Delete from the sub-word set, thereby generating computer-recognizable new words from the sub-word set. 2. The computer new word learning method described in item 1 of the scope of patent application, further comprising: a word recognition process, which performs word recognition processing on a document; and a non-word part recording program, which serves as the document When there is at least one wordless part in the record, the wordless part is recorded in the wordless set. 3. The computer new word learning method described in item 1 of the scope of patent application, further comprising: a judging program for judging whether the subword set is an empty set, and ending the computer when the subword set is an empty set. The flow of new word learning methods

Page 15 517195 VI. Scope of patent application. 4. The computer new word learning method described in item 1 of the scope of patent application, further comprising: a third deletion procedure, which is after the second deletion procedure, and includes the ones other than the most frequently occurring ones in the subword set. Subword deletion. 5. The computer new word learning method described in item 1 of the scope of patent application, further includes a% wordless segmentation procedure, which includes removing the wordless part containing the new word from the wordless set. ; When the number of words before the new word in the wordless part containing the new word is two or more, the part of the wordless part before the new word is regarded as another wordless part , And add it to the wordless set; and when the number of words after the new word in the wordless portion containing the new word is two or more, the wordless portion is located in the new word The following part is regarded as another wordless part and added to the wordless set. 6. The computer new word learning method described in item 1 of the scope of patent application, further comprising: a subword set emptying program, which is to clear the subword set and return to the subword recording program. 7. The computer new word learning method described in item 1 of the scope of patent application, wherein

Page 16 517195 6. Scope of patent application The predetermined value is 2. 8. A computer new word learning system comprising: a subword recording module, which decomposes at least one nonword part recorded in a nonword set into at least one subword, and records the subword in a subword In the set, the non-word part refers to a computer-readable document, and any adjacent word cannot form a part of the computer-recognizable vocabulary. A first deletion module is to calculate each of these sub-words separately. And delete sub-words whose occurrences are less than a predetermined value from the sub-word set; and a second deletion module, which sequentially selects a different first sub-word from the sub-word set in order And a second subword, when the first subword is included in the second subword, and the number of occurrences of the first subword is not greater than the second subword, the first subword is removed from the subword The word set is deleted, so that computer-recognizable new words are generated from the sub-word set. 9. The computer new word learning system described in item 8 of the scope of patent application, further comprising: a word recognition module, which performs word recognition processing on a document; and a wordless partial recording module, which should be When the document has at least one wordless part, the wordless part is recorded in the wordless set. 10. The computer new word learning system described in item 8 of the scope of patent application, further comprising:

Page 17 517195 VI. Scope of patent application A third deletion module deletes the subwords other than the most frequently occurring one in the subword set. 11. The computer new word learning system described in item 8 of the scope of patent application, further comprising: a wordless part segmentation module, which removes the wordless part containing the new word from the wordless set; When the number of words before the new word in the wordless part containing the new word is two or more, the part of the wordless part before the new word is regarded as another wordless part, And add it to the wordless set; and when the number of words after the new word in the wordless portion containing the new word is two or more, the wordless portion is located after the new word The part of is regarded as another wordless part and added to the wordless set. 12. The computer new word learning system described in item 8 of the scope of patent application, wherein the predetermined value is 2. 13. A computer new word learning system, including: a central processing unit; and a storage device that stores at least one code, so that the central processing unit can execute the following procedures after reading the code: a The subword recording program is to decompose at least one non-word part recorded in a non-word set into at least one sub-word, and record the sub-word in a sub-word set, wherein the non-word part refers to a Any computer-readable file

Page 18 517195 VI. The adjacent words in the patent application scope cannot form part of the computer-recognizable vocabulary; a first deletion procedure calculates the number of occurrences of each of these sub-words separately, and the number of occurrences is less than a predetermined value The word is deleted from the subword set; and a second deletion procedure is to sequentially select a different first subword and a second subword from the subword set, when the first subword is included in When the second subword appears less frequently than the second subword, the first subword is deleted from the subword set, so that the computer can recognize the first subword from the subword set. new word. 14. The computer new word learning system as described in item 13 of the scope of patent application, wherein the central processing unit executes: after reading the code, a word recognition program that performs word recognition processing on a document; and A wordless part recording module records the wordless part in the wordless set when the document has at least one wordless part. 15. The computer new word learning system described in item 13 of the scope of patent application, wherein the central processing unit executes after reading the code: a judging procedure for judging whether the subword set is an empty set, And when the sub-word set is an empty set, the process of the computer new word learning method is ended. 16. The computer new word learning system described in item 13 of the scope of patent application, wherein the central processing unit reads the program After the code, execute:

Page 19 517195 VI. Scope of Patent Application A third deletion procedure is to delete subwords other than the most frequently occurring one in the subword set. 17. The computer new word learning system described in item 13 of the scope of patent application, wherein the central processing unit executes after reading the code: a segmentation procedure without a word, including The wordless part is removed from the wordless set; when the number of words before the new word in the wordless part containing the new word is two or more, the wordless part is placed before the new word Part is regarded as another wordless part and added to the wordless set; and when the number of words after the new word in the wordless part containing the new word is more than two, The part of the wordless part after the new word is regarded as another wordless part, and it is added to the wordless set. 18. The computer new word learning system described in item 13 of the scope of patent application, wherein the central processing unit further executes the program after reading the code: a subword set emptying procedure is to clear the subword set and return Go to the subword recorder. 19. The computer new word learning system described in item 13 of the scope of patent application, wherein the predetermined value is 2.

Page 20