TW517195B - Learning method and system for new vocabularies in computer - Google Patents

Learning method and system for new vocabularies in computer Download PDF

Info

Publication number
TW517195B
TW517195B TW89123412A TW89123412A TW517195B TW 517195 B TW517195 B TW 517195B TW 89123412 A TW89123412 A TW 89123412A TW 89123412 A TW89123412 A TW 89123412A TW 517195 B TW517195 B TW 517195B
Authority
TW
Taiwan
Prior art keywords
word
wordless
subword
sub
computer
Prior art date
Application number
TW89123412A
Other languages
Chinese (zh)
Inventor
Li-Wei Yang
Original Assignee
Eland Technologies Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Eland Technologies Co Ltd filed Critical Eland Technologies Co Ltd
Priority to TW89123412A priority Critical patent/TW517195B/en
Application granted granted Critical
Publication of TW517195B publication Critical patent/TW517195B/en

Links

Abstract

The invention provides a learning method and system for new vocabularies in a computer, which includes a sub-vocabulary recording procedure, a first deleting procedure, and a second deleting procedure. The sub-vocabulary recording procedure is to divide at least a non-vocabulary portion recorded in a non-vocabulary set into at least a sub-vocabulary, and record the resultant sub-vocabulary in a sub-vocabulary set, wherein the non-vocabulary portion means that any neighbored words in a computer readable document can not compose a recognizable vocabulary by the computer. The first deleting procedure is to calculate the appearing times of each sub-vocabulary, and delete the sub-vocabulary with appearing times less than a predetermined value from the sub-vocabulary set. The second deleting procedure is to sequentially select a first sub-vocabulary and a different second sub-vocabulary from the sub-vocabulary set, and when the first sub-vocabulary is included in the second vocabulary and the appearing times of the first sub-vocabulary is not larger than that of the second sub-vocabulary, the first sub-vocabulary is deleted from the sub-vocabulary set, and a new vocabulary recognizable by the computer is generated from the sub-vocabulary set. The present invention also discloses a system for new vocabularies in computer that implements the method.

Description

517195 五、發明說明(1) 【發明領域】 本發明係關於一種電腦新詞學習方法與系統,尤關於 一種針對文件中任意相鄰單字均無法構成電腦可辨識詞彙 的部份進行分析,藉以得到新詞並增加電腦可辨識詞彙之 數量的電腦新詞學習方法與系統。 【習知技術】 在英文、法文或德文等拼音文字的文件中,由於各個@ 字(word )之間均有空白加以分隔,故並不存在對句子進 行分割才能了解其真正含義的問題。然而,在中文、日文 或韓文等,各個字之間並無空白加以分隔的文件中,若無 法對文件的内容加以切割,將無法了解其真正的含義為 何,而造成判讀上的錯誤。 所謂的「斷詞」,係指將由成串中文、日文或韓文等 的字元所組成的文句進行分割,使文句被切割成許多有意 義的詞彙。在許多語言處理的應用上,如校正、翻譯或語4 音辨識等,都必需要先對文件進行斷詞後,才能夠作進一 步的處理。 隨著電腦科技的發展,已經有以電腦來斷詞的方法與 系統出現。與以人工來進行斷詞相比較,電腦斷詞可以大517195 V. Description of the Invention (1) [Field of the Invention] The present invention relates to a computer new word learning method and system, and more particularly to an analysis of a part of a document where any adjacent single word cannot constitute a computer-recognizable vocabulary, thereby obtaining New computer word learning method and system for adding new words and increasing the number of computer-recognizable words. [Knowledge technology] In the English, French, or German phonetic alphabet files, there is no space between each @ character (word) to separate them, so there is no problem of segmenting a sentence to understand its true meaning. However, in documents such as Chinese, Japanese, or Korean, where there is no space between the characters, if the content of the document cannot be cut, it will be impossible to understand what it really means, which will result in interpretation errors. The so-called "word-breaking" refers to the segmentation of a sentence composed of a string of Chinese, Japanese, or Korean characters, so that the sentence is cut into many meaningful words. In many language processing applications, such as correction, translation, or phonetic recognition, you must perform word segmentation on the file before you can proceed further. With the development of computer technology, computer-based methods and systems have emerged. Compared with manual word segmentation, computer segmentation can be large

517195 五、發明說明(2) 2 ί少所f的時間。然而,電腦斷詞的困難處之 在於 §其娅到其無法辨識的詞彙時,若不以人工 ,在 彙’其將無法作適當的處理。 别入新的字 【發明概要】 針對上述問 習方法與系統, 本發明之另 統,其可做為電 題,本發明之目的為提供一 其可自動學習文件中的新詞彙。新詞學 一目的為提供一種電腦新詞 腦詞彙資料庫更新及維護的基;方法與系 為 子詞記 詞記錄 成至少 其中無 均無法 別計算 子詞自 序選取 包含於 詞時, 產生電 達上述目 錄程序、 程序係將 一子詞, 詞部份係 構成電腦 各子詞的 子詞集合 不相同之 第二子詞 將第一子 腦可辨識 的,依本發明之電 程序以 詞集合 並將所得到之子詞 可讀取 彙的部 並將出 一第一刪除 g己錄於一無 指於一電腦 可辨識之詞 出現次數, 中刪除。第 刪除 第一子詞與一第 中,且第一 詞自子詞集 之新詞。 子言司之 合刪除 腦新詞 及一第 之至少 δ己錄於 之文件 份。第 現次數 程序係 二子詞 出現次 ,藉以 學習方法包 一刪除輕序 一無詞部份 一子詞集合 中,任相鄰 一刪除程序 小於一預定 自子詞集合 ,當第一子 數不大於第 由該子詞集 括一 〇子 分解 中, no 早字 係分值之 中依 詞被 二子 合中517195 V. Description of the invention (2) 2 The time of less. However, the difficulty of computer word segmentation is that § Qiya will not be able to deal with it properly if she does not use artificial words when she does not recognize the words. Breaking into new words [Summary of the Invention] In view of the above-mentioned questioning method and system, another aspect of the present invention can be used as a question, and the object of the present invention is to provide a new vocabulary that can automatically learn documents. The purpose of the new vocabulary is to provide a basis for updating and maintaining the computerized vocabulary database of new words and brains. The method and system are to record at least one of the subwords, which cannot be calculated separately. When the subwords are self-selected and included in the words, electricity is generated The above-mentioned directory program and program are a sub-word, and the word part is a second sub-word that is different from the sub-word set constituting each sub-word of the computer. The first sub-brain is recognizable. The obtained subwords can be read in the Department of Sink, and a first deletion g has been recorded in a number of times that a finger can recognize a computer, and deleted. Paragraph deletes the first subword and the first one, and the first word is a new word from the subword set. The combination of Ziyanji deleted the new words in the brain and at least δ has been recorded in the documents. The number-of-times program refers to the occurrence of two subwords, so that the learning method includes deleting a light order, a wordless part of a subword set, and any adjacent deletion program is less than a predetermined self-word set. This set of subwords includes 10 sub-decompositions, and no in the early word series scores are combined by two sub-words.

第5頁 517195 五、發明說明(3) 記錄模組、::::種電滕新詞學習系統,其包括_ ,模組係將記錄於::模組以及-第二刪除模組。子:: 少一子詞,龙轉兮…巧集合之至少一無詞部份八j。 份係指於一電子詞記錄於一子詞集合中,其成至 電腦可辨識之★:項取之文件中’任相鄰單字均“:】: 等子詞的出現;;的部份。第-删除模組係分別;= 子詞集合中並將出現次數小於—預定值亥 不相同之一第S第一刪除模組則自子詞集合 、° 第二子詞中,S 與—第二子詞,當第一子 將第-子詞自J第—子詞之出現次數不大於;於 可辨識之新詞。3集合删除,藉以由子詞集合中產生^ 較佳實施例之詳細說明】 以下將參日s 4 !新詞學習方::=式二月依本發明較佳實施例之電 付唬加以說明。’、,,、相同的凡件將以相同的參照 請參照圖] 法1係先進行〜較佳實施例之電腦新詞學習方 51« 同辨識程序11,以針一雷日《 π 彳子自方 Μ進行斷詞處 ^ . 乂對電恥可讀取之文件 中文、曰文或ί : 戶“胃「斷詞」係指將由成串 %文等的字元所組成的文句進行分:2 II Ul! Β_ι 第6頁 二>丄 /ly:) 五、發明說明(4) 句被切割成耸炙亡* 種習知之「i典統;式例中,係使用-即,利用詞彙細点I ^ 」來對文件進行斷詞,亦 評估,以求得最隹之、詞彙長度等對被切割之文件進行 項技術者可:然而,需注意者,熟悉該 不脫離本發===斷詞法來對文件進行斷詞,而 、著’進行無詞部份記錄 =δ己錄於一無詞集合5 2中 ,」’係指在文件5 1中, 线之d菜的部份。例如, 「王明昨天拜訪李小華」· 句中,由於「王明」與「 各,組合(如,「李小華 、、「李小華」等三種組合 為兩個「無詞部份」。亦 王 明 昨天 拜訪 明」與「李小華」這兩個 程序12,以將文件51中的無 。在本發明中,所謂的「無 ^相鄰單字均無法構成電腦 若文件51中具有如下的句 李小華」這兩個部份中,單 」可以有「李小」、「小 )均無法被系統辨識,因此 即,此句的斷詞結果會成 李 小 華」 無詞部份將會成為單字的組 公.^ΐ ’在子詞記錄程序13中,無詞集合52各μ 取主夕一子詞,並將为解所得之子詞記錄於—子Page 5 517195 V. Description of the invention (3) Recording module: :::: A new electronic word learning system including _, the module will be recorded in :: module and-the second delete module. Child: One less word, the dragon is turning ... At least one wordless part of the clever set is eight j. The part refers to the part recorded in an electronic word in a collection of sub-words, and its completion to the computer-recognizable ★: item taken 'any adjacent word is ":]: the part where the sub-word appears ;; The deleted modules are respectively; = in the set of subwords and the number of occurrences is less than-one of the predetermined value, the first S deleted module is from the set of subwords, ° in the second subword, S and-the second child Words, when the first sub-sub-word appears from the J-sub-word no more than; the new word is identifiable. 3 sets are deleted to generate from the sub-word set ^ Detailed description of the preferred embodiment] The following Let's learn about s 4! New word learning party :: == February will be explained according to the electric payment method of the preferred embodiment of the present invention. ',,,, the same parts will use the same reference, please refer to the drawing] Method 1 The first method of computer new word learning in the preferred embodiment 51 «is the same as the recognition program 11, and the word segmentation is performed with the needle π 日 自 自 Zi Zi Fang ^. 乂 The document that can be read in Chinese, Sentences or ί: The household "stomach" word segmentation "refers to a sentence composed of a string of characters such as 2% of words: 2 II Ul! Β_ Page 6 II > 丄 / ly :) V. Explanation of the invention (4) The sentence was cut into a towering death * The conventional "i allusion system; in the formula, it is used-that is, using the vocabulary point I ^" To perform word segmentation on the document and evaluate it to obtain the most appropriate, vocabulary length, etc. The technical person who performs the item on the cut file may: However, those who need to pay attention to it should be familiar with the word segmentation === Word segmentation is performed on the file, and 'Record without word part = δ has been recorded in a wordless set 5 2 "" refers to the part of line d in file 51. For example, in the sentence "Wang Ming visited Li Xiaohua yesterday", the combination of "Wang Ming" and "Each" (for example, "Li Xiaohua", "Li Xiaohua" and other three combinations are two "wordless parts". Also Wang Ming yesterday "Visit Ming" and "Li Xiaohua" two programs 12 to remove none in file 51. In the present invention, the so-called "no ^ adjacent words cannot constitute a computer. If file 51 has the following sentence Li Xiaohua" Among the parts, "Single" can have "Li Xiao", "Xiao" can not be recognized by the system, so that the word segmentation result of this sentence will become Li Xiaohua. "The wordless part will become the single-word group. ^ Ϊ́ 'In the sub-word recording program 13, each of the non-word sets 52 takes a sub-word of the main eve, and records the sub-words obtained from the solution in the sub-

第7頁 517195 五、發明說明(5) 詞集合53中。以前面的「王明昨天拜訪李小 ^ ,在子詞記錄程序丨3中,無詞部份「李」廷句為 「李小」、「d、筮命「本I试 字小華」會被分解 β ^ 小華」與「李小華」三個子勃λ认丄 例 為 …、 」、小華」與「李小華」三個子叫4. 子詞e錄程序13會將每一個無詞 =二換言之’ 來。 %子詞都分解出 任一種計算方法 接著,第一刪除程序14分別計算各個 數,並將出現次數小於一預定值之子詞自哕二=現次 除。在此程序中,所謂各個子詞&「出現;:列集合中刪 子詞於原本無詞集合52中的出現次數,亦^ ’ y指各 詞集合53 :的出現次數。熟悉本技術者可視J ::司於子 乂尤·名矣古·]* ★七、、4· 貝際狀况選擇 當某個子詞在整份文件51中的出現次數過少 其在文件中是偶而出⑨,故可將 二l表示 m:王明」在整份文件51中僅d: 「王明」並非-個且:广出現了十表次’則很明顯地, 出現的子詞…錄價值的新詞’其僅為-個偶而 由於出現次數眾多,2」這個子詞相對於文件51而言, 故為一具有記錄價值的新詞。 至於預定值的大小,則 如,可以人工輸入的方式更 數’來機動性調整預定值的 可視實際狀況加以設定。例 改預設值,或依文件5 1的字 大小。如此將可針對不同的 文Page 7 517195 V. Description of the invention (5) In the word set 53. Based on the previous "Wang Ming visited Li Xiao ^ yesterday, in the sub-word recording program 丨 3, the unspoken part of" Li "was sentenced as" Li Xiao "," d, fatal, "this I try the word Xiaohua" will The three subordinates λ, which are decomposed β ^ "Xiaohua" and "Li Xiaohua", are identified as "...", "Xiaohua" and "Li Xiaohua". The three sub-names are 4. The sub-word e-recording program 13 will replace each wordless = 2 in other words ' Come. The% subwords are decomposed into any kind of calculation method. Next, the first deletion program 14 calculates each number separately, and divides the subwords whose occurrences are less than a predetermined value from the second = the current division. In this program, the so-called each subword & "appears": the number of occurrences of deleted subwords in the original non-word set 52 in the column set, and ^ 'y refers to the number of occurrences of each word set 53: Those familiar with the technology Visual J :: Si Yuzi 乂 You · Ming 矣 gu ·] * ★ VII, 4, · Inter-state selection When a subword appears too few times in the entire document 51, it appears occasionally in the document, Therefore, the two l can be expressed as m: Wang Ming ”in the entire document 51. Only d:“ Wang Ming ”is not a single and: ten table times appear widely. It is obvious that the sub-words appearing ... The word 'It's only an occasional and because of its many occurrences, the sub-word' 2 'is a new word with record value relative to file 51. As for the magnitude of the predetermined value, for example, the value can be manually inputted, and the predetermined value can be adjusted flexibly according to actual conditions. For example, change the default value, or according to the word size of document 5 1. This will target different languages.

^17195 五、發明說明(6) 件’设定不同的新詞學習標準 不相除程序=中笛係先自該子詞集合53依序選取 包含=4;:=第t子詞。然後,當第-子= 小」為匕巧::rr除。例如,當選取7: 小」係包含於「李小華」中」,故第;τ詞時,由於「李 ;數會等☆「李小華」這個子詞的出二個子詞的出現 下,即將「李*自手古司隼人μ的出現一人數。在這種情況 ,詞。相同心留下;李小華」 李小華」中,故直也會自子1鱼」坆個子詞也包含於 删除包含於較長子詞的較短子;集:?:除。如此,將可 保留長度較長的子詞。 接著,在判斷程序1 6中,若早人 ^,在第一删除程序14與第二二=a 53為空集合,亦 ,除,則結束整個依本發=郎中已將所有的子 驾方法1的流程。若子詞隼人5又佳貫施例之電腦新詞學 刪除程序17,僅保留出現V』3最中,有子詞,則 出現次數較少的子詞。如此,—夕的子詞,删除所有其它 次將只會產生一個新詞。 在產生新詞之後,即進行無 包括了新詞的無詞部份刪除,^ :⑷份分割程序〗8,以將 中,新詞以外的部份獨立出來开彡成包括了新詞的無詞部份 '成新的無詞部份。亦即, / 五二發明說明 ο 一 — -— d:詞;份中,位於新詞之前的單字數h 無詞部份,並加;:::==:詞之前的部份視為2 時’則將無詞部:中::;:詞之後的單字數量為兩個以工 份,並將JLA入中 新詞之後的部份視為另一盔1 a 寻八加入至無詞集合52。 “、、阔部 例如,若文件51中另有—句:「他 :丄;由於整句都沒有電 前 Ϊ 了:個無詞部份,且其中包括了剛Ϊ二; 華」。針對此一盔叫加yv 4压王的新岡「李小 —無詞部份就合無詞部份分割程序18中,此 即新詞「李d二1 =,並被分割成兩個新的無詞部份, 「李小華」之:以個字,新詞 序19在= = 之後’即進行子詞集合清空程 進行子詞分解的動;月空’並回到子詞記錄程序13重新 ,,流程,將可找出文件51中所有可能的新 一: J衫響到電腦原本可辨識的詞,以及文件5 1中 既f :Ϊ 2 :。因Λ,其可有效地對電腦可讀取的文件進 仃更適當的斷詞處理。^ 17195 V. Description of the invention (6) Case ‘Set different learning standards for new words. Non-divide program = Zhongdi system first selects from this sub-word set 53 sequentially. Contains = 4;: = t-th sub-word. Then, when the first-sub = small "is dagger :: rr division. For example, when 7: Xiao is selected to be included in "Li Xiaohua", so the term; τ, due to the appearance of the two sub-words "Li Xiaohua" due to "Li; Shuhui etc." * Since there is a number of people in the ancient Siyaren μ. In this case, the words. Concentric staying; Li Xiaohua "Li Xiaohua", so straight will also be a son of a fish "A sub-word is also included in the deletion included in the longer sub-word Shorter child; set:? :except. In this way, longer subwords will be retained. Next, in the determination program 16, if the early person ^, the first deletion program 14 and the second two = a 53 are empty sets, and also, except, the entire method according to the present invention = all the sub-driving methods have been ended 1 process. If the sub-word 隼 人 5 is a good example of the new computer vocabulary, delete the program 17, and keep only the most appearing V′3. If there are sub-words, the sub-words appearing less often. So, the subword of Xi, deleting all other times will only produce a new word. After generating a new word, delete the non-word part that does not include the new word, ^: the segmentation procedure of the part 〖8, in order to separate the parts other than the new word into a non-word that includes the new word. The word part 'becomes a new wordless part. That is, / May 2nd invention description ο One — — — d: word; in the copy, the number of words before the new word h without the word part, and added; ::: ==: the part before the word is regarded as 2 Shi 'will add the wordless part: Middle ::;: to the number of words after the word, and treat the part after the new word in JLA as another helmet. 1 a 52. "、, wide section For example, if there is another sentence in document 51:" He: 丄; because there is no electricity before the whole sentence 前: a wordless part, and it includes Gang Er Er; Hua ". In response to this helmet, the new gang called yv 4 is king "Li Xiao-the wordless part is combined with the wordless part in the segmentation program 18, which is the new word" Li d 2 1 =, and divided into two new No word part, "Li Xiaohua": with a word, the new word order 19 after == ', that is, the subword set emptying process is performed for the subword decomposition; the moon is empty, and it returns to the subword recording program 13, and re-, Process, you can find all possible new ones in file 51: J shirt ringing to the computer's original recognizable word, and both f: Ϊ 2: in file 51. Because of Λ, it can effectively perform more proper word segmentation on computer-readable files.

第10頁 J丄/丄3:)Page 10 J 丄 / 丄 3 :)

王於所產生的新詞, 〜 可辨識詞*。例如,若產生2際狀況使其成為斬的電腦 慮將此-近年來才以詞為「電子商務」,即考 使其成為新的電腦可辨加入電腦:詞囊資料庫’ 資料庫的更新與維護。 如此,將有助於電腦詞彙 3月參照圖2,依本發明輕 統2包括一詞辨識模电21施例之電腦新詞學習系 詞記錄模組23、_第一…、闲部份記錄模組22、一子 -第三刪除模組26以及一 -第二刪除模組25、 例中,各模組均為儲存於:;;: =模組2J。在本實施 -中央處理單元讀取後硬f機或,使 而,需。:孰Λν:找出文件51中的新詞。然 與進一步的岸了;;,項技術者亦可對其進行等效之修改 文進行電子裝置中,以對 詞的工作,而不超出本;::;;:;1,。斷詞與產生新 電腦新詞學習备& q γ ^ t 際網路自另-網路飼取文件51,或經由 ^ ^ ^ /t Q Π ^服态項取文件5 1。進行詞辨識時所 〇貝^ ’、可儲存於一電腦可讀取之記憶裝置或New words produced by Wang Yu, ~ Identifiable words *. For example, if there is a situation in which the computer becomes chopped, consider this word-in recent years, the word has been used as "e-commerce". And maintenance. In this way, it will be helpful for the computer vocabulary to refer to FIG. 2 in accordance with the present invention. In accordance with the present invention, the light system 2 includes a word recognition module 21 embodiment of the computer new word learning system word record module 23, _first ... Module 22, one child-third deletion module 26 and one-second deletion module 25. In the example, each module is stored in: ;;: = module 2J. In this implementation-the central processing unit reads the hard f machine or, so that it is needed. : 孰 Λν: Find new words in file 51. Of course, with the further shore ;;, the technicians can also make equivalent modifications to the text in the electronic device to work on the word without exceeding this; :: ;;:; 1. Word segmentation and new computer learning words & q γ ^ t The Internet picks up file 51 from the Internet-or fetches file 51 via the ^ ^ ^ / t Q Π ^ service status item. Word recognition in word recognition, can be stored in a computer-readable memory device, or

517195 發明說明(9) $媒體中,以便電腦新詞學習系統2之存取。電腦新詞學 駕系統2所產生的新詞亦可加入至詞彙資料庫3 〇中,以對 其進行維護與更新的動作。 /依本發明之電腦新詞學習方法與系統係利用電腦技術 來對電腦可讀取文件進行斷詞,以將文件中的文句正確切 割成有意義的詞彙。其有助於許多語言處理的應用,如校 正、翻譯或語音辨識等科技的進一步發展。 依本發明之電腦新詞學習方法與系統可自動學 中的新詞彙,以對文件作適當的斷詞處理。 =务明之電腦新詞學習方法與系統 一貝枓庫更新及維護的基礎。 々电細d菜 以上所述僅為舉例性,而非 本發明之精神與範疇,而對苴進行2望,。任何未脫離 應包含於後附之申請專利巾。4效修改或'變更,均 第12頁 517195 圖式簡單說明 【圖式之簡單說明】 圖1為一流程圖,顯示依本發明較佳實施例之電腦新 詞學習方法之流程。 圖2為一示意圖,顯示依本發明較佳實施例之電腦新 詞學習系統之架構。 【圖式符號說明】 1 電 腦 新 詞 學 習 方 法 11 詞 辨 識 程 序 12 無 詞 部 份 記 錄 程 序 13 子 詞 記 錄 程 序 14 第 — 刪 除 程 序 15 第 二 刪 除 程 序 16 判 斷 程 序 17 第 二 刪 除 程 序 18 無 詞 部 份 分 割 程 序 19 子 詞 集 合 清 空 程 序 2 電 腦 新 詞 學 習 系 統 21 詞 辨 識 模 組 22 無 詞 部 份 記 錄 模 組 23 子 詞 記 錄 模 組517195 Description of the invention (9) $ in the media for easy access to the computer new word learning system 2. The new words learned by the computer new driving system 2 can also be added to the vocabulary database 3 0 to perform maintenance and update operations. / The computer new word learning method and system according to the present invention uses computer technology to segment words that can be read by a computer, so as to correctly cut sentences in the file into meaningful words. It helps many language processing applications, such as the further development of technologies such as correction, translation or speech recognition. The computer new word learning method and system according to the present invention can automatically learn new words in order to perform proper word segmentation processing on files. = Mingming's Computer New Words Learning Method and System A foundation for updating and maintaining the library. The above description is only exemplary, not the spirit and scope of the present invention. Any non-detachment shall be included in the attached patent towel. 4-effect modification or 'change, both on page 12 517195 Simple description of the diagram [Simplified description of the diagram] FIG. 1 is a flowchart showing the flow of a computer new word learning method according to a preferred embodiment of the present invention. Fig. 2 is a schematic diagram showing the architecture of a computer new word learning system according to a preferred embodiment of the present invention. [Illustration of Graphical Symbols] 1 Computer New Word Learning Method 11 Word Recognition Program 12 Wordless Part Recording Program 13 Subword Recording Program 14th-Delete Program 15 Second Delete Program 16 Judgment Program 17 Second Delete Program 18 No Word Department Partitioning program 19 Subword collection emptying program 2 Computer new word learning system 21 Word recognition module 22 No-word part record module 23 Subword record module

第13頁 517195 圖式簡單說明 24 第一刪除模組 25 第二刪除模組 26 第三刪除模組 27 無詞部份分割模組 30 詞彙資料庫 51 文件 52 無詞集合 53 子詞集合Page 13 517195 Schematic description 24 First delete module 25 Second delete module 26 Third delete module 27 Segmentation module without words 30 Lexical database 51 Documents 52 No word collection 53 Subword collection

Claims (1)

517195 六、申請專利範圍 1. 一種電腦新詞學習方法,包含以下程序: 一子詞記錄程序,係將記錄於一無詞集合之至少一無 詞部份分解成至少一子詞,並將該子詞記錄於一子詞集合 中,其中該無詞部份係指於一電腦可讀取之文件中,任相 鄰單字均無法構成電腦可辨識之詞彙的部份; 一第一刪除程序,係分別計算各該等子詞的出現次 數,並將出現次數小於一預定值之子詞自該子詞集合中刪 除;以及 一第二刪除程序,係自該子詞集合中依序選取不相同 之一第一子詞與一第二子詞,當該第一子詞被包含於該第 二子詞中,且該第一子詞之出現次數不大於該第二子詞 時,將該第一子詞自該子詞集合刪除,藉以由該子詞集合 中產生電腦可辨識之新詞。 2. 如申請專利範圍第1項所述之電腦新詞學習方法,更包 含: 一詞辨識程'序,係對一文件進行詞辨識處理;以及 一無詞部份記錄程序,係當該文件中具有至少一無詞 部份時,將該無詞部份記錄於該無詞集合中。 3. 如申請專利範圍第1項所述之電腦新詞學習方法,更包 含: 一判斷程序,係判斷該子詞集合是否為空集合,並當 該子詞集合為空集合時,結束該電腦新詞學習方法之流517195 6. Scope of patent application 1. A computer new word learning method, including the following procedures: A subword recording program is to decompose at least one nonword part recorded in a nonword set into at least one subword, and The subwords are recorded in a set of subwords, where the non-word part refers to a computer-readable file, and any adjacent word cannot form a part of the computer-recognizable vocabulary; a first deletion process, respectively Count the number of occurrences of each of these subwords, and delete the subwords whose occurrences are less than a predetermined value from the set of subwords; and a second deletion procedure, which sequentially selects a different one from the subword set A sub-word and a second sub-word, when the first sub-word is included in the second sub-word, and the number of occurrences of the first sub-word is not greater than the second sub-word, the first sub-word Delete from the sub-word set, thereby generating computer-recognizable new words from the sub-word set. 2. The computer new word learning method described in item 1 of the scope of patent application, further comprising: a word recognition process, which performs word recognition processing on a document; and a non-word part recording program, which serves as the document When there is at least one wordless part in the record, the wordless part is recorded in the wordless set. 3. The computer new word learning method described in item 1 of the scope of patent application, further comprising: a judging program for judging whether the subword set is an empty set, and ending the computer when the subword set is an empty set. The flow of new word learning methods 第15頁 517195 六、申請專利範圍 程。 4. 如申請專利範圍第1項所述之電腦新詞學習方法,更包 含: 一第三刪除程序,係於該第二刪除程序之後,更將該 子詞集合中出現次數最多之子詞以外的子詞刪除。 5. 如申請專利範圍第1項所述之電腦新詞學習方法,更包 含·· % 一無詞部份分割程序,包括 胃 將包含該新詞之無詞部份自該無詞集合移除; 當包含該新詞之無詞部份中,位於該新詞之前的單字 數量為兩個以上時,將該無詞部份中位於該新詞之前的部 份視為另一無詞部份,並將其加入至該無詞集合中;且 當包含該新詞之無詞部份中,位於該新詞之後的單字 數量為兩個以上時,將該無詞部份中位於該新詞之後的部 份視為另一無詞部份,並將其加入至該無詞集合中。 6. 如申請專利範圍第1項所述之電腦新詞學習方法,更包 含: 一子詞集合清空程序,係清空該子詞集合並回到該子 詞記錄程序。 7.如申請專利範圍第1項所述之電腦新詞學習方法,其中Page 15 517195 VI. Scope of patent application. 4. The computer new word learning method described in item 1 of the scope of patent application, further comprising: a third deletion procedure, which is after the second deletion procedure, and includes the ones other than the most frequently occurring ones in the subword set. Subword deletion. 5. The computer new word learning method described in item 1 of the scope of patent application, further includes a% wordless segmentation procedure, which includes removing the wordless part containing the new word from the wordless set. ; When the number of words before the new word in the wordless part containing the new word is two or more, the part of the wordless part before the new word is regarded as another wordless part , And add it to the wordless set; and when the number of words after the new word in the wordless portion containing the new word is two or more, the wordless portion is located in the new word The following part is regarded as another wordless part and added to the wordless set. 6. The computer new word learning method described in item 1 of the scope of patent application, further comprising: a subword set emptying program, which is to clear the subword set and return to the subword recording program. 7. The computer new word learning method described in item 1 of the scope of patent application, wherein 第16頁 517195 六、申請專利範圍 該預定值為2。 8. 一種電腦新詞學習系統,包含: 一子詞記錄模組,係將記錄於一無詞集合之至少一無 詞部份分解成至少一子詞,並將該子詞記錄於一子詞集合 中,其中該無詞部份係指於一電腦可讀取之文件中,任相 鄰單字均無法構成電腦可辨識之詞彙的部份; 一第一刪除模組,係分別計算各該等子詞的出現次 數,並將出現次數小於一預定值之子詞自該子詞集合中刪_ 除;以及 一第二刪除模組,係自該子詞集合中依序選取不相同 之一第一子詞與一第二子詞,當該第一子詞被包含於該第 二子詞中,且該第一子詞之出現次數不大於該第二子詞 時,將該第一子詞自該子詞集合刪除,藉以由該子詞集合 中產生電腦可辨識之新詞。 9. 如申請專利範圍第8項所述之電腦新詞學習系統,更包 含: 一詞辨識模組,係對一文件進行詞辨識處理;以及 · 一無詞部份記錄模組,係當該文件中具有至少一無詞 部份時,將該無詞部份記錄於該無詞集合中。 10.如申請專利範圍第8項所述之電腦新詞學習系統,更 包含:Page 16 517195 6. Scope of patent application The predetermined value is 2. 8. A computer new word learning system comprising: a subword recording module, which decomposes at least one nonword part recorded in a nonword set into at least one subword, and records the subword in a subword In the set, the non-word part refers to a computer-readable document, and any adjacent word cannot form a part of the computer-recognizable vocabulary. A first deletion module is to calculate each of these sub-words separately. And delete sub-words whose occurrences are less than a predetermined value from the sub-word set; and a second deletion module, which sequentially selects a different first sub-word from the sub-word set in order And a second subword, when the first subword is included in the second subword, and the number of occurrences of the first subword is not greater than the second subword, the first subword is removed from the subword The word set is deleted, so that computer-recognizable new words are generated from the sub-word set. 9. The computer new word learning system described in item 8 of the scope of patent application, further comprising: a word recognition module, which performs word recognition processing on a document; and a wordless partial recording module, which should be When the document has at least one wordless part, the wordless part is recorded in the wordless set. 10. The computer new word learning system described in item 8 of the scope of patent application, further comprising: 第17頁 517195 六、申請專利範圍 一第三删除模組,係將該子詞集合中出現次數最多之 子詞以外的子詞刪除。 11·如申請專利範圍第8項所述之電腦新詞學習系統,更 包含: 一無詞部份分割模組,其係 將包含該新詞之無詞部份自該無詞集合移除; 當包含該新詞之無詞部份中,位於該新詞之前的單字 數量為兩個以上時,將該無詞部份中位於該新詞之前的部 份視為另一無詞部份,並將其加入至該無詞集合中;且 當包含該新詞之無詞部份中,位於該新詞之後的單字 數量為兩個以上時,將該無詞部份中位於該新詞之後的部 份視為另一無詞部份,並將其加入至該無詞集合中。 12. 如申請專利範圍第8項所述之電腦新詞學習系統,其 中該預定值為2。 13. —種電腦新詞學習系統,包含: 一中央處理單元;以及 一儲存裝置,其係儲存至少一程式碼,俾使該中央處 理單元於讀取該程式碼後,可執行以下程序: 一子詞記錄程序,係將記錄於一無詞集合之至少一無 詞部份分解成至少一子詞,並將該子詞記錄於一子詞集合 中,其中該無詞部份係指於一電腦可讀取之文件中,任相Page 17 517195 VI. Scope of patent application A third deletion module deletes the subwords other than the most frequently occurring one in the subword set. 11. The computer new word learning system described in item 8 of the scope of patent application, further comprising: a wordless part segmentation module, which removes the wordless part containing the new word from the wordless set; When the number of words before the new word in the wordless part containing the new word is two or more, the part of the wordless part before the new word is regarded as another wordless part, And add it to the wordless set; and when the number of words after the new word in the wordless portion containing the new word is two or more, the wordless portion is located after the new word The part of is regarded as another wordless part and added to the wordless set. 12. The computer new word learning system described in item 8 of the scope of patent application, wherein the predetermined value is 2. 13. A computer new word learning system, including: a central processing unit; and a storage device that stores at least one code, so that the central processing unit can execute the following procedures after reading the code: a The subword recording program is to decompose at least one non-word part recorded in a non-word set into at least one sub-word, and record the sub-word in a sub-word set, wherein the non-word part refers to a Any computer-readable file 第18頁 517195 六、申請專利範圍 鄰單字均無法構成電腦可辨識之詞彙的部份; 一第一刪除程序,係分別計算各該等子詞的出現次 數,並將出現次數小於一預定值之子詞自該子詞集合中刪 除;以及 一第二删除程序,係自該子詞集合中依序選取不相同 之一第一子詞與一第二子詞,當該第一子詞被包含於該第 二子詞中,且該第一子詞之出現次數少於該第二子詞時, 將該第一子詞自該子詞集合刪除,藉以由該子詞集合中產 生電腦可辨識之新詞。 14. 如申請專利範圍第1 3項所述之電腦新詞學習系統,其 中該中央處理單元於讀取該程式碼後,更執行: 一詞辨識程序,係對一文件進行詞辨識處理;以及 一無詞部份記錄模組,係當該文件中具有至少一無詞 部份時,將該無詞部份記錄於該無詞集合中。 15. 如申請專利範圍第1 3項所述之電腦新詞學習系統,其 中該中央處理單元於讀取該程式碼後,更執行: 一判斷程序,係判斷該子詞集合是否為空集合,並當 該子詞集合為空集合時,結束該電腦新詞學習方法之流 程0 16.如申請專利範圍第1 3項所述之電腦新詞學習系統,其 中該中央處理單元於讀取該程式碼後,更執行:Page 18 517195 VI. The adjacent words in the patent application scope cannot form part of the computer-recognizable vocabulary; a first deletion procedure calculates the number of occurrences of each of these sub-words separately, and the number of occurrences is less than a predetermined value The word is deleted from the subword set; and a second deletion procedure is to sequentially select a different first subword and a second subword from the subword set, when the first subword is included in When the second subword appears less frequently than the second subword, the first subword is deleted from the subword set, so that the computer can recognize the first subword from the subword set. new word. 14. The computer new word learning system as described in item 13 of the scope of patent application, wherein the central processing unit executes: after reading the code, a word recognition program that performs word recognition processing on a document; and A wordless part recording module records the wordless part in the wordless set when the document has at least one wordless part. 15. The computer new word learning system described in item 13 of the scope of patent application, wherein the central processing unit executes after reading the code: a judging procedure for judging whether the subword set is an empty set, And when the sub-word set is an empty set, the process of the computer new word learning method is ended. 16. The computer new word learning system described in item 13 of the scope of patent application, wherein the central processing unit reads the program After the code, execute: 第19頁 517195 六、申請專利範圍 一第三刪除程序,係將該子詞集合中出現次數最多之 子詞以外的子詞刪除。 17. 如申請專利範圍第1 3項所述之電腦新詞學習系統,其 中該中央處理單元於讀取該程式碼後,更執行: 一無詞部份分割程序,包括 將包含該新詞之無詞部份自該無詞集合移除; 當包含該新詞之無詞部份中,位於該新詞之前的單字 數量為兩個以上時,將該無詞部份中位於該新詞之前的部 份視為另一無詞部份,並將其加入至該無詞集合中;以及 當包含該新詞之無詞部份中,位於該新詞之後的單字 數量為兩個以上時,將該無詞部份中位於該新詞之後的部 份視為另一無詞部份,並將其加入至該無詞集合中。 18. 如申請專利範圍第1 3項所述之電腦新詞學習系統,其 中該中央處理單元於讀取該程式碼後,更執行: 一子詞集合清空程序,係清空該子詞集合並回到該子 詞記錄程序。 19. 如申請專利範圍第1 3項所述之電腦新詞學習系統,其 中該預定值為2。Page 19 517195 VI. Scope of Patent Application A third deletion procedure is to delete subwords other than the most frequently occurring one in the subword set. 17. The computer new word learning system described in item 13 of the scope of patent application, wherein the central processing unit executes after reading the code: a segmentation procedure without a word, including The wordless part is removed from the wordless set; when the number of words before the new word in the wordless part containing the new word is two or more, the wordless part is placed before the new word Part is regarded as another wordless part and added to the wordless set; and when the number of words after the new word in the wordless part containing the new word is more than two, The part of the wordless part after the new word is regarded as another wordless part, and it is added to the wordless set. 18. The computer new word learning system described in item 13 of the scope of patent application, wherein the central processing unit further executes the program after reading the code: a subword set emptying procedure is to clear the subword set and return Go to the subword recorder. 19. The computer new word learning system described in item 13 of the scope of patent application, wherein the predetermined value is 2. 第20頁Page 20
TW89123412A 2000-11-06 2000-11-06 Learning method and system for new vocabularies in computer TW517195B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
TW89123412A TW517195B (en) 2000-11-06 2000-11-06 Learning method and system for new vocabularies in computer

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
TW89123412A TW517195B (en) 2000-11-06 2000-11-06 Learning method and system for new vocabularies in computer

Publications (1)

Publication Number Publication Date
TW517195B true TW517195B (en) 2003-01-11

Family

ID=27801202

Family Applications (1)

Application Number Title Priority Date Filing Date
TW89123412A TW517195B (en) 2000-11-06 2000-11-06 Learning method and system for new vocabularies in computer

Country Status (1)

Country Link
TW (1) TW517195B (en)

Similar Documents

Publication Publication Date Title
JP5791861B2 (en) Information processing apparatus and information processing method
US20130080152A1 (en) Linguistically-adapted structural query annotation
JP5501581B2 (en) Information processing apparatus and information processing method
KR19990078364A (en) Sentence processing apparatus and method thereof
JP2008287517A (en) Highlighting device and program
JP4502114B2 (en) Database search device
Pal et al. Anubhuti--An annotated dataset for emotional analysis of Bengali short stories
Stec The Genizah Psalms: A Study of MS 798 of the Antonin Collection. Cambridge Genizah Studies Series, Volume 5
TW517195B (en) Learning method and system for new vocabularies in computer
JP2007140639A (en) Data display device, data display method and data display program
JP2009157888A (en) Transliteration model generation device, transliteration apparatus, and computer program therefor
JP5511161B2 (en) Information processing apparatus and information processing method
JPH08339376A (en) Foreign language retrieving device and information retrieving system
JPS61248160A (en) Document information registering system
JP2003006191A (en) Device and method for supporting preparation of foreign language document and program recording medium
Kwanten The Structure of the Tangut [Hsi Hsia] Characters
JP2000250913A (en) Example type natural language translation method, production method and device for list of bilingual examples and recording medium recording program of the production method and device
Algarni Light morphology and arabic information retrieval.
JPH11250063A (en) Retrieval device and method therefor
Suliman Types of Deixis Used in The Doum-Tree of Wad hamid: A Corpus-based Study
JP5259764B2 (en) Browsing support system, browsing support method and program
JP2958044B2 (en) Kana-Kanji conversion method and device
JP2010033156A (en) Information processor and information processing method
Nance Shakespeare, Theobald, and the Prose Problem in Double Falsehood
Porter Lovins revisited

Legal Events

Date Code Title Description
GD4A Issue of patent certificate for granted invention patent
MM4A Annulment or lapse of patent due to non-payment of fees