TWI290684B

TWI290684B - Incremental thesaurus construction method

Info

Publication number: TWI290684B
Application number: TW092112651A
Authority: TW
Inventors: Yuen-Hsien Tseng
Original assignee: Webgenie Information Ltd; Yuen-Hsien Tseng
Priority date: 2003-05-09
Filing date: 2003-05-09
Publication date: 2007-12-01
Also published as: TW200424874A

Abstract

An incremental thesaurus construction method is provided. The incremental thesaurus construction method first collects keywords in a document set, divides each of the documents in the document set into a plurality of logical segments, and perform related terms analysis based on the logical segments.

Description

1290684 11219twf3.doc/006 96-9-4 玫、發明說明：【發明所屬之技術領域】本發明是有關於一種漸進式關聯詞庫之建構方法，且特別是有關於一種同現漸進式關聯詞庫之建構方法。【先前技術】長久以來，「未匹配字彙」（vocabulary mismatch)—直是資訊檢索系統使用者檢索失敗的主要原因之一。所謂「未匹配字彙」問題，就是使用者查詢時下達的詞彙與系統用以索引文件的詞彙不相同的情況。例如，不同的文件可能出現「筆記型電腦」、「筆記本電腦」或「筆記本型電腦」等用詞不一致的情況，如果系統直接以原文件的詞彙建立索弓丨（建索引的目的是要加快查詢比對的速度），當使用者下達「筆記本電腦」時，對於包含意義相同但詞彙字串不同的文件，就有可能因爲比對不正確，造成漏檢或失敗的情況。圖書館學的理論中’早已注意到此種現象，並提出像「授權檔案」與「關聯詞庫」等工具來解決這個問題。「授權檔案」（authority file)中記錄了各種同義異形詞，使得索引或檢索時，各種意義相同但形式不同的詞彙，可以對應起來，而被視爲相同的詞彙處理。如上述的「筆記型電腦」與「筆記本電腦」’或「行政院長」與「閣揆」，或「老人癡呆症」與「老人失智症」等詞_，都可透過授權檔案的運用，在索引與檢索時視爲相同的詞彙。而「關聯詞庫」 (thesaurus)則進一步紀錄詞彙之間更多的關係，除同義詞 96-9-4 '1290684 11219twf3.doc/006 外，還有反義詞、廣義詞、狹義詞、相關詞等，用以擴展或縮小檢索詞彙的主題範圍。例如「筆記型電腦」與「掌上型電腦」的槪念很接近，都可視爲「攜帶型電腦」的狹義詞。相對的，「攜帶型電腦」可視爲這兩個詞的廣義詞，透過廣義詞的擴展，運用查詢詞「筆記型電腦」可找出包含「掌上型電腦」的文件。「關聯詞庫」列舉詞彙之間的關係，用於查詢詞的互相推薦，以擴大或縮小查詢範圍，或提示相關槪念的不同查詢用語，使檢索從原本的字串比對層次，提升到以目吾思做比對的層次。爲了建構此種詞彙之間語意上的關係，往往需要人工分析與整理。人工製作關聯詞庫的優點是正確性高，缺點則是成本大、建構速度慢、維護不易、以及事先選用的詞彙可能與後續或其他新進的文件無關。過去資訊檢索實驗的硏究指出，一般目的（general-purposed)的關聯詞庫運用在特定領域的文件檢索上，會出現無法提升檢索效能的情形。關聯詞庫雖然捕捉了詞彙之間的語意落差，關聯詞庫涵蓋的詞彙主題，卻可能與文件的主題有所落差’而達不到以關聯詞庫提升檢索成效的目的。一個極端的例子’是將人文科學方面的關聯詞庫運用在工程科學文獻的檢索上，其檢索效果當然難以彰顯。然而針對每一種文獻領域製作關聯詞庫，卻又耗時費力。因此，根據文獻本身的主題，自動且即時產生關聯詞庫的方法，是値得探討的主題。自動化的方法，大抵都倚賴相關的詞彙在文件中常常 1290684 11219twf3.doc/006 96-9-4 一起出現的線索，來建構關聯詞庫。此種方式建構出來的關聯詞庫，可稱爲「共同出現關聯詞庫」（co-occurrence thesaurus)，或簡稱「同現關聯詞庫」。同現關聯詞庫裡的詞彙與詞彙之間的關係，不像人工製作的關聯詞庫那樣有準確的語意關係，如廣義詞、狹義詞等關係，但卻是有統計上的相關性。此種相關性經常透露出文件中隱含的、不易由人工偵測、發覺或維護的知識。因此，自動建構此種關聯詞庫的方法，也可以視爲是文字知識探勘(knowledge discovery from text，or text mining)的一種方法。1290684 11219twf3.doc/006 96-9-4 mei, invention description: [Technical field of invention] The present invention relates to a method for constructing a progressive association vocabulary, and in particular to a co-occurrence progressive vocabulary Construction method. [Prior Art] "Vocabulary mismatch" has long been one of the main reasons for the failure of users in information retrieval systems. The so-called "unmatched vocabulary" problem is that the vocabulary released by the user is not the same as the vocabulary used by the system to index the file. For example, different documents may have inconsistent words such as "notebook", "laptop" or "notebook". If the system directly uses the vocabulary of the original document, the purpose of indexing is to speed up. Query the speed of the comparison. When the user releases the "laptop", the file containing the same meaning but different vocabulary strings may be missed or failed due to incorrect comparison. In the theory of library science, this phenomenon has been noticed, and tools such as “authorization file” and “related vocabulary” have been proposed to solve this problem. Various synonymous words are recorded in the "authority file", so that when quoting or retrieving, words of the same meaning but different forms can be matched and treated as the same vocabulary. For example, the words "notebook" and "laptop" or "Executive Dean" and "Grandfather", or "Dementia" and "Dementia for the Elderly" can be used through the use of authorized files. Treated as the same vocabulary when indexing and retrieving. The "thesaurus" further records more relationships between vocabulary, in addition to the synonym 96-9-4 '1290684 11219twf3.doc/006, there are also antonyms, generalized words, narrow words, related words, etc. To expand or narrow the scope of the subject of the search term. For example, the "notebook computer" and the "handheld computer" are very close to each other and can be regarded as the narrow word of "portable computer". In contrast, "portable computer" can be regarded as the broad word of these two words. Through the extension of the generalized word, the query word "notebook" can be used to find the file containing "handheld computer". "Associative lexicon" lists the relationship between vocabulary and is used to query each other for the purpose of querying words to expand or narrow the scope of the query, or to prompt different linguistic terms for mourning, so that the search can be upgraded from the original string alignment level to I am doing the level of comparison. In order to construct the semantic relationship between such vocabulary, manual analysis and collation are often required. The advantages of artificially making a related lexicon are high accuracy, and the disadvantages are high cost, slow construction, difficult maintenance, and pre-selected vocabulary may be independent of subsequent or other new files. In the past, research on the information retrieval experiment pointed out that the general-purposed lexicon is used in the retrieval of documents in a specific field, and there is a situation in which the search performance cannot be improved. Although the related vocabulary captures the semantic gap between vocabulary, the vocabulary topics covered by the related lexicon may be inferior to the theme of the document, and the purpose of improving the retrieval effectiveness by the associated lexicon is not achieved. An extreme example is the application of the related lexicon of the humanities to the retrieval of engineering science literature, and its retrieval effect is of course difficult to manifest. However, creating a related vocabulary for each of the literature fields is time consuming and labor intensive. Therefore, according to the subject of the literature itself, the method of automatically and immediately generating the related lexicon is the subject of discussion. Automated methods, mostly relying on relevant vocabulary in the document often appear 1290684 11219twf3.doc/006 96-9-4 together to create a related lexicon. The related vocabulary constructed in this way can be called "co-occurrence thesaurus" or "co-occurrence related vocabulary". The relationship between vocabulary and vocabulary in the co-occurring vocabulary is not as accurate as the artificially-produced lexicon, such as generalized words and narrow words, but it has statistical correlation. This correlation often reveals knowledge that is implicit in the document and that is not easily detected, detected, or maintained. Therefore, the method of automatically constructing such a related vocabulary can also be regarded as a method of knowledge discovery from text (or text mining).

Gerard Salton 曾在 Automatic Text Processing: TheGerard Salton was at Automatic Text Processing: The

Transformation, Analysis，and Retrieval of Information by Computer，Addison-Wesley，1989 —文中提出一種建構同現關聯詞庫的基本方法，其作法是先算出各個詞彙間的相似度（此相似度乃依詞彙在各文件之間，共同出現的情形而定），然後再依此相似度將詞彙歸類(clustering)。在這個作法裡，假若文件Di以文件向量DiKdn，di2,…，dit)表示，其中t表示索引詞個數，而屯表示詞彙Tj在文件Di的權重，則詞彙η在各個文件的權重也可以組成詞彙向量 TjK^j，d2j，…，dnj)，其中η表示所有文件的個數。那麼詞彙Tj與Tk的相似度，可定義成= sim{TjJk) 一旦計算出任意兩個詞彙之間的相似度後，就可以運用各種歸類技巧將相似度高的詞彙自動歸成同一類。但此種建構方法，任意兩詞彙之間相似度的計算與歸類的運算 1290684 11219twf3.doc/006 96-9-4 量太大，運用於大量文件時，需要耗費龐大的記憶體與長久的計算時間。Transformation, Analysis, and Retrieval of Information by Computer, Addison-Wesley, 1989—This paper proposes a basic method for constructing co-occurring related lexicons by first calculating the similarity between vocabularies (this similarity is based on vocabulary in each document). Depending on the situation that occurs together, then the vocabulary is clustered according to this similarity. In this method, if the file Di is represented by the file vector DiKdn, di2, ..., dit), where t represents the number of index words, and 屯 represents the weight of the vocabulary Tj in the file Di, the weight of the vocabulary η in each file can also be The vocabulary vectors TjK^j, d2j, ..., dnj) are composed, where η represents the number of all files. Then the similarity between the vocabulary Tj and Tk can be defined as = sim{TjJk) Once the similarity between any two vocabularies is calculated, various categorization techniques can be used to automatically classify vocabulary with high similarity into the same class. However, this method of construction, the calculation of similarity between any two words and the operation of classification is 1290684 11219twf3.doc/006 96-9-4. The amount of memory is too large, and it takes a lot of memory and long-term use when it is applied to a large number of files. calculating time.

Hsinchun Chen 等人也在"Automatic ThesaurusHsinchun Chen and others are also in "Automatic Thesaurus

Generation for an Electronic Community System，，，Journal of the American Society for Information Science, 46(3): 175-193，April 1995 —文中根據詞彙共词出現在文件中的統計特徵來建構關聯詞庫。他們將詞彙Tj在文件Di中的權重定義爲：〔 \ ^ij - tfij X l°g ~7Γ X Wj \Jj j 其中n爲文件總數，dfj爲詞彙Tj出現的篇數，tfij爲詞彙Tj在文件Di的詞頻(出現次數），而Wj爲詞彙Tj的長度，例如：「Artificial Intelligence」包含2個英文字，其長度定義爲2，而「數位音樂」包含四個中文字，其長度定義爲4。另外，詞彙Tj及Tk在文件Di中的共同權重定義爲： ^ , Ί f η ) dm = Uijk x log — x 其中dfjk爲此兩詞彙在所有的文件中共同出現的篇數’ tfijk代表此兩詞彙在文件Di中的出現次數，取出現較少次的爲tfijk的値。Chen等人認爲，對稱型(symmetric)的同現(co-occurrence)詞彙計算方式，比較常找出出現較頻繁的詞彙，而這種詞彙對檢索成效的提升比較沒有幫助，因此他們定義出非對稱型的詞彙歸類方式，亦即，詞彙Τ』對詞彙Tk的歸類權重爲：Generation for an Electronic Community System,,, Journal of the American Society for Information Science, 46(3): 175-193, April 1995 - A related lexicon is constructed based on the statistical characteristics of lexical co-words appearing in documents. They define the weight of the vocabulary Tj in the file Di as: [ \ ^ij - tfij X l°g ~7Γ X Wj \Jj j where n is the total number of files, dfj is the number of words in the vocabulary Tj, and tfij is the vocabulary Tj The word frequency (number of occurrences) of the document Di, and Wj is the length of the vocabulary Tj. For example, "Artificial Intelligence" contains 2 English words, the length of which is defined as 2, and "digital music" contains four Chinese characters, the length of which is defined as 4. In addition, the common weights of the vocabulary Tj and Tk in the file Di are defined as: ^ , Ί f η ) dm = Uijk x log — x where dfjk is the number of pieces in which the two words co-occur in all the files' tfijk represents the two The number of occurrences of the vocabulary in the file Di is taken as a trick of tfijk that occurs less frequently. Chen et al. believe that the symmetric co-occurrence vocabulary calculation method often finds more frequent vocabulary, and this vocabulary does not help the improvement of retrieval performance, so they define The asymmetric vocabulary classification method, that is, the vocabulary Τ 对 vocabulary Tk classification weight is:

Cluster _ Weight(Tj ,Tk) = yn -^1-— x weighting _ factor {Tk) 1290684 11219twf3 .doc/006 96-9-4 而詞彙Tk對詞彙Tj的歸類權重爲Cluster _ Weight(Tj ,Tk) = yn -^1-- x weighting _ factor {Tk) 1290684 11219twf3 .doc/006 96-9-4 and the vocabulary Tk has a categorization weight for the vocabulary Tj

Cluster _ Weight{Tk 9 ) x weighting _ factor(Tj) 其中權重因子爲 weighting _ factor(Tj) = log dfj /og⑻ 利用上述的歸類公式，他們從4,714篇文件中（磁碟空間共佔 8 MB)，產生了 1，708,551 個詞對（co-occurrence pairs)。相當於每個詞約與數千個詞關聯在一起。由於關聯詞對太多，極容易造成使用者查詢瀏覽上的負擔，因此針對每個詞，限制其關聯詞數最多爲1〇〇個。如此刪除了 60% 的詞對，剩下709,659個詞對，而這些詞對，是由7829個不同的詞彙所組成。由於上述公式的運算量很大，爲了產生這些詞對，在工作站上要花9.2 CPU小時，且所有詞對所佔用的磁碟空間共12.3 MB，比原文件集還大。Cluster _ Weight{Tk 9 ) x weighting _ factor(Tj) where the weighting factor is weighting _ factor(Tj) = log dfj /og(8) Using the above categorization formula, they are from 4,714 files (disk space totals 8 MB) ), resulting in 1,708,551 co-occurrence pairs. Equivalent to each word being associated with thousands of words. Since there are too many pairs of related words, it is very easy to cause the burden of the user to browse and browse. Therefore, for each word, the number of related words is limited to at most one. This removes 60% of the word pairs, leaving 709,659 word pairs, and these word pairs are composed of 7829 different words. Due to the large amount of computation of the above formula, in order to generate these word pairs, it takes 9.2 CPU hours on the workstation, and the disk space occupied by all word pairs is 12.3 MB, which is larger than the original file set.

Mark Sanderson 等人在’’Deriving Concept Hierarchies from Text5ff Proceedings of the 22nd International ACM SIGIR Conference on Research and Development in Information Retrieval - SIGIR ’99，Aug· 15-19，Berkeley， U.S.A·，1999, ρρ·206-213 —文中，用一種全然不同的方式來建構具備階層架構的同現關聯詞庫，他們的目的是希望從檢索回來的文件中，自動產生槪念階層（concept hierarchies)，以方便使用者瞭解檢出文件的大致內容。其作法分兩步驟：第一步先做詞彙的選擇，以決定哪些詞彙要列在槪念階層中，供使用者瀏覽。詞彙的選擇主要是從 1290684 11219twf3 .doc/006 96-9-4 檢索結果的前幾篇文件中，就比對程度較佳的段落裡，找出常常一起出現的詞彙。另一種選擇，是從每一篇檢出的文件中最相關（即最近似查詢條件）的段落裡，選擇符合下列條件的詞彙：某詞彙出現在檢出文件中的篇數除以該詞彙出現在全部文件中的篇數大於0.1者，即： (df in retrieved set ) / (df in the collection) >= 0.1 運用這兩種選擇，他們從TREC文件集的每個查詢結果的前500篇文件中，平均擷取出2430個詞。有了這些與查詢詞彙相關的重要詞彙，第二步就進行詞彙的關聯分析。他們把任意兩個詞都拿來做包含關係（subsumption relationship)的比較：即如果Tj包含（subsirnies)Tk，則條件機率ρ(η | Tk) = 1且P(Tk | Tj) < 1應當要成立。亦即每當詞彙Tk出現在某篇文件時，詞彙Tj也都出現，但當η出現在某篇文件時，則Tk不一定出現，這種情況就說η包含 Tk。大體上，廣義詞會包含狹義詞，但沒有廣義、狹義關係的詞彙，有些也會符合這個包含條件。這個條件雖然是包含關係在數學上的直接描述，但這個條件太嚴苛，沒有嚴格符合條件的詞彙也有可能具備語意上的包含關係。於是Sanderson將上述條件放寬成：如果Tj包含Tk，則條件機率P(Tj | Tk) >= 0.8且P(Tk | Tj) < 1應當要成立。透過這個包含條件的檢驗，平均從TREC文件集的每個查詢主題中，擷取出200個包含對（subsumption pairs)。從這些包含對中，透過一些階層式選單的視窗工具，即可產生槪念階層。在此槪念階層中，包含者爲父節點，被包含者爲其 1290684 11219twD.doc/006 96-9-4 子節點。在經過測試之後發現，共有67%包含對被判斷爲相關（interesting for further exploring)。在此所謂相關’疋指受試者對自動擷取出來的包含對有興趣、有意思，以致有想要進一步探索其何以致之的動機。但經實驗顯示’僅隨意配對就能使相關度達到51%，其原因可能是挑選的詞彙都來自於同一檢索結果，這些詞彙屬於同一主題的可能性很高，因此即使將這些詞彙隨意亂配對，也會有某種程度的相關度。整體而言，此方法在查詢時才進行，查詢反應時間會受影響，且其提示的詞彙只限於檢索結果的前N 篇，因此不是一個全域關聯詞庫（global thesaurus)的建構方法。【發明內容】因此，本發明的目的就是在提供一種漸進式關聯詞庫之建構方法，其具有快速的建構速度。本發明的另一目的在提供一種漸進式關聯詞庫之建構方法，其僅需要單篇文件就可以擷取關聯詞。爲了達成上述與其他目的，本發明提供一種漸進式關聯詞庫之建構方法，此自動建構方法首先擷取文件組中之多個關鍵詞，接下來再將文件組中之文件分成多個邏輯段落’並以這些邏輯段落爲基礎來進行關聯詞分析以建立關聯詞庫。在本發明的一個較佳實施例中，以邏輯段落爲基礎進行關聯詞分析之步驟，係首先計算兩個關鍵詞之間的關聯權重，並在關聯權重大於等於一門檻値時，累積此關聯權 96-9-4 1290684 11219twf3.doc/006 重。最後再將所累積之關聯權重乘上正規化後的反向文件篇數以取得相似度。而在文件組中之文件長度差異超過某一事先設定的預定値時，還可以將相對較長之文件切割成多個文件。而在本發明的另一個較佳實施例中’計算關聯權重之步驟係以下列方程式所得： xln(l .72 + ^-) ^gKTipTik) 2xS{Tij(^Tik) S(TV)^S(T^ 其中，Si爲文件i被分割的邏輯段落數，S(Tij)代表詞彙j在文件i中出現的邏輯段落數，s(Tij n Tik)爲詞彙j 與詞彙k共同出現的邏輯段落數。在本發明的又一個較佳實施例中，相似度之計算係以下列方程式所得： l〇g^kxn/dfk) log❸ 賊,Ό) /=1 其中，η是文件的總篇數，dfk是詞_ k出現的篇數， wk是關鍵詞k的長度，當大於等於一門檻値的時候，令/(¥(?；，7；)产¥〇；，7；)，否則即令/(_(7；，4))=〇。而在顯示的時‘候，根據本發明之一較佳實施例，可於顯示關聯詞時一倂顯示與此關聯詞相關之詞頻及相似度。也可以在顯示關聯詞時，根據此關聯詞之相似度以決定顯示之位置。此外，還可以根據詞頻、出現時間與相似度來排序關聯詞。本發明因以文章中之邏輯段落，如句子、段落等爲單位來分析關聯詞，因此可以在短時間內僅依靠一份文件就能取得相對應的關聯詞，以及該些關聯詞之間的關聯性。 11 T290684 11219twf3.doc/006 96-9-4 而在新文件加入的時候’只需要就目前所有的關聯詞與新文件做分析，而無須全部文件重作一次關聯詞分析。因此其關聯詞庫之建立將可以比之前的技術快上許多。爲讓本發明之上述和其他目的、特徵、和優點能更明顯易懂，下文特舉一較佳實施例，並配合所附圖式，作詳細說明如下：Mark Sanderson et al. at ''Deriving Concept Hierarchies from Text5ff Proceedings of the 22nd International ACM SIGIR Conference on Research and Development in Information Retrieval - SIGIR '99, Aug. 15-19, Berkeley, USA·, 1999, ρρ·206-213 - In this paper, a co-occurring vocabulary with hierarchical structure is constructed in a completely different way. Their purpose is to automatically generate concept hierarchies from the retrieved files to facilitate users to understand the detected files. The approximate content. The method is divided into two steps: the first step is to make a choice of vocabulary to determine which words are to be listed in the mourning class for the user to browse. The choice of vocabulary is mainly from the first few documents of the search results of 1290684 11219twf3 .doc/006 96-9-4, in the paragraphs with better comparison, find the words that often appear together. Another option is to select the vocabulary from the paragraph that is most relevant (ie, the most approximate query condition) in each of the checked documents: the number of words in a checked file divided by the vocabulary Now the number of articles in all files is greater than 0.1, namely: (df in retrieved set ) / (df in the collection) >= 0.1 Using these two options, they use the first 500 results of each query result from the TREC file set. In the document, an average of 2430 words were taken. With these important vocabulary related to the query vocabulary, the second step is to perform lexical association analysis. They use any two words as a comparison of the subsumption relationship: that is, if Tj contains (subsirnies) Tk, the conditional probability ρ(η | Tk) = 1 and P(Tk | Tj) < 1 should To be established. That is, whenever the vocabulary Tk appears in a certain document, the vocabulary Tj also appears, but when η appears in a certain file, Tk does not necessarily appear, and this case says that η contains Tk. In general, generalized words will contain narrow words, but there are no generalized, narrowly defined words, and some will also conform to this inclusion. Although this condition is a direct description of the mathematical relationship of the inclusion relationship, this condition is too strict, and a vocabulary without strict qualifications may also have a semantic inclusion relationship. Sanderson then relaxes the above conditions: if Tj contains Tk, the conditional probability P(Tj | Tk) >= 0.8 and P(Tk | Tj) < 1 should be established. Through this test containing conditions, an average of 200 subsumption pairs are extracted from each query subject of the TREC file set. From these inclusions, the sacred class can be generated through the window tools of some hierarchical menus. In this mourning class, the inclusive is the parent node, and the included one is its 1290684 11219twD.doc/006 96-9-4 child node. After testing, it was found that a total of 67% included interest for further exploration. The so-called "related" refers to the motivation of the subject to be interested in the inclusion of the automatic extraction, so that they want to explore further. However, experiments have shown that 'only random pairing can make the correlation reach 51%. The reason may be that the selected words are all from the same search result. These words are highly likely to belong to the same subject, so even if these words are randomly matched There will also be some degree of relevance. On the whole, this method is only performed at the time of query, the query response time is affected, and the vocabulary of the prompt is limited to the first N pieces of the search result, so it is not a construction method of the global thesaurus. SUMMARY OF THE INVENTION Accordingly, it is an object of the present invention to provide a method for constructing a progressive association vocabulary that has a fast construction speed. Another object of the present invention is to provide a method for constructing a progressive association vocabulary that can retrieve related words only by requiring a single document. In order to achieve the above and other objects, the present invention provides a method for constructing a progressive association vocabulary. The automatic construction method first captures multiple keywords in a file group, and then divides the files in the file group into multiple logical paragraphs. Based on these logical paragraphs, the related words are analyzed to establish a related thesaurus. In a preferred embodiment of the present invention, the step of analyzing the related words based on the logical paragraphs first calculates the associated weight between the two keywords, and accumulates the associated right when the associated weight is greater than or equal to a threshold. 96-9-4 1290684 11219twf3.doc/006 Heavy. Finally, the accumulated correlation weights are multiplied by the normalized reverse file number to obtain the similarity. When a file length difference in a file group exceeds a predetermined predetermined time, a relatively long file can be cut into multiple files. In another preferred embodiment of the present invention, the step of calculating the associated weight is obtained by the following equation: xln(l.72 + ^-) ^gKTipTik) 2xS{Tij(^Tik) S(TV)^S( T^ where Si is the number of logical paragraphs in which document i is divided, S(Tij) is the number of logical paragraphs in which vocabulary j appears in file i, and s(Tij n Tik) is the number of logical paragraphs in which vocabulary j and vocabulary k appear together. In still another preferred embodiment of the present invention, the calculation of the similarity is obtained by the following equation: l〇g^kxn/dfk) log❸ thief, Ό) /=1 where η is the total number of files, dfk Is the number of words appearing in the word _ k, wk is the length of the keyword k, when greater than or equal to a threshold, the order / / (¥ (?;, 7;) production ¥ 〇;, 7;), otherwise ordered / ( _(7;,4))=〇. In the case of display, according to a preferred embodiment of the present invention, the word frequency and similarity associated with the related word can be displayed at the same time when the related word is displayed. It is also possible to determine the position of the display based on the similarity of the related words when displaying the related words. In addition, related words can be sorted according to word frequency, appearance time, and similarity. The invention analyzes the related words by using logical paragraphs in the article, such as sentences, paragraphs, etc., so that the corresponding related words and the correlation between the related words can be obtained by relying on only one document in a short time. 11 T290684 11219twf3.doc/006 96-9-4 When the new document is added, it only needs to analyze all the related words and new files at present, without having to re-analyze the related words. Therefore, the establishment of its associated vocabulary will be much faster than the previous technology. The above and other objects, features, and advantages of the present invention will become more apparent and understood.

【實施方式】Z 請參照第1圖，其繪示了根據本發明之同現關聯詞庫自動建構方法之一較佳實施例的施行步驟流程圖。在此實施例中，首先進行文件的關鍵詞擷取（S100)，之後將每份文件分成多個邏輯段落，並以邏輯段落爲基礎來進行關聯詞的分析（S102)，最後累積所有的關聯詞建成關聯詞庫 (S104)，即可構成同現關聯詞庫。由於關鍵詞是文件內有意義且具代表性的詞彙，且又是呈現文件主題意義的最小單位，因此各種自動化文獻處理都必須先進行關鍵詞自動擷取的步驟。然而由於關鍵詞的認定是主觀的判斷，不利於電腦的自動處理，因此在本方法中係假設：如果文件探討某個主題，那麼應該會提到某些特定的字串好幾次。這種主題詞彙可能重複的特性，便是電腦可以遵循的法則、可用來擷取關鍵詞的依據。依據此一重複性假設，發明人曾發展出一套關鍵詞自動擷取方法，並獲得中華民國的發明專利（中華民國發明專利第 1^789號）。此方法可擷取新生詞彙、專有名詞、人名、地名、機構名等詞彙；不要求文件的完整性，可適用於有 12 1290684 96-9-4 11219twf3.doc/006 雜訊的環境，如OCR文件、語音辨識等文件；不需要額外資源’如字典、詞庫、文法剖析器、語料庫等需耗費大量人力事先建立或維護的資源；擷取的關鍵詞沒有長度限制；擷取的速度快、消耗的記憶空間少；統計特徵非常低(僅出現兩次）的關鍵詞也可被擷取到；擷取的正確率高，約 86%。由於此關鍵詞自動摘取方法只根據一個簡單的假設，因此各種不同領域的文件，如工程、人文、醫學、專利等文獻，以及不同語文，如中文、英文、甚至音樂的旋律字串等都可立即運用這套方法進行關鍵詞、關鍵旋律、或文件關鍵片段的自動擷取。至於有些看起來不符合此假設條件的文件，如書目資料、或短文件，由於文字精簡，詞彙不太可能重複，所以可以利用一些技巧讓這些文件也符合重複性的假設。其中一種技巧，就是集合相關的短文件成較長的文件，讓重要詞彙較有機會重複出現，再進行關鍵詞擷取的動作。例如，在處理書目資料時，便可利用這個技巧，將35萬筆標題以五萬筆標題當成一份文件的方式來擷取關鍵詞，然後合倂這些關鍵詞，再比對哪些詞彙來自於哪些標題，即可建成關聯詞庫，供查詢提示與書目檢索運用。此外，也可以用背景文件的方式，利用檢索功能或其他可靠的方法，找出與想處理的短文件內容相關的文件作爲背景文件，將該短文件與背景文件視爲同一文件，使重要詞彙有機會重複出現，再進行關鍵詞擷取，最後留下僅出現在短文件的詞彙， 13 1290684 11219twO.doc/006 96-9-4 即爲該文件自動擷取出來的關鍵詞。能夠不依賴詞庫或辭典便能擷取新生詞彙的方法，對於某些文獻類型，或是跨領域的文獻是非常重要的。例如，中文的新聞文件常不斷有新的語彙出現。發明人曾以一部包含I2萬條中文詞的詞庫來比對新聞文件的詞彙，並和發明人自行發展的方法作比較，發現發明人所發展的方法平均每篇新聞文件擷取出33個關鍵詞（出現至少兩次的詞），其中有11個（33%)有意義的詞彙是這個詞庫中沒有莧錄的詞。顯示如果光靠事先建構的詞庫來擷取新聞文件的詞彙，可能會漏掉1/3的重要詞彙。此關鍵詞自動擷取方法也可以配合詞庫斷詞法來提升關鍵詞擷取的正確率。單純的詞庫斷詞法，例如：最長詞優先法，是在依次讀入文件的每一個字時，比對詞庫中以該字爲首的最長詞彙，若詞庫中有這樣的詞彙，就從文件斷出該詞，然後前進到該詞之後的下一個字，重複上述詞庫比對。如果詞庫中沒有以該字爲首的詞彙，就將該字斷爲一字詞（或視爲未知詞之一部分），然後前進到下個字繼續比對。如此重複比對一直到文件結束爲止。由此斷出來的詞，都是詞庫裡的詞，其正確率(precision)較高，不過文件中可能有詞庫中沒有的詞彙，因而其召回率(recall)較低。由於前述的關鍵詞自動擷取方法可以擷取新生詞彙，因此，可以將此關鍵詞擺取方法運用於詞庫斷詞法之後，讓兩種方法的優點互相彌補，從而得到高正確率、高召回率的結果。根據實驗顯示，先將文件以12萬詞的詞庫斷詞後， 1290684 11219twG.doc/006 96-9-4 再運用關鍵詞自動擷取方法’所得到的關鍵詞正確率可從原來的86%提升到96%。在文件的關鍵詞依前述方法或任何其他方法擷取出來後，就可以進行關聯分析。過去的分析方法，是依據詞彙是否常常共同出現在同一篇文件中發展而來。但在全文文件的環境中，以一整篇文件爲單位來計算是否共同出現，容易失去精準度，這是因爲當兩個詞彙在文件中隔開的距離越大，則其關係密切的可能性將越低。因此共同出現的單位應該縮小到一個邏輯段落（1〇gical ParagraPh)、或文章段落(paragraph)、或一個句子（sentence)、或任何分割文件的單位。如此，可以計數單一文件中任意兩個詞彙個別出現與共同出現的單位數，再套用Dice、Cosine、Jaccard或 mutual information等資訊檢索常用的相似度計算公式而求得兩詞彙之間的關聯度。而在計算出每一篇文件重要詞彙之間的關聯度後，累積關聯強度超過某個門檻値的關聯詞，即可完成整個文件資料庫的關聯詞庫。在本發明中，採用下列公式來計算同一篇文中兩個詞彙的關聯權重（association weight) xln(1.72 + 5,/)[Embodiment] Z Please refer to FIG. 1 , which is a flow chart showing the execution steps of a preferred embodiment of the co-occurring association vocabulary automatic construction method according to the present invention. In this embodiment, keyword scrambling of the file is first performed (S100), and then each file is divided into a plurality of logical paragraphs, and the related words are analyzed based on the logical paragraphs (S102), and finally all the related words are accumulated. Correlation vocabulary (S104), can form a co-occurrence related vocabulary. Since the keyword is a meaningful and representative vocabulary within the document, and is the smallest unit that presents the meaning of the document, various automated document processing must first perform the steps of automatic keyword retrieval. However, since the identification of keywords is subjective judgment and is not conducive to the automatic processing of computers, it is assumed in this method that if a document explores a topic, then certain strings should be mentioned several times. The characteristics that this topic vocabulary may repeat are the rules that computers can follow and the basis on which keywords can be used. Based on this repetitive assumption, the inventor developed a set of automatic keyword extraction methods and obtained the invention patent of the Republic of China (Republic of China invention patent No. 1^789). This method can learn new words, proper nouns, person names, place names, organization names, etc.; does not require the integrity of the file, can be applied to environments with 12 1290684 96-9-4 11219twf3.doc/006 noise, such as OCR files, voice recognition and other documents; no additional resources such as dictionaries, thesaurus, grammar parser, corpus, etc., which require a lot of manpower to build or maintain in advance; the keywords are not limited in length; the speed of capture is fast The memory space consumed is small; the keywords with very low statistical characteristics (only appear twice) can also be captured; the correct rate of sampling is high, about 86%. Since the automatic extraction method of this keyword is based on a simple assumption, documents in various fields, such as engineering, humanities, medicine, patents, etc., as well as melody strings in different languages such as Chinese, English, and even music, etc. This method can be used immediately to capture keywords, key melody, or key segments of a file. As for some documents that do not seem to meet this assumption, such as bibliographic data, or short documents, because the text is streamlined, the vocabulary is unlikely to be repeated, so some techniques can be used to make these documents also meet the repetitive assumptions. One of the techniques is to collect related short files into longer files, so that important words have more chances to repeat, and then perform keyword capture actions. For example, when processing bibliographic data, you can use this technique to capture 350,000 titles using 50,000 titles as a document, then combine the keywords and compare which words come from Which titles can be used to build a related vocabulary for query and bibliographic retrieval. In addition, you can use the background file to use the search function or other reliable methods to find the file related to the short file content you want to process as the background file, and treat the short file and the background file as the same file, making the important words There is an opportunity to repeat, keyword search, and finally leave the words that appear only in short files, 13 1290684 11219twO.doc/006 96-9-4 is the keyword that is automatically extracted from the file. The ability to learn new vocabulary without relying on the thesaurus or the dictionary is very important for certain document types or cross-domain literature. For example, Chinese news files often have new vocabulary. The inventor compared the vocabulary of the news document with a vocabulary containing I2 million Chinese words, and compared it with the inventor's own development method, and found that the method developed by the inventor extracted an average of 33 news files per article. Keywords (words that appear at least twice), of which 11 (33%) of the meaningful words are words that are not recorded in the thesaurus. It is shown that if you use the previously constructed thesaurus to capture the vocabulary of news files, you may miss 1/3 of the important words. This keyword automatic extraction method can also be combined with the lexicon break word method to improve the correct rate of keyword capture. Simple lexicon grammar, for example, the longest word first method is to compare each word in the file in turn, and compare the longest vocabulary with the word in the lexicon. If there is such a vocabulary in the lexicon, Just break the word from the file and proceed to the next word after the word, repeating the above-mentioned thesaurus comparison. If there is no vocabulary in the lexicon, the word is broken into a word (or part of the unknown word), and then proceeds to the next word to continue the comparison. Repeat the comparison until the end of the file. The words that are broken out are all words in the lexicon, and their precision is higher. However, there may be words in the vocabulary that are not in the vocabulary, so the recall rate is low. Since the above keyword automatic retrieval method can extract new vocabulary, the keyword arranging method can be applied to the lexicon breaking vocabulary method, so that the advantages of the two methods complement each other, thereby obtaining high correctness and highness. The result of the recall rate. According to the experiment, after the word is broken by the vocabulary of 120,000 words, 1290684 11219 twG.doc/006 96-9-4 and the keyword automatic extraction method is used to obtain the correctness of the keyword from the original 86. % increased to 96%. After the keywords of the file are extracted according to the foregoing method or any other method, the correlation analysis can be performed. Past analytical methods have evolved based on whether vocabulary often appears together in the same document. However, in the context of full-text files, it is easy to lose accuracy by calculating the total number of documents in a single document. This is because the greater the distance between two words in the file, the closer the possibility The lower it will be. Therefore, the co-occurring units should be reduced to a logical paragraph (1〇gical ParagraPh), or a paragraph, or a sentence, or any unit of split files. In this way, it is possible to count the number of units of any two words in a single file and the number of units that appear together, and then use the similarity calculation formulas such as Dice, Cosine, Jaccard or mutual information to obtain the degree of association between the two words. After calculating the degree of association between the important words of each document, the associated words with the strength of the association exceeding a certain threshold are accumulated, and the associated thesaurus of the entire file database can be completed. In the present invention, the following formula is used to calculate the association weight of two vocabularies in the same text xln(1.72 + 5, /)

2xS(TvnTlk) wgK ip ih) SiT^+SiW 其中Si爲文件i被分割的單位數’此單位通常爲句子、段落或任一有意義的分割單位，爲方便後續解說，統稱此單位爲一個「句子」。S(Tij)則代表詞彙j在文件i中出現的句子數，S(Tij ΓΊ Tik)則爲詞彙j與詞彙k共同出現的句子數。所以上面公式的第一項剛好是Dice係數。但是光以 15 1290684 96-9-4 11219twf3.doc/0062xS(TvnTlk) wgK ip ih) SiT^+SiW where Si is the number of units that the file i is divided into. 'This unit is usually a sentence, a paragraph, or any meaningful segmentation unit. For the convenience of subsequent explanations, this unit is collectively called a sentence. "." S(Tij) represents the number of sentences in which the vocabulary j appears in the file i, and S(Tij ΓΊ Tik) is the number of sentences in which the vocabulary j and the vocabulary k appear together. So the first item in the above formula happens to be the Dice coefficient. But the light is 15 1290684 96-9-4 11219twf3.doc/006

Dice係數計算時，長文件裡詞彙間的關聯權重數値常較短文件的數値低。因此，我們以第二項ln(1.72 + Si)來補償長文件詞彙的權重，以便讓長度不同的文件，其詞彙關聯權重的値有大約相同的範圍。必須注意的是，此處所用之常數1·72並非爲唯一可使用的値。事實上，只要是大於等於自然對數之基數的値均可用於此處。經過上述的計算之後，本發明將關聯權重大於等於某一門檻値(threshold)者拿來累積，並乘上正規化後的反向文件篇數（Inverse Document Frequency，IDF)，而得到最後的相似度。其精確的公式如下： l〇g{W^n)dfkl χΣ/(^^.)) 其中η是文件的總篇數，dfk是詞彙k出現的篇數，wk 是詞彙k的長度。而當a大於或等於一個門檻値(threshold) 的時候令f(a)=a，否則即令f(a)=0。在此公式中，反向文件篇數(IDF)可以在所有的文件都處理完後，再計算進去，而不會影響單篇文件擷取相關詞的動作。在大部分的情況下，此門檻値只要選1.0即可應用於大部分的文件型態上。例如書目資料庫，其一筆書目記錄可視爲一個邏輯文句’ 而出現在一筆記錄中的詞彙，只要出現一次，兩詞彙間的關聯權重（2χ1/(1 + 1)χ1η(1·72+1)=1η(2·72)>1·0)就可以通過此門檻値。對於文件長度分配極不平均的文件資料庫而言，最好將超長文件(這裡是指相對於文件資料庫中的其他文件而言)切割成數份較短的文件，再運用上述的公式求其關聯詞與相似度。 16 !29〇684 11219twD.doc/006 96-9-4 把共同出現的單位從整個文件改成範圍較小的段落或句子後，立即的效應，是單篇文件即可擷取關聯詞，而不用等到處理完全部文件後才看得到關聯詞。這個效應有三個優點：一、漸進式（incremental)擷取關聯詞製作容易，對於每天或定期會新增文件的資料庫，已經處理過的文件不需要再處理，只需擷取新進文件的關聯詞，加到整個關聯詞庫即可。二、哪些關聯詞出現在哪些文件，容易追蹤記錄，因此要將關聯詞按關聯強度排序、或按關聯詞出現的篇數(df)排序，或按文件的日期排序，都變成輕易可行。三、關聯詞擷取的速度快。原先以文件爲共同出現的單位來計算關聯詞，需要就所有文件出現的所有重要詞彙兩兩相比，才知道是否常常共同出現，這需約〇(m2n)的計算量，其中m是所有重要詞彙的個數，η是所有文件的篇數。通常在一個文件資料庫中，m都是數以萬計，甚至數十萬，而η通常少則數千，多則數十萬，甚至上百萬。改變後的作法，其計算量則約需〇(nK2s)，其中Κ是單篇文件中拿來作關聯分析的詞彙數，s則是每篇文件平均的句子個數。一般的情況K與s的範圍都可以設在10到100之間，與0(m2n) 相比，0(nK2s)小很多，因此運算的時間短少許多。根據實驗顯示，從33萬篇中文新聞文件中（文字量約佔323MB大小），以桌上型的Pentium II電腦運算，大約花費5·5個小時，產生250萬個關聯詞對(全部相異詞有25萬個），平均每個詞有10個對應的關聯詞。相對的，Chen等人的作法，在Sun Spare工作站上，花費9.2小時，從4,714 17 1290684 11219twD.doc/006 96_9-4 篇英文（文字量共8MB)中，找出1，708551個關聯詞對，由於關聯詞數太多，從中再移除60%的關聯詞對，作爲最後的結果。Chen等人的另外一個實驗則以2GB的英文摘要資料庫爲對象，從270,000個詞中，產生4,000,000個關聯詞對，共花費超級電腦24.5個CPU小時。第2圖顯示一篇文件的內容，以及由上述方法自動擷取出來的關鍵詞與關聯詞範例。其中關鍵詞後面的括弧顯示該詞彙出現在文中的次數，而關聯詞則以二維圖示方式顯示詞彙之間彼此的關聯。每一篇文件的關聯詞擷取出來後，即可累積建成一個關聯詞庫，供查詢提示用。使用者可以藉此瞭解文件中大致的內容，並選擇有用的關聯詞彙，探索資料庫中記載的訊息。第3圖顯示根據本發明之另一較佳實施例之施行結果。其中，當使用者輸入DoCoMo後，系統在最左欄提示三個與DoCoMo字串近似的詞彙，一個是查詢詞彙本身，兩個是被查詢詞彙包含（subsume)的詞（均可視爲狹義詞）。每個詞在其後的圓括號內均註明其出現篇數。若視該詞彙爲一類別詞的話，那麼使用者可以清楚知道屬於d〇CoMo 類別的文件有62篇，屬於NTT DoCoMo類別的文件有19 篇等等。如此，使用者可點選較小的類別，快速縮小範圍。此外，這些出現篇數也可以協助使用者在不必進一步查詢的情況下，就可得知用此詞彙做查詢時會出現的篇數，讓使用者僅作一次查詢’就可得知好幾個查詢結果。其展現出來的效果宛如一個分類目錄，只不過此分類目錄會隨查 18 1290684 11219twf3 .doc/006 96-9-4 詢詞彙的不同而變化，而且也會隨文件資料庫的不同而不同，因此稱爲「動態分類目錄」。在第3圖中間標示「Related Terms」的那一欄，也是運用關聯詞庫建構出來的提示詞。對於查詢詞DoCoMo的每一個關聯詞，其後的圓括號顯示兩個數據，第一個是該詞彙的詞頻（df，即出現的篇數），第二個是累積出來的關聯強度。在這個例子中，DoCoMo的關聯詞是按關聯強度由大到小排序而成。從這個關聯提示詞中，大略可知DoCoMo 與日本的電信業有關，由於「三G」與「i-Mode」是手機領域滿特別的用語，因此精確的說應該跟電信業的手機領域有關。對於想要探索DoCoMo的使用者而言，這些提示具有相當程度的摘要作用。使用者如想做進一步的瞭解，可點選適當的詞彙，調出相關的文件，從其記載的描述中瞭解詳情。上面的關聯詞是以一維空間的文字展示，我們也可以將關聯詞應用到二維空間的環境，使其具備另一番使用上的趣味。此外，如果我們能辨別詞彙的性質，在顯示這些詞彙時，還可以根據其性質主動提示出更多相關的訊息。如第4圖所示，使用者在查詢MP3後，發現有很多詞彙跟 MP3有關係，其中「中環」像似一家公司名稱，使用者點選後，可以找出MP3與中環相關的文章，瞭解其詳細的關係。此時，系統在廠商資料庫中，也比對到中環這家公司，因此也將該公司的詳細資料顯示出來。如此，從非結構化的文件中，可以獲得事物之間的關聯，而這些關聯猶如記 19 1290684 11219twf3. doc/006 95.9.4 錄了某些知識等待探索，有些知識由文件中的自由文句中透露出來，有些知識則由事先準備好的結構化資料加以補充。在這樣的二維空間中一步步探索時，猶如在地圖上一步步發現所需的資訊或知識。因此，這樣的系統，可簡略稱爲「知識地圖」。如前所述，在累積每篇文件的關聯詞時，我們可以得知詞彙之間的關聯強度(即相似度）、出現的文件與篇數、以及其日期時間等資訊。因此，在顯示某個查詢詞彙的關聯詞時，我們就可依這些訊息來排序。然而不管關聯詞用什麼方式排序，其所對應的關聯詞集(set of related terms)都是一樣的，只不過是這些詞的次序不同而已。但不同次序的呈現方式，會讓使用者對系統提示的關聯詞好壞，有不同的感覺。某個次序可能較另一種次序還好，亦即詞彙相關的比例較高。請參照第5圖，其顯示的是查詢詞「古蹟」的關聯詞，分別用「詞頻」、「時間」、「沒有IDF的相似度」與「相似度」四種方式排序。「詞頻」即該提示詞出現的文件篇數。「時間」指該提示詞與查詢詞一起出現的文件中，以文件日期最近者，作爲該提不詞的日期’然後再將這些提示詞依此日期由近而遠排序。這樣做的用意是考慮到某些「關聯詞對」在最近的報導中才出現’他們的強度或詞頻都還未累積到足夠的大，而可以被排列在前面。尤其像新聞類的文件，就具備此種特性。「沒有1DF 的相似度」是指依下面公式求得的相似度〜勝(M)=(一 W)) 20 l29〇684 112l9twf3.doc/006 96-9-4 其中，函數f()的定義與前面的定義相同。這個相似度公式，與原先的比較，少了 log(wk X n/dfk)/l〇g(n)這一項，其目的是要比對出少了反向文件頻率（IDF)的效果。最後「相似度」指的是依原先相似度計算公式求得的相似度。圖四的提示詞中，被判定與查詢詞「古蹟」較相關的詞彙，以打勾的方式標示，這個例子看起來，依相似度排序的效果較好。根據實驗結果，若擬定6個查詢詞彙，並就每個查詢言司彙檢視前N(在此N=50)個提示詞，判斷其與查詢詞的相關性’則將全部30個查詢詞的結果平均起來的結果將如第 6圖所示。當關聯提示詞依「相似度」排序時，被判斷爲相關的比率爲69%，其次是依「沒有IDF的相似度」排序： 62% ’再其次是依「時間」排序：59%，最差的是依篇數排序’比例爲48%。這數據再次印證「反向文件篇數」（IDF) 在資訊檢索中常扮演重要的角色，而依時間排序雖然在此效果比較不好，但在強調時間順序的文件資料庫中，可能還是有重要的應用。單就相關比例而言，即可知依相似度排序的效果最好。若相關比例一樣時，計算其精確率與召回率，可以進一步顯示不同方法的差異。例如，當某兩種排序方法的相關比例都是60%時，亦即前50個提示詞中有30個被判定爲相關時’這30個相關提示詞都排列在最前面與都排列最後面（即前20個都不相關），顯然效果上還是有很大的差異。透過TREC檢索競賽常用的精確率與召回率計算程式 21 1290684 11219twf3.doc/006 96-9-4 trec_eval，可以進一步區別此兩種排序方法的優劣。另外，上述四種排序方式，在前50個提示詞中，呈現的相關提示詞也許不盡相同，透過trec_eval的精確率與召回率計算，將四種方法找到的相關提示詞全部集合起來計算召回率，也可以大略得知，那個方法常能找到其他方法不能找到的相關提示詞，或其他方法找到的相關提示詞，常都是某個方法就可以找到的詞彙。第7圖展示查詢詞「古蹟」依相似度排序時的trec_eval輸出。從此表中可以看出，被判定爲相關的詞彙，越是集中在前面，其平均的精確率越高。若找出的相關提示詞一樣多，但沒有特別集中在前面，則其平均的精確率將比較低。最後的平均精確率，則是將全部30個查詢詞的平均精確率再加以總計平均，其中依「相似度」排序依然是效果最好的方式，平均精確率是0.5284，其次是依「沒有IDF的相似度」排序，平均精確率：0.4346，再其次是依「時間」排序：0.4028，最後是依篇數排序： 0.3020 〇綜上所述，本發明所提出之漸進式關聯詞庫之建構方法，其建構速度快且成效好。與過去的硏究相較，雖然評估的環境不同，不過就數値上看，彼此的成效幾乎達同一水準。此外，僅需要單篇文件即可擷取關聯詞，而且漸進式擷取關聯詞製作容易，對於每天或定期會新增文件的資料庫，已經處理過的文件不需要再處理，只需擷取新進文件的關聯詞，加到整個關聯詞庫即可。再者，其關聯詞排序容易，要將關聯詞按關聯強度排序、或按關聯詞出現的 22 1290684 11219twD.doc/006 95.9.4 篇數(df)排序，或按文件的日期排序，都輕易可行。雖然本發明已以一較佳實施例揭露如上，然其並非用以限定本發明，任何熟習此技藝者，在不脫離本發明之精神和範圍內，當可作些許之更動與潤飾，因此本發明之保護範圍當視後附之申請專利範圍所界定者爲準。 jgl式簡單說明第1圖是根據本發明之一較佳實施例之施行步驟流程圖；第2圖顯示一篇文件的內容，以及由本發明之一較佳實施例自動擷取出來的關鍵詞與關聯詞範例；第3圖顯示根據本發明之另一較佳實施例之一施行結果示意圖；第4圖顯示根據本發明一較佳實施例之另一施行結果示意圖；第5圖顯示根據本發明之一較佳實施例查詢「古蹟」的關聯詞，並分別用「詞頻」、「時間」、「沒有IDF的相似度」與「相似度」四種方式排序所得的結果示意圖；第6圖顯示的是根據本發明之一較佳實施例所得結果之示意圖；以及第7圖顯示的是查詢詞「古蹟」依相似度排序時的 trec__eval輸出結果。圖式標說明：_ sl〇〇〜S104 :根據本發明之一較佳實施例之施行步驟 23When the Dice coefficient is calculated, the number of associated weights between words in a long file is often shorter than the number of files. Therefore, we compensate the weight of the long file vocabulary with the second term ln(1.72 + Si), so that the lexical association weights of files of different lengths have about the same range. It must be noted that the constant 1.72 used here is not the only one that can be used. In fact, as long as it is greater than or equal to the base of the natural logarithm, it can be used here. After the above calculation, the present invention accumulates the association weight greater than or equal to a certain threshold, and multiplies the normalized inverse document number (IDF) to obtain the final similarity. degree. The exact formula is as follows: l〇g{W^n)dfkl χΣ/(^^.)) where η is the total number of files, dfk is the number of words in the vocabulary k, and wk is the length of the vocabulary k. And when a is greater than or equal to a threshold, let f(a) = a, otherwise let f(a) = 0. In this formula, the number of reverse files (IDF) can be calculated after all the files have been processed, without affecting the action of the single word to retrieve the relevant words. In most cases, this threshold can be applied to most file types as long as 1.0 is selected. For example, a bibliographic database, a bibliographic record can be regarded as a logical sentence 'and a vocabulary appearing in a record, as long as it occurs once, the weight of association between the two vocabulary (2χ1/(1 + 1)χ1η(1·72+1) =1η(2·72)>1·0) can pass this threshold. For a file database with a very uneven file length allocation, it is best to cut a very long file (here, relative to other files in the file database) into several shorter files, and then use the above formula to find Its related words and similarities. 16 !29〇684 11219twD.doc/006 96-9-4 After changing the co-occurring unit from the entire document to a smaller paragraph or sentence, the immediate effect is that a single document can retrieve related words without using Wait until the full file is processed before you can see the related words. This effect has three advantages: 1. Incremental is easy to make related words. For a database that adds files every day or regularly, the processed files do not need to be processed, and only the related words of the newly imported files are retrieved. Add to the entire associated vocabulary. Second, which related words appear in which documents, easy to track records, so it is easy to sort the related words by relevance strength, or by the number of articles (df) appearing by the related words, or by the date of the file. Third, the speed of the related words is fast. Originally, the file is used as a common unit to calculate the related words. It is necessary to compare all the important words appearing in all the files to know whether they often appear together. This requires the calculation of m2n, where m is all important words. The number of η is the number of all files. Often in a document database, m is tens of thousands, or even hundreds of thousands, and η is usually thousands, hundreds of thousands, or even millions. After the change, the calculation amount is about 〇(nK2s), where Κ is the number of vocabularies used in the single document for correlation analysis, and s is the average number of sentences per file. In general, the range of K and s can be set between 10 and 100. Compared with 0 (m2n), 0 (nK2s) is much smaller, so the operation time is much shorter. According to the experiment, from the 330,000 Chinese news files (the volume of text is about 323MB), the desktop Pentium II computer computing takes about 5.6 hours, resulting in 2.5 million related word pairs (all different words) There are 250,000), and each word has 10 corresponding related words. In contrast, Chen et al. spent 9.2 hours on the Sun Spare workstation, and found 1,708,551 related words from 4,714 17 1290684 11219twD.doc/006 96_9-4 English (8MB in total). Since there are too many related words, 60% of the related word pairs are removed from them as the final result. Another experiment by Chen et al. used the 2GB English abstract database to generate 4,000,000 related word pairs from 270,000 words, which cost the supercomputer 24.5 CPU hours. Figure 2 shows the contents of a document and examples of keywords and related words that are automatically extracted by the above method. The parentheses after the keyword show the number of times the vocabulary appears in the text, while the related words show the vocabulary associations with each other in a two-dimensional representation. After the related words of each file are taken out, a related vocabulary can be built up for query prompts. Users can use this to understand the general content of the file and select useful related terms to explore the information recorded in the database. Figure 3 shows the results of an implementation in accordance with another preferred embodiment of the present invention. When the user inputs DoCoMo, the system prompts three words in the leftmost column that are similar to the DoCoMo string, one is the query vocabulary itself, and the other is the word of the vocabulary that is queried (can be regarded as a narrow word) . Each word is marked with the number of occurrences in the parentheses after it. If the vocabulary is a category, then the user can clearly know that there are 62 files belonging to the d〇CoMo category, 19 files belonging to the NTT DoCoMo category, and so on. In this way, users can click on smaller categories to quickly narrow the scope. In addition, these appearances can also help the user to know the number of articles that will appear when using this vocabulary without further inquiry, so that the user can make only one query and then learn several queries. result. The effect is similar to a catalogue, except that the catalogue will vary depending on the vocabulary of the query, and will vary from document to database. It is called "dynamic classification directory". The column labeled "Related Terms" in the middle of Figure 3 is also a prompt word constructed using the associated vocabulary. For each related word of the query term DoCoMo, the following parentheses display two data, the first is the word frequency of the vocabulary (df, that is, the number of articles appearing), and the second is the accumulated correlation strength. In this example, DoCoMo's related words are sorted by the strength of the association from large to small. From this related prompting words, it is known that DoCoMo is related to the Japanese telecommunications industry. Since "three G" and "i-Mode" are special terms in the mobile phone field, it should be precisely related to the mobile phone field of the telecommunications industry. These tips have a considerable degree of abstraction for users who want to explore DoCoMo. If the user wants to know more, click on the appropriate vocabulary, call up the relevant documents, and learn more from the descriptions. The above related words are displayed in one-dimensional space, and we can also apply related words to the environment of two-dimensional space, so that it has another interesting taste. In addition, if we can distinguish the nature of vocabulary, when displaying these vocabulary, we can also actively present more relevant information according to its nature. As shown in Figure 4, after querying MP3, the user found that there are many words related to MP3. Among them, "Central" is like a company name. After clicking the user, you can find out the articles related to MP3 and Central. Its detailed relationship. At this point, the system is also compared to the company in Central in the vendor database, so the company's details are also displayed. Thus, from unstructured documents, the associations between things can be obtained, and these associations are like 19 1290684 11219twf3. doc/006 95.9.4 Some knowledge is waiting to be explored, and some knowledge is in free sentences in the file. Disclosure, some knowledge is supplemented by structured data prepared in advance. Step by step exploration in such a two-dimensional space is like stepping through the map to find the information or knowledge you need. Therefore, such a system can be simply referred to as a "knowledge map." As mentioned earlier, when accumulating the related words of each document, we can know the strength of the association between the vocabulary (ie, similarity), the number of documents and articles that appear, and their date and time. Therefore, when displaying the related words of a query vocabulary, we can sort them according to these messages. However, no matter how the related words are sorted, the corresponding set of related terms are the same, except that the order of the words is different. However, the presentation in different orders will make the user feel different about the related words prompted by the system. One order may be better than the other, that is, the proportion of vocabulary related is higher. Please refer to Figure 5, which shows the related words of the query word "history", which are sorted by "word frequency", "time", "similarity without IDF" and "similarity". "Word frequency" is the number of documents that appear in the prompt word. "Time" means the file in which the prompt word appears together with the query word, with the date of the file being the closest, as the date of the mention of the word, and then sorting the prompt words according to the date from near to far. The intention is to take into account that some "relevant word pairs" appear in recent reports. 'Their strength or word frequency has not yet accumulated enough, but can be ranked in front. This is especially true for documents like news. "The degree of similarity without 1DF" refers to the similarity obtained by the following formula ~ win (M) = (1 W)) 20 l29〇684 112l9twf3.doc/006 96-9-4 where the definition of the function f() Same as the previous definition. This similarity formula, compared with the original one, reduces the log(wk X n/dfk)/l〇g(n) term, and the purpose is to compare the effect of the reverse file frequency (IDF). Finally, "similarity" refers to the similarity obtained by the original similarity calculation formula. In the prompt words in Figure 4, the words that are judged to be more relevant to the query word "historic" are marked with a tick. This example seems to have a better effect of sorting by similarity. According to the experimental results, if six query vocabularies are drawn up, and each query is said to check the front N (here N=50) prompt words and judge the relevance of the query words, then all 30 query words will be The results of the average results will be as shown in Figure 6. When the associated prompt words are sorted according to "similarity", the ratio determined to be relevant is 69%, followed by "no similarity with IDF": 62% 'Secondly, sorted by "time": 59%, most The difference is that the ratio is '48% according to the number of articles. This data once again confirms that "Reverse Document Numbers" (IDF) often plays an important role in information retrieval. Although sorting by time is not good in this case, it may still be important in a document database that emphasizes chronological order. Applications. As far as the correlation ratio is concerned, it can be seen that the sorting by similarity is the best. If the correlation ratio is the same, calculate the accuracy and recall rate, you can further show the difference between the different methods. For example, when the correlation ratio of two sorting methods is 60%, that is, 30 of the first 50 prompt words are judged to be related, 'the 30 related prompt words are arranged at the front and the last one. (ie the top 20 are not relevant), obviously there is still a big difference in effect. Through the TREC search competition commonly used accuracy and recall rate calculation program 21 1290684 11219twf3.doc / 006 96-9-4 trec_eval, can further distinguish the advantages and disadvantages of the two sorting methods. In addition, in the above four sorting methods, in the first 50 prompt words, the related prompt words may not be the same. Through the accuracy rate of the trec_eval and the recall rate calculation, all the related prompt words found by the four methods are collectively collected to calculate the recall. Rate, you can also roughly know that the method can often find related prompt words that other methods can not find, or related prompt words found by other methods, often a vocabulary that can be found by a certain method. Figure 7 shows the trec_eval output when the query word "history" is sorted by similarity. It can be seen from this table that the more vocabulary that is judged to be related, the more concentrated it is, the higher the average accuracy. If you find as many relevant prompts, but don't focus on the front, the average accuracy will be lower. The final average accuracy rate is to average the average accuracy of all 30 query words, and the ranking according to "similarity" is still the best way. The average accuracy is 0.5284, followed by "no IDF." Sort of similarity, average accuracy: 0.4346, followed by "time": 0.4028, and finally by number of articles: 0.3020 In summary, the method for constructing the progressive association vocabulary proposed by the present invention, Its construction speed is fast and the results are good. Compared with the past studies, although the assessment environment is different, on the other hand, the effectiveness of each other is almost the same level. In addition, only a single document is needed to retrieve related words, and progressively drawing related words is easy to make. For databases that add files every day or regularly, the processed files do not need to be processed, just need to retrieve new files. The associated words can be added to the entire associated vocabulary. Furthermore, the ordering of related words is easy. It is easy to sort the related words by the intensity of the association, or by the number of related words that appear in 22 1290684 11219twD.doc/006 95.9.4 (df), or by the date of the file. Although the present invention has been described above in terms of a preferred embodiment, it is not intended to limit the invention, and it is obvious to those skilled in the art that the present invention may be modified and retouched without departing from the spirit and scope of the invention. The scope of the invention is defined by the scope of the appended claims. BRIEF DESCRIPTION OF THE DRAWINGS FIG. 1 is a flow chart showing the steps of a preferred embodiment of the present invention; FIG. 2 is a view showing the contents of a document and the keywords automatically extracted by a preferred embodiment of the present invention. Example of a related word; FIG. 3 is a schematic view showing the result of performing another embodiment of the present invention; FIG. 4 is a view showing another embodiment of the present invention according to a preferred embodiment of the present invention; A preferred embodiment queries the related words of "historic monuments" and sorts the results of the results by using "word frequency", "time", "no similarity of IDF" and "similarity" respectively; Figure 6 shows A schematic diagram of the results obtained in accordance with a preferred embodiment of the present invention; and Figure 7 shows the results of the trec__eval output when the query word "history" is sorted by similarity. BRIEF DESCRIPTION OF THE DRAWINGS: _ sl〇〇~S104: Step 23 is performed in accordance with a preferred embodiment of the present invention

Claims

1290684 11219twfi.doc/〇〇6 Picking up, applying for a patent garden: I. A method for constructing a progressive association vocabulary, comprising: taking a plurality of keywords in a file group; dividing each file in the file group into plural a logical paragraph; calculating an associated weight of the two keywords according to the logical paragraphs; accumulating the associated weight when the related weight is greater than or equal to a threshold; multiplying the accumulated associated weight by the normalized reverse The number of documents is obtained to obtain a similarity to establish the related vocabulary according to the similarity; and to display the related words in the related vocabulary. 2. The construction method of the progressive associative lexicon described in claim 1 of the patent application, wherein the step of calculating the associated weight is obtained by the following equation: ^gKTipTik) x 111(1.72 + 5^) 2x^(7;. ΤΤ;,) ~S(Tv)^S(Tik) where Si is the number of logical paragraphs in which document i is divided, scTij) represents the number of logical paragraphs in which vocabulary j appears in file i, and s(Tij n Tik) is the vocabulary j The number of logical paragraphs that appear with the vocabulary k. 3. The method for constructing a progressive associative vocabulary as described in claim 1, wherein the calculation of the similarity is obtained by the following equation: log ❸ /=1 where η is the total number of files, dfk is the vocabulary The number of articles appearing in k, ~!^ is the length of the keyword k. When μ^(7;·,7;) is greater than or equal to a threshold, let /(_(?;;))= _(7; ·, 7;), otherwise, /0^(7^4))=0. 4. The method for constructing a progressive associative vocabulary according to claim 1, wherein the step of displaying the related words in the associated vocabulary comprises: displaying a word frequency and a similarity associated with the related words. 24 1290684 11219twD.doc/0〇6 96-9-4 5. The method for constructing the progressive association vocabulary as described in claim 1 'the steps of displaying the related words in the associated lexicon include: according to the related words One degree of similarity to determine the position of the display. 6. The method for constructing a progressive associative vocabulary as described in claim 5, wherein the less similarly related term is displayed at a position farther from the center point. 7. The method for constructing a progressive associative vocabulary according to claim 1, wherein the step of displaying the related words in the associated lexicon comprises: sorting the related words according to the word frequency. 8. The method for constructing a progressive associative vocabulary according to claim 1, wherein the step of displaying the related words in the associated vocabulary comprises: sorting the related words according to the time of occurrence. 9. The method for constructing a progressive associative vocabulary according to claim 1, wherein the step of displaying the related words in the associated lexicon comprises: sorting the related words according to one of the similarities without the number of reversed documents. 10. The method for constructing a progressive associative vocabulary as described in claim 1 of the patent application, further comprising: cutting a relatively long file into a plurality of files when the file length difference in the file group exceeds a predetermined threshold. 25 ^ 1290684 11219twf3.doc/006 96-9-4 Abstract: A method for constructing a progressive association lexicon. This automatic construction method first captures multiple keywords in a file group, and then groups the files. The file in it is divided into multiple logical paragraphs, and the related words are analyzed based on these logical paragraphs. The incremental thesaurus construction method first collects keywords in a document set, divides each of the documents in the document set into a plurality of logical segments, and perform related terms analysis based on The logical segments. 指定, designated representative circle: (1) The representative representative of the case is: (1). (2) A brief description of the components of the representative drawings: S100 to S104: The execution steps according to a preferred embodiment of the present invention 捌 If the chemical formula is used in the present case, please disclose the chemical formula which best shows the characteristics of the invention.