201126359 六、發明說明: 【發明所屬之技術領域】 本發明主要關於一種自動化關鍵字評估系統與方法, 以求取一關鍵字權重門檻,並以此權重門檻決定文件中之 文件詞是否為一關鍵字。 【先前技術·】 隨著電腦相關應用的日漸普及,不論對於個人、組織、 或是企業而言,電子化的資料儲存媒體,已經取代傳統的 實體儲存媒介,如紙張,成為最主要也最重要的儲存方式, 也因此,資料的有效管理與使用,一直是使用者、資料内 容管理者、以及開發者所關注的問題。 在不同的資料型態中,非結構化(Unstructured )的資 料管理,一直是在資料管理議題中最難以處理的部份。所 謂的非結構化資料,泛指任何缺乏預設資料内容格式 (Metadata)的資料,如純文字文件等,其資料管理最困 難的地方在於,由於缺乏對資料内容的描述性格式,也就 是資料缺乏自動自我描述(Self-Description)的特性,導 致資料無法被代理程式或是資料處理程式(Agents)自動 化地處理並解析(Interpret)其内容所代表的含意,例如對 於自動化程式而言,尚未處理過的非結構化純文字資料檔 案,可能僅能視為長字串(Strings )或是長位元串流 (Streams),在無法解析其含意的狀況下,僅能作資料的 儲存,而無法依據使用者的需求加以管理與應用。 201126359 針對上述的非結構化資料管理問題,一般皆由資訊檢 索或文件探勘這兩個領域的研究來嘗試解決。其中,以資 訊檢索(Information Retrieval,IR)領域而言,其目的即 為試圖解決非結構化資料無法被管理與應用的問題,而其 最基本的概念為,根據非結構化資料的内容,以反向工程 (Reverse Engineering)的方式反向建立文件的邏輯内容概 述,並以機器可解析的方式與格式儲存,讓系統可以最佳 化管理非結構化文件,並讓使用者可以依照其個人需求對 • 資料進行操作運用,如查詢、資料維護等。因此,一個資 訊檢索系統的建立,可以將其視為是文件庫(Textbase) 的建構過程,而在此建構過程中一個關鍵的技術環節,就 在於如何將檔案系統中的文件集合有意義地建立其邏輯概 述,讓使用者能依據此邏輯層描述來應用資料,這個過程 在圖書館學領域稱之為自動化索引(Automatic Indexing )。 所謂「有意義」邏輯概述的建立,意即依照文件原始的内 容語意(而非使用任意索引方式如編碼等等),來建立關 • 於文件與文件集合本身的邏輯描述。因此,一個有效的文 件邏輯描述建構方法,就是以文件的關鍵字詞作為對其之 描述,與此相關的研究在學術上稱之為對文件的自動化摘 要(Automatic Summarization ),換言之就是從含有大量 文字的非結構化文件中,萃取出一小段字或是詞集合,以 涵蓋此文件所理應表達的語意。摘要的形式,可以是字 (Keywords )、詞(Keyphrases )、甚至是句子(Sentences ) 集合(下文對於關鍵字詞,皆以『關鍵字』一詞描述之)。 201126359 從量化的技術上來說,在文件中尋找關鍵字的方法,一妒 可以看成「尋找文件中表達能力權重較重的關鍵字」的^ 法。也就是說,如果能以「字對文件的表達能力」為量化 指標,則此指標值越大者即為關鍵字集合,也就是能 文件特徵的摘要。 乂 另-方面,也有相關研究引入知識發掘(Kn〇wiedge Discovery)與演化式計 # (Evoluti〇naryC〇mp咖i〇n)的 概念’在大量的訓練文件集合支持下,幫助關鍵字的操取。 其=知識發掘的方法用分類(classificati〇n)的方式,在 大量的文字中,分類出r為關鍵字的機率高」的字詞 ^字,而演化式計算的方式,則以成員函數(Μ_咖 :刪㈣的模型建構,試圖計算出一個字詞屬 機率。 ^觀以上的研究,其著重的重點皆在於「權重計算規 ' 立」,也就是探討用什麼樣的方法或過程,能更準 字在文件或是文件集合中的量化權重,但是-份 料合中的騎字列表,卻缺m统化的 妙3二疋機制。以現況而言,專家法與個數門檻法,仍 =疋決疋-份文件或是文件集合中關鍵字列表的主要方 2所謂的專家法,歧請具有域 龍絲〜 4疋^擇權重值大於某特定門播的值, 目丨3 子。1取系統擷取作為關鍵字;而個數門檻法,指的 設文字權重列表中,固定挑選數組或是某預 最尚的字作為關鍵字。然而上述兩種方法,皆 201126359 對每一:文 由專家來決定,、:不符合實二$關鍵字庫皆需要 則合眼你關^〜 丁'應用而要的;而個數門檻法 ^降低_子的品f ’因為每—份文件皆有其文字 、特性’所選取的固定量有可能對某些文201126359 VI. Description of the Invention: [Technical Field] The present invention relates to an automated keyword evaluation system and method for obtaining a keyword weight threshold, and using this weight threshold to determine whether a document word in a file is a key word. [Previous Technology] With the increasing popularity of computer-related applications, electronic data storage media has replaced traditional physical storage media, such as paper, as the most important and important for individuals, organizations, or enterprises. The way of storage, and therefore the effective management and use of data, has always been a concern of users, data content managers, and developers. Unstructured data management has been the most difficult part of data management issues in different data types. The so-called unstructured data refers to any material lacking the default data content format (Metadata), such as plain text files. The most difficult part of data management is the lack of a descriptive format for the data content, that is, the data. The lack of automatic self-description (Self-Description) features that the data cannot be processed and parsed (interpret) by the agent or the data processing program (Agents). For example, for the automation program, it has not been processed. Unstructured plain text data files may only be treated as long strings (Strings) or long-term stream (Streams). In the case that the meaning cannot be resolved, only the data can be stored, but not Manage and apply according to the needs of users. 201126359 In response to the above-mentioned unstructured data management problems, it is generally tried to solve by research in the fields of information retrieval or document exploration. Among them, in the field of Information Retrieval (IR), the purpose is to try to solve the problem that unstructured data cannot be managed and applied, and the most basic concept is that according to the content of unstructured data, The Reverse Engineering approach reverses the logical content overview of the file and stores it in a machine-resolvable manner and format, allowing the system to optimally manage unstructured files and allow users to tailor their individual needs. Operation and use of data, such as inquiries, data maintenance, etc. Therefore, the establishment of an information retrieval system can be regarded as the construction process of the file base (Textbase). A key technical link in the construction process is how to establish the file collection in the file system meaningfully. A logical overview that allows users to apply data based on this logical layer description. This process is called Automatic Indexing in the field of library science. The so-called "meaningful" logical overview is established by creating a logical description of the file and file collection itself, in accordance with the original semantics of the document (rather than using arbitrary indexing such as coding, etc.). Therefore, an effective method for constructing a logical description of a file is to use the keyword word of the file as its description. The related research is academically called Automatic Summarization of the file, in other words, it contains a large amount. In the unstructured file of the text, a small paragraph or a collection of words is extracted to cover the semantic meaning of the document. The form of the abstract can be a collection of words (Keywords), words (Keyphrases), or even a sentence (Sentences) (hereinafter, the keyword words are described by the word "keywords"). 201126359 From the technical point of view of quantitative technology, the method of finding keywords in a file can be regarded as a method of "finding a keyword with a heavy weight of expressiveness in a file". That is to say, if the "word-to-file expression ability" can be used as a quantitative indicator, the larger the indicator value is the keyword set, that is, the summary of the file characteristics. In addition, there are also related researches that introduce the concept of knowledge discovery (Kn〇wiedge Discovery) and evolutionary meter # (Evoluti〇naryC〇mp coffee i〇n)' with the support of a large number of training files to help the operation of keywords. take. The method of knowledge mining uses the classification method (classificati〇n), in a large number of words, the word “word with high probability of r is classified as a keyword”, and the method of evolutionary calculation is a member function ( Μ_咖: Delete (4) model construction, trying to calculate the probability of a word. ^ The above research focuses on the "weight calculation rules", which is to explore what method or process is used. The quantification weight of the word can be more accurate in the file or file collection, but the list of riding words in the sub-combination is lacking the m3 mechanism. In the current situation, the expert method and the number threshold method , still = 疋 疋 份 份 份 份 份 份 份 份 份 份 份 份 份 份 份 份 份 份 份 份 份 份 份 份 份 份 份 份 份 份 份 份 份 份 份 份 所谓 所谓 所谓 所谓 所谓 所谓 所谓 所谓 所谓 所谓Sub.1 takes the system as the keyword; and the number threshold method refers to the text weight list, the fixed selection array or some pre-existing word as the keyword. However, the above two methods are all 201126359 for each A: The text is decided by experts, :: does not match The real two $ keyword library needs to be closed to you ~ ^ Ding's application; and the number of thresholds ^ lower _ sub-product f 'because each file has its text, characteristics 'selected fixed Quantity is possible for some texts
二、對其他文件卻是過少的,過多的關鍵字選取會= 又下降,而過少的關鍵字選取卻會讓完整度下降,從而降 低關鍵字集合表達文件或文件集合的能力。 因此,亟需有-種系統化動態決定關鍵字權重門捏的 方法’能夠提供好的關鍵字擷取品質並且提昇關鍵字掘取 系統的效率。 【發明内容】 本發明之-實_提供了—㈣鍵字評估系統,包括 一斷詞模組、一詞頻統計及過濾模組、一 組'以及-關鍵字決定模組。上述斷詞』二= 庫中挑選至少一參考文件進行一斷詞程序以產生一文件詞 集合,其中上述參考文件包括複數組已知關鍵字;上述詞 頻統計及過濾模組用以根據上述文件詞集合與上述已知關 鍵字產生一詞頻名次比例資訊,並根據上述詞頻名次比例 資訊過濾上述已知關鍵字中屬於離群值(〇utlier)者;上 述權重門檻決定模組用以根據一權重指標方法計算出上述 過濾後的已知關鍵字所分別對應之複數權重值,並將上述 7 201126359 權重值中之最小權重值設定為一權重門才監;上述關鍵字決 定模組用以判斷-比對文件中每個文件詞之權重值是否大 於上述權重門檻,若是,則決定該文件詞為關鍵字。 、本發明之另-實施例提供了—種關鍵字評估方法,該 方法之流程步驟為:針對一文件庫中之至少一參考文件進 行-斷詞程序以產生-文件詞集合,其中上述參考文件 括複數組已知關鍵字;根據上述文件詞集合與上述已 詞頻名次比例資訊;根據上述詞頻名次比例資 ==已知關鍵字中屬於離群值者;根據一權重指標 述過濾'後的已知關鍵字所分別對應之複數權 值中之最小權重值設定為-權重門檻; 榧輸出至一資訊檢索系統或-自動摘要 根攄上i Μ ί述貝訊檢索系統或上述自動摘要系統係用以 權重門檻判斷—比對文件中每個文件詞是否為關 術人㈣其他附加的特徵與優點,此領域之熟習技 實施本發明之精神和範圍内,當可根據本案 二:::路之關鍵字評估系统與方法,做些許的更 【實施方式】 說明最佳方式,在於 明之保_當視 =:::=7 201126359 本發明係針對非結構化文件提供一種不需要領域事前 知識(Domain Knowledge )即能動態決定文件關鍵字的權 重了限的關鍵字評估方法。第丨圖係根據本發明一實施例 所示之關鍵字評估系統,在關鍵字評估系統1〇〇中,'首先 由斷詞模組110自一外部文件庫中挑選出至少一參考文件 進行-斷詞料,其巾該文件庫巾儲存有複數件來自各個 領,的參考文件,且每份參考文件之關鍵字皆為已知,關 鍵字評估系統⑽即根據這些已知騎字的參考文件作為 訓練樣本。該斷詞程序係將每一份文件語意化地處理成各 個由一群字串所構成的集合資料結構,這些字串之間可以 重複,且皆為文件的一部分。可採用的斷詞程序包括:詞 庫式斷詞法(Dictionary Method )、統計式斷詞法(Stati i 、私摘服(HybHdMethGd) / 或者= 以搜尋空白鍵進行斷詞,由於本發明的重點並非在於斷詞 法,因此所採用的斷詞法不在此限。斷詞模組11〇所執行 的斷詞程序結束後,會產生一文件詞集合,包含每份參考 文件中所有的文件詞。詞頻統計及過濾模組12〇接著根據 文件詞集合與已知關鍵字計算出已知關鍵字的詞頻名次比 例統計分佈(distribution),並藉由此統計分佈,再將此 關鍵字詞頻名次比例分佈資訊中屬於離群值(〇utHer) 過濾、掉。 進一步,權重門檻決定模組130則根據一權重指標方 法計算出每個過濾後的已知關鍵字的權重值,並取此權重 值樣本中之最小值設定為權重門檻,其中可採用的權重指 201126359 標可以是基於包括(但不限於)詞頻逆向文件頻率(Term Frequency - Inverse Document Frequency,TF-IDF )指標、 Z 分數(Z-score )指標、概似比檢定值(Log-Likelihood Ratio Test )、或訊息理論中之交互資訊容量(MutualSecond, it is too little for other files. Too many keyword selections will fall again, while too few keyword selections will reduce the integrity, thus reducing the ability of keyword collections to express files or collections of files. Therefore, there is an urgent need to have a systematic approach to determining the key weights of the keywords to provide good keyword capture quality and improve the efficiency of the keyword mining system. SUMMARY OF THE INVENTION The present invention provides a (four) key evaluation system including a word breaker module, a word frequency statistics and filtering module, and a set of 'and-keyword decision modules. The above-mentioned word breaker "2" selects at least one reference file from the library to perform a word-breaking program to generate a file word set, wherein the reference file includes a complex array known keyword; the word frequency statistics and filtering module is used according to the file word The set generates a word frequency ranking information with the above known keywords, and filters the outliers (〇utlier) among the known keywords according to the word frequency ranking ratio information; the weight threshold determining module is used according to a weight indicator The method calculates the complex weight value corresponding to the filtered known keywords, and sets the minimum weight value of the 7 201126359 weight values as a weight gate; the keyword determining module is used to determine the ratio Whether the weight value of each file word in the file is greater than the above weight threshold, and if so, the file word is determined as a keyword. A further embodiment of the present invention provides a method for evaluating a keyword, the method of which is to perform a -word breaking program for at least one reference file in a file library to generate a set of - word words, wherein the reference file Include a complex array of known keywords; according to the above-mentioned file word set and the above-mentioned word frequency ranking information; according to the above-mentioned word frequency ranking ratio == known keywords belong to the outliers; according to a weight indicator The minimum weight value of the complex weights corresponding to the known keywords is set to - weight threshold; 榧 output to an information retrieval system or - automatic summary root on i Μ ί 贝 贝 贝 贝 或 或 或 或Judging by the weight threshold—Comparing whether each document word in the document is a singularity (4) other additional features and advantages, and the familiarity of the field in the practice of the spirit and scope of the present invention can be based on the second case::: Keyword evaluation system and method, do a little more [Implementation] Explain the best way, in the case of Mingzhi _ 视视 =:::=7 201126359 The present invention is directed to unstructured Pieces provided that does not require prior knowledge of the field (Domain Knowledge) that is able to dynamically determine the file keyword weight evaluation method keyword limit. The figure is a keyword evaluation system according to an embodiment of the present invention. In the keyword evaluation system, first, at least one reference file is selected from an external file library by the word breaker module 110. The word file is stored in a file with a plurality of reference files from each collar, and the keywords of each reference file are known. The keyword evaluation system (10) is based on the reference files of these known riding words. As a training sample. The word breaker program semantically processes each file into a collection data structure consisting of a group of strings, which can be repeated and are part of the file. The word-breaking programs that can be used include: dictionary method, statistical method (Stati i, private service (HybHdMethGd) / or = word search for blank words, due to the focus of the present invention It is not in the word-breaking method, so the word-breaking method used is not limited to this. After the word-breaking program executed by the word-cutting module 11〇, a set of file words is generated, including all the file words in each reference file. The word frequency statistics and filtering module 12〇 then calculates the distribution of the word frequency rankings of the known keywords according to the file word set and the known keywords, and by using the statistical distribution, the keyword frequency rankings are distributed proportionally. The weighting threshold (〇utHer) is filtered and dropped. Further, the weight threshold determining module 130 calculates the weight value of each filtered known keyword according to a weight indicator method, and takes the weight value sample. The minimum value is set to the weight threshold, and the weight that can be used refers to the 201126359 standard can be based on including (but not limited to) the word frequency reverse file frequency (Term Frequency - Invers e Document Frequency, TF-IDF) indicator, Z-score indicator, Log-Likelihood Ratio Test, or interactive information capacity in message theory (Mutual)
Information,MI)等理論所建構出之量化評估公式,且上 述權重指標方法的計算標的(即自變數)可參考詞頻、字 財財身的㈣The quantitative evaluation formula constructed by the theory, such as Information, MI), and the calculation target of the weight indicator method (ie, the self-variable) can refer to the word frequency, the word financial body (4)
如僅由一個字組成的字:「我 ^ J 字組成的字詞:「2009」等二'、Μ〜等、或疋全由數 詞性(例如關鍵字通常為名j現為權重值低的字詞)、 在段落中前段出現的字詞,其、或文字出現位置(例如 出現字詞為高)等。 要丨生可能較在段落中後段 由於本發明的重點並非在於 此所採用的權重指標方法不在、重私標方法的研究,因 權重量化指標的方法皆可採此只要是任何可以產生 字決定模組14G則再針對需㈣該權4門播後 ,關鍵 相同的權重指標方法計算比建立關鍵字的比對文件,以 值,並判斷其權重值是否大於=文件中每個文件詞的權重 該文件詞為關鍵字。 ;該權重門檻,若是,則決定 其中上述斷詞模組可更 — 進行時態處理以預先過濾文步針對所挑選的參考文件 升關鍵字評估系統之執行欵率。非關鍵字之文件詞,使提 上述詞頻名次比例的统八 計算出已知_字在_名°/佈係針對每份參考文件’ 並 -人統計中的名次順序集合, 201126359 除以此參考文件内的文件詞總 參考文件中的名次比例分佈。例如乂=,字在此 件令出現的觸β是第三參考文 οοί / ’則此次比例即為 不存在參考文件中有些已知關鍵字可能並 字d: 能發生在人為手動定義的關鍵 二;=τ鍵字略去不計。另外因為-份參考文 件中1Ά、一個已知關鍵字,因 次比例分佈,將為-群浮點數所:= 集合中的母一洋點數值皆介於 考文件内容與其詞頻分佈特性可门而由於母份參 呷付r生·並不相同,所以每份參 各自所構成的已知關鍵字詞頻名次比例分佈集合, 1含的成員浮點數可能值皆不同,如第2圖所示,參考 二二1 T各自的已知關鍵字的詞頻名次比例的分佈位 23同。最後’將所有參考文件的關鍵字詞頻名次比 ,即可求得總文件集合的關鍵字詞頻名次比例 、:” ’-般而言’此分佈集合為—右偏的機率密度 函數’如第3圖所示,在此機率密度函數中的高發生頻率 範圍即可被視為較可能的關鍵字落點範圍,換句話說,就 是指關鍵字的詞頻名次比例最可能存在的區間。因此,如 前所述’求出此最可能存在區間,並將落於較不可能存在 區間’意即落於最可能存在區間之外,的樣本視為離群樣 本而加以據除。 上述關於過濾離群值的步驟係根據此區間的求取而進 201126359 行’在-實施例巾’可將此區間的界定#作離群值 (outliers)的界定方法來處理,在此情境中之離群值,定 義為『在樣本分佈中’其名次比例值過高』(即趨近於一 之樣本)者。然而區間的界定並不限於用此概念與方法。 在統計上超過資料普遍行為樣式时料項料稱為離群 值’ -個界定離群值的經驗法則是取資料轉換成常態分配 之後的Z分數統計量超過平均值達三倍標準差的資料項則 視為不正常情況之離群值,並需加以修剪掉。以修剪後的 詞頻名次分佈區間為非離群區間,即為第3圖所標示的「可 能的關鍵字區間」。另-個界定離群值的經驗法則是採用 =分位數法’根據所有已知關鍵字的詞頻名次比例資訊計 算出第-四分位數…第三四分位數0、以及四分位距 柳,然後以溫和離群門根叫3+(1.5训或極端離群門檻 工2 = 0+(3></淡)界定出非離群區間的右尾界限(如第3圖所示 =「可能的關鍵字區間」的右界限),並以所有已知關鍵 字的詞頻名次比例值中的最小值為左尾界限(如第3圖所 示的「可能的關鍵字區間」的左界限)’因此可據以界定籲 出詞頻名次關值大於H或12的_字即被視為離群值 並予以删除。 在另一實施例中,權重門檻決定模組13〇可在計算出 每個過濾後的已知關鍵字的權重值後,接著統計出所^十 算得到的權重值的分佈集合,以上述的之分數法或四分位 數=求取權重值的可能區間’並將不在該區間内的已知關 鍵子過濾掉,然後再根據剩餘的已知關鍵字中之最小權重 12 201126359 值為權重門檻’如此一來,則可 另在此實施過程中,由於其目的⑽的精準度。 接近於零之樣本)視為不二在: 作-數一 需另行設計_值之發财法。之離鮮值騎情境,而不 在另一實施例中,關鍵字評估备 環取樣模組(㈣示),用以可再包括一循For example, a word consisting of only one word: "I ^ J word composed of words: "2009", etc. two ', Μ ~, etc., or 疋 all by number of words (for example, the keyword is usually named j is now a low weight value Word), the word that appears in the front of the paragraph, its or the position of the text (for example, the word appears high). It is possible to produce a word-determining model that may be generated in the latter part of the paragraph because the emphasis of the present invention is not on the weight index method used in this case, and the method of weighting the private label method. After the group 14G, the key weighting index method is used to calculate the comparison file with the key, and the weight value is greater than the weight of each file word in the file. The file word is a keyword. The weight threshold, if so, determines that the above-described word-breaking module can be further processed to temporally filter the progress rate of the system for the selected reference file. The non-keyword file words, which make the above-mentioned word frequency rankings, calculate the known _words in the _name °/clothing for each reference file 'and-person statistics in the ranking order, 201126359 except this reference The distribution of the rankings in the total reference file of the file words in the file. For example, 乂=, the word in this order is the third reference οοί / ', then the ratio is no reference to some of the known keywords in the reference file may be the word d: can be artificially defined manually Second; = τ key word is omitted. In addition, because the reference file is 1Ά, a known keyword, the scale distribution will be the group-float number: = the value of the parent-point in the set is between the content of the test file and its word frequency distribution. However, since the parental participation is not the same, each share of the known keyword frequency distribution is a set of proportional distributions, and the value of the floating point of the member is different, as shown in Figure 2. For reference, the distribution of the word frequency rankings of the known keywords of the two 2 1 T is the same as the distribution bit 23. Finally, 'to compare the keyword word frequency rankings of all reference files, you can find the keyword word frequency ranking of the total file set,:" '- Generally speaking, this distribution set is - right-handed probability density function' as in the third As shown in the figure, the high frequency range in this probability density function can be regarded as the more likely keyword fall range, in other words, the most likely range of the word frequency ranking of the keyword. Therefore, The above-mentioned "find the most likely interval, and will fall in the less likely interval" means that the sample falls outside the most likely interval, and is considered as an outlier sample. The value of the step is based on the determination of this interval into the 201126359 line 'in-the embodiment towel' can be defined by the definition of this interval as outliers, the outliers in this situation, It is defined as “in the sample distribution, the value of its ranking is too high” (ie, close to one sample). However, the definition of the interval is not limited to the use of this concept and method. The item is called the outlier value' - the rule of thumb for defining outliers is that the data of the Z-score statistic after the data is converted into the normal distribution by more than three times the standard deviation is regarded as the outlier of the abnormal situation. And need to be trimmed off. The trimmed word frequency distribution interval is a non-outlier interval, which is the "possible keyword interval" indicated in Figure 3. Another rule of thumb for defining outliers is to use the =quantile method to calculate the tertiles based on the word frequency ranking information of all known keywords...the third quartile 0, and the quartile From the willow, then with the gentle out of the roots called 3 + (1.5 training or extreme out of the door completion 2 = 0 + (3 >< / light) to define the right tail boundary of the non-outlier interval (as shown in Figure 3 Show = "right margin of "possible keyword interval"), and use the minimum of the word frequency ranking values of all known keywords as the left tail limit (as shown in Figure 3, "Possible Key Interval" The left boundary) 'The _ word that can be used to define the frequency of the word frequency to be greater than H or 12 is considered to be an outlier and deleted. In another embodiment, the weight threshold decision module 13 can be calculated. After the weight value of each filtered known keyword, the distribution set of the weighted values calculated by the tenth is calculated, and the fractional method or the quartile of the above is used to obtain the possible interval of the weight value' Filter out known key children that are not in the interval, and then based on the remaining known keywords The minimum weight 12 201126359 value is the weight threshold. In this way, it can be used in this implementation process, because of the accuracy of its purpose (10). Samples close to zero) are considered to be different: The method of making money. In addition to the fresh-value riding situation, in another embodiment, the keyword evaluation backup sampling module ((4) is shown) for further inclusion
所接受睹,舌新Μπ 述權重門檻不被使用者 及過:,模袓120、'文件使斷詞楔組110、詞頻統計 ^以及權重門摄決定模組U0重新運作以 產生另一合理權重門檻,此一 =__受為止; 2^,上=之關鍵字評估方法可設定為—循環重複進行 之^ ’而權重門檻值,可以由每單元所進行_序得到 之榷重門難作為自變數,進行數學運算而得。 此外’上述之參考文件取樣文件庫可以是位於網路上 的文件集合,例如:維基百科(Wikipedia)、網諸網站(如: 無名、blogspot)等’而關鍵字評估系統1〇〇可再包括一通 訊模組(未繪示),提供網路通訊功能以連結至網路上的 文件庫’並取得其所儲存之參考文件以執行上述之關鍵字 §平估方法。 關於關鍵字評估系統100的應用,在一實施例中,可 用於協助資訊檢索系統或搜尋引擎建立其索引表。傳統的 ^ Λ檢索系統與搜尋引擎係使用反向索引(invert index ) 13 201126359 表作為資訊檢索的索引,所謂的反向索引表指的是任意文 字與文件庫中文件的對應,利用反向索引表,當使用者下 達任何文字查詢時,可以透過索引表對應到内含此文字的 文件集合,並定序後回傳給使用者。然而反向索引表最大 的問題,即在於當文件庫越來越大時,由於對應的文字將 越來越多,此種反向索引表將面臨表格本身的列數與所佔 的儲存空間越來越大的問題,使得搜尋表格的預期時間拉 長、引發查詢速度的緩慢。而由於大多數的使用者在使用 資訊檢索系統或搜尋引擎進行查詢時,會有「概念」搜尋 的傾向,也就是在下達搜尋時,搜尋的條件往往是能反應 某概念的關鍵字,因此在本實施例中,如第4圖所示,資 訊檢索系統400結合了本發明之關鍵字評估系統100,由 關鍵字評估系統100以批次或即時的方式從文件庫中的文 件内文字進行權重門檻判定並據此擷取出關鍵字,讓反向 索引表中僅存放關鍵字評估系統100所決定出的「潛在關 鍵字」與文件的對應,則可使反向索引表的尺寸減小,而 加速使用者的查詢。舉例來說:假設現有兩份待摘要文件, D1 的内容為「This is a cat」,D2 的内容「This is a dog」, 傳統的資訊檢索系統將考量待摘要文件中的所有文字,因 此所建立出的反向索引表為{This —[D1,D2]; is —[D1,D2]; a 4(01,02)5 cat —»[D1]; dog —[D2]},然而,其中的 this、 is、以及a在判斷下其重要性或指標性皆不足以成為關鍵 字。此時,若搭配使甩本發明之關鍵字評估方法,令計算 出的權重門檻決定出可能的關鍵字僅為cat以及dog,而捨 14 201126359 棄this、is、a ’所以資訊檢索系統僅需以本發明之關鍵字 評估方法所判定出的關鍵字去建立反向索引表,因而縮小 反向索引表的尺寸,得到的反向索引表為{eat—dog —[D2]}。 在另一實施例中,本發明之關鍵字評估系統1〇()可應 用在自動摘要系統中。一般的自動摘要系統包括了對文件 的關鍵字自動定義、Web 2 〇使用者自產生資料(11似Accepted, the new tongue Μ 述 述 述 述 述 述 述 述 述 述 , , , , , , , , , , , , , 、 、 、 、 、 、 、 、 、 、 、 、 、 、 、 、 、 、 、 、 、 、 、 、 、 、 、 、 Threshold, this one = __ is subject to; 2^, the keyword evaluation method of the upper = can be set to - the loop repeats ^ 'and the weight threshold value, can be obtained by each unit _ order Self-variable, obtained by mathematical operations. In addition, the reference file sampling file library mentioned above may be a collection of files located on the network, for example: Wikipedia, web sites (eg, nameless, blogspot), etc. and the keyword evaluation system 1 may further include a A communication module (not shown) provides a network communication function to connect to a file library on the network and obtain a reference file stored therein to perform the above-mentioned keyword sizing method. The application of the keyword evaluation system 100, in one embodiment, can be used to assist an information retrieval system or search engine in establishing its index table. The traditional ^ Λ retrieval system and search engine system use the inverted index (invert index ) 13 201126359 table as an index of information retrieval, the so-called reverse index table refers to the correspondence between any text and the file in the file library, using the reverse index The table, when the user releases any text query, can use the index table to correspond to the set of files containing the text, and return it to the user after sequencing. However, the biggest problem with the reverse index table is that when the file library is getting larger and larger, because the corresponding text will be more and more, the reverse index table will face the number of columns in the table itself and the storage space occupied. The bigger the problem, the longer the expected time for searching the table, and the slower the query. Since most users have a tendency to search for "concept" when using an information retrieval system or a search engine, that is, when a search is made, the search condition is often a keyword that reflects a certain concept, so In the present embodiment, as shown in FIG. 4, the information retrieval system 400 incorporates the keyword evaluation system 100 of the present invention, and the keyword evaluation system 100 performs weighting from the in-file text in the file library in a batch or instant manner. After the threshold is determined and the keyword is retrieved, and only the correspondence between the "potential keyword" determined by the keyword evaluation system 100 and the file is stored in the reverse index table, the size of the reverse index table can be reduced. Accelerate user queries. For example: Suppose there are two existing digest files, the content of D1 is "This is a cat", and the content of D2 is "This is a dog". The traditional information retrieval system will consider all the texts in the abstract file, so The inverted index table created is {This —[D1,D2]; is —[D1,D2]; a 4(01,02)5 cat —»[D1]; dog —[D2]}, however, This, is, and a are not enough to be keywords in the judgment of their importance or index. At this time, if the keyword evaluation method of the present invention is used together, the calculated weight threshold determines that the possible keywords are only cat and dog, and the house 14 201126359 discards this, is, a ' so the information retrieval system only needs The keyword determined by the keyword evaluation method of the present invention is used to establish an inverted index table, thereby reducing the size of the inverted index table, and the obtained inverted index table is {eat_dog_[D2]}. In another embodiment, the keyword evaluation system 1() of the present invention can be applied to an automatic summary system. The general automatic summary system includes automatic definition of keywords for files, and Web 2 user-generated data (11
Generated CGntent)的自動標籤(Tagging)功能、或是内 嵌於其他資訊系統中的自動摘要功能。如第5圖所示,實 式有兩種,第一種方式是在自動摘要系統中即時 或是批次地直接將關鍵字評估系統1〇〇所產生的權重 應用在自動摘要時的關鍵字門檻決定的功能程序,因此: 對任意從文件庫中取出之待摘要文件時,可以在計算出文 件中個別單詞權重值之後,直接以該權重服判定其是否 為關鍵字而㈣_字啦H方式収搭配其它現 ^的關鍵字擷取方案,在自動摘要系統520中,先由關鍵 予評估系、統100作預先處豸,將待摘要文件中不可能程产 =文件翔先予關除,制麵字冊系統剛處= ,仔到的文件單詞輸入後續的關鍵字擷取方案,使得關鍵 字掏取方案只需考慮預處理之後所剩鍵 =在原始謝的關鍵字可能性即可,心;= 關=需要的時間。需注意的是,在第二種方式中,由於 估系、統謂僅需產出權重門檻,因此可關閉關鍵 子决疋模組140的功能,以節省系統效能。 15 201126359 泣。第6 ®係根據本發明—實施觸述之雜字評估方法 机程圖。首先’從文件庫中挑選參考文件進行斷詞,並產 生參考文件所屬之文件詞集合(步驟S610),如上所述, =牛庫中賴存的參考文件皆是_字已知的文件,用來 备作本發明所需之訓練樣本^根據該文件詞集合與參考文 件中的已知關鍵子產生—參考文件總體之詞頻名次比例資 訊曰(=驟S620) ’該詞頻名次比例資訊為一樣本集,代表 的是每個已知關鍵子在其所屬參考文件巾的詞頻名次與該 广考文件中文件§§]總數的比^接著再根據該詞頻名次比例 資訊過遽掉已知關鍵字中屬於離群值者(步驟S630),意 即以統Qt的方式找出詞頻名次比例資訊的分佈,並以任一 j疋離群值的畺化法則來找出該分伟中的非離群區間,凡 疋不在非離群區間内的資料皆予以刪除。過滤完離群值 後’再以權重指標方法計算出過遽後的已知關鍵字所分別 對應之權重值(步驟S64〇),而所採用的權重指標方法可 j基於包括(但稀於)文件雜、觸逆向文件頻率 扣^ Z刀數私標、概似比檢定值、或是訊息理論中之交 互資訊容量等理論所建構出之量化評估公式。根據所得到 權重值,將權重值中的最小權重值設定為權重門檻(步驟 S650) ’並將權重門檻輸出至—資訊檢索系統或—自動摘 要系統(步驟S660),使該資訊檢索系統或自動摘要系統 得以根據權重門Μ麟輯文件巾的每個文件詞是否為 關鍵字。 在另一實施例中,上述之關鍵字評估方法可設定為一 201126359 循環重複進行之程序,藉由一使用者介面將所產生的權重 門檻顯示給使用者,使用者逕依其經驗知識評估該合理權 重門檻是否,若使用者判定該合理權重門檻不,則可透過 該使用者介面下達指令使重複執行上述之關鍵字評估方 法,以自文件庫中重新取樣參考文件,並產生新的權重門 檻。此一循環可持續進行直到所產生的權重門檻被使用者 接受為止。在一實施例中,上述權重門檻值可由每單元所 進行的程序得到之權重門檻值作為自變數,進行數學運算 而得。 本發明雖以各種實施例揭露如上,然而其僅為範例參 考而非用以限定本發明的範圍,任何熟習此項技藝者,在 不脫離本發明之精神和範圍内,當可做些許的更動與潤 飾。因此上述實施例並非用以限定本發明之範圍,本發明 之保護範圍當視後附之申請專利範圍所界定者為準。Generated CGntent) is an automatic tagging function or an automatic digest function embedded in other information systems. As shown in Figure 5, there are two types of real-time. The first method is to directly apply the weight generated by the keyword evaluation system 1 to the automatic summary in the automatic summary system. The function program determined by the threshold, therefore: When any file to be abstracted is taken out from the file library, after calculating the weight value of the individual word in the file, directly determine whether it is a keyword by using the weight (4) _ word H In the automatic summary system 520, the key is to be pre-disposed by the key evaluation system and the system 100, and it is impossible to cancel the production in the abstract file. The face-to-face book system is just at =, the file word entered into the subsequent keyword capture scheme, so that the keyword capture scheme only needs to consider the remaining keys after pre-processing = the possibility of the original thank-you keyword , heart; = off = time required. It should be noted that in the second mode, since the estimation system only needs to generate the weight threshold, the function of the key sub-decision module 140 can be turned off to save system performance. 15 201126359 Weeping. The 6th ® is a machine map according to the present invention - implementing the typographic evaluation method of the tact. Firstly, the reference file is selected from the file library for word segmentation, and a set of file words to which the reference file belongs is generated (step S610). As described above, the reference files in the cattle database are all known files of the _ word. The training sample required for the present invention is generated according to the known key sub-information in the document word set and the reference file--the total frequency of the reference file in the reference file. (=step S620) 'The word frequency ranking information is the same The set represents the ratio of the word frequency of each known key in the reference file of the reference document to the total number of documents in the census file § § § ^ then according to the word frequency ranking information over the known keywords It belongs to the outliers (step S630), which means to find the distribution of the word frequency ranking information in the way of unified Qt, and find out the non-outliers in the sub-group by the degenerate rule of any j疋 outliers. In the interval, the data that is not in the non-outlier interval will be deleted. After filtering out the outliers, the weight values corresponding to the known keywords are calculated by the weight indicator method (step S64〇), and the weight indicator method used may be based on including (but less than) Quantitative evaluation formulas constructed by the theory of miscellaneous documents, reversed file frequency deductions, Z-numbers of private labels, approximate ratios, or interactive information capacity in message theory. According to the obtained weight value, the minimum weight value among the weight values is set as the weight threshold (step S650)' and the weight threshold is output to the information retrieval system or the automatic summary system (step S660), so that the information retrieval system or the automatic The summary system is able to determine whether each file word of the document towel is a keyword according to the weight of the door. In another embodiment, the keyword evaluation method described above may be set to a process repeated in 201126359, and the generated weight threshold is displayed to the user by a user interface, and the user evaluates the weight according to his empirical knowledge. If the user decides that the reasonable weight threshold is not available, the user can issue the keyword evaluation method repeatedly by the user interface to resample the reference file from the file library and generate a new weight threshold. . This cycle can continue until the resulting weight threshold is accepted by the user. In an embodiment, the weight threshold may be obtained by mathematically calculating a weight threshold obtained by a program performed by each unit as an independent variable. The present invention has been described above with reference to various embodiments, which are intended to be illustrative only and not to limit the scope of the invention, and those skilled in the art can make a few changes without departing from the spirit and scope of the invention. With retouching. The above-described embodiments are not intended to limit the scope of the invention, and the scope of the invention is defined by the scope of the appended claims.
17 201126359 【圖式簡單說明】 第1圖係根據本發明一實施例所示之關鍵字評估系 統。 、 第2圖係根據本發明一實施例所述之不同參考文件之 關鍵字詞頻名次比例之統計分佈圖。 第3圖係根據本發明一實施例所述之所有參考文件之 關鍵字詞頻名次比例發生次數之累計統計分佈圖。 第4圖係根據本發明一實施例所述之關鍵字評估系統 與資訊檢索系統之聯合運作示意圖。 第5圖係根據本發明一實施例所述之關鍵字評估系統 與自動摘要系統之聯合運作示意圖。 第6圖係根據本發明一實施例所述之關鍵字評估方法 流程圖。 【主要元件符號說明】 100〜關鍵字評估系統; 110〜斷詞模組; 120〜詞頻統計及過濾模組; 130〜權重門檻決定模組; 140〜關鍵字決定模組; 400〜資訊檢索系統; 510、520〜自動化摘要系統。 1817 201126359 [Simplified description of the drawings] Fig. 1 is a keyword evaluation system according to an embodiment of the present invention. Figure 2 is a statistical distribution diagram of the keyword word frequency rankings of different reference files according to an embodiment of the present invention. Figure 3 is a cumulative statistical distribution of the number of occurrences of keyword word frequency rankings for all reference documents according to an embodiment of the present invention. Figure 4 is a schematic diagram showing the joint operation of a keyword evaluation system and an information retrieval system according to an embodiment of the present invention. Figure 5 is a schematic diagram showing the joint operation of the keyword evaluation system and the automatic summary system according to an embodiment of the present invention. Figure 6 is a flow chart of a keyword evaluation method according to an embodiment of the present invention. [Main component symbol description] 100~keyword evaluation system; 110~word breaker module; 120~word frequency statistics and filtering module; 130~weight threshold decision module; 140~keyword decision module; 400~ information retrieval system ; 510, 520 ~ automated summary system. 18