TW201118603A

TW201118603A - A computer system of template-based term entity-relation mining algorithm

Info

Publication number: TW201118603A
Application number: TW98140809A
Authority: TW
Inventors: Yu-Chieh Wu; Yang-Cheng Lu
Original assignee: Yu-Chieh Wu; Yang-Cheng Lu
Priority date: 2009-11-30
Filing date: 2009-11-30
Publication date: 2011-06-01
Also published as: SG171508A1

Abstract

The invention presents a framework which integrates natural language processing and data mining techniques to extract interesting patterns that were defined by users. The pattern is defined as an entity-relation-term template. The parameters of our framework are all configurable and could be defined previously. Therefore the patterns can be mined and retrieved according to the importance and strength. By means of term extraction, part-of-speech tagging, and the proposed mining techniques, a portion of irrelevant term set will be eliminated before mining process.

Description

201118603 means of term extraction, part-of-speech tagging, and the proposed mining techniques, a portion of irrelevant term set will be eliminated before mining process. 四、指定代表圖： (一) 本案指定代表圖為：第（1 )圖。 (二) 本代表圖之元件符號簡單說明： 100〜190步驟流程五、本案若有化學式時，請揭示最能顯示發明特徵的化學式：無六、發明說明：【發明所屬之技術領域】本發明係關於一種能對文件内文進行關連詞彙探勘、關連命名實體探勘之方法，能利用事先定義之探勘規則、詞彙切割法則、命名實體對象定義、統計檢定方法等，將文件内滿足上述條件的結果依其可信度、關連度輸出。而關連詞彙及命名實體也能依使用者喜好擴充至多個詞彙關連性探勘。【先前技術】本發明定位於一能在大量文件之中，利用自然語言詞性擷取標注結果，依事先定義的命名實體規則’進而以大量資料探勘法則進行關連式探勘。此發明牵涉到幾個領域知識如：（1)自然語言處理：詞彙擷取（Natural language 201118603 processing)、列性自動標記（part-〇f_Speech tagging)、詞彙後處理（Post term processing)，命名實體規則研究(Named entity recognition) ; (2)資料探勘：序列資料探勘（Sequential pattern mining)、關連性探勘（Association mining) ; (3)相關係數檢定等領域知識研究。就整體架構設計精神而論，本發明為一創新且跨領域之結合。雖然過去國内外技術所能做到的技術能力皆具一定水平’但尚未真正結合上述自然語言、序列探勘等技術 • 整合為一系統性架構。過去文獻或發明研究著重的是一般樣式(pattern)的探勘而非在文件上關連詞彙之探勘。同時這些技術著眼於計算詞彙與詞彙之間的相似度，而非一以詞性出發為主的樣板式探勘。自然語言處理上，雖然已有相似的詞彙擷取技術’不過，對於具關連性、詞性樣板、命名實體定義為主且可多國語言的方法，卻從未見過。以下是以國際觀點來談論過去的相關技術與研究報告。過去多年來由於網際網路興起，大量文件、新聞文章、 φ 論壇消息、報導等資料隨時間呈倍數成長。然而資料探勘技術主要是用來處理大量數值型資料，並能在這些數值資料中，找出具高度相關的樣式或行為。例如超商中，使用資料探勘技術，辅助決策者將客戶的購買商品作一關連式探勘，把具經常一起購買的品項陳列出來，如購買咖啡又會再購買貝果。對於序列式或時間型資料，探勘方式也有所不同’例如著名的Prefixspan (美國Jiawei Han等）或 BIDE演算法’皆被設計來解決各式各樣的序列資料。然而文件上的資料不若一般數值型及類別型 201118603 (Cat—)資料K牛是不具結構’且與對象語言處理相關。同時文件也具語意，大部分語系如中文、日文、英文都是m«料型態1阿拉伯文、捷克語都是詞囊順序無關（free word order)的語言。由於各種語令間差異，因此增加探勘上之困難。在台灣地區，中研院陳克健博士等研究團隊已陸續發展出在大量文件之中，建立出詞囊支幹的方法。該方法主要也是利用中文斷詞及未知詞^方式，把相似詞或同義詞辨識出來，接著再以pair_wise胸㈣ Information計算，保留具一定程度的相似詞彙，並建立其關連詞表。使用者則可以用檢索詞彙的方式，對這關連詞彙表中與4_相似、近義的詞彙檢索。該方法嚴格說來，是-個單獨考慮同A、近義詞結構的計算過程，並未考慮以詞性及淺層句法樣板為主的實體定義，及未考慮改善斷詞結果，且不能延伸至多詞共現及命名實體或片語等階詹。因此其主要限財於並未過_訊詞彙的詞彙分群分至於文件探勘議題上，其中最主要應用仍在於分類及分群上之方法突破，並未更進—步拓展至詞彙—片語—命名實體上之關連性探勘。而文件探勘往往第—個問題在於如何先將大量的文件資料之中（Yiming Yang博士在其 1998 SIGIR文章已作-综述），分門賴至其所對應到使用者事先疋義好類別名稱之中。例如在報紙新聞、電子媒體文件輸入下，由機器自動判定’給予其合適類別，如政治、社會、影劇等標籤名稱，並且將這些文件以這些類別各自組織完成。再一步的應用則是利用文件分群技術(如階層式 201118603 分群法或廣為人知的K-means)把彼此看似無關的文件，透過結合詞彙分析，文件相似度計算，估計出特定群組，然而群組不須事先指定’而是由機器自動運算而得。雖然這二種方法皆為目前文件探勘主流議題，但實際上仍是在文章階段，並未細一步至句子或片語等層級。而詞彙—片語 —命名實體等層級與一篇文章存在著相當大的差異，因此探勘文件的技術基本上與本專利方法是顯然不同。過去十年内，本發明團隊已在上述三個領域研究許 # 久。早期在2001-2003年之間便已提出結合中文詞彙分群 (Word clustering)技術’用以增強文件分類（丁加 categorization)準確度，改善未知詞對分類演算法之影響。而在2003-2006年之間則專心致力於英文片語辨識、命名實體辨識等議題，同時也發表於各大國際學術刊物之中（如 Knowledge-based system, Expert System with applications)。2007年則是對序列資料探勘方法的演算過程，加以研究改善，並在ACL_2〇〇7中發表論文證實此 φ 法能有效改善分類模型的訓練時間。而最近3年間則著眼於中文斷詞、詞性標記及句法結構剖析等研究，同時也參加國際斷詞、詞性標記比赛、以及多國§吾言句法結構剖析比赛，取得相當優異的成績。目前本研究團隊所發展出的中文斷詞及詞性標注方法，已在公認語料之中，證實得到最佳的自動標示準確度，並且每秒能處理上萬詞彙。基於過去所發展之優勢，本專利的發展能成功地建立在強大的語言處理及文件探勘基礎上。過去這些方法所著 201118603 然與本專利不同的。雖然詞彙相似度計算鱼本不限定語系，且元全不同的層面。本專利要詞性架構，將：:二Ϊ，而是能讓使用者先定好所苒將具統計篁咼度相關的樣式挖掘出，並且读本s心水f等最低門祕件。至於詞彙相似度計算等方 / p疋否相冋或相關。況且其本質上是只適用中語系，斷詞結果為主；然而本專利不但適用多國包含2為主分析’利用命名實體上定義（已有所犬破，同時也涵蓋不同語系。上【發明内容】本發明之目的乃是在於提出—個適用多國語言的命關連性探勘架構，本發_用使用者定義好的詞性 =，及停用字與命名實體定義，並依照各樣式關連性，以列式探勘方法，㈣合統計獨立性或相_ 挖掘及呈現1程細部調整均已參數化，並可依好予以定義’加入其他文件資訊，如時間、曰期、來。 I戶能在有限時間内得到在所指定文件集中，高度相關的中名實體或詞彙關係樣式。、為達成上述之目的，本發明電腦系統包括至少一程式，該軟體程式包含以下程式模組：人一詞彙擷取，用以文件輸人至特定語言之詞彙切式，如中文斷詞，將其結果擷取回來； 201118603 詞性標注，用以擷取詞彙後的文件輸入至特定語言之詞性標注程式中，如英文Brill-Tagger方法，將其結果擷取回來；樣板定義，用以定義所要擷取的詞性集，以及命名實體定義，使符合命名實體定義的多個詞彙成為一單元實體；停用字庫，用以儲存所要去除常用、不相關詞彙；樣式投影資料庫，用以建立欲探勘之命名實體所出現之對應句子、文件資訊； • 樣式探勘，用以挖掘符合條件之命名實體序列；關連性計算，用以計算出探勘出的樣式中，其内部各詞彙之間的關連度；在本發明架構下，強調多國語系適用性，因此須搭配特定語言的詞彙擷取及詞性標注方法，將其詞彙及詞性標示完成，進而將符合命名實體規則的樣式由大量文件中探勘而得。故在實施例中，這部分元件可以用不同語系或不 • 同標注詞彙方法來求出文件詞彙及詞性，方能進行探勘。在實施例中，命名實體規則定義係由使用者依其詞性組合，經其觀察將定義符合規則的詞性組合（即一連續詞彙）合成為一命名實體，成為一探勘基本單元。其規則可以彈性加入、修改刪除。在實施例中，樣式探勘係將所有命名實體基本單元，依其在文章中之順序關係，使用序列探勘法，將滿足使用者定義的樣式挖掘出。在實施例中，關連性計算係將樣式内的所有詞彙，利 201118603 用使用者所定義的計算相關度方法，依其相關度排序列出。【實施方式】本發明之實施方式主要可以參考第一圖及第二圖，分別為本發明之整體運作圖以及操作流程圖。第三圖為本發明之演算方式。第一圖包含本發明中幾個重要核心模組、與使用者輸入系統輸出概念圖。本發明實施例之操作流程將說明如下。以整體系統觀點而言，（100, 130, 150)是與使用者相關之模組，（1〇〇)為使用者所輸入的一系列文件；（130)為使用者所指定之命名實體定義法則模組；（150)則為事先定義好要去除的停用字表。第一圖之中（120)代表著使用一個網路資料收集器 (Web crawler)來將使用者之輸入文章經由網路連結傳送或以網路資料收集器自動探訪取得資料；使用者亦可以不經由此直接輸入系統之中。（120)模組之_包含詞彙擷取（121) 及詞性標注（122)模組，其中詞彙擷取用來將文件之中詞彙以白或其他疋義付號區隔出來。而詞性標注則是將詞彙之詞性予以標記。 (130)樣板定義則是提供命名實體定義法則（140)及停用字詞去除（150)—有效定義。文件經由詞性標記後，命名實體則是利用（140)模組將符合定義之命名實體予以擷取出來；而接著（150)模組則將符合停用詞表的命名實體與詞彙去除。最後經由（160)模組，建立命名實體投影表。（17〇) 關連式探勘則是利用所建立的命名實體投影表，採用序列 201118603 式探勘法包含（171)樣式擷取及（172)樣式探勘模組，可以將滿足最小出現次數的命名實體依其文件中出現順序予以探勘0 最後（180)模組則是將探勘出的樣式利用所選定的獨立性檢定方法，算出其詞彙與詞彙之間的關連性。（190)為將命名實體詞彙依關連度排序輸出。第三圖為本發明所提出之詳細演算方式，並涵蓋上述處理流程。其輸入為所欲分析之文章及使用者定好之樣板 • 模組；而輸出則為所探勘出之詞彙關連性列表。實施方式以一輸入中文文件為例，使用者定義以下參數： •中文文件（經由中文斷詞及中文詞性標注處理） • C=5 (詞彙距離=5個詞） • M=10%(擷取前10%最顯著詞彙） •夕二70°/〇(實體相關性至少要>70%) • · (實體至少出現次數） • 最大樣式探勘長度 •卡方檢定為獨立性檢定方法經中文斷詞及詞性標注後之原始文件可參考例1: 201118603 例1 經中文斷詞及詞性標注後之原始文件中時-2009/10/22 2008_((1))集團 _((Na))稅_((Na))後 _((Ng))純益 _((Na))排行_((VG))台積_((Nb))電 _((Na)) ： _((COLONCATEGORY)) 999.33 億_(〇^1〇)元_(〇^)登_(屮)）獲利_((VH)) 王J(QUESTIONCATEGORY))中華_((Nt〇)徵信社_((Nc)) ®_((Df功新 J(VH))調查」(Nv))指出 _((VE)) ’―((COMMACATEGORY))不論—((Cbb))是 _((SHI))稅_((Na))後 J(Ng))純益 J(Na)) 、J(PAUSECATEGORY))純益率—((Na)) ，_((COMMACATEGORY)) 台積—((Nb))電一((Na))均」(D))拔得_((VC))頭籌_((Na)) » _((PER!ODCATEGORY))自由時報-2009/10/22 台積_((Nb))電_(〇^))企業J(Na))集團」(Na))獲利_((VH))王 J(QUESTIONCATEGORY))在」(P)) 「—((PARENTHESISCATEGORY))獲利 _((VH))」 _((PARENTHESISCATEGORY))表現_((Na))方面J(Na)) - _((COMMACATEGORY))根據」(P))中華_((Nc))徵信所」(Nc))調查_((VE))指出」(VE))， _((COMMACATEGORY))台積_((Nb))電」(Na))在—((P)) 2008 年_((Na))榮登_((VC)) 「_((PARENTHESISCATEGORY))獲利 _((VH))王」(Na))」 J(PARENTHESISCATEGORY))寶座_((Na)) 。J(PERIODCATEGORY))工商時報-2009/10/22 美林_((Nb))證券」(Na))看好 J(VJ))台積_((Nb))電_((Na))第四—((Neu))季_(〇^)) 與—收屻）明年_((Nd))獲利 _((VH)) ，」(COMMACATEGORY))認為_((VE))股價 _((Na))還_(仰）有_狀_2))補_狀〇)漲_((>^))空間 _((Na)) NOWnews - 2009/10/22 2008年_((1^))賺_((¥(：))最_((Dfa))大—((VH))台積_((Nb))電 _((Na)) 集團 _((Na)) 999.33 億_((^11))元_((^)榮登 _((VC))獲利 J(VH))王 _((QUESTIONCATEGORY))中央衧-2009/10/20 天下」(Nc))雜誌_((Na))調查」(VE)) :_((FW))台積_((Nb))電_((^〇)蟬連_((VJ)) 標竿 _((Na))企業 _((Na))龍頭 j(Na)) _中廣-2009/10/20 標竿_((Na))企業 _((Na))最佳_((A))聲望 _((Na))台積_((Nb))電_(〇^))蟬連龍頭_(_ "201118603 means of term extraction, part-of-speech tagging, and the proposed mining techniques, a portion of irrelevant term set will be eliminated before mining process. IV. Designated representative map: (1) The representative representative of the case is: (1) ) Figure. (2) Brief description of the symbol of the representative figure: 100~190 Step flow 5. If there is a chemical formula in this case, please disclose the chemical formula that best shows the characteristics of the invention: None. Description of the invention: [Technical field to which the invention belongs] The present invention It relates to a method for conducting related vocabulary exploration and related named entity exploration in the document text, and can use the previously defined exploration rules, vocabulary cutting rules, named entity object definitions, statistical verification methods, etc., to satisfy the above conditions in the file. Output according to its credibility and relevance. Related vocabulary and named entities can also be extended to multiple vocabulary related explorations according to user preferences. [Prior Art] The present invention is directed to a joint exploration method in which a large number of documents can be used to extract the result of the annotation using the natural language part-of-speech according to a predefined definition of the entity rule. This invention involves several domain knowledge such as: (1) natural language processing: vocabulary learning (Natural language 201118603 processing), column-automatic tagging (part-〇f_Speech tagging), post-term processing (Post term processing), named entities (Named entity recognition); (2) data exploration: sequence data mining (Sequential pattern mining), association mining (Association mining); (3) correlation coefficient verification and other areas of knowledge research. In terms of overall architectural design, the present invention is an innovative and cross-disciplinary combination. Although the technical capabilities that can be achieved by domestic and foreign technologies have reached a certain level in the past, they have not been truly integrated into the above-mentioned natural language, sequence exploration and other technologies. Past literature or invention studies have focused on the exploration of general patterns rather than the exploration of related words on documents. At the same time, these techniques focus on calculating the similarity between vocabulary and vocabulary, rather than a model exploration based on the origin of words. In natural language processing, although similar vocabulary techniques have been used, 'however, methods that are related to the relevance of the word, the model of the word, and the definition of the entity are multi-lingual, have never been seen. The following is an international perspective on relevant technologies and research reports from the past. Over the years, due to the rise of the Internet, a large number of documents, news articles, φ forum news, reports and other materials have grown exponentially over time. However, data mining techniques are mainly used to process large amounts of numerical data and to find highly relevant patterns or behaviors in these numerical data. For example, in the super-business, data exploration technology is used to assist decision-makers in making a related exploration of the customer's purchased goods, and displaying the items that are often purchased together, such as buying coffee and purchasing the bego. For sequential or temporal data, there are also different ways of exploration. For example, the famous Prefixspan (Jiawei Han, etc.) or BIDE algorithm is designed to solve a wide variety of sequence data. However, the information on the document is not as general as the numerical type and category type 201118603 (Cat-) data K cattle is not structured' and related to the object language processing. At the same time, the documents are also semantic. Most of the languages such as Chinese, Japanese, and English are m«types. 1 Arabic and Czech are the languages of the free word order. Due to the differences between the various languages, the difficulty in exploration is increased. In Taiwan, research teams such as Dr. Chen Kejian from the Institute of Academia Sinica have successively developed methods for establishing a vocabulary support among a large number of documents. The method mainly uses Chinese word breaking and unknown word ^ method to identify similar words or synonyms, and then calculate with pair_wise chest (four) Information, retain a certain degree of similar vocabulary, and establish its related vocabulary. Users can search for vocabulary similar to 4_ in this related vocabulary by using lexical search. Strictly speaking, this method is a calculation process that considers the structure of A and synonym separately. It does not consider the definition of entities based on part of speech and shallow syntactic patterns, and does not consider improving the results of word breaks, and cannot extend to multiple words. Now and named the entity or the phrase order. Therefore, the main limitation is that the vocabulary grouping of vocabulary has not been divided into document exploration topics. The most important application is still the method breakthrough in classification and grouping, and it has not been further expanded to vocabulary-speech-naming. Physically related exploration. And the problem of document exploration is often the first question is how to first put a large amount of documents (Dr. Yiming Yang in his 1998 SIGIR article has been - reviewed), according to the corresponding user name in. For example, under the input of newspaper news and electronic media files, the machine automatically determines to give appropriate categories such as political, social, and drama titles, and organizes these documents in their respective categories. Another step is to use file grouping techniques (such as the hierarchical 201118603 grouping method or the well-known K-means) to extract each other's seemingly unrelated documents by combining lexical analysis and file similarity calculations to estimate specific groups. Groups do not have to be specified in advance, but are automatically calculated by the machine. Although these two methods are the mainstream topics of current document exploration, they are still in the text stage, and they are not stepped down to the level of sentence or phrase. The vocabulary-speech-named entity level is quite different from an article, so the technique of exploring documents is basically different from this patent method. In the past ten years, the team of the invention has studied in the above three areas for a long time. In the early period, between 2001 and 2003, the combination of Chinese word clustering technology was used to enhance the accuracy of document classification (categorization) and to improve the influence of unknown words on classification algorithms. In 2003-2006, he devoted himself to topics such as English phrase recognition and naming entity identification, and was also published in major international academic journals (such as Knowledge-based system, Expert System with applications). In 2007, the calculation process of the sequence data exploration method was studied and improved, and a paper published in ACL_2〇〇7 confirmed that the φ method can effectively improve the training time of the classification model. In the past three years, he has focused on Chinese word-breaking, part-of-speech tagging, and syntactic structure analysis. He also participated in international word-breaking, part-of-speech tagging competitions, and multi-country § my-sentence structure analysis competitions, and achieved quite good results. The Chinese word-breaking and part-of-speech tagging methods developed by the research team have been confirmed in the accepted corpus, and the best automatic marking accuracy is obtained, and tens of thousands of words can be processed every second. Based on the advantages developed in the past, the development of this patent can be successfully established on the basis of powerful language processing and document exploration. In the past, these methods were different from this patent 201118603. Although the lexical similarity calculation is not limited to the language family, and the elements are different. The patented structure of the patent will: :: two, but allow the user to first dig out the styles related to the statistical temperament, and read the minimum secrets such as the heart and water. As for the lexical similarity calculation, etc. / p疋 is irrelevant or relevant. Moreover, it is essentially only applicable to the Chinese language system, and the result of the word-breaking is dominant; however, this patent is not only applicable to multi-country including 2 main analysis's definition using the named entity (the dog has been broken, but also covers different language systems. Contents] The purpose of the present invention is to propose a life-related exploration architecture that is applicable to multiple languages. The present invention uses user-defined word-of-speech=, and deactivates words and named entity definitions, and associates according to each style. , by the method of columnar exploration, (4) statistical independence or phase _ mining and presentation of 1 course detail adjustment have been parameterized, and can be defined as 'additional information, such as time, flood season, come. In a limited time, a highly relevant medium-name entity or vocabulary relationship style is obtained in the specified file set. To achieve the above purpose, the computer system of the present invention includes at least one program, and the software program includes the following program modules: Capture, use the vocabulary of the file to enter a specific language, such as Chinese word break, to retrieve the results; 201118603 part-of-speech tagging, used to extract words The subsequent files are input into a specific language part-of-speech tagging program, such as the English Brill-Tagger method, which retrieves the results; a template definition that defines the part of speech set to be retrieved, and a named entity definition that conforms to the named entity definition. Multiple vocabulary becomes a unit entity; the font library is deactivated to store commonly used and unrelated vocabulary; the style projection database is used to establish corresponding sentence and file information appearing in the named entity to be explored; • style exploration, Used to mine qualified sequence of named entities; correlation calculation to calculate the degree of correlation between the internal vocabulary in the explored style; under the framework of the invention, the multi-language applicability is emphasized, so it must be matched The vocabulary capture and part-of-speech tagging method of a specific language completes the vocabulary and part-of-speech tagging, and then the style conforming to the named entity rule is explored from a large number of files. Therefore, in the embodiment, the components may be in different languages or not. • The same can be done by labeling the vocabulary method to obtain the vocabulary and part of speech of the document. The definition of the named entity rule is composed by the user according to the part of speech, and the definition of the part of speech (ie, a continuous vocabulary) that conforms to the rule is synthesized into a named entity, which becomes a basic unit of exploration. The rules can be flexibly added, modified and deleted. In the embodiment, the style exploration system will use all the named entity basic units, according to their order relationship in the article, to use the sequence exploration method to mine the user-defined styles. In the embodiment, the correlation calculation system will All the vocabulary in the styles, 201118603, are listed by the degree of relevance of the user-defined calculation correlation method. [Embodiment] The embodiments of the present invention can mainly refer to the first figure and the second figure, respectively. The overall operation diagram and operation flow chart of the invention. The third figure is the calculation mode of the invention. The first figure contains the conceptual diagrams of several important core modules and user input system output in the present invention. The operational flow of the embodiment of the present invention will be explained below. From a system perspective, (100, 130, 150) is a user-related module, (1) a series of files entered by the user; (130) a named entity definition specified by the user The rule module; (150) defines the stop word table to be removed in advance. The first picture (120) represents the use of a Web crawler to transfer user input articles via a network link or automatically access data through a network data collector; users may not This is directly entered into the system. (120) The module _ includes a vocabulary capture (121) and a part-of-speech annotation (122) module, wherein the vocabulary capture is used to distinguish the vocabulary in the file by white or other derogatory payment. The part-of-speech tagging marks the vocabulary. (130) The template definition provides a named entity definition rule (140) and a discontinued word removal (150) - a valid definition. After the file is tagged by part of speech, the named entity uses the (140) module to extract the named entity that conforms to the definition; and then the (150) module removes the named entity and vocabulary that match the stop word list. Finally, a named entity projection table is established via the (160) module. (17〇) The related exploration is based on the established named entity projection table, which uses the sequence 201118603 style exploration method to include (171) style capture and (172) style exploration modules, which can be used to name the named entity that meets the minimum number of occurrences. The order in which the documents appear is to be explored. The last (180) module is to use the selected independence verification method to calculate the relevance of the vocabulary and vocabulary. (190) To sort the named entity vocabulary according to the degree of relevance. The third figure is a detailed calculation method proposed by the present invention and covers the above processing flow. The input is the template to be analyzed and the template set by the user • The module; and the output is the list of vocabulary related relationships. The implementation takes an input Chinese file as an example. The user defines the following parameters: • Chinese file (via Chinese word segmentation and Chinese part-of-speech tagging) • C=5 (vocabulary distance = 5 words) • M=10% (taken Top 10% most significant vocabulary) • 夕二70°/〇 (entity relevance at least >70%) • · (at least the number of occurrences of the entity) • Maximum style exploration length • Chi-square test is independent verification method The original document after word and part-of-speech tagging can be referred to Example 1: 201118603 Example 1 In the original document after Chinese word-breaking and part-of-speech tagging -2009/10/22 2008_((1)) Group_((Na)) Tax_ ((Na)) after _((Ng)) pure benefit _((Na)) ranking _((VG)) TS(_(Nb)) electricity_((Na)) : _((COLONCATEGORY)) 99.933 billion _(〇^1〇)元_(〇^)登_(屮)) Profit _((VH)) 王J(QUESTIONCATEGORY))中华_((Nt〇) Credit Information Agency_((Nc)) ® _((Df Gongxin J (VH)) Survey" (Nv)) states that _((VE)) '-((COMMACATEGORY)) regardless—((Cbb)) is _((SHI)) Tax_((Na )) After J(Ng)) pure benefit J(Na)), J(PAUSECATEGORY)) pure yield—((Na)), _((COMMACATEGORY)) TS-((Nb)) electricity ((Na)) "D()) pulls out the _((VC)) head _((Na)) » _((PER!ODCATEGORY))Free Times-2009/10/22 TSMC _((Nb)) _(〇^)) Enterprise J(Na)) Group (Na) Profit _ ((VH)) King J (QUESTIONCATEGORY) in "(P)) "-((PARHTHESISCATEGORY)) profit _ ((VH))" _ ((PARENTHESISCATEGORY)) performance _ ((Na)) aspect J(Na)) - _((COMMACATEGORY)) According to "(P)) Chinese _((Nc)) Credit Information Institute (Nc)) Survey _ ((VE)) pointed out (VE)), _ (( COMMACATEGORY)) TSMC _((Nb)) (Na)) at -((P)) 2008 _((Na)) _((VC)) "_((PARENTHESISCATEGORY)) profit _ ((VH))王"(Na))" J (PARENTHESISCATEGORY)) Throne _ ((Na)). J(PERIODCATEGORY)) Business Times-2009/10/22 Merrill Lynch ((Nb)) Securities (Na) is optimistic about J(VJ)) TSMC _((Nb)) _((Na)) fourth - ((Neu)) season _ (〇 ^)) and - receipt _ ((Nd)) profit _ ((VH)), "(COMMACATEGORY)) think _ ((VE)) stock price _ (( Na)) Also _(Yang) has _ shape_2)) Supplement _ 〇〇) _ (( gt; ^)) space _ ((Na)) NOWnews - 2009/10/22 2008 _ ((1^ )) earn _((¥(:)) most _((Dfa)) large—((VH)) TS(_((Nb)) _((Na)) Group _((Na)) 99.933 billion _ ((^11))元_((^)荣登_((VC)) Profiting J(VH))王_((QUESTIONCATEGORY))Central 衧-2009/10/20 天下(Nc) Magazine_ ((Na))Investigation"(VE)) :_((FW)) TSMC_((Nb)) Electricity_((^〇)蝉连_((VJ)) Standard _((Na)) Enterprise _((Na)) faucet j(Na)) _ Zhongguang-2009/10/20 Standard _((Na)) Enterprise _((Na)) Best _((A)) Reputation _((Na) ) TSMC _((Nb)) 电_(〇^))蝉连龙头_(_ "

10 201118603 經由命名實體法則定義後，命名實體詞彙已重新更動為下列形式（實體定義法則在此實施例定義如下）：10 201118603 After the definition by the named entity rule, the named entity vocabulary has been changed to the following form (the entity definition rule is defined in this embodiment as follows):

例2經由命名實體法則重新定義後之文件_中時-2009/10/22 2008_((1))集團_((Na)) m ((Να)) m ((Ns))飩益排行_((VG))台稽（(Nb、、雷 ((Να)) : _( (COLONCATEGORY)) 999.33 M ffNeu)) ττ ((m)登_( (D))獲利（(VH、、王 ffOUESTIONCATEGORW 中莖徵信H ((Nc、、爵（(Dfah 新 “VH、、調查 _((Nv))指出」(VE)) ，_((COMMACATEGORY))不論」(Cbb))是_似沿)）稅 f(_ 後（_}纯益（(Να、、' J(PAUSECATEGORY))純益率_((Na))」(COMMACATEGORY)) 台積_((Nb))電_((洳)）均_((0))拔得_((VC))頭籌 _((Na)) «_((PERIODCATEGORY))自由時報-2009/10/22 台穑（(rm)雷（(Να、、企業_((Na))集團_((Na))搏利（(VH、) Ψ. _((QUESTIONCATEGORY))在—((P))「―((PARENTHESISCATEGORY))獲利 _((VH))」 —((PAREN 丁 HESISCATEGORY))表現_((Na))方面」(Na)) ，_((COMMACATEGORY))根 »_((P))中塞（(Nci)徵信所（(Nc)}調查_((VE))指出_((VE)) ’ _((COMMACATEGORY))台積（麵、Λ ((Να))在_((卩)）2008 年_(〇^))榮登—((VC)) 「」(PARENTHESISCATEGORY))猜利（(VW)王（(Να、、, —((PARENTHESISCATEGORY))寶座_((Na)) 。_((PERIODCATEGORY))工商時報-2009/10/22 美林（(pm)證券 am)i 看好_((vj))台精（(m、、m am))第四（(Neui)季（md)) 與」(Caa))明年 _((Nd))獲利」(VH)) ,」(COMMACATEGORY))認為 _((VE))股價 _((Na))還_((0))有_((V_2))補（(VC)) M ((m)空間_((Na))NOWnews - 2009/10/22 2008 年_((Na)) m rrva) m apfan ± «vh))台穑（麵、m ((Na)) _((Na)) 999.33 億_((Neu))元_(^切榮登_((VC))獲利（(VH、i Ψ. ((OUESTIONCATEGORYi)中央补-2⑽9/10/20 ^Tmc)) MMrrm)調查_((ve)) :_((fw))台穑 am、、s rrvcn ¢1¾ (Yvj^ ) 標竿 _((Na))企業 _((Na))龍頭_((Na))中廣-2009/10/20 標竿_((Na))企業_((Na))最佳_((A))聲望_((Na))台稽（(m)i 雷（(VC))蟬連_((VJ)) 龍頭」(Na)) 201118603 實體定義法刖 Nb=Nb+Na Na=VH+Na Neu=Neu+Nf Nb=VH+SPCHANGECATEGORY Neu=PARENTHESISCATEGORY+Neu+Nf Dfa=Dfa+VH Na=Na+Ng Nc=Nc+Nc Nb=Nc+Na Neu=Neu+Nd Nb=Nb+VC Na=Na+Ng+Na Na=VH+QUESTIONCATEGORY Dfa-VC+Dfa+VH VOVC+NvExample 2 redefining the file after the naming entity rule_中时-2009/10/22 2008_((1)) Group _((Na)) m ((Να)) m ((Ns)) 益益行_( (VG)) Taiji ((Nb, Ray ((Να)): _((COLONCATEGORY)) 999.33 M ffNeu)) ττ ((m) _((D)) profit ((VH,, WANG ffOUESTIONCATEGORW The middle stem letter H ((Nc, 爵((Dfah new "VH, survey_(Nv)) pointed out" (VE)), _ ((COMMACATEGORY)) regardless of "(Cbb)) is _like) ) Tax f (_ after (_} pure benefit ((Να,, 'J (PAUSECATEGORY)) pure rate _ ((Na)) (COMMACATEGORY)) TS _ ((Nb)) _ ((洳)) _((0)) _((VC)) _((Na)) «_((PERIODCATEGORY)) Free Times - 2009/10/22 Taiwan ((rm) Lei ((Να,, enterprise _((Na)) Group _((Na))Boli ((VH,) Ψ. _((QUESTIONCATEGORY)) at -((P)) "-((PARENTHESISCATEGORY)) profit _((VH)) ——((PAREN Ding HESISCATEGORY)) performance _ ((Na)) aspect (Na)), _ ((COMMACATEGORY)) root »_((P)) Zhongsai ((Nci) Credit Information Institute ((Nc) }Investigation_((VE)) indicates that _((VE)) ' _((COMMACATEGORY)) TSMC (face, Λ ((Να)) is in _((卩)) 2008 _(〇^)) —((VC)) ""(P ARENTHESISCATEGORY)) Guess ((VW) King ((Να,,,——((PARENTHESISCATEGORY)) throne _((Na))._((PERIODCATEGORY)) Business Times-2009/10/22 Merrill Lynch ((pm) Securities am)i optimistic _((vj)) Tai Jing ((m, m am)) fourth ((Neui) season (md)) and "(Caa)) next year _ ((Nd)) profit" ( VH)),"(COMMACATEGORY)) thinks that _((VE)) stock price_((Na)) also _((0)) has _((V_2)) complement ((VC)) M ((m) space_ ((Na))NOWnews - 2009/10/22 2008 _((Na)) m rrva) m apfan ± «vh)) Taiwanese (face, m ((Na)) _((Na)) 99.933 billion _ ((Neu)) Yuan_(^切荣登_((VC)) profit ((VH, i Ψ. ((OUESTIONCATEGORYi) Central Supplement-2(10)9/10/20 ^Tmc)) MMrrm) Survey_((ve )) :_((fw)) Taiwan 穑am,, s rrvcn ¢13⁄4 (Yvj^) Standard _((Na)) Enterprise _((Na)) faucet_((Na)) Zhongguang-2009/10 /20 Standard _ ((Na)) Enterprise _ ((Na)) Best _ ((A)) Reputation _ ((Na)) Taiji ((m)i Lei ((VC)) Qilian _ (( VJ)) Faucet (Na) 201118603 Entity definition method刖Nb=Nb+Na Na=VH+Na Neu=Neu+Nf Nb=VH+SPCHANGECATEGORY Neu=PARENTHESISCATEGORY+Neu+Nf Dfa=Dfa+VH Na=Na+ Ng Nc=Nc+Nc Nb=Nc+Na Neu=Neu+Nd Nb=Nb+VC Na=Na+Ng+Na Na=VH+QUESTIONCATEGO RY Dfa-VC+Dfa+VH VOVC+Nv

接著以停用字詞，再將上述文件包含停用字詞去除，下例為去除停用詞後之文件，停用字詞在此實施例定義如下所示：例3 去除停用詞後之文件中時-2009/10/22 2008(1)集團(Na)稅後純益(Na)排行(VG)台積電(Nb) · (COLONCATEGORY) 999.33億元(Neu)登(D)獲利王(Nb)中華徵信社(Nc)最新(Dfa) 指出 (VE) ’（C0MMACATEG0RY)不論(Cbb)是(SHI)稅後純益(Na) 一 (PAUSECATEGORY)純益率(Na) ，(COMMACATEGORY)台積電(Nb)均(Ο)拔得 (VC)頭籌(Na) - (PERIODCATEGORY)自由時報-2009/10/22 台積電(Nb)企業(Na)集團(Na)獲利王(Na)在碑——「(PARENTHESISCA-TEGORY)獲利(VH) ^(PARENTHESISCATEGORY)表現(Na)方面(Na) ，(CQMMAGA-TEGORY)根據碑中華徵信所(Nc)調查(VE)指出(VE) ，(COMMACATEGORY)台積電(Nb)在的 2008 年(Na)榮登(VC) 「(PARENTHESISCATEGORY)獲利王(Na) 了 (PARENTHESISCATEGORY)寶座(Na)』（PERIODCATEGORY)工商時報_ 2009/10/22 美林證券(Nb)看好(VJ)台積電(Nb)第四季(Neu)與(Caa)明年(Nd)獲利(VH)… (GQMMACATEGORY)認為(V^ 股價(Na)還 fP)~W(V_2)補漲(VC)獅(Na)Then disable the word and then remove the file from the disabled word. The following example is the file after the stop word is removed. The disabled word is defined as follows in this embodiment: Example 3 After removing the stop word In the document -2009/10/22 2008(1)Group (Na) After-tax net profit (Na) ranking (VG) TSMC (Nb) · (COLONCATEGORY) 99.933 billion yuan (Neu) Deng (D) profitable king (Nb ) China Credit Information Agency (Nc) latest (Dfa) pointed out (VE) '(C0MMACATEG0RY) regardless of (Cbb) is (SHI) after-tax net profit (Na) one (PAUSECATEGORY) net profit rate (Na), (COMMACATEGORY) TSMC (Nb) ) (()) (VC) head (Na) - (PERIODCATEGORY) Free Times - 2009/10/22 TSMC (Nb) Enterprise (Na) Group (Na) profitable King (Na) in the monument - "( PARENTHESISCA-TEGORY) profit (VH) ^ (PARENTHESISCATEGORY) performance (Na) aspect (Na), (CQMMAGA-TEGORY) according to the China National Credit Information Institute (Nc) survey (VE) pointed out (VE), (COMMACATEGORY) TSMC ( Nb) in 2008 (Na) was awarded (VC) "(PARENTHESISCATEGORY) profitable king (Na) (PARENTHESISCATEGORY) throne (Na)" (PERIODCATEGORY) Business Times _ 2009/10/22 Merrill Lynch (Nb) Optimistic (VJ) TSMC (Nb) Season 4 (Neu) and (C Aa) Next year (Nd) profit (VH)... (GQMMACATEGORY) thinks (V^ stock price (Na) also fP)~W(V_2) make up (VC) lion (Na)

12 201118603 NOWnews - 2009/10/2? (Na) 2008年(Na)賺最大(Dfa)台積電(Nb)集團999 33億元…叫）榮登(vc)獲利王中央衧-2009/1 ft/如天下雜誌(Nb)調查台積電(Nb)蟬連(VJ)標竿(Na)企業(Na)龍頭(Na) 中廣-2009/10/20 標竿(Na)企業(Na)最佳(A)聲望(Na)台積電(Nfc)蟬連(VJ)龍頭註：停用詞舉例如〇一；—在是最還有大後認為調查指出根據不論方面均空間。12 201118603 NOWnews - 2009/10/2? (Na) 2008 (Na) earned the most (Dfa) TSMC (Nb) Group 999.33 billion... Called) Rong Deng (vc) profitable Wang Zhongyu-2009/1 ft / 如天杂志(Nb)Investigate TSMC (Nb) Qilian (VJ) Standard (Na) Enterprise (Na) Lead (Na) Zhongguang-2009/10/20 Standard (Na) Enterprise (Na) Best ( A) Prestige (Na) TSMC (Nfc) Qilian (VJ) faucet Note: The terminology of the stoppage is for example: one is the most important, and the survey indicates that the space is based on the aspect.

依使用者所定義，要保留的詞性，將文件之中非保留的詞性詞彙，予以去除，下例為去除非保留詞性後文件，保留詞性表在此實施例定義如下所示：例4 保留詞性表中時 2009/10/22 — ' 2008(1)集團(Na)稅後纯益(Na)排行(VG)台積電(Mb) ^-(CQLQMCATEGOR¥)-999.33億元(Neu)獲利王(Nb)中華徵信社(Nc)最新(Dfa)調查(Nv)指出 ^~^OMMACATEGORY)不論(Cbb)是(SHI)稅後純益(Na)〜 (PAUSECATEGORY)純益率(Na) ，（COMMACATEGORY)台積電(Nb)均(9}拔得 (VC)頭籌(Na) "-(PERIODCATEGORY)自由時報-2009/10/22 台積電(Nb)企業(Na)集團(Na)獲利王(Na)在碑——[(PARENTHESISCATEGQRY)獲利(VH) 乂PARENTHESISCATEGORY)表現(Na)方面(Na) i (COMMACATEGORY)根據(P)中華徵信所(Nc)調查(VE)指出(VE) 1 (COMMACATEGORY)台積電(Nb)在^ 2008 年(Na)榮登(VC) (PARENTHESISCATEGORY)獲利王(Na) 丁 (PARENTHESISCATEGORY)寶座(Na) 。（PERIODCATEGORY)工商時報-2009/10/22 美林證券(Nb)看好(VJ)台積電(Nb)第四季(Neu)與(Caa}明年(Nd)獲利(VH)丄 (COMMACATEGORY)認為(VE)股價(Na)還(B).有(V_2)補漲(VC)挪(Na) 13 201118603 NOWnews - 2009/10/22 2008年(Na)賺最大(Dfa)台積電(Nb)集團(Na) 999.33億元(Neu)榮登(VC)獲利王 (Na) 中央衧-2009/10/20 天下雜誌(Nb)調查(VE) :(FW)台積電(Nb) 蟬連(VJ)標竿(Na)企業(Na)龍頭(Na) 中廣-2009/10/20 標竿(Na)企業(Na)最佳(A)聲望(Na)台積電(Nb)蟬連(VJ)龍頭(Na) 註：要保留的詞性組說明如下，A Dfa I Na Nb Nc Ned Nd Neu Nf Ng VA VB VC VE VG VH VJ。最後對剩餘詞彙（命名實體）實施序列式探勘方法，依參數設定，可以探勘出以下關連實體表，並包含卡方值，依遞增方式排序輸出。例5詞彙關聯實體表所找出的關連實體列表探勘參數台積電蟬連 Z =2.721 MaxLen=2 台積電龍頭 λ： =2.721 C=5 台積電集團 χ =2.7199 M=\0% 台積電企業 χ=2.7199 9=10% 台積電獲利王 X =2.4551 MinS=\ 集團獲利王 X =6.8935 X2-statistics 標竿企業 χ =8.5239 蟬連龍頭 ζ =8.5253 14 201118603 【圖式簡單說明】第一圖為本發明之系統架構示意圖。第二圖為本發明之處理流程示意圖。第三圖為本發明之探勘演算過程圖。【主要元件符號說明】 100〜190系統步驟流程 100輸入連結或文章 110網路資料收集器（Web crawler) 120自然語言處理 121詞彙切割 122詞性標註 130樣板定義 140命名實體定義法則 150停用字（詞）去除 160建立命名實體投影表 170關連詞探勘 171樣式擷取 172樣式探勘 180關連性計算 190命名實體關連集 15According to the definition of the part of the user, the non-reserved part of speech in the file is removed. The following example is to remove the non-reserved part of the document. The reserved part of speech is defined as follows in this embodiment: Example 4 Retaining part of speech 2009/10/22 — ' 2008 (1) Group (Na) After-tax Net Profit (Na) Ranking (VG) TSMC (Mb) ^-(CQLQMCATEGOR¥) - 99.933 billion (Neu) Profit King ( Nb) China Credit Information Agency (Nc) latest (Dfa) survey (Nv) pointed out that ^~^OMMACATEGORY) (Cbb) is (SHI) after-tax pure benefit (Na) ~ (PAUSECATEGORY) net profit rate (Na), (COMMACATEGORY) TSMC (Nb) (9) (VC) Head (Na) "-(PERIODCATEGORY) Free Times-2009/10/22 TSMC (Nb) Enterprise (Na) Group (Na) Profit King (Na) Monument - [(PARENTHESISCATEGQRY) profit (VH) 乂PARENTHESISCATEGORY) performance (Na) aspect (Na) i (COMMACATEGORY) according to (P) China Credit Information Institute (Nc) survey (VE) pointed out (VE) 1 (COMMACATEGORY) TSMC (Nb) was awarded the title of PARENTHESISCATEGORY (Na) in 2008 (Na) (PARENTHESISCATEGORY). (PERIODCATEGORY) Business Times-2009/10/22 Merrill Lynch (Nb) is optimistic (VJ) TSMC (Nb) fourth quarter (Neu) and (Caa) next year (Nd) profit (VH) 丄 (COMMACATEGORY) believes ( VE) stock price (Na) also (B). There is (V_2) compensatory (VC) move (Na) 13 201118603 NOWnews - 2009/10/22 2008 (Na) earned the most (Dfa) TSMC (Nb) Group (Na ) 99.933 billion yuan (Neu) Rong Deng (VC) profitable king (Na) Central 衧-2009/10/20 Tianxia Magazine (Nb) survey (VE): (FW) TSMC (Nb) Qilian (VJ) label (Na)Company (Na) leader (Na) Zhongguang-2009/10/20 Standard (Na) Enterprise (Na) Best (A) Prestige (Na) TSMC (Nb) Qilian (VJ) Faucet (Na) Note: The part-of-speech group to be retained is as follows, A Dfa I Na Nb Nc Ned Nd Neu Nf Ng VA VB VC VE VG VH VJ. Finally, the sequential vocabulary method (named entity) is implemented, and the parameters can be explored. The following related entity table, including the chi-square value, sorts the output in an incremental manner. Example 5 The vocabulary-associated entity table finds the list of related entities. The exploration parameter TSMC Qilian Z = 2.721 MaxLen=2 TSMC faucet λ: =2.721 C=5 TSMC Group χ =2.7199 M=\0% TSMC Enterprise χ=2.7199 9 =10% TSMC profitable king X =2.4551 MinS=\ Group profitable king X =6.8935 X2-statistics Standard company χ =8.5239 蝉 ζ ζ =8.5253 14 201118603 [Simple description] The first picture is the invention The second diagram is a schematic diagram of the processing flow of the present invention. The third diagram is a schematic diagram of the exploration calculation process of the present invention. [Main component symbol description] 100~190 system step flow 100 input link or article 110 network data collector (Web crawler) 120 natural language processing 121 vocabulary cutting 122 part of speech tag 130 template definition 140 named entity definition rule 150 stop word (word) removal 160 to establish a named entity projection table 170 related words exploration 171 style draw 172 style exploration 180 correlation Calculate 190 named entity related sets 15

Claims

201118603 VII. The scope of application for patents: 1. The computerized system of the vocabulary entity related exploration method of the template type, the computer system includes at least a software program, and the following program modules are included: In the package, it cuts the leeks and the part-of-speech tagging template/Hi and is ashamed to define the part of speech set to be retrieved, and the vocabulary of the vocabulary that is defined by the named entity is removed. , the irrelevant word capsule; the corresponding pattern of the named sentence to be explored, the style module 'used to mine the sequence of named entities that meet the conditions; the group of calculations ^ used to calculate the style of the exploration, the internal columns of the dishes The degree of connection between them. 2. Define the computer system described in the template and item according to the scope of the application. The system uses the part-of-speech tag to define the named entity. 3. According to the scope of the application, the terminology is disabled. 4. The computer system according to item 1 of the application scope, wherein the system utilizes the post-labeled file or completes the named entity definition file to the computer system described in the item, wherein the system uses the 201118603 named entity definition template and the part-of-speech tagged file to go Unless the word is reserved. 5. The computer system of claim i, wherein the system utilizes the remaining retained part of the vocabulary and the named entity for serial relationship exploration. The computer system according to the scope of the application, wherein the system Using the remaining insured (four) sexual difficulties and the non-sequence relationship of the person wearing the shirt, wherein the system contains the combination of the word-of-speech combination to be a 7. The following characteristics of the computer system according to the scope of application 1: The entity definition template is used to define the entity by the ruler or the vocabulary definition or the combination of the two. 8. The computer system according to item 1 of the application scope, wherein the system includes the following characteristics: a related sequence of Li Guanyu's vocabulary and a named entity, and then according to parameters such as: vocabulary distance occurrence times, maximum Style exploration length, single record or = sex verification method, etc. 'to limit the exploration method, output 3 2 17