TWI225997B

TWI225997B - Chinese ontology auto-establishment system and method, and storage media

Info

Publication number: TWI225997B
Application number: TW92122079A
Authority: TW
Inventors: Yuan-Fang Kao; Chang-Shing Lee; Yau-Hwang Kuo; I-Heng Meng
Original assignee: Inst Information Industry
Priority date: 2003-08-12
Filing date: 2003-08-12
Publication date: 2005-01-01
Also published as: TW200506655A

Abstract

The present invention provides a Chinese ontology auto-establishment system and method. The system includes an episode processing unit and an ontology generating unit. The episode processing unit is provided to receive plural Chinese vocabularies and their syntactical functions included in a Chinese vocabulary stream, and obtain multiple strong two-vocabulary sequence combination from the Chinese vocabulary. Each strong two-vocabulary sequence combination includes a first Chinese vocabulary and a second Chinese vocabulary closely adjacent linkage in the multiple Chinese vocabularies, and the number of this strong two-vocabulary sequence combination occurred in the Chinese vocabulary stream is larger than a first minimum supporting degree. The ontology generating unit is coupled to the episode processing unit for receiving multiple strong two-vocabulary sequence combinations. Based on a first aspect corresponding to the first Chinese vocabulary and a syntactical function corresponding to the second Chinese vocabulary, the attribute or operation of the first aspect is obtained.

Description

1225997 五、發明說明（1) " ' -- 發明所屬之技術領域此發明是一種本體庫自動建構系統及方法，特別是種中文本體庫自動建構系統及方法。先前技術加本體庫（〇ntol0gy)是一種描述物與物之間關聯的概念 $構’第1圖係表示習知技術之本體庫架構示意圖。在此架構中包含了幾個主要的元素，領域丨丨、類別丨2、概念 1 31、屬性1 3 2、操作1 3 3、關係R 1、R 2、R 3。領域1 1代表1225997 V. Description of the invention (1) "-The technical field to which the invention belongs This invention is a system and method for automatically constructing an ontology library, especially a system and method for automatically constructing a Chinese ontology library. The prior art plus the ontology library (Ontol0gy) is a concept that describes the relationship between objects. The first structure is a schematic diagram of the ontology library architecture of the conventional technology. There are several main elements in this architecture, the domain 丨丨, category 丨 2, concept 1 31, attribute 1 3 2, operation 1 3 3, relationship R 1, R 2, R 3. Field 1 1 Representative

本體庫所要描述的特定領域，每個領域n中又可分為多個類別1 2 ’存在本體庫中的概念丨3會包含概念名稱丨Μ、屬性132和操作133。關係可分成三種，關連（ass〇ciati〇n) R1、概括（generalizati〇n) R2 與組合（aggregati〇n) R3。關聯Ri就是一般表示概念13間的語意關連，概括是一種抽^層次不同的階層式關係，越上層的概念越抽象，組a R 3疋一種群組關聯，用以表示概念的集合關係。一個建構完善的本體庫通常可被搜尋引擎、知識管理、電子商務等應用軟體所運用，用以增加搜尋的效率，或增進文件處理能力。目前有幾個常見英文本體庫，例如 Wordnet、Cyc，以及中文本體庫，例如知網（H〇wnet)，可被使用者下載與使用。The specific domains to be described by the ontology library can be divided into multiple categories in each domain n. The concepts stored in the ontology library 3 include the concept name M, attributes 132, and operations 133. Relationships can be divided into three types: association R1, generalization R2 and aggregateion R3. Relevance Ri is generally used to indicate the semantic relations between concepts 13. The generalization is a hierarchical relationship with different levels of abstraction. The higher the level of the concept, the more abstract it is. The group a R 3 is a group association that is used to represent the collective relationship of concepts. A well-established ontology library can usually be used by search engines, knowledge management, e-commerce and other application software to increase search efficiency or improve document processing capabilities. There are currently several common English ontology libraries, such as Wordnet, Cyc, and Chinese ontology libraries, such as HowNet, which can be downloaded and used by users.

目别’有幾個已經建立好的本體庫可以使用，如 Wordnet、Cyc、知網，但是，其所建立的因為領域的關係，其所提供的概念13、屬性132、操作133、關係R1、 R2、R3不見知可完全滿足使用者的需求，因為使用本體庫There are several established ontology libraries that can be used, such as Wordnet, Cyc, and HowNet. However, because of the relationship between the domains, the concepts provided by it 13, attributes 132, operations 133, and relations R1 are provided. R2, R3 do not know that it can fully meet the needs of users, because the ontology library is used

〇213-l〇l〇8™7(Nl);SN〇WBALL.ptd 第6頁 1225997 五、發明說明（2) =系：：需：的本體庫’大多需要會因領域…有才能滿；i其需i應料、統必須要自行發展各自的本體庫人六ίίί，建構本體庫的方法為人工建構]吏用大量的員q域中所有的概念13、屬性132、操作133、點’首务立杯鲁利用此方式建構本體庫存在有若干缺本體座ί，二ΐ!大量人力’再者’#使用兩人以上建構 H:: ：會因為個人觀點不㈤’需要花費大量的時間 $:时袖，靖弭歧異’帛後，由於知識的演進日新月異， :使用人工來建構本體庫’常會因為更新速度陵，以致於 …、法滿足應用糸統當下的需要。 θ為避免上述缺點，另外一種可行的建構方法是使用大置文件來建構離型本體庫（簡稱為本體庫自動建專業人士進行修&，從而建構出—可帛n 可減少使用人力並可更有效率地更新本體庫。庫在英文本體庫自動建構技術上，大多使用— 剖析器（grammar parser)由大量文件中筮你山1 、又 ⑽、操㈣、關係R1、R2、R3件：庫屬然而，由於中文文法規則複雜，且缺乏― ，體庫。文文法剖析器，因此，在中文本體庫自動建^性尚的中不能夠直接套用一文法剖析器來自動建構本體術並需要一系統與方法來自動建構中文本體庫。。因此’ 發明内容有鑑於此，本發明之目的為提供一種中文本體庫自動〇213-l〇l〇8 ™ 7 (Nl); SN〇WBALL.ptd Page 6 1225997 V. Description of the invention (2) = Department:: Needs: Ontology library 'Most need to be full because of the field ... I need to meet the expectations, and the system must develop its own ontology library. The method of constructing the ontology library is artificial construction.] The official uses a large number of all concepts in the q domain, 13, attributes, operations, 133, points. Shouli Libei uses this method to build the ontology inventory. There are several missing ontology seats. Second, a lot of manpower '再者' # Use two or more people to construct H ::: It will take a lot of time because of personal views. $: 时袖, Jing's disagreement. 'After that, due to the rapid evolution of knowledge, the use of humans to build the ontology library' will often be slow due to the update speed, so that the method can meet the current needs of the application system. θ In order to avoid the above disadvantages, another feasible construction method is to use a large file to construct the off-line ontology library (referred to as the ontology library to automatically build professionals for repair & Update the ontology library more efficiently. The library is mostly used for the automatic construction of the ontology library in English. The grammar parser uses a large number of files to pinch you 1, ⑽, ㈣, and the relationships R1, R2, and R3: However, due to the complexity of Chinese grammar rules and the lack of a corpus of grammars, a grammar parser cannot be directly applied in the automatic construction of Chinese ontology libraries. A system and method for automatically constructing a Chinese ontology library. Therefore, in view of this, an object of the present invention is to provide a Chinese ontology library automatically.

12259971225997

五、發明說明（3) 巧系統及方法，除可用以自動建構本體庫外人力使用並可更有效率地更新本體庫。 τ減^ 方法依ίίΐΞ:，本發明之中文本體庫自動建構系統及首先认置一文件、中文字典、文件處理處理單元、句斷處理單元、本體庫產生單元以及本體^ 文件處理早兀，用以輸入至少一份文件，#出文件中 !2ί意義的名詞以及動詞，成為-中文詞串流，此串机匕3夕個具順序性的中文詞及其詞性。概念處理單元輸〇 =文件處理單元所得到之名詞，分析任兩名詞間的關係強度，將屬於同一概念的實體叢集（cluster)在一起。句 =處理單元用以輸入由文件處理單元所得到的中文詞串机，產生多個句斷（episode)，一個句斷為在一句斷詞量 (window size)下之多個詞的順序性組合。本體庫產生單凡在輸入由句斷處理單元所產生之句斷集合後，會將每一句斷與概念處理單元所產生之概念進行比對，若句斷中的，是某個概念的實體，則在其後標注上概念名稱。在進行完概念標註後，本體庫產生單元會利用句型基模 (pattern)。規則，從上述已標注概念名稱之句斷中，擷取出屬性、操作與關聯。本體庫產生單元在擷取出屬性、操作與關聯之後，據以建構出一個領域本體庫。實施方式第2圖係為表示依據本發明實施例之中文本體庫自動建構系統之糸統示意圖。依據本發明實施例之中文本體庫自動建構系統2包括V. Description of the Invention (3) In addition to the intelligent system and method, it can be used to automatically construct the ontology library, and can be used by humans to update the ontology library more efficiently. τ minus method: According to the present invention, the automatic construction system of a text body library and first recognizes a file, a Chinese dictionary, a file processing processing unit, a sentence processing unit, an ontology library generation unit, and an ontology ^ file processing. By inputting at least one file, # 出文件! 2ί nouns and verbs become -Chinese word stream, this string machine drew a sequence of Chinese words and their parts of speech. The concept processing unit input 〇 = the nouns obtained by the file processing unit, analyze the strength of the relationship between any two nouns, and cluster together entities that belong to the same concept. Sentence = processing unit is used to input the Chinese word string machine obtained by the file processing unit to generate multiple episodes, one sentence is a sequential combination of multiple words under a sentence window size . Ontology library generation Shan Fan After inputting the sentence set generated by the sentence processing unit, it compares each sentence with the concept generated by the concept processing unit. If the sentence is an entity of a concept, The concept name is marked after it. After the concept annotation is completed, the ontology library generation unit will use a sentence pattern. The rules extract attributes, operations, and associations from the sentences marked with the concept names above. The ontology library generation unit constructs a domain ontology library after extracting attributes, operations, and associations. Embodiment 2 FIG. 2 is a schematic diagram showing a system of a text body library automatic construction system according to an embodiment of the present invention. According to the embodiment of the present invention, the text body library automatic construction system 2 includes

12259971225997

五、發明說明⑷ 文件21 24、句典22、文件處理單元23、概念處理單元 J勘f處理單元25、太§#庙太丄w 文侔9 1技炎士本體庫產生早元26以及本體庫27。電子中文令杜4^ ί本體庫自動建構系統之輸入資料，為或其他可用以饴格式可為W〇rd、HTML、Power Point T用以儲存中文文件的電子格式。 -個中系統電子中文字’，包含多個中文詞，每肀文辭包含至少一個中文字。 ; *意Ϊ 3，文3件b處3理C ί ^ Π ^ 詞U =意ΓΓ以及動詞。首以I義字Λ 丁句子的斷詞並詞性標注，之後，輔詞；op word filter)方法來找出有意義的名二動广以下以實際資料說明上述處理，設存在一段文 U二廷戰神馬拉度納:Η己「上帝之手」使阿根廷中文勹；从磨文件處理單70 2 3使用中文斷詞系統進行子的斷詞並詞性標注後，如3a圖所示，此句子會被 :::阿根廷，、”戰神"、"馬拉度納"等14個詞，每個詞首抑i t Γ括弧描述其詞性，以N為首的代表名詞，以V為 s5] ,P代表介係詞，PARENTHESISCATEGORY 代表括 I I〇DCATEG〇RY代表句號。文件處理單元23接著依據士述=詞與詞性，使用無義字篩選（st〇p㈣以futer)方 / T限疋型態的名詞及動詞，例如，Na、Nb、Nc、Vc 4詞性的詞，如圖3b所示，成為一中文詞串流。概念處理單元24會輸入由文件處理單元23所得到之名V. Description of the invention⑷ Document 21 24, Sentence 22, Document processing unit 23, Concept processing unit Jff processing unit 25, Tai § # Temple Tai 丄 w Wen 侔 9 1 Early generation 26 and ontology generated by the ontology library Library 27. The input data of the electronic Chinese order system 4 ^ 本体 Ontology library automatic construction system is or other electronic format that can be used to store Chinese documents in formats such as Word, HTML, and Power Point. -System Chinese Electronic Characters', which contains multiple Chinese words, and each script contains at least one Chinese character. ; * 意 Ϊ 3, 3 in b, 3 C C ί ^ Π ^ Word U = meaning ΓΓ and verb. First, use the I word Λ Ding sentence segmentation and part-of-speech tagging, and then use the auxiliary word; op word filter) method to find a meaningful name. Second, we will use the actual data to explain the above process. Maradona: The "Hand of God" made Argentina Chinese embarrassed; after processing the document 70 2 3 using the Chinese word segmentation system to perform word segmentation and tagging, as shown in Figure 3a, this sentence will be ::: Argentina, "God of War", "Maradona" and other 14 words, each word is suppressed by its Γ brackets to describe its part of speech, with N as the representative noun, and V as s5], P Represents prepositions, PARENTHESISCATEGORY stands for II〇DCATEG〇RY stands for period. The file processing unit 23 then uses nonsense words to filter (st〇p㈣futer) square / T-limited nouns based on narrative = words and parts of speech. Verbs, for example, Na, Nb, Nc, Vc, 4 part-of-speech words, as shown in Figure 3b, become a stream of Chinese words. The concept processing unit 24 will enter the name obtained by the file processing unit 23

第9頁 1225997 五、發明說明（5) 詞，先選取詞頻（term frequency)乘以文件頻率倒數 (inverse document frequency)較高之名詞，接著，使用類神經網路技術中的非監督式學習之自我聚類（self organization map，SOM)模式，分析任兩名詞間的關係強度，將屬於同一概念的實體（instance)聚在一起。句斷處理單元25用以輸入由文件處理單元23所得到的中文詞串流，得到多個句斷（e p i s 〇 d e )。一個句斷為在一句斷詞量（window size)下之多個詞的順序性組合，如圖 3c所示，其中包含兩個句斷詞量為3之兩個句斷，包括，，阿根廷（N c )—戰神（N a )—馬拉度納（^|13)"以及"阿根廷（]^(：)—擊敗（Vc)_英格蘭（Nc)n。第4圖係表示依據本發明實施例之句斷處理演算法示意圖，此圖中包含4 0 0到4 2 0的虚擬碼。演算法中所需的變數、參數及資料結構說明如下： (1) WindowSize稱為句斷詞量，為演算法的輸入參數，限定每一句斷所包含的詞數； (2) miniminn —Support稱為最小支持量，為演算法的輸入參數，限定每一句斷之最少出現次數； (3) y <t" t2，···，tk〉為一資料結構是用以記錄 h，t”…，tk之詞順序組合（term sequence)出現於哪些句子+ (sentence)中。 (4) ···，、>.cardinaUt"^ 數是用以記錄 tl，t2，…，tk之詞順序組合一共出現幾次。 (5) ti· P〇Sition變數是用以記錄^在句子中出現的位Page 9 1225997 V. Description of the invention (5) For the words, first select the term with a higher term frequency multiplied by the inverse document frequency, and then use the unsupervised learning in neural network technology. A self organization map (SOM) model analyzes the strength of the relationship between any two nouns and brings together entities that belong to the same concept. The sentence processing unit 25 is used to input a stream of Chinese words obtained by the file processing unit 23 to obtain a plurality of sentence breaks (e p i s 0 d e). A sentence is a sequential combination of multiple words under a window size, as shown in Figure 3c, which contains two sentences with a sentence size of 3, including, Argentina ( N c) —God of War (N a) —Maradona (^ | 13) " Argentina (] ^ (:) — defeated (Vc) _England (Nc) n. Figure 4 shows the basis of this The schematic diagram of the sentence segmentation algorithm of the embodiment of the invention includes virtual codes from 400 to 4 0. The variables, parameters and data structures required in the algorithm are described as follows: (1) WindowSize is called sentence segmentation The amount is the input parameter of the algorithm and limits the number of words included in each sentence; (2) miniminn —Support is called the minimum support amount. It is the input parameter of the algorithm and limits the minimum number of occurrences of each sentence; (3) y < t " t2, ···, tk> is a data structure used to record in which sentences + (sentence) the term sequence of h, t "..., tk appears. (4) · · ,, > .cardinaUt " ^ The number is used to record the sequence of the words tl, t2, ..., tk appearing a total of several times. (5) ti · The P〇Sition variable is used to record the position of ^ in the sentence

第10頁 1225997 五、發明說明（6) 置。田湧弃〜‘只，n至ίκ 八，能所需的時間複雜度也很高，為了加快演算法執行的效在第一次讀取資料時，會將每一詞出現過之句子編號 (sentence — num)，記錄於υ<ν中，如4〇1所示，可防止後續計算時，需要重新掃描所有的句子。若y <t"心，···，心〉. cardinality大於最小支持量（minimum supp〇rt)，則此詞為一強單詞，必須紀錄到強單詞集合如402所示。接下來’依據強單詞集合所包含之所有早列進行兩兩排列組合，如4 Μ 中的任-雙詞順序组人<t t 3所二，候選雙詞順序組合〇4. ^ 汁〇<ta，&>必須滿足tb出現在t之後， ta tb的間距不超過句斷詞量（Wind〇wSize)。 a ，序組合中之 “ti，t2, ...，ν. —Μ 最= 持ΐ，則此候選雙詞順床鈿人 y入於敢丨、叉 ΠρΓσ^ 9 Μ序組合會紀錄到強雙詞順序 (large-2-sequence)隼人 φ ，丄,— 汁順序（large-2-Sequen^ 如406所不。當找出強雙詞 (large-k-Sequence)集^集ΐ价，下來要找強k詞順序而每個強k詞順序隼人& #白、據強雙詞順序集合來找， L…<Wk〉的；以支持量都可利用凟算法會一直找下去，後將被包含於其他順序人彳沒有新的強詞順序，最的強k詞順序就是我們所、σ的強k同順序刪除，遺留下體庫的需要，只需找的句斷（episode)。依建構本集合就足夠了，因為二二順序（large〜3-se(luence) 有兩個或三個詞的強詞順序Page 10 1225997 V. Description of the invention (6). Tian Yong abandoned ~ 'only, n to ίκ eight, the time complexity required is also very high, in order to speed up the implementation of the algorithm, the first time you read the data, the number of each sentence appears ( sentence — num), recorded in υ < ν, as shown in 401, can prevent the need to re-scan all sentences during subsequent calculations. If y < t " heart, ..., heart>. Cardinality is greater than the minimum support amount (minimum support), then this word is a strong word and must be recorded to the strong word set as shown in Figure 402. Next 'permutations and combinations based on all the early columns contained in the strong word set, such as the any-two-word order group person in 4 Μ < tt 3, the candidate two-word order combination 〇4. ^ 汁〇 < ta, & > must satisfy that tb appears after t, and the spacing between ta and tb does not exceed the sentence break size (WindwwSize). a, “ti, t2, ..., ν. —M most = perseverance in the sequence combination, then this candidate two-word sequence will be followed by the person y into the courage, and the cross ΠρΓσ ^ 9 Μ sequence combination will record a strong Large word sequence (large-2-sequence) 隼 φ, 丄, — juice sequence (large-2-Sequen ^ as in 406. When finding a large-k-Sequence set ^ set price, go down To find strong k-word order and each strong k-word order 隼人 &#;, according to the set of strong two-word order, L ... <Wk>; the support amount can be used 凟 algorithm will continue to find, There will be no new strong word order included in other orders. The strong k word order is the same as the strong k order we delete, σ, leaving behind the need of the body library, just find the sentence (episode). It is sufficient to construct this set, because the two-two order (large ~ 3-se (luence) has a strong word order of two or three words

1225997 五、發明說明（7) 訊1225997 V. Description of Invention (7)

Uafge-se(luence)集合就足夠包含要建構本體庫所需的資Uafge-se (luence) collection is enough to contain the resources needed to build the ontology library

本體庫產生單元26在輸入由句斷處理單元25所產生之句斷集合後，會將每一句斷與概念處理單元24所產生之概 f進行比對，若句斷中的詞是某個概念的實體，則在其後標注上概念名稱。例如，在概念處理單元24聚類後之概念了解，"南韓”、”義大利”、”巴西”是屬於，，球隊”概念的實體"，:冠軍"是屬於"獎項π概念的實體，，，貝克漢"、”李瓦度疋屬於’’球員"概念的實體。所以，最後之標注結果如下’南韓（Nca|球隊）、義大利（Nca|球隊）、巴西（Nca丨球隊冠軍（Nad|獎項）、英格蘭（Nca丨球隊）、貝克漢（Nba丨球員）、李瓦度（Nba|球員）、南韓隊（Nba|球隊）。After the ontology library generating unit 26 inputs the sentence set generated by the sentence processing unit 25, it compares each sentence with the approximate f generated by the concept processing unit 24. If the word in the sentence is a certain concept, Entity, the concept name is marked after it. For example, after concept clustering in the concept processing unit 24, "South Korea", "Italy", and "Brazil" belong to, and the entity of the team "concept" is: "Champion" belongs to "Award π" Conceptual entities, Beckham ", " Li Waduo belongs to the concept of 'players ". Therefore, the final marked results are as follows:' South Korea (Nca | Team), Italy (Nca | Team) , Brazil (Nca 丨 Team Champion (Nad | Award), England (Nca 丨 Team), Beckham (Nba 丨 Player), Li Wadu (Nba | Player), South Korea (Nba | Team).

一般而言，經常一起出現的詞代表該等詞在語意上有關連性，以簡單的中文文法舉例來說，可於句子中找出，，主詞+動詞+受詞”或”主詞+動詞+補語"等簡單的句型關連。但就本發明而言，並非希望利用文法的句型關連來自動建構本體庫，而希望能由大量的文件中，大體上會透過貝體-屬性-屬性值（instance— attribute - value)丨丨、丨丨貝體關連-貫體（concept - association - concept)丨丨或"實體-操作（instance-operation)，，等形式的基模 (p a 11 e r η) ’由上述所得到的句斷（e p丨s 〇 d e)的順序關係，找出本體庫中之屬性、操作與關連。在進行完概念標註後，本體庫產生單元26會利用以下的句型基模規則，從上述已標注概念名稱之句斷中，擷取Generally speaking, the words that appear together often represent the semantic relevance of these words. Taking simple Chinese grammar as an example, you can find it in the sentence, "subject + verb + acceptor" or "subject + verb + Complements are related to simple sentence patterns. However, as far as the present invention is concerned, it is not desirable to use the grammatical syntactic relation to automatically construct the ontology library, but to hope that a large number of documents will generally pass through the case-attribute-value. , 丨丨 concept-association-concept (concept-association-concept) 丨 or "entity-operation (instance-operation)," and other forms of the fundamental model (pa 11 er η) 'from the sentence obtained above (Ep 丨 s 〇de) order relationship to find the attributes, operations and relationships in the ontology library. After the concept labeling is completed, the ontology library generation unit 26 will use the following sentence pattern base rules to extract from the sentence with the labeled concept name above

0213-10108TW(Nl);SNOWBALL.ptd 第12頁 1225997 五、發明說明（8) 出屬性、操作與關聯。第5a圖係表示依據本發明實施例之中文動詞詞性示意圖，包含511到51 5之動詞詞性。第5b圖係表示依據本發明實施例之中文名詞詞性示意圖，包含 5 21到5 3 1之名詞詞性。屬性132的擷取規則有三，（1)句斷之句斷詞量為2 ; (2)句斷中的第一個詞為一實體（instance) ; (3)句斷中的第二個詞之中文詞性標註為個體名詞5 2 2、可屬抽象名詞 523、抽象名詞524、集合名詞525、普通地方名詞528或狀態不及物述詞51 4。例如：一句斷為π巴西（Nca I球隊），球風（Nad)"，可擷取出”球風••是"巴西"的屬性。操作1 3 3的擷取規則有三，（1 )句斷之句斷詞量為2 ; (2)句斷中的第一個詞為一實體（instance) ; (3)句斷中的第二個詞之中文詞性標註為動作不及物述詞5丨1。例如：一句斷為”巴西（Neal球隊），奪標（VA)，，，可擷取出，，奪標·· 是"巴西π的操作。關聯R3的摘取規則有三，（1)句斷之句斷詞量為3 ; (2) 句斷中的第一個詞與第三個詞為一實體（instance); (3) 句斷中的第二個詞之中文詞性標註為及物動詞（Vb、 VC、VD、VE、VF)512、狀態及物動詞（vi、vj、νκ、 VL)515、個體名詞522、可屬抽象名詞523、抽象名詞 5 2 4、集合名詞5 2 5或普通地方名詞5 2 8。例如：一句斷為” 巴西（Neal球隊），赢待（VJ3)，冠軍（Nad|獎項）"，可掏取出 ”赢得”是”巴西'’與"冠軍π間的關連。本體庫產生單元26在擷取出屬性132、操作133與關聯0213-10108TW (Nl); SNOWBALL.ptd Page 12 1225997 V. Description of the invention (8) Out attributes, operations and associations. Fig. 5a is a schematic diagram showing the part of speech of a Chinese verb according to an embodiment of the present invention, which includes verb parts of 511 to 515. FIG. 5b is a schematic diagram showing part-of-speech of Chinese nouns according to an embodiment of the present invention, and includes part-of-speech of 5 21 to 5 31. There are three extraction rules for attribute 132. (1) the number of words in a sentence is 2; (2) the first word in a sentence is an instance; (3) the second word in a sentence Chinese part-of-speech tagging is individual nouns 5 2 2. May belong to abstract nouns 523, abstract nouns 524, collective nouns 525, ordinary local nouns 528, or inferiority predicates 51 4. For example: a sentence broken into π Brazil (Nca I team), style of play (Nad), "ball style •• is the attribute of" Brazil ". There are three extraction rules for operation 1 3 3, ( 1) The number of words in the sentence is 2; (2) The first word in the sentence is an instance; (3) The Chinese part of speech of the second word in the sentence is marked as inaction Predicate 5 丨 1. For example: The sentence “Brazil (Neal team), winning the bid (VA),” can be retrieved, winning the bid is the operation of “Brazil π”. There are three extraction rules associated with R3. (1) the number of words in the sentence is 3; (2) the first word and the third word in the sentence are an instance; (3) in the sentence The Chinese part-of-speech of the second word is marked as transitive verb (Vb, VC, VD, VE, VF) 512, state and transitive verb (vi, vj, νκ, VL) 515, individual noun 522, and can be an abstract noun 523 , Abstract nouns 5 2 4, collective nouns 5 2 5 or ordinary local nouns 5 2 8. For example: a sentence is broken as "Brazil (Neal team), win (VJ3), championship (Nad | award)", can be taken out "win" is the relationship between "Brazil '" and "champion π. Ontology library generation unit 26 retrieves attributes 132, operations 133 and associations

1225997 五、發明說明（9) R3之後就可以建構出一個領域本艏庫。第6圖係表示依據本發明實施例之本體庫架構示意圖’此本體庫依據200 2世界盃足球赛相關新聞4 4 0篇經本發明建構而得。第7圖係表示依據本發明實施例之中文本體庫自動建構方法之方法流程圖。首先，如步驟S71，輸入至少一份文件21，使用中文斷詞系統（CK I P)進行中文句子斷詞並詞性標注。如步驟 S72，使用無義字篩選（stop word filter)方法，刪除步驟S 71所產生之無意義的詞，例如，標點符號、補語等，留下限定型態的名詞及動詞。之後，如步驟S 7 3，輸入經步驟S 7 2所得到之名詞，先選取詞頻（term frequency)乘以文件頻率倒數（inverse document frequency)較高之名詞，接著，使用類神經網路技術中的非監督式學習之自我聚類（self organization map， SOM)模式，分析任兩名詞間的關係強度，將屬於同一概念的實體（instance)聚在一起。如步驟S 7 4所示，輸入由步驟S 7 2所得到的詞及其詞性，產生多個句斷（episode)，其演算法如圖4所示。一個句斷為在一句斷詞量（window size)下之多個詞的順序性組合，如圖3c所示，其中包含兩個句斷詞量為3之兩個句 f 斷’包括”阿根廷（Nc) -戰神（Na) —馬拉度納（Nb)，，以及，，阿根廷（Nc)—擊敗（Vc)—英格蘭（Nc)"。接下來，如步驟S75所示’輸入由步驟S74所產生之句斷集合，將每一句斷與步驟S73所產生之概念進行比對，若句斷中的詞是某個概念1225997 V. Description of invention (9) After R3, a domain library can be constructed. Fig. 6 is a schematic diagram showing the ontology library architecture according to an embodiment of the present invention. This ontology library is constructed according to 4 40 articles of the 2002 World Cup football game related news. FIG. 7 is a flowchart of a method for automatically constructing a text body library according to an embodiment of the present invention. First, at step S71, at least one file 21 is input, and the Chinese word segmentation system (CK IP) is used to perform word segmentation and part-of-speech tagging of Chinese sentences. In step S72, a stop word filter method is used to delete meaningless words generated in step S71, such as punctuation marks, complements, and the like, leaving the nouns and verbs in a limited form. Then, as in step S 7 3, input the nouns obtained in step S 7 2, first select the term with a higher term frequency multiplied by the inverse document frequency, and then use the neural network-like technology The unsupervised learning self-organization map (SOM) model analyzes the strength of the relationship between any two nouns and brings together entities that belong to the same concept. As shown in step S 7 4, the words and part-of-speech obtained in step S 7 2 are input to generate multiple episodes. The algorithm is shown in FIG. 4. A sentence is a sequential combination of multiple words under a window size, as shown in Fig. 3c, which contains two sentences with a sentence size of 3, and the sentence "including" Argentina ( Nc)-God of War (Na)-Maradona (Nb), and, Argentina (Nc)-Defeated (Vc)-England (Nc) " Next, as shown in step S75, 'input by step S74 The generated sentence set is compared with each sentence and the concept generated in step S73, if the word in the sentence is a certain concept

〇213-l〇l〇8BfF(Nl);SNOWBALL.ptd 第14頁 1225997 五、發明說明（10) 的實體，則在其後標注上概念名稱如步驟S76所示，使用上述之屬+生、操作基模規則，從上述已標注概念名稱之句斷中，擷取來建構本體庫之屬性132、操作133與關聯R3。驟S77 ,整合步驟S76所產生之實體、屬性132 ’ ^ 關聯R3，建構領域本體庫。保興 :者，士發明提出—種電腦可讀取儲存媒體， =-電腦程式’上述電腦程式用以實現中文本體庫構方法，此方法會執行如上所述之步驟。第8圖係表示依據本發明實施例之中文本體構方法之電腦可讀取儲存媒體示意圖。自動建以儲存-電腦程式820，用以實現中文二存白媒二0 ’用法。装雷腦鞀4 4人用从只現甲文本體庫自動建構方法其電細私式包含七個邏輯，分別為斷詞及桿兮1 p 輯821、刪除無意義詞邏經S99 :』及铩左生邏 a “ 』邏輯8 22、叢集概念邏輯8 23、涂娃〇213-101〇8BfF (Nl); SNOWBALL.ptd Page 14 1225997 V. The entity of the invention description (10), then the concept name is marked as shown in step S76, using the above-mentioned genus + health, The operation model rule is extracted from the above-mentioned sentence marked with the concept name to construct the attribute 132, operation 133, and association R3 of the ontology library. In step S77, the entity and attribute 132 '^ associated with R3 generated in step S76 are integrated to construct a domain ontology library. Baoxing: The author and scholar invented a kind of computer-readable storage medium. = -Computer program 'The above computer program is used to implement the Chinese ontology library method. This method will perform the steps described above. FIG. 8 is a schematic diagram showing a computer-readable storage medium according to the text structure method in the embodiment of the present invention. It is automatically built to store-computer program 820, which is used to implement the Chinese second storage and white media two 0 'method. Zhuang Lei Nao 建构 4 4 people use an automatic construction method from the existing text body library. The electronic private form contains seven logics, which are word segmentation and pole xi 1 p series 821, delete the meaningless word logic sutra S99: ”and铩 Zuo Shengluo a "" Logic 8 22, Cluster Concept Logic 8 23, Tu Wa

句斷邏輯824、標註實I#碟經只構 ^ ^ 只體邏軏82 5、擷取屬性及操作以及M 連邏輯826與產生本體庫邏輯82 7。及關因此，藉由本發明所提供之中文及方法，除可用以自動逮播太辦庙冰篮厚目動建構糸統並可更有效率地更新本體庫。刀使用 T $本發明已以較佳實施例揭露如上，缺豆限定本發明，任何熟朵 ……、並非用以神和範圍内，當可做藝者’在不脫離本發明之精範圍當視後附之申請專利範圍所界定者=月之保護Sentence logic 824, labeled real I # disc script only constructs ^ ^ Logic logic only 82 5. Retrieve attributes and operations, and M-connect logic 826 and generate ontology library logic 82 7. And therefore, with the Chinese language and method provided by the present invention, it can be used to automatically capture and broadcast the Taibang Temple ice basket to build a system and update the ontology database more efficiently. The use of a knife T The present invention has been disclosed in the preferred embodiment as described above. The lack of beans restricts the present invention. Any cooked flower ... is not used within the scope of God and can be an artist without departing from the scope of the present invention. As defined by the scope of the attached patent application = protection of the month

1225997 圖式簡單說明為使本發明之上述目的、特徵和優點能更明顯易懂，下文特舉實施例，並配合所附圖示，進行詳細說明如下：第1圖係表示習知技術之本體庫架構示意圖；第2圖係為表示依據本發明實施例之中文本體庫自動建構系統之系統示意圖；第3a、3b、3c圖係表示依據本發明實施例之範例資料示意圖；第4圖係表示依據本發明實施例之句斷處理演算法示意圖，第5a圖係表示依據本發明實施例之中文動詞詞性示意圖，第5b圖係表示依據本發明實施例之中文名詞詞性示意圖；第6圖係表示依據本發明實施例之本體庫架構示意圖，第7圖係表示依據本發明實施例之中文本體庫自動建構方法之方法流程圖； Ο 第8圖係表示依據本發明實施例之中文本體庫自動建構方法之電腦可讀取儲存媒體示意圖。符號說明 1 1〜領域 1 2〜類別 1 3〜概念 1 3 1〜概念名稱；1225997 Brief description of the drawings In order to make the above-mentioned objects, features and advantages of the present invention more comprehensible, the following specific examples are given in conjunction with the accompanying drawings to explain in detail as follows: Figure 1 shows the essence of the conventional technology Library architecture diagram; Figure 2 is a schematic diagram of a system for automatically constructing a text body library according to an embodiment of the present invention; Figures 3a, 3b, and 3c are schematic diagrams of example data according to an embodiment of the present invention; and Figure 4 is a diagram A schematic diagram of a sentence segmentation algorithm according to an embodiment of the present invention, FIG. 5a is a schematic diagram of part of speech of a Chinese verb according to an embodiment of the present invention, and FIG. 5b is a schematic diagram of part of speech of a Chinese noun according to an embodiment of the present invention; Schematic diagram of the ontology library architecture according to the embodiment of the present invention, FIG. 7 is a flowchart showing a method for automatically constructing a text body library according to an embodiment of the present invention; 〇 FIG. 8 is an automatic construction of a text body library according to an embodiment of the present invention The method computer can read the schematic diagram of the storage medium. Explanation of symbols 1 1 ~ domain 1 2 ~ category 1 3 ~ concept 1 3 1 ~ concept name;

0213-10108TW(Nl);SN〇WBALL.ptd 第16頁 1225997 圖式簡單說明 1 3 2〜屬性； 1 3 3〜操作； 21〜文件； 2 2〜中文字典； 23〜文件處理單元； 2 4〜概念處理單元； 25〜句斷處理單元； 26〜本體庫產生單元； 27〜本體庫； 5 1 1、5 1 2.....5 1 5〜中文動詞詞性； 5 2 1、5 2 2 ..... 5 3 1〜中文名詞詞性； 80〜儲存媒體； 820〜中文本體庫自動建構電腦程式； 8 2 1〜斷詞及標註詞性邏輯； 8 2 2〜刪除無意義詞邏輯； 82 3〜叢集概念邏輯； 824〜建構句斷邏輯； 82 5〜標註實體邏輯； 82 6〜擷取屬性及操作以及關連邏輯； 82 7〜產生本體庫邏輯。0213-10108TW (Nl); SN〇WBALL.ptd Page 16 1225997 Schematic description 1 3 2 ~ attribute; 1 3 3 ~ operation; 21 ~ file; 2 2 ~ Chinese dictionary; 23 ~ file processing unit; 2 4 ~ Concept processing unit; 25 ~ Sentence processing unit; 26 ~ Ontology library generation unit; 27 ~ Ontology library; 5 1 1, 5 1 2 ..... 5 1 5 ~ Chinese verb part of speech; 5 2 1, 5 2 2 ..... 5 3 1 ~ Chinese noun parts of speech; 80 ~ storage media; 820 ~ Chinese ontology library automatically constructs computer programs; 8 2 1 ~ word segmentation and tagging part of speech logic; 8 2 2 ~ delete meaningless word logic; 82 3 ~ cluster concept logic; 824 ~ construct sentence logic; 82 5 ~ mark entity logic; 82 6 ~ fetch attributes and operations and related logic; 82 7 ~ generate ontology library logic.

0213-10108™F(Nl);SNOWBALL.ptd 第 17 頁0213-10108 ™ F (Nl); SNOWBALL.ptd page 17

Claims

1225997 6. Scope of patent application1. A Chinese word stream, which corresponds to each library, includes: A sentence sentence stream contains the above-mentioned combination of Chinese words, and each of the above is immediately supported by the related strong two-word sequence group. Degree; The text body library is from the Chinese word string, and one of the Chinese words is described in the ontology library. The above-mentioned first word order combination of the strong double words corresponds to the above-mentioned reasoning unit, the sequential part of speech, and the strong word order One of the first Zhonghe appears in the above and the generating unit, the sequence combination, and the above-mentioned second dynamic construction system corresponding to the text, which is suitable for inputting a Chinese language including a sequential plural Chinese word and a language to generate a Chinese language. The ontology receives the above-mentioned Chinese word stream, and the above-mentioned Chinese words and the plural strong two-word sequence group combination corresponding to each of the above-mentioned Chinese words include a written word and a second Chinese word existing in the above-mentioned linguistic words, and ^ The number of times that the Chinese word stream is greater than _ is first coupled to the sentence processing unit, and is used to connect according to each of the above strong two-word sequence combinations One of the first concepts in the first concept and the above attributes or operations of each of the above strong concepts 2. As in the patent application scope system, the first Chinese word in the ontology library is the body and the strong two-word order The group property is a noun or a state that is inferior to the above-mentioned first concept of the above-mentioned first concept. 3. If the above-mentioned part of speech corresponding to the Chinese word in the scope of the patent application, obtains the attribute or an operation, the above-mentioned Chinese ontology database is established according to the above-mentioned first outline. In the automatic construction and generation unit of the Chinese text body library described in item 1, if the above-mentioned strong two-word sequence group corresponds to the above-mentioned second predicate of the second Chinese word in the first-in-one combination of the first concepts, the first Two Chinese words are automatically constructed in the Chinese text body library described in item 2 0213-10108TWF (Nl); SN〇WBALL.ptd page 18 1225997 6. Patent application scope system, in the above ontology library generating unit, the above noun is a A material noun, an abstract noun, an abstract noun, a collective noun, or a common local noun. 4 · The Chinese text body library automatic construction system according to item 1 of the scope of patent application, wherein in the ontology library generating unit, if the first Chinese word in the strong two-word sequence combination corresponds to one of the first concepts described above The second nature of the first Chinese word and the second Chinese word in the strong two-word sequence combination is a delay in action. The second Chinese word is the operation corresponding to the first concept.

5. The automatic construction system of the Chinese text body library as described in item 1 of the scope of the patent application, wherein the sentence processing unit obtains plural strong three-word sequence combinations from the Chinese words, and each of the above strong three-word sequence combinations includes existing in Among the Chinese words, there is a third Chinese word, a fourth Chinese word, and a fifth Chinese word that are closely related to each other, and the number of times that the strong three-word sequence combination appears in the Chinese word stream is greater than the second minimum support degree. . 6. The automatic construction system of Chinese text body library as described in item 5 of the scope of patent application, wherein the ontology library generating unit receives the strong three-word sequence combination according to the third Chinese word in the strong three-word sequence combination. One of the second concepts, one of the third concepts corresponding to the above three strong words, and one of the parts of speech corresponding to the upper Chinese words can be related to one of the third concepts.

In the sequential combination, the fourth to the strongest three-word sequential combinations in the above-mentioned fourth to corresponding to the above-mentioned second concept and system 7. If the scope of the patent application is unified, wherein the ontology library described in item 6 of the Chinese text library In the automatic construction generating unit, if the above strong three-word sequence group

〇213-108 TW (Nl); SNOWBALL.ptd

1225997 VI. The third Chinese word in the scope of the patent application corresponds to one of the second concepts described above, the second entity two, and the strong three-word sequence combination of the above fifth Chinese words corresponds to one of the third concepts described above. The entity and the fourth part of the fourth Chinese word combined in the order of the strong three words are an action and a predicate. Then, the fourth Chinese word corresponds to the relationship between the second concept and the third concept. 8. The Chinese text body library automatic construction system according to item 6 of the scope of patent application, wherein in the ontology library generation unit, input the above-mentioned relations corresponding to the above-mentioned concept and the third concept to establish the above-mentioned Chinese ontology library. 9. · A method for automatically constructing a Chinese ontology database, suitable for inputting a stream of Chinese words. The stream of Chinese words includes a sequence of plural Chinese words and a part of speech corresponding to each of the above Chinese words to generate a Chinese ontology. The method includes the following steps: receiving the above-mentioned Chinese word stream, the Chinese word stream including the above-mentioned Chinese words in order and the above-mentioned part of speech corresponding to each of the above-mentioned Chinese words; obtaining a plural strong two-word sequence combination 'each- The above strong two-word order is 纟 and the person includes a first-Chinese ^ and a second Chinese word which are closely related in the Chinese words, and the strong two-word order combination appears more than one in the Chinese word stream. The first minimum support product; according to one of the first concepts corresponding to the first Chinese word in each of the strong two-word sequence combinations and the above-mentioned part of speech corresponding to the second Chinese word in the strong two-word sequence combinations, corresponding to The above & a concept = attribute or an operation; and

〇2l3-101〇8TWF (Nl); SNOWBALL.ptd Page 20 1225997 6. Scope of patent application Based on the above attributes or operations corresponding to the above first concept, the above-mentioned Chinese ontology library is established. 10. According to the method for automatically constructing a Chinese text body library as described in item 9 of the scope of the patent application, in obtaining the attributes corresponding to the first thought or the above-mentioned operation steps, if the first Chinese word of the strong two-word sequence is combined In order to correspond to one of the first concepts of the first entity and the second Chinese word in which the strong two-word sequence is combined, the part of speech is a noun or an inferiority predicate, then the second Chinese word corresponds to the above The above attributes of the first concept. 11. The method for automatically constructing a Chinese text library as described in Item 10 of the scope of patent application, wherein the above-mentioned noun is a material noun, an aphoristic noun, an abstract noun, a collective noun, or a common local noun. 12. According to the method for automatically constructing a Chinese text body library as described in item 9 of the scope of the patent application, in obtaining the attributes or operation steps corresponding to the first concept, if the first Chinese word in the strong two-word sequence combination is obtained The first part of speech corresponding to the first entity of the first concept and the strong two-word sequence combination of the first part of the Chinese word is an inferiority predicate, then the second Chinese word corresponds to the first concept The above operation. 1 3 · The method for automatically constructing a Chinese text library as described in item 9 of the scope of patent application, further comprising the following steps: Obtaining plural strong three-word sequential combinations from the above Chinese words, each of the above strong two-word sequential combinations including existing in the above Chinese The first, second, and fourth Chinese words and the fiftieth Chinese word are immediately related to each other in the words, and the strong two-word sequence combination appears more frequently in the Chinese word stream than the above

1225997 6. Scope of patent application The second smallest support. 14. The method for automatically constructing a Chinese text library in Item 2 of the patent scope of Shenyan, including the following steps: receiving the above-mentioned strong three-word sequence combination, and corresponding to the third Chinese word corresponding to the above-mentioned strong two-word sequence combination A second concept, a third concept corresponding to the fifth Chinese word in the strong three-word sequence combination, and a part of speech corresponding to the fourth Chinese word in the strong two-word sequence combination, to obtain a corresponding second word Concept and one of the third concepts mentioned above. 1 5 · According to the method for automatically constructing a Chinese text body library as described in item 4 of the scope of the patent application, in the above-mentioned connection steps corresponding to the above-mentioned second concept and the above-mentioned third concept, if the above-mentioned Three Chinese 3 is a second entity corresponding to one of the above-mentioned second concepts, and the fifth Chinese word of the above-mentioned strong three-word order combination is a third entity corresponding to one of the above-mentioned third concepts' and the above-mentioned strong three-word order combination The above-mentioned part of speech of the fourth Chinese word is an action and a predicate. Then, the fourth Chinese word is the above-mentioned connection corresponding to the first concept and the third concept. 16. The method for automatically constructing a Chinese text body library as described in item 14 of the scope of patent application, further comprising the following steps: 1. Enter the above-mentioned relations corresponding to the above-mentioned second concept and the above-mentioned third concept to establish the above-mentioned Chinese ontology library. 17 · — A computer-readable storage medium for storing a computer program that is loaded into a computer system and causes the 1 computer system to execute as described in items 9 to 16 of the scope of patent applications. The method described. '