TWI225997B - Chinese ontology auto-establishment system and method, and storage media - Google Patents

Chinese ontology auto-establishment system and method, and storage media Download PDF

Info

Publication number
TWI225997B
TWI225997B TW92122079A TW92122079A TWI225997B TW I225997 B TWI225997 B TW I225997B TW 92122079 A TW92122079 A TW 92122079A TW 92122079 A TW92122079 A TW 92122079A TW I225997 B TWI225997 B TW I225997B
Authority
TW
Taiwan
Prior art keywords
chinese
word
mentioned
strong
concept
Prior art date
Application number
TW92122079A
Other languages
Chinese (zh)
Other versions
TW200506655A (en
Inventor
Yuan-Fang Kao
Chang-Shing Lee
Yau-Hwang Kuo
I-Heng Meng
Original Assignee
Inst Information Industry
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Inst Information Industry filed Critical Inst Information Industry
Priority to TW92122079A priority Critical patent/TWI225997B/en
Application granted granted Critical
Publication of TWI225997B publication Critical patent/TWI225997B/en
Publication of TW200506655A publication Critical patent/TW200506655A/en

Links

Landscapes

  • Machine Translation (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The present invention provides a Chinese ontology auto-establishment system and method. The system includes an episode processing unit and an ontology generating unit. The episode processing unit is provided to receive plural Chinese vocabularies and their syntactical functions included in a Chinese vocabulary stream, and obtain multiple strong two-vocabulary sequence combination from the Chinese vocabulary. Each strong two-vocabulary sequence combination includes a first Chinese vocabulary and a second Chinese vocabulary closely adjacent linkage in the multiple Chinese vocabularies, and the number of this strong two-vocabulary sequence combination occurred in the Chinese vocabulary stream is larger than a first minimum supporting degree. The ontology generating unit is coupled to the episode processing unit for receiving multiple strong two-vocabulary sequence combinations. Based on a first aspect corresponding to the first Chinese vocabulary and a syntactical function corresponding to the second Chinese vocabulary, the attribute or operation of the first aspect is obtained.

Description

1225997 五、發明說明(1) " ' -- 發明所屬之技術領域 此發明是一種本體庫自動建構系統及方法,特別是 種中文本體庫自動建構系統及方法。 先前技術 加 本體庫(〇ntol0gy)是一種描述物與物之間關聯的概念 $構’第1圖係表示習知技術之本體庫架構示意圖。在此 架構中包含了幾個主要的元素,領域丨丨、類別丨2、概念 1 31、屬性1 3 2、操作1 3 3、關係R 1、R 2、R 3。領域1 1代表1225997 V. Description of the invention (1) "-The technical field to which the invention belongs This invention is a system and method for automatically constructing an ontology library, especially a system and method for automatically constructing a Chinese ontology library. The prior art plus the ontology library (Ontol0gy) is a concept that describes the relationship between objects. The first structure is a schematic diagram of the ontology library architecture of the conventional technology. There are several main elements in this architecture, the domain 丨 丨, category 丨 2, concept 1 31, attribute 1 3 2, operation 1 3 3, relationship R 1, R 2, R 3. Field 1 1 Representative

本體庫所要描述的特定領域,每個領域n中又可分為多個 類別1 2 ’存在本體庫中的概念丨3會包含概念名稱丨Μ、屬 性132和操作133。關係可分成三種,關連(ass〇ciati〇n) R1、概括(generalizati〇n) R2 與組合(aggregati〇n) R3。關聯Ri就是一般表示概念13間的語意關連,概括是 一種抽^層次不同的階層式關係,越上層的概念越抽象, 組a R 3疋一種群組關聯,用以表示概念的集合關係。 一個建構完善的本體庫通常可被搜尋引擎、知識管 理、電子商務等應用軟體所運用,用以增加搜尋的效率, 或增進文件處理能力。目前有幾個常見英文本體庫,例如 Wordnet、Cyc,以及中文本體庫,例如知網(H〇wnet),可 被使用者下載與使用。The specific domains to be described by the ontology library can be divided into multiple categories in each domain n. The concepts stored in the ontology library 3 include the concept name M, attributes 132, and operations 133. Relationships can be divided into three types: association R1, generalization R2 and aggregateion R3. Relevance Ri is generally used to indicate the semantic relations between concepts 13. The generalization is a hierarchical relationship with different levels of abstraction. The higher the level of the concept, the more abstract it is. The group a R 3 is a group association that is used to represent the collective relationship of concepts. A well-established ontology library can usually be used by search engines, knowledge management, e-commerce and other application software to increase search efficiency or improve document processing capabilities. There are currently several common English ontology libraries, such as Wordnet, Cyc, and Chinese ontology libraries, such as HowNet, which can be downloaded and used by users.

目别’有幾個已經建立好的本體庫可以使用,如 Wordnet、Cyc、知網,但是,其所建立的因為領域的關 係,其所提供的概念13、屬性132、操作133、關係R1、 R2、R3不見知可完全滿足使用者的需求,因為使用本體庫There are several established ontology libraries that can be used, such as Wordnet, Cyc, and HowNet. However, because of the relationship between the domains, the concepts provided by it 13, attributes 132, operations 133, and relations R1 are provided. R2, R3 do not know that it can fully meet the needs of users, because the ontology library is used

〇213-l〇l〇8™7(Nl);SN〇WBALL.ptd 第6頁 1225997 五、發明說明(2) =系::需:的本體庫’大多需要會因領域…有 才能滿;i其需i應料、統必須要自行發展各自的本體庫 人六ίίί,建構本體庫的方法為人工建構]吏用大量的 員q域中所有的概念13、屬性132、操作133、 點’首务立杯鲁利用此方式建構本體庫存在有若干缺 本體座ί,二ΐ!大量人力’再者’#使用兩人以上建構 H:: :會因為個人觀點不㈤’需要花費大量的時間 $:时袖,靖弭歧異’帛後,由於知識的演進日新月異, :使用人工來建構本體庫’常會因為更新速度陵,以致於 …、法滿足應用糸統當下的需要。 θ為避免上述缺點,另外一種可行的建構方法是使用大 置文件來建構離型本體庫(簡稱為本體庫自動建 專業人士進行修&,從而建構出—可帛n 可減少使用人力並可更有效率地更新本體庫。 庫 在英文本體庫自動建構技術上,大多使用— 剖析器(grammar parser)由大量文件中筮你山1 、 又 ⑽、操㈣、關係R1、R2、R3件:庫屬 然而,由於中文文法規則複雜,且缺乏― ,體庫。 文文法剖析器,因此,在中文本體庫自動建^性尚的中 不能夠直接套用一文法剖析器來自動建構本體術並 需要一系統與方法來自動建構中文本體庫。 。因此’ 發明内容 有鑑於此,本發明之目的為提供一種中文本體庫自動〇213-l〇l〇8 ™ 7 (Nl); SN〇WBALL.ptd Page 6 1225997 V. Description of the invention (2) = Department:: Needs: Ontology library 'Most need to be full because of the field ... I need to meet the expectations, and the system must develop its own ontology library. The method of constructing the ontology library is artificial construction.] The official uses a large number of all concepts in the q domain, 13, attributes, operations, 133, points. Shouli Libei uses this method to build the ontology inventory. There are several missing ontology seats. Second, a lot of manpower '再者' # Use two or more people to construct H ::: It will take a lot of time because of personal views. $: 时 袖, Jing's disagreement. 'After that, due to the rapid evolution of knowledge, the use of humans to build the ontology library' will often be slow due to the update speed, so that the method can meet the current needs of the application system. θ In order to avoid the above disadvantages, another feasible construction method is to use a large file to construct the off-line ontology library (referred to as the ontology library to automatically build professionals for repair & Update the ontology library more efficiently. The library is mostly used for the automatic construction of the ontology library in English. The grammar parser uses a large number of files to pinch you 1, ⑽, ㈣, and the relationships R1, R2, and R3: However, due to the complexity of Chinese grammar rules and the lack of a corpus of grammars, a grammar parser cannot be directly applied in the automatic construction of Chinese ontology libraries. A system and method for automatically constructing a Chinese ontology library. Therefore, in view of this, an object of the present invention is to provide a Chinese ontology library automatically.

12259971225997

五、發明說明(3) 巧系統及方法,除可用以自動建構本體庫外 人力使用並可更有效率地更新本體庫。 τ減^ 方法依ίίΐΞ:,本發明之中文本體庫自動建構系統及 首先认置一文件、中文字典、文件處理 處理單元、句斷處理單元、本體庫產生單元以及本體^ 文件處理早兀,用以輸入至少一份文件,#出文件中 !2ί意義的名詞以及動詞,成為-中文詞串流,此串 机匕3夕個具順序性的中文詞及其詞性。概念處理單元輸 〇 =文件處理單元所得到之名詞,分析任兩名詞間的關係 強度,將屬於同一概念的實體叢集(cluster)在一起。句 =處理單元用以輸入由文件處理單元所得到的中文詞串 机,產生多個句斷(episode),一個句斷為在一句斷詞量 (window size)下之多個詞的順序性組合。本體庫產生單 凡在輸入由句斷處理單元所產生之句斷集合後,會將每一 句斷與概念處理單元所產生之概念進行比對,若句斷中的 ,是某個概念的實體,則在其後標注上概念名稱。在進行 完概念標註後,本體庫產生單元會利用句型基模 (pattern)。規則,從上述已標注概念名稱之句斷中,擷取 出屬性、操作與關聯。本體庫產生單元在擷取出屬性、操 作與關聯之後,據以建構出一個領域本體庫。 實施方式 第2圖係為表示依據本發明實施例之中文本體庫自動 建構系統之糸統示意圖。 依據本發明實施例之中文本體庫自動建構系統2包括V. Description of the Invention (3) In addition to the intelligent system and method, it can be used to automatically construct the ontology library, and can be used by humans to update the ontology library more efficiently. τ minus method: According to the present invention, the automatic construction system of a text body library and first recognizes a file, a Chinese dictionary, a file processing processing unit, a sentence processing unit, an ontology library generation unit, and an ontology ^ file processing. By inputting at least one file, # 出 文件! 2ί nouns and verbs become -Chinese word stream, this string machine drew a sequence of Chinese words and their parts of speech. The concept processing unit input 〇 = the nouns obtained by the file processing unit, analyze the strength of the relationship between any two nouns, and cluster together entities that belong to the same concept. Sentence = processing unit is used to input the Chinese word string machine obtained by the file processing unit to generate multiple episodes, one sentence is a sequential combination of multiple words under a sentence window size . Ontology library generation Shan Fan After inputting the sentence set generated by the sentence processing unit, it compares each sentence with the concept generated by the concept processing unit. If the sentence is an entity of a concept, The concept name is marked after it. After the concept annotation is completed, the ontology library generation unit will use a sentence pattern. The rules extract attributes, operations, and associations from the sentences marked with the concept names above. The ontology library generation unit constructs a domain ontology library after extracting attributes, operations, and associations. Embodiment 2 FIG. 2 is a schematic diagram showing a system of a text body library automatic construction system according to an embodiment of the present invention. According to the embodiment of the present invention, the text body library automatic construction system 2 includes

12259971225997

五、發明說明⑷ 文件21 24、句典22、文件處理單元23、概念處理單元 J勘f處理單元25、太§#庙太丄w 文侔9 1技炎士 本體庫產生早元26以及本體庫27。 電子中文令杜4^ ί本體庫自動建構系統之輸入資料,為 或其他可用以饴格式可為W〇rd、HTML、Power Point T用以儲存中文文件的電子格式。 -個中系統電子中文字’,包含多個中文詞,每 肀文辭包含至少一個中文字。 ; *意Ϊ 3,文3件b處3理C ί ^ Π ^ 詞U =意ΓΓ以及動詞。首 以I義字Λ 丁句子的斷詞並詞性標注,之後,輔 詞;op word filter)方法來找出有意義的名 二動广以下以實際資料說明上述處理,設存在一段文 U二廷戰神馬拉度納:Η己「上帝之手」使阿根廷 中文勹;从磨 文件處理單70 2 3使用中文斷詞系統進行 子的斷詞並詞性標注後,如3a圖所示,此句子會被 :::阿根廷,、”戰神"、"馬拉度納"等14個詞,每個詞 首抑i t Γ括弧描述其詞性,以N為首的代表名詞,以V為 s5] ,P代表介係詞,PARENTHESISCATEGORY 代表括 I I〇DCATEG〇RY代表句號。文件處理單元23接著依據 士述=詞與詞性,使用無義字篩選(st〇p㈣以futer)方 / T限疋型態的名詞及動詞,例如,Na、Nb、Nc、Vc 4詞性的詞,如圖3b所示,成為一中文詞串流。 概念處理單元24會輸入由文件處理單元23所得到之名V. Description of the invention⑷ Document 21 24, Sentence 22, Document processing unit 23, Concept processing unit Jff processing unit 25, Tai § # Temple Tai 丄 w Wen 侔 9 1 Early generation 26 and ontology generated by the ontology library Library 27. The input data of the electronic Chinese order system 4 ^ 本体 Ontology library automatic construction system is or other electronic format that can be used to store Chinese documents in formats such as Word, HTML, and Power Point. -System Chinese Electronic Characters', which contains multiple Chinese words, and each script contains at least one Chinese character. ; * 意 Ϊ 3, 3 in b, 3 C C ί ^ Π ^ Word U = meaning ΓΓ and verb. First, use the I word Λ Ding sentence segmentation and part-of-speech tagging, and then use the auxiliary word; op word filter) method to find a meaningful name. Second, we will use the actual data to explain the above process. Maradona: The "Hand of God" made Argentina Chinese embarrassed; after processing the document 70 2 3 using the Chinese word segmentation system to perform word segmentation and tagging, as shown in Figure 3a, this sentence will be ::: Argentina, "God of War", "Maradona" and other 14 words, each word is suppressed by its Γ brackets to describe its part of speech, with N as the representative noun, and V as s5], P Represents prepositions, PARENTHESISCATEGORY stands for II〇DCATEG〇RY stands for period. The file processing unit 23 then uses nonsense words to filter (st〇p㈣futer) square / T-limited nouns based on narrative = words and parts of speech. Verbs, for example, Na, Nb, Nc, Vc, 4 part-of-speech words, as shown in Figure 3b, become a stream of Chinese words. The concept processing unit 24 will enter the name obtained by the file processing unit 23

第9頁 1225997 五、發明說明(5) 詞,先選取詞頻(term frequency)乘以文件頻率倒數 (inverse document frequency)較高之名詞,接著,使用 類神經網路技術中的非監督式學習之自我聚類(self organization map,SOM)模式,分析任兩名詞間的關係強 度,將屬於同一概念的實體(instance)聚在一起。 句斷處理單元25用以輸入由文件處理單元23所得到的 中文詞串流,得到多個句斷(e p i s 〇 d e )。一個句斷為在一 句斷詞量(window size)下之多個詞的順序性組合,如圖 3c所示,其中包含兩個句斷詞量為3之兩個句斷,包括,,阿 根廷(N c )—戰神(N a )—馬拉度納(^|13)"以及"阿根廷(]^(:)—擊 敗(Vc)_英格蘭(Nc)n。 第4圖係表示依據本發明實施例之句斷處理演算法示 意圖,此圖中包含4 0 0到4 2 0的虚擬碼。演算法中所需的變 數、參數及資料結構說明如下: (1) WindowSize稱為句斷詞量,為演算法的輸入參 數,限定每一句斷所包含的詞數; (2) miniminn —Support稱為最小支持量,為演算法的輸 入參數,限定每一句斷之最少出現次數; (3) y <t" t2,···,tk〉為一資料結構是用以記錄 h,t”…,tk之詞順序組合(term sequence)出現於哪些句子+ (sentence)中 。 (4) ···,、>.cardinaUt"^ 數是用以記錄 tl,t2,…,tk之詞順序組合一共出現幾次。 (5) ti· P〇Sition變數是用以記錄^在句子中出現的位Page 9 1225997 V. Description of the invention (5) For the words, first select the term with a higher term frequency multiplied by the inverse document frequency, and then use the unsupervised learning in neural network technology. A self organization map (SOM) model analyzes the strength of the relationship between any two nouns and brings together entities that belong to the same concept. The sentence processing unit 25 is used to input a stream of Chinese words obtained by the file processing unit 23 to obtain a plurality of sentence breaks (e p i s 0 d e). A sentence is a sequential combination of multiple words under a window size, as shown in Figure 3c, which contains two sentences with a sentence size of 3, including, Argentina ( N c) —God of War (N a) —Maradona (^ | 13) " Argentina (] ^ (:) — defeated (Vc) _England (Nc) n. Figure 4 shows the basis of this The schematic diagram of the sentence segmentation algorithm of the embodiment of the invention includes virtual codes from 400 to 4 0. The variables, parameters and data structures required in the algorithm are described as follows: (1) WindowSize is called sentence segmentation The amount is the input parameter of the algorithm and limits the number of words included in each sentence; (2) miniminn —Support is called the minimum support amount. It is the input parameter of the algorithm and limits the minimum number of occurrences of each sentence; (3) y < t " t2, ···, tk> is a data structure used to record in which sentences + (sentence) the term sequence of h, t "..., tk appears. (4) · · ,, > .cardinaUt " ^ The number is used to record the sequence of the words tl, t2, ..., tk appearing a total of several times. (5) ti · The P〇Sition variable is used to record the position of ^ in the sentence

第10頁 1225997 五、發明說明(6) 置。 田湧弃 〜‘只,n至ίκ 八, 能 所需的時間複雜度也很高,為了加快演算法執行的效 在第一次讀取資料時,會將每一詞出現過之句子編號 (sentence — num),記錄於υ<ν中,如4〇1所示,可防止後 續計算時,需要重新掃描所有的句子。若y <t"心,···,心〉. cardinality大於最小支持量(minimum supp〇rt),則此詞 為一強單詞,必須紀錄到強單詞集合 如402所示。接下來’依據強單詞集合所包含之所有 早列進行兩兩排列組合,如4 Μ 中的任-雙詞順序组人<t t 3所二,候選雙詞順序組合 〇4. ^ 汁〇<ta,&>必須滿足tb出現在t之後, ta tb的間距不超過句斷詞量(Wind〇wSize)。 a ,序組合中之 “ti,t2, ...,ν. —Μ 最= 持ΐ,則此候選雙詞順床鈿人 y入於敢丨、叉 ΠρΓσ^ 9 Μ序組合會紀錄到強雙詞順序 (large-2-sequence)隼人 φ ,丄,— 汁 順序(large-2-Sequen^ 如406所不。當找出強雙詞 (large-k-Sequence)集^集ΐ价,下來要找強k詞順序 而每個強k詞順序隼人& #白、據強雙詞順序集合來找, L…<Wk〉的;以支持量都可利用 凟算法會一直找下去, 後將被包含於其他順序人 彳沒有新的強詞順序,最 的強k詞順序就是我們所、σ的強k同順序刪除,遺留下 體庫的需要,只需找 的句斷(episode)。依建構本 集合就足夠了,因為 二二順序(large〜3-se(luence) 有兩個或三個詞的強詞順序Page 10 1225997 V. Description of the invention (6). Tian Yong abandoned ~ 'only, n to ίκ eight, the time complexity required is also very high, in order to speed up the implementation of the algorithm, the first time you read the data, the number of each sentence appears ( sentence — num), recorded in υ < ν, as shown in 401, can prevent the need to re-scan all sentences during subsequent calculations. If y < t " heart, ..., heart>. Cardinality is greater than the minimum support amount (minimum support), then this word is a strong word and must be recorded to the strong word set as shown in Figure 402. Next 'permutations and combinations based on all the early columns contained in the strong word set, such as the any-two-word order group person in 4 Μ < tt 3, the candidate two-word order combination 〇4. ^ 汁 〇 < ta, & > must satisfy that tb appears after t, and the spacing between ta and tb does not exceed the sentence break size (WindwwSize). a, “ti, t2, ..., ν. —M most = perseverance in the sequence combination, then this candidate two-word sequence will be followed by the person y into the courage, and the cross ΠρΓσ ^ 9 Μ sequence combination will record a strong Large word sequence (large-2-sequence) 隼 φ, 丄, — juice sequence (large-2-Sequen ^ as in 406. When finding a large-k-Sequence set ^ set price, go down To find strong k-word order and each strong k-word order 隼 人 &#;, according to the set of strong two-word order, L ... <Wk>; the support amount can be used 凟 algorithm will continue to find, There will be no new strong word order included in other orders. The strong k word order is the same as the strong k order we delete, σ, leaving behind the need of the body library, just find the sentence (episode). It is sufficient to construct this set, because the two-two order (large ~ 3-se (luence) has a strong word order of two or three words

1225997 五、發明說明(7) 訊1225997 V. Description of Invention (7)

Uafge-se(luence)集合就足夠包含要建構本體庫所需的資Uafge-se (luence) collection is enough to contain the resources needed to build the ontology library

本體庫產生單元26在輸入由句斷處理單元25所產生之 句斷集合後,會將每一句斷與概念處理單元24所產生之概 f進行比對,若句斷中的詞是某個概念的實體,則在其後 標注上概念名稱。例如,在概念處理單元24聚類後之概念 了解,"南韓”、”義大利”、”巴西”是屬於,,球隊”概念的實 體",:冠軍"是屬於"獎項π概念的實體,,,貝克漢"、”李瓦 度疋屬於’’球員"概念的實體。所以,最後之標注結果如 下’南韓(Nca|球隊)、義大利(Nca|球隊)、巴西(Nca丨球 隊冠軍(Nad|獎項)、英格蘭(Nca丨球隊)、貝克漢(Nba丨 球員)、李瓦度(Nba|球員)、南韓隊(Nba|球隊)。After the ontology library generating unit 26 inputs the sentence set generated by the sentence processing unit 25, it compares each sentence with the approximate f generated by the concept processing unit 24. If the word in the sentence is a certain concept, Entity, the concept name is marked after it. For example, after concept clustering in the concept processing unit 24, "South Korea", "Italy", and "Brazil" belong to, and the entity of the team "concept" is: "Champion" belongs to "Award π" Conceptual entities, Beckham ", " Li Waduo belongs to the concept of 'players ". Therefore, the final marked results are as follows:' South Korea (Nca | Team), Italy (Nca | Team) , Brazil (Nca 丨 Team Champion (Nad | Award), England (Nca 丨 Team), Beckham (Nba 丨 Player), Li Wadu (Nba | Player), South Korea (Nba | Team).

一般而言,經常一起出現的詞代表該等詞在語意上有 關連性,以簡單的中文文法舉例來說,可於句子中找出,, 主詞+動詞+受詞”或”主詞+動詞+補語"等簡單的句型 關連。但就本發明而言,並非希望利用文法的句型關連來 自動建構本體庫,而希望能由大量的文件中,大體上會透 過貝體-屬性-屬性值(instance— attribute - value)丨丨、丨丨 貝體關連-貫體(concept - association - concept)丨丨或"實 體-操作(instance-operation),,等形式的基模 (p a 11 e r η) ’由上述所得到的句斷(e p丨s 〇 d e)的順序關係, 找出本體庫中之屬性、操作與關連。 在進行完概念標註後,本體庫產生單元26會利用以下 的句型基模規則,從上述已標注概念名稱之句斷中,擷取Generally speaking, the words that appear together often represent the semantic relevance of these words. Taking simple Chinese grammar as an example, you can find it in the sentence, "subject + verb + acceptor" or "subject + verb + Complements are related to simple sentence patterns. However, as far as the present invention is concerned, it is not desirable to use the grammatical syntactic relation to automatically construct the ontology library, but to hope that a large number of documents will generally pass through the case-attribute-value. , 丨 丨 concept-association-concept (concept-association-concept) 丨 or "entity-operation (instance-operation)," and other forms of the fundamental model (pa 11 er η) 'from the sentence obtained above (Ep 丨 s 〇de) order relationship to find the attributes, operations and relationships in the ontology library. After the concept labeling is completed, the ontology library generation unit 26 will use the following sentence pattern base rules to extract from the sentence with the labeled concept name above

0213-10108TW(Nl);SNOWBALL.ptd 第12頁 1225997 五、發明說明(8) 出屬性、操作與關聯。第5a圖係表示依據本發明實施例之 中文動詞詞性示意圖,包含511到51 5之動詞詞性。第5b圖 係表示依據本發明實施例之中文名詞詞性示意圖,包含 5 21到5 3 1之名詞詞性。 屬性132的擷取規則有三,(1)句斷之句斷詞量為2 ; (2)句斷中的第一個詞為一實體(instance) ; (3)句斷中的 第二個詞之中文詞性標註為個體名詞5 2 2、可屬抽象名詞 523、抽象名詞524、集合名詞525、普通地方名詞528或狀 態不及物述詞51 4。例如:一句斷為π巴西(Nca I球隊),球 風(Nad)",可擷取出”球風••是"巴西"的屬性。 操作1 3 3的擷取規則有三,(1 )句斷之句斷詞量為2 ; (2)句斷中的第一個詞為一實體(instance) ; (3)句斷中的 第二個詞之中文詞性標註為動作不及物述詞5丨1。例如: 一句斷為”巴西(Neal球隊),奪標(VA),,,可擷取出,,奪標·· 是"巴西π的操作。 關聯R3的摘取規則有三,(1)句斷之句斷詞量為3 ; (2) 句斷中的第一個詞與第三個詞為一實體(instance); (3) 句斷中的第二個詞之中文詞性標註為及物動詞(Vb、 VC、VD、VE、VF)512、狀態及物動詞(vi、vj、νκ、 VL)515、個體名詞522、可屬抽象名詞523、抽象名詞 5 2 4、集合名詞5 2 5或普通地方名詞5 2 8。例如:一句斷為” 巴西(Neal球隊),赢待(VJ3),冠軍(Nad|獎項)",可掏取出 ”赢得”是”巴西'’與"冠軍π間的關連。 本體庫產生單元26在擷取出屬性132、操作133與關聯0213-10108TW (Nl); SNOWBALL.ptd Page 12 1225997 V. Description of the invention (8) Out attributes, operations and associations. Fig. 5a is a schematic diagram showing the part of speech of a Chinese verb according to an embodiment of the present invention, which includes verb parts of 511 to 515. FIG. 5b is a schematic diagram showing part-of-speech of Chinese nouns according to an embodiment of the present invention, and includes part-of-speech of 5 21 to 5 31. There are three extraction rules for attribute 132. (1) the number of words in a sentence is 2; (2) the first word in a sentence is an instance; (3) the second word in a sentence Chinese part-of-speech tagging is individual nouns 5 2 2. May belong to abstract nouns 523, abstract nouns 524, collective nouns 525, ordinary local nouns 528, or inferiority predicates 51 4. For example: a sentence broken into π Brazil (Nca I team), style of play (Nad), "ball style •• is the attribute of" Brazil ". There are three extraction rules for operation 1 3 3, ( 1) The number of words in the sentence is 2; (2) The first word in the sentence is an instance; (3) The Chinese part of speech of the second word in the sentence is marked as inaction Predicate 5 丨 1. For example: The sentence “Brazil (Neal team), winning the bid (VA),” can be retrieved, winning the bid is the operation of “Brazil π”. There are three extraction rules associated with R3. (1) the number of words in the sentence is 3; (2) the first word and the third word in the sentence are an instance; (3) in the sentence The Chinese part-of-speech of the second word is marked as transitive verb (Vb, VC, VD, VE, VF) 512, state and transitive verb (vi, vj, νκ, VL) 515, individual noun 522, and can be an abstract noun 523 , Abstract nouns 5 2 4, collective nouns 5 2 5 or ordinary local nouns 5 2 8. For example: a sentence is broken as "Brazil (Neal team), win (VJ3), championship (Nad | award)", can be taken out "win" is the relationship between "Brazil '" and "champion π. Ontology library generation unit 26 retrieves attributes 132, operations 133 and associations

1225997 五、發明說明(9) R3之後就可以建構出一個領域本艏庫。第6圖係表示依據 本發明實施例之本體庫架構示意圖’此本體庫依據200 2世 界盃足球赛相關新聞4 4 0篇經本發明建構而得。 第7圖係表示依據本發明實施例之中文本體庫自動建 構方法之方法流程圖。 首先,如步驟S71,輸入至少一份文件21,使用中文 斷詞系統(CK I P)進行中文句子斷詞並詞性標注。如步驟 S72,使用無義字篩選(stop word filter)方法,刪除步 驟S 71所產生之無意義的詞,例如,標點符號、補語等, 留下限定型態的名詞及動詞。 之後,如步驟S 7 3,輸入經步驟S 7 2所得到之名詞,先 選取詞頻(term frequency)乘以文件頻率倒數(inverse document frequency)較高之名詞,接著,使用類神經網 路技術中的非監督式學習之自我聚類(self organization map, SOM)模式,分析任兩名詞間的關係強度,將屬於同 一概念的實體(instance)聚在一起。 如步驟S 7 4所示,輸入由步驟S 7 2所得到的詞及其詞 性,產生多個句斷(episode),其演算法如圖4所示。一個 句斷為在一句斷詞量(window size)下之多個詞的順序性 組合,如圖3c所示,其中包含兩個句斷詞量為3之兩個句 f 斷’包括”阿根廷(Nc) -戰神(Na) —馬拉度納(Nb),,以及,,阿 根廷(Nc)—擊敗(Vc)—英格蘭(Nc)"。接下來,如步驟S75所 示’輸入由步驟S74所產生之句斷集合,將每一句斷與步 驟S73所產生之概念進行比對,若句斷中的詞是某個概念1225997 V. Description of invention (9) After R3, a domain library can be constructed. Fig. 6 is a schematic diagram showing the ontology library architecture according to an embodiment of the present invention. This ontology library is constructed according to 4 40 articles of the 2002 World Cup football game related news. FIG. 7 is a flowchart of a method for automatically constructing a text body library according to an embodiment of the present invention. First, at step S71, at least one file 21 is input, and the Chinese word segmentation system (CK IP) is used to perform word segmentation and part-of-speech tagging of Chinese sentences. In step S72, a stop word filter method is used to delete meaningless words generated in step S71, such as punctuation marks, complements, and the like, leaving the nouns and verbs in a limited form. Then, as in step S 7 3, input the nouns obtained in step S 7 2, first select the term with a higher term frequency multiplied by the inverse document frequency, and then use the neural network-like technology The unsupervised learning self-organization map (SOM) model analyzes the strength of the relationship between any two nouns and brings together entities that belong to the same concept. As shown in step S 7 4, the words and part-of-speech obtained in step S 7 2 are input to generate multiple episodes. The algorithm is shown in FIG. 4. A sentence is a sequential combination of multiple words under a window size, as shown in Fig. 3c, which contains two sentences with a sentence size of 3, and the sentence "including" Argentina ( Nc)-God of War (Na)-Maradona (Nb), and, Argentina (Nc)-Defeated (Vc)-England (Nc) " Next, as shown in step S75, 'input by step S74 The generated sentence set is compared with each sentence and the concept generated in step S73, if the word in the sentence is a certain concept

〇213-l〇l〇8BfF(Nl);SNOWBALL.ptd 第14頁 1225997 五、發明說明(10) 的實體,則在其後標注上概念名稱 如步驟S76所示,使用上述之屬+生、操作 基模規則,從上述已標注概念名稱之句斷中,擷取 來建構本體庫之屬性132、操作133與關聯R3。 驟S77 ,整合步驟S76所產生之實體、屬性132 ’ ^ 關聯R3,建構領域本體庫。 保興 :者,士發明提出—種電腦可讀取儲存媒體, =-電腦程式’上述電腦程式用以實現中文本體庫 構方法,此方法會執行如上所述之步驟。 第8圖係表示依據本發明實施例之中文本體 構方法之電腦可讀取儲存媒體示意圖。 自動建 以儲存-電腦程式820,用以實現中文二存白媒二0 ’用 法。装雷腦鞀4 4人用从只現甲文本體庫自動建構方 法其電細私式包含七個邏輯,分別為斷詞及桿兮1 p 輯821、刪除無意義詞邏經S99 :』及铩左生邏 a “ 』邏輯8 22、叢集概念邏輯8 23、涂娃〇213-101〇8BfF (Nl); SNOWBALL.ptd Page 14 1225997 V. The entity of the invention description (10), then the concept name is marked as shown in step S76, using the above-mentioned genus + health, The operation model rule is extracted from the above-mentioned sentence marked with the concept name to construct the attribute 132, operation 133, and association R3 of the ontology library. In step S77, the entity and attribute 132 '^ associated with R3 generated in step S76 are integrated to construct a domain ontology library. Baoxing: The author and scholar invented a kind of computer-readable storage medium. = -Computer program 'The above computer program is used to implement the Chinese ontology library method. This method will perform the steps described above. FIG. 8 is a schematic diagram showing a computer-readable storage medium according to the text structure method in the embodiment of the present invention. It is automatically built to store-computer program 820, which is used to implement the Chinese second storage and white media two 0 'method. Zhuang Lei Nao 建构 4 4 people use an automatic construction method from the existing text body library. The electronic private form contains seven logics, which are word segmentation and pole xi 1 p series 821, delete the meaningless word logic sutra S99: ”and铩 Zuo Shengluo a "" Logic 8 22, Cluster Concept Logic 8 23, Tu Wa

句斷邏輯824、標註實I#碟經只構 ^ ^ 只體邏軏82 5、擷取屬性及操作以及M 連邏輯826與產生本體庫邏輯82 7。 及關 因此,藉由本發明所提供之中文 及方法,除可用以自動逮播太辦庙冰篮厚目動建構糸統 並可更有效率地更新本體庫。 刀使用 T $本發明已以較佳實施例揭露如上,缺豆 限定本發明,任何熟朵 ……、並非用以 神和範圍内,當可做藝者’在不脫離本發明之精 範圍當視後附之申請專利範圍所界定者=月之保護Sentence logic 824, labeled real I # disc script only constructs ^ ^ Logic logic only 82 5. Retrieve attributes and operations, and M-connect logic 826 and generate ontology library logic 82 7. And therefore, with the Chinese language and method provided by the present invention, it can be used to automatically capture and broadcast the Taibang Temple ice basket to build a system and update the ontology database more efficiently. The use of a knife T The present invention has been disclosed in the preferred embodiment as described above. The lack of beans restricts the present invention. Any cooked flower ... is not used within the scope of God and can be an artist without departing from the scope of the present invention. As defined by the scope of the attached patent application = protection of the month

1225997 圖式簡單說明 為使本發明之上述目的、特徵和優點能更明顯易懂, 下文特舉實施例,並配合所附圖示,進行詳細說明如下: 第1圖係表示習知技術之本體庫架構示意圖; 第2圖係為表示依據本發明實施例之中文本體庫自動 建構系統之系統示意圖; 第3a、3b、3c圖係表示依據本發明實施例之範例資料 示意圖; 第4圖係表示依據本發明實施例之句斷處理演算法示 意圖, 第5a圖係表示依據本發明實施例之中文動詞詞性示意 圖, 第5b圖係表示依據本發明實施例之中文名詞詞性示意 圖; 第6圖係表示依據本發明實施例之本體庫架構示意 圖, 第7圖係表示依據本發明實施例之中文本體庫自動建 構方法之方法流程圖; Ο 第8圖係表示依據本發明實施例之中文本體庫自動建 構方法之電腦可讀取儲存媒體示意圖。 符號說明 1 1〜領域 1 2〜類別 1 3〜概念 1 3 1〜概念名稱;1225997 Brief description of the drawings In order to make the above-mentioned objects, features and advantages of the present invention more comprehensible, the following specific examples are given in conjunction with the accompanying drawings to explain in detail as follows: Figure 1 shows the essence of the conventional technology Library architecture diagram; Figure 2 is a schematic diagram of a system for automatically constructing a text body library according to an embodiment of the present invention; Figures 3a, 3b, and 3c are schematic diagrams of example data according to an embodiment of the present invention; and Figure 4 is a diagram A schematic diagram of a sentence segmentation algorithm according to an embodiment of the present invention, FIG. 5a is a schematic diagram of part of speech of a Chinese verb according to an embodiment of the present invention, and FIG. 5b is a schematic diagram of part of speech of a Chinese noun according to an embodiment of the present invention; Schematic diagram of the ontology library architecture according to the embodiment of the present invention, FIG. 7 is a flowchart showing a method for automatically constructing a text body library according to an embodiment of the present invention; 〇 FIG. 8 is an automatic construction of a text body library according to an embodiment of the present invention The method computer can read the schematic diagram of the storage medium. Explanation of symbols 1 1 ~ domain 1 2 ~ category 1 3 ~ concept 1 3 1 ~ concept name;

0213-10108TW(Nl);SN〇WBALL.ptd 第16頁 1225997 圖式簡單說明 1 3 2〜屬性; 1 3 3〜操作; 21〜文件; 2 2〜中文字典; 23〜文件處理單元; 2 4〜概念處理單元; 25〜句斷處理單元; 26〜本體庫產生單元; 27〜本體庫; 5 1 1、5 1 2.....5 1 5〜中文動詞詞性; 5 2 1、5 2 2 ..... 5 3 1〜中文名詞詞性; 80〜儲存媒體; 820〜中文本體庫自動建構電腦程式; 8 2 1〜斷詞及標註詞性邏輯; 8 2 2〜刪除無意義詞邏輯; 82 3〜叢集概念邏輯; 824〜建構句斷邏輯; 82 5〜標註實體邏輯; 82 6〜擷取屬性及操作以及關連邏輯; 82 7〜產生本體庫邏輯。0213-10108TW (Nl); SN〇WBALL.ptd Page 16 1225997 Schematic description 1 3 2 ~ attribute; 1 3 3 ~ operation; 21 ~ file; 2 2 ~ Chinese dictionary; 23 ~ file processing unit; 2 4 ~ Concept processing unit; 25 ~ Sentence processing unit; 26 ~ Ontology library generation unit; 27 ~ Ontology library; 5 1 1, 5 1 2 ..... 5 1 5 ~ Chinese verb part of speech; 5 2 1, 5 2 2 ..... 5 3 1 ~ Chinese noun parts of speech; 80 ~ storage media; 820 ~ Chinese ontology library automatically constructs computer programs; 8 2 1 ~ word segmentation and tagging part of speech logic; 8 2 2 ~ delete meaningless word logic; 82 3 ~ cluster concept logic; 824 ~ construct sentence logic; 82 5 ~ mark entity logic; 82 6 ~ fetch attributes and operations and related logic; 82 7 ~ generate ontology library logic.

0213-10108™F(Nl);SNOWBALL.ptd 第 17 頁0213-10108 ™ F (Nl); SNOWBALL.ptd page 17

Claims (1)

1225997 六、申請專利範圍 1 · 一種中 詞串流,上述 相應於每一上 庫,包括: 一句斷處 文詞串流包含 中文詞之上述 合,每一上述 前後緊鄰關連 強雙詞順序組 最小支持度; 文本體庫自 中文詞串流 述中文詞之 一本體庫 收上述強雙詞 之上述第一中 詞順序組合中 相應於上述第 理單元,用 具順序性之 詞性,由上 強雙詞順序 之一第一中 合出現於上 以及 產生單元, 順序組合, 文詞所相應 之上述第二 動建構系統,適用於輸入一中文 包含具順序性之複數中文詞以及 一阑性,用以產生一中文本體 以接收上述中文詞串流,上述中 上述中文詞以及相應於每一上述 述中文詞得到複數強雙詞順序組 組合包含存在於上述令文詞中具 文詞與一第二中文詞,並且上^ 述中文詞串流之次數大於_第一 耦接於該句斷處理單元,用以接 依據每一上述強雙詞順序組合中 之一第一概念以及每一上述強 概念之 念之上述屬性或上述操作 2·如申請專利範圍第 系統,其中上述本體庫產 合中之上述第一中文詞為 體以及上述強雙詞順序組 性為一名詞或一狀態不及 應上述第一概念之上述屬 3 ·如申請專利範圍第 中文詞所相應之上述詞性,得到 屬性或一操作,依據上述第一概 ’建立上述中文本體庫。 1項所述之中文本體庫自動建構 生單元中,若上述強雙詞順序組 相應於上述第一概念之一第—者 合中之上述第二中文詞之上述二 ^述詞,則上述第二中文詞為相 2項所述之中文本體庫自動建構 0213-10108TWF(Nl);SN〇WBALL.ptd 第18頁 1225997 六、申請專利範圍 系統,其中上述本體庫產生單元中,上述名詞為一物質名 詞、一可屬抽象名詞、一抽象名詞、一集合名詞或一普通 地方名詞。 4 ·如申請專利範圍第1項所述之中文本體庫自動建構 系統,其中上述本體庫產生單元中,若上述強雙詞順序組 合中之上述第一中文詞為相應於上述第一概念之一第一實 體以及上述強雙詞順序組合中之上述第二中文詞之上述二 性為一動作不及物述§司’則上述第二中文詞為相應於上述 第一概念之上述操作。1225997 6. Scope of patent application1. A Chinese word stream, which corresponds to each library, includes: A sentence sentence stream contains the above-mentioned combination of Chinese words, and each of the above is immediately supported by the related strong two-word sequence group. Degree; The text body library is from the Chinese word string, and one of the Chinese words is described in the ontology library. The above-mentioned first word order combination of the strong double words corresponds to the above-mentioned reasoning unit, the sequential part of speech, and the strong word order One of the first Zhonghe appears in the above and the generating unit, the sequence combination, and the above-mentioned second dynamic construction system corresponding to the text, which is suitable for inputting a Chinese language including a sequential plural Chinese word and a language to generate a Chinese language. The ontology receives the above-mentioned Chinese word stream, and the above-mentioned Chinese words and the plural strong two-word sequence group combination corresponding to each of the above-mentioned Chinese words include a written word and a second Chinese word existing in the above-mentioned linguistic words, and ^ The number of times that the Chinese word stream is greater than _ is first coupled to the sentence processing unit, and is used to connect according to each of the above strong two-word sequence combinations One of the first concepts in the first concept and the above attributes or operations of each of the above strong concepts 2. As in the patent application scope system, the first Chinese word in the ontology library is the body and the strong two-word order The group property is a noun or a state that is inferior to the above-mentioned first concept of the above-mentioned first concept. 3. If the above-mentioned part of speech corresponding to the Chinese word in the scope of the patent application, obtains the attribute or an operation, the above-mentioned Chinese ontology database is established according to the above-mentioned first outline. In the automatic construction and generation unit of the Chinese text body library described in item 1, if the above-mentioned strong two-word sequence group corresponds to the above-mentioned second predicate of the second Chinese word in the first-in-one combination of the first concepts, the first Two Chinese words are automatically constructed in the Chinese text body library described in item 2 0213-10108TWF (Nl); SN〇WBALL.ptd page 18 1225997 6. Patent application scope system, in the above ontology library generating unit, the above noun is a A material noun, an abstract noun, an abstract noun, a collective noun, or a common local noun. 4 · The Chinese text body library automatic construction system according to item 1 of the scope of patent application, wherein in the ontology library generating unit, if the first Chinese word in the strong two-word sequence combination corresponds to one of the first concepts described above The second nature of the first Chinese word and the second Chinese word in the strong two-word sequence combination is a delay in action. The second Chinese word is the operation corresponding to the first concept. 5·如申請專利範圍第1項所述之中文本體庫自動建構 糸統,其中上述句斷處理單元,由上述中文詞得到複數強 三詞順序組合,每一上述強三詞順序組合包含存在於上述 中文詞中具前後緊鄰關連之一第三中文詞、一第四中文詞 以及一第五中文詞,並且上述強三詞順序組合出現於上述 中文詞串流之次數大於上述第二最小支持度。 6·如申請專利範圍第5項所述之中文本體庫自動建構 系統,其中上述本體庫產生單元中,接收上述強三詞順序 組合’依據上述強三詞順序組合中上述第三中文詞所相應 之一第二概念,上述強三詞 相應之一第三概念,以及上 中文詞所相應之一詞性,得 述第三概念之一關連。5. The automatic construction system of the Chinese text body library as described in item 1 of the scope of the patent application, wherein the sentence processing unit obtains plural strong three-word sequence combinations from the Chinese words, and each of the above strong three-word sequence combinations includes existing in Among the Chinese words, there is a third Chinese word, a fourth Chinese word, and a fifth Chinese word that are closely related to each other, and the number of times that the strong three-word sequence combination appears in the Chinese word stream is greater than the second minimum support degree. . 6. The automatic construction system of Chinese text body library as described in item 5 of the scope of patent application, wherein the ontology library generating unit receives the strong three-word sequence combination according to the third Chinese word in the strong three-word sequence combination. One of the second concepts, one of the third concepts corresponding to the above three strong words, and one of the parts of speech corresponding to the upper Chinese words can be related to one of the third concepts. 順序組合中上述第五中文詞所 述強三詞順序組合中上述第四 到相應於上述第二概念以及上 系 7·如申請專利範圍 統,其中上述本體庫 第6項所述之中文本體庫自動建構 產生單元中,若上述強三詞順序組In the sequential combination, the fourth to the strongest three-word sequential combinations in the above-mentioned fourth to corresponding to the above-mentioned second concept and system 7. If the scope of the patent application is unified, wherein the ontology library described in item 6 of the Chinese text library In the automatic construction generating unit, if the above strong three-word sequence group 〇213-l〇l〇8TW(Nl);SNOWBALL.ptd〇213-108 TW (Nl); SNOWBALL.ptd 1225997 六、申請專利範圍 合中之上述第三中文詞為相應於上述第二概念之一第二實 體二上述強三詞順序組合之上述第五中文詞為相應於上述 第三概念之一第三實體,以及上述強三詞順序組合之上述 第四中文詞之上述詞性為一動作及物述詞,則上述第四中 文詞為相應於上述第二概念以及上述第三概念之間之 關連。 8.如申請專利範圍第6項所述之中文本體庫自動建構 系統’其中上述本體庫產生單元中,輸入相應於上述第_ 概念以及上述第三概念之上述關連,建立上述中文本體 庫。 9· 一種中文本體庫自動建構方法,適用於輸入一中文 詞串流,上述中文詞串流包含具順序性之複數中文詞以及 相應於每一上述中文詞之一詞性,用以產生一中文本體 庫,其方法包括下列步驟: 接收上述中文詞串流,上述中文詞串流包含具順序性 之上述中文詞以及相應於每一上述中文詞之上述詞性; 得到複數強雙詞順序組合’每—上述強雙詞順序纟且人 包含存在於上述中文詞中具前後緊鄰關連之一第—中文^ 以及一第二中文詞,並且上述強雙詞順序組合出現於上述 中文詞串流之次數大於一第一最小支持产; 依據每一上述強雙詞順序組合中上述第一中文詞所相 應之一第一概念以及上述強雙詞順序組合中上述第二中文 詞所相應之上述詞性’得到相應於上述&一概念之=屬性 或一操作;以及1225997 VI. The third Chinese word in the scope of the patent application corresponds to one of the second concepts described above, the second entity two, and the strong three-word sequence combination of the above fifth Chinese words corresponds to one of the third concepts described above. The entity and the fourth part of the fourth Chinese word combined in the order of the strong three words are an action and a predicate. Then, the fourth Chinese word corresponds to the relationship between the second concept and the third concept. 8. The Chinese text body library automatic construction system according to item 6 of the scope of patent application, wherein in the ontology library generation unit, input the above-mentioned relations corresponding to the above-mentioned concept and the third concept to establish the above-mentioned Chinese ontology library. 9. · A method for automatically constructing a Chinese ontology database, suitable for inputting a stream of Chinese words. The stream of Chinese words includes a sequence of plural Chinese words and a part of speech corresponding to each of the above Chinese words to generate a Chinese ontology. The method includes the following steps: receiving the above-mentioned Chinese word stream, the Chinese word stream including the above-mentioned Chinese words in order and the above-mentioned part of speech corresponding to each of the above-mentioned Chinese words; obtaining a plural strong two-word sequence combination 'each- The above strong two-word order is 纟 and the person includes a first-Chinese ^ and a second Chinese word which are closely related in the Chinese words, and the strong two-word order combination appears more than one in the Chinese word stream. The first minimum support product; according to one of the first concepts corresponding to the first Chinese word in each of the strong two-word sequence combinations and the above-mentioned part of speech corresponding to the second Chinese word in the strong two-word sequence combinations, corresponding to The above & a concept = attribute or an operation; and 〇2l3-101〇8TWF(Nl);SNOWBALL.ptd 第20頁 1225997 六、申請專利範圍 依據相應於上述第一概念之上述屬性或上述操作,建 -立上述中文本體庫。 10·如申請專利範圍第9項所述之中文本體庫自動建構 方法,於得到相應於上述第一楙念之上述屬性或上述操作 步驟中,若上述強雙詞順序組合之上述第一中文詞為相應 於上述第一概念之一第一實體以及上述強二詞順序組合之 上述第二中文詞之上述詞性為一名詞或一狀態不及物述 詞,則上述第二中文詞為相應於上述第一概念之上述屬 性。 11 ·如申請專利範圍第1 〇項所述之中文本體庫自動建 構方法,其中上述名詞為一物質名詞、一可屬拙象名詞、 一抽象名詞、一集合名詞或一普通地方名詞。 12.如申請專利範圍第9項所述之中文本體庫自動建構 方法,於得到相應於上述第一概念之上述屬性或上述操作 步驟中,若上述強雙詞順序組合中之上述第一中文詞為相 應於上述第一概念之一第一實體以及上述強雙詞順序組合 之上述第一中文詞之上述詞性為一動作不及物述詞,則上 述第二中文詞為相應於上述第一概念之上述操作。 1 3 ·如申請專利範圍第9項所述之中文本體庫自動建構 方法,更包括下列步驟: 由上述中文詞得到複數強三詞順序組合,每一上述強 二詞順序組合包含存在於上述中文詞中具前後緊鄰關連之 「第,中文詞、一第四中文詞以及〆第五十文詞,並且上 述強二詞順序組合出現於上述中文詞串流之次數大於上述〇2l3-101〇8TWF (Nl); SNOWBALL.ptd Page 20 1225997 6. Scope of patent application Based on the above attributes or operations corresponding to the above first concept, the above-mentioned Chinese ontology library is established. 10. According to the method for automatically constructing a Chinese text body library as described in item 9 of the scope of the patent application, in obtaining the attributes corresponding to the first thought or the above-mentioned operation steps, if the first Chinese word of the strong two-word sequence is combined In order to correspond to one of the first concepts of the first entity and the second Chinese word in which the strong two-word sequence is combined, the part of speech is a noun or an inferiority predicate, then the second Chinese word corresponds to the above The above attributes of the first concept. 11. The method for automatically constructing a Chinese text library as described in Item 10 of the scope of patent application, wherein the above-mentioned noun is a material noun, an aphoristic noun, an abstract noun, a collective noun, or a common local noun. 12. According to the method for automatically constructing a Chinese text body library as described in item 9 of the scope of the patent application, in obtaining the attributes or operation steps corresponding to the first concept, if the first Chinese word in the strong two-word sequence combination is obtained The first part of speech corresponding to the first entity of the first concept and the strong two-word sequence combination of the first part of the Chinese word is an inferiority predicate, then the second Chinese word corresponds to the first concept The above operation. 1 3 · The method for automatically constructing a Chinese text library as described in item 9 of the scope of patent application, further comprising the following steps: Obtaining plural strong three-word sequential combinations from the above Chinese words, each of the above strong two-word sequential combinations including existing in the above Chinese The first, second, and fourth Chinese words and the fiftieth Chinese word are immediately related to each other in the words, and the strong two-word sequence combination appears more frequently in the Chinese word stream than the above 1225997 六、申請專利範圍 第二最小支持度。 14·如申睛專利範圍第Η項戶斤述之中文本體庫自動建 構方法,更包括下列步驟: 接收上述強三詞順序組合,依據上述強二同順序組合 中上述第三中文詞所相應之一第二概念,上述強三詞順序 組合中上述第五中文詞所相應之一第三概念,以及上述強 二詞順序組合中上述第四中文詞所相應之一詞性,得到相 應於上述第二概念以及上述第三概念之一關連。 1 5 ·如申請專利範圍第丨4項所述之中文本體庫自動建 構方法’於得到相應於上述第二概念以及上述第三概念之 上述關連步驟中,若上述強三詞順序組合之上述第三中文 3為相應於上述第二概念之一第二實體,上述強三詞順序 組合之上述第五中文詞為相應於上述第三概念之一第三實 體’以及上述強三詞順序組合之上述第四中文詞之上述詞 性為一動作及物述詞,則上述第四中文詞為相應於上述第 一概念以及上述第三概念之間之上述關連。 16·如申請專利範圍第14項所述之中文本體庫自動建 構方法,更包括下列步驟: 、 輸入相應於上述第二概念以及上述第三概念之上述關 連,建立上述中文本體庫。 17 · —種電腦可讀取儲存媒體,用以儲存一電腦程 式,該電腦程式用以載入至一電腦系統中並且使得該1電腦 系統執行如申請專利範圍第9至1 6項中住一者所述之 法。 '1225997 6. Scope of patent application The second smallest support. 14. The method for automatically constructing a Chinese text library in Item 2 of the patent scope of Shenyan, including the following steps: receiving the above-mentioned strong three-word sequence combination, and corresponding to the third Chinese word corresponding to the above-mentioned strong two-word sequence combination A second concept, a third concept corresponding to the fifth Chinese word in the strong three-word sequence combination, and a part of speech corresponding to the fourth Chinese word in the strong two-word sequence combination, to obtain a corresponding second word Concept and one of the third concepts mentioned above. 1 5 · According to the method for automatically constructing a Chinese text body library as described in item 4 of the scope of the patent application, in the above-mentioned connection steps corresponding to the above-mentioned second concept and the above-mentioned third concept, if the above-mentioned Three Chinese 3 is a second entity corresponding to one of the above-mentioned second concepts, and the fifth Chinese word of the above-mentioned strong three-word order combination is a third entity corresponding to one of the above-mentioned third concepts' and the above-mentioned strong three-word order combination The above-mentioned part of speech of the fourth Chinese word is an action and a predicate. Then, the fourth Chinese word is the above-mentioned connection corresponding to the first concept and the third concept. 16. The method for automatically constructing a Chinese text body library as described in item 14 of the scope of patent application, further comprising the following steps: 1. Enter the above-mentioned relations corresponding to the above-mentioned second concept and the above-mentioned third concept to establish the above-mentioned Chinese ontology library. 17 · — A computer-readable storage medium for storing a computer program that is loaded into a computer system and causes the 1 computer system to execute as described in items 9 to 16 of the scope of patent applications. The method described. '
TW92122079A 2003-08-12 2003-08-12 Chinese ontology auto-establishment system and method, and storage media TWI225997B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
TW92122079A TWI225997B (en) 2003-08-12 2003-08-12 Chinese ontology auto-establishment system and method, and storage media

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
TW92122079A TWI225997B (en) 2003-08-12 2003-08-12 Chinese ontology auto-establishment system and method, and storage media

Publications (2)

Publication Number Publication Date
TWI225997B true TWI225997B (en) 2005-01-01
TW200506655A TW200506655A (en) 2005-02-16

Family

ID=35613505

Family Applications (1)

Application Number Title Priority Date Filing Date
TW92122079A TWI225997B (en) 2003-08-12 2003-08-12 Chinese ontology auto-establishment system and method, and storage media

Country Status (1)

Country Link
TW (1) TWI225997B (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110955748A (en) * 2018-09-26 2020-04-03 华硕电脑股份有限公司 Semantic processing method, electronic device and non-transitory computer readable recording medium

Families Citing this family (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
TWI608367B (en) * 2012-01-11 2017-12-11 國立臺灣師範大學 Text readability measuring system and method thereof
TWI550422B (en) * 2015-04-08 2016-09-21 雲拓科技有限公司 Claim text generalizing method
TWI639927B (en) 2016-05-27 2018-11-01 雲拓科技有限公司 Method for corresponding element symbols in the specification to the corresponding element terms in claims
TWI598751B (en) 2016-12-05 2017-09-11 雲拓科技有限公司 Automatic claim computerized-translating apparatus

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110955748A (en) * 2018-09-26 2020-04-03 华硕电脑股份有限公司 Semantic processing method, electronic device and non-transitory computer readable recording medium
CN110955748B (en) * 2018-09-26 2022-10-28 华硕电脑股份有限公司 Semantic processing method, electronic device and non-transitory computer readable recording medium

Also Published As

Publication number Publication date
TW200506655A (en) 2005-02-16

Similar Documents

Publication Publication Date Title
Snoek et al. Adding semantics to detectors for video retrieval
Nadeau Semi-supervised named entity recognition: learning to recognize 100 entity types with little supervision
Cui et al. Soft pattern matching models for definitional question answering
Allahyari et al. Automatic topic labeling using ontology-based topic models
Wang et al. Using word embeddings to enhance keyword identification for scientific publications
Kothari et al. SMS based interface for FAQ retrieval
US8560485B2 (en) Generating a domain corpus and a dictionary for an automated ontology
Al-Zoghby et al. Arabic semantic web applications–a survey
US20130262086A1 (en) Generation of a semantic model from textual listings
US8200671B2 (en) Generating a dictionary and determining a co-occurrence context for an automated ontology
CN111159414B (en) Text classification method and system, electronic equipment and computer readable storage medium
US8000957B2 (en) English-language translation of exact interpretations of keyword queries
Zock et al. Deliberate word access: an intuition, a roadmap and some preliminary empirical results
Allahyari et al. A knowledge-based topic modeling approach for automatic topic labeling
Alami et al. Hybrid method for text summarization based on statistical and semantic treatment
Afzal et al. Semantically enhanced concept search of the Holy Quran: Qur’anic English WordNet
JP2010287020A (en) Synonym translation system and synonym translation method
Ahmed et al. Web-Based Arabic Question Answering System using Machine Learning Approach.
Lahbari et al. Toward a new arabic question answering system.
TWI225997B (en) Chinese ontology auto-establishment system and method, and storage media
Garrido et al. The GENIE project-a semantic pipeline for automatic document categorisation
Albukhitan et al. Semantic web annotation using deep learning with Arabic morphology
Bakari et al. Logic-based approach for improving Arabic question answering
AbuTaha et al. An ontology-based arabic question answering system
Deena et al. Keyword extraction using latent semantic analysis for question generation

Legal Events

Date Code Title Description
MM4A Annulment or lapse of patent due to non-payment of fees