TWI225997B - Chinese ontology auto-establishment system and method, and storage media - Google Patents
Chinese ontology auto-establishment system and method, and storage media Download PDFInfo
- Publication number
- TWI225997B TWI225997B TW92122079A TW92122079A TWI225997B TW I225997 B TWI225997 B TW I225997B TW 92122079 A TW92122079 A TW 92122079A TW 92122079 A TW92122079 A TW 92122079A TW I225997 B TWI225997 B TW I225997B
- Authority
- TW
- Taiwan
- Prior art keywords
- chinese
- word
- mentioned
- strong
- concept
- Prior art date
Links
Landscapes
- Machine Translation (AREA)
- Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
Abstract
Description
1225997 五、發明說明(1) " ' -- 發明所屬之技術領域 此發明是一種本體庫自動建構系統及方法,特別是 種中文本體庫自動建構系統及方法。 先前技術 加 本體庫(〇ntol0gy)是一種描述物與物之間關聯的概念 $構’第1圖係表示習知技術之本體庫架構示意圖。在此 架構中包含了幾個主要的元素,領域丨丨、類別丨2、概念 1 31、屬性1 3 2、操作1 3 3、關係R 1、R 2、R 3。領域1 1代表1225997 V. Description of the invention (1) "-The technical field to which the invention belongs This invention is a system and method for automatically constructing an ontology library, especially a system and method for automatically constructing a Chinese ontology library. The prior art plus the ontology library (Ontol0gy) is a concept that describes the relationship between objects. The first structure is a schematic diagram of the ontology library architecture of the conventional technology. There are several main elements in this architecture, the domain 丨 丨, category 丨 2, concept 1 31, attribute 1 3 2, operation 1 3 3, relationship R 1, R 2, R 3. Field 1 1 Representative
本體庫所要描述的特定領域,每個領域n中又可分為多個 類別1 2 ’存在本體庫中的概念丨3會包含概念名稱丨Μ、屬 性132和操作133。關係可分成三種,關連(ass〇ciati〇n) R1、概括(generalizati〇n) R2 與組合(aggregati〇n) R3。關聯Ri就是一般表示概念13間的語意關連,概括是 一種抽^層次不同的階層式關係,越上層的概念越抽象, 組a R 3疋一種群組關聯,用以表示概念的集合關係。 一個建構完善的本體庫通常可被搜尋引擎、知識管 理、電子商務等應用軟體所運用,用以增加搜尋的效率, 或增進文件處理能力。目前有幾個常見英文本體庫,例如 Wordnet、Cyc,以及中文本體庫,例如知網(H〇wnet),可 被使用者下載與使用。The specific domains to be described by the ontology library can be divided into multiple categories in each domain n. The concepts stored in the ontology library 3 include the concept name M, attributes 132, and operations 133. Relationships can be divided into three types: association R1, generalization R2 and aggregateion R3. Relevance Ri is generally used to indicate the semantic relations between concepts 13. The generalization is a hierarchical relationship with different levels of abstraction. The higher the level of the concept, the more abstract it is. The group a R 3 is a group association that is used to represent the collective relationship of concepts. A well-established ontology library can usually be used by search engines, knowledge management, e-commerce and other application software to increase search efficiency or improve document processing capabilities. There are currently several common English ontology libraries, such as Wordnet, Cyc, and Chinese ontology libraries, such as HowNet, which can be downloaded and used by users.
目别’有幾個已經建立好的本體庫可以使用,如 Wordnet、Cyc、知網,但是,其所建立的因為領域的關 係,其所提供的概念13、屬性132、操作133、關係R1、 R2、R3不見知可完全滿足使用者的需求,因為使用本體庫There are several established ontology libraries that can be used, such as Wordnet, Cyc, and HowNet. However, because of the relationship between the domains, the concepts provided by it 13, attributes 132, operations 133, and relations R1 are provided. R2, R3 do not know that it can fully meet the needs of users, because the ontology library is used
〇213-l〇l〇8™7(Nl);SN〇WBALL.ptd 第6頁 1225997 五、發明說明(2) =系::需:的本體庫’大多需要會因領域…有 才能滿;i其需i應料、統必須要自行發展各自的本體庫 人六ίίί,建構本體庫的方法為人工建構]吏用大量的 員q域中所有的概念13、屬性132、操作133、 點’首务立杯鲁利用此方式建構本體庫存在有若干缺 本體座ί,二ΐ!大量人力’再者’#使用兩人以上建構 H:: :會因為個人觀點不㈤’需要花費大量的時間 $:时袖,靖弭歧異’帛後,由於知識的演進日新月異, :使用人工來建構本體庫’常會因為更新速度陵,以致於 …、法滿足應用糸統當下的需要。 θ為避免上述缺點,另外一種可行的建構方法是使用大 置文件來建構離型本體庫(簡稱為本體庫自動建 專業人士進行修&,從而建構出—可帛n 可減少使用人力並可更有效率地更新本體庫。 庫 在英文本體庫自動建構技術上,大多使用— 剖析器(grammar parser)由大量文件中筮你山1 、 又 ⑽、操㈣、關係R1、R2、R3件:庫屬 然而,由於中文文法規則複雜,且缺乏― ,體庫。 文文法剖析器,因此,在中文本體庫自動建^性尚的中 不能夠直接套用一文法剖析器來自動建構本體術並 需要一系統與方法來自動建構中文本體庫。 。因此’ 發明内容 有鑑於此,本發明之目的為提供一種中文本體庫自動〇213-l〇l〇8 ™ 7 (Nl); SN〇WBALL.ptd Page 6 1225997 V. Description of the invention (2) = Department:: Needs: Ontology library 'Most need to be full because of the field ... I need to meet the expectations, and the system must develop its own ontology library. The method of constructing the ontology library is artificial construction.] The official uses a large number of all concepts in the q domain, 13, attributes, operations, 133, points. Shouli Libei uses this method to build the ontology inventory. There are several missing ontology seats. Second, a lot of manpower '再者' # Use two or more people to construct H ::: It will take a lot of time because of personal views. $: 时 袖, Jing's disagreement. 'After that, due to the rapid evolution of knowledge, the use of humans to build the ontology library' will often be slow due to the update speed, so that the method can meet the current needs of the application system. θ In order to avoid the above disadvantages, another feasible construction method is to use a large file to construct the off-line ontology library (referred to as the ontology library to automatically build professionals for repair & Update the ontology library more efficiently. The library is mostly used for the automatic construction of the ontology library in English. The grammar parser uses a large number of files to pinch you 1, ⑽, ㈣, and the relationships R1, R2, and R3: However, due to the complexity of Chinese grammar rules and the lack of a corpus of grammars, a grammar parser cannot be directly applied in the automatic construction of Chinese ontology libraries. A system and method for automatically constructing a Chinese ontology library. Therefore, in view of this, an object of the present invention is to provide a Chinese ontology library automatically.
12259971225997
五、發明說明(3) 巧系統及方法,除可用以自動建構本體庫外 人力使用並可更有效率地更新本體庫。 τ減^ 方法依ίίΐΞ:,本發明之中文本體庫自動建構系統及 首先认置一文件、中文字典、文件處理 處理單元、句斷處理單元、本體庫產生單元以及本體^ 文件處理早兀,用以輸入至少一份文件,#出文件中 !2ί意義的名詞以及動詞,成為-中文詞串流,此串 机匕3夕個具順序性的中文詞及其詞性。概念處理單元輸 〇 =文件處理單元所得到之名詞,分析任兩名詞間的關係 強度,將屬於同一概念的實體叢集(cluster)在一起。句 =處理單元用以輸入由文件處理單元所得到的中文詞串 机,產生多個句斷(episode),一個句斷為在一句斷詞量 (window size)下之多個詞的順序性組合。本體庫產生單 凡在輸入由句斷處理單元所產生之句斷集合後,會將每一 句斷與概念處理單元所產生之概念進行比對,若句斷中的 ,是某個概念的實體,則在其後標注上概念名稱。在進行 完概念標註後,本體庫產生單元會利用句型基模 (pattern)。規則,從上述已標注概念名稱之句斷中,擷取 出屬性、操作與關聯。本體庫產生單元在擷取出屬性、操 作與關聯之後,據以建構出一個領域本體庫。 實施方式 第2圖係為表示依據本發明實施例之中文本體庫自動 建構系統之糸統示意圖。 依據本發明實施例之中文本體庫自動建構系統2包括V. Description of the Invention (3) In addition to the intelligent system and method, it can be used to automatically construct the ontology library, and can be used by humans to update the ontology library more efficiently. τ minus method: According to the present invention, the automatic construction system of a text body library and first recognizes a file, a Chinese dictionary, a file processing processing unit, a sentence processing unit, an ontology library generation unit, and an ontology ^ file processing. By inputting at least one file, # 出 文件! 2ί nouns and verbs become -Chinese word stream, this string machine drew a sequence of Chinese words and their parts of speech. The concept processing unit input 〇 = the nouns obtained by the file processing unit, analyze the strength of the relationship between any two nouns, and cluster together entities that belong to the same concept. Sentence = processing unit is used to input the Chinese word string machine obtained by the file processing unit to generate multiple episodes, one sentence is a sequential combination of multiple words under a sentence window size . Ontology library generation Shan Fan After inputting the sentence set generated by the sentence processing unit, it compares each sentence with the concept generated by the concept processing unit. If the sentence is an entity of a concept, The concept name is marked after it. After the concept annotation is completed, the ontology library generation unit will use a sentence pattern. The rules extract attributes, operations, and associations from the sentences marked with the concept names above. The ontology library generation unit constructs a domain ontology library after extracting attributes, operations, and associations. Embodiment 2 FIG. 2 is a schematic diagram showing a system of a text body library automatic construction system according to an embodiment of the present invention. According to the embodiment of the present invention, the text body library automatic construction system 2 includes
12259971225997
五、發明說明⑷ 文件21 24、句典22、文件處理單元23、概念處理單元 J勘f處理單元25、太§#庙太丄w 文侔9 1技炎士 本體庫產生早元26以及本體庫27。 電子中文令杜4^ ί本體庫自動建構系統之輸入資料,為 或其他可用以饴格式可為W〇rd、HTML、Power Point T用以儲存中文文件的電子格式。 -個中系統電子中文字’,包含多個中文詞,每 肀文辭包含至少一個中文字。 ; *意Ϊ 3,文3件b處3理C ί ^ Π ^ 詞U =意ΓΓ以及動詞。首 以I義字Λ 丁句子的斷詞並詞性標注,之後,輔 詞;op word filter)方法來找出有意義的名 二動广以下以實際資料說明上述處理,設存在一段文 U二廷戰神馬拉度納:Η己「上帝之手」使阿根廷 中文勹;从磨 文件處理單70 2 3使用中文斷詞系統進行 子的斷詞並詞性標注後,如3a圖所示,此句子會被 :::阿根廷,、”戰神"、"馬拉度納"等14個詞,每個詞 首抑i t Γ括弧描述其詞性,以N為首的代表名詞,以V為 s5] ,P代表介係詞,PARENTHESISCATEGORY 代表括 I I〇DCATEG〇RY代表句號。文件處理單元23接著依據 士述=詞與詞性,使用無義字篩選(st〇p㈣以futer)方 / T限疋型態的名詞及動詞,例如,Na、Nb、Nc、Vc 4詞性的詞,如圖3b所示,成為一中文詞串流。 概念處理單元24會輸入由文件處理單元23所得到之名V. Description of the invention⑷ Document 21 24, Sentence 22, Document processing unit 23, Concept processing unit Jff processing unit 25, Tai § # Temple Tai 丄 w Wen 侔 9 1 Early generation 26 and ontology generated by the ontology library Library 27. The input data of the electronic Chinese order system 4 ^ 本体 Ontology library automatic construction system is or other electronic format that can be used to store Chinese documents in formats such as Word, HTML, and Power Point. -System Chinese Electronic Characters', which contains multiple Chinese words, and each script contains at least one Chinese character. ; * 意 Ϊ 3, 3 in b, 3 C C ί ^ Π ^ Word U = meaning ΓΓ and verb. First, use the I word Λ Ding sentence segmentation and part-of-speech tagging, and then use the auxiliary word; op word filter) method to find a meaningful name. Second, we will use the actual data to explain the above process. Maradona: The "Hand of God" made Argentina Chinese embarrassed; after processing the document 70 2 3 using the Chinese word segmentation system to perform word segmentation and tagging, as shown in Figure 3a, this sentence will be ::: Argentina, "God of War", "Maradona" and other 14 words, each word is suppressed by its Γ brackets to describe its part of speech, with N as the representative noun, and V as s5], P Represents prepositions, PARENTHESISCATEGORY stands for II〇DCATEG〇RY stands for period. The file processing unit 23 then uses nonsense words to filter (st〇p㈣futer) square / T-limited nouns based on narrative = words and parts of speech. Verbs, for example, Na, Nb, Nc, Vc, 4 part-of-speech words, as shown in Figure 3b, become a stream of Chinese words. The concept processing unit 24 will enter the name obtained by the file processing unit 23
第9頁 1225997 五、發明說明(5) 詞,先選取詞頻(term frequency)乘以文件頻率倒數 (inverse document frequency)較高之名詞,接著,使用 類神經網路技術中的非監督式學習之自我聚類(self organization map,SOM)模式,分析任兩名詞間的關係強 度,將屬於同一概念的實體(instance)聚在一起。 句斷處理單元25用以輸入由文件處理單元23所得到的 中文詞串流,得到多個句斷(e p i s 〇 d e )。一個句斷為在一 句斷詞量(window size)下之多個詞的順序性組合,如圖 3c所示,其中包含兩個句斷詞量為3之兩個句斷,包括,,阿 根廷(N c )—戰神(N a )—馬拉度納(^|13)"以及"阿根廷(]^(:)—擊 敗(Vc)_英格蘭(Nc)n。 第4圖係表示依據本發明實施例之句斷處理演算法示 意圖,此圖中包含4 0 0到4 2 0的虚擬碼。演算法中所需的變 數、參數及資料結構說明如下: (1) WindowSize稱為句斷詞量,為演算法的輸入參 數,限定每一句斷所包含的詞數; (2) miniminn —Support稱為最小支持量,為演算法的輸 入參數,限定每一句斷之最少出現次數; (3) y <t" t2,···,tk〉為一資料結構是用以記錄 h,t”…,tk之詞順序組合(term sequence)出現於哪些句子+ (sentence)中 。 (4) ···,、>.cardinaUt"^ 數是用以記錄 tl,t2,…,tk之詞順序組合一共出現幾次。 (5) ti· P〇Sition變數是用以記錄^在句子中出現的位Page 9 1225997 V. Description of the invention (5) For the words, first select the term with a higher term frequency multiplied by the inverse document frequency, and then use the unsupervised learning in neural network technology. A self organization map (SOM) model analyzes the strength of the relationship between any two nouns and brings together entities that belong to the same concept. The sentence processing unit 25 is used to input a stream of Chinese words obtained by the file processing unit 23 to obtain a plurality of sentence breaks (e p i s 0 d e). A sentence is a sequential combination of multiple words under a window size, as shown in Figure 3c, which contains two sentences with a sentence size of 3, including, Argentina ( N c) —God of War (N a) —Maradona (^ | 13) " Argentina (] ^ (:) — defeated (Vc) _England (Nc) n. Figure 4 shows the basis of this The schematic diagram of the sentence segmentation algorithm of the embodiment of the invention includes virtual codes from 400 to 4 0. The variables, parameters and data structures required in the algorithm are described as follows: (1) WindowSize is called sentence segmentation The amount is the input parameter of the algorithm and limits the number of words included in each sentence; (2) miniminn —Support is called the minimum support amount. It is the input parameter of the algorithm and limits the minimum number of occurrences of each sentence; (3) y < t " t2, ···, tk> is a data structure used to record in which sentences + (sentence) the term sequence of h, t "..., tk appears. (4) · · ,, > .cardinaUt " ^ The number is used to record the sequence of the words tl, t2, ..., tk appearing a total of several times. (5) ti · The P〇Sition variable is used to record the position of ^ in the sentence
第10頁 1225997 五、發明說明(6) 置。 田湧弃 〜‘只,n至ίκ 八, 能 所需的時間複雜度也很高,為了加快演算法執行的效 在第一次讀取資料時,會將每一詞出現過之句子編號 (sentence — num),記錄於υ<ν中,如4〇1所示,可防止後 續計算時,需要重新掃描所有的句子。若y <t"心,···,心〉. cardinality大於最小支持量(minimum supp〇rt),則此詞 為一強單詞,必須紀錄到強單詞集合 如402所示。接下來’依據強單詞集合所包含之所有 早列進行兩兩排列組合,如4 Μ 中的任-雙詞順序组人<t t 3所二,候選雙詞順序組合 〇4. ^ 汁〇<ta,&>必須滿足tb出現在t之後, ta tb的間距不超過句斷詞量(Wind〇wSize)。 a ,序組合中之 “ti,t2, ...,ν. —Μ 最= 持ΐ,則此候選雙詞順床鈿人 y入於敢丨、叉 ΠρΓσ^ 9 Μ序組合會紀錄到強雙詞順序 (large-2-sequence)隼人 φ ,丄,— 汁 順序(large-2-Sequen^ 如406所不。當找出強雙詞 (large-k-Sequence)集^集ΐ价,下來要找強k詞順序 而每個強k詞順序隼人& #白、據強雙詞順序集合來找, L…<Wk〉的;以支持量都可利用 凟算法會一直找下去, 後將被包含於其他順序人 彳沒有新的強詞順序,最 的強k詞順序就是我們所、σ的強k同順序刪除,遺留下 體庫的需要,只需找 的句斷(episode)。依建構本 集合就足夠了,因為 二二順序(large〜3-se(luence) 有兩個或三個詞的強詞順序Page 10 1225997 V. Description of the invention (6). Tian Yong abandoned ~ 'only, n to ίκ eight, the time complexity required is also very high, in order to speed up the implementation of the algorithm, the first time you read the data, the number of each sentence appears ( sentence — num), recorded in υ < ν, as shown in 401, can prevent the need to re-scan all sentences during subsequent calculations. If y < t " heart, ..., heart>. Cardinality is greater than the minimum support amount (minimum support), then this word is a strong word and must be recorded to the strong word set as shown in Figure 402. Next 'permutations and combinations based on all the early columns contained in the strong word set, such as the any-two-word order group person in 4 Μ < tt 3, the candidate two-word order combination 〇4. ^ 汁 〇 < ta, & > must satisfy that tb appears after t, and the spacing between ta and tb does not exceed the sentence break size (WindwwSize). a, “ti, t2, ..., ν. —M most = perseverance in the sequence combination, then this candidate two-word sequence will be followed by the person y into the courage, and the cross ΠρΓσ ^ 9 Μ sequence combination will record a strong Large word sequence (large-2-sequence) 隼 φ, 丄, — juice sequence (large-2-Sequen ^ as in 406. When finding a large-k-Sequence set ^ set price, go down To find strong k-word order and each strong k-word order 隼 人 &#;, according to the set of strong two-word order, L ... <Wk>; the support amount can be used 凟 algorithm will continue to find, There will be no new strong word order included in other orders. The strong k word order is the same as the strong k order we delete, σ, leaving behind the need of the body library, just find the sentence (episode). It is sufficient to construct this set, because the two-two order (large ~ 3-se (luence) has a strong word order of two or three words
1225997 五、發明說明(7) 訊1225997 V. Description of Invention (7)
Uafge-se(luence)集合就足夠包含要建構本體庫所需的資Uafge-se (luence) collection is enough to contain the resources needed to build the ontology library
本體庫產生單元26在輸入由句斷處理單元25所產生之 句斷集合後,會將每一句斷與概念處理單元24所產生之概 f進行比對,若句斷中的詞是某個概念的實體,則在其後 標注上概念名稱。例如,在概念處理單元24聚類後之概念 了解,"南韓”、”義大利”、”巴西”是屬於,,球隊”概念的實 體",:冠軍"是屬於"獎項π概念的實體,,,貝克漢"、”李瓦 度疋屬於’’球員"概念的實體。所以,最後之標注結果如 下’南韓(Nca|球隊)、義大利(Nca|球隊)、巴西(Nca丨球 隊冠軍(Nad|獎項)、英格蘭(Nca丨球隊)、貝克漢(Nba丨 球員)、李瓦度(Nba|球員)、南韓隊(Nba|球隊)。After the ontology library generating unit 26 inputs the sentence set generated by the sentence processing unit 25, it compares each sentence with the approximate f generated by the concept processing unit 24. If the word in the sentence is a certain concept, Entity, the concept name is marked after it. For example, after concept clustering in the concept processing unit 24, "South Korea", "Italy", and "Brazil" belong to, and the entity of the team "concept" is: "Champion" belongs to "Award π" Conceptual entities, Beckham ", " Li Waduo belongs to the concept of 'players ". Therefore, the final marked results are as follows:' South Korea (Nca | Team), Italy (Nca | Team) , Brazil (Nca 丨 Team Champion (Nad | Award), England (Nca 丨 Team), Beckham (Nba 丨 Player), Li Wadu (Nba | Player), South Korea (Nba | Team).
一般而言,經常一起出現的詞代表該等詞在語意上有 關連性,以簡單的中文文法舉例來說,可於句子中找出,, 主詞+動詞+受詞”或”主詞+動詞+補語"等簡單的句型 關連。但就本發明而言,並非希望利用文法的句型關連來 自動建構本體庫,而希望能由大量的文件中,大體上會透 過貝體-屬性-屬性值(instance— attribute - value)丨丨、丨丨 貝體關連-貫體(concept - association - concept)丨丨或"實 體-操作(instance-operation),,等形式的基模 (p a 11 e r η) ’由上述所得到的句斷(e p丨s 〇 d e)的順序關係, 找出本體庫中之屬性、操作與關連。 在進行完概念標註後,本體庫產生單元26會利用以下 的句型基模規則,從上述已標注概念名稱之句斷中,擷取Generally speaking, the words that appear together often represent the semantic relevance of these words. Taking simple Chinese grammar as an example, you can find it in the sentence, "subject + verb + acceptor" or "subject + verb + Complements are related to simple sentence patterns. However, as far as the present invention is concerned, it is not desirable to use the grammatical syntactic relation to automatically construct the ontology library, but to hope that a large number of documents will generally pass through the case-attribute-value. , 丨 丨 concept-association-concept (concept-association-concept) 丨 or "entity-operation (instance-operation)," and other forms of the fundamental model (pa 11 er η) 'from the sentence obtained above (Ep 丨 s 〇de) order relationship to find the attributes, operations and relationships in the ontology library. After the concept labeling is completed, the ontology library generation unit 26 will use the following sentence pattern base rules to extract from the sentence with the labeled concept name above
0213-10108TW(Nl);SNOWBALL.ptd 第12頁 1225997 五、發明說明(8) 出屬性、操作與關聯。第5a圖係表示依據本發明實施例之 中文動詞詞性示意圖,包含511到51 5之動詞詞性。第5b圖 係表示依據本發明實施例之中文名詞詞性示意圖,包含 5 21到5 3 1之名詞詞性。 屬性132的擷取規則有三,(1)句斷之句斷詞量為2 ; (2)句斷中的第一個詞為一實體(instance) ; (3)句斷中的 第二個詞之中文詞性標註為個體名詞5 2 2、可屬抽象名詞 523、抽象名詞524、集合名詞525、普通地方名詞528或狀 態不及物述詞51 4。例如:一句斷為π巴西(Nca I球隊),球 風(Nad)",可擷取出”球風••是"巴西"的屬性。 操作1 3 3的擷取規則有三,(1 )句斷之句斷詞量為2 ; (2)句斷中的第一個詞為一實體(instance) ; (3)句斷中的 第二個詞之中文詞性標註為動作不及物述詞5丨1。例如: 一句斷為”巴西(Neal球隊),奪標(VA),,,可擷取出,,奪標·· 是"巴西π的操作。 關聯R3的摘取規則有三,(1)句斷之句斷詞量為3 ; (2) 句斷中的第一個詞與第三個詞為一實體(instance); (3) 句斷中的第二個詞之中文詞性標註為及物動詞(Vb、 VC、VD、VE、VF)512、狀態及物動詞(vi、vj、νκ、 VL)515、個體名詞522、可屬抽象名詞523、抽象名詞 5 2 4、集合名詞5 2 5或普通地方名詞5 2 8。例如:一句斷為” 巴西(Neal球隊),赢待(VJ3),冠軍(Nad|獎項)",可掏取出 ”赢得”是”巴西'’與"冠軍π間的關連。 本體庫產生單元26在擷取出屬性132、操作133與關聯0213-10108TW (Nl); SNOWBALL.ptd Page 12 1225997 V. Description of the invention (8) Out attributes, operations and associations. Fig. 5a is a schematic diagram showing the part of speech of a Chinese verb according to an embodiment of the present invention, which includes verb parts of 511 to 515. FIG. 5b is a schematic diagram showing part-of-speech of Chinese nouns according to an embodiment of the present invention, and includes part-of-speech of 5 21 to 5 31. There are three extraction rules for attribute 132. (1) the number of words in a sentence is 2; (2) the first word in a sentence is an instance; (3) the second word in a sentence Chinese part-of-speech tagging is individual nouns 5 2 2. May belong to abstract nouns 523, abstract nouns 524, collective nouns 525, ordinary local nouns 528, or inferiority predicates 51 4. For example: a sentence broken into π Brazil (Nca I team), style of play (Nad), "ball style •• is the attribute of" Brazil ". There are three extraction rules for operation 1 3 3, ( 1) The number of words in the sentence is 2; (2) The first word in the sentence is an instance; (3) The Chinese part of speech of the second word in the sentence is marked as inaction Predicate 5 丨 1. For example: The sentence “Brazil (Neal team), winning the bid (VA),” can be retrieved, winning the bid is the operation of “Brazil π”. There are three extraction rules associated with R3. (1) the number of words in the sentence is 3; (2) the first word and the third word in the sentence are an instance; (3) in the sentence The Chinese part-of-speech of the second word is marked as transitive verb (Vb, VC, VD, VE, VF) 512, state and transitive verb (vi, vj, νκ, VL) 515, individual noun 522, and can be an abstract noun 523 , Abstract nouns 5 2 4, collective nouns 5 2 5 or ordinary local nouns 5 2 8. For example: a sentence is broken as "Brazil (Neal team), win (VJ3), championship (Nad | award)", can be taken out "win" is the relationship between "Brazil '" and "champion π. Ontology library generation unit 26 retrieves attributes 132, operations 133 and associations
1225997 五、發明說明(9) R3之後就可以建構出一個領域本艏庫。第6圖係表示依據 本發明實施例之本體庫架構示意圖’此本體庫依據200 2世 界盃足球赛相關新聞4 4 0篇經本發明建構而得。 第7圖係表示依據本發明實施例之中文本體庫自動建 構方法之方法流程圖。 首先,如步驟S71,輸入至少一份文件21,使用中文 斷詞系統(CK I P)進行中文句子斷詞並詞性標注。如步驟 S72,使用無義字篩選(stop word filter)方法,刪除步 驟S 71所產生之無意義的詞,例如,標點符號、補語等, 留下限定型態的名詞及動詞。 之後,如步驟S 7 3,輸入經步驟S 7 2所得到之名詞,先 選取詞頻(term frequency)乘以文件頻率倒數(inverse document frequency)較高之名詞,接著,使用類神經網 路技術中的非監督式學習之自我聚類(self organization map, SOM)模式,分析任兩名詞間的關係強度,將屬於同 一概念的實體(instance)聚在一起。 如步驟S 7 4所示,輸入由步驟S 7 2所得到的詞及其詞 性,產生多個句斷(episode),其演算法如圖4所示。一個 句斷為在一句斷詞量(window size)下之多個詞的順序性 組合,如圖3c所示,其中包含兩個句斷詞量為3之兩個句 f 斷’包括”阿根廷(Nc) -戰神(Na) —馬拉度納(Nb),,以及,,阿 根廷(Nc)—擊敗(Vc)—英格蘭(Nc)"。接下來,如步驟S75所 示’輸入由步驟S74所產生之句斷集合,將每一句斷與步 驟S73所產生之概念進行比對,若句斷中的詞是某個概念1225997 V. Description of invention (9) After R3, a domain library can be constructed. Fig. 6 is a schematic diagram showing the ontology library architecture according to an embodiment of the present invention. This ontology library is constructed according to 4 40 articles of the 2002 World Cup football game related news. FIG. 7 is a flowchart of a method for automatically constructing a text body library according to an embodiment of the present invention. First, at step S71, at least one file 21 is input, and the Chinese word segmentation system (CK IP) is used to perform word segmentation and part-of-speech tagging of Chinese sentences. In step S72, a stop word filter method is used to delete meaningless words generated in step S71, such as punctuation marks, complements, and the like, leaving the nouns and verbs in a limited form. Then, as in step S 7 3, input the nouns obtained in step S 7 2, first select the term with a higher term frequency multiplied by the inverse document frequency, and then use the neural network-like technology The unsupervised learning self-organization map (SOM) model analyzes the strength of the relationship between any two nouns and brings together entities that belong to the same concept. As shown in step S 7 4, the words and part-of-speech obtained in step S 7 2 are input to generate multiple episodes. The algorithm is shown in FIG. 4. A sentence is a sequential combination of multiple words under a window size, as shown in Fig. 3c, which contains two sentences with a sentence size of 3, and the sentence "including" Argentina ( Nc)-God of War (Na)-Maradona (Nb), and, Argentina (Nc)-Defeated (Vc)-England (Nc) " Next, as shown in step S75, 'input by step S74 The generated sentence set is compared with each sentence and the concept generated in step S73, if the word in the sentence is a certain concept
〇213-l〇l〇8BfF(Nl);SNOWBALL.ptd 第14頁 1225997 五、發明說明(10) 的實體,則在其後標注上概念名稱 如步驟S76所示,使用上述之屬+生、操作 基模規則,從上述已標注概念名稱之句斷中,擷取 來建構本體庫之屬性132、操作133與關聯R3。 驟S77 ,整合步驟S76所產生之實體、屬性132 ’ ^ 關聯R3,建構領域本體庫。 保興 :者,士發明提出—種電腦可讀取儲存媒體, =-電腦程式’上述電腦程式用以實現中文本體庫 構方法,此方法會執行如上所述之步驟。 第8圖係表示依據本發明實施例之中文本體 構方法之電腦可讀取儲存媒體示意圖。 自動建 以儲存-電腦程式820,用以實現中文二存白媒二0 ’用 法。装雷腦鞀4 4人用从只現甲文本體庫自動建構方 法其電細私式包含七個邏輯,分別為斷詞及桿兮1 p 輯821、刪除無意義詞邏經S99 :』及铩左生邏 a “ 』邏輯8 22、叢集概念邏輯8 23、涂娃〇213-101〇8BfF (Nl); SNOWBALL.ptd Page 14 1225997 V. The entity of the invention description (10), then the concept name is marked as shown in step S76, using the above-mentioned genus + health, The operation model rule is extracted from the above-mentioned sentence marked with the concept name to construct the attribute 132, operation 133, and association R3 of the ontology library. In step S77, the entity and attribute 132 '^ associated with R3 generated in step S76 are integrated to construct a domain ontology library. Baoxing: The author and scholar invented a kind of computer-readable storage medium. = -Computer program 'The above computer program is used to implement the Chinese ontology library method. This method will perform the steps described above. FIG. 8 is a schematic diagram showing a computer-readable storage medium according to the text structure method in the embodiment of the present invention. It is automatically built to store-computer program 820, which is used to implement the Chinese second storage and white media two 0 'method. Zhuang Lei Nao 建构 4 4 people use an automatic construction method from the existing text body library. The electronic private form contains seven logics, which are word segmentation and pole xi 1 p series 821, delete the meaningless word logic sutra S99: ”and铩 Zuo Shengluo a "" Logic 8 22, Cluster Concept Logic 8 23, Tu Wa
句斷邏輯824、標註實I#碟經只構 ^ ^ 只體邏軏82 5、擷取屬性及操作以及M 連邏輯826與產生本體庫邏輯82 7。 及關 因此,藉由本發明所提供之中文 及方法,除可用以自動逮播太辦庙冰篮厚目動建構糸統 並可更有效率地更新本體庫。 刀使用 T $本發明已以較佳實施例揭露如上,缺豆 限定本發明,任何熟朵 ……、並非用以 神和範圍内,當可做藝者’在不脫離本發明之精 範圍當視後附之申請專利範圍所界定者=月之保護Sentence logic 824, labeled real I # disc script only constructs ^ ^ Logic logic only 82 5. Retrieve attributes and operations, and M-connect logic 826 and generate ontology library logic 82 7. And therefore, with the Chinese language and method provided by the present invention, it can be used to automatically capture and broadcast the Taibang Temple ice basket to build a system and update the ontology database more efficiently. The use of a knife T The present invention has been disclosed in the preferred embodiment as described above. The lack of beans restricts the present invention. Any cooked flower ... is not used within the scope of God and can be an artist without departing from the scope of the present invention. As defined by the scope of the attached patent application = protection of the month
1225997 圖式簡單說明 為使本發明之上述目的、特徵和優點能更明顯易懂, 下文特舉實施例,並配合所附圖示,進行詳細說明如下: 第1圖係表示習知技術之本體庫架構示意圖; 第2圖係為表示依據本發明實施例之中文本體庫自動 建構系統之系統示意圖; 第3a、3b、3c圖係表示依據本發明實施例之範例資料 示意圖; 第4圖係表示依據本發明實施例之句斷處理演算法示 意圖, 第5a圖係表示依據本發明實施例之中文動詞詞性示意 圖, 第5b圖係表示依據本發明實施例之中文名詞詞性示意 圖; 第6圖係表示依據本發明實施例之本體庫架構示意 圖, 第7圖係表示依據本發明實施例之中文本體庫自動建 構方法之方法流程圖; Ο 第8圖係表示依據本發明實施例之中文本體庫自動建 構方法之電腦可讀取儲存媒體示意圖。 符號說明 1 1〜領域 1 2〜類別 1 3〜概念 1 3 1〜概念名稱;1225997 Brief description of the drawings In order to make the above-mentioned objects, features and advantages of the present invention more comprehensible, the following specific examples are given in conjunction with the accompanying drawings to explain in detail as follows: Figure 1 shows the essence of the conventional technology Library architecture diagram; Figure 2 is a schematic diagram of a system for automatically constructing a text body library according to an embodiment of the present invention; Figures 3a, 3b, and 3c are schematic diagrams of example data according to an embodiment of the present invention; and Figure 4 is a diagram A schematic diagram of a sentence segmentation algorithm according to an embodiment of the present invention, FIG. 5a is a schematic diagram of part of speech of a Chinese verb according to an embodiment of the present invention, and FIG. 5b is a schematic diagram of part of speech of a Chinese noun according to an embodiment of the present invention; Schematic diagram of the ontology library architecture according to the embodiment of the present invention, FIG. 7 is a flowchart showing a method for automatically constructing a text body library according to an embodiment of the present invention; 〇 FIG. 8 is an automatic construction of a text body library according to an embodiment of the present invention The method computer can read the schematic diagram of the storage medium. Explanation of symbols 1 1 ~ domain 1 2 ~ category 1 3 ~ concept 1 3 1 ~ concept name;
0213-10108TW(Nl);SN〇WBALL.ptd 第16頁 1225997 圖式簡單說明 1 3 2〜屬性; 1 3 3〜操作; 21〜文件; 2 2〜中文字典; 23〜文件處理單元; 2 4〜概念處理單元; 25〜句斷處理單元; 26〜本體庫產生單元; 27〜本體庫; 5 1 1、5 1 2.....5 1 5〜中文動詞詞性; 5 2 1、5 2 2 ..... 5 3 1〜中文名詞詞性; 80〜儲存媒體; 820〜中文本體庫自動建構電腦程式; 8 2 1〜斷詞及標註詞性邏輯; 8 2 2〜刪除無意義詞邏輯; 82 3〜叢集概念邏輯; 824〜建構句斷邏輯; 82 5〜標註實體邏輯; 82 6〜擷取屬性及操作以及關連邏輯; 82 7〜產生本體庫邏輯。0213-10108TW (Nl); SN〇WBALL.ptd Page 16 1225997 Schematic description 1 3 2 ~ attribute; 1 3 3 ~ operation; 21 ~ file; 2 2 ~ Chinese dictionary; 23 ~ file processing unit; 2 4 ~ Concept processing unit; 25 ~ Sentence processing unit; 26 ~ Ontology library generation unit; 27 ~ Ontology library; 5 1 1, 5 1 2 ..... 5 1 5 ~ Chinese verb part of speech; 5 2 1, 5 2 2 ..... 5 3 1 ~ Chinese noun parts of speech; 80 ~ storage media; 820 ~ Chinese ontology library automatically constructs computer programs; 8 2 1 ~ word segmentation and tagging part of speech logic; 8 2 2 ~ delete meaningless word logic; 82 3 ~ cluster concept logic; 824 ~ construct sentence logic; 82 5 ~ mark entity logic; 82 6 ~ fetch attributes and operations and related logic; 82 7 ~ generate ontology library logic.
0213-10108™F(Nl);SNOWBALL.ptd 第 17 頁0213-10108 ™ F (Nl); SNOWBALL.ptd page 17
Claims (1)
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
TW92122079A TWI225997B (en) | 2003-08-12 | 2003-08-12 | Chinese ontology auto-establishment system and method, and storage media |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
TW92122079A TWI225997B (en) | 2003-08-12 | 2003-08-12 | Chinese ontology auto-establishment system and method, and storage media |
Publications (2)
Publication Number | Publication Date |
---|---|
TWI225997B true TWI225997B (en) | 2005-01-01 |
TW200506655A TW200506655A (en) | 2005-02-16 |
Family
ID=35613505
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
TW92122079A TWI225997B (en) | 2003-08-12 | 2003-08-12 | Chinese ontology auto-establishment system and method, and storage media |
Country Status (1)
Country | Link |
---|---|
TW (1) | TWI225997B (en) |
Cited By (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN110955748A (en) * | 2018-09-26 | 2020-04-03 | 华硕电脑股份有限公司 | Semantic processing method, electronic device and non-transitory computer readable recording medium |
Families Citing this family (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
TWI608367B (en) * | 2012-01-11 | 2017-12-11 | 國立臺灣師範大學 | Text readability measuring system and method thereof |
TWI550422B (en) * | 2015-04-08 | 2016-09-21 | 雲拓科技有限公司 | Claim text generalizing method |
TWI639927B (en) | 2016-05-27 | 2018-11-01 | 雲拓科技有限公司 | Method for corresponding element symbols in the specification to the corresponding element terms in claims |
TWI598751B (en) | 2016-12-05 | 2017-09-11 | 雲拓科技有限公司 | Automatic claim computerized-translating apparatus |
-
2003
- 2003-08-12 TW TW92122079A patent/TWI225997B/en not_active IP Right Cessation
Cited By (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN110955748A (en) * | 2018-09-26 | 2020-04-03 | 华硕电脑股份有限公司 | Semantic processing method, electronic device and non-transitory computer readable recording medium |
CN110955748B (en) * | 2018-09-26 | 2022-10-28 | 华硕电脑股份有限公司 | Semantic processing method, electronic device and non-transitory computer readable recording medium |
Also Published As
Publication number | Publication date |
---|---|
TW200506655A (en) | 2005-02-16 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
Snoek et al. | Adding semantics to detectors for video retrieval | |
Nadeau | Semi-supervised named entity recognition: learning to recognize 100 entity types with little supervision | |
Cui et al. | Soft pattern matching models for definitional question answering | |
Allahyari et al. | Automatic topic labeling using ontology-based topic models | |
Wang et al. | Using word embeddings to enhance keyword identification for scientific publications | |
Kothari et al. | SMS based interface for FAQ retrieval | |
US8560485B2 (en) | Generating a domain corpus and a dictionary for an automated ontology | |
Al-Zoghby et al. | Arabic semantic web applications–a survey | |
US20130262086A1 (en) | Generation of a semantic model from textual listings | |
US8200671B2 (en) | Generating a dictionary and determining a co-occurrence context for an automated ontology | |
CN111159414B (en) | Text classification method and system, electronic equipment and computer readable storage medium | |
US8000957B2 (en) | English-language translation of exact interpretations of keyword queries | |
Zock et al. | Deliberate word access: an intuition, a roadmap and some preliminary empirical results | |
Allahyari et al. | A knowledge-based topic modeling approach for automatic topic labeling | |
Alami et al. | Hybrid method for text summarization based on statistical and semantic treatment | |
Afzal et al. | Semantically enhanced concept search of the Holy Quran: Qur’anic English WordNet | |
JP2010287020A (en) | Synonym translation system and synonym translation method | |
Ahmed et al. | Web-Based Arabic Question Answering System using Machine Learning Approach. | |
Lahbari et al. | Toward a new arabic question answering system. | |
TWI225997B (en) | Chinese ontology auto-establishment system and method, and storage media | |
Garrido et al. | The GENIE project-a semantic pipeline for automatic document categorisation | |
Albukhitan et al. | Semantic web annotation using deep learning with Arabic morphology | |
Bakari et al. | Logic-based approach for improving Arabic question answering | |
AbuTaha et al. | An ontology-based arabic question answering system | |
Deena et al. | Keyword extraction using latent semantic analysis for question generation |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
MM4A | Annulment or lapse of patent due to non-payment of fees |