TWI442249B - Domain Knowledge Network Construction Method and Its System - Google Patents

Domain Knowledge Network Construction Method and Its System Download PDF

Info

Publication number
TWI442249B
TWI442249B TW99106441A TW99106441A TWI442249B TW I442249 B TWI442249 B TW I442249B TW 99106441 A TW99106441 A TW 99106441A TW 99106441 A TW99106441 A TW 99106441A TW I442249 B TWI442249 B TW I442249B
Authority
TW
Taiwan
Prior art keywords
knowledge network
webpage
domain knowledge
domain
word
Prior art date
Application number
TW99106441A
Other languages
Chinese (zh)
Other versions
TW201131389A (en
Original Assignee
Univ Nat Chi Nan
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Univ Nat Chi Nan filed Critical Univ Nat Chi Nan
Priority to TW99106441A priority Critical patent/TWI442249B/en
Publication of TW201131389A publication Critical patent/TW201131389A/en
Application granted granted Critical
Publication of TWI442249B publication Critical patent/TWI442249B/en

Links

Landscapes

  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Description

領域知識網路建構方法及其系統Domain knowledge network construction method and system thereof

本發明是有關於一種網路資源整合技術,特別是指一種領域知識網路(Domain Knowledge Network,簡稱DKN)建構方法及其系統。The invention relates to a network resource integration technology, in particular to a domain knowledge network (DKN) construction method and a system thereof.

隨著網路技術的快速發展與普及化,使得網路上的資訊量呈現爆炸性的成長,而網路上的資訊包羅萬象,儼然像是一個龐大的分散式資料庫。當使用者在網路上進行搜尋或瀏覽之初,往往尚未有具體而清楚的想法,在此種沒有具體概念(concrete concepts)的情況下,當然無法透過使用網路而獲得具體想法或創意。With the rapid development and popularization of network technology, the amount of information on the Internet has exploded, and the information on the Internet is all-encompassing, like a huge decentralized database. When users search or browse on the Internet, there is often no specific and clear idea. In the absence of such specific concepts, it is of course impossible to obtain specific ideas or ideas through the use of the Internet.

一般而言,概念間存在許多關聯性,而概念的語意也往往和領域相關,常有「一詞多義」或「同意詞」的問題。舉例來說,「Bridge」一詞,在「建築領域」及「網路領域」分別代表不同意義;針對此問題,通常需要參考內容的前後文關係;因此,概念間的關聯性極適合用於清楚地表達語意。Generally speaking, there are many relatedities between concepts, and the semantic meaning of concepts is often related to the domain. There is often the problem of "multiple meanings" or "consent words". For example, the term "Bridge" stands for different meanings in the "architectural area" and "network area"; for this problem, it is usually necessary to refer to the context of the content; therefore, the correlation between concepts is very suitable for Express your meaning clearly.

一種習知的知識網路建構方式,如G. Marchionini等人於「“Toward a statistical knowledge network,”In Proceedings of the National Conference on Digital Government Research,pages 27-32,Boston,2003. National Science Foundation.」所揭露,需要大量人工介入、配合特定使用者介面的操作,及程序的處理,來完成知識網路之建構。本發明旨在尋求一種可自動地由領域相關網頁擷取概念,並利用概念間的關聯性建構領域知識網路之方法及其系統。A conventional method of constructing a knowledge network, such as G. Marchionini et al., "Toward a statistical knowledge network," In Proceedings of the National Conference on Digital Government Research, pages 27-32, Boston, 2003. National Science Foundation. It is revealed that a large number of manual interventions, operations with specific user interfaces, and processing of programs are required to complete the construction of the knowledge network. The present invention seeks to find a method and system for automatically constructing a domain knowledge network by automatically extracting concepts from domain-related web pages and utilizing the association between concepts.

因此,本發明之目的,即在提供一種領域知識網路建構方法。Accordingly, it is an object of the present invention to provide a domain knowledge network construction method.

於是,本發明領域知識網路建構方法,適於藉由一系統執行,該方法包含下列步驟:Therefore, the domain knowledge network construction method of the present invention is suitable for execution by a system, and the method comprises the following steps:

A)取得與一領域相關的複數個網頁;A) obtaining a plurality of web pages related to a field;

B)將每一網頁劃分成複數區塊;B) dividing each web page into plural blocks;

C)自每一網頁之該等區塊中識別出屬於一詮釋資料類型者;C) identifying those belonging to an interpretation data type from the blocks of each web page;

D)對於屬於該詮釋資料類型之區塊對應的網頁文字進行詞類標記處理,以對網頁文字內的單詞指派關於詞性的詞類標記,並將網頁文字分割成複數句子;D) performing a word class tag processing on the webpage text corresponding to the block belonging to the type of the interpretation data, assigning a word class tag on the word part of the word in the webpage text, and dividing the webpage text into plural sentences;

E)對經過步驟D)之詞類標記處理並分割的句子進行命名實體辨識;E) performing named entity identification on the sentence processed and segmented by the word class tag of step D);

F)根據步驟D)及E)之處理結果進行物件間關聯性處理,以得到每一句子之一主動賓結構;及F) performing an inter-object correlation process according to the processing results of steps D) and E) to obtain an active guest structure for each sentence;

G)根據步驟F)之處理結果,產生對應該領域的一領域知識網路。G) According to the processing result of step F), a domain knowledge network corresponding to the field is generated.

本發明之另一目的,即在提供一種領域知識網路建構系統。Another object of the present invention is to provide a domain knowledge network construction system.

於是,本發明領域知識網路建構系統,包含一網頁劃分單元、一區塊類型識別單元、一詞類標記處理單元、一命名實體辨識單元、一物件間關聯性處理單元,及一領域知識網路產生單元。Therefore, the domain knowledge network construction system of the present invention comprises a webpage dividing unit, a block type identifying unit, a word class tag processing unit, a named entity identifying unit, an inter-object correlation processing unit, and a domain knowledge network. Generate a unit.

該網頁劃分單元用以將與一領域相關的複數個網頁中每一者劃分成複數區塊。該區塊類型識別單元用以自每一網頁之該等區塊中識別出屬於一詮釋資料類型者。該詞類標記處理單元用以對於屬於該詮釋資料類型之區塊對應的網頁文字進行詞類標記處理,以對網頁文字內的單詞指派關於詞性的詞類標記,並將網頁文字分割成複數句子。該命名實體辨識單元用以對經過該詞類標記處理單元之處理並分割的句子進行命名實體辨識。該物件間關聯性處理單元用以根據該詞類標記處理單元及該命名實體辨識單元之處理結果進行物件間關聯性處理,以得到每一句子之一主動賓結構。該領域知識網路產生單元用以根據每一句子的主動賓結構,產生對應該領域的一領域知識網路。The webpage dividing unit is configured to divide each of the plurality of webpages related to a domain into a plurality of webpages. The block type identifying unit is configured to identify, from the blocks of each web page, those belonging to an interpretation data type. The part-of-speech tag processing unit is configured to perform word class tag processing on the webpage text corresponding to the block belonging to the interpretation material type, to assign a word class tag on the part of the word in the webpage text, and divide the webpage text into plural sentences. The named entity identification unit is configured to perform named entity identification on the sentence processed and segmented by the word class tag processing unit. The inter-object correlation processing unit is configured to perform an inter-object correlation process according to the processing result of the word class tag processing unit and the named entity identification unit to obtain an active guest structure of each sentence. The domain knowledge network generating unit is configured to generate a domain knowledge network corresponding to the domain according to the active guest structure of each sentence.

有關本發明之前述及其他技術內容、特點與功效,在以下配合參考圖式之一個較佳實施例的詳細說明中,將可清楚的呈現。The above and other technical contents, features and advantages of the present invention will be apparent from the following detailed description of the preferred embodiments.

參閱圖1,本發明領域知識網路建構系統1之較佳實施例包含一網頁取得單元11、與該網頁取得單元11連接的一資料庫12、與該資料庫12連接的一網頁劃分(page partition)單元13、與該資料庫12連接的一區塊類型(block type)識別單元14、與該資料庫12連接的一詞類標記(Part-of-Speech,簡稱POS)處理單元15、與該資料庫12連接的一命名實體辨識(Name Entity Recognition,簡稱NER)單元16、與該資料庫12連接的一物件間關聯性(Object-Object Relation)處理單元17、與該資料庫12連接的一領域知識網路產生單元18,及與該領域知識網路產生單元18連接的一使用者介面單元19。Referring to FIG. 1, a preferred embodiment of the domain knowledge network construction system 1 of the present invention comprises a webpage obtaining unit 11, a database 12 connected to the webpage obtaining unit 11, and a webpage partitioning connected to the database 12. a partitioning unit 13, a block type identifying unit 14 connected to the database 12, a word-of-speech (POS) processing unit 15 connected to the database 12, and the partitioning unit A Name Entity Recognition (NER) unit 16 connected to the database 12, an Object-Object Relation processing unit 17 connected to the database 12, and a connection with the database 12 The domain knowledge network generating unit 18 and a user interface unit 19 connected to the domain knowledge network generating unit 18.

該網頁取得單元11用以取得與一領域相關的複數個網頁,並將該等網頁儲存於該資料庫12。該網頁劃分單元13用以將與該領域相關的網頁中每一者劃分成複數區塊,並將劃分結果儲存於該資料庫12。該區塊類型識別單元14用以自該等區塊中識別出屬於一詮釋資料類型(metadata type)者,並將識別結果儲存於該資料庫12。該詞類標記處理單元15用以對於屬於該詮釋資料類型之區塊對應的網頁文字(text)進行詞類標記處理,以對網頁文字內的單詞(words)指派關於詞性的詞類標記(POS tags),並將網頁文字分割成複數句子(sentences),並將標記結果儲存於該資料庫12。該命名實體辨識單元16用以對經過該詞類標記處理單元15之處理並分割的句子進行命名實體辨識,並將處理結果儲存於該資料庫12。該物件間關聯性處理單元17用以根據該詞類標記處理單元15及該命名實體辨識單元16之處理結果進行物件間關聯性處理,以得到每一句子之一主動賓(主語-動語-賓語,Subject-Verb-Object,簡稱SVO)結構,並將處理結果儲存於該資料庫12。該領域知識網路產生單元18用以根據每一句子的主動賓結構,產生對應該領域的一領域知識網路,並透過該使用者介面單元19提供給使用者。The webpage obtaining unit 11 is configured to obtain a plurality of webpages related to a domain, and store the webpages in the database 12. The webpage dividing unit 13 is configured to divide each of the webpages related to the domain into a plurality of chunks, and store the partitioning result in the database 12. The block type identifying unit 14 is configured to identify a metadata type from the blocks and store the recognition result in the database 12. The part-of-speech tag processing unit 15 is configured to perform a part-of-speech tag process on a webpage text (text) corresponding to the block belonging to the interpret data type, to assign a word class tag (POS tags) to the word in the webpage text. The webpage text is divided into plural sentences, and the tagged results are stored in the database 12. The named entity identification unit 16 is configured to perform named entity identification on the sentence processed and segmented by the part-of-speech tag processing unit 15, and store the processing result in the database 12. The inter-object correlation processing unit 17 is configured to perform an inter-object correlation process according to the processing result of the part-of-speech tag processing unit 15 and the named entity recognizing unit 16 to obtain an active guest (subject-verb-object) of each sentence. , Subject-Verb-Object (SVO) structure, and the processing result is stored in the database 12. The domain knowledge network generating unit 18 is configured to generate a domain knowledge network corresponding to the domain according to the active guest structure of each sentence, and provide the user to the user through the user interface unit 19.

參閱圖1與圖2,本發明領域知識網路建構方法之較佳實施例係藉由該領域知識網路建構系統1執行,其包含下列步驟。Referring to FIG. 1 and FIG. 2, a preferred embodiment of the domain knowledge network construction method of the present invention is implemented by the domain knowledge network construction system 1, which includes the following steps.

在步驟301中,該網頁取得單元11依據使用者輸入之關鍵字於各網站2中進行搜尋,以取得與該領域相關的網頁;其中,使用者所輸入之關鍵字係與該領域相關。在本較佳實施例中,該網頁取得單元11係以現有的詮釋資料搜尋引擎(metasearch engine),例如,Web Crawler,來實現。In step 301, the webpage obtaining unit 11 searches the website 2 according to the keyword input by the user to obtain a webpage related to the domain; wherein the keyword input by the user is related to the domain. In the preferred embodiment, the web page obtaining unit 11 is implemented by an existing metasearch engine, such as a Web Crawler.

在步驟302中,該網頁劃分單元13將步驟301中所取得之網頁中每一者劃分成區塊。本較佳實施例中,該網頁劃分單元13係採用林宣華博士指導朱軍柏所發表之碩士論文「以網頁區塊辨識和超鏈結分析為基礎之網站地圖自動產生系統」中所揭露之網頁劃分技術,其細節不在此贅述。In step 302, the web page dividing unit 13 divides each of the web pages obtained in step 301 into blocks. In the preferred embodiment, the webpage dividing unit 13 uses the webpage division disclosed in the master's thesis published by Dr. Lin Xuanhua, Zhu Guangbo, "Web site automatic generation system based on web block identification and hyperlink analysis". The details of the technology are not described here.

在步驟303中,該區塊類型識別單元14根據步驟302中所劃分之每一區塊個別對應的網頁文字,以識別出屬於詮釋資料類型者;其中,該區塊類型識別單元14係根據每一區塊個別對應的網頁文字的屬性-值對(attribute-value pairs)進行識別,每一網頁的區塊所對應之網頁文字中屬性出現頻率最高者,該區塊即屬於該詮釋資料類型。In step 303, the block type identifying unit 14 identifies the file type that belongs to the interpretation data type according to the webpage text corresponding to each of the blocks divided in step 302; wherein the block type identification unit 14 is based on each The attribute-value pairs of the corresponding webpage texts of a block are identified, and the attribute having the highest frequency of occurrence in the webpage text corresponding to the block of each webpage belongs to the interpretation data type.

在步驟304中,該詞類標記處理單元15對於屬於該詮釋資料類型之區塊對應的網頁文字進行詞類標記處理,以對網頁文字內的單詞指派關於詞性的詞類標記,然後,該詞類標記處理單元15基於被指派詞類標記的單詞,找出網頁文字中句子的邊界,以將網頁文字分割成句子。In step 304, the part-of-speech tag processing unit 15 performs word class tag processing on the webpage text corresponding to the block belonging to the interpretation material type, to assign a word class tag on the part of the word in the webpage text, and then the word class tag processing unit 15 Find the boundary of the sentence in the webpage text based on the word assigned to the word class to segment the webpage text into sentences.

在本較佳實施例中,係對中文語系之網頁文字為範例進行詞類標記,舉例來說,輸入一段網頁文字「洋基隊率先取得三連戰的第一勝」到該詞類標記處理單元15,其輸出結果為「洋基(Nb(詞性))隊(Na)率先(D)取得(VC)三(Neu)連戰(Nb)的(DE)第一(Neu)勝(VH)」;其中,該詞類標記處理單元15係採用中文詞智識庫小組(Chinese Knowledge and Information Processing,簡稱CKIP)的中文詞分割(Chinese Term Segmentation,簡稱CTS)系統予以實施,其細節可參考http://godel.iis.sinica.edu.tw/CKIP/;但該詞類標記處理單元15亦可依網頁文字之語系,而採用其他現有的詞類標記工具,並不限於本較佳實施例所揭露。In the preferred embodiment, a word class tag is used as an example for a web page text of a Chinese language system. For example, a web page text "the Yankees first takes the first win of a three-game battle" to the word class tag processing unit 15 is input. The output is "Yanke (Nb (speech)" team (Na) takes the lead (D) to obtain (VC) three (Neu) Lien Chan (Nb) (DE) first (Neu) win (VH)"; The tag processing unit 15 is implemented by the Chinese Term Segmentation (CTS) system of the Chinese Knowledge and Information Processing (CKIP). For details, refer to http://godel.iis. Sinica.edu.tw/CKIP/; however, the word class tag processing unit 15 may also use other existing word class tagging tools according to the language of the webpage text, and is not limited to the preferred embodiment.

在步驟305中,該命名實體辨識單元16先基於預先建立的字典(dictionary)對經過步驟304之詞類標記處理並分割的句子進行字典比對,以自詞類標記為名詞的單詞識別出對應的命名;然後對於未能由字典比對識別出對應的命名者,基於一條件隨機域(Conditional Random Fields,簡稱CRFs)模型來進行辨識,此步驟主要用於對詞類標記為名詞的單詞進行識別並對應到適當的命名(proper names),像是,“人(person)”、“組織(organization)”,及“位置(location)”。關於該條件隨機域模型的實施細節,可參考Chen等人發表之論文「“Chinese Named Entity Recognition with Conditional Random Fields,”Proceedings of the Fifth SIGHAN Workshop on Chinese Language Processing,pp. 118-121,2006.」,故不在此贅述。In step 305, the named entity identification unit 16 first performs a dictionary comparison on the sentence processed and segmented by the part-of-speech tag processing in step 304 based on a pre-established dictionary, and identifies the corresponding naming word from the word class as a noun. Then, for the naming fails to identify the corresponding naming by the dictionary, based on a Conditional Random Fields (CRFs) model for identification, this step is mainly used to identify and correspond to words whose words are marked as nouns. To appropriate proper names, such as "person", "organization", and "location". For details on the implementation of the conditional random domain model, see Chen et al., "Chinese Named Entity Recognition with Conditional Random Fields," Proceedings of the Fifth SIGHAN Workshop on Chinese Language Processing, pp. 118-121, 2006. Therefore, it is not described here.

值得一提的是,除了上述Chen等人發表之論文中所使用之特徵(features)之外,本發明另外加上兩種加強的特徵,以建立該條件隨機域模型;其中,該等加強的特徵分別是單詞邊界特徵(word boundary features),及字元特徵(char features)。對於單詞邊界特徵而言,有些特定名詞係用於作為一名詞片語(noun phrase)的邊界,舉例來說,“某人的頭銜(title for a person)”,像是“總統”往往是用來作為某一人名之邊界,依此即有助於辨識出像是“歐巴馬總統”之類的名詞片語;而“組織名稱(organization name)”,像是“集團”往往是用來作為某一公司團體之邊界,依此即有助於辨識出像是“鴻海集團”之類的名詞片語。對於字元特徵而言,此種特徵特別適用於某些有特定字數規則的命名,舉例來說,“人名(person name)”,像是中文的姓及名,一般來說字元長度限制於某一範圍內。It is worth mentioning that, in addition to the features used in the paper published by Chen et al., the present invention additionally adds two enhanced features to establish the conditional random domain model; wherein the enhanced The features are word boundary features and char features, respectively. For word boundary features, some specific nouns are used as the boundary of a noun phrase. For example, "title for a person", like "president" is often used As the boundary of a person's name, it helps to identify a noun phrase like "Obama President"; and "organization name", like "group" is often used As the boundary of a certain corporate group, it helps to identify a noun phrase like "Hon Hai Group". For character features, this feature is particularly useful for certain naming rules with a specific word count. For example, "person name", like Chinese first name and first name, generally character length limit Within a certain range.

在步驟306中,該物件間關聯性處理單元17根據該詞類標記處理單元15及該命名實體辨識單元16之處理結果進行物件間關聯性處理,以得到每一句子之主動賓結構;若句子中有屬於同一複合名詞(compound noun)的命名,則該物件間關聯性處理單元17還先找出屬於同一複合名詞者形成一複合名詞命名,然後再進行物件間關聯性處理,以得到句子之主動賓結構。舉例來說,某一句子「李登輝(Nb)參訪(VC)日本(Nc)東京(Nc)靖國(Nb)神社(Nc)」,其結構為:Nb+VC+Nc+Nc+Nb+Nc,其中,「日本(Nc)東京(Nc)靖國(Nb)神社(Nc)」即屬於同一複合名詞,所以可再將句子的結構簡化為:Nb+VC+複合名詞命名;而李登輝(主語)及日本東京靖國神社(賓語)之關聯性為參訪(動語)。藉由此步驟,以將命名作為領域相關網頁的基本概念,而命名之間的動詞或類動詞(verb-like)作為命名之間的關聯性。In step 306, the inter-object correlation processing unit 17 performs an inter-object correlation process according to the processing result of the part-of-speech tag processing unit 15 and the named entity recognizing unit 16, to obtain an active guest structure of each sentence; If there is a name that belongs to the same compound noun, the inter-object correlation processing unit 17 first finds a compound noun to form a compound noun, and then performs an association between the objects to obtain the initiative of the sentence. Guest structure. For example, a sentence "Lee Tenghui (Nb) visit (VC) Japan (Nc) Tokyo (Nc) Yasukuni (Nb) Shrine (Nc), its structure is: Nb + VC + Nc + Nc + Nb + Nc, in which "Japan (Nc) Tokyo (Nc) Yasukuni (Nb) Shrine (Nc)" belongs to the same compound noun, so the structure of the sentence can be simplified to: Nb + VC + compound noun naming; and Lee Teng-hui (subject The relevance of the Japanese Yasukuni Shrine (object) is the visit (verb). By this step, the naming is used as the basic concept of the domain-related web page, and the verb or verb-like between the naming is used as the association between the naming.

在步驟307中,該領域知識網路產生單元18根據步驟306之處理結果,產生對應該領域的領域知識網路。舉例來說,以“李登輝”為關鍵字搜尋到對應某一日期(例如,2007/01/01)的新聞網頁,新聞網頁中屬於該詮釋資料類型之區塊對應的網頁文字中某一句子「李登輝參訪日本東京靖國神社」,即可用於建構並產生成類似於圖3之領域知識網路,圖3中的節點為命名,節點間的連線為命名之間的關聯性。值得一提的是,各節點還可以由網路上的重要網站(例如:搜尋引擎或維基百科(http://www.wikipedia.org/))進一步透過上述步驟的處理,自動擷取出與領域相關的額外資訊,藉此增強如圖3所示之領域知識網路的完整性。In step 307, the domain knowledge network generating unit 18 generates a domain knowledge network corresponding to the domain according to the processing result of step 306. For example, a news page corresponding to a certain date (for example, 2007/01/01) is searched for by "Lee Tenghui", and a sentence in the webpage text corresponding to the block of the interpretation data type in the news webpage " Lee Teng-hui visited Japan's Tokyo Yasukuni Shrine, which can be used to construct and generate a domain knowledge network similar to Figure 3. The nodes in Figure 3 are named, and the connections between nodes are the association between naming. It is worth mentioning that each node can also be extracted from the domain by an important website on the Internet (for example, search engine or Wikipedia (http://www.wikipedia.org/)). Additional information to enhance the integrity of the domain knowledge network as shown in Figure 3.

在步驟308中,該使用者介面單元19採用W3C(World Wide Web Consortium)的可縮放向量圖形工作小組(Scalable Vector Graphics,簡稱SVG)之規範標準,來建立對應該領域知識網路的圖形化使用者介面,以提供並呈現給使用者。In step 308, the user interface unit 19 uses the W3C (World Wide Web Consortium) Scalable Vector Graphics (SVG) specification standard to establish a graphical use of the domain knowledge network. Interface to provide and present to the user.

綜上所述,藉由本發明之領域知識網路建構系統1及方法,可自動地由領域相關的網頁擷取適當的命名作為概念,並利用概念間的關聯性建構領域知識網路,故確實能達成本發明之目的。In summary, the domain knowledge network construction system 1 and method of the present invention can automatically extract appropriate naming as a concept from a domain-related web page, and construct a domain knowledge network by using the association between concepts. The object of the invention can be achieved.

惟以上所述者,僅為本發明之較佳實施例而已,當不能以此限定本發明實施之範圍,即大凡依本發明申請專利範圍及發明說明內容所作之簡單的等效變化與修飾,皆仍屬本發明專利涵蓋之範圍內。The above is only the preferred embodiment of the present invention, and the scope of the invention is not limited thereto, that is, the simple equivalent changes and modifications made by the scope of the invention and the description of the invention are All remain within the scope of the invention patent.

1...領域知識網路建構系統1. . . Domain knowledge network construction system

11...網頁取得單元11. . . Web page acquisition unit

12...資料庫12. . . database

13...網頁劃分單元13. . . Web page division unit

14...區塊類型識別單元14. . . Block type identification unit

15...詞類標記處理單元15. . . Word class tag processing unit

16...命名實體辨識單元16. . . Named entity identification unit

17...物件間關聯性處理單元17. . . Inter-object correlation processing unit

18...領域知識網路產生單元18. . . Domain knowledge network generation unit

19...使用者介面單元19. . . User interface unit

2...網站2. . . website

301~308...步驟301~308. . . step

圖1是一方塊圖,說明本發明領域知識網路建構系統之一較佳實施例;1 is a block diagram showing a preferred embodiment of the domain knowledge network construction system of the present invention;

圖2是一流程圖,說明本發明領域知識網路建構方法之一較佳實施例;及2 is a flow chart showing a preferred embodiment of a method for constructing a domain knowledge network of the present invention; and

圖3是一示意圖,說明本發明所建構之一領域知識網路的一範例。Figure 3 is a schematic diagram showing an example of a domain knowledge network constructed in accordance with the present invention.

1...領域知識網路建構系統1. . . Domain knowledge network construction system

11...網頁取得單元11. . . Web page acquisition unit

12...資料庫12. . . database

13...網頁劃分單元13. . . Web page division unit

14...區塊類型識別單元14. . . Block type identification unit

15...詞類標記處理單元15. . . Word class tag processing unit

16...命名實體辨識單元16. . . Named entity identification unit

17...物件間關聯性處理單元17. . . Inter-object correlation processing unit

18...領域知識網路產生單元18. . . Domain knowledge network generation unit

19...使用者介面單元19. . . User interface unit

2...網站2. . . website

Claims (10)

一種領域知識網路建構方法,適於藉由一系統執行,該方法包含下列步驟:A)取得與一領域相關的複數個網頁;B)將每一網頁劃分成複數區塊;C)根據每一區塊個別對應的網頁文字的屬性-值對進行識別以自每一網頁之該等區塊中識別出屬於一詮釋資料類型者,其中該詮釋資料類型係為每一網頁的區塊所對應之網頁文字中屬性出現頻率最高者;D)對於屬於該詮釋資料類型之區塊對應的網頁文字進行詞類標記處理,以對網頁文字內的單詞指派關於詞性的詞類標記,並將網頁文字分割成複數句子;E)對經過步驟D)之詞類標記處理並分割的句子進行命名實體辨識;F)根據步驟D)及E)之處理結果進行物件間關聯性處理,以得到每一句子之一主動賓結構;及G)根據步驟F)之處理結果,產生對應該領域的一領域知識網路。 A domain knowledge network construction method is adapted to be executed by a system, the method comprising the steps of: A) obtaining a plurality of web pages related to a field; B) dividing each web page into a plurality of blocks; C) according to each Identifying attribute-value pairs of individual webpage texts of a block to identify those belonging to an interpretation data type from the blocks of each webpage, wherein the interpretation data type is corresponding to the block of each webpage The highest frequency of occurrence of the attribute in the webpage text; D) performing word class tag processing on the webpage text corresponding to the block belonging to the interpretation data type, assigning a word class tag on the word in the webpage text, and dividing the webpage text into a plurality of sentences; E) performing a named entity identification on the sentence processed and segmented by the word class tag of step D); F) performing an inter-object correlation process according to the processing results of steps D) and E) to obtain one of each sentence actively The guest structure; and G) according to the processing result of step F), generate a domain knowledge network corresponding to the domain. 依據申請專利範圍第1項所述之領域知識網路建構方法,其中,步驟E)包括下列子步驟:e-1)基於預先建立的字典將該等句子進行字典比對,以自詞類標記為名詞的單詞識別出對應的命名;及e-2)對於未能由子步驟e-1)識別出對應的命名者,基於一條件隨機域模型來進行辨識。 According to the domain knowledge network construction method described in claim 1, wherein the step E) comprises the following sub-steps: e-1) performing dictionary comparison on the sentences based on a pre-established dictionary, and marking the words from the word class as The noun word identifies the corresponding naming; and e-2) identifies the corresponding naminger that was not identified by sub-step e-1) based on a conditional random domain model. 依據申請專利範圍第2項所述之領域知識網路建構方法,其中,該條件隨機域模型中所使用之特徵包括單詞邊界特徵。 The domain knowledge network construction method according to claim 2, wherein the feature used in the conditional random domain model includes a word boundary feature. 依據申請專利範圍第2項所述之領域知識網路建構方法,其中,該條件隨機域模型中所使用之特徵包括字元特徵。 The domain knowledge network construction method according to claim 2, wherein the feature used in the conditional random domain model comprises a character feature. 依據申請專利範圍第2項所述之領域知識網路建構方法,其中,在步驟F)中,還找出每一句子中屬於同一複合名詞的命名,然後進行物件間關聯性處理,以得到每一句子之該主動賓結構。 According to the domain knowledge network construction method described in claim 2, in step F), the naming of the same compound noun in each sentence is also found, and then the inter-object correlation processing is performed to obtain each The active guest structure of a sentence. 一種領域知識網路建構系統,包含:一網頁劃分單元,用以將與一領域相關的複數個網頁中每一者劃分成複數區塊;一區塊類型識別單元,用以根據每一區塊個別對應的網頁文字的屬性-值對進行識別以自每一網頁之該等區塊中識別出屬於一詮釋資料類型者,其中該詮釋資料類型係為每一網頁的區塊所對應之網頁文字中屬性出現頻率最高者;一詞類標記處理單元,用以對於屬於該詮釋資料類型之區塊對應的網頁文字進行詞類標記處理,以對網頁文字內的單詞指派關於詞性的詞類標記,並將網頁文字分割成複數句子;一命名實體辨識單元,用以對經過該詞類標記處理單元之處理並分割的句子進行命名實體辨識; 一物件間關聯性處理單元,用以根據該詞類標記處理單元及該命名實體辨識單元之處理結果進行物件間關聯性處理,以得到每一句子之一主動賓結構;及一領域知識網路產生單元,用以根據每一句子的主動賓結構,產生對應該領域的一領域知識網路。 A domain knowledge network construction system, comprising: a webpage dividing unit, configured to divide each of a plurality of webpages related to a domain into a plurality of webpages; and a block type identifying unit for each block according to each block Identifying the attribute-value pairs of the respective corresponding webpage texts to identify those belonging to an interpretation data type from the blocks of each webpage, wherein the interpretation data type is the webpage text corresponding to the block of each webpage The word attribute occurrence frequency is the highest; a word class tag processing unit is configured to perform word class tag processing on the webpage text corresponding to the block belonging to the interpretation data type, to assign a word class tag on the word in the webpage text, and the webpage The character is divided into a plurality of sentences; a named entity identification unit is configured to perform named entity identification on the sentence processed and segmented by the word class tag processing unit; An inter-object correlation processing unit is configured to perform an inter-object correlation process according to the processing result of the word class tag processing unit and the named entity identification unit to obtain an active guest structure of each sentence; and a domain knowledge network generation The unit is configured to generate a domain knowledge network corresponding to the domain according to the active guest structure of each sentence. 依據申請專利範圍第6項所述之領域知識網路建構系統,其中,該命名實體辨識單元係先基於預先建立的字典將該等句子進行字典比對,以自詞類標記為名詞的單詞識別出對應的命名,然後對於未能識別出對應的命名者,基於一條件隨機域模型來進行辨識。 According to the domain knowledge network construction system described in claim 6, wherein the named entity identification unit first compares the sentences based on a pre-established dictionary, and identifies the words from the word class as a noun. The corresponding naming is then based on a conditional random domain model for identifying the corresponding naming person. 依據申請專利範圍第7項所述之領域知識網路建構系統,其中,該條件隨機域模型中所使用之特徵包括單詞邊界特徵。 The domain knowledge network construction system according to claim 7, wherein the features used in the conditional random domain model include word boundary features. 依據申請專利範圍第7項所述之領域知識網路建構系統,其中,該條件隨機域模型中所使用之特徵包括字元特徵。 The domain knowledge network construction system according to claim 7, wherein the features used in the conditional random domain model include character features. 依據申請專利範圍第7項所述之領域知識網路建構系統,其中,該物件間關聯性處理單元還找出每一句子中屬於同一複合名詞的命名,然後進行物件間關聯性處理,以得到每一句子之該主動賓結構。 According to the domain knowledge network construction system described in claim 7, wherein the inter-object correlation processing unit further finds the naming of the same compound noun in each sentence, and then performs the correlation processing between the objects to obtain The active guest structure of each sentence.
TW99106441A 2010-03-05 2010-03-05 Domain Knowledge Network Construction Method and Its System TWI442249B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
TW99106441A TWI442249B (en) 2010-03-05 2010-03-05 Domain Knowledge Network Construction Method and Its System

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
TW99106441A TWI442249B (en) 2010-03-05 2010-03-05 Domain Knowledge Network Construction Method and Its System

Publications (2)

Publication Number Publication Date
TW201131389A TW201131389A (en) 2011-09-16
TWI442249B true TWI442249B (en) 2014-06-21

Family

ID=50180352

Family Applications (1)

Application Number Title Priority Date Filing Date
TW99106441A TWI442249B (en) 2010-03-05 2010-03-05 Domain Knowledge Network Construction Method and Its System

Country Status (1)

Country Link
TW (1) TWI442249B (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
TWI682287B (en) * 2018-10-25 2020-01-11 財團法人資訊工業策進會 Knowledge graph generating apparatus, method, and computer program product thereof

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
TWI682287B (en) * 2018-10-25 2020-01-11 財團法人資訊工業策進會 Knowledge graph generating apparatus, method, and computer program product thereof
US11250035B2 (en) 2018-10-25 2022-02-15 Institute For Information Industry Knowledge graph generating apparatus, method, and non-transitory computer readable storage medium thereof

Also Published As

Publication number Publication date
TW201131389A (en) 2011-09-16

Similar Documents

Publication Publication Date Title
US10585924B2 (en) Processing natural-language documents and queries
US11080295B2 (en) Collecting, organizing, and searching knowledge about a dataset
JP6014725B2 (en) Retrieval and information providing method and system for single / multi-sentence natural language queries
TWI656450B (en) Method and system for extracting knowledge from Chinese corpus
Kallipolitis et al. Semantic search in the World News domain using automatically extracted metadata files
KR101095866B1 (en) Triple indexing and searching scheme for efficient information retrieval
Do et al. Building a knowledge graph by using cross-lingual transfer method and distributed MinIE algorithm on apache spark
Al-Zoghby et al. Semantic relations extraction and ontology learning from Arabic texts—a survey
Moncla et al. Automated geoparsing of paris street names in 19th century novels
KR20180093157A (en) A question translation system based on dependency tree and semantic representation and the method thereof
Belozerov et al. Semantic web technologies: Issues and possible ways of development
WO2012091541A1 (en) A semantic web constructor system and a method thereof
Neri et al. Mining the Web to monitor the Political Consensus
Zeng et al. Linking entities in short texts based on a Chinese semantic knowledge base
TWI442249B (en) Domain Knowledge Network Construction Method and Its System
Saravanan et al. Extraction of Core Web Content from Web Pages using Noise Elimination.
KR20100070084A (en) Apparatus and method for in real time retrieving knowledge relevant to user's query from a large-scale ontology
Tran et al. A model of vietnamese person named entity question answering system
JP2014191777A (en) Word meaning analysis device and program
Sathiya Content ranking using semantic word comparison and structural string matching
Salaiwarakul Thai natural language based cultural tourism ontology
Uppal et al. Semantic web mining and semantic search engine: A review
JP2012243130A (en) Information retrieval device, method and program
Rasham et al. The challenges and case for urdu DBpedia
Do et al. Building a knowledge graph of Vietnam tourism from text

Legal Events

Date Code Title Description
MM4A Annulment or lapse of patent due to non-payment of fees