TW201131388A

TW201131388A - Domain metadata retrieval method and its system

Info

Publication number: TW201131388A
Application number: TW99106440A
Authority: TW
Inventors: Xuan-Hua Lin; guan-hong Liu; Guo-Ting Huang; jun-ming Qiu; Jun-Zhe Huang
Original assignee: Univ Nat Chi Nan
Priority date: 2010-03-05
Filing date: 2010-03-05
Publication date: 2011-09-16
Also published as: TWI423053B

Abstract

A domain metadata retrieval method, which includes the following steps: (A) receiving a plurality of domain-related homepages related to a domain; (B) analyzing the homepage semantics of the domain-related homepages to obtain a plurality of first metadata attributes; (C) obtaining a plurality of keyphrases from those domain-related homepages based on a statistical language model, and using the keyphrases as a plurality of second metadata attributes; (D) according to the result of the first and second metadata attributes corresponding to a set of Dublin Core Field, retrieving a plurality of domain-related metadata from the domain-related homepages; and (E) outputting the domain-related metadata to a carrier.

Description

201131388 六、發明說明：【發明所屬之技術領域】本發明_疋有關於一種用於網路資料挖掘（data mining ) 的相關技術’特別是指一種領域全釋資料（d〇main metadata )擷取方法及其系統。【先前技術】隨著網路的技術快速發展與普及化，使得網路上的資訊$呈現爆炸性的成長，而網路上的資訊包羅萬象，儼然像疋個龐大的分散式資料庫，要如何快速而正確地於網路上找到所需資訊是現階段網路知識工程中非常重要的一項研究課題。當使用者想要搜尋某一領域或主題的相關資201131388 VI. Description of the Invention: [Technical Field of the Invention] The present invention relates to a related art for data mining, in particular, a field full metadata (d〇main metadata) Method and system. [Prior Art] With the rapid development and popularization of the network technology, the information on the Internet is exploding, and the information on the Internet is all-encompassing. It is like a huge distributed database, how to be fast and correct. Finding the information you need on the Internet is a very important research topic in the current network knowledge engineering. When the user wants to search for relevant resources in a certain field or topic

δΤΙα ，口J 先於搜尋網站（例如，G〇ogle、Yah〇〇、Bing、〇penfind ) =入與領域或主題相關的關鍵字（k—)，並藉由其相寻引擎（search engines)送出關鍵字進行搜尋.，以得到> 量與關鍵字相關的網頁資料之搜尋結果。然後，使用者〒於找回的大^網頁資料中’進行劉覽並判斷是否符合自己的需求；若找回的網頁資料不符合需求，則重複地進行截入關鍵字與㈣搜尋結果之動作。—般而言，使用者為] 找到符合自己的需求之資訊’往往需要多次進行上述動十匕種方式不彳_耗時、缺乏效率，也容易令人感到厭煩。 /田=用者所輸人的關鍵字不夠精’確時，搜尋結果中妇 :缺乏實際上可用的資訊’雖然’使用者可藉由在不斷备程中累積相關的領域知識，以期提昇下次遇到㈣ 201131388 領域或主題時所輸入之關鍵字的精確度，但是，當使用者想要搜尋另一新的領域或主題時，仍需花費不少的時間來建立相關的領域知識。再者，現有的網站為了要服務不同需长的使用者，後端常藉由資料庫系統管理結構化資料，根據不同的資訊需纟’以程式作動態呈現，例如，CGI ( n Gateway Interface )程式；此種網站後端的資料庫 ’無^以-般的搜尋引擎透過超鏈結（hypedink)的方式來 :取資料’因&，尚有許多有用的隱藏網路資訊，稱為隱臧=路（Hdden Web)或深層網路（以印web)，以現有的搜尋引擎，未必能分析取得。有鑑於此，本發明旨在自的網路（包含隱藏網路及深層域詮釋資料，以提供給使用者關應用。領域相關（domain-related ) 網路）_，擷取出對應的領 ’並提供給網路知識工程相【發明内容】即在提供一種領域詮釋資料擷因此，本發明之目的取方法。於是’本發明領域t全釋資料操取方法，適於藉由一系數該方法包含下列步驟：A)接收與一領域相關的複數個錢相關網頁；B)分析該等領域相關網頁之網頁語法得到複數個第一给釋資料屬性；〇基於-統計式語言 :二，自該等領域相關網頁得到複數個關鍵片肖，並以該一鍵片-。作為複數個第二詮釋資料屬性；〇)根據該等第一、二㈣資料屬性對應至—組都柏林核心集攔位的結果 201131388 ，由S玄等領域相關網頁擷取出複數筆領域相關的给釋資料，以及E)將該等領域相關的發釋資料輸出至一載體。本發明之另一目的，即在提供一種領域詮釋資料擷取系統。於是’本發明領域詮釋資料擷取系統包含一記憶單元，及一領域詮釋資料擷取單元。該記憶單元儲存有與一領域相關的複數個領域相關網頁。該領域詮釋資料擷取單元係與該記憶單元連接；該領域詮釋資料擷取單元包括一語法刀析模組、一關鍵片語統計模組，及一詮釋資料擷取模組；該語法分析模組用以分析該等領域相關網頁之網頁語δΤΙα , mouth J precedes the search site (eg, G〇ogle, Yah〇〇, Bing, 〇penfind) = enters the domain or topic related keyword (k-), and through its search engines Send keywords to search. To get > search results for webpages related to keywords. Then, the user clicks on the retrieved large webpage data to make a review and determine whether it meets his or her needs. If the retrieved webpage data does not meet the demand, the user repeatedly inserts the keyword and (4) the search result. . In general, users find the information that suits their needs. It often takes many times to do the above. It is time-consuming, inefficient, and boring. /Tian = the user's keyword is not enough. When the woman is in the search result, there is a lack of information that is actually available. 'Although the user can accumulate relevant domain knowledge in the continuous preparation process, in order to improve the situation. The second time (4) the accuracy of the keywords entered in the 201131388 domain or topic, but when the user wants to search for another new domain or topic, it still takes a lot of time to build the relevant domain knowledge. Moreover, in order to serve different users, the back-end often manages the structured data through the database system, and dynamically renders according to different information, for example, CGI (n Gateway Interface) Program; the back-end database of this kind of website is not a general-purpose search engine through the hyperpedink: the data 'ince', there are many useful hidden network information, called concealment. = Hdden Web or deep network (printed web), with existing search engines, may not be available for analysis. In view of this, the present invention is directed to a self-network (including a hidden network and a deep domain interpretation material to provide a user with a domain-related network), and extracts the corresponding collar' Provided to the network knowledge engineering phase [invention content] is to provide a field of interpretation data, therefore, the object of the present invention. Thus, the present invention is directed to a method for full data manipulation, which is adapted to include a coefficient by a method comprising: A) receiving a plurality of money related web pages related to a field; and B) analyzing web page grammar of related web pages in the fields Obtaining a plurality of first release material attributes; 〇 based on - statistical language: Second, obtaining a plurality of key pieces from the relevant web pages of the fields, and using the one-key piece -. As a plurality of second interpretation data attributes; 〇) according to the first and second (four) data attributes corresponding to the group - Dublin core set block results 201131388, from the S Xuan and other fields related to the field to extract the multiple pen field related release Information, and E) output of relevant release data in such fields to a carrier. Another object of the present invention is to provide a field interpretation data capture system. Thus, the field of data interpretation system of the present invention comprises a memory unit and a field interpretation data acquisition unit. The memory unit stores a plurality of domain related web pages associated with a field. The field interpretation data acquisition unit is connected to the memory unit; the domain interpretation data acquisition unit comprises a grammar knife analysis module, a key phrase statistics module, and an interpretation data capture module; the syntax analysis module Group of web pages used to analyze relevant web pages in these fields

法’以得到複數個第-㈣資料屬性；該關鍵片語統計模 =用以基於-統計式語言模型，自該等領域相關網頁得到，數個關鍵n並以該等關鍵片語作為複數個第二言全釋資料屬性，邊佥釋資料擷取模組用以根據該等第一、二詮釋資料屬f生對應至—組都柏林核心集欄位的結果，由該等領域相關網1操取出複數筆領域相關的t全釋資料。發明的功效在於：藉由將該等第一、二發釋資料性對：到該組都柏林核心集欄位後’對該等領域相關網 -欠谷進行搜尋，不但可以取得隱藏網路或深層網路的 :貝讯，亦可提高搜尋之效率以及結果的準確度，而搜出來的該_資料更可用於建構領域知識。【實施方式】以有關本發明之下配合參考圖式别述及其他技術内容、特點與功效，在之一個較佳實施例的詳細說明中，將可 201131388 清楚的呈現。㈣画^㈣領域㈣資料㈣系統丨之較例包含-使用者介面單“、與該使用者介面單元：的-領域資料收集單元12、與該使用者介面單元 = 域資料收集單元12連接的一資料 -項貝了十厍13，以及與該使用者面皁7L Π及該資料庫13連接的— 7員域洤釋資料擷取單元 ^ i3c m .…〜J干別八9广囟 1U、一收集結果選取介面112、—都柏林核心t (Dublin^ 不介面H3’及-領域㈣f料擷取結果顯示介面中，該關鍵字輸人介面ηι、該收集結果選取介面⑴、ς 都^木核4標示介面113，及該領域_資_取結果^ 不，|面m之實施態樣為互動式„，以供❹者摔料領域㈣貧㈣料、統！並將純行結果呈躲使用者。该領域資料收集單元12包括—網頁取得模組⑵，及 :與财（_king)模組】22。該資料庫13為記憶單元的種實施態樣。該領域證釋資料操取單& 14包括一嗜去八 ::吴組⑷、一關鍵片語統計模組142，及—詮釋資料擷取模組143。參閱圖1與圖2 ’對應上述領域詮釋資料擷取系統！之較佳實施例’本發明領域㈣f料擷取方法係藉由該 s全釋資料#貞料、統丨執行，其包含下列步驟。 _ 上备使用者想要搜尋並彙整某一領域的資料時，可透過 °亥使用者介面單元11之該關鍵字輸入介面ηι輸入對應該 201131388 領域之至少一關鍵字；舉例來說，使用者想要搜尋旅遊領域相關的資料時，即可透過該關鍵字輸入介面111輸入「猿遊」作為該關鍵字。在步驟S301中，該領域資料收集單元12之該網頁取得單元121根據該關鍵字進行搜尋，並取得網站2中與該關鍵字相關之該等網頁，並將該等網頁儲存於該資料庫13 中。在本較佳實施例中，該網頁取得單元121係以現有的元搜尋引擎（metasearch engine)，例如 ’ WebCrawler，來自該等網站2搜尋並取得與該關鍵字相關之超文字標記語言（Hypertext Markup Language，以下簡稱 HTML )網頁。由於以關鍵字搜尋並取得網頁之技術係為習知技術，且非本發明之重點，故不在此贅述其實施細節。在步驟S302中，該領域詮釋資料擷取單元14之該語法分析模組141分析該等網頁之網頁語法，以得到對應每一網頁之一文件物件模型（Document Object Model，以下簡稱DOM )樹，並利用習知的語彙分析（lexical analysis ) 及索引（indexing)方法建立對應每一網頁的標記（token) 及反索引（inverted index )，並將標記及反索引儲存於該資料庫13。其中，語彙分析及索引方法之細節可參考S. Deerwester 等人提出的「Indexing by latent Semantic Analysis,J. Amer. Soc. Info. Sci., vol. 41, pp. 391-407, 1990.」，以及 G. Salton 等人提出的「”A Vector Space Model for Automatic Indexing/5 Communications of the ACM, vol. 18, no. 11，pp. 613-620，1975.」。 201131388 在步驟S303中，該領域資料收集單元i2之該歸納盘排序模組m先將步驟S3G]所取得之網頁進行歸納，以將該等網頁歸納成三種階層粒度（3_丨evel g_laHties)之概念物件（C〇nCeptUal峋⑽），分別是「網站階層」之概念物件、：目錄階層」之概念物件，及「網頁階層」之概念物件，然後，該歸納與排序模組122再將該等「網站階層」、「目錄階層」，及「網頁階層」之㈣物件進行排序，以形成對應該關鍵字之-領域資料收集結果。其中，步驟幻〇3進一步描述如下。首先，該歸納與排序模組122係依下列定義進行歸納。「網站階層」之概念物件係指網站主頁（h〇mepage)之一致資源疋址器（Uniform Resource Locator，以下簡稱URL ) ，其係由註冊到網際網路領域名稱伺服器（Domain Name Server，以下簡稱DNS )之主機名稱所指示者，舉例來說，「http://travel.network.com.tw/」即屬於「網站階層」之概念物件。「目錄階層」之概念物件係指包含複數個領域相關網頁者’一般而言，網站中往往可以再分為幾個領域相關的目錄’這些目錄即屬於「目錄階層」之概念物件，舉例來 5兒’「http://travel.network.com.tw/tourguide/」即屬於「目錄階層」之概念物件。「網頁階層」之概念物件係指單一領域相關網頁’此單一領域相關網頁通常内含領域相關資訊 ’或内含可鏈結到更多領域相關資訊之網頁鏈結。然後’該歸納與排序模組122再根據下列參數對該等「網站階層」、「目錄階層」，及「網頁階層」之概念物件進 201131388 行排序：-群集（cluster)大小、一詞彙頻率_反文件頻率 (erm Frequency-Inverse Document Frequency，以下簡稱 TF-IDF)相似度’及一權威（Auth〇rity)與發散（祕）值，在本較佳實施例中，對於該等「網站階層」、「目錄階層」 ’及「網頁階層」之概念物件其中任一者，該歸納與排序」模組122係將其對應之料集、、該π·相似度，及該權威與發散值分別乘上其等各別的權重，然後總合起來作為排序積分，排序積分越大者，其出現的順序越前面。其中，s亥群集大小係指概念物件内含的網頁數；該 IDF相似度係利用一種習知的統計技術，並配合使用者定義的關鍵字及網頁的標記計算而得，用以評估某一詞彙對於個文件集或一個語料庫（c〇rpus )其中一份文件的重要程度其細節可參考「http://en-wikipedia.org/wiki/Tf-idf」； °亥權威與發政值係利用一習知的超鏈結導引主題搜尋（The law 'to obtain a plurality of the first-(four) data attributes; the key piece-of-speech statistical model = used to be based on the -statistical language model, obtained from the relevant web pages of the fields, several key n and with the key words as the plural The second sentence fully explains the data attributes, and the data extraction module is used to calculate the results of the first and second interpretation data corresponding to the group of Dublin core set fields, and the related network 1 of the fields Take out the full-release data related to the field of the plural pen. The effect of the invention is that by making the first and second release information pairs: after searching for the relevant network in the core group of the group, the search for the related network can not only obtain hidden network or deep layer. The network: Beixun can also improve the efficiency of the search and the accuracy of the results, and the _ data found can be used to construct domain knowledge. [Embodiment] Other technical contents, features, and effects will be described in conjunction with the reference drawings, and in the detailed description of a preferred embodiment, it will be clearly shown in 201131388. (4) Drawings ^ (4) Fields (4) Data (4) System 丨 Comparative Example Include - User Interface Sheet ", with the User Interface Unit: - Domain Data Collection Unit 12, and the User Interface Unit = Domain Data Collection Unit 12 A data-item has ten 厍13, and is connected with the user's face soap 7L Π and the database 13 - 7-member domain release data acquisition unit ^ i3c m ....~J Gan Bei 8 9 Guangzhao 1U a collection result selection interface 112, - Dublin core t (Dublin^ no interface H3' and - domain (four) f material retrieval result display interface, the keyword input interface ηι, the collection result selection interface (1), ς all ^ wood Core 4 indicates interface 113, and the field _ _ _ take the result ^ No, | face m implementation of the interactive „, for the squatter to fall in the field (four) poor (four) material, system! And the pure line results hide The domain data collection unit 12 includes a webpage acquisition module (2) and a _king module 22. The database 13 is a implementation aspect of the memory unit. & 14 includes a fascination eight:: Wu group (4), a key phrase statistics module 142, and Interpreting the data capture module 143. Referring to Figures 1 and 2' corresponding to the above-mentioned field interpretation data acquisition system! The preferred embodiment of the invention [4] f material extraction method is based on the s full release data #贞, The system includes the following steps: _ When the backup user wants to search and summarize the data of a certain field, the keyword input interface ηι of the user interface 11 of the user can input at least one of the fields corresponding to the 201131388 field. For example, when the user wants to search for information related to the tourism field, the user can input "tour" as the keyword through the keyword input interface 111. In step S301, the domain data collecting unit 12 The webpage obtaining unit 121 searches for the keyword according to the keyword, and obtains the webpages related to the keyword in the website 2, and stores the webpages in the database 13. In the preferred embodiment, the webpage The obtaining unit 121 is an existing meta search engine, such as 'WebCrawler, from which the website 2 searches for and obtains the hypertext markup language associated with the keyword. Hypertext Markup Language (hereinafter referred to as HTML) webpage. Since the technology for searching and acquiring webpages by keywords is a well-known technology and is not the focus of the present invention, the implementation details are not described herein. In step S302, the domain interpreting materials The grammar analysis module 141 of the capture unit 14 analyzes the web page grammar of the web pages to obtain a Document Object Model (DOM) tree corresponding to each web page, and uses a conventional vocabulary analysis (lexical). The analysis and indexing methods establish a token and an inverted index corresponding to each web page, and store the token and the inverse index in the database 13. For details of the vocabulary analysis and indexing method, refer to "Indexing by latent Semantic Analysis, J. Amer. Soc. Info. Sci., vol. 41, pp. 391-407, 1990." by S. Deerwester et al. And "A Vector Space Model for Automatic Indexing/5 Communications of the ACM, vol. 18, no. 11, pp. 613-620, 1975." by G. Salton et al. 201131388 In step S303, the inductive disc sorting module m of the domain data collecting unit i2 first summarizes the webpages obtained in step S3G] to classify the webpages into three hierarchical granularities (3_丨evel g_laHties). Concept objects (C〇nCeptUal(10)) are conceptual objects of the "Website hierarchy", conceptual objects of the directory hierarchy, and conceptual objects of the "Website hierarchy", and then the induction and sequencing module 122 then The "Website Hierarchy", "Table of Content", and (4) objects of the "Page Level" are sorted to form the corresponding domain-domain data collection results. Among them, Step Magic 3 is further described below. First, the induction and ordering module 122 is summarized according to the following definitions. The concept of “Website Hierarchy” refers to the Uniform Resource Locator (URL) of the website homepage (h〇mepage), which is registered to the Internet Domain Name Server (Domain Name Server, below) For example, the host name indicated by DNS), for example, "http://travel.network.com.tw/" is a concept object of "website hierarchy". The concept object of "directory hierarchy" refers to a webpage that contains a plurality of domain-related webpages. In general, a website can be subdivided into several domain-related catalogues. These catalogues are conceptual objects belonging to the "directory hierarchy". For example, Children's "http://travel.network.com.tw/tourguide/" is a concept object of the "directory hierarchy". The concept of "web class" refers to a single-domain related web page. 'This single-area related web page usually contains domain-related information' or contains a web link that links to more relevant information. Then, the summary and sorting module 122 sorts the conceptual objects of the "website hierarchy", "directory hierarchy", and "webpage hierarchy" into 201131388 according to the following parameters: - cluster size, a vocabulary frequency _ Erm Frequency-Inverse Document Frequency (hereinafter referred to as TF-IDF) similarity' and an Authority and Divergence value. In the preferred embodiment, for the "Website Class" , the "information hierarchy" and the "page hierarchy" concept object, the induction and sorting module 122 is to multiply the corresponding material set, the π·similarity, and the authority and the divergence value respectively Each of them is weighted, and then summed up as a sorting score. The greater the sorting score, the earlier the order of appearance. The shai cluster size refers to the number of web pages included in the concept object; the IDF similarity is calculated by using a conventional statistical technique and matching the user-defined keywords and the mark of the web page to evaluate a certain For the details of the vocabulary for a file set or a corpus (c〇rpus), please refer to "http://en-wikipedia.org/wiki/Tf-idf" for details. Use a well-known super-chain guide to search for topics (

Hyperlink-Induced Topic Search，以下簡稱 HITS )演算法求得’其細節可參考 r http://en_wikipedia 〇rg/wiki/mTS_ algorithm」；而在此步驟s3〇3 示。 -———--- 表所使用的權重整理如表1所 1 網站階層目錄階層網頁階層群集大小權重=1 權重=1 N/A 權威與發散值權重=2 權重=2 權重=1 TF-IDF相似度權重=3 權重=3 權重=2 在步驟S304中，為了使後續領域詮釋資料之擷取結果 201131388 更為準確且具有實皙立只貝忍義’該使用者介面單元集結果選取介面112 】1稭由该收而使用者可透過該收隼h m 見給使用者，續用於擷取領域詮釋 /地篩選出後 h x 釋科〜複數個領域相關網頁，等領域相關網頁傳逆仏j』貝並將该寻送^该領域詮釋資料擷取單於該資料庫13，复中 * 早兀14並儲存由於，/、垓寺領域相關網頁屬於該等網頁。 '赢過上述步驟S301〜S3〇3虚理接結果係以歸納與排序後之方式呈^域資料收集速地篩選出該等領域相關網頁。肖者可輕易且快值仔一提的是，雖妙 ,Hyperlink-Induced Topic Search, hereinafter referred to as HITS) algorithm, can be found in the details of r http://en_wikipedia 〇rg/wiki/mTS_ algorithm; and in this step s3〇3. -————--- The weights used in the table are as shown in Table 1. The website hierarchy directory level page level cluster size weight = 1 weight = 1 N / A authority and divergence value weight = 2 weight = 2 weight = 1 TF- IDF similarity weight = 3 weight = 3 weight = 2 In step S304, in order to make the subsequent field interpretation data acquisition result 201131388 more accurate and has a real stand-up, the user interface unit set result selection interface 112 】1 straw can be seen by the user through the receipt hm, continue to be used in the field of interpretation / to screen out after hx release section ~ multiple fields related pages, and other related web pages j 』并将并将并将并将该该该该该该该该该该该该该该该该该该该该该该该该该该该该该该该该该该该该该该该该'Winning the above steps S301~S3〇3 imaginary results The results are summarized in the way of sorting and sorting, and the data is collected in such fields. Xiao can be easy and quick. It’s worth mentioning, though,

Mmw -Γ4·, ，，，'、，經由使用者筛選出該等領域相關..周頁’可大幅提昇德礓 ώ ^ 曼續4理的精確度；但，本發明亦可自動將該領域資料收隼社 (「網站階層排序在前Ν名的概念物件 )倂廡沾」目錄階層」’或「網頁階層」之概念物件 )對應的網頁，直接你& 貞1接作為沒寺領域相關網頁，以繼續進行後續之處理。 & π 14得到該在步驟S 3 0 5中，古女力首奸a诞这丄Η此 τ 4領域§全釋貧料擷取單元Mmw -Γ4·, ,,,',,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,, The data collection agency ("the concept of the hierarchy of the website hierarchy") is the corresponding webpage of the "category class" or "the concept of the "page hierarchy"), directly connected to you & Web page to continue the follow-up process. & π 14 get the in the step S 3 0 5, the ancient woman's first rape a birthday, this τ 4 field § full release of the poor material extraction unit

等領域相關網頁。在步驟S3〇6中，該領域證釋資料操取單元14十註任法分析模Μ⑷還分析該等領域相關網頁之網頁語：:：取得複數個第—證釋資料屬性（attribute)。_般而言，由於、周站中與搜*服務相關的區域，係被後& (⑽㈣㈣）於HTML㈣隨〉子句中，在本較佳實施例中該語法分析松组141分析該等領域相關網頁之HTML<hrm>子句中的屬性-值對（attribute_value pair)，以取得該等第一詮釋資 10 201131388 料屬性。其中，屬性與值的關係可為1對1 ( l-to-l mapping)、N 對 1 (N-to-1 mapping) ’ 或 1 對 N ( 1-to-N mapping )，該語法分析模組141僅使用關係為1對1 ( 1 -to-1 mapping)及 1 對 N(l-to-N mapping)的屬性·值對，來取得該等第一詮釋資料屬性。Related field pages. In step S3〇6, the domain certificate interpretation data manipulation unit 14 ten-note analysis module (4) also analyzes the webpage words of the related webpages in the fields::: Obtain a plurality of first-documentation data attributes. _ In general, because the area related to the search service in the weekly station is the following & ((10) (four) (four)) in the HTML (four) with the clause, in the preferred embodiment, the grammatical analysis loose group 141 analyzes the same The attribute_value pair in the HTML<hrm> clause of the domain-related web page to obtain the first interpreter 10 201131388 material attribute. Wherein, the relationship between the attribute and the value may be 1 to 1 (l-to-l mapping), N to 1 (N-to-1 mapping) ' or 1 to N (1-to-N mapping), the parsing module The group 141 uses only the attribute-value pairs whose relationship is 1-to-1 mapping and 1-to-N mapping to obtain the first interpretation material attributes.

在步驟S307中，該領域詮釋資料擷取單元14之該關鍵片語統計模組142根據步驟S302中建立之該等領域相關網頁的標記，並基於一統計式語言模型，配合該等領域相關網頁之標記，以自該等領域相關網頁得到複數個關鍵片語（keyphrases )，並以該等關鍵片語作為複數個第二Ί全釋資料屬性。在本較佳實施例中，係採用n-gram語言模型進行統計，其細節可參考W. B. Cavnar等人所提出之「”N-Gram-Based Text Categorization，’’ Proc. 3rd Symp. On Document Analysis and Information Retrieval, pp. 161-175, 1994.」，以及 L. F. Chien 提出的「”PAT-Tree-Based Keyword Extraction for Chinese Information Retrieval,” Proc· 24th ACM SIGIR Infl Conf. Research and Development in Information Retrieval, pp. 50-58，1997.」。在步驟S308中，該使用者介面單元11藉由該都柏林核心集標示介面113，以供使用者將該等第一、二詮釋資料屬性對應至一組都柏林核心集欄位（Dublin Core Field )，其中，都柏林核心集是一種廣泛被運用於網路資源、政府出版品、圖書館典藏、博物館典藏的詮釋資料格式，將該等第一、二詮釋資料屬性對應至該組都柏林核心集攔位之 201131388 目的是為了使各類f全釋資料的格式得以互通 inter〇perabimy)，以增進在跨各個不同網站間擷取㈣資料的互通性；該組都柏林核心集攔位包括15個欄位，分別是 ·· UUe、subject ' descripti〇n、咖、_、—Η如' creator、contnbutor、f01.mat、id 、source 、 language 、relation、coverage，及 rights。類似地’以旅遊領域為例，及都柏林核心集標示介面 113如圖3所示，透過該都柏林核个极。杲私不介面113標示後的對應結果整理如表2所示。矣0In step S307, the key phrase statistics module 142 of the domain interpretation data capture unit 14 cooperates with the relevant webpages of the domain according to the tags of the related webpages established in step S302, and based on a statistical language model. The mark is obtained by obtaining a plurality of key phrases from the relevant web pages of the fields, and using the key words as the plurality of second full release data attributes. In the preferred embodiment, the n-gram language model is used for statistics. For details, refer to "N-Gram-Based Text Categorization," by WB Cavnar et al., ''Proc. 3rd Symp. On Document Analysis and Information Retrieval, pp. 161-175, 1994.", and "PAT-Tree-Based Keyword Extraction for Chinese Information Retrieval" by LF Chien, "Proc· 24th ACM SIGIR Infl Conf. Research and Development in Information Retrieval, pp. 50-58, 1997.". In step S308, the user interface unit 11 uses the Dublin Core Set Interface 113 to allow the user to map the first and second interpretation data attributes to a set of Dublin Core Fields. Among them, the Dublin Core Collection is an interpretation data format widely used in Internet resources, government publications, library collections, and museum collections. The first and second interpretation data attributes are mapped to the Dublin Core Set. The purpose of 201131388 is to enable the interpreting of the format of all types of f-interpreted data to enhance the interoperability of data between different websites; the group of Dublin core set includes 15 fields, respectively Yes · UUe, subject ' descripti〇n, coffee, _, - such as 'creator, contbutor, f01.mat, id, source, language, relation, coverage, and rights. Similarly, in the field of tourism, and the Dublin Core Set Interface 113 is shown in Figure 3, through the Dublin Core. The corresponding results after the labeling of the privacy interface 113 are shown in Table 2.矣0

12 201131388 宿、飯店、國家12 201131388 Accommodation, restaurants, countries

值知提的是，當在同一領域的第一個網站經由步驟讓標示後，其他同—領域的網站可以根據累積的對應資動判斷；判斷方式是：以字為單位做序列比對或相數（ldentlty score)計算。另以圖書館領域為例，假又全釋貝料屬性為「f/刊名」，對應到都柏林核心集摘位為Title」，對於其他同屬圖書館領域的網站，則將之逢釋貧料屬性與之前已建立好的對應結果，以字為單位做序列比對或相同字分數（identitysc〇re)計算，以「自動應：該組都柏林核心集棚位中。同-領域累計的資二夕、對應的結果也愈齊備，而且在此過程中，還可 nit況進行人卫修正，以逐錢善對應的準確率。在步驟S3G9〜S31G中，該領縣釋資料擷取單元14之料操取额141先根據對應至該等都柏林核心集頁中祎取户筌―° 屬自動地由該等領域相關網二取夕筆領域相關的證釋資料（例如，揭取出所有對二、性·值對）；然後，將該等1 全釋資料輸出至-載體， :如，储存至該資料庫13以供後續建構領域亍人面該領域f全釋資料榻取結果顯不介面114呈現給使用者。不，…貝綜上所述，本發明藉由將該等第—丨對應到該組都柏林核心集搁位後，對該等領域相二 13 201131388 内容進行撞裒得隱藏網路卜但可以透過各網站本身的搜尋服務以取以及結果的準確卢，而：！！亦可提高搜尋之效率建構專举的…而技夺出來的該等较釋資料更可用於 =的領域知識’故確實能達成本發明之目的。处以Γΐ所述者’僅為本發明之較佳實施例而已，當不月匕匕限定本發明實施之範圍，即大只# 士义範圍及發明說明内容所作之簡單的等效變二二。；：利屬本發明專利涵蓋之範圍内。、> ’％仍【圖式簡單說明】 i 圖…方塊圖，說明本發明領域” 之—較佳實施例；系、·先圖2是一流程圖’說明本發明 ;及發月領域詮釋資料擷取方法圖3是一示意圖，說明一使用 _ 核心集標示介面。，丨面早兀之—都柏林It is worth mentioning that when the first website in the same field is marked by steps, other websites in the same field can be judged according to the accumulated corresponding funds; the judgment method is: doing sequence comparison or phase by word unit The number (ldentlty score) is calculated. In the library field, for example, the fake and fully-released bedding attribute is “f/title”, which corresponds to the Dublin core set, which is titled “Title”. For other websites in the same library field, it will be released to the poor. The material attribute is compared with the previously established corresponding result, and the sequence comparison or the same word score (identitysc〇re) is calculated in units of words to "automatically: the group of Dublin core set sheds. On the second day of the second day, the corresponding results are also more complete, and in the process, the human health correction can be performed in the nit condition to improve the accuracy of the corresponding money. In the steps S3G9 to S31G, the collar county data extraction unit 14 The amount of material handling 141 is firstly based on the corresponding information in the Dublin Core Collection page, which is automatically related to the domain of the field. , sex/value pairs); then, the 1 full release data is output to the carrier, eg, stored in the database 13 for subsequent construction of the field, the field is fully interpreted, and the results are not interfaced. 114 is presented to the user. No, ... According to the present invention, after the first-level 丨 corresponds to the set of Dublin core sets, the content of the third-party 13 201131388 is collided to hide the network, but the search service of each website itself can be obtained. In order to obtain the accuracy of the results, and:!! can also improve the efficiency of the search to construct a specialization ... and the comparative information obtained by the technology can be used for the domain knowledge of = so it can achieve the purpose of the present invention. The present invention is only a preferred embodiment of the present invention, and the scope of the invention is not limited to the scope of the invention, that is, the simple equivalent of the scope of the invention and the description of the invention. ;; is within the scope of the patent of the present invention. > '% still [schematic description of the drawings] i diagram ... block diagram, illustrating the field of the invention" - preferred embodiment; Flowchart' illustrates the present invention; and the method of extracting data from the field of the moon is shown in FIG. 3 as a schematic diagram illustrating a use of the _ core set designation interface. , the face of the early morning - Dublin

14 201131388 【主要元件符號說明】 I ..............領域詮釋資料 121 ........... 擷取系統 122........... II .............使用者介面單組元 13............. III ...........關鍵字輸入介 14............. 面 .擷取單元 112 ...........收集結果選取 141 ........... 介面 142........... 113 ...........都柏林核心集模組標示介面 143 ........... 114 ...........領域詮釋資料模組擷取結果顯示介面 2 .............. 12.............領域資料收集 S301-S310· 〇〇 —* 早兀網頁取得模組歸納與排序模資料庫領域詮拜資料語法分析模組關鍵片語統計詮釋資料擷取網站步驟14 201131388 [Explanation of main component symbols] I..............Field interpretation data 121 ........... Capture system 122........ ... II .............user interface single component 13............. III ........... Word input interface 14............. Face. Capture unit 112 ........... Collection result selection 141 ........... Interface 142.............. 113 ........... Dublin Core Set Module Labeling Interface 143 ........... 114 ...... ..... Field Interpretation Data Module Capture Results Display Interface 2 .............. 12.............Field Data Collection S301-S310 · 〇〇—* Early Web page acquisition module induction and sorting model database field interpretation data grammar analysis module key phrase statistical interpretation data retrieval site steps

1515

Claims

201131388 VII. Patent application scope: L A field interpretation data extraction method suitable for execution by a system, the method comprises the following steps: A) receiving a plurality of domain related web pages related to a field; B) analyzing the fields The webpage grammar of the relevant webpage is obtained to obtain a plurality of first proof data attributes; θ is based on a statistical language model, and a plurality of key phrases are paid from related webpages in the fields, and the key words are used as plural numbers A full-text data attribute; D) extracting a plurality of interpret-related interpretation materials from relevant web pages in accordance with the results of the first and second interpretation data attributes corresponding to a set of Dublin Core Set fields; and E Exporting relevant interpretation data from these fields to a carrier. 2. According to the field interpretation data described in item i of the scope of application for patents: take the method 'where' in step B) +, analyze the complex attribute-value pairs corresponding to the relevant web pages in these fields to obtain the first interpretation Data attribute. 3. According to the field mentioned in the patent scope (4) data manipulation method 'where' in step C), the statistics are based on the n_gram language model to obtain the key words. (The method of extracting the data from the field mentioned in item i of the patent scope) also includes the following steps before step A): F) searching for lights in a plurality of websites according to at least the keyword related to the field, To obtain a plurality of web pages related to the keyword; G) to group each web page into a conceptual object of a website hierarchy, a concept object of a 16 201131388 directory hierarchy, and one of the conceptual objects of a web page hierarchy; Sorting the concept objects of the website hierarchy, the directory hierarchy, and the webpage hierarchy; and I) obtaining the related webpages of the domain according to the sorting result of the step Η), wherein the %•domain related webpage belongs to the webpages. 5. In accordance with the method of extracting data in the field of claim 4, wherein in step Η), any of the concept elements of the website hierarchy, the directory hierarchy, and the web page hierarchy are based on Corresponding to one of the cluster size, a vocabulary frequency-anti-file frequency similarity, and an authority and divergence value for sorting.

According to the method for extracting data in the field of claim 5, wherein in the step Η), the concept object of the website hierarchy, the directory hierarchy and the webpage hierarchy is corresponding to The cluster size, the vocabulary frequency-anti-file frequency similarity, and the authority and the divergence value are respectively multiplied by the respective weights, and then a sorted integral is synthesized and sorted by the sorted integral. - an area interpretation data extraction system, comprising: » a hidden unit, storing a plurality of domain related web pages related to a field; and a mouth data interpretation unit = domain release data acquisition unit including _ grammar analysis module Group, one-off-statistical module' and an interpretation data capture module, the grammar analysis refers to the webpage grammar used to analyze relevant webpages in these fields, to obtain the first interpretation data attribute of 201131388, the key The phrase statistics module is configured to obtain a plurality of key phrases based on the statistical language model from the relevant webpages of the fields, and use the key monthly language as a plurality of second interpretation data attributes, and the interpretation data acquisition module According to the results of the first and second interpretation data attributes corresponding to a set of Dublin core set blocks, the relevant fields of the fields are extracted from the related pages of the fields. 8. 9. 10. 11. According to the field mentioned in item 7 of the application patent (4) (4), please take the system 'where' the grammar analysis module analyzes the complex attribute-value pairs corresponding to the relevant web pages in these fields to obtain such The first - interpretation of data attributes. ~ According to the scope of patent application, the full-text data extraction rate is described in the following section: 'The key documentary statistical module is based on the n_gram language model to obtain these key phrases. According to the data of the scope of the patent application, the data is also included in the data unit and the data interpretation unit of the field: the two are combined with the user interface unit. - Dublin Core Set 'No Interface' The Dublin Core Set interface user uses the first and second interpretation data attributes to correspond to the set field. As for the group of Dublin cores, according to the scope of the application for patents, the full range of data in the first paragraph of the scope of the patent application & fruit display - two users, one side also includes - field transfer information, take the picture ·丨面,, used to relate users to these fields. The full release of poverty is presented to the field of information in accordance with paragraph 1 of the scope of the patent application. It also contains a data collection unit with the memory unit and the data interpretation unit of the field: 18 12. 201131388. The data collection unit of the domain includes a webpage obtaining module, and an inductive and sorting module, wherein the webpage obtaining module is configured to perform a search on a plurality of websites according to at least one keyword related to the domain to obtain a plurality of webpages associated with the keyword, the inductive and sorting module is used to group each webpage into a conceptual object of a website hierarchy, a conceptual object of a directory hierarchy, and a conceptual object of a webpage hierarchy, and The hierarchical objects of the website, the directory hierarchy, and the conceptual objects of the web page are sorted to form a data collection result in one of the corresponding keywords. η• According to the towel, please (4) encircle the field mentioned in 12 items (4) data manipulation system, in which the concept of the website class, the directory class, and the page level is the towel--the sorting mode_ The ranking of the conceptual objects of the website hierarchy, the directory hierarchy, and the webpage hierarchy is performed according to one of its corresponding cluster size, vocabulary frequency, anti-file frequency similarity, and an authoritative and divergent value. 14. According to the application for patents _ 13 of the full release data extraction system, in which the site hierarchy, the directory hierarchy, and the webpage concept object - the 'induction and sorting module system corresponding to it The cluster size, the vocabulary frequency _ anti-file frequency similarity, and (d) multiply the divergence values by respective weights, then total __ = 'and the site hierarchy' directory hierarchy, and the page hierarchy: Other sorting points are sorted. ~ 15. According to the scope of the patent, please refer to the field of information in the 12th paragraph of the patent scope, where the user interface unit also includes the keyword wheeling interface, 19 201131388 16. For the user to enter and The domain is related to this keyword. According to the field of patent application 胄12, the field of the data is provided, wherein the user interface unit further includes a collection result selection interface for the user to select the fields from the data collection result in the field: The related webpages such as Yanhai belong to the webpages, and the user interface unit also stores related webpages in the domain in the quotation unit.

20