TW201131388A - Domain metadata retrieval method and its system - Google Patents

Domain metadata retrieval method and its system Download PDF

Info

Publication number
TW201131388A
TW201131388A TW99106440A TW99106440A TW201131388A TW 201131388 A TW201131388 A TW 201131388A TW 99106440 A TW99106440 A TW 99106440A TW 99106440 A TW99106440 A TW 99106440A TW 201131388 A TW201131388 A TW 201131388A
Authority
TW
Taiwan
Prior art keywords
data
hierarchy
field
domain
interpretation
Prior art date
Application number
TW99106440A
Other languages
Chinese (zh)
Other versions
TWI423053B (en
Inventor
Xuan-Hua Lin
guan-hong Liu
Guo-Ting Huang
jun-ming Qiu
Jun-Zhe Huang
Original Assignee
Univ Nat Chi Nan
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Univ Nat Chi Nan filed Critical Univ Nat Chi Nan
Priority to TW99106440A priority Critical patent/TWI423053B/en
Publication of TW201131388A publication Critical patent/TW201131388A/en
Application granted granted Critical
Publication of TWI423053B publication Critical patent/TWI423053B/en

Links

Abstract

A domain metadata retrieval method, which includes the following steps: (A) receiving a plurality of domain-related homepages related to a domain; (B) analyzing the homepage semantics of the domain-related homepages to obtain a plurality of first metadata attributes; (C) obtaining a plurality of keyphrases from those domain-related homepages based on a statistical language model, and using the keyphrases as a plurality of second metadata attributes; (D) according to the result of the first and second metadata attributes corresponding to a set of Dublin Core Field, retrieving a plurality of domain-related metadata from the domain-related homepages; and (E) outputting the domain-related metadata to a carrier.

Description

201131388 六、發明說明: 【發明所屬之技術領域】 本發明_疋有關於一種用於網路資料挖掘(data mining ) 的相關技術’特別是指一種領域全釋資料(d〇main metadata )擷取方法及其系統。 【先前技術】 隨著網路的技術快速發展與普及化,使得網路上的資 訊$呈現爆炸性的成長,而網路上的資訊包羅萬象,儼然 像疋個龐大的分散式資料庫,要如何快速而正確地於網 路上找到所需資訊是現階段網路知識工程中非常重要的一 項研究課題。 當使用者想要搜尋某一領域或主題的相關資201131388 VI. Description of the Invention: [Technical Field of the Invention] The present invention relates to a related art for data mining, in particular, a field full metadata (d〇main metadata) Method and system. [Prior Art] With the rapid development and popularization of the network technology, the information on the Internet is exploding, and the information on the Internet is all-encompassing. It is like a huge distributed database, how to be fast and correct. Finding the information you need on the Internet is a very important research topic in the current network knowledge engineering. When the user wants to search for relevant resources in a certain field or topic

δΤΙα , 口J 先於搜尋網站(例如,G〇ogle、Yah〇〇、Bing、〇penfind ) =入與領域或主題相關的關鍵字(k—),並藉由其相 寻引擎(search engines)送出關鍵字進行搜尋.,以得到> 量與關鍵字相關的網頁資料之搜尋結果。然後,使用者〒 於找回的大^網頁資料中’進行劉覽並判斷是否符合自己 的需求;若找回的網頁資料不符合需求,則重複地進行截 入關鍵字與㈣搜尋結果之動作。—般而言,使用者為] 找到符合自己的需求之資訊’往往需要多次進行上述動十 匕種方式不彳_耗時、缺乏效率,也容易令人感到厭煩。 /田=用者所輸人的關鍵字不夠精’確時,搜尋結果中妇 :缺乏實際上可用的資訊’雖然’使用者可藉由在不斷备 程中累積相關的領域知識,以期提昇下次遇到㈣ 201131388 領域或主題時所輸入之關鍵字的精確度,但是,當使用者 想要搜尋另一新的領域或主題時,仍需花費不少的時間來 建立相關的領域知識。再者,現有的網站為了要服務不同 需长的使用者,後端常藉由資料庫系統管理結構化資料, 根據不同的資訊需纟’以程式作動態呈現,例如,CGI ( n Gateway Interface )程式;此種網站後端的資料庫 ’無^以-般的搜尋引擎透過超鏈結(hypedink)的方式來 :取資料’因&,尚有許多有用的隱藏網路資訊,稱為隱 臧=路(Hdden Web)或深層網路(以印web),以現有的 搜尋引擎,未必能分析取得。 有鑑於此,本發明旨在自 的網路(包含隱藏網路及深層 域詮釋資料,以提供給使用者 關應用。 領域相關(domain-related ) 網路)_,擷取出對應的領 ’並提供給網路知識工程相 【發明内容】 即在提供一種領域詮釋資料擷 因此,本發明之目的 取方法。 於是’本發明領域t全釋資料操取方法,適於藉由一系 數該方法包含下列步驟:A)接收與一領域相關的複 數個錢相關網頁;B)分析該等領域相關網頁之網頁語法 得到複數個第一给釋資料屬性;〇基於-統計式語言 :二,自該等領域相關網頁得到複數個關鍵片肖,並以該 一鍵片-。作為複數個第二詮釋資料屬性;〇)根據該等第 一、二㈣資料屬性對應至—組都柏林核心集攔位的結果 201131388 ,由S玄等領域相關網頁擷取出複數筆領域相關的给釋資料 ,以及E)將該等領域相關的發釋資料輸出至一載體。 本發明之另一目的,即在提供一種領域詮釋資料擷取 系統。 於是’本發明領域詮釋資料擷取系統包含一記憶單元 ,及一領域詮釋資料擷取單元。該記憶單元儲存有與一領 域相關的複數個領域相關網頁。該領域詮釋資料擷取單元 係與該記憶單元連接;該領域詮釋資料擷取單元包括一語 法刀析模組、一關鍵片語統計模組,及一詮釋資料擷取模 組;該語法分析模組用以分析該等領域相關網頁之網頁語δΤΙα , mouth J precedes the search site (eg, G〇ogle, Yah〇〇, Bing, 〇penfind) = enters the domain or topic related keyword (k-), and through its search engines Send keywords to search. To get > search results for webpages related to keywords. Then, the user clicks on the retrieved large webpage data to make a review and determine whether it meets his or her needs. If the retrieved webpage data does not meet the demand, the user repeatedly inserts the keyword and (4) the search result. . In general, users find the information that suits their needs. It often takes many times to do the above. It is time-consuming, inefficient, and boring. /Tian = the user's keyword is not enough. When the woman is in the search result, there is a lack of information that is actually available. 'Although the user can accumulate relevant domain knowledge in the continuous preparation process, in order to improve the situation. The second time (4) the accuracy of the keywords entered in the 201131388 domain or topic, but when the user wants to search for another new domain or topic, it still takes a lot of time to build the relevant domain knowledge. Moreover, in order to serve different users, the back-end often manages the structured data through the database system, and dynamically renders according to different information, for example, CGI (n Gateway Interface) Program; the back-end database of this kind of website is not a general-purpose search engine through the hyperpedink: the data 'ince', there are many useful hidden network information, called concealment. = Hdden Web or deep network (printed web), with existing search engines, may not be available for analysis. In view of this, the present invention is directed to a self-network (including a hidden network and a deep domain interpretation material to provide a user with a domain-related network), and extracts the corresponding collar' Provided to the network knowledge engineering phase [invention content] is to provide a field of interpretation data, therefore, the object of the present invention. Thus, the present invention is directed to a method for full data manipulation, which is adapted to include a coefficient by a method comprising: A) receiving a plurality of money related web pages related to a field; and B) analyzing web page grammar of related web pages in the fields Obtaining a plurality of first release material attributes; 〇 based on - statistical language: Second, obtaining a plurality of key pieces from the relevant web pages of the fields, and using the one-key piece -. As a plurality of second interpretation data attributes; 〇) according to the first and second (four) data attributes corresponding to the group - Dublin core set block results 201131388, from the S Xuan and other fields related to the field to extract the multiple pen field related release Information, and E) output of relevant release data in such fields to a carrier. Another object of the present invention is to provide a field interpretation data capture system. Thus, the field of data interpretation system of the present invention comprises a memory unit and a field interpretation data acquisition unit. The memory unit stores a plurality of domain related web pages associated with a field. The field interpretation data acquisition unit is connected to the memory unit; the domain interpretation data acquisition unit comprises a grammar knife analysis module, a key phrase statistics module, and an interpretation data capture module; the syntax analysis module Group of web pages used to analyze relevant web pages in these fields

法’以得到複數個第-㈣資料屬性;該關鍵片語統計模 =用以基於-統計式語言模型,自該等領域相關網頁得到 ,數個關鍵n並以該等關鍵片語作為複數個第二言全釋 資料屬性,邊佥釋資料擷取模組用以根據該等第一、二詮 釋資料屬f生對應至—組都柏林核心集欄位的結果,由該等 領域相關網1操取出複數筆領域相關的t全釋資料。 發明的功效在於:藉由將該等第一、二發釋資料 性對:到該組都柏林核心集欄位後’對該等領域相關網 -欠谷進行搜尋,不但可以取得隱藏網路或深層網路的 :貝讯,亦可提高搜尋之效率以及結果的準確度,而搜 出來的該_資料更可用於建構領域知識。 【實施方式】 以 有關本發明之 下配合參考圖式 别述及其他技術内容、特點與功效,在 之一個較佳實施例的詳細說明中,將可 201131388 清楚的呈現。 ㈣画^㈣領域㈣資料㈣系統丨之較 例包含-使用者介面單“、與該使用者介面單元: 的-領域資料收集單元12、與該使用者介面單元 = 域資料收集單元12連接的一資料 -項 貝了十厍13,以及與該使用者 面皁7L Π及該資料庫13連接的— 7員域洤釋資料擷取單元 ^ i3c m .…〜J干別八9广囟 1U、一 收集結果選取介面112、—都柏林核心t (Dublin^ 不介面H3’及-領域㈣f料擷取結果顯示介面 中,該關鍵字輸人介面ηι、該收集結果選取介面⑴、ς 都^木核4標示介面113,及該領域_資_取結果^ 不,|面m之實施態樣為互動式„,以供❹者摔料 領域㈣貧㈣料、統!並將純行結果呈躲使用者。 该領域資料收集單元12包括—網頁取得模組⑵,及 :與财(_king)模組】22。該資料庫13為記憶單元的 種實施態樣。該領域證釋資料操取單& 14包括一嗜去八 ::吴組⑷、一關鍵片語統計模組142,及—詮釋資料擷取 模組143。 參閱圖1與圖2 ’對應上述領域詮釋資料擷取系統!之 較佳實施例’本發明領域㈣f料擷取方法係藉由該 s全釋資料#貞料、統丨執行,其包含下列步驟。 _ 上备使用者想要搜尋並彙整某一領域的資料時,可透過 °亥使用者介面單元11之該關鍵字輸入介面ηι輸入對應該 201131388 領域之至少一關鍵字;舉例來說,使用者想要搜尋旅遊領 域相關的資料時,即可透過該關鍵字輸入介面111輸入「猿 遊」作為該關鍵字。 在步驟S301中,該領域資料收集單元12之該網頁取 得單元121根據該關鍵字進行搜尋,並取得網站2中與該 關鍵字相關之該等網頁,並將該等網頁儲存於該資料庫13 中。在本較佳實施例中,該網頁取得單元121係以現有的 元搜尋引擎(metasearch engine),例如 ’ WebCrawler,來 自該等網站2搜尋並取得與該關鍵字相關之超文字標記語 言(Hypertext Markup Language,以下簡稱 HTML )網頁。 由於以關鍵字搜尋並取得網頁之技術係為習知技術,且非 本發明之重點,故不在此贅述其實施細節。 在步驟S302中,該領域詮釋資料擷取單元14之該語 法分析模組141分析該等網頁之網頁語法,以得到對應每 一網頁之一文件物件模型(Document Object Model,以下 簡稱DOM )樹,並利用習知的語彙分析(lexical analysis ) 及索引(indexing)方法建立對應每一網頁的標記(token) 及反索引(inverted index ),並將標記及反索引儲存於該資 料庫13。其中,語彙分析及索引方法之細節可參考S. Deerwester 等人提出的「Indexing by latent Semantic Analysis,J. Amer. Soc. Info. Sci., vol. 41, pp. 391-407, 1990.」,以及 G. Salton 等人提出的「”A Vector Space Model for Automatic Indexing/5 Communications of the ACM, vol. 18, no. 11,pp. 613-620,1975.」。 201131388 在步驟S303中,該領域資料收集單元i2之該歸納盘 排序模組m先將步驟S3G]所取得之網頁進行歸納,以將 該等網頁歸納成三種階層粒度(3_丨evel g_laHties)之概 念物件(C〇nCeptUal峋⑽),分別是「網站階層」之概念物 件、:目錄階層」之概念物件,及「網頁階層」之概念物件 ,然後,該歸納與排序模組122再將該等「網站階層」、「 目錄階層」,及「網頁階層」之㈣物件進行排序,以形成 對應該關鍵字之-領域資料收集結果。其中,步驟幻〇3進 一步描述如下。 首先,該歸納與排序模組122係依下列定義進行歸納 。「網站階層」之概念物件係指網站主頁(h〇mepage)之一 致資源疋址器(Uniform Resource Locator,以下簡稱URL ) ,其係由註冊到網際網路領域名稱伺服器(Domain Name Server,以下簡稱DNS )之主機名稱所指示者,舉例來說, 「http://travel.network.com.tw/」即屬於「網站階層」之概 念物件。「目錄階層」之概念物件係指包含複數個領域相關 網頁者’一般而言,網站中往往可以再分為幾個領域相關 的目錄’這些目錄即屬於「目錄階層」之概念物件,舉例 來 5兒’「http://travel.network.com.tw/tourguide/」即屬於「目 錄階層」之概念物件。「網頁階層」之概念物件係指單一領 域相關網頁’此單一領域相關網頁通常内含領域相關資訊 ’或内含可鏈結到更多領域相關資訊之網頁鏈結。 然後’該歸納與排序模組122再根據下列參數對該等 「網站階層」、「目錄階層」,及「網頁階層」之概念物件進 201131388 行排序:-群集(cluster)大小、一詞彙頻率_反文件頻率 (erm Frequency-Inverse Document Frequency,以下簡稱 TF-IDF)相似度’及一權威(Auth〇rity)與發散(祕)值 ,在本較佳實施例中,對於該等「網站階層」、「目錄階層」 ’及「網頁階層」之概念物件其中任一者,該歸納與排序」 模組122係將其對應之料集、、該π·相似度,及 該權威與發散值分別乘上其等各別的權重,然後總合起來 作為排序積分,排序積分越大者,其出現的順序越前面 。其中,s亥群集大小係指概念物件内含的網頁數;該 IDF相似度係利用一種習知的統計技術,並配合使用者定義 的關鍵字及網頁的標記計算而得,用以評估某一詞彙對於 個文件集或一個語料庫(c〇rpus )其中一份文件的重要程 度其細節可參考「http://en-wikipedia.org/wiki/Tf-idf」; °亥權威與發政值係利用一習知的超鏈結導引主題搜尋(The law 'to obtain a plurality of the first-(four) data attributes; the key piece-of-speech statistical model = used to be based on the -statistical language model, obtained from the relevant web pages of the fields, several key n and with the key words as the plural The second sentence fully explains the data attributes, and the data extraction module is used to calculate the results of the first and second interpretation data corresponding to the group of Dublin core set fields, and the related network 1 of the fields Take out the full-release data related to the field of the plural pen. The effect of the invention is that by making the first and second release information pairs: after searching for the relevant network in the core group of the group, the search for the related network can not only obtain hidden network or deep layer. The network: Beixun can also improve the efficiency of the search and the accuracy of the results, and the _ data found can be used to construct domain knowledge. [Embodiment] Other technical contents, features, and effects will be described in conjunction with the reference drawings, and in the detailed description of a preferred embodiment, it will be clearly shown in 201131388. (4) Drawings ^ (4) Fields (4) Data (4) System 丨 Comparative Example Include - User Interface Sheet ", with the User Interface Unit: - Domain Data Collection Unit 12, and the User Interface Unit = Domain Data Collection Unit 12 A data-item has ten 厍13, and is connected with the user's face soap 7L Π and the database 13 - 7-member domain release data acquisition unit ^ i3c m ....~J Gan Bei 8 9 Guangzhao 1U a collection result selection interface 112, - Dublin core t (Dublin^ no interface H3' and - domain (four) f material retrieval result display interface, the keyword input interface ηι, the collection result selection interface (1), ς all ^ wood Core 4 indicates interface 113, and the field _ _ _ take the result ^ No, | face m implementation of the interactive „, for the squatter to fall in the field (four) poor (four) material, system! And the pure line results hide The domain data collection unit 12 includes a webpage acquisition module (2) and a _king module 22. The database 13 is a implementation aspect of the memory unit. & 14 includes a fascination eight:: Wu group (4), a key phrase statistics module 142, and Interpreting the data capture module 143. Referring to Figures 1 and 2' corresponding to the above-mentioned field interpretation data acquisition system! The preferred embodiment of the invention [4] f material extraction method is based on the s full release data #贞, The system includes the following steps: _ When the backup user wants to search and summarize the data of a certain field, the keyword input interface ηι of the user interface 11 of the user can input at least one of the fields corresponding to the 201131388 field. For example, when the user wants to search for information related to the tourism field, the user can input "tour" as the keyword through the keyword input interface 111. In step S301, the domain data collecting unit 12 The webpage obtaining unit 121 searches for the keyword according to the keyword, and obtains the webpages related to the keyword in the website 2, and stores the webpages in the database 13. In the preferred embodiment, the webpage The obtaining unit 121 is an existing meta search engine, such as 'WebCrawler, from which the website 2 searches for and obtains the hypertext markup language associated with the keyword. Hypertext Markup Language (hereinafter referred to as HTML) webpage. Since the technology for searching and acquiring webpages by keywords is a well-known technology and is not the focus of the present invention, the implementation details are not described herein. In step S302, the domain interpreting materials The grammar analysis module 141 of the capture unit 14 analyzes the web page grammar of the web pages to obtain a Document Object Model (DOM) tree corresponding to each web page, and uses a conventional vocabulary analysis (lexical). The analysis and indexing methods establish a token and an inverted index corresponding to each web page, and store the token and the inverse index in the database 13. For details of the vocabulary analysis and indexing method, refer to "Indexing by latent Semantic Analysis, J. Amer. Soc. Info. Sci., vol. 41, pp. 391-407, 1990." by S. Deerwester et al. And "A Vector Space Model for Automatic Indexing/5 Communications of the ACM, vol. 18, no. 11, pp. 613-620, 1975." by G. Salton et al. 201131388 In step S303, the inductive disc sorting module m of the domain data collecting unit i2 first summarizes the webpages obtained in step S3G] to classify the webpages into three hierarchical granularities (3_丨evel g_laHties). Concept objects (C〇nCeptUal(10)) are conceptual objects of the "Website hierarchy", conceptual objects of the directory hierarchy, and conceptual objects of the "Website hierarchy", and then the induction and sequencing module 122 then The "Website Hierarchy", "Table of Content", and (4) objects of the "Page Level" are sorted to form the corresponding domain-domain data collection results. Among them, Step Magic 3 is further described below. First, the induction and ordering module 122 is summarized according to the following definitions. The concept of “Website Hierarchy” refers to the Uniform Resource Locator (URL) of the website homepage (h〇mepage), which is registered to the Internet Domain Name Server (Domain Name Server, below) For example, the host name indicated by DNS), for example, "http://travel.network.com.tw/" is a concept object of "website hierarchy". The concept object of "directory hierarchy" refers to a webpage that contains a plurality of domain-related webpages. In general, a website can be subdivided into several domain-related catalogues. These catalogues are conceptual objects belonging to the "directory hierarchy". For example, Children's "http://travel.network.com.tw/tourguide/" is a concept object of the "directory hierarchy". The concept of "web class" refers to a single-domain related web page. 'This single-area related web page usually contains domain-related information' or contains a web link that links to more relevant information. Then, the summary and sorting module 122 sorts the conceptual objects of the "website hierarchy", "directory hierarchy", and "webpage hierarchy" into 201131388 according to the following parameters: - cluster size, a vocabulary frequency _ Erm Frequency-Inverse Document Frequency (hereinafter referred to as TF-IDF) similarity' and an Authority and Divergence value. In the preferred embodiment, for the "Website Class" , the "information hierarchy" and the "page hierarchy" concept object, the induction and sorting module 122 is to multiply the corresponding material set, the π·similarity, and the authority and the divergence value respectively Each of them is weighted, and then summed up as a sorting score. The greater the sorting score, the earlier the order of appearance. The shai cluster size refers to the number of web pages included in the concept object; the IDF similarity is calculated by using a conventional statistical technique and matching the user-defined keywords and the mark of the web page to evaluate a certain For the details of the vocabulary for a file set or a corpus (c〇rpus), please refer to "http://en-wikipedia.org/wiki/Tf-idf" for details. Use a well-known super-chain guide to search for topics (

Hyperlink-Induced Topic Search,以下簡稱 HITS )演算法求 得’其細節可參考 r http://en_wikipedia 〇rg/wiki/mTS_ algorithm」;而在此步驟s3〇3 示。 -———--- 表 所使用的權重整理如表1所 1 網站階層 目錄階層 網頁階層 群集大小 權重=1 權重=1 N/A 權威與發散值 權重=2 權重=2 權重=1 TF-IDF相似度 權重=3 權重=3 權重=2 在步驟S304中,為了使後續領域詮釋資料之擷取結果 201131388 更為準確且具有實皙立 只貝忍義’該使用者介面單元 集結果選取介面112 】1稭由该收 而使用者可透過該收隼h m 見給使用者, 續用於擷取領域詮釋 /地篩選出後 h x 釋科〜複數個領域相關網頁, 等領域相關網頁傳逆仏j』貝並將该 寻送^该領域詮釋資料擷取單 於該資料庫13,复中 * 早兀14並儲存 由於,/、 垓寺領域相關網頁屬於該等網頁。 '赢過上述步驟S301〜S3〇3虚理接 結果係以歸納與排序後之方式呈^域資料收集 速地篩選出該等領域相關網頁。 肖者可輕易且快 值仔一提的是,雖妙 ,Hyperlink-Induced Topic Search, hereinafter referred to as HITS) algorithm, can be found in the details of r http://en_wikipedia 〇rg/wiki/mTS_ algorithm; and in this step s3〇3. -————--- The weights used in the table are as shown in Table 1. The website hierarchy directory level page level cluster size weight = 1 weight = 1 N / A authority and divergence value weight = 2 weight = 2 weight = 1 TF- IDF similarity weight = 3 weight = 3 weight = 2 In step S304, in order to make the subsequent field interpretation data acquisition result 201131388 more accurate and has a real stand-up, the user interface unit set result selection interface 112 】1 straw can be seen by the user through the receipt hm, continue to be used in the field of interpretation / to screen out after hx release section ~ multiple fields related pages, and other related web pages j 』 并将 并将 并将 并将 该 该 该 该 该 该 该 该 该 该 该 该 该 该 该 该 该 该 该 该 该 该 该 该 该 该 该 该 该 该 该 该 该 该 该 该 该 该 该 该'Winning the above steps S301~S3〇3 imaginary results The results are summarized in the way of sorting and sorting, and the data is collected in such fields. Xiao can be easy and quick. It’s worth mentioning, though,

Mmw -Γ4·, ,,,'、,經由使用者筛選出該等領域相 關..周頁’可大幅提昇德礓 ώ ^ 曼續4理的精確度;但,本發明亦可 自動將該領域資料收隼社 (「網站階層 排序在前Ν名的概念物件 )倂廡沾」目錄階層」’或「網頁階層」之概念物件 )對應的網頁,直接你& 貞1接作為沒寺領域相關網頁,以繼續進行 後續之處理。 & π 14得到該 在步驟S 3 0 5中,古女力首奸a诞这丄Η此 τ 4領域§全釋貧料擷取單元Mmw -Γ4·, ,,,',,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,, The data collection agency ("the concept of the hierarchy of the website hierarchy") is the corresponding webpage of the "category class" or "the concept of the "page hierarchy"), directly connected to you & Web page to continue the follow-up process. & π 14 get the in the step S 3 0 5, the ancient woman's first rape a birthday, this τ 4 field § full release of the poor material extraction unit

等領域相關網頁。 在步驟S3〇6中,該領域證釋資料操取單元14十註任 法分析模Μ⑷還分析該等領域相關網頁之網頁語::: 取得複數個第—證釋資料屬性(attribute)。_般而言,由 於、周站中與搜*服務相關的區域,係被後& (⑽㈣㈣) 於HTML㈣隨〉子句中,在本較佳實施例中該語法分 析松组141分析該等領域相關網頁之HTML<hrm>子句中 的屬性-值對(attribute_value pair),以取得該等第一詮釋資 10 201131388 料屬性。其中,屬性與值的關係可為1對1 ( l-to-l mapping)、N 對 1 (N-to-1 mapping) ’ 或 1 對 N ( 1-to-N mapping ),該語法分析模組141僅使用關係為1對1 ( 1 -to-1 mapping)及 1 對 N(l-to-N mapping)的屬性·值對,來 取得該等第一詮釋資料屬性。Related field pages. In step S3〇6, the domain certificate interpretation data manipulation unit 14 ten-note analysis module (4) also analyzes the webpage words of the related webpages in the fields::: Obtain a plurality of first-documentation data attributes. _ In general, because the area related to the search service in the weekly station is the following & ((10) (four) (four)) in the HTML (four) with the clause, in the preferred embodiment, the grammatical analysis loose group 141 analyzes the same The attribute_value pair in the HTML<hrm> clause of the domain-related web page to obtain the first interpreter 10 201131388 material attribute. Wherein, the relationship between the attribute and the value may be 1 to 1 (l-to-l mapping), N to 1 (N-to-1 mapping) ' or 1 to N (1-to-N mapping), the parsing module The group 141 uses only the attribute-value pairs whose relationship is 1-to-1 mapping and 1-to-N mapping to obtain the first interpretation material attributes.

在步驟S307中,該領域詮釋資料擷取單元14之該關 鍵片語統計模組142根據步驟S302中建立之該等領域相關 網頁的標記,並基於一統計式語言模型,配合該等領域相 關網頁之標記,以自該等領域相關網頁得到複數個關鍵片 語(keyphrases ),並以該等關鍵片語作為複數個第二Ί全釋 資料屬性。在本較佳實施例中,係採用n-gram語言模型進 行統計,其細節可參考W. B. Cavnar等人所提出之「”N-Gram-Based Text Categorization,’’ Proc. 3rd Symp. On Document Analysis and Information Retrieval, pp. 161-175, 1994.」,以及 L. F. Chien 提出的「”PAT-Tree-Based Keyword Extraction for Chinese Information Retrieval,” Proc· 24th ACM SIGIR Infl Conf. Research and Development in Information Retrieval, pp. 50-58,1997.」。 在步驟S308中,該使用者介面單元11藉由該都柏林 核心集標示介面113,以供使用者將該等第一、二詮釋資料 屬性對應至一組都柏林核心集欄位(Dublin Core Field ), 其中,都柏林核心集是一種廣泛被運用於網路資源、政府 出版品、圖書館典藏、博物館典藏的詮釋資料格式,將該 等第一、二詮釋資料屬性對應至該組都柏林核心集攔位之 201131388 目的是為了使各類f全釋資料的格式得以互通 inter〇perabimy),以增進在跨各個不同網站間擷取㈣資料 的互通性;該組都柏林核心集攔位包括15個欄位,分別是 ·· UUe、subject ' descripti〇n、咖、_、—Η如' creator、contnbutor、f01.mat、id 、source 、 language 、relation、coverage,及 rights。 類似地’以旅遊領域為例, 及都柏林核心集標示介面 113如圖3所示,透過該都柏林核 个极。杲私不介面113標示後 的對應結果整理如表2所示。 矣0In step S307, the key phrase statistics module 142 of the domain interpretation data capture unit 14 cooperates with the relevant webpages of the domain according to the tags of the related webpages established in step S302, and based on a statistical language model. The mark is obtained by obtaining a plurality of key phrases from the relevant web pages of the fields, and using the key words as the plurality of second full release data attributes. In the preferred embodiment, the n-gram language model is used for statistics. For details, refer to "N-Gram-Based Text Categorization," by WB Cavnar et al., ''Proc. 3rd Symp. On Document Analysis and Information Retrieval, pp. 161-175, 1994.", and "PAT-Tree-Based Keyword Extraction for Chinese Information Retrieval" by LF Chien, "Proc· 24th ACM SIGIR Infl Conf. Research and Development in Information Retrieval, pp. 50-58, 1997.". In step S308, the user interface unit 11 uses the Dublin Core Set Interface 113 to allow the user to map the first and second interpretation data attributes to a set of Dublin Core Fields. Among them, the Dublin Core Collection is an interpretation data format widely used in Internet resources, government publications, library collections, and museum collections. The first and second interpretation data attributes are mapped to the Dublin Core Set. The purpose of 201131388 is to enable the interpreting of the format of all types of f-interpreted data to enhance the interoperability of data between different websites; the group of Dublin core set includes 15 fields, respectively Yes · UUe, subject ' descripti〇n, coffee, _, - such as 'creator, contbutor, f01.mat, id, source, language, relation, coverage, and rights. Similarly, in the field of tourism, and the Dublin Core Set Interface 113 is shown in Figure 3, through the Dublin Core. The corresponding results after the labeling of the privacy interface 113 are shown in Table 2.矣0

12 201131388 宿、飯店、國家12 201131388 Accommodation, restaurants, countries

值知提的是,當在同一領域的第一個網站經由步驟 讓標示後,其他同—領域的網站可以根據累積的對應資 動判斷;判斷方式是:以字為單位做序列比對或相 數(ldentlty score)計算。另以圖書館領域為例,假 又全釋貝料屬性為「f/刊名」,對應到都柏林核心集摘 位為Title」,對於其他同屬圖書館領域的網站,則將 之逢釋貧料屬性與之前已建立好的對應結果,以字為單位 做序列比對或相同字分數(identitysc〇re)計算,以「自動 應:該組都柏林核心集棚位中。同-領域累計的資 二夕、對應的結果也愈齊備,而且在此過程中,還可 nit況進行人卫修正,以逐錢善對應的準確率。 在步驟S3G9〜S31G中,該領縣釋資料擷取單元14之 料操取额141先根據對應至該等都柏林核心集 頁中祎取户筌―° 屬自動地由該等領域相關網 二取夕筆領域相關的證釋資料(例如,揭取出所有對 二、性·值對);然後,將該等1 全釋資料輸出至-載體, :如,储存至該資料庫13以供後續建構領域 亍人面該領域f全釋資料榻取結果顯 不介面114呈現給使用者。 不,…貝 綜上所述,本發明藉由將該等第—丨 對應到該組都柏林核心集搁位後,對該等領域相二 13 201131388 内容進行撞裒 得隱藏網路卜但可以透過各網站本身的搜尋服務以取 以及結果的準確卢,而:!!亦可提高搜尋之效率 建構專举的…而技夺出來的該等较釋資料更可用於 =的領域知識’故確實能達成本發明之目的。 处以Γΐ所述者’僅為本發明之較佳實施例而已,當不 月匕匕限定本發明實施之範圍,即大只# 士义 範圍及發明說明内容所作之簡單的等效變二二。;:利 屬本發明專利涵蓋之範圍内。 、> ’%仍 【圖式簡單說明】 i 圖…方塊圖,說明本發明領域” 之—較佳實施例; 系、·先 圖2是一流程圖’說明本發明 ;及 發月領域詮釋資料擷取方法 圖3是一示意圖,說明一使用 _ 核心集標示介面。 ,丨面早兀之—都柏林It is worth mentioning that when the first website in the same field is marked by steps, other websites in the same field can be judged according to the accumulated corresponding funds; the judgment method is: doing sequence comparison or phase by word unit The number (ldentlty score) is calculated. In the library field, for example, the fake and fully-released bedding attribute is “f/title”, which corresponds to the Dublin core set, which is titled “Title”. For other websites in the same library field, it will be released to the poor. The material attribute is compared with the previously established corresponding result, and the sequence comparison or the same word score (identitysc〇re) is calculated in units of words to "automatically: the group of Dublin core set sheds. On the second day of the second day, the corresponding results are also more complete, and in the process, the human health correction can be performed in the nit condition to improve the accuracy of the corresponding money. In the steps S3G9 to S31G, the collar county data extraction unit 14 The amount of material handling 141 is firstly based on the corresponding information in the Dublin Core Collection page, which is automatically related to the domain of the field. , sex/value pairs); then, the 1 full release data is output to the carrier, eg, stored in the database 13 for subsequent construction of the field, the field is fully interpreted, and the results are not interfaced. 114 is presented to the user. No, ... According to the present invention, after the first-level 丨 corresponds to the set of Dublin core sets, the content of the third-party 13 201131388 is collided to hide the network, but the search service of each website itself can be obtained. In order to obtain the accuracy of the results, and:!! can also improve the efficiency of the search to construct a specialization ... and the comparative information obtained by the technology can be used for the domain knowledge of = so it can achieve the purpose of the present invention. The present invention is only a preferred embodiment of the present invention, and the scope of the invention is not limited to the scope of the invention, that is, the simple equivalent of the scope of the invention and the description of the invention. ;; is within the scope of the patent of the present invention. > '% still [schematic description of the drawings] i diagram ... block diagram, illustrating the field of the invention" - preferred embodiment; Flowchart' illustrates the present invention; and the method of extracting data from the field of the moon is shown in FIG. 3 as a schematic diagram illustrating a use of the _ core set designation interface. , the face of the early morning - Dublin

14 201131388 【主要元件符號說明】 I ..............領域詮釋資料 121 ........... 擷取系統 122........... II .............使用者介面單 組 元 13............. III ...........關鍵字輸入介 14............. 面 .擷取單元 112 ...........收集結果選取 141 ........... 介面 142........... 113 ...........都柏林核心集 模組 標示介面 143 ........... 114 ...........領域詮釋資料 模組 擷取結果顯示介面 2 .............. 12.............領域資料收集 S301-S310· 〇〇 —* 早兀 網頁取得模組 歸納與排序模 資料庫 領域詮拜資料 語法分析模組 關鍵片語統計 詮釋資料擷取 網站 步驟14 201131388 [Explanation of main component symbols] I..............Field interpretation data 121 ........... Capture system 122........ ... II .............user interface single component 13............. III ........... Word input interface 14............. Face. Capture unit 112 ........... Collection result selection 141 ........... Interface 142.............. 113 ........... Dublin Core Set Module Labeling Interface 143 ........... 114 ...... ..... Field Interpretation Data Module Capture Results Display Interface 2 .............. 12.............Field Data Collection S301-S310 · 〇〇—* Early Web page acquisition module induction and sorting model database field interpretation data grammar analysis module key phrase statistical interpretation data retrieval site steps

1515

Claims (1)

201131388 七、申請專利範圍: L 一種領域詮釋資料擷取方法,適於藉由一系統執行,該 方法包含下列步驟: A) 接收與一領域相關的複數個領域相關網頁; B) 分析該等領域相關網頁之網頁語法,以得到複 數個第一證釋資料屬性; θ 基於一統計式語言模型,自該等領域相關網頁 付到複數個關鍵片語,並以該等關鍵片語作為複數個第 一言全釋資料屬性; D) 根據該等第一、二詮釋資料屬性對應至一組都 柏林核心集欄位的結果,由該等領域相關網頁摘取出複 數筆領域相關的詮釋資料;以及 E) 將該等領域相關的詮釋資料輸出至一載體。 2. 依據申請專利範圍第i項所述之領域詮釋資料拍:取方法 ’其中’在步驟B) +,係分析該等領域相關網頁對應 的複數屬性·值對,來取得該等第一詮釋資料屬性。 3. 依射請專利範圍第}項所述之領域㈣資料操取方法 ’其中’在步驟C)中,係基於n_gram語言模型進行統 計’以得到該等關鍵片語。 (依射請專利範圍第i項所述之領域崎資料掏取方法 ’還包含步驟A)之前的下列步驟: F) 根據與該領域相關的至少—關鍵字於複數個網 站中進灯搜尋,以取得與該關鍵字相關的複數個網頁; G) 將每-網頁歸納成一網站階層之概念物件、一 16 201131388 目錄階層之概念物件,及一網頁階層之概念物件其中一 者; Η)將該等網站階層、目錄階層,及網頁階層之概 念物件進行排序;以及 I )根據步驟Η )之排序結果得到該等領域相關網頁 其中,該%•領域相關網頁屬於該等網頁。 5.依據申請專利範圍第4項所述之領域詮釋資料擷取方法 ,其中,在步驟Η)中,對於該等網站階層、目錄階層 ,及網頁階層之概念物件其中任一者,係根據其對應之 一群集大小、一詞彙頻率-反文件頻率相似度,及一權威 與發散值進行排序。201131388 VII. Patent application scope: L A field interpretation data extraction method suitable for execution by a system, the method comprises the following steps: A) receiving a plurality of domain related web pages related to a field; B) analyzing the fields The webpage grammar of the relevant webpage is obtained to obtain a plurality of first proof data attributes; θ is based on a statistical language model, and a plurality of key phrases are paid from related webpages in the fields, and the key words are used as plural numbers A full-text data attribute; D) extracting a plurality of interpret-related interpretation materials from relevant web pages in accordance with the results of the first and second interpretation data attributes corresponding to a set of Dublin Core Set fields; and E Exporting relevant interpretation data from these fields to a carrier. 2. According to the field interpretation data described in item i of the scope of application for patents: take the method 'where' in step B) +, analyze the complex attribute-value pairs corresponding to the relevant web pages in these fields to obtain the first interpretation Data attribute. 3. According to the field mentioned in the patent scope (4) data manipulation method 'where' in step C), the statistics are based on the n_gram language model to obtain the key words. (The method of extracting the data from the field mentioned in item i of the patent scope) also includes the following steps before step A): F) searching for lights in a plurality of websites according to at least the keyword related to the field, To obtain a plurality of web pages related to the keyword; G) to group each web page into a conceptual object of a website hierarchy, a concept object of a 16 201131388 directory hierarchy, and one of the conceptual objects of a web page hierarchy; Sorting the concept objects of the website hierarchy, the directory hierarchy, and the webpage hierarchy; and I) obtaining the related webpages of the domain according to the sorting result of the step Η), wherein the %•domain related webpage belongs to the webpages. 5. In accordance with the method of extracting data in the field of claim 4, wherein in step Η), any of the concept elements of the website hierarchy, the directory hierarchy, and the web page hierarchy are based on Corresponding to one of the cluster size, a vocabulary frequency-anti-file frequency similarity, and an authority and divergence value for sorting. 依據申請專利範圍第5項所述之領域詮釋資料擷取方法 ,其中,在步驟Η)中,對於該等網站階層、目錄階層 及網頁階層之概念物件其中任—者,係將其對應之該 群集大小、忒詞彙頻率-反文件頻率相似度,及該權威與 發散值分別乘上各別的權重,然後總合成一排序積分, 並以該排序積分進行排序。 —種領域詮釋資料擷取系統,包含: »己隐單元,儲存有與一領域相關的複數個領域相 關網頁;以及 嘴玛詮釋資料擷取單元 =域發釋資料掏取單元包括_語法分析模組、一關命 -統计模組’及一詮釋資料擷取模組,該語法分析指 用以分析該等領域相關網頁之網頁語法,以得到複輿 17 201131388 第一詮釋資料屬性,該關鍵片語統計模組用以基於—統 計式語言模型’自該等領域相關網頁得到複數個關鍵片 語,並以該等關鍵月語作為複數個第二詮釋資料屬性, 該詮釋資料擷取模組用以根據該等第一、二詮釋資料屬 性對應至一組都柏林核心集攔位的結果,由該等領域相 關網頁擷取出複數筆領域相關的詮釋資料。 8. 9. 10. 11. 依據申請專利範㈣7項所述之領域㈣請操取系統 ’其中’該語法分析模組係分析該等領域相關網頁對應 的複數屬性-值對,來取得該等第—詮釋資料屬性。〜 依據申請專利範圍帛7項所述之領域言全釋資料擷取率統 ’其中’該關鍵片語統計模組係基於n_gram語言模型進 行統計’以得到該等關鍵片語。 依據申請專利範圍第7項所述之領域崎資料擷取年統 ,還包含與該記憶單元及該領域詮釋資料擷取單元:接 二t二者"面早^ ’或使用者介面單元包括-都柏林 核心集‘不介面’該都柏林核心集標示介面用 者將該等第一、二詮釋資料屬性對應 集欄位。 至及組都柏林核心 依據申請專利範圍第1〇項所述之領域言全釋資料操取 &果顯-二使用者,1面Ο還包括—領域轉資料梅取 ···〇果顯不;丨面,用以將該等領域相關 使用者。 全釋貧枓呈現給 依據申請專利範圍第1〇項所述之領域言全釋資 ’還包含與該記憶單元及該領域詮釋資料擷取單元:接 18 12. 201131388 的一領域資料收集單元,該領域資料收集單元包括一網 頁取得模組,及一歸納與排序模組,該網.頁取得模組用 以根據與該領域相關的至少一關鍵字於複數個網站中進 行搜哥,以取得與該關鍵字相關的複數個網頁,該歸納 與排序模組用以將每一網頁歸納成一網站階層之概念物 件、一目錄階層之概念物件,及一網頁階層之概念物件 其中一者,並將該等網站階層、目錄階層,及網頁階層 之概念物件進行排序,以形成對應該關鍵字之一領域資 料收集結果。 η•依據巾請專㈣圍帛12項所述之領域㈣資料操取系統 ,其中,對於該等網站階層、目錄階層,及網頁階層之 概念物件其巾任-者,該歸_排序模_根據其對應 之一群集大小、-詞彙頻率·反文件頻率相似度,及一權 威與發散值’進行該等網站階層、目錄階層,及網頁階 層之概念物件的排序。 14. 依據申請專利_ 13項所述之物全釋資料掏取系统 ,其中,對於該等網站階層、目錄階層,及網頁 概念物件其中任—者’該歸納與排序模組係將其對應之 該群集大小、該詞彙頻率_反文件頻率相似度,及㈣ 與發散值分別乘上各別的權重,然後總合成__ = ’並以該等網站階層 '目錄階層,及網頁階層 : 件各別的排序積分進行其等之排序。 〜 15. 依射請專利範圍第12項所述之領域證釋資料梅 ,其中,該使用者介面單元還包括—關鍵字輪入介面、: 19 201131388 16. 用以供使用者輸人與該領域相關的該關鍵字。 依據申請專利範圍冑12項所述之領域给釋資料操 ’其中’該使用者介面單元還包括—收集結果選取介面 ’用以供使用者自該領域資料收集結果選擇出該等領域 :關網頁’言亥等領域相關網頁屬於該等網頁,該使用者 介面單元還將該等領域相關網頁儲存於該記情單元。According to the method for extracting data in the field of claim 5, wherein in the step Η), the concept object of the website hierarchy, the directory hierarchy and the webpage hierarchy is corresponding to The cluster size, the vocabulary frequency-anti-file frequency similarity, and the authority and the divergence value are respectively multiplied by the respective weights, and then a sorted integral is synthesized and sorted by the sorted integral. - an area interpretation data extraction system, comprising: » a hidden unit, storing a plurality of domain related web pages related to a field; and a mouth data interpretation unit = domain release data acquisition unit including _ grammar analysis module Group, one-off-statistical module' and an interpretation data capture module, the grammar analysis refers to the webpage grammar used to analyze relevant webpages in these fields, to obtain the first interpretation data attribute of 201131388, the key The phrase statistics module is configured to obtain a plurality of key phrases based on the statistical language model from the relevant webpages of the fields, and use the key monthly language as a plurality of second interpretation data attributes, and the interpretation data acquisition module According to the results of the first and second interpretation data attributes corresponding to a set of Dublin core set blocks, the relevant fields of the fields are extracted from the related pages of the fields. 8. 9. 10. 11. According to the field mentioned in item 7 of the application patent (4) (4), please take the system 'where' the grammar analysis module analyzes the complex attribute-value pairs corresponding to the relevant web pages in these fields to obtain such The first - interpretation of data attributes. ~ According to the scope of patent application, the full-text data extraction rate is described in the following section: 'The key documentary statistical module is based on the n_gram language model to obtain these key phrases. According to the data of the scope of the patent application, the data is also included in the data unit and the data interpretation unit of the field: the two are combined with the user interface unit. - Dublin Core Set 'No Interface' The Dublin Core Set interface user uses the first and second interpretation data attributes to correspond to the set field. As for the group of Dublin cores, according to the scope of the application for patents, the full range of data in the first paragraph of the scope of the patent application & fruit display - two users, one side also includes - field transfer information, take the picture ·丨面,, used to relate users to these fields. The full release of poverty is presented to the field of information in accordance with paragraph 1 of the scope of the patent application. It also contains a data collection unit with the memory unit and the data interpretation unit of the field: 18 12. 201131388. The data collection unit of the domain includes a webpage obtaining module, and an inductive and sorting module, wherein the webpage obtaining module is configured to perform a search on a plurality of websites according to at least one keyword related to the domain to obtain a plurality of webpages associated with the keyword, the inductive and sorting module is used to group each webpage into a conceptual object of a website hierarchy, a conceptual object of a directory hierarchy, and a conceptual object of a webpage hierarchy, and The hierarchical objects of the website, the directory hierarchy, and the conceptual objects of the web page are sorted to form a data collection result in one of the corresponding keywords. η• According to the towel, please (4) encircle the field mentioned in 12 items (4) data manipulation system, in which the concept of the website class, the directory class, and the page level is the towel--the sorting mode_ The ranking of the conceptual objects of the website hierarchy, the directory hierarchy, and the webpage hierarchy is performed according to one of its corresponding cluster size, vocabulary frequency, anti-file frequency similarity, and an authoritative and divergent value. 14. According to the application for patents _ 13 of the full release data extraction system, in which the site hierarchy, the directory hierarchy, and the webpage concept object - the 'induction and sorting module system corresponding to it The cluster size, the vocabulary frequency _ anti-file frequency similarity, and (d) multiply the divergence values by respective weights, then total __ = 'and the site hierarchy' directory hierarchy, and the page hierarchy: Other sorting points are sorted. ~ 15. According to the scope of the patent, please refer to the field of information in the 12th paragraph of the patent scope, where the user interface unit also includes the keyword wheeling interface, 19 201131388 16. For the user to enter and The domain is related to this keyword. According to the field of patent application 胄12, the field of the data is provided, wherein the user interface unit further includes a collection result selection interface for the user to select the fields from the data collection result in the field: The related webpages such as Yanhai belong to the webpages, and the user interface unit also stores related webpages in the domain in the quotation unit. 2020
TW99106440A 2010-03-05 2010-03-05 Domain Interpretation Data Retrieval Method and Its System TWI423053B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
TW99106440A TWI423053B (en) 2010-03-05 2010-03-05 Domain Interpretation Data Retrieval Method and Its System

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
TW99106440A TWI423053B (en) 2010-03-05 2010-03-05 Domain Interpretation Data Retrieval Method and Its System

Publications (2)

Publication Number Publication Date
TW201131388A true TW201131388A (en) 2011-09-16
TWI423053B TWI423053B (en) 2014-01-11

Family

ID=50180351

Family Applications (1)

Application Number Title Priority Date Filing Date
TW99106440A TWI423053B (en) 2010-03-05 2010-03-05 Domain Interpretation Data Retrieval Method and Its System

Country Status (1)

Country Link
TW (1) TWI423053B (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
TWI681303B (en) * 2018-10-19 2020-01-01 國立暨南國際大學 Flexible web data management system and a method thereof

Family Cites Families (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US6389467B1 (en) * 2000-01-24 2002-05-14 Friskit, Inc. Streaming media search and continuous playback system of media resources located by multiple network addresses
US7693827B2 (en) * 2003-09-30 2010-04-06 Google Inc. Personalization of placed content ordering in search results
US20090089275A1 (en) * 2007-10-02 2009-04-02 International Business Machines Corporation Using user provided structure feedback on search results to provide more relevant search results
TWI385540B (en) * 2007-11-02 2013-02-11 Intumit Inc Article content value-added service system and method of the same

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
TWI681303B (en) * 2018-10-19 2020-01-01 國立暨南國際大學 Flexible web data management system and a method thereof

Also Published As

Publication number Publication date
TWI423053B (en) 2014-01-11

Similar Documents

Publication Publication Date Title
KR101450358B1 (en) Searching structured geographical data
US9262532B2 (en) Ranking entity facets using user-click feedback
JP5256293B2 (en) System and method for including interactive elements on a search results page
KR101443475B1 (en) Search suggestion clustering and presentation
CN103365924B (en) A kind of method of internet information search, device and terminal
TWI695277B (en) Automatic website data collection method
CN101441636A (en) Hospital information search engine and system based on knowledge base
WO2002101588A1 (en) Content management system
KR100954842B1 (en) Method and System of classifying web page using category tag information and Recording medium using by the same
Zhao et al. A new keywords method to improve web search
Jeong et al. Efficient keyword extraction and text summarization for reading articles on smart phone
TW201131388A (en) Domain metadata retrieval method and its system
CN102495844B (en) Improved GuTao method for creating user models
Zhang et al. A semantics-based method for clustering of Chinese web search results
JP2007012100A (en) Retrieval method and retrieval device or information providing system based on personal information
Thanadechteemapat et al. Thai word segmentation for visualization of thai web sites
Walther et al. Federated product search with information enrichment using heterogeneous sources
Rocha et al. LODifying personal content sharing
Wen et al. Automatic Web page classification using various features
Barman et al. Ad-hoc information retrieval focused on wikipedia based query expansion and entropy based ranking
Sengupta et al. Semantic thumbnails: a novel method for summarizing document collections
Akamine et al. Development of a large-scale web crawler and search engine infrastructure
Kanavos et al. Extracting knowledge from web search engine using wikipedia
Li et al. Timeline: a Chinese event extraction and exploration system
OMAR et al. Gathering Web pages of entities with high precision

Legal Events

Date Code Title Description
MM4A Annulment or lapse of patent due to non-payment of fees