1293737 · 九、發明說明: 【發明所屬之技術領域】 本發明係有關於一擷取方法;特別是,有關於藉由 在一訓練網頁上標示一規則範本,以找出其它擷取規 則,並可在一在一測試網頁中學習其它相似之擷取規 則,使擷取規則更完整,讓擷取系統提昇。 【先前技術】 網頁是現今呈現線上資料的表示方式,當然就包含 了許多資訊,例如:主要資料區域、導覽連結、廣告、 圖片…荨。而大多數使用者感到興趣的部分,比較可能 是主要資料區域,故將資訊做過濾和整合,將使用者比 較有興趣的主要資料區域抓出,將其它使用者不感興趣 的資料過;慮掉貝有必要。例如ACM、IEEE、citeseer等 都是論文網站’而這類型的網站若能整合在一起,則使 用者在接文貧訊時,可以更方便的接收到多方同一類型 網站的貧訊’節省使用者搜尋的時間並且可以接收更多 資A □此母個網站必需有其擷取規則(Extract丨〇n⑴丨㊀ )’去榻取使用者所要的資訊。1293737 · IX. Description of the Invention: [Technical Field of the Invention] The present invention relates to a method of capturing; in particular, to identifying other rules by marking a rule template on a training webpage, and You can learn other similar capture rules in a test web page to make the capture rules more complete and to improve the retrieval system. [Prior Art] A web page is a representation of online materials today, and of course contains a lot of information, such as: main data areas, navigation links, advertisements, pictures... The part that most users are interested in is probably the main data area. Therefore, the information is filtered and integrated, and the main data areas that the user is interested in are taken out, and the information that other users are not interested in is passed. Bay is necessary. For example, ACM, IEEE, citeseer, etc. are all paper websites. And if this type of website can be integrated, users can more easily receive the same type of website's poor news when they receive poor news. Search time and can receive more funds A □ This parent website must have its own rules (Extract丨〇n(1)丨1) to get the information the user wants.
以現在網際纟罔路的發達,很多資料都是存在資料庫 中,然後再用網頁表現資料。而該等網頁的編寫目前是 透過共同閘道界面(Common Gateway interface, CGI )程式產生’因此由同一CG|產生之網頁必定有其規則, 1293737 · 故我們可以使用該規則反向地將資料擷取出來。 此種網頁之顯示類型可大約分成二類: (A)單一紀錄網頁(S|ngu|arpage):網頁中顯 示單一筆記錄;及 (B )夕重紀錄網頁(Multi - Reco rd Page ):網頁 中顯示二筆或二筆以上的記錄。With the development of the Internet, many of the materials are stored in the database, and then the data is displayed on the webpage. The writing of these web pages is currently generated through the Common Gateway Interface (CGI) program. Therefore, the web pages generated by the same CG| must have their rules, 1293737. Therefore, we can use this rule to reverse the data. take out. The display types of such web pages can be roughly divided into two categories: (A) single record webpage (S|ngu|arpage): a single record is displayed on the webpage; and (B) Multi-Reco rd Page: webpage Two or more records are displayed in the middle.
掘取該等網頁資訊之程式又稱為包覆程式(Wrapper ),包f程式的功能在於擷取原本的資料來源並且依照 自已所定義的格式來儲存,以便將處理過後的資料進一 步的整合。因為網頁有經常更換的特性,故設計一個可 學習的,頁資訊娜系統’為現今網頁資訊娜的趨向 也就疋說網頁資訊擷取系統必需依照訓練網頁要擷 取的内4產生相對的榻取規則來交給包覆程式( )處理。 但網頁資料之掘取有其困難度,主要是因為一筆資 料可能包括了多種屬性,而若屬性排_序不同或屬性 不一常會使擷取規則無所適從,無法正確擷取使用 要的資訊。 以下說明現今幾種產生擷取規則的方法: 一、需要使用者進行標示動作之網頁資訊擷取系統 :„是需要使用者標示使用者所想要的資訊作 為範例,藉此產生擷取規則以擷取類似此範例的資料, 以下有二種方法屬該種系統·· 1293737 , 1 · WIEN :此方法係在1997年由Nicholas Kushmerick提出之一可學習之資訊擷取系統,係針對不 同格式的網頁,使用不同的擷取程式,而該系統稱為 WIEN。該系統係使用 Head-Open-Close-Left· Right-Tail (HOCLRT)技術,該技術係合併〇pen-Close-Left-Right (〇CLR)技#f與Head-Left-Right-Tail (HLRT)技術。 QCLR技術係將網頁視做一連串紀錄的序列,紀錄間存在 不相關的文字,可用該方法分析出每一筆文字紀錄間的 分界符號。H LRT技術則係將網頁資訊分成三部分去處理 ,該三部份為頭部(Head)、資料所在區塊、尾部(Tail )’使用者必須先標不網頁啟始和貧料開始處作為頭部 分界符號(Head Delimiter),接著使用者再標示資料終 點和網頁結束處作為尾部分界符號(Tail Deli mi ter), 兩者之間便為資料所在區塊。之後每一筆資料會有各個 屬性(Attribute),而每個屬性由不同的分界符號包著, 先從左分界符號開始將屬性抓出,抓到右分界符號結束 ,接著換第二屬性,以此類推。WIEN雖然使用不同的擷 取程式,但是每一個擷取程式只能擷取固定型式的網頁 ’若遇到遺失屬性或屬性不一之情形,即無法正確擷取 〇 2 · SoftMealy :此方法係由許鈞南博士等人在 1998年發表之系統,該系統之特點在於將擷取程式表示 成一個有限狀態置換器(finite-state transducer, FST) 1293737 ,透過一情境規則(Contextual rule)執行擷取動作,而 該情境規則係由使用者標示屬性的標示動作所推演出來 。但是該方法對於沒有看過的屬性排列,就必需再次標 示,雖然後來提出一多重推演(Multi-pass)的方法以各 自擷取每一屬性,惟其步驟複雜且執行效率不高。 二、免標示動作之資訊擷取系統:顧名思義即使用 者不必標示所需資訊與資訊中之屬性,而由系統去分析 重覆性區塊之資訊,由其中判斷該使用者所要資訊。以 下亦有二種方法屬該種系統: 1 · Record Boundary :此方法係由 D. W. Embley 、丫· Κ· Ng和Y. S_ Jiang於2000年所發表,與以往之包 覆程式(Wrapper Generation)的目的不同,主要是分 析多筆紀錄網頁中之一致性分隔符號,提出一種可以找 出網頁紀錄與紀錄間分隔符號的方法。該方法利用統計 方法的觀念並結合一些特定的經驗法則(heuristics)來 估計可能的分隔號,故必須是多筆紀錄的網頁,不適合 於單一紀錄的網頁。其缺點為單用一二個標籤(tag)來 區隔每一筆紀錄資料不夠理想,且因為單一標籤或許會 出現在非資料區塊之處,並且只能找到一筆資料的邊界 ,故無法進一步擷取。 2 · IEPAD :此方法係在2001年由張嘉惠博士等人 發表,該方法不需使用者標示範例,而是要求使用者挑 選資訊記錄範例(Record pattern )以及所需的屬性去產 1293737 六营訊 生擷取規則,利用網頁重複出現的紀錄’使用一社、0 萃取(Information Extraction)上廣泛使用的資料、、口構 PAT Tree ( Patricia tree),來找尋網頁重複出現的、己亲 資料中同樣之字首(Prefix)的字串,稱之為重複性範< (Repetitive pattern)。然而該方法同樣為步1驟複雜 執行效率並不高。 故,一般習用之技術並無法符合使用考於實際使用时之 所需。 【發明内容】 因此,本發明提出一擷取方法,其主要目的係在於 以訊號方式擷取網頁資訊,可區別資料區塊與許資料區 塊之位置’獲得使用者所要的資訊,同時挖掘其押 、 則,以降低使用者於網際空間搜尋資訊的時間成本 、/ 可進行資訊整合之加值。 為達上述之目的,本發明提供一方法玎將多重紀錄 (multi_record)資料區塊中之各資料分隔開,讓/使用 者藉在一訓練網頁(training page)中標示’筆負料、 找出其它相似資料並建立擷取規則。接著在/則、 (testing page)中,以所建立之擷取規則去找尋資料 新_变 若出現雖然相似、但不全然等於該擷取規則之一’、 資訊時,會據以增加一擷取規則,使擷取規則更加完備。 1293737 【實施方式】 本發明係先以訊號(signal)方式表示一資訊網頁( 稱為訊號化),而在要求一使用者在一訓練網頁中標示 出一所需資訊之後,本發明亦將該使用者所標示之所需 資訊予以訊號化,接著以一以長條圖及邊界標籤為基礎 之關連性係數(histogram and boundary-tag-based correlation coefficient, HBCC)的量測方法在已訊號化 之該資訊網頁中找出與該已訊號化之所需資訊相似的區 塊,並同時產生一高峰值,藉該高峰值以判斷該資訊網 頁中出現該使用者所需資訊之記錄,使本發明可找出使 用者所需資訊及其擷取規則(extraction rule)。且在一 測試網頁中,想用一線上學習(0n_L丨neLeaming)機制 ,係可以學習其它相似於該使用者所需資訊的擷取規則 ,使本發明可獲得更完整之擷取規則,讓擷取系統提昇 ,使系統更加完善。 故本發明係包含一模組及一方法以擷取網頁資訊, 詳述如後: (一)本發明擷取網頁資訊之模組 本發明之模組係包含一訓練模組(trajnjng module )與一測試模組(testing module),用該訓練模組處理 訓練網頁建立以主要的擷取規則,接著用測試模組處理 該測試網頁中找出未在該訓練網頁中建立之擷取規則, 以建立新的擷取規則,因此具有線上學習(On-line learning)機制。 1293737 故,本發日賴取網1#訊之模組係包含以下部份: ·· ( 1 )該訓練模組,係用以建立主要的擷取規則, ' 其建立擷取規則之方式係包含以下步驟: (A)藉本發明擷取網頁資訊之方法,在將一資訊範 =與該訓練網頁加以訊號化之後,比較該資訊 範例與該訓練網頁可找出之相似該資訊範例 的貝料,以獲得複數個超過一預設閥值之高峰 值,並自第一個高峰值開始下一步驟; (B )取得該高峰值,當該高峰值不為彳時且該高峰 值與目前規則庫(ru|ebase)中的擷取規則中 任一型高峰值或標籤總數不相同時,則將該高 • 峰值與下一高峰值間所代表之該訓練網頁中 - 的網頁貧訊間所間隔的所有標籤新增為一擷 取規則,並將該擷取規則儲存到一規則庫( rulebase);及 • ( C )重覆取得下一個高峰值,藉步驟(B )之方式 判斷疋否新增擷取規則,直到最後一個高峰值 為止。 如此,藉由以上步驟可獲得多型擷取規則,但該多 型擷取規則的k臧(tag)總數在與該資訊範例比較時不 一定相同。 (2 )該測試模組,係可套用一線上學習功能,以 新增未在該訓練網頁中建立之擷取規則而具有一線上學 1293737 習功能,該線上學習機制係包含以下步驟: (A)使用在前述訓練網頁中所建立之該規則庫,計 算該規則庫中所有擷取規則之平均標籤數; (B )藉本發明之網頁資訊擷取方法,逐一用該規則 庫中之所有擷取規則作為資訊範例,以在該測 試網頁中找出該擷取方法之最早出現的高峰 值。若其高峰值為1,則代表該規則庫已有該 測試網頁之該型擷取規則而不須重覆擷取,故 直接移至下一筆高峰值,直到取得第一個值不 為1之高峰值,並取得該高峰值之下一個高峰 值; (C )將一新的擷取規則之起始位置設定為該高峰值 所代表之該訓練網頁中的標籤,而該新的擷取 規則之終點位置則設定為該下一個高峰值所 代表之該訓練網頁中之標籤的前一標籤; (D )當該新的擷取規則之總標籤數大於該平均標籤 數的二分之一,且小於該平均標籤數的二分之 三時,則表示該新的擷取規則相似於自該規則 庫中取出之擷取規則但不全然相等,此時即可 將該新的擷取規則新增到該規則庫;及 (E)重覆取得後續之值不為1的高峰值及其下一個 高峰值,藉步驟(C )及步驟(D )之方式判 斷是否新增新的擷取規則到該規則庫,直到該 12 1293737 測試網頁之最後一個值不為1的高峰值為止。 • 如此’藉由以上步驟可獲得由該訓練網頁所產生之 . 該規則庫中所沒有但相似之擷取規則,而該新的擷取規 則之HBCC值與其標籤總數在與該資訊範例比較時不一 定相同。 (二)本發明擷取網頁資訊之方法 本發明之方法係以訊號(signal)方式去表示一資訊 ^ 網頁,並要求使用者標示出所需資訊,而也將該使用者 所才示干之所品負訊加以訊號化。再將訊號化後之該資气 網頁與該使用者所需資訊,用一HBCC值的量測方法去處 理该一筆訊號化資訊中相似的區塊,藉設定一閥值以產 • 生一尚於該閥值之高峰值,以此判斷該資訊網頁中出現 /使用者所茜 > 訊者共有幾筆,及各筆記錄之起始點與 結束點’因此可以找出使用者所需要之資訊。 故’本發明擷取網頁資訊之方法係包含以下步驟: ❿ —(A)標示資訊範例:由該使用者在一資訊網頁中 標不出符合該使用者所需資訊之一資訊範例,並以一訊 號化方式處理該資訊範例。 ^本發明之方法的特點係在於將訊號處理方法應用在 資Λ擷取上,而因為一多重紀錄網頁中係有多筆重複出 現的貝訊’故若將一具多重紀錄之資訊網頁予以訊號化 (即以訊號方式表示該資訊網頁),則可以用一訊號處 理的方式去分析該資訊網頁中之資料區塊。 1293737 基本上,一網頁之原始碼係包含二類資訊,即超 子標&語言(HTML)標籤與純文字,故本發明之方法的 訊號化方式係分別編派—特定數字予欲訊號化之資料中 的母-類型HTML標籤。而各類標籤又可分為對稱標鐵與 非對稱標籤,其中在對稱標籤部份,每個左方標籤(= <a>)所編派之特定數字,纟加入—負號後即代表其所對 稱之右方標籤(如</a>),例如,若代表<a>之特定數字 為1〇γ則代表</3>之特定數字便為_1〇,如此可讓每一類 型標籤都有一固定之特定數字代表之,而其中一定也有 非結構性標籤,例如<丨>,這類型的註解標籤等…,我們 將其編號為0,再計算長條圖及邊界標籤為基礎之關連性 係數之中,如邊界標籤和長條圖,不予於考慮。 而本發明之方法中的訊號化方式,除了如上述之編 派特定數字予HTML標籤外,且會對網頁資料執行一 histogram式統計或一 histogram式統計之正規化(即以一 標籤型數做正規化)。該histogram式統計指一種統計方 式,係將網頁資料中每一種不同類型HTML標籤之出現順 序加以排列,接著累積各種不同類型標籤之出現次數, 以得出一統計結果。請參閱『第1圖』所示,係本方法 之一資訊範例與一資料窗(data window)的一 histogram 統計結果之波形比較圖,其橫轴係各型標籤之出現順序 ’其縱軸則係各型標籤之出現次數(圖形資料取自 http://www.dkfz-heidelbera.de/mrph 1293737 vs/fmri/hmever/fmri basics/cc.html)。 而所謂的histogram統計之正規化係為,其從網頁取 出如下述之資料窗時,此資料窗不一定有與該資訊範例 同樣多類型數目的標籤,因此當不足時,本方法係將其 數目補足,使橫軸座標所代表之標籤類型數目係為足數 ,但其縱軸的值則為0。反之,若該資料窗之標籤類型數 目係為超過,則將其正規到該資訊範例之標籤類型的數 目大小。如此,藉比較該資訊範例之histogram結果與該 資訊網頁之資料窗的histogram正規化結果,可以計算出 一 HBCC值以得知二者之相似度。 故,本步驟係首先要求一使用者在該資訊網頁中標 示所需資訊作為資訊範例,並將該資訊範例加以訊號化 ,以供後續以訊號處理方式分析該資訊網頁。 (B )抓取資料窗:自該資訊網頁之起始處開始抓 取一資料窗(data window),並以本發明之方法中的該 訊號化方式處理該資料窗。 因為該資訊網頁係存在大量的標籤,而該資訊範例 只是其中一筆記錄(record),故本發明之方法係亦在該 資訊網頁中只取一部份,使該部份與該資訊範例有相同 總標籤數,該部份即稱為一資料窗(data window)。之 後,對該資料窗執行一 histogram式統計之正規化,以預 備供後續的比較之用。 (C )計算HBCC值取得波形圖:比較訊號化後之該 15 1293737 資料窗與該資訊範例,以計算出一 HBCC值,並取得其波 形圖。 一所謂“相關性係數” (correlation coefficient), 其物理性質係代表二訊號資料間做一時間軸上之比較時 該二者的頻率之相關特性,而本發明之方法的HBCC值亦 為一相關性係數值,該HBCC值之計算係包含以下步驟: 步驟(I):在該資訊範例與該資料窗間執行一關聯 (correlation)以取得一係婁丈(coefficient)值。 步驟(丨丨):分別在該資訊範例及該資料窗中抓取一 邊界標籤(boundary tag )以取得一邊界標籤值。該資訊 網頁中係有符合該使用者所需資訊之資料區塊部份,但 也有不符合之非資料區塊部份,因此本方法在其中加入 一邊界標籤之資訊,藉以判斷一 HBCC高峰值之起始位址 是否真為該使用者所需資訊。 因此,該邊界標籤之取得方式如下:本發明之方法 係加重該資訊範例前十分之一或前五分之一出現之標籤 的順序值,且係依據該資訊範例之標籤總數大小以決定 係前十分之一或係前五分之一,若該標籤總數小於40個 ,則取前五分之一之標籤當成邊界標籤,若大於40個標 籤個數,則取前十分之一之標籤當成邊界標籤。 而當該資料窗前十分之一或前五分之一的標籤出現 之順序與該資訊範例前十分之一或前五分之一的標籤出 現之順序相同時,則設定該邊界標籤具有一值1,若否, 1293737 則該值為〇。及 步驟(丨丨丨):將該資料窗與該資訊範例關聯後所取 得之該係數值,與該邊界標籤值相加以平均,即可取得 該關連性係數值。 在計算出第一個資料窗與該資訊範例間之該HBCC 值之後,係於該資訊網頁中,自第一個資料窗之下一個 超文字標記語言(HTML)標籤(tag)處開始,再抓取 該資料窗以計算取得另一 HBCC值,如此一直重覆抓取直 到資料結束,便可取得該HBCC波形圖。 (D )取得高峰值:設一閥值(threshold )以在該 波形圖中找出複數個高於該閥值之高峰值,而該高峰值 所代表之該資訊網頁中的部份網頁資訊即為該使用者所 需之資訊。 請參閱『第2圖』所示,係本方法之一 HBCC值波形 圖,其橫軸為標籤出現順序,其縱軸則為其HBCC值。由 圖中可看出多個較高之HBCC值,而該使用者可預先設定 一閥值(threshold ),凡高於該閥值之HBCC值便為一 高峰值。又因該HBCC值係表達該資料窗與該資訊範例間 之相似性,故該高峰值即表示該資料窗所代表之該資訊 網頁中的部份網頁資訊,與該資訊範例所代表之該使用 者所需資訊,係為高度相似,因此本發明之方法係將其 視為該使用者所需資訊。 (E )自動校正標示:藉該高峰值以自動校正(self 17 1293737 calibrating)該資訊範例之標示不正確處,因為為多重紀 錄網頁,S此;#使用者圈選錯誤的資訊範例,其二構歲 下-筆資訊相似,則此筆錯誤f訊範例找到下^相;以 資料時,中間會相隔數個標籤,產生判斷法則。 因為該使用者標示資訊範例時,可能所標示之起始 位置與結束位置並不正確,因此會產生不完整的資訊範 例,故本發明之方法係提供—自動校正功能加以改正。 百先,判斷是否須要執行該自動校正,其過程描述 •十:m:使用者可能在該資訊網頁中’從第]個標 2不到如k個標籤以設定該資訊範例,故該資 共有k+1個標籤; 步驟(N) ·再執行本發明之方法的步驟(B)至步 :⑻,即用該資訊範例去比較該資訊網頁 一 HBCC值波形圖; 私乂驟(丨N) ·叹疋該使用者所標示之所需資訊的起 =為第X個高峰值且為第j個標籤,而到下-個高峰 (細1個)係為第r個標籤,則該二筆高峰值(第X 個與第X+1個)間之相差標籤數為Η個; …步驟(IV):若該H個相差標籤數不等於k+1 (該資 :==數,代表該使用者所標示之資訊範例 ^ 要更改其起始位置與終點位置。· 其次’當須純行該自動校正時,其過程描述如下 1293737 步驟(丨):將該資訊範例之終點 结 高峰值t、、砧位置调至第x+1個 同峰值之計個標戴,使該資訊範例之起始位置和線點 位置修改為係彳蝴《籤_iluM個賴結束,而且 總共補了 W個標籤(W=(r-j)-(k+1))。 —步驟(II):若第卜,個標籤和「_]個標鐵相等,則將 該資訊範例之起始位置調至第乂個高峰值之前一個標籤 ’使該資訊範例之起始位置和終點ϋ修改為係從第丁 個標籤開始到第r-1個標籤結束。 步驟(丨丨丨)··再重覆執行上述步驟(丨丨)〇次,直到 w-n+1之值為〇,或第卜n+1個標籤與第「_η + ι個標籤係為 不相等且第j-n+1個標籤與第r-n個標籤係為不相等(即原 始資訊範例之起始位置的前一個標籤與調整後資訊範例 之終點位置的標籤係為不相等且原始資訊範例之起始位 置的如一個標籤與調整後資訊範例之終點位置的前一個 標籤亦為不相等),如此即完成該自動校正。 (E )重新取得波形圖及所需資訊:當該自動校正 使該資訊範例之標示有所改動而成一新資訊範例時,則 對该新資訊範例執行本發明之方法的該訊號化方式,並 藉該訊號化後之新資訊範例再重新取得該波形圖及該使 用者所需資訊。 如此,藉由以上步驟可獲得該使用者所需資訊。 綜上所述,本發明之具線上學習功能之網頁資訊擷 1293737 取模組與方法,係包含一方法以訊號方式擷取網頁資訊 ,可區別資料區塊與非資料區塊之位置,獲得使用者所 要的資訊;且包含一具自動學習功能之模組可挖掘其摘 取規則,以降低使用者於網際空間搜尋資訊的時間成本 ,並可進行資訊整合之加值,進而使本方法之産生能更 進步、更實用、更符合使用者之所需,確已符合方法專 利申請之要件,爰依法提出專利申請,尚請貴審查委 員撥冗細審,並盼早曰准予專利以勵方法,實感德便。 惟以上所述者’僅為本方法之較佳實施例而已,當 不能以此限定本方法實施之範®;故,凡依本方法申請 專利範圍及方法說明書内容所作之簡單的等效變化盘修 飾’皆應仍屬本方法專利涵蓋之範圍内。 / 20 1293737 【圖式簡單說明】 第1圖,係本方法之資訊範例與資料窗的histogram統 計結果之波形比較圖。 第2圖,係本方法之HBCC值波形圖 【主要元件符號說明】 ❿The program for traversing the information of these web pages is also called Wrapper. The function of the f program is to retrieve the original data source and store it according to the format defined by itself, so as to further integrate the processed data. Because the webpage has the characteristics of frequent replacement, designing a learnable, page information system is the trend of today's webpage information. It is also said that the webpage information retrieval system must be based on the training webpage. Take the rules and hand them to the wrapper ( ). However, the difficulty of excavating webpage data is mainly due to the fact that a piece of information may include multiple attributes. If the attribute order is different or the attribute is not always used, the rule may be misunderstood and the information used may not be correctly obtained. The following describes several methods for generating the capture rules: First, the webpage information retrieval system that requires the user to perform the marking action: „ It is an example that requires the user to indicate the information desired by the user, thereby generating a retrieval rule. Taking a sample similar to this example, there are two methods below that are: 1293737, 1 · WIEN: This method was developed in 1997 by Nicholas Kushmerick, an information retrieval system for different formats. Web pages, using different capture programs, and the system is called WIEN. The system uses Head-Open-Close-Left· Right-Tail (HOCLRT) technology, which is a combination of -pen-Close-Left-Right (〇 CLR) Technique#f and Head-Left-Right-Tail (HLRT) technology. QCLR technology treats web pages as a sequence of records, with irrelevant text between records, which can be used to analyze the boundaries between each text record. Symbol. H LRT technology divides the webpage information into three parts. The three parts are the head, the block where the data is located, and the tail (Tail). The user must first mark the web page and start the poor. The beginning is the Head Delimiter, and then the user then marks the end of the data and the end of the page as the Tail Deli mi ter. The data is in the block where the data is located. There will be various attributes, and each attribute is wrapped by a different demarcation symbol, starting with the left demarcation symbol, grabbing the attribute, grabbing the end of the right delimiter symbol, then changing the second attribute, and so on. WIEN though Use different capture programs, but each capture program can only capture fixed-type webpages. 'If you encounter missing attributes or different attributes, you can't get it correctly. · SoftMealy: This method is by Dr. Xu Yinan. The system published by et al. in 1998, the system is characterized by representing the capture program as a finite-state transducer (FST) 1293737, performing a capture operation through a Contextual rule, and The situational rules are derived from the marking actions of the user's marked attributes. However, this method must be arranged for attributes that have not been seen. Sub-marking, although a multi-pass method was proposed to capture each attribute, but the steps are complicated and the execution efficiency is not high. Second, the information-free system of the mark-free action: as the name suggests, the user does not have to Mark the required information and the attributes in the information, and the system analyzes the information of the repetitive block to determine the information the user wants. There are two methods below: 1 · Record Boundary: This method It was published in 2000 by DW Embley, 丫·Κ·Ng and Y. S_ Jiang. It is different from the purpose of the previous Wrapper Generation. It mainly analyzes the consistency separators in multiple document pages and proposes A method for finding the separator between a web page record and a record. This method uses the notion of statistical methods and some specific heuristics to estimate the possible separation number, so it must be a multi-record web page, not suitable for a single-record web page. The disadvantage is that it is not ideal to separate each record data by using one or two tags. Because a single tag may appear in the non-data block and only find the boundary of a piece of data, it cannot be further 撷take. 2 · IEPAD: This method was published in 2001 by Dr. Zhang Jiahui and others. This method does not require the user to indicate the example, but requires the user to select the Record pattern and the required attributes to produce 1293737. The rules of the production and use of the recurring records of the webpage use the information widely used in Information Extraction, and the PAT Tree (Patricia tree) to find the same information in the website. The prefix of the prefix (Prefix) is called the Repetitive pattern. However, this method is also complicated by the step 1 and the execution efficiency is not high. Therefore, the commonly used technology does not meet the needs of the use of the actual use. SUMMARY OF THE INVENTION Accordingly, the present invention provides a method for capturing, the main purpose of which is to capture webpage information by means of signals, and to distinguish the location of the data block from the location of the data block to obtain the information desired by the user and to mine the same. In order to reduce the time cost of users searching for information in the Internet space, / the value of information integration can be added. In order to achieve the above object, the present invention provides a method for separating each data in a multi-record data block, allowing the user to sign a 'pen load, search in a training page. Other similar information is generated and the capture rules are established. Then, in the (testing page), the searched rules are used to find the new information. If it appears similar, but not completely equal to one of the capture rules, the information will be increased. Take the rules to make the extraction rules more complete. 1293737 [Embodiment] The present invention first expresses a news webpage (referred to as signalization) by means of a signal, and after requiring a user to indicate a desired information in a training webpage, the present invention also The required information indicated by the user is signalized, and then the measurement method of the histogram and boundary-tag-based correlation coefficient (HBC) based on the bar graph and the boundary label is signalized. The information webpage finds a block similar to the information required for the signalization, and simultaneously generates a high peak value, and the high peak value is used to determine the record of the information required by the user in the information webpage, so that the present invention Find the information the user needs and the extraction rules. In a test webpage, if you want to use the online learning (0n_L丨neLeaming) mechanism, you can learn other acquisition rules similar to the information required by the user, so that the invention can obtain a more complete capture rule, so that Take system upgrades to make the system more complete. Therefore, the present invention includes a module and a method for extracting webpage information, as follows: (1) The module for capturing webpage information of the present invention The module of the present invention comprises a training module (trajnjng module) and a testing module, which uses the training module to process the training webpage to establish a main capture rule, and then uses the test module to process the test webpage to find a capture rule that is not established in the training webpage, Establish new capture rules and therefore have an online learning (On-line learning) mechanism. 1293737 Therefore, the module of the 赖 取 取 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 The method includes the following steps: (A) by means of the method for extracting webpage information by the present invention, after comparing the information model=the training webpage with the training webpage, comparing the information example with the training webpage to find a similarity to the information sample To obtain a plurality of high peaks exceeding a predetermined threshold and to start the next step from the first high peak; (B) to obtain the high peak, when the high peak is not 彳 and the high peak is present When any type of high peak or total number of tags in the rule rule (ru|ebase) is different, the page between the high peak and the next highest peak is displayed in the webpage of the training page. All the labels of the interval are added as a snap rule, and the capture rule is stored in a rule base; and (C) repeatedly obtains the next high peak, which is judged by the method of step (B). No new rules are added until the last high peakThus, the multi-type capture rule can be obtained by the above steps, but the total number of k臧(tags) of the multi-type capture rule is not necessarily the same when compared with the information example. (2) The test module can apply an online learning function to add an online learning function of 1293737 which is not established in the training webpage. The online learning mechanism includes the following steps: (A Using the rule base established in the aforementioned training webpage, calculating the average number of tags of all the capture rules in the rule base; (B) using the webpage information capture method of the present invention, using all the tags in the rule base one by one Take the rule as an example of information to find the earliest high peak of the capture method in the test page. If the peak value is 1, it means that the rule base already has the type of capture rule of the test webpage without repeating the capture, so it moves directly to the next high peak until the first value is not 1 a high peak and a high peak below the high peak; (C) setting a starting position of a new capturing rule to a label in the training web page represented by the high peak, and the new capturing rule The end position is set to the previous label of the label in the training webpage represented by the next high peak; (D) when the total number of labels of the new extraction rule is greater than one-half of the average number of labels, And less than three-thirds of the average number of labels, indicating that the new extraction rule is similar to the extraction rule taken from the rule base but not completely equal, and the new extraction rule can be newly added. Increased to the rule base; and (E) repeatedly obtain a high peak value whose subsequent value is not 1 and its next high peak value, and determine whether to add a new capture rule by means of steps (C) and (D) Go to the rule base until the last value of the 12 1293737 test page So far as the peak value of 1. • So 'by the above steps, you can get the rules that are not similar but similar in the rule base. The HBCC value of the new rule and the total number of tags are compared with the information sample. Not necessarily the same. (2) Method for extracting webpage information according to the present invention The method of the present invention uses a signal to represent a webpage of information and requires the user to indicate the required information, and also displays the information required by the user. The news of the product is signaled. Then, the signaled webpage and the information required by the user are processed by a HBCC value to process a similar block in the signalized information, and a threshold is set to produce a Based on the peak value of the threshold, it is judged that there are several pens in the information webpage and the user has a total number of pens, and the starting point and the ending point of each record', so that the user needs to find out News. Therefore, the method for extracting webpage information of the present invention comprises the following steps: ❿ - (A) labeling information example: the user can not display an information example that meets the information required by the user in a news webpage, and The signalling method handles this information paradigm. The method of the present invention is characterized in that the signal processing method is applied to the resource acquisition, and because a multi-record webpage has a plurality of repetitive Beixun's, so a multi-record information webpage is given. Signaling (that is, signaling the information page), you can use a signal processing method to analyze the data blocks in the information page. 1293737 Basically, the source code of a web page contains two types of information, namely hyper-subscript & language (HTML) tags and plain text, so the signalization method of the method of the present invention is separately assigned - specific numbers to be signalized The parent-type HTML tag in the data. The various types of labels can be divided into symmetric standard and asymmetric labels. In the symmetric label part, the specific number assigned by each left label (= <a>) is added after the minus sign. The symmetric right-hand label (such as </a>), for example, if the specific number representing 1<a> is 1〇γ, the specific number representing </3> is _1〇, so that Each type of tag has a fixed specific number, and there must be non-structural tags, such as <丨>, this type of annotation tag, etc., we number it to 0, then calculate the bar graph and Among the correlation coefficients based on boundary labels, such as boundary labels and bar graphs, are not considered. The method of signalization in the method of the present invention, in addition to assigning a specific number to the HTML tag as described above, and performing a histogram-type statistic or a histogram-style statistic on the webpage data (ie, using a tag type number normalization). The histogram-style statistic refers to a statistical method in which the order of occurrence of each of the different types of HTML tags in the web page data is sequentially arranged, and then the number of occurrences of various types of tags is accumulated to obtain a statistical result. Please refer to the "Figure 1", which is a comparison of the waveforms of one of the information examples of this method and a histogram of the data window. The horizontal axis is the order in which the labels of each type appear. The number of occurrences of each type of label (graphic data is taken from http://www.dkfz-heidelbera.de/mrph 1293737 vs/fmri/hmever/fmri basics/cc.html). The so-called normalization of histogram statistics is that when the data window such as the following is taken out from the webpage, the data window does not necessarily have the same number of labels as the information example, so when the number is insufficient, the method will be the number. Complement, so that the number of label types represented by the horizontal axis coordinates is a sufficient number, but the value of the vertical axis is 0. On the other hand, if the number of tag types of the data window is exceeded, it is normalized to the number of tag types of the information sample. In this way, by comparing the histogram result of the information sample with the histogram normalization result of the data window of the information webpage, a HBCC value can be calculated to know the similarity between the two. Therefore, this step first requires a user to indicate the required information as an information example in the information webpage, and to signal the information example for subsequent analysis of the information webpage by signal processing. (B) Grab the data window: a data window is captured from the beginning of the information web page, and the data window is processed in the signalized manner in the method of the present invention. Since the information webpage has a large number of labels and the information example is only one of the records, the method of the present invention also takes only a part of the information webpage so that the part is identical to the information example. The total number of tags, which is called a data window. Thereafter, a histogram-type statistical normalization is performed on the data window to prepare for subsequent comparisons. (C) Calculate the HBCC value acquisition waveform: compare the signalized 15 1293737 data window with the information example to calculate a HBCC value and obtain its waveform map. A so-called "correlation coefficient" whose physical properties represent the correlation characteristics of the frequencies of the two signals when compared on a time axis, and the HBCC value of the method of the present invention is also a correlation. The coefficient of the coefficient, the calculation of the HBCC value comprises the following steps: Step (I): Perform a correlation between the information example and the data window to obtain a coefficient value. Step (丨丨): Grab a boundary tag in the information sample and the data window to obtain a boundary tag value. The information webpage contains the data block part of the information required by the user, but there is also a non-data block part that does not comply with it. Therefore, the method adds a boundary label information to determine a high peak of HBCC. Whether the starting address is really the information required by the user. Therefore, the boundary label is obtained in the following manner: the method of the present invention emphasizes the order value of the label appearing in the first tenth or the first fifth of the information example, and is determined according to the total number of labels of the information example. The first tenth or the first one fifth. If the total number of the labels is less than 40, the label of the first one fifth is regarded as the boundary label. If it is greater than the number of 40 labels, the first tenth of the label is taken. The label acts as a border label. And when the tenth or the first one of the labels in the front of the data window appear in the same order as the first tenth or the first one of the labels of the information example, the boundary label is set to have A value of 1, if no, 1293737, the value is 〇. And step (丨丨丨): the coefficient value obtained by associating the data window with the information sample is averaged with the boundary label value to obtain the correlation coefficient value. After calculating the HBCC value between the first data window and the information sample, it is in the information webpage, starting from a hypertext markup language (HTML) tag under the first data window, and then Grab the data window to calculate another HBCC value, and then repeat the capture until the end of the data, the HBCC waveform can be obtained. (D) obtaining a high peak value: setting a threshold value to find a plurality of high peak values higher than the threshold value in the waveform diagram, and the high peak value represents part of the webpage information in the information webpage The information required for this user. Please refer to Figure 2 for a HBCC value waveform diagram. The horizontal axis is the order in which the labels appear and the vertical axis is the HBCC value. A plurality of higher HBCC values can be seen in the figure, and the user can pre-set a threshold value, and the HBCC value above the threshold is a high peak value. And because the HBCC value expresses the similarity between the data window and the information example, the high peak indicates part of the webpage information in the information webpage represented by the data window, and the use represented by the information example The information required is highly similar, so the method of the present invention treats it as information required by the user. (E) Automatic correction mark: automatically correct by the high peak (self 17 1293737 calibrating). The information sample is incorrectly marked because it is a multiple record page, S this; # user circled the wrong information example, the second Under the age of the age - the pen information is similar, then the sample of the error f is found to find the next phase; when the data is used, there are several labels in the middle to generate the judgment rule. Since the user may indicate that the starting position and the ending position are not correct when the information example is marked, an incomplete information pattern may be generated, so the method of the present invention provides an automatic correction function to correct. Hundreds of first, to determine whether it is necessary to perform the automatic correction, the process description • Ten: m: the user may not be able to set the information example from the first label 2 in the information page, so the capital is shared k+1 labels; Step (N) · Perform step (B) to step (8) of the method of the present invention, that is, use the information example to compare the HBC value waveform of the information web page; Private steps (丨N) · sigh that the user needs to mark the information required = the Xth highest peak and the jth label, and the next peak (the thin one) is the rth label, then the two The number of phase difference labels between the high peaks (the Xth and the X+1th) is ;; ...Step (IV): If the number of the H phase difference labels is not equal to k+1 (the capital: == number, represents the Example of the information indicated by the user ^ To change the starting position and the ending position. · Secondly, when the automatic correction is required, the process is described as follows: 1293737 Step (丨): The end point of the information example is the peak value t And the position of the anvil is adjusted to the x+1th peak value, so that the starting position and line position of the information example are modified. The system 《 《 签 i i i i i i i i i i i i i i i i i i i i i i i i i i i i i i i i i i i i i i i i i i i i i i i i If the iron is equal, then the starting position of the information example is adjusted to the first high peak before a label 'changes the starting position and ending point of the information example to start from the first label to the r-1th label. End. Step (丨丨丨)·· Repeat the above steps (丨丨) times until the value of w-n+1 is 〇, or the n+1th label and the “_η + ι label Is not equal and the j-n+1th label is not equal to the rnth label (ie, the label of the previous label of the starting position of the original information example and the end position of the adjusted information example are not equal and The original label of the original information example is unequal to the previous label of the end position of the adjusted information example, so that the automatic correction is completed. (E) Re-acquire the waveform and the required information: When the automatic correction changes the indication of the information sample to a new information example, the new information For example, the signalization method of the method of the present invention is performed, and the waveform information and the information required by the user are re-acquired by the signalized new information example. Thus, the information required by the user can be obtained by the above steps. In summary, the webpage information of the online learning function of the present invention 撷 1293737 takes a module and a method, and includes a method for extracting webpage information by means of a signal, and distinguishing between the location of the data block and the non-data block, The information that the user wants; and the module that includes an automatic learning function can mine the extraction rules to reduce the time cost of searching for information in the Internet space, and can add value of information integration, thereby making the method Produce more advanced, more practical, and more in line with the needs of users. It has indeed met the requirements of the method patent application. If you file a patent application according to law, you should ask your review committee to take a detailed examination and hope that the patent will be granted as early as possible. Real sense of virtue. However, the above description is only a preferred embodiment of the method, and the method of applying the scope of the patent and the method description of the method according to the method is not limited. Modifications shall remain within the scope of this method patent. / 20 1293737 [Simple description of the diagram] Figure 1 is a comparison of the waveforms of the information examples of the method and the histogram statistics of the data window. Figure 2 is the HBCC value waveform diagram of this method. [Main component symbol description] ❿
21twenty one