TW200535641A - Online extraction rule analysis for semi-structured documents - Google Patents
Online extraction rule analysis for semi-structured documents Download PDFInfo
- Publication number
- TW200535641A TW200535641A TW93110811A TW93110811A TW200535641A TW 200535641 A TW200535641 A TW 200535641A TW 93110811 A TW93110811 A TW 93110811A TW 93110811 A TW93110811 A TW 93110811A TW 200535641 A TW200535641 A TW 200535641A
- Authority
- TW
- Taiwan
- Prior art keywords
- data
- patent application
- information extraction
- scope
- extraction rules
- Prior art date
Links
Landscapes
- Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
Abstract
Description
200535641 五、發明說明(1) 發明所屬之技術 本發明是 於一種半結構 先前技術 隨著網際 示語言(Η T M L ) 中,使用者往 如何透過資訊 的方式呈現, 得擷取作業自 訊擷取的重要 設計資訊 人工撰寫擷取 的格式隨時有 產生擷取程式 從1 9 9 7年 利用標示範例 生擷取規則, 有關於一種資訊擷取方法,且特別是有關 化文件的線上資訊擷取方法。 網路 的格 往可 擷取 進而 動化 課題 擷取 資料 可能 是設 開始 網頁 接著 的發展,愈來愈多 式來呈現,有用與 花上大筆的時間在 系統的設計,將輸 整合資料,建構豐 且提高搜尋效率及 〇 系統最直接的方法 的包覆程式(Wrapp 發生更改,因此如 計擷取系統最大的 ,Wrapper Induct ,告訴系統要擷取 利用規則來擷取網 N. Hsu and C. -C. Chang. Finite200535641 V. Description of the invention (1) The technology to which the invention belongs The invention is based on a semi-structured prior art. With the Internet Display Language (ΗTML), how users can present information through information acquisition. The important design information of the manual writing and extraction format is generated at any time. The extraction program is used to generate extraction rules from 1997. It is related to an information extraction method, and especially an online information extraction method of a document. . The grid of the Internet can be retrieved and then the topic of dynamic data retrieval may be set to start the development of the web page and then presented more and more. It is useful and spends a lot of time in the design of the system to integrate the data. Construct a wrapper program that improves the search efficiency and the most direct method of the system (Wrapp changes, so if the capture system is the largest, Wrapper Induct, tells the system to use the rules to capture the network N. Hsu and C .-C. Chang. Finite
for semi-str of I JCAI-99 Techniques a Stockholm, S Generating F uctured text mining. Workshop οn Text min n d Applications, p p. wed en, 1 9 9 9; C. -N . H inite-state transduc 的資訊以超文字標 無用的資訊摻雜其 找尋資料。因此, 入的貧料以結構化 富的搜尋引擎,使 精準度,是線上資 是對各個網站利用 e r ),但是由於網站 何快速並且自動地 挑戰。 i ο η的方法被提出, 的資訊,讓系統產 站的貧訊(請參照C . state transducers In Proceedings ing: Foundations, 38-49, s u and Dung, e r s for <1for semi-str of I JCAI-99 Techniques a Stockholm, S Generating F uctured text mining. Workshop οn Text min nd Applications, p p. wed en, 1 9 9 9; C. -N. H inite-state transduc Superimpose useless information with hypertext to find it. Therefore, the rich and poor materials are structured by rich search engines, so that the accuracy is based on the use of online resources for various websites. However, due to how fast and automatically the websites challenge. The method of i ο η is proposed, the information that makes the system poor (see C. state transducers In Proceedings ing: Foundations, 38-49, s u and Dung, e r s for < 1
12652TWF.PTD 第5頁 200535641 五、發明說明(2) semi-structured data extraction from the web. Information Systems, 23(8):521-538, 1998; N.12652TWF.PTD Page 5 200535641 V. Description of the invention (2) semi-structured data extraction from the web. Information Systems, 23 (8): 521-538, 1998; N.
Kushmerick, D. Weld, and R. Doorenbos. Wrapper induction for information extraction. In Proceedings of the 15th International Joint Conference on Artificial Intelligence ( I J C A I ), pp. 7 2 9 - 7 3 7, Japan, 1 9 9 7; I . Muslea, S. Minton, and C. K η o b1o c k. A hierarchical approach to wrapper induction. In Proceedings of the 3rd International Conference on Autonomous Agents, p p . 1 9 0 - 1 9 7, S e a 11 1 e,W A, 1 9 9 9 )這類利用標示範例網· 頁的方式(稱之為supervised approaches),雖然有不 錯的擷取率,但是必須經過十分繁複的標示,才能產生 擷取規則,對使用者來說並不是那麼便利,因此如何能 減少使用者標示的資訊擷取系統是系統設計的一大挑 戰。目前不需使用者標示的方法(稱之為unsupervised approaches)包 4舌如IEPAD(請參照C. —Η. Chang and S. -C. Lui. Iepad: Information extraction based on pattern discovery. In Proceedings of the 10th International Conference on World Wide Web, pages 6 8 1 - 6 8 8, Η ο n g - K ο n g, 2 0 0 1 )係假設所掘取的資料在文件 存在多筆資料’故能利用重複性型樣探勘方法做為可能 擷取對象的猜測,因此對於單一記錄網頁尚無解決辦 法。又如現有的RoadRunner(請參照V. Grescenzi, G.Kushmerick, D. Weld, and R. Doorenbos. Wrapper induction for information extraction. In Proceedings of the 15th International Joint Conference on Artificial Intelligence (IJCAI), pp. 7 2 9-7 3 7, Japan, 1 9 9 7; I Muslea, S. Minton, and C. K η o b1o c k. A hierarchical approach to wrapper induction. In Proceedings of the 3rd International Conference on Autonomous Agents, pp. 1 9 0-1 9 7, S ea 11 1 e , WA, 1 9 9 9) This kind of method of labeling example websites and pages (called supervised approaches), although has a good retrieval rate, must go through a very complicated labeling to generate extraction rules. It is not so convenient for users, so how to reduce the information retrieval system marked by users is a major challenge in system design. At present, the method without user identification (called unsupervised approaches) includes 4 tongues such as IEPAD (please refer to C. —Η. Chang and S. -C. Lui. Iepad: Information extraction based on pattern discovery. In Proceedings of the 10th International Conference on World Wide Web, pages 6 8 1-6 8 8, Η ο ng-K ο ng, 2 0 0 1) It is assumed that there are multiple pieces of data in the document, so it can use repetitive patterns The prospecting method is used as a guess of possible objects, so there is no solution for a single record webpage. Another example is the existing RoadRunner (see V. Grescenzi, G.
12652TWF.PTD 第 6 頁 200535641 五、發明說明(3) Mecca, and automatic In Proceed on Very La E X A L G (請參 Extracting Proceeding Conference 2 0 0 3 .)係假 希望將整個 開來,然而 樣板中也可 入,尚需要 發明則是採 計’兼具二 發明内容 有鑑於 的方法來完 Extraction 可將資料完 記錄的網頁 本發明 文件的資訊 域,並針對 P. Merialdo. Roadrunner : Towards data extraction from large web sites, i n g s of 2 7th International Conference rge Data Base, pp. 109-118, 2 0 0 1 ·)及 照A. Arasu and H. Garcia-Molina. structured data from web pages. In s of ACM SIGMOD International on Management of Data, pages 337-348, 設所擷取的資料是以整個網頁的資料為主, 網頁的樣板(t e m p 1 a t e )和資料(d a t a )能區分 每個人所需要的資料可能不儘相同,即使是 能有其想要的資料,因此少了使用者的介 後置處理擷取出使用者所需要的資料。而本 取介於supervised及unsupervised之間的設 者之長,並有很好的效能。 以上所述之先前技術,本發明提出一個有效 成自動化的資訊擷取系統(I n f 〇 r m a t i ο η System),讓使用者不必經過繁複的標示便 整的擷取到手,同時解決單一記錄以及多筆 擷取問題。 提供一種資訊操取方法,適用於一半結構化 擷取。使用者於半結構化文件内框選目標區 目標資訊作細部分析,使半結構化文件結構12652TWF.PTD Page 6 200535641 V. Description of the invention (3) Mecca, and automatic In Proceed on Very La EXALG (please refer to Extracting Proceeding Conference 2 0 0 3.) It is hoped to open the whole, but you can also enter in the model If you still need to invent it, you need to use the method of both the invention and the invented method to complete the Extraction. The information field of the document of this invention can be recorded on the webpage. It also aims at P. Merialdo. Roadrunner: Towards data extraction from large web sites. , ings of 2 7th International Conference rge Data Base, pp. 109-118, 2 0 0 1 ·) and A. Arasu and H. Garcia-Molina. structured data from web pages. In s of ACM SIGMOD International on Management of Data, pages 337-348, Suppose that the captured data is based on the entire web page. The template (temp 1 ate) and data (data) of the web page can distinguish each person's required data, even if it is It can have the data it wants, so there is no need for the user to post-process the data needed by the user. And it takes the strength of the designer between supervised and unsupervised, and has good performance. With the above-mentioned prior art, the present invention proposes an effective and automatic information retrieval system (I nf 〇rmati ο η System), which allows users to obtain a complete collection without having to go through complicated labels, and simultaneously solve single records and multiple Pen capture issues. Provides an information manipulation method suitable for half-structured retrieval. The user selects the target area within the semi-structured document. The target information is analyzed in detail to make the semi-structured document structure.
I2652TWF.PTD 第7頁 200535641 五、發明說明(4) 化。最後指定屬性,以提高該資訊擷取方法的擷取率。 前述細部分析的步驟包括:(一)下探(d r i 1 1 - d 〇 w η ),係 對資料做較細部分析;(二)上合(r ο 1 1 - u ρ ),係整合資料 為屬性較上位的資料;(三)指定屬性,係決定一綱目 (s c h e m a )並依據該綱目擷取整合。此種資訊擷取方法亦 配合階層編碼、近似型樣配對、以及字串對齊的方法實 施上述步驟。其中,近似型樣配對的步驟例如以類似字 串對齊方式實施(請參照D . G u s f i e 1 d · E i f i c i e n t methods for multiple sequence alignment with guaranteed error bounds. Bull. Math. Biol·, 5 5 : 1 4 1 - 1 5 4, 1 9 9 3 ·)。其中,字串對齊的步驟例如以多 # 重字串對齊實施。其中,目標資訊例如以超文字標示語 言(Hypertext Markup Language, HTML)記載。又者’目 標資訊係可以例如是單記錄頁或多記錄頁構成。另一方 面,擷取規則分析的步驟更包括(一)以定義符號為根據 (Del im iter-based),係憑藉分界符號,根據欲擷取屬性 其前後文格式來擷取;(二)以内容為根據 (C ο n t e n t - b a s e d ),係憑藉資料内容來擷取;(三)以上兩 種方法兼施。本發明中的資訊擷取方法的結果例如以可 延伸性標示語言(Extensible Markup Language, XML)格 式儲存,所以可搭配XML相關的應用。另外,框選該目標 _ 區域時,例如以多重框選的操作進行,以解決網頁中屬 ’ 性排列不規則的情形。 本發明亦提供一種資訊擷取規則產生方法,適用於I2652TWF.PTD Page 7 200535641 V. Description of the invention (4). Finally, specify attributes to improve the retrieval rate of the information retrieval method. The foregoing detailed analysis steps include: (1) drill down (dri 1 1-d 0w η), which is a detailed analysis of the data; (2) shanghe (r ο 1 1-u ρ), which integrates the data as Attributes are higher-level data; (3) Specifying attributes is to determine a schema and extract and integrate according to the schema. This method of information retrieval also implements the above steps in conjunction with the methods of hierarchical coding, approximate pattern matching, and string alignment. Among them, the steps of approximate pattern matching are implemented in a similar string alignment manner (see D. G usfie 1 d · E ificient methods for multiple sequence alignment with guaranteed error bounds. Bull. Math. Biol ·, 5 5: 1 4 1-1 5 4, 1 9 9 3 ·). The string alignment step is implemented by, for example, multi- # string alignment. The target information is recorded in, for example, Hypertext Markup Language (HTML). The target information may be a single record page or multiple record pages. On the other hand, the step of analyzing the extraction rules further includes (1) taking the definition symbol as the basis (Del im iter-based), relying on the delimitation symbol, to retrieve according to the context format of the attribute to be retrieved; (2) taking The content is based on (C ο ntent-based), which is retrieved based on the data content; (3) The above two methods are used together. The result of the information retrieval method in the present invention is stored in an Extensible Markup Language (XML) format, so it can be used with XML-related applications. In addition, when the target _ area is framed, for example, a multi-frame selection operation is performed to solve the problem of irregular arrangement of the attribute ′ in the web page. The invention also provides a method for generating information extraction rules, which is applicable to
12652TWF.PTD 第8頁 200535641 五、發明說明(5) 針對一個或多數個記錄頁進行訓練,以產生擷取規則。 此種資訊擷取規則產生方法包括使用者於該些記錄頁内 框選一目標區域,其中至少包括一目標資訊,並且針對 該目標資訊做一細部資料指定。在此所述之擷取規則產 生方法,其中的細部分析步驟包括:(一)下探 (drill-down),係對該目標資料做較細部分析;(二)上 合(r ο Π - u p ),係整合該目標資料為屬性較上位的資料; 以及(三)指定屬性,係決定一綱目(s c h e m a )並依據該綱 目擷取整合。本發明的資訊擷取規則產生方法,在實施 時更包括圖樣比對、字串對齊、以及加入純文字内容比 對步驟。另一方面,前述的下探步驟以分界標記為根據 進一步擷取出屬性值。上合的步驟以該目標區域中的一 部分區塊的作整合。最後的指定屬性步驟則是使用者點 擊滑鼠以及輸入屬性名字而完成。 本發明又提供一種資訊擷取規則產生方法,以處理 一個或多數個記錄頁訓練,產生資訊擷取規則。首先將 該些記錄頁處理為結構化文件,再針對結構化文件作細 部資料指定以擷取屬性值。本發明中,例如以階層編 碼、近似型樣配對、以及字串對齊等方法的處理先產生 結構化文件。再針對該結構化文件作細部資料指定以擷 取屬性值,步驟包括(一)下探(d r i 1 1 - d 〇 w η ),係對該目 標資料做較細部分析;(二)上合(r ο 1 1 - u ρ ),係整合該目 標資料為屬性較上位的資料;(三)指定屬性,係決定一 綱目(s c h e m a )並依據該綱目操取整合。12652TWF.PTD Page 8 200535641 V. Description of the invention (5) Training on one or more record pages to generate extraction rules. This method for generating information extraction rules includes a user selecting a target area within the record pages, which includes at least one target information, and makes detailed data designation for the target information. The method for generating the extraction rules described herein includes detailed analysis steps including: (a) drill-down, which performs a detailed analysis of the target data; (b) r ο Π-up ), Which integrates the target data into higher-level data; and (3) Specifying attributes, determine a schema and extract integration based on the schema. The method for generating information extraction rules of the present invention, when implemented, further includes the steps of pattern comparison, string alignment, and comparison of adding plain text content. On the other hand, the aforementioned down-step further extracts the attribute value based on the demarcation mark. The closing step is integrated with a part of the blocks in the target area. The final attribute assignment step is completed by the user clicking the mouse and entering the attribute name. The invention also provides a method for generating information retrieval rules to process one or more record page trainings to generate information retrieval rules. The record pages are first processed into a structured document, and then detailed data assignments are made for the structured document to retrieve attribute values. In the present invention, structured documents are first generated by processes such as hierarchical coding, approximate pattern matching, and string alignment. Then make detailed data designation for the structured document to retrieve the attribute values. The steps include (a) drill down (dri 1 1-d 0w η), which is a detailed analysis of the target data; (b) shanghai ( r ο 1 1-u ρ), is to integrate the target data as higher-level data; (3) Specify the attributes, determine a schema, and perform integration according to the schema.
12652TWF.PTD 第9頁 200535641 五、發明說明(6) 為讓本 明顯易懂, 作詳細說明 實施方式 本發明 extraction 複雜的標示 的滑鼠點擊 屬性標示簡 (enclosing 點擊拖曳框 掘取,並比 發明之上述和其他目的、特徵、和優點能更 下文特舉一較佳實施例,並配合所附圖式, 如下: 提出的線上擷取規則分析(0 η 1 i n e rule analysis, OLERA)系統主要針對避免 讓使用者在擷取的過程中只需透過幾個簡單 ,以WYSIWYG(所見即所得)的觀念,將繁複的 化成三步驟的操作。(1 )採用目標區域框選 ),讓使用者對要擷取的目標透過簡單的滑鼠 選,使系統以聰明的方式直覺地進行資料的 對其它可能的記錄;(2 )區塊的下探 (Drill-down)或上合(Roll-up),根據第一步的記錄進 步榻取, 對齊(S t r 定(Spec i 擷取系統 或者多記 動化的理 一般 域及使用 重要資料 示著作者 興趣的區 透過近似型樣配對(Pattern Matching)及字串 i n g A 1 i g n m e n t)建立掘取規則;(3 )相關屬性指 f y ),係利用加入純文字内容的比對來加強整個 的成功率’不論對於單記錄頁(Singular Page) 錄頁(Multi-Record Page),其結果皆完成了自 想,也有很高的擷取率。 來說,網頁可將其内容分成使用者感興趣的區 者不感興趣的區域’而用者感興趣的區域便是 顯示的地方。以介紹書本的網頁為例,可能顯 、出版社、出版日期等等的資訊;使用者不感 域則是網站上給瀏覽者便於操作的按鈕或者琳12652TWF.PTD Page 9 200535641 V. Description of the invention (6) In order to make the text obvious and easy to understand, the detailed description of the embodiment of the present invention extraction complex labeling mouse click attribute labeling simplified The above and other objectives, features, and advantages can be further exemplified below with a preferred embodiment, and in accordance with the accompanying drawings, as follows: The proposed online extraction rule analysis (OLA) system is mainly aimed at Avoid letting the user only use a few simple, WYSIWYG (what you see is what you get) concepts to reduce the complexity into a three-step operation during the retrieval process. (1) Use the target area to select) to allow users The target to be retrieved uses a simple mouse selection to make the system intuitively record other possible data in a smart way; (2) Drill-down or Roll-up of the block According to the record of the first step, the data acquisition, alignment (Speci extraction system or multiple mechanized general domains) and the use of important data to show the author's interest are approximated by Pattern matching and string A 1 ignment) to establish mining rules; (3) the relevant attribute refers to fy), which uses the comparison of the addition of pure text content to enhance the overall success rate 'regardless of the single record page ( Singular Page) Multi-Record Page, the results have completed their self-imagination, and also have a high capture rate. For example, a webpage can divide its content into areas that the user is interested in, and areas that the user is not interested in, and the area that the user is interested in is the display place. Take the webpage that introduces books as an example, information such as, publisher, publication date, etc. may be displayed; the user's unseen area is the buttons or links on the website that are easy for the viewer to operate.
12652TWF.PTD 第10頁 200535641 五、發明說明(7) 瑯滿目的廣告,對不同使用者而言,所需要的資料其實 不盡相同,因此必需透過框選步驟,讓使用者告訴系統 其所感與趣的區域。 在這裡目標區域框選後,糸統處理的方法主要是透 過一個改良式的對齊(a 1 i g n m e n t)演算法。在此針對此種 對齊演算法加以說明。首先對目標區域框選的内容作網 頁編碼;由於目標件係以Η TM L格式建構而成,解讀此網 頁内容時以HTML的標籤為定義符號(delimiter),而兩相 鄰的定義符號之間,内容均予以忽略,因而定義兩相鄰 定義符號之間的該些内容為一單位。為了要執行本發明 所使用的對齊演算法,需應用上述之網頁編碼方法將目 標區域框選的内容作階層編碼。例如通常可將階層編碼 依包含範圍大小分為三層考慮:審查階層 (markup-level)、本文階層(text-level)、以及字元階 層(w 〇 r d - 1 e v e 1 )。由下表所列可瞭解此三種階層的意 涵,要聲明的是此表所列示僅為舉例目的,所有同級的 定義符號均可依類似的階層編碼方法區分。至此產生配 對型樣,亦可稱為在訓練程序中產生參考的擷取規則, 用以作後續的資料測試與擷取。12652TWF.PTD Page 10 200535641 V. Description of the invention (7) For the various users, the required information is actually different, so the user must tell the system what he thinks and feels through the box selection steps. Interesting area. After the target area is selected here, the system's processing method is mainly through an improved alignment (a 1 i g n m e n t) algorithm. This alignment algorithm is described here. First of all, the content selected in the target area is web-encoded. Because the target is constructed in Η TM L format, the HTML tag is used as the delimiter when interpreting the content of this webpage, and between two adjacent definition symbols , The content is ignored, so the content between two adjacent definition symbols is defined as a unit. In order to execute the alignment algorithm used in the present invention, the above-mentioned web page coding method is used to hierarchically encode the content selected in the target area. For example, hierarchical coding can usually be divided into three levels according to the size of the coverage: review level (markup-level), text level (text-level), and character level (w 〇 r d-1 e v e 1). The meaning of these three levels can be understood from the table below. It should be stated that the list is only for the purpose of example. All definition symbols at the same level can be distinguished by similar level coding methods. At this point, a matching pattern is generated, which can also be referred to as a reference acquisition rule generated in the training process, for subsequent data testing and acquisition.
12652TWF.PTD 第11頁 200535641 五、發明說明(8) 階層 编碼方案 定義符號 審査(markup) [^塊級標籤 區塊級標籤 (block-level tags) 本文(text) 段落 NL,CR,Tab 句孑 .?! 字元(word) 片語 :,;[](){} _ 空白(Blank)阀$-/ 經過階層編碼之後,則執行近似配對,係利用在訓 練網頁中比較框選區塊中的類似記錄。凡與框選區塊相 似度小於一臨界者皆符合認定標準。 經由近似配對的網頁記錄内容,便可以多重字串對 齊演算法處理。此演算法以動態程式共執行nk次而得到 對齊結果,係針對k個長度n的字串計算。例如針對下列 三個字李B 1, B 2, B 3,執行此演算法後得到以下對齊結 果:12652TWF.PTD Page 11 200535641 V. Description of the invention (8) Hierarchical coding scheme definition mark review [^ block-level tags block-level tags text (text) paragraphs NL, CR, Tab sentence孑.?! Word phrase:,; [] () {} _ Blank valve $-/ After hierarchical coding, approximate matching is performed, which is performed by comparing blocks in the selection box on the training webpage. Similar records. Those that have a similarity to the box-selected block that are less than a critical level meet the recognition criteria. The content can be recorded through the approximately matched web pages, and then can be processed by multiple string alignment algorithm. This algorithm uses a dynamic program to execute a total of nk times to obtain the alignment result, which is calculated for k strings of length n. For example, for the following three words B1, B2, B3, the following alignment results are obtained after executing this algorithm:
此法已應用到I EPAD中,可彌補PAT樹演算法的不足,且 特別適用於字率長度相似的狀況下。處理的字串數目甚 鉅,可利用已知的c e n t e r s t a r (請參照D . G u s f i e 1 d · Efficient methods for multiple sequence alignmentThis method has been applied to I EPAD, which can make up for the shortcomings of the PAT tree algorithm, and is especially suitable for situations with similar word rate lengths. The number of strings processed is very large. You can use the known c e n t e r s t a r (see D. G u s f i e 1 d · Efficient methods for multiple sequence alignment
12652TWF.PTD 第12頁 200535641 五、發明說明(9) with guaranteed error 55:141-154, 1993·)近似 至此,一開始使用者 架構將其轉換成結構化的 須進一步正確擷取出所指 提,它可能是價格、一本 經過處理轉換而得的結構 得這些屬性值,畢竟目前 標示語言(HTML)為定義符 也就是說目前頂多只能擷 字内容包含著欲擷取屬性 些文字内容做擷取,這就 細部貧料指定可以分 Roll-up 及Specify ,Dril bounds. Bull. Math. Biol., 演算法加快計算速度。 輸入的網頁,已經透過系統的 資料。雖然如此,本發明還必 定的目標,可以一個屬性值粹 書的作者名稱、或者日期,而 化資料,並不代表著就已經取 為止一切的方法都是以超文字 號(D e 1 i m i 1: e r )的基礎來設想, 取出一段文字内容,而這段文 值,所以必須再進一步的對這 是細部貧料指定的目的。 成三個部分,Drill-down、 1—down與Roll—up的主要1¾念來 自於0LAP多維度資料模型,標準的0LAP亦包含 Drill-down和R〇11-up ,其中Drill-down是對資料再{故較 細部的分析,R ο 1 1 - u p則是整合資料較高層次的概念,在 細部資料指定的方面也是同樣的道理,D r i 1 1 - d 〇 w η將純 文字内容作更細部的分析,將純文字内容再以一些如標 點符號或者換行等當作D e 1 i m i t e r分隔,進一步地擷取出 屬性值,Ro 1 1 -up則將某一區塊的資料作一整合,像是作 者的資料整合成作者列表或者書本的出版日期與出版商 的資料整合成出版相關資料等,這部分可依照使用者的 需求來進行。不論是D r i 1 1 - d〇w η亦或R ο 1 1 - u p,均經由以12652TWF.PTD Page 12 200535641 V. Description of the invention (9) with guaranteed error 55: 141-154, 1993 ·) Approximately to this point, at the beginning, when the user framework converts it into a structure, it is necessary to further correctly extract the reference, It may be the price, a structure that is processed and converted to obtain these attribute values. After all, the current markup language (HTML) is the definition, which means that at most the word content can only contain the text content that contains the attributes you want to retrieve. Extraction, which can be divided into Roll-up and Specify, Dril bounds. Bull. Math. Biol., Algorithms to speed up the calculation. The input web page has passed the system's information. Nonetheless, the present invention also has a definite objective. The author's name or date of an attribute value can be used to change the information, which does not mean that all methods have been taken so far. : er) based on the assumption that a paragraph of text content is taken out, and this paragraph value, so it must be further specified for the purpose specified in detail. It is divided into three parts. The main ideas of Drill-down, 1-down and Roll-up come from the 0LAP multi-dimensional data model. The standard 0LAP also includes Drill-down and Ro11-up. Among them, Drill-down is for data. Then {more detailed analysis, R ο 1 1-up is a higher-level concept of integrated data, the same reasoning is also specified in the detailed data, D ri 1 1-d ωw η will change the content of pure text For detailed analysis, the pure text content is separated by some punctuation marks or line breaks as De e imiter to further extract the attribute values. Ro 1 1 -up integrates the data of a certain block, like The author's data is integrated into the author list or the publication date of the book and the publisher's data are integrated into the publication-related data, etc. This part can be performed according to the needs of the user. Whether it is D r i 1 1-d〇w η or R ο 1 1-u p,
12652TWF.PTD 第13頁 200535641 五、發明說明(10) 下兩個步驟實施:網頁編碼以及多重字串對齊演算法。 此二個步驟與區塊框選步驟中所敘述者相同,在此不贄 述 012652TWF.PTD Page 13 200535641 V. Description of the invention (10) The next two steps are implemented: web page encoding and multiple string alignment algorithm. These two steps are the same as those described in the block selection step, and are not described here. 0
Drill-down與Roll-up提供一個很彈性的方式來讓使 用者擷取出屬性值,接著就是另一個功能Spec i fy, S p e c i f y即是決定綱目(S c h e m a )的一個重要工作,決定綱 目之後便依此綱目進行擷取整合。在習知的資訊擷取系 統中,綱目通常在一開始就決定,而在0LERA系統中則是 最後一個步驟,並且只需要簡單的幾個滑鼠點擊及輸入 屬性名稱的名稱便可以完成。 本發明所提出的系統中,以所見即所得(What You See Is What You Get, WYSIWYG)的觀點,將繁複的屬性 標示簡化成依序三步簡單的操作。請參照圖1 ,其繪示依 照本發明一較佳實施例的目標區域框選方法。使用者以 一簡單滑鼠點擊及拖曳,或經由其它同樣功能的使用者 週邊介面(例如滑鼠板,決定目標網頁中欲加以擷取資訊 的内容,該内容包括例如純文字,更包括例如圖像,且 以超文字標示語言(HTML)格式表示。在圖1中以BARNES & N 0 B L E網路書店的網站為例,在搜尋到的J + +參考書籍網 頁中,使用者選取有興趣的内容包括書名、作者、格式 (format)、國際標準圖書編號(International Standard Book Number, ISBN)、出版商(Publisher)、出版日期 (Pub. Date)、版本(Edition Desc.)等資料(102),而此 頁中未選取的部分包括位於版面右側的一段文字聲明Drill-down and Roll-up provide a very flexible way for users to retrieve attribute values. Then there is another function Spec i fy, S pecify is an important task to determine the outline (S chema). After the outline is determined, Extract and integrate according to this outline. In the conventional information retrieval system, the outline is usually determined at the beginning, and in the 0LERA system, it is the last step, and it only needs a few mouse clicks and the name of the attribute name can be completed. In the system proposed by the present invention, from the viewpoint of What You See Is What You Get (WYSIWYG), the complicated attribute labeling is simplified into a three-step simple operation in sequence. Please refer to FIG. 1, which illustrates a target area selection method according to a preferred embodiment of the present invention. The user clicks and drags with a simple mouse, or through other user-interfaces with the same functions (such as a mouse pad), to determine the content to be retrieved in the landing page. The content includes, for example, plain text, and also includes, for example, images Like, and it is expressed in Hypertext Markup Language (HTML) format. In Figure 1, the website of the BARNES & N 0 BLE online bookstore is taken as an example. In the searched J + + reference book web page, the user selects the interest The content includes the title, author, format, International Standard Book Number (ISBN), publisher, publisher, publication date, edition description, etc. (102) , And the unselected part of this page includes a text statement to the right of the layout
12652TWF.PTD 第14頁 200535641 五、發明說明(11) (104)° 接著參考圖2,其繪示依照本發明一較佳實施例的細 部資料指定互動式介面示意圖。經過〇 L E R A系統執行網頁 編碼、近似型樣配對、以及多重字串對齊演算等步驟之 後,得到如圖2所示的表格形式的資料2 1 0 ,而圖中的陰 影部分2 2 0係經0ELRA演算後產生的擷取規則,其它類似 的書籍查詢網頁可依此法則將選取區塊内的資料結構 化。 接下來欲實施資料擷取,需利用以下的擷取規則。 擷取規則的表示法大致上可分成三種,(1 )以定義符號為 根據(D e 1 i in i t e r - b a s e d ):憑藉分界符號,根據欲擷取屬 性其前後文格式來擷取,例如圖3所示,在跳出的互動視 窗3 3 0中輸入之符號〃 $,· ”等;(2 )以内容為根據 (C ο n t e n t - b a s e d ) ··憑藉資料内容來掘取,例如本發明一 較佳實施例中的書名、作者、版式、出版商、國際標準 圖書編號、出版曰期、版本等特殊格式;(3 )以前後文為 根據(Context-based):同時採用(1)、(2)兩者的做法。 圖3所示為執行下探(D r i 1 1 - d 〇 w η )的例子,本實施例中係 針對包含格式(f 〇 r m a t)資訊的一欄執行下探,依據互動 視窗3 3 0中鍵入的分隔符號(此處鍵入3個符號,但發生作 用的是逗號",M ),將版式資料向π下π再解析為 n F 〇 r m a t : π 、" P a p e r b a c kπ 、π 1 s t e d · π 、’’ 9 9 2 ρ ρ · π 等四 欄,得到表格形式的資料3 1 0。 參考圖4,係描述上轉的一實施例。其作業係可視為12652TWF.PTD Page 14 200535641 V. Description of the Invention (11) (104) ° Next, referring to FIG. 2, a schematic diagram of an interactive interface with detailed information according to a preferred embodiment of the present invention is shown. After the OLERA system performs web page encoding, approximate pattern matching, and multiple string alignment calculations, the data in the form of table 2 1 0 shown in FIG. 2 is obtained, and the shaded part 2 2 0 in the figure is 0ELRA The retrieval rules generated after the calculation, other similar book query web pages can structure the data in the selected block according to this rule. To implement data retrieval, the following retrieval rules need to be used. The representation of the extraction rules can be roughly divided into three types. (1) Based on the definition symbol (D e 1 i in iter-based): With the delimitation symbol, it is extracted according to the context format of the attribute to be retrieved, such as the figure As shown in Fig. 3, the symbols 〃 $, · ", etc. entered in the pop-up interactive window 3 3 0; (2) Based on the content (C ο ntent-based). Special formats such as book title, author, layout, publisher, international standard book number, publication date, version, etc. in the preferred embodiment; (3) Context-based: (1), ( 2) Both methods. Figure 3 shows an example of performing a drilldown (D ri 1 1-d 〇w η). In this embodiment, a drilldown is performed for a column containing format (f ommat) information. According to the delimiter symbol typed in the interactive window 3 3 0 (3 symbols are typed here, but the comma ", M is used), the layout data is reduced to π and then parsed into n F 〇rmat: π, " P aperbac kπ, π 1 sted · π, '' 9 9 2 ρ ρ · π and other four columns to obtain the data in the form of a table 3 1 0. Referring to FIG. 4, an embodiment of description is described. The operation can be regarded as
12652TWF.PTD 第15頁 200535641 五、發明說明(12) 下探的逆向操作,意即將數欄細分的資料統合為一較為 上位的資料,例如圖4中陰影的部分4 2 0中,各欄的資料 如"平裝本π 、π第1版π 、M總頁數π等資料統整為π版式π — 筆資料。使用者可依其需要在4 3 0中填入所要加入的分隔 符號。 接下來參考圖5,係描述相關屬性指定(S p e c i f y ), 如圖5中使用者輸入"Book Title11, "Author", π P u b 1 i s h e r n在所要的資料欄上。 參考圖6 ’係本發明0 L E R Α糸統架構的流程不意圖。 根據此示意圖所繪示,0 L E R A系統操作方法係分為兩個部 分:如圖左半部所示之訓練程序6 1 0及右半部所示之測試 /擷取程序6 2 0。其中在訓練程序中,一範本文件6 0 2例如 為單一記錄網頁或多重錄網頁,首先經過使用者操作區 塊選取612、再經由下探/上合614、以及屬性指定616的 步驟得到擷取型樣6 0 4。而在測試/擷取程序6 2 0方面,將 標的文件以步驟6 2 6處理,引入擷取出的型樣6 0 4化的資 料6 0 8。要注意的事,在此流程圖中,所有以橢圓形標示 的步驟係使用者操作,例如以滑鼠點擊拖曳或以其它介 面執行類似效果,而所有矩形標示的步驟係本發明之 〇L E R A系統所對應的操作。在使用者框選區塊(步驟6 1 2 ) 時,系統所對應執行的步驟6 2 2包括記錄頁編碼、近似型 樣配對、以及多重字串對齊。而對應於使用者下探/上合 步驟(6 1 4 )的系統操作包括記錄頁編碼以及多重字串對 齊。另一方面於測試/擷取程序步驟中,測試標的文件或12652TWF.PTD Page 15 200535641 V. Description of the invention (12) The reverse operation, which means to combine the subdivided data in several columns into a higher-level data, such as the shaded part 4 2 0 in Figure 4, the Data such as " Paperback π, π First Edition π, M Total Pages π, etc. are collected into π layout π — pen data. The user can fill in the desired separator symbol in 4 3 0 according to his needs. Next, referring to FIG. 5, the related attribute designation (S p e c i f y) is described. As shown in FIG. 5, the user enters "Book Title11," Author ", and π P u b 1 i s h e r n is on the desired data column. Referring to FIG. 6 ′, it is not intended to illustrate the flow of the OLED system architecture of the present invention. According to the diagram, the operation method of the 0 L E R A system is divided into two parts: the training program 6 1 0 shown in the left half and the test / retrieval program 6 2 0 shown in the right half. Among them, in the training process, a template file 602 is, for example, a single recording webpage or a multiple recording webpage. First, it is retrieved through the user operation block selection 612, and then it is retrieved through the steps of drill down / up and down 614, and attribute specification 616 Pattern 6 0 4. In the test / retrieval procedure 620, the target file is processed in step 626, and the extracted pattern 604 data is introduced. It should be noted that in this flowchart, all the steps marked with an oval are user-operated, such as clicking and dragging with a mouse or performing similar effects with other interfaces, and all the steps marked with a rectangle are the LELER system of the present invention. The corresponding operation. When the user selects a block (step 6 1 2), the corresponding step 6 2 2 performed by the system includes recording page coding, approximate pattern matching, and multiple string alignment. The system operation corresponding to the user's down / close step (6 1 4) includes recording page encoding and multiple string alignment. On the other hand, in the test / retrieval procedure steps, the test target document or
12652TWF.PTD 第16頁 200535641 五、發明說明(13) 擷取標的文件6 0 6經過包括記錄頁編碼以及多重字串對齊 的步驟而得到結構化資料6 0 8。 雖然本發明已以一較佳實施例揭露如上,然其並非 用以限定本發明,任何熟習此技藝者,在不脫離本發明 之精神和範圍内,當可作些許之更動與潤飾,因此本發 明之保護範圍當視後附之申請專利範圍所界定者為準。12652TWF.PTD Page 16 200535641 V. Description of the invention (13) Retrieving the target file 6 0 6 The structured data 6 0 8 is obtained through the steps including recording page encoding and multiple string alignment. Although the present invention has been disclosed as above with a preferred embodiment, it is not intended to limit the present invention. Any person skilled in the art can make some changes and retouch without departing from the spirit and scope of the present invention. The scope of protection of the invention shall be determined by the scope of the attached patent application.
12652TWF.PTD 第17頁 200535641 圖式簡單說明 圖1是依照本發明中一較佳實施例所繪示之目標區域 框選方法示意圖。 圖2是依照本發明中一較佳實施例所繪示之細部資料 指定互動式介面示意圖。 圖3是依照本發明中一較佳實施例所繪示針對資料下 探(D r i 1 1 - d〇w η )做較細部的分析的介面示意圖。 圖4是依照本發明中一較佳實施例所繪示整合資料成 較高層次的上合(Roll-up)介面示意圖 圖5是依照本發明中一較佳實施例所繪示之細部資料 指定互動式介面示意圖。 圖6是依照本發明中一較佳實施例所繪示之0LERA系 統資料擷取流程示意圖。 圖式標記說明 602 文 件 604 擷 取 型 樣 606 文 件 608 結 構 化 資料 610 訓 練 程 序 612 區 塊 框 選 6 14 下 探 /上合 616 指 定 屬 性 620 測 言式 /擷取12652TWF.PTD Page 17 200535641 Brief Description of Drawings Figure 1 is a schematic diagram of a method for selecting a target area according to a preferred embodiment of the present invention. FIG. 2 is a schematic diagram of a detailed data specifying interactive interface according to a preferred embodiment of the present invention. FIG. 3 is a schematic diagram of an interface for detailed analysis of data drilling (D r i 1 1-doww η) according to a preferred embodiment of the present invention. FIG. 4 is a schematic diagram of a Roll-up interface for integrating data into a higher level according to a preferred embodiment of the present invention. FIG. 5 is a detailed data specification according to a preferred embodiment of the present invention. Interactive interface diagram. FIG. 6 is a schematic diagram of a data retrieval process of the 0LERA system according to a preferred embodiment of the present invention. Description of Schematic Symbols 602 Files 604 Extraction Patterns 606 Files 608 Structured Information 610 Training Procedures 612 Block Selection 6 14 Drilldown / Closing 616 Designation Properties 620 Test / Retrieval
12652TWF.PTD 第18頁12652TWF.PTD Page 18
Claims (1)
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
TW93110811A TWI237780B (en) | 2004-04-19 | 2004-04-19 | Online extraction rule analysis for semi-structured documents |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
TW93110811A TWI237780B (en) | 2004-04-19 | 2004-04-19 | Online extraction rule analysis for semi-structured documents |
Publications (2)
Publication Number | Publication Date |
---|---|
TWI237780B TWI237780B (en) | 2005-08-11 |
TW200535641A true TW200535641A (en) | 2005-11-01 |
Family
ID=36929938
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
TW93110811A TWI237780B (en) | 2004-04-19 | 2004-04-19 | Online extraction rule analysis for semi-structured documents |
Country Status (1)
Country | Link |
---|---|
TW (1) | TWI237780B (en) |
-
2004
- 2004-04-19 TW TW93110811A patent/TWI237780B/en not_active IP Right Cessation
Also Published As
Publication number | Publication date |
---|---|
TWI237780B (en) | 2005-08-11 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
US8005815B2 (en) | Search engine | |
JP2005122295A (en) | Relationship figure creation program, relationship figure creation method, and relationship figure generation device | |
CN108090199B (en) | Semantic information extraction and visualization method for large-scale image set | |
CN113312503A (en) | Novel teaching video content abstract and visual browsing method | |
CN111813874B (en) | Terahertz knowledge graph construction method and system | |
Abrami et al. | Text2scenevr: Generating hypertexts with vannotator as a pre-processing step for text2scene systems | |
JP2024091709A (en) | Sentence preparation apparatus, sentence preparation method, and sentence preparation program | |
Martín-Valdivia et al. | Using information gain to improve multi-modal information retrieval systems | |
Yurtsever et al. | Figure search by text in large scale digital document collections | |
Kadam et al. | A survey on HTML structure aware and tree based web data scraping technique | |
Sabri et al. | Improving performance of DOM in semi-structured data extraction using WEIDJ model | |
US20080015843A1 (en) | Linguistic Image Label Incorporating Decision Relevant Perceptual, Semantic, and Relationships Data | |
JP2014102625A (en) | Information retrieval system, program, and method | |
Sanoja et al. | Block-o-matic: a web page segmentation tool and its evaluation | |
CN114238735B (en) | Intelligent internet data acquisition method | |
Ishihara et al. | Analyzing visual layout for a non-visual presentation-document interface | |
Fung et al. | Discover information and knowledge from websites using an integrated summarization and visualization framework | |
Yeh et al. | A case for query by image and text content: searching computer help using screenshots and keywords | |
Ramezani et al. | Automated text summarization: An overview | |
Ibrahim et al. | Exquisite: explaining quantities in text | |
CN116209992A (en) | Multimodal table encoding for information retrieval systems | |
TWI237780B (en) | Online extraction rule analysis for semi-structured documents | |
Tsapatsoulis | Web image indexing using WICE and a learning-free language model | |
Adefowoke Ojokoh et al. | Automated document metadata extraction | |
Kolkur et al. | Web Data Extraction Using Tree Structure Algorithms-A Comparison |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
MM4A | Annulment or lapse of patent due to non-payment of fees |