TWI611308B

TWI611308B - Webpage data extraction device and webpage data extraction method thereof

Info

Publication number: TWI611308B
Application number: TW105135730A
Authority: TW
Inventors: 黃奕翔; 邱育賢; 蕭暉議
Original assignee: 財團法人資訊工業策進會
Priority date: 2016-11-03
Filing date: 2016-11-03
Publication date: 2018-01-11
Also published as: CN108021600A; TW201818268A; US20180121558A1

Abstract

一種網頁資料擷取裝置及其網頁資料擷取方法。網頁資料擷取裝置執行：根據網頁資料之URL之位址關聯性，將網頁資料分為URL群組；自URL群組之網頁資料中挑選第一網頁資料以及第二網頁資料；解析第一網頁資料以及第二網頁資料得網頁節點資料集合；根據網頁節點資料集合之網頁節點資料之XML路徑語言之路徑關聯性以及文字內容之文字關聯性，將網頁節點資料集合之網頁節點資料分為複數網頁節點資料群組；分別計算各網頁節點資料群組之一文字內容總和；根據文字內容總和，判斷網頁節點資料群組之主要網頁節點資料群組；根據主要網頁節點資料群組包含之網頁節點資料之XML路徑語言決定網頁主要內容擷取資訊。 A webpage data extraction device and a webpage data acquisition method thereof. Web page data extraction device execution: according to the address relevance of the URL of the webpage data, the webpage data is divided into URL groups; the first webpage data and the second webpage data are selected from the webpage data of the URL group; the first webpage is parsed The data and the second webpage data are obtained by the webpage node data collection; according to the path relevance of the XML path language of the webpage node data of the webpage node data collection and the textual relevance of the text content, the webpage node data of the webpage node data collection is divided into plural webpages a node data group; respectively calculating a sum of text content of each webpage node data group; determining a main webpage node data group of the webpage node data group according to the sum of the text content; according to the webpage node data included in the main webpage node data group The XML path language determines the main content of the web page to capture information.

Description

Web data capture device and webpage data acquisition method thereof

本發明係關於一種網頁資料擷取裝置及其網頁資料擷取方法；更具體而言，本發明係關於一種自動化之網頁資料擷取裝置及其網頁資料擷取方法。 The present invention relates to a webpage data extraction device and a webpage data extraction method thereof. More specifically, the present invention relates to an automated webpage data retrieval device and a webpage data acquisition method thereof.

隨著網際網路應用發展，各式各樣的資訊皆可從不同之網頁獲取，因此，當有特定資料分析需求時，便可針對相關網站之網頁，擷取其主要內容後分析處理。 With the development of Internet applications, all kinds of information can be obtained from different web pages. Therefore, when there is a specific data analysis requirement, the main content of the webpage of the relevant website can be extracted and analyzed.

而習知之網頁主要內容擷取方式中，多採人工抓取分析進行，然而，以人工之方式針對不同網站之不同網頁進行主要內容判斷，其效率相當不理想。據此，為提升網頁主要內容擷取效率，便有以客製程式為主，針對網頁之各種樣板(templates)及其排版(layout)作為訓練資料(training data)，進行網頁分析及主要內容擷取之技術。 However, in the main content extraction methods of the webpage, the manual crawling analysis is used. However, the main content judgment of different webpages of different websites is artificially performed, and the efficiency is rather unsatisfactory. Accordingly, in order to improve the efficiency of the main content of the webpage, there are custom-made programs, and various templates and layouts of the webpage are used as training data for webpage analysis and main content. Take the technology.

惟此種客製程式之方式，僅能針對特定網頁之樣板及排版進行處理，因此，當網頁改版或其語法結構稍微調整，若不針對客製程式進行相應之調整，將會導致分析及擷取之結果發生明顯錯誤。 However, the way of this kind of custom program can only be processed for the template and layout of a specific web page. Therefore, when the webpage revision or its grammatical structure is slightly adjusted, if the corresponding adjustment is not made for the custom program, it will lead to analysis and Take the result Obvious error.

更者，因網頁格式排版日趨複雜，因此網頁資訊量亦大幅大幅增加，單一網頁之網頁節點(webpage node)可能高達近千個，據此，當網頁之結構或型態發生更動時，客製程式調整的複雜程度將更顯困難，甚至可能需要重新撰寫客製程式，如此，同樣導致網頁主要內容判斷之效率不佳。 Moreover, due to the increasingly complex layout of web pages, the amount of web page information has also increased substantially. The number of webpage nodes of a single web page may be as high as nearly one thousand. According to this, when the structure or type of webpage changes, the customization is made. The complexity of the program adjustment will be more difficult, and may even require rewriting the custom program. As a result, the main content of the web page is judged to be inefficient.

因此，如何改進習知網頁主要內容擷取效率不佳之缺點，乃業界須共同努力之目標。 Therefore, how to improve the shortcomings of the main content of the traditional webpage to achieve inefficiency is the goal of the industry to work together.

本發明之主要目的係提供一種用於網頁資料擷取裝置之網頁資料擷取方法。網頁資料擷取裝置自網頁伺服器接收複數網頁資料。網頁資料擷取方法包含：(a)令網頁資料擷取裝置根據複數網頁資料之複數統一資源定址器(uniform resource locator,URL)之位址關聯性，將複數網頁資料分為至少一URL群組。其中，至少一URL群組包含第一URL群組，第一URL群組包含至少部分複數網頁資料；(b)令網頁資料擷取裝置自第一URL群組之部分複數網頁資料中，挑選第一網頁資料以及第二網頁資料；(c)令網頁資料擷取裝置解析第一網頁資料以及第二網頁資料得網頁節點資料集合。其中，網頁節點資料集合包含複數網頁節點資料，各網頁節點資料包含相對應之XML路徑語言(XML Path Language)以及文字內容。 The main object of the present invention is to provide a webpage data extraction method for a webpage data retrieval device. The webpage data retrieval device receives the plurality of webpage data from the webpage server. The webpage data extraction method comprises: (a) causing the webpage data retrieval device to divide the plurality of webpage data into at least one URL group according to the address association of a plurality of uniform resource locators (URLs) of the plurality of webpage data. . The at least one URL group includes a first URL group, the first URL group includes at least a portion of the plurality of webpage materials, and (b) the webpage data retrieval device selects a portion of the plurality of webpage materials from the first URL group. a web page data and a second web page data; (c) a web page data capture device that parses the first web page data and the second web page data to obtain a web page node data set. The webpage node data set includes a plurality of webpage node materials, and each webpage node data includes a corresponding XML Path Language and text content.

前述網頁資料擷取方法進一步包含：(d)令網頁資料擷取裝置根據網頁節點資料集合之複數網頁節點資料之複數XML路徑語言之路徑關聯性以及複數文字內容之文字關聯性，將網頁節點資料集合之複數網頁節點資料分為複數網頁節點資料群組。其中，各網頁節點資料群組至少包含部分複數網頁節點資料；(e)令網頁資料擷取裝置分別計算各網頁節點資料群組之部分複數網頁節點資料之文字內容總和；(f)令網頁資料擷取裝置根據複數文字內容總和，判斷複數網頁節點資料群組之至少一主要網頁節點資料群組；(g)令網頁資料擷取裝置根據至少一主要網頁節點資料群組包含之部分複數網頁節點資料之複數XML路徑語言，決定網頁主要內容擷取資訊。 The aforementioned method for extracting webpage data further includes: (d) making webpage materials The capturing device divides the plurality of webpage node data of the webpage node data set into a plurality of webpage node data groups according to the path association of the plurality of XML path languages of the plurality of webpage node data of the webpage node data collection and the textual relevance of the plurality of textual content. The webpage data group includes at least part of the plurality of webpage node data; (e) causing the webpage data extracting device to calculate the sum of the textual contents of the plurality of webpage node data of each webpage node data group; (f) making the webpage data The capturing device determines at least one primary webpage node data group of the plurality of webpage node data groups according to the sum of the plurality of textual contents; (g) causing the webpage data extracting apparatus to include at least a plurality of webpage nodes included in the at least one main webpage node data group The plural XML path language of the data determines the main content of the web page to capture information.

為達上述目的，本發明揭露一種網頁資料擷取裝置，包含接收單元以及處理單元。接收單元用以自網頁伺服器接收複數網頁資料。處理單元用以：根據複數網頁資料之複數URL之位址關聯性，將複數網頁資料分為至少一URL群組。其中，至少一URL群組包含第一URL群組，第一URL群組包含至少部分複數網頁資料；自第一URL群組之部分複數網頁資料中，挑選第一網頁資料以及第二網頁資料；解析第一網頁資料以及第二網頁資料得網頁節點資料集合。其中，網頁節點資料集合包含複數網頁節點資料，各網頁節點資料包含相對應之XML路徑語言以及文字內容。 To achieve the above objective, the present invention discloses a webpage data capture device, including a receiving unit and a processing unit. The receiving unit is configured to receive a plurality of webpage materials from the web server. The processing unit is configured to: divide the plurality of webpage materials into at least one URL group according to address relevance of the plurality of URLs of the plurality of webpage materials. The at least one URL group includes a first URL group, and the first URL group includes at least part of the plurality of webpage materials; and the first webpage data and the second webpage data are selected from the plurality of webpage materials of the first URL group; Parsing the first webpage data and the second webpage data into a webpage node data collection. The webpage node data set includes a plurality of webpage node materials, and each webpage node data includes a corresponding XML path language and text content.

前述處理單元進一步用以：根據網頁節點資料集合之複數網頁節點資料之複數XML路徑語言之路徑關聯性以及複數文字內容之文字關聯性，將網頁節點資料集合之複數網頁節點資料分為複數網頁節點資料群組。其中，各網頁節點資料群組至少包含部分複數網頁節點資料；分別計算各網頁節點資料群組之部分複數網頁節點資料之文字內容總和；根據複數文字內容總和，判斷複數網頁節點資料群組之至少一主要網頁節點資料群組；根據至少一主要網頁節點資料群組包含之部分複數網頁節點資料之複數XML路徑語言，決定網頁主要內容擷取資訊。 The processing unit is further configured to: according to the path association of the plurality of XML path languages of the plurality of webpage node data of the webpage node data set and the textual relevance of the plurality of textual content, the plurality of webpage nodes of the webpage node data collection It is divided into multiple web node data groups. Wherein, each webpage node data group includes at least part of the plurality of webpage node data; respectively, calculating a sum of text content of a plurality of webpage node data of each webpage node data group; and judging at least a plurality of webpage node datagroups according to the sum of the plural textual content sums A main webpage node data group; determining a main content of the webpage to retrieve information according to a plurality of XML path languages of at least one of the plurality of webpage node data groups.

此外在參閱圖式及隨後描述之實施方式後，此技術領域具有通常知識者便可瞭解本發明之其他目的，以及本發明之技術手段及實施態樣。 Further objects of the present invention, as well as the technical means and implementations of the present invention, will become apparent to those skilled in the art in view of the appended claims.

1、2‧‧‧網頁資料擷取裝置 1, 2‧‧‧ web data capture device

11、21‧‧‧接收單元 11, 21‧‧‧ receiving unit

13、23‧‧‧處理單元 13, 23‧‧ ‧ processing unit

wp‧‧‧網頁資料 Wp‧‧‧Webpage

ul‧‧‧統一資源定址器 Ul‧‧‧Uniform Resource Addresser

ug‧‧‧至少一URL群組 Ug‧‧‧At least one URL group

UL1‧‧‧第一URL群組 UL1‧‧‧First URL Group

WP1‧‧‧第一網頁資料 WP1‧‧‧First webpage information

WP2‧‧‧第二網頁資料 WP2‧‧‧Second webpage information

ND‧‧‧網頁節點資料 ND‧‧‧Website Node Information

NDX‧‧‧XML路徑語言 NDX‧‧‧XML path language

NDT‧‧‧文字內容 NDT‧‧‧ text content

wpg‧‧‧網頁節點資料集合 Wpg‧‧‧Website node data collection

ndg‧‧‧網頁節點資料群組 Ndg‧‧‧Website Node Data Group

MNDG‧‧‧至少一主要網頁節點資料群組 MNDG‧‧‧ at least one main web node data group

MX‧‧‧網頁主要內容擷取資訊 The main content of the MX‧‧‧ webpage captures information

第1A圖係本發明第一實施例之網頁資料擷取操作示意圖；第1B圖係本發明第一實施例之網頁資料擷取裝置之方塊圖；第2A圖係本發明第二實施例之網頁資料擷取操作示意圖；第2B圖係本發明第二實施例之網頁資料擷取裝置之方塊圖；第3圖係本發明第三實施例之網頁資料擷取方法之流程圖；以及第4A-4B圖係本發明第四實施例之網頁資料擷取方法之流程圖。 1A is a block diagram of a webpage data capture operation according to a first embodiment of the present invention; FIG. 1B is a block diagram of a webpage data capture apparatus according to a first embodiment of the present invention; and FIG. 2A is a webpage of a second embodiment of the present invention; FIG. 2B is a block diagram of a webpage data capture device according to a second embodiment of the present invention; and FIG. 3 is a flowchart of a webpage data capture method according to a third embodiment of the present invention; and FIG. 4A- 4B is a flow chart of a webpage data acquisition method according to a fourth embodiment of the present invention.

下將透過實施方式來解釋本發明之內容。須說明者，本發明的實施例並非用以限制本發明須在如實施例所述之任何特定的環境、應用或特殊方式方能實施。因此，有關實施例之說明僅為闡釋本發明之目的，而非用以限制本發明，且本案所請求之範圍，以申請專利範圍為準。除此之外，於以下實施例及圖式中，與本發明非直接相關之元件已省略而未繪示，且以下圖式中各元件間之尺寸關係僅為求容易瞭解，非用以限制實際比例。 The contents of the present invention will be explained by way of embodiments. It should be noted that the embodiments of the present invention are not intended to limit the invention as required by any of the embodiments. A specific environment, application or special way can be implemented. Therefore, the description of the embodiments is only for the purpose of illustrating the invention, and is not intended to limit the invention. In addition, in the following embodiments and drawings, elements that are not directly related to the present invention have been omitted and are not shown, and the dimensional relationships between the elements in the following figures are merely for ease of understanding and are not intended to be limiting. Actual ratio.

請參考第1A~1B圖。第1A圖係本發明第一實施例之網頁資料擷取操作示意圖，第1B圖係本本發明第一實施例之一網頁資料擷取裝置1之方塊圖。網頁資料擷取裝置1包含一接收單元11以及一處理單元13，並透過接收單元11與一網頁伺服器9連線。元件間之互動將於下文中進一步闡述。 Please refer to Figures 1A~1B. 1A is a block diagram of a webpage data capture operation of the first embodiment of the present invention, and FIG. 1B is a block diagram of a webpage data capture device 1 of the first embodiment of the present invention. The webpage data capture device 1 includes a receiving unit 11 and a processing unit 13, and is connected to a web server 9 via the receiving unit 11. The interaction between components will be further elaborated below.

首先，當需要分析網頁伺服器9之網頁時，網頁資料擷取裝置1之接收單元11自網頁伺服器9接收複數網頁資料wp。其中，基於網際網路使用原則，各網頁資料wp皆有其相應之統一資源定址器(uniform resource locator,URL)ul。 First, when it is necessary to analyze the web page of the web server 9, the receiving unit 11 of the web data extracting device 1 receives the plurality of web materials wp from the web server 9. Among them, based on the principle of Internet usage, each web page data wp has its corresponding uniform resource locator (URL) ul.

接著，網頁資料擷取裝置1之處理單元13便根據複數網頁資料wp之複數URL ul之位址關聯性，將複數網頁資料wp分為至少一URL群組ug。其中，至少一URL群組u1包含一第一URL群組UL1，而第一URL群組UL1包含至少部分網頁資料wp。 Next, the processing unit 13 of the webpage data retrieval device 1 divides the plurality of webpage data wp into at least one URL group ug according to the address relevance of the plural URL ul of the plurality of webpage data wp. The at least one URL group u1 includes a first URL group UL1, and the first URL group UL1 includes at least part of the web page data wp.

須說明，此分群用意在於，初步地根據URL特性，將網頁內容相似度較高之網頁進行分類，以利後續比對分析。換言之，由於相同樣板及排版之網頁，其URL位址之形式通常較為相近，因此，根據網頁資料之URL之位址關聯性，便可進行初步分群。 It should be noted that the purpose of this grouping is to initially classify webpages with higher similarity of webpage content according to the URL characteristics, so as to facilitate subsequent comparison analysis. In other words, since the URLs of the same board and typesetting webpages are usually similar in form, the initial grouping can be performed according to the address relevance of the URL of the webpage data.

隨後，網頁資料擷取裝置1之處理單元13自第一URL群組UL1之部分網頁資料中，挑選一第一網頁資料WP1以及一第二網頁資料WP2，並解析第一網頁資料WP1以及第二網頁資料WP2得一網頁節點資料集合wpg。 Then, the processing unit 13 of the webpage data retrieval device 1 selects a first webpage data WP1 and a second webpage data WP2 from a part of the webpage data of the first URL group UL1, and parses the first webpage data WP1 and the second The web page data WP2 has a web node data set wpg.

詳言之，由於單一網頁中包含多個網頁節點(webpage node)，因此，解析第一網頁資料WP1以及第二網頁資料WP2之語法便可得到包含複數網頁節點資料ND之網頁節點資料集合wpg。其中，各網頁節點資料ND包含相對應之一XML路徑語言(XML Path Language)NDX以及一文字內容NDT。 In detail, since a single webpage includes a plurality of webpage nodes, the syntax of the first webpage data WP1 and the second webpage data WP2 can be analyzed to obtain a webpage node dataset wpg including a plurality of webpage node materials ND. The web page node data ND includes one of an XML path language (NDX) and a text content NDT.

據此，網頁資料擷取裝置1之處理單元13便可根據網頁節點資料集合wpg之複數網頁節點資料ND之複數XML路徑語言NDX之路徑關聯性以及複數文字內容NDT之文字關聯性，將網頁節點資料集合wpg之複數網頁節點資料ND分為複數網頁節點資料群組ndg。其中，各網頁節點資料群組ndg至少包含部分網頁節點資料ND。 According to this, the processing unit 13 of the webpage data extraction device 1 can select the webpage node according to the path association of the plural XML path language NDX of the plurality of webpage node data ND of the webpage node data set wpg and the textual relevance of the plural text content NDT. The data set node data ND of the data set wpg is divided into a plurality of web page node data groups ndg. The webpage node data group ndg includes at least part of the webpage node data ND.

須說明，類似地，此分群用意在於，根據XML語法以及文字內容之特性，將內容相似度較高之網頁節點進行分類，以利後續主要內容之判斷。換言之，即根據網頁節點之XML路徑語言之路徑關聯性，將XML語法相似度較高之網頁節點分群，另一方面，亦可根據網頁節點之文字內容之文字關聯性，將內容相似度較高之網頁節點分群。 It should be noted that, similarly, this grouping means that, according to the characteristics of the XML grammar and the text content, the webpage nodes with higher content similarity are classified to facilitate the judgment of the subsequent main content. In other words, according to the path association of the XML path language of the webpage node, the webpage nodes with high XML syntax similarity are grouped, and on the other hand, the content similarity is higher according to the textual relevance of the text content of the webpage node. The web nodes are grouped.

接著，網頁資料擷取裝置1之處理單元13分別計算各網頁節點資料群組ndg之部分網頁節點資料ND之一文字內容總和(未繪示)，即計算同一網頁節點資料群組ndg之網頁節點資料ND之文字總長度，並根據複數文字內容總和，判斷複數網頁節點資料群組ndg之至少一主要網頁節點資料群組MNDG。 Next, the processing unit 13 of the webpage data extraction device 1 calculates each The sum of the text content of one of the webpage node data ND of the webpage node data group ndg (not shown), that is, the total length of the text of the webpage node data ND of the same webpage node data group ndg, and judging the plural according to the sum of the plural text contents At least one primary web node data group MNDG of the web node data group ndg.

具體而言，由於同一網路頁面中，具有主要內容之網頁節點資料通常具有資料量較大之文字內容，因此，前述分群主要係根據同一網頁節點資料群組之網頁節點資料之文字內容總和，將具有主要內容之網頁節點資料與不具有主要內容之網頁節點資料進行劃分。 Specifically, since the webpage node data having the main content in the same webpage usually has a large amount of text content, the foregoing grouping is mainly based on the sum of the text contents of the webpage node data of the same webpage node data group. The web node data with the main content is divided with the web node data without the main content.

據此，網頁資料擷取裝置1之處理單元13便可根據至少一主要網頁節點資料群組MNDG包含之部分網頁節點資料ND之XML路徑語言NDX，決定一網頁主要內容擷取資訊MX。更進一步來說，網頁主要內容擷取資訊MX主要係XML路徑語言NDX之集合。 According to this, the processing unit 13 of the webpage data retrieval device 1 can determine a webpage main content capture information MX according to the XML path language NDX of the partial webpage node data ND included in at least one main webpage node data group MNDG. Furthermore, the main content of the web page capture information MX is mainly a collection of XML path language NDX.

如此一來，在前述URL群組具有相同性質(例如樣板及排版)網頁之情況下，網頁資料擷取裝置1之處理單元13後續便可直接根據此XML路徑語言NDX之集合，於URL群組中直接選擇具有主要內容之網頁節點，俾後續主要內容之分析及利用。 In this case, in the case that the foregoing URL group has the same nature (for example, template and typesetting) webpage, the processing unit 13 of the webpage data extracting apparatus 1 can directly follow the set of the XML path language NDX in the URL group. Directly select the web page node with the main content, and analyze and utilize the subsequent main content.

請參考第2A-2B圖。第2A圖係本發明第二實施例之網頁資料擷取操作示意圖，第2B圖係本本發明第二實施例之一網頁資料擷取裝置2之方塊圖。網頁資料擷取裝置2包含一接收單元21以及一處理單元23，並透過接收單元21與網頁伺服器9連線。第二實施例主要係進一步用範例解釋網頁資料擷取裝置2擷取分析網頁之細節。 Please refer to Figure 2A-2B. 2A is a block diagram of a webpage data capture operation of the second embodiment of the present invention, and FIG. 2B is a block diagram of a webpage data capture device 2 of the second embodiment of the present invention. The webpage data capture device 2 includes a receiving unit 21 and a processing unit 23, and is connected to the web server 9 via the receiving unit 21. First The second embodiment mainly uses the example to explain the webpage data capturing device 2 to extract the details of the analysis webpage.

同樣地，當需要分析網頁伺服器9之網頁時，網頁資料擷取裝置2之接收單元21自網頁伺服器9接收複數網頁資料wp，而基於網際網路使用原則，各網頁資料wp皆有其相應之URL ul，其中，網頁資料wp及相對應之URL ul如下表格繪示：

Similarly, when it is necessary to analyze the webpage of the webpage server 9, the receiving unit 21 of the webpage data extracting apparatus 2 receives the plurality of webpage materials wp from the webpage server 9, and based on the internet usage principle, each webpage data wp has its Corresponding URL ul, wherein the webpage data wp and the corresponding URL ul are as follows:

接著，網頁資料擷取裝置2之處理單元23便根據複數網頁資料wp之複數URL ul之位址關聯性，將複數網頁資料wp分為至少一URL群組ug。其中，至少一URL群組ul包含第一URL群組UL1，而第一URL群組UL1包含至少部分網頁資料WP。須說明，第二實施例中，此處之URL分群組主要係基於最小編輯距離(Minimum Edit Distance,MED)完成。 Next, the processing unit 23 of the webpage data retrieval device 2 divides the plurality of webpage data wp into at least one URL group ug according to the address relevance of the plural URL ul of the plurality of webpage materials wp. The at least one URL group ul includes the first URL group UL1, and the first URL group UL1 includes at least part of the web page data WP. It should be noted that in the second embodiment, the URL grouping here is mainly based on the Minimum Edit Distance (MED).

詳言之，網頁資料擷取裝置2之處理單元23將複數網頁資料wp之複數URL ul兩兩進行最小編輯距離計算，得結果如下表：

In detail, the processing unit 23 of the webpage data retrieval device 2 calculates the minimum edit distance of the plurality of URLs ul of the plurality of webpage data wp, and the results are as follows:

據此，網頁資料擷取裝置2之處理單元23可根據上表內容，將MED值小於一URL門檻值之網頁資料配對加至同一URL群組中。以第二實施例來說，URL門檻值為2，因此，MED值為1之網頁配對將分在同一URL群組。 Accordingly, the processing unit 23 of the webpage data retrieval device 2 can add the webpage data with the MED value less than a URL threshold to the same URL group according to the content of the above table. In the second embodiment, the URL threshold is 2, so the web page pair with the MED value of 1 will be grouped in the same URL group.

詳言之，第一URL群組UL1所包含之至少部分網頁資料WP即為http：//www.aaaaa.com/item1~3.html。另外，至少一URL群組ul亦可包含一第二URL群組(未繪示)，且第二URL群組包含至少部分網頁資料WP，即http：//www.aaaaa.com/list1~2.html，惟相同URL群組之操作相同，後續將僅以第一URL群組UL1為主。 In detail, at least part of the webpage data WP included in the first URL group UL1 is http://www.aaaaa.com/item1~3.html . In addition, at least one URL group ul may also include a second URL group (not shown), and the second URL group includes at least part of the web page data WP, ie, http://www.aaaaa.com/list1~2 .html , but the operation of the same URL group is the same, and the subsequent will only be based on the first URL group UL1.

接著，網頁資料擷取裝置2之處理單元23自第一URL群組UL1之部分網頁資料中，挑選資料量(即網頁資料之HTML size)最高之第一網頁資料WP1以及資料量第二高之第二網頁資料WP2，並解析第一網頁資料WP1以及第二網頁資料WP2得網頁節點資料集合wpg。 Then, the processing unit 23 of the webpage data retrieval device 2 selects the first webpage data WP1 with the highest amount of data (ie, the HTML size of the webpage data) and the second highest amount of data from the partial webpage data of the first URL group UL1. The second webpage data WP2, and parsing the webpage node data set wpg of the first webpage data WP1 and the second webpage data WP2.

詳言之，由於單一網頁中包含多個網頁節點，因此，解析第一網頁資料WP1以及第二網頁資料WP2之語法便可得到包含複數網頁節點資料ND之網頁節點資料集合wpg。其中，各網頁節點資料ND包含相對應之XML路徑語言NDX以及文字內容NDT，內容詳如下表：

In detail, since a single webpage includes a plurality of webpage nodes, the syntax of the first webpage data WP1 and the second webpage data WP2 can be parsed to obtain a webpage node dataset wpg including a plurality of webpage node materials ND. Among them, each web page node data ND includes a corresponding XML path language NDX and text content NDT, and the contents are as follows:

隨後，於第二實施例中，可進一步將重複或無效之網頁節點資料ND自網頁節點資料集合wpg中刪除。具體而言，網頁資料擷取裝置2之處理單元23根據上述表格，自文字內容NDT中挑選至少一無效文字內容以及至少一重複節點資料。以前述表格為例，無效文字內容為‘0’以及’null’，重複節點資料為’html/body/div[1]/div[2]/div[2]/div[3]/div[3]/div[6]∥返回首頁’。因此，調整後之網頁節點資料集合wpg之網頁節點資料ND內容如下表所示：

Subsequently, in the second embodiment, the duplicated or invalid webpage node data ND may be further deleted from the webpage node dataset wpg. Specifically, the processing unit 23 of the webpage data retrieval device 2 selects at least one invalid text content and at least one duplicate node data from the text content NDT according to the above table. Taking the above table as an example, the invalid text content is '0' and 'null', and the duplicate node data is 'html/body/div[1]/div[2]/div[2]/div[3]/div[3 ]/div[6]∥Return to the home page'. Therefore, the adjusted ND content of the web node data set wpg of the web node data set is as follows:

隨即，網頁資料擷取裝置2之處理單元23便可根據網頁節點資料集合wpg之複數網頁節點資料ND之複數XML路徑語言NDX之路徑關聯性以及複數文字內容NDT之文字關聯性，將網頁節點資料集合wpg之複數網頁節點資料ND分為複數網頁節點資料群組ndg。 Then, the processing unit 23 of the webpage data extraction device 2 can obtain the webpage node data according to the path association of the plural XML path language NDX of the plurality of webpage node data ND of the webpage node data set wpg and the textual relevance of the plural text content NDT. The plurality of web page node data ND of the collection wpg is divided into a plurality of web page node data groups ndg.

更詳細而言，第二實施例中，此處之網頁節點資料分群組之技術主要可分為兩部分進行。首先，第一部分，類似地，將前述表格之網頁節點資料ND之XML路徑語言NDX兩兩進行最小編輯距離計算，並將MED值小於一XML門檻值(未繪示)之網頁節點資料ND配對加至同一路徑群組XG中。以第二實施例來說，分組狀況如下表所示：

In more detail, in the second embodiment, the technology of the webpage node data grouping here can be mainly divided into two parts. First, in the first part, similarly, the XML path language NDX of the webpage node data ND of the foregoing table is subjected to minimum edit distance calculation, and the web node data ND pair with the MED value less than an XML threshold value (not shown) is added. To the same path group XG. In the second embodiment, the grouping status is as shown in the following table:

接著，第二部分，於各路徑群組XG中，針對網頁節點資料ND之文字內容NDT進行TF-IDF(term frequency-inverse document frequency)計算，得相應之複數用語頻率向量，並計算兩兩文字內容之用語頻率向量間之餘弦值，若大於一文字內容門檻值(未繪示)，則將其加至同一網頁節點資料群組ndg中。以第二實施例來說，分組狀況如下表所示：

Then, in the second part, in each path group XG, the TF-IDF (term frequency-inverse document frequency) calculation is performed on the text content NDT of the webpage node data ND, and the corresponding plural frequency frequency vector is obtained, and the two-word text is calculated. If the cosine value between the content frequency vectors of the content is greater than a text content threshold (not shown), it is added to the same web node data group ndg. In the second embodiment, the grouping status is as shown in the following table:

如此一來，整合前述二部分之分組方式，便形成網頁節點資料群組ndg，如下表所示：

In this way, by integrating the above two parts of the grouping method, the web node data group ndg is formed, as shown in the following table:

須說明，利用關鍵字針對文字內容進行TF-IDF計算得相關向量，並計算兩兩向量之餘弦值以判斷向量間之關聯性之技術，應為本領域技術人員透過習知技術可輕易理解之內容，於此不再贅述，本發明主要係將其用於分組之關聯性依據。 It should be noted that the technique of using the keyword to perform the TF-IDF calculation of the correlation vector for the text content and calculating the cosine value of the two vectors to determine the correlation between the vectors should be easily understood by those skilled in the art through the prior art. The content is not described here, and the present invention is mainly used for the relevance basis of the grouping.

接著，網頁資料擷取裝置2之處理單元23分別計算各網頁節點資料群組ndg之部分網頁節點資料ND之文字內容總和，即計算同一網頁節點資料群組ndg之網頁節點資料ND之文字總長度，詳如下表：

Then, the processing unit 23 of the webpage data extraction device 2 calculates the total text content of the webpage node data ND of each webpage node data group ndg, that is, calculates the total length of the text of the webpage node data ND of the same webpage node data group ndg. , as detailed below:

接著，網頁資料擷取裝置2之處理單元23將相應於不同網頁節點資料群組ndg之文字內容總和排序成一文字內容總和序列，如下表所示：

Next, the processing unit 23 of the webpage data extracting device 2 sorts the sum of the text contents corresponding to the different webpage node data groups ndg into a text content sum sequence, as shown in the following table:

隨後，網頁資料擷取裝置2之處理單元23計算排序後之文字總和序列中，相鄰文字內容總和之差值：1、2、1、44、1，並挑選最大差值，即44。據此，同樣地，由於同一網路頁面中，具有主要內容之網頁節點資料通常具有資料量較大之文字內容，因此，最大差值出現之處，即為主要內容之網頁節點資料與不具有主要內容之網頁節點資料之分界。 Subsequently, the processing unit 23 of the web page data extracting device 2 calculates the difference between the sum of the adjacent text contents in the sorted text sum sequence: 1, 2, 1, 44, 1, and selects the maximum difference, that is, 44. Accordingly, since the webpage node data having the main content usually has a large amount of text content in the same webpage, the maximum difference occurs, that is, the webpage node data of the main content does not have The boundary between the main content of the web page node data.

因此，網頁資料擷取裝置2之處理單元23便可根據最大差值，將文字內容總和序列分為一主要區域以及一次要區域，並根據主要區域，判斷網頁節點資料群組ndg之至少一主要網頁節點資料群組MNDG，如下表所示：

Therefore, the processing unit 23 of the webpage data extraction device 2 can divide the text content sum sequence into a main area and a primary area according to the maximum difference, and determine at least one main page of the webpage node data group ndg according to the main area. The web node data group MNDG is shown in the following table:

因此，於第二實施例中，主要網頁節點資料群組MNDG包含之部分網頁節點資料ND之XML路徑語言NDX如下表所示：

Therefore, in the second embodiment, the XML path language NDX of the partial webpage node data ND included in the main webpage node data group MNDG is as follows:

隨後，網頁資料擷取裝置2之處理單元23便可針對主要網頁節點資料群組MNDG包含之部分網頁節點資料ND之XML路徑語言NDX，進行最長共同子序列(Longest Common Subsequence)演算法，決定網頁主要內容擷取資訊MX。於第二實施例中，網頁主要內容擷取資訊MX為：’html/body/div[1]/main[1]/article[[0-9]+].*’。 Then, the processing unit 23 of the webpage data retrieval device 2 can perform the longest common subsequence algorithm for the XML path language NDX of the partial webpage node data ND included in the main webpage node data group MNDG, and determine the webpage. The main content captures information MX. In the second embodiment, the main content capture information MX of the webpage is: 'html/body/div[1]/main[1]/article[[0-9]+].*'.

如此一來，在前述URL群組(即http：//www.aaaaa.com/item1~3.html)具有相同性質(例如樣板及排版)網頁之情況下，網頁資料擷取裝置2之處理單元23後續便可選擇具有相同主要內容擷取資訊MX(即html/body/div[1]/main[1]/article[[0-9]+].*)之網頁節點，俾後續主要內容之分析及利用。 In this way, in the case that the aforementioned URL group (ie, http://www.aaaaa.com/item1~3.html ) has the same nature (for example, template and typesetting) webpage, the processing unit of the webpage data capturing device 2 23 Subsequent selection of the web page node with the same main content capture information MX (ie html/body/div[1]/main[1]/article[[0-9]+].*), subsequent main content Analysis and utilization.

本發明之第三實施例為網頁資料擷取方法，其流程圖請參考第3圖。第三實施例之方法係用於一網頁資料擷取裝置(例如前述實施例之網頁資料擷取裝置1)。網頁資料擷取裝置自一網頁伺服器接收複數網頁資料。第三實施例之詳細步驟如下所述。 The third embodiment of the present invention is a webpage data extraction method, and the flowchart thereof is referred to FIG. The method of the third embodiment is applied to a web page data extracting device (for example, the web page data extracting device 1 of the foregoing embodiment). The webpage data retrieval device receives the plurality of webpage data from a webpage server. The detailed steps of the third embodiment are as follows.

首先，執行步驟301，令網頁資料擷取裝置根據複數網頁資料之複數URL之位址關聯性，將複數網頁資料分為至少一URL群組。其中，至少一URL群組包含一第一URL群組，第一URL群組包含至少部分網頁資料。執行步驟302，令網頁資料擷取裝置自第一URL群組之部分網頁資料中，挑選一第一網頁資料以及一第二網頁資料。 First, step 301 is executed to enable the webpage data retrieval device to divide the plurality of webpage data into at least one URL group according to the address association of the plural URL of the plurality of webpage materials. The at least one URL group includes a first URL group, and the first URL group includes at least part of the webpage data. Step 302 is executed to enable the webpage data retrieval device to select a first webpage material and a second webpage data from a portion of the webpage data of the first URL group.

執行步驟303，令網頁資料擷取裝置解析第一網頁資料以及第二網頁資料得一網頁節點資料集合。其中，網頁節點資料集合包含複數網頁節點資料，各網頁節點資料包含相對應之一XML路徑語言以及一文字內容。 Step 303 is executed to enable the webpage data capture device to parse the first webpage data and the second webpage data to obtain a webpage node dataset. The webpage node data set includes a plurality of webpage node materials, and each webpage node data includes a corresponding one of an XML path language and a text content.

執行步驟304，令網頁資料擷取裝置根據網頁節點資料集合之網頁節點資料之XML路徑語言之路徑關聯性以及文字內容之文字關聯性，將網頁節點資料集合之複數網頁節點資料分為複數網頁節點資料群組。其中，各網頁節點資料群組至少包含部分網頁節點資料。 Step 304 is executed to enable the webpage data extraction device to divide the plurality of webpage node data of the webpage node data collection according to the path relevance of the XML path language of the webpage node data of the webpage node data collection and the textual relevance of the text content. Multiple web page node data groups. The webpage data group includes at least part of the webpage node data.

執行步驟305，令網頁資料擷取裝置分別計算各網頁節點資料群組之部分網頁節點資料之一文字內容總和。執行步驟306，令網頁資料擷取裝置根據複數文字內容總和，判斷複數網頁節點資料群組之至少一主要網頁節點資料群組。最後，執行步驟307，令網頁資料擷取裝置根據至少一主要網頁節點資料群組包含之部分網頁節點資料之XML路徑語言，決定一網頁主要內容擷取資訊。 Step 305 is executed to enable the webpage data extraction device to calculate the sum of the text contents of one of the webpage node data of each webpage node data group. Step 306 is executed to enable the webpage data retrieval device to determine at least one major webpage node data group of the plurality of webpage node data groups according to the sum of the plurality of textual content. Finally, step 307 is executed to enable the webpage data retrieval device to determine the main content capture information of a webpage according to the XML path language of the webpage node data included in the at least one main webpage node data group.

本發明之第四實施例為網頁資料擷取方法，其流程圖請參考第4圖。第四實施例之方法係用於一網頁資料擷取裝置(例如前述實施例之網頁資料擷取裝置2)。網頁資料擷取裝置自一網頁伺服器接收複數網頁資料。第四實施例之詳細步驟如下所述。 The fourth embodiment of the present invention is a webpage data extraction method, and the flowchart thereof is referred to FIG. The method of the fourth embodiment is applied to a web page data extracting device (for example, the web page data extracting device 2 of the foregoing embodiment). The webpage data retrieval device receives the plurality of webpage data from a webpage server. The detailed steps of the fourth embodiment are as follows.

首先，執行步驟401，令網頁資料擷取裝置根據複數網頁資料之複數URL之位址關聯性，將複數網頁資料分為至少一URL群組。其中，至少一URL群組包含一第一URL群組，第一URL群組包含至少部分網頁資料，且第一URL群組中，部分網頁資料之URL間之最小編輯距離皆小於一URL門檻值。 First, step 401 is executed to enable the webpage data retrieval device to divide the plurality of webpage data into at least one URL group according to the address relevance of the plural URL of the plurality of webpage materials. At least one URL group includes a first URL group, the first URL group includes at least part of the webpage data, and the minimum edit distance between the URLs of the partial webpage data in the first URL group is less than a URL threshold. .

執行步驟402，令網頁資料擷取裝置自第一URL群組之部分網頁資料中，挑選資料量最高之一第一網頁資料以及資料量第二高之一第二網頁資料。執行步驟403，令網頁資料擷取裝置解析第一網頁資料以及第二網頁資料得一網頁節點資料集合。其中，網頁節點資料集合包含複數網頁節點資料，各網頁節點資料包含相對應之一XML路徑語言以及一文字內容。 Step 402 is executed to enable the webpage data retrieval device to select one of the first webpage materials with the highest amount of data and the second webpage data of the second highest data volume from the webpage data of the first URL group. Step 403 is executed to enable the webpage data capture device Parsing the first webpage data and the second webpage data to obtain a webpage node data set. The webpage node data set includes a plurality of webpage node materials, and each webpage node data includes a corresponding one of an XML path language and a text content.

執行步驟404，令網頁資料擷取裝置自文字內容中挑選至少一無效文字內容以及至少一重複節點資料，並將相對應於至少一無效文字內容以及至少一重複節點資料之網頁節點自網頁節點資料集合中刪除。 Step 404 is executed to enable the webpage data extraction device to select at least one invalid text content and at least one duplicate node data from the text content, and the webpage node corresponding to the at least one invalid text content and the at least one duplicate node data from the webpage node data Remove from the collection.

執行步驟405，令網頁資料擷取裝置根據網頁節點資料集合之複數網頁節點資料之XML路徑語言之路徑關聯性，將網頁節點資料集合之複數網頁節點資料分為複數路徑群組。其中，各路徑群組之部分網頁節點資料之XML路徑語言間之最小編輯距離皆小於一XML門檻值。 Step 405 is executed to enable the webpage data retrieval device to divide the plurality of webpage node data of the webpage node data collection into a plurality of path groups according to the path association of the XML path language of the plurality of webpage node data of the webpage node data collection. The minimum edit distance between the XML path languages of the partial web node data of each path group is less than an XML threshold.

執行步驟406，令網頁資料擷取裝置針對各路徑群組，根據部分網頁節點資料之文字內容之文字關聯性，將各路徑群組分為複數網頁節點資料群組。其中，各路徑群組中，部分網頁節點資料之各文字內容具有一用語頻率向量，且各路徑群組中，各網頁節點資料群組之部分網頁節點資料之文字內容之用語頻率向量間之餘弦值大於一文字內容門檻值。 Step 406 is executed to enable the webpage data retrieval device to divide each path group into a plurality of webpage node data groups according to the textual relevance of the text content of the partial webpage node data for each path group. Wherein, in each path group, each text content of the part of the webpage node data has a term frequency vector, and in each path group, the cosine frequency between the text frequency of the text content of the part of the webpage node data group The value is greater than a text content threshold.

執行步驟407，令網頁資料擷取裝置將複數文字內容總和排序成一文字內容總和序列。執行步驟408，令網頁資料擷取裝置計算文字內容總和序列中相鄰文字內容總和之複數差值。執行步驟409，令網頁資料擷取裝置挑選複數差值之一最大差值。執行步驟410，令網頁資料擷取裝置根據最大差值，將文字內容總和序列分為一主要區域以及一次要區域。 Step 407 is executed to enable the webpage data extracting device to sort the sum of the plurality of text contents into a sum of text content sums. Step 408 is executed to enable the webpage data retrieval device to calculate a complex difference between the sum of adjacent text contents in the sum of the text contents. Step 409 is executed to enable the webpage data capture device to select one of the maximum differences of the complex differences. Hold In step 410, the webpage data extracting device divides the text content sum sequence into a main area and a primary area according to the maximum difference.

執行步驟411，令網頁資料擷取裝置根據主要區域，判斷複數網頁節點資料群組之至少一主要網頁節點資料群組。執行步驟412，令網頁資料擷取裝置針對至少一主要網頁節點資料群組包含之部分網頁節點資料之XML路徑語言，進行最長共同子序列演算法。執行步驟413，令網頁資料擷取裝置根據步驟412之結果，決定網頁主要內容擷取資訊。 Step 411 is executed to enable the webpage data retrieval device to determine at least one major webpage node data group of the plurality of webpage node data groups according to the main area. Step 412 is executed to enable the webpage data retrieval device to perform the longest common subsequence algorithm for the XML path language of the partial webpage node data included in the at least one main webpage node data group. Step 413 is executed to enable the webpage data retrieval device to determine the main content of the webpage to retrieve the information according to the result of step 412.

綜合上述，本發明之網頁資料擷取裝置及其網頁資料擷取方法，主要可自動地分析不同網頁群組之樣板及排版之語法，並且據以自動地找出具有主要內容之網頁節點。如此一來，便可更有效率地完成網頁資料之擷取，俾利進行後續相關數據分析。 In summary, the webpage data extracting apparatus and the webpage data extracting method thereof of the present invention can automatically analyze the syntax of the template and typesetting of different webpage groups, and automatically find the webpage node having the main content. In this way, the webpage data can be retrieved more efficiently, and the subsequent relevant data analysis can be performed.

惟上述實施例僅為例示性說明本發明之實施態樣，以及闡釋本發明之技術特徵，並非用來限制本發明之保護範疇。任何熟悉此技藝之人士可輕易完成之改變或均等性之安排均屬於本發明所主張之範圍，本發明之權利保護範圍應以申請專利範圍為準。 The above-described embodiments are merely illustrative of the embodiments of the present invention and the technical features of the present invention are not intended to limit the scope of the present invention. It is intended that any changes or equivalents of the invention may be made by those skilled in the art. The scope of the invention should be determined by the scope of the claims.

301~307‧‧‧步驟 301~307‧‧‧Steps

Claims

A webpage data capture method for a webpage data capture device, the webpage data capture device receives a plurality of webpage data from a webpage server, and the webpage data capture method comprises: (a) causing the webpage data capture apparatus to Addressing the address of the uniform resource locator (URL) of the webpage data, and dividing the webpage data into at least one URL group, wherein the at least one URL group includes a first URL group a group, the first URL group includes at least a portion of the webpage data; (b) causing the webpage data retrieval device to select a first webpage material and a first page from a portion of the webpage data of the first URL group (b) causing the webpage data capture device to parse the first webpage data and the second webpage data to obtain a webpage node data set, wherein the webpage node dataset includes a plurality of webpage node data, each webpage node The data includes a corresponding XML Path Language and a text content; (d) the web page data extraction device according to the web node node data collection of the webpage nodes The path association of the XML path language of the data and the textual relevance of the text content, the webpage node data of the webpage node data set is divided into a plurality of webpage node data groups, wherein each webpage node data group The group includes at least a portion of the webpage node data; (e) causing the webpage data capture device to calculate a sum of textual content of one of the webpage node data of each of the webpage node data groups; (f) extracting the webpage data Determining, according to the sum of the text contents, at least one main webpage node data group of the webpage node data group; (g) causing the webpage data extracting apparatus to according to the at least one main webpage node data grouping package The XML path language of the data of the web page nodes, which determine the content of the web pages, determines the main content of a web page to capture information.

The method for extracting webpage data according to claim 1, wherein a minimum edit distance (Minimum Edit Distance) between the URLs of the webpage data in the first URL group is less than a URL threshold.

The method for extracting webpage data according to claim 1, wherein the step (b) further comprises: (b1) causing the webpage data retrieval device to select a data amount from a portion of the webpage data of the first URL group. The highest first page of the webpage and the second webpage of the second highest amount of data.

The method for extracting webpage data according to claim 1, wherein the step (c) further comprises: (c1) causing the webpage data extracting device to select at least one invalid text content and at least one repeating node from the text content. And deleting the webpage node corresponding to the at least one invalid text content and the at least one duplicate node data from the webpage node data set.

The method of extracting webpage data according to claim 1, wherein the step (d) further comprises: (d1) causing the webpage data retrieval device to use the XML path language of the webpage node data according to the webpage node data collection. Path association, the webpage node data of the webpage node data set is divided into a plurality of path groups, wherein a minimum edit distance between the XML path languages of the webpage node data of each part of the path group is less than An XML threshold value; (d2) causing the webpage data capture device to divide each of the path groups into the webpages according to the textual relevance of the textual content of the webpage node data for each of the path groups Node data group; Wherein, in each of the path groups, each of the text content of the webpage node data has a term frequency vector; wherein, in each of the path groups, each of the webpage node data groups is part of the webpage node data The cosine values between the frequency frequencies of the words of the text content are greater than a text content threshold.

The method for extracting webpage data according to claim 1, wherein the step (f) further comprises: (f1) causing the webpage data extracting device to sort the sum of the text contents into a sum of text content; (f2) The webpage data extraction device calculates a complex difference between the sum of adjacent text contents in the summation sequence of the text content; (f3) causing the webpage data extraction device to select a maximum difference of one of the differences; (f4) making the webpage data The capturing device divides the text content sum sequence into a main area and a primary area according to the maximum difference; (f5) causing the webpage data extracting device to determine the webpage node data group according to the main area At least one primary web node data group.

The method for extracting webpage data according to claim 1, wherein the step (g) further comprises: (g1) causing the webpage data capture device to include a portion of the webpage node data included in the at least one main webpage data group The XML path language performs a longest common subsequence algorithm; (g2) causes the web page data extracting device to determine the main content of the web page to retrieve information according to the result of the step (g1).

A webpage data capture device comprising: a receiving unit, configured to receive a plurality of webpage data from a webpage server; and a processing unit configured to: according to the address association of a plurality of uniform resource locators (URLs) of the webpage data, And the webpage data is divided into at least one URL group, wherein the at least one URL group comprises a first URL group, the first URL group includes at least part of the webpage materials; and the part of the first URL group Selecting a first webpage data and a second webpage data in the webpage data; parsing the first webpage data and the second webpage data to obtain a webpage node dataset, wherein the webpage node dataset includes a plurality of webpage node data Each web page node data includes a corresponding one of an XML path language and a text content; a path association of the XML path languages of the web page node data according to the web page node data set and the text content Text association, the webpage node data of the webpage node data collection is divided into a plurality of webpage node data groups, wherein each of the The webpage node data group includes at least some of the webpage node data; respectively, calculating a sum of text content of one of the webpage node data of each webpage node data group; and determining the webpage node data group according to the sum of the textual content At least one main webpage node data group; determining the main content of the webpage to retrieve information according to the XML path language of the webpage node data included in the at least one main webpage data group.

The webpage data capture device of claim 8, wherein a minimum edit distance (Minimum Edit Distance) between the URLs of the webpage data of the first URL group is less than a URL threshold.

The webpage data retrieval device of claim 8, wherein the processing unit is further configured to: select the first webpage data and the amount of data with the highest amount of data from a portion of the webpage data of the first URL group The second highest webpage data.

The webpage data retrieval device of claim 8, wherein the processing unit is further configured to: select at least one invalid text content and at least one duplicate node data from the text content, and invalidate corresponding to the at least one The text content and at least one web node of the repeated node data are deleted from the web node data collection.

The webpage data retrieval device of claim 8, wherein the processing unit is further configured to: according to the path association of the XML path languages of the webpage node data of the webpage node data collection, the webpage node data The collection of the webpage node data is divided into a plurality of path groups, wherein a minimum edit distance between the XML path languages of the webpage node data of each of the path groups is less than an XML threshold value; a group, according to the textual relevance of the text content of the webpage node data, the path group is divided into the webpage node data groups; wherein, among the path groups, some of the webpage nodes Each text content of the data has a term frequency vector; wherein, in each of the path groups, a cosine value between the word frequency vectors of the text content of the webpage node data of the webpage node data group Greater than a text content threshold.

The webpage data extraction device of claim 8, wherein the processing unit is further configured to: sort the sum of the text contents into a sum of text content sum; calculate a complex difference of the sum of adjacent text contents in the summation sequence of the text content a maximum difference value of one of the difference values; the sum of the text content sum is divided into a main area and a primary area according to the maximum difference; and the web area data group is determined according to the main area At least one primary web node data group.

The webpage data capture device of claim 8, wherein the processing unit is further configured to: perform the longest common for the XML path language of the webpage node data of the at least one main webpage data group The Longest Common Subsequence algorithm; determines the main content of the web page to capture information based on the results of the longest common subsequence algorithm.