TWI611308B - Webpage data extraction device and webpage data extraction method thereof - Google Patents

Webpage data extraction device and webpage data extraction method thereof Download PDF

Info

Publication number
TWI611308B
TWI611308B TW105135730A TW105135730A TWI611308B TW I611308 B TWI611308 B TW I611308B TW 105135730 A TW105135730 A TW 105135730A TW 105135730 A TW105135730 A TW 105135730A TW I611308 B TWI611308 B TW I611308B
Authority
TW
Taiwan
Prior art keywords
webpage
data
node data
node
group
Prior art date
Application number
TW105135730A
Other languages
Chinese (zh)
Other versions
TW201818268A (en
Inventor
黃奕翔
邱育賢
蕭暉議
Original Assignee
財團法人資訊工業策進會
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 財團法人資訊工業策進會 filed Critical 財團法人資訊工業策進會
Priority to TW105135730A priority Critical patent/TWI611308B/en
Priority to CN201611000331.0A priority patent/CN108021600A/en
Priority to US15/358,119 priority patent/US20180121558A1/en
Application granted granted Critical
Publication of TWI611308B publication Critical patent/TWI611308B/en
Publication of TW201818268A publication Critical patent/TW201818268A/en

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/95Retrieval from the web
    • G06F16/955Retrieval from the web using information identifiers, e.g. uniform resource locators [URL]
    • G06F16/9566URL specific, e.g. using aliases, detecting broken or misspelled links
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L67/00Network arrangements or protocols for supporting network services or applications
    • H04L67/01Protocols
    • H04L67/02Protocols based on web technology, e.g. hypertext transfer protocol [HTTP]
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/95Retrieval from the web
    • G06F16/951Indexing; Web crawling techniques
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/95Retrieval from the web
    • G06F16/958Organisation or management of web site content, e.g. publishing, maintaining pages or automatic linking
    • G06F16/986Document structures and storage, e.g. HTML extensions
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L61/00Network arrangements, protocols or services for addressing or naming
    • H04L61/45Network directories; Name-to-address mapping
    • H04L61/457Network directories; Name-to-address mapping containing identifiers of data entities on a computer, e.g. file names

Landscapes

  • Engineering & Computer Science (AREA)
  • Databases & Information Systems (AREA)
  • Theoretical Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Computer Networks & Wireless Communication (AREA)
  • Signal Processing (AREA)
  • Computer Hardware Design (AREA)
  • Information Transfer Between Computers (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

一種網頁資料擷取裝置及其網頁資料擷取方法。網頁資料擷取裝置執行:根據網頁資料之URL之位址關聯性,將網頁資料分為URL群組;自URL群組之網頁資料中挑選第一網頁資料以及第二網頁資料;解析第一網頁資料以及第二網頁資料得網頁節點資料集合;根據網頁節點資料集合之網頁節點資料之XML路徑語言之路徑關聯性以及文字內容之文字關聯性,將網頁節點資料集合之網頁節點資料分為複數網頁節點資料群組;分別計算各網頁節點資料群組之一文字內容總和;根據文字內容總和,判斷網頁節點資料群組之主要網頁節點資料群組;根據主要網頁節點資料群組包含之網頁節點資料之XML路徑語言決定網頁主要內容擷取資訊。 A webpage data extraction device and a webpage data acquisition method thereof. Web page data extraction device execution: according to the address relevance of the URL of the webpage data, the webpage data is divided into URL groups; the first webpage data and the second webpage data are selected from the webpage data of the URL group; the first webpage is parsed The data and the second webpage data are obtained by the webpage node data collection; according to the path relevance of the XML path language of the webpage node data of the webpage node data collection and the textual relevance of the text content, the webpage node data of the webpage node data collection is divided into plural webpages a node data group; respectively calculating a sum of text content of each webpage node data group; determining a main webpage node data group of the webpage node data group according to the sum of the text content; according to the webpage node data included in the main webpage node data group The XML path language determines the main content of the web page to capture information.

Description

網頁資料擷取裝置及其網頁資料擷取方法 Web data capture device and webpage data acquisition method thereof

本發明係關於一種網頁資料擷取裝置及其網頁資料擷取方法;更具體而言,本發明係關於一種自動化之網頁資料擷取裝置及其網頁資料擷取方法。 The present invention relates to a webpage data extraction device and a webpage data extraction method thereof. More specifically, the present invention relates to an automated webpage data retrieval device and a webpage data acquisition method thereof.

隨著網際網路應用發展,各式各樣的資訊皆可從不同之網頁獲取,因此,當有特定資料分析需求時,便可針對相關網站之網頁,擷取其主要內容後分析處理。 With the development of Internet applications, all kinds of information can be obtained from different web pages. Therefore, when there is a specific data analysis requirement, the main content of the webpage of the relevant website can be extracted and analyzed.

而習知之網頁主要內容擷取方式中,多採人工抓取分析進行,然而,以人工之方式針對不同網站之不同網頁進行主要內容判斷,其效率相當不理想。據此,為提升網頁主要內容擷取效率,便有以客製程式為主,針對網頁之各種樣板(templates)及其排版(layout)作為訓練資料(training data),進行網頁分析及主要內容擷取之技術。 However, in the main content extraction methods of the webpage, the manual crawling analysis is used. However, the main content judgment of different webpages of different websites is artificially performed, and the efficiency is rather unsatisfactory. Accordingly, in order to improve the efficiency of the main content of the webpage, there are custom-made programs, and various templates and layouts of the webpage are used as training data for webpage analysis and main content. Take the technology.

惟此種客製程式之方式,僅能針對特定網頁之樣板及排版進行處理,因此,當網頁改版或其語法結構稍微調整,若不針對客製程式進行相應之調整,將會導致分析及擷取之結果發生 明顯錯誤。 However, the way of this kind of custom program can only be processed for the template and layout of a specific web page. Therefore, when the webpage revision or its grammatical structure is slightly adjusted, if the corresponding adjustment is not made for the custom program, it will lead to analysis and Take the result Obvious error.

更者,因網頁格式排版日趨複雜,因此網頁資訊量亦大幅大幅增加,單一網頁之網頁節點(webpage node)可能高達近千個,據此,當網頁之結構或型態發生更動時,客製程式調整的複雜程度將更顯困難,甚至可能需要重新撰寫客製程式,如此,同樣導致網頁主要內容判斷之效率不佳。 Moreover, due to the increasingly complex layout of web pages, the amount of web page information has also increased substantially. The number of webpage nodes of a single web page may be as high as nearly one thousand. According to this, when the structure or type of webpage changes, the customization is made. The complexity of the program adjustment will be more difficult, and may even require rewriting the custom program. As a result, the main content of the web page is judged to be inefficient.

因此,如何改進習知網頁主要內容擷取效率不佳之缺點,乃業界須共同努力之目標。 Therefore, how to improve the shortcomings of the main content of the traditional webpage to achieve inefficiency is the goal of the industry to work together.

本發明之主要目的係提供一種用於網頁資料擷取裝置之網頁資料擷取方法。網頁資料擷取裝置自網頁伺服器接收複數網頁資料。網頁資料擷取方法包含:(a)令網頁資料擷取裝置根據複數網頁資料之複數統一資源定址器(uniform resource locator,URL)之位址關聯性,將複數網頁資料分為至少一URL群組。其中,至少一URL群組包含第一URL群組,第一URL群組包含至少部分複數網頁資料;(b)令網頁資料擷取裝置自第一URL群組之部分複數網頁資料中,挑選第一網頁資料以及第二網頁資料;(c)令網頁資料擷取裝置解析第一網頁資料以及第二網頁資料得網頁節點資料集合。其中,網頁節點資料集合包含複數網頁節點資料,各網頁節點資料包含相對應之XML路徑語言(XML Path Language)以及文字內容。 The main object of the present invention is to provide a webpage data extraction method for a webpage data retrieval device. The webpage data retrieval device receives the plurality of webpage data from the webpage server. The webpage data extraction method comprises: (a) causing the webpage data retrieval device to divide the plurality of webpage data into at least one URL group according to the address association of a plurality of uniform resource locators (URLs) of the plurality of webpage data. . The at least one URL group includes a first URL group, the first URL group includes at least a portion of the plurality of webpage materials, and (b) the webpage data retrieval device selects a portion of the plurality of webpage materials from the first URL group. a web page data and a second web page data; (c) a web page data capture device that parses the first web page data and the second web page data to obtain a web page node data set. The webpage node data set includes a plurality of webpage node materials, and each webpage node data includes a corresponding XML Path Language and text content.

前述網頁資料擷取方法進一步包含:(d)令網頁資料 擷取裝置根據網頁節點資料集合之複數網頁節點資料之複數XML路徑語言之路徑關聯性以及複數文字內容之文字關聯性,將網頁節點資料集合之複數網頁節點資料分為複數網頁節點資料群組。其中,各網頁節點資料群組至少包含部分複數網頁節點資料;(e)令網頁資料擷取裝置分別計算各網頁節點資料群組之部分複數網頁節點資料之文字內容總和;(f)令網頁資料擷取裝置根據複數文字內容總和,判斷複數網頁節點資料群組之至少一主要網頁節點資料群組;(g)令網頁資料擷取裝置根據至少一主要網頁節點資料群組包含之部分複數網頁節點資料之複數XML路徑語言,決定網頁主要內容擷取資訊。 The aforementioned method for extracting webpage data further includes: (d) making webpage materials The capturing device divides the plurality of webpage node data of the webpage node data set into a plurality of webpage node data groups according to the path association of the plurality of XML path languages of the plurality of webpage node data of the webpage node data collection and the textual relevance of the plurality of textual content. The webpage data group includes at least part of the plurality of webpage node data; (e) causing the webpage data extracting device to calculate the sum of the textual contents of the plurality of webpage node data of each webpage node data group; (f) making the webpage data The capturing device determines at least one primary webpage node data group of the plurality of webpage node data groups according to the sum of the plurality of textual contents; (g) causing the webpage data extracting apparatus to include at least a plurality of webpage nodes included in the at least one main webpage node data group The plural XML path language of the data determines the main content of the web page to capture information.

為達上述目的,本發明揭露一種網頁資料擷取裝置,包含接收單元以及處理單元。接收單元用以自網頁伺服器接收複數網頁資料。處理單元用以:根據複數網頁資料之複數URL之位址關聯性,將複數網頁資料分為至少一URL群組。其中,至少一URL群組包含第一URL群組,第一URL群組包含至少部分複數網頁資料;自第一URL群組之部分複數網頁資料中,挑選第一網頁資料以及第二網頁資料;解析第一網頁資料以及第二網頁資料得網頁節點資料集合。其中,網頁節點資料集合包含複數網頁節點資料,各網頁節點資料包含相對應之XML路徑語言以及文字內容。 To achieve the above objective, the present invention discloses a webpage data capture device, including a receiving unit and a processing unit. The receiving unit is configured to receive a plurality of webpage materials from the web server. The processing unit is configured to: divide the plurality of webpage materials into at least one URL group according to address relevance of the plurality of URLs of the plurality of webpage materials. The at least one URL group includes a first URL group, and the first URL group includes at least part of the plurality of webpage materials; and the first webpage data and the second webpage data are selected from the plurality of webpage materials of the first URL group; Parsing the first webpage data and the second webpage data into a webpage node data collection. The webpage node data set includes a plurality of webpage node materials, and each webpage node data includes a corresponding XML path language and text content.

前述處理單元進一步用以:根據網頁節點資料集合之複數網頁節點資料之複數XML路徑語言之路徑關聯性以及複數文字內容之文字關聯性,將網頁節點資料集合之複數網頁節點資 料分為複數網頁節點資料群組。其中,各網頁節點資料群組至少包含部分複數網頁節點資料;分別計算各網頁節點資料群組之部分複數網頁節點資料之文字內容總和;根據複數文字內容總和,判斷複數網頁節點資料群組之至少一主要網頁節點資料群組;根據至少一主要網頁節點資料群組包含之部分複數網頁節點資料之複數XML路徑語言,決定網頁主要內容擷取資訊。 The processing unit is further configured to: according to the path association of the plurality of XML path languages of the plurality of webpage node data of the webpage node data set and the textual relevance of the plurality of textual content, the plurality of webpage nodes of the webpage node data collection It is divided into multiple web node data groups. Wherein, each webpage node data group includes at least part of the plurality of webpage node data; respectively, calculating a sum of text content of a plurality of webpage node data of each webpage node data group; and judging at least a plurality of webpage node datagroups according to the sum of the plural textual content sums A main webpage node data group; determining a main content of the webpage to retrieve information according to a plurality of XML path languages of at least one of the plurality of webpage node data groups.

此外在參閱圖式及隨後描述之實施方式後,此技術領域具有通常知識者便可瞭解本發明之其他目的,以及本發明之技術手段及實施態樣。 Further objects of the present invention, as well as the technical means and implementations of the present invention, will become apparent to those skilled in the art in view of the appended claims.

1、2‧‧‧網頁資料擷取裝置 1, 2‧‧‧ web data capture device

11、21‧‧‧接收單元 11, 21‧‧‧ receiving unit

13、23‧‧‧處理單元 13, 23‧‧ ‧ processing unit

wp‧‧‧網頁資料 Wp‧‧‧Webpage

ul‧‧‧統一資源定址器 Ul‧‧‧Uniform Resource Addresser

ug‧‧‧至少一URL群組 Ug‧‧‧At least one URL group

UL1‧‧‧第一URL群組 UL1‧‧‧First URL Group

WP1‧‧‧第一網頁資料 WP1‧‧‧First webpage information

WP2‧‧‧第二網頁資料 WP2‧‧‧Second webpage information

ND‧‧‧網頁節點資料 ND‧‧‧Website Node Information

NDX‧‧‧XML路徑語言 NDX‧‧‧XML path language

NDT‧‧‧文字內容 NDT‧‧‧ text content

wpg‧‧‧網頁節點資料集合 Wpg‧‧‧Website node data collection

ndg‧‧‧網頁節點資料群組 Ndg‧‧‧Website Node Data Group

MNDG‧‧‧至少一主要網頁節點資料群組 MNDG‧‧‧ at least one main web node data group

MX‧‧‧網頁主要內容擷取資訊 The main content of the MX‧‧‧ webpage captures information

第1A圖係本發明第一實施例之網頁資料擷取操作示意圖;第1B圖係本發明第一實施例之網頁資料擷取裝置之方塊圖;第2A圖係本發明第二實施例之網頁資料擷取操作示意圖;第2B圖係本發明第二實施例之網頁資料擷取裝置之方塊圖;第3圖係本發明第三實施例之網頁資料擷取方法之流程圖;以及第4A-4B圖係本發明第四實施例之網頁資料擷取方法之流程圖。 1A is a block diagram of a webpage data capture operation according to a first embodiment of the present invention; FIG. 1B is a block diagram of a webpage data capture apparatus according to a first embodiment of the present invention; and FIG. 2A is a webpage of a second embodiment of the present invention; FIG. 2B is a block diagram of a webpage data capture device according to a second embodiment of the present invention; and FIG. 3 is a flowchart of a webpage data capture method according to a third embodiment of the present invention; and FIG. 4A- 4B is a flow chart of a webpage data acquisition method according to a fourth embodiment of the present invention.

下將透過實施方式來解釋本發明之內容。須說明者,本發明的實施例並非用以限制本發明須在如實施例所述之任何特 定的環境、應用或特殊方式方能實施。因此,有關實施例之說明僅為闡釋本發明之目的,而非用以限制本發明,且本案所請求之範圍,以申請專利範圍為準。除此之外,於以下實施例及圖式中,與本發明非直接相關之元件已省略而未繪示,且以下圖式中各元件間之尺寸關係僅為求容易瞭解,非用以限制實際比例。 The contents of the present invention will be explained by way of embodiments. It should be noted that the embodiments of the present invention are not intended to limit the invention as required by any of the embodiments. A specific environment, application or special way can be implemented. Therefore, the description of the embodiments is only for the purpose of illustrating the invention, and is not intended to limit the invention. In addition, in the following embodiments and drawings, elements that are not directly related to the present invention have been omitted and are not shown, and the dimensional relationships between the elements in the following figures are merely for ease of understanding and are not intended to be limiting. Actual ratio.

請參考第1A~1B圖。第1A圖係本發明第一實施例之網頁資料擷取操作示意圖,第1B圖係本本發明第一實施例之一網頁資料擷取裝置1之方塊圖。網頁資料擷取裝置1包含一接收單元11以及一處理單元13,並透過接收單元11與一網頁伺服器9連線。元件間之互動將於下文中進一步闡述。 Please refer to Figures 1A~1B. 1A is a block diagram of a webpage data capture operation of the first embodiment of the present invention, and FIG. 1B is a block diagram of a webpage data capture device 1 of the first embodiment of the present invention. The webpage data capture device 1 includes a receiving unit 11 and a processing unit 13, and is connected to a web server 9 via the receiving unit 11. The interaction between components will be further elaborated below.

首先,當需要分析網頁伺服器9之網頁時,網頁資料擷取裝置1之接收單元11自網頁伺服器9接收複數網頁資料wp。其中,基於網際網路使用原則,各網頁資料wp皆有其相應之統一資源定址器(uniform resource locator,URL)ul。 First, when it is necessary to analyze the web page of the web server 9, the receiving unit 11 of the web data extracting device 1 receives the plurality of web materials wp from the web server 9. Among them, based on the principle of Internet usage, each web page data wp has its corresponding uniform resource locator (URL) ul.

接著,網頁資料擷取裝置1之處理單元13便根據複數網頁資料wp之複數URL ul之位址關聯性,將複數網頁資料wp分為至少一URL群組ug。其中,至少一URL群組u1包含一第一URL群組UL1,而第一URL群組UL1包含至少部分網頁資料wp。 Next, the processing unit 13 of the webpage data retrieval device 1 divides the plurality of webpage data wp into at least one URL group ug according to the address relevance of the plural URL ul of the plurality of webpage data wp. The at least one URL group u1 includes a first URL group UL1, and the first URL group UL1 includes at least part of the web page data wp.

須說明,此分群用意在於,初步地根據URL特性,將網頁內容相似度較高之網頁進行分類,以利後續比對分析。換言之,由於相同樣板及排版之網頁,其URL位址之形式通常較為相近,因此,根據網頁資料之URL之位址關聯性,便可進行初步分群。 It should be noted that the purpose of this grouping is to initially classify webpages with higher similarity of webpage content according to the URL characteristics, so as to facilitate subsequent comparison analysis. In other words, since the URLs of the same board and typesetting webpages are usually similar in form, the initial grouping can be performed according to the address relevance of the URL of the webpage data.

隨後,網頁資料擷取裝置1之處理單元13自第一URL群組UL1之部分網頁資料中,挑選一第一網頁資料WP1以及一第二網頁資料WP2,並解析第一網頁資料WP1以及第二網頁資料WP2得一網頁節點資料集合wpg。 Then, the processing unit 13 of the webpage data retrieval device 1 selects a first webpage data WP1 and a second webpage data WP2 from a part of the webpage data of the first URL group UL1, and parses the first webpage data WP1 and the second The web page data WP2 has a web node data set wpg.

詳言之,由於單一網頁中包含多個網頁節點(webpage node),因此,解析第一網頁資料WP1以及第二網頁資料WP2之語法便可得到包含複數網頁節點資料ND之網頁節點資料集合wpg。其中,各網頁節點資料ND包含相對應之一XML路徑語言(XML Path Language)NDX以及一文字內容NDT。 In detail, since a single webpage includes a plurality of webpage nodes, the syntax of the first webpage data WP1 and the second webpage data WP2 can be analyzed to obtain a webpage node dataset wpg including a plurality of webpage node materials ND. The web page node data ND includes one of an XML path language (NDX) and a text content NDT.

據此,網頁資料擷取裝置1之處理單元13便可根據網頁節點資料集合wpg之複數網頁節點資料ND之複數XML路徑語言NDX之路徑關聯性以及複數文字內容NDT之文字關聯性,將網頁節點資料集合wpg之複數網頁節點資料ND分為複數網頁節點資料群組ndg。其中,各網頁節點資料群組ndg至少包含部分網頁節點資料ND。 According to this, the processing unit 13 of the webpage data extraction device 1 can select the webpage node according to the path association of the plural XML path language NDX of the plurality of webpage node data ND of the webpage node data set wpg and the textual relevance of the plural text content NDT. The data set node data ND of the data set wpg is divided into a plurality of web page node data groups ndg. The webpage node data group ndg includes at least part of the webpage node data ND.

須說明,類似地,此分群用意在於,根據XML語法以及文字內容之特性,將內容相似度較高之網頁節點進行分類,以利後續主要內容之判斷。換言之,即根據網頁節點之XML路徑語言之路徑關聯性,將XML語法相似度較高之網頁節點分群,另一方面,亦可根據網頁節點之文字內容之文字關聯性,將內容相似度較高之網頁節點分群。 It should be noted that, similarly, this grouping means that, according to the characteristics of the XML grammar and the text content, the webpage nodes with higher content similarity are classified to facilitate the judgment of the subsequent main content. In other words, according to the path association of the XML path language of the webpage node, the webpage nodes with high XML syntax similarity are grouped, and on the other hand, the content similarity is higher according to the textual relevance of the text content of the webpage node. The web nodes are grouped.

接著,網頁資料擷取裝置1之處理單元13分別計算各 網頁節點資料群組ndg之部分網頁節點資料ND之一文字內容總和(未繪示),即計算同一網頁節點資料群組ndg之網頁節點資料ND之文字總長度,並根據複數文字內容總和,判斷複數網頁節點資料群組ndg之至少一主要網頁節點資料群組MNDG。 Next, the processing unit 13 of the webpage data extraction device 1 calculates each The sum of the text content of one of the webpage node data ND of the webpage node data group ndg (not shown), that is, the total length of the text of the webpage node data ND of the same webpage node data group ndg, and judging the plural according to the sum of the plural text contents At least one primary web node data group MNDG of the web node data group ndg.

具體而言,由於同一網路頁面中,具有主要內容之網頁節點資料通常具有資料量較大之文字內容,因此,前述分群主要係根據同一網頁節點資料群組之網頁節點資料之文字內容總和,將具有主要內容之網頁節點資料與不具有主要內容之網頁節點資料進行劃分。 Specifically, since the webpage node data having the main content in the same webpage usually has a large amount of text content, the foregoing grouping is mainly based on the sum of the text contents of the webpage node data of the same webpage node data group. The web node data with the main content is divided with the web node data without the main content.

據此,網頁資料擷取裝置1之處理單元13便可根據至少一主要網頁節點資料群組MNDG包含之部分網頁節點資料ND之XML路徑語言NDX,決定一網頁主要內容擷取資訊MX。更進一步來說,網頁主要內容擷取資訊MX主要係XML路徑語言NDX之集合。 According to this, the processing unit 13 of the webpage data retrieval device 1 can determine a webpage main content capture information MX according to the XML path language NDX of the partial webpage node data ND included in at least one main webpage node data group MNDG. Furthermore, the main content of the web page capture information MX is mainly a collection of XML path language NDX.

如此一來,在前述URL群組具有相同性質(例如樣板及排版)網頁之情況下,網頁資料擷取裝置1之處理單元13後續便可直接根據此XML路徑語言NDX之集合,於URL群組中直接選擇具有主要內容之網頁節點,俾後續主要內容之分析及利用。 In this case, in the case that the foregoing URL group has the same nature (for example, template and typesetting) webpage, the processing unit 13 of the webpage data extracting apparatus 1 can directly follow the set of the XML path language NDX in the URL group. Directly select the web page node with the main content, and analyze and utilize the subsequent main content.

請參考第2A-2B圖。第2A圖係本發明第二實施例之網頁資料擷取操作示意圖,第2B圖係本本發明第二實施例之一網頁資料擷取裝置2之方塊圖。網頁資料擷取裝置2包含一接收單元21以及一處理單元23,並透過接收單元21與網頁伺服器9連線。第 二實施例主要係進一步用範例解釋網頁資料擷取裝置2擷取分析網頁之細節。 Please refer to Figure 2A-2B. 2A is a block diagram of a webpage data capture operation of the second embodiment of the present invention, and FIG. 2B is a block diagram of a webpage data capture device 2 of the second embodiment of the present invention. The webpage data capture device 2 includes a receiving unit 21 and a processing unit 23, and is connected to the web server 9 via the receiving unit 21. First The second embodiment mainly uses the example to explain the webpage data capturing device 2 to extract the details of the analysis webpage.

同樣地,當需要分析網頁伺服器9之網頁時,網頁資料擷取裝置2之接收單元21自網頁伺服器9接收複數網頁資料wp,而基於網際網路使用原則,各網頁資料wp皆有其相應之URL ul,其中,網頁資料wp及相對應之URL ul如下表格繪示:

Figure TWI611308BD00001
Similarly, when it is necessary to analyze the webpage of the webpage server 9, the receiving unit 21 of the webpage data extracting apparatus 2 receives the plurality of webpage materials wp from the webpage server 9, and based on the internet usage principle, each webpage data wp has its Corresponding URL ul, wherein the webpage data wp and the corresponding URL ul are as follows:
Figure TWI611308BD00001

接著,網頁資料擷取裝置2之處理單元23便根據複數網頁資料wp之複數URL ul之位址關聯性,將複數網頁資料wp分為至少一URL群組ug。其中,至少一URL群組ul包含第一URL群組UL1,而第一URL群組UL1包含至少部分網頁資料WP。須說明,第二實施例中,此處之URL分群組主要係基於最小編輯距離(Minimum Edit Distance,MED)完成。 Next, the processing unit 23 of the webpage data retrieval device 2 divides the plurality of webpage data wp into at least one URL group ug according to the address relevance of the plural URL ul of the plurality of webpage materials wp. The at least one URL group ul includes the first URL group UL1, and the first URL group UL1 includes at least part of the web page data WP. It should be noted that in the second embodiment, the URL grouping here is mainly based on the Minimum Edit Distance (MED).

詳言之,網頁資料擷取裝置2之處理單元23將複數網頁資料wp之複數URL ul兩兩進行最小編輯距離計算,得結果如下 表:

Figure TWI611308BD00002
In detail, the processing unit 23 of the webpage data retrieval device 2 calculates the minimum edit distance of the plurality of URLs ul of the plurality of webpage data wp, and the results are as follows:
Figure TWI611308BD00002

據此,網頁資料擷取裝置2之處理單元23可根據上表內容,將MED值小於一URL門檻值之網頁資料配對加至同一URL群組中。以第二實施例來說,URL門檻值為2,因此,MED值為1之網頁配對將分在同一URL群組。 Accordingly, the processing unit 23 of the webpage data retrieval device 2 can add the webpage data with the MED value less than a URL threshold to the same URL group according to the content of the above table. In the second embodiment, the URL threshold is 2, so the web page pair with the MED value of 1 will be grouped in the same URL group.

詳言之,第一URL群組UL1所包含之至少部分網頁資料WP即為http://www.aaaaa.com/item1~3.html。另外,至少一URL群組ul亦可包含一第二URL群組(未繪示),且第二URL群組包含至少部分網頁資料WP,即http://www.aaaaa.com/list1~2.html,惟相同URL群組之操作相同,後續將僅以第一URL群組UL1為主。 In detail, at least part of the webpage data WP included in the first URL group UL1 is http://www.aaaaa.com/item1~3.html . In addition, at least one URL group ul may also include a second URL group (not shown), and the second URL group includes at least part of the web page data WP, ie, http://www.aaaaa.com/list1~2 .html , but the operation of the same URL group is the same, and the subsequent will only be based on the first URL group UL1.

接著,網頁資料擷取裝置2之處理單元23自第一URL群組UL1之部分網頁資料中,挑選資料量(即網頁資料之HTML size)最高之第一網頁資料WP1以及資料量第二高之第二網頁資料WP2,並解析第一網頁資料WP1以及第二網頁資料WP2得網頁節點資料集合wpg。 Then, the processing unit 23 of the webpage data retrieval device 2 selects the first webpage data WP1 with the highest amount of data (ie, the HTML size of the webpage data) and the second highest amount of data from the partial webpage data of the first URL group UL1. The second webpage data WP2, and parsing the webpage node data set wpg of the first webpage data WP1 and the second webpage data WP2.

詳言之,由於單一網頁中包含多個網頁節點,因此,解析第一網頁資料WP1以及第二網頁資料WP2之語法便可得到包含複數網頁節點資料ND之網頁節點資料集合wpg。其中,各網頁節點資料ND包含相對應之XML路徑語言NDX以及文字內容NDT,內容詳如下表:

Figure TWI611308BD00003
In detail, since a single webpage includes a plurality of webpage nodes, the syntax of the first webpage data WP1 and the second webpage data WP2 can be parsed to obtain a webpage node dataset wpg including a plurality of webpage node materials ND. Among them, each web page node data ND includes a corresponding XML path language NDX and text content NDT, and the contents are as follows:
Figure TWI611308BD00003

隨後,於第二實施例中,可進一步將重複或無效之網頁節點資料ND自網頁節點資料集合wpg中刪除。具體而言,網頁 資料擷取裝置2之處理單元23根據上述表格,自文字內容NDT中挑選至少一無效文字內容以及至少一重複節點資料。以前述表格為例,無效文字內容為‘0’以及’null’,重複節點資料為’html/body/div[1]/div[2]/div[2]/div[3]/div[3]/div[6]∥返回首頁’。因此,調整後之網頁節點資料集合wpg之網頁節點資料ND內容如下表所示:

Figure TWI611308BD00004
Subsequently, in the second embodiment, the duplicated or invalid webpage node data ND may be further deleted from the webpage node dataset wpg. Specifically, the processing unit 23 of the webpage data retrieval device 2 selects at least one invalid text content and at least one duplicate node data from the text content NDT according to the above table. Taking the above table as an example, the invalid text content is '0' and 'null', and the duplicate node data is 'html/body/div[1]/div[2]/div[2]/div[3]/div[3 ]/div[6]∥Return to the home page'. Therefore, the adjusted ND content of the web node data set wpg of the web node data set is as follows:
Figure TWI611308BD00004

隨即,網頁資料擷取裝置2之處理單元23便可根據網頁節點資料集合wpg之複數網頁節點資料ND之複數XML路徑語言NDX之路徑關聯性以及複數文字內容NDT之文字關聯性,將網頁節點資料集合wpg之複數網頁節點資料ND分為複數網頁節點資料群組ndg。 Then, the processing unit 23 of the webpage data extraction device 2 can obtain the webpage node data according to the path association of the plural XML path language NDX of the plurality of webpage node data ND of the webpage node data set wpg and the textual relevance of the plural text content NDT. The plurality of web page node data ND of the collection wpg is divided into a plurality of web page node data groups ndg.

更詳細而言,第二實施例中,此處之網頁節點資料分群組之技術主要可分為兩部分進行。首先,第一部分,類似地,將前述表格之網頁節點資料ND之XML路徑語言NDX兩兩進行最小編輯距離計算,並將MED值小於一XML門檻值(未繪示)之網頁節點資料ND配對加至同一路徑群組XG中。以第二實施例來說,分組狀況如下表所示:

Figure TWI611308BD00005
In more detail, in the second embodiment, the technology of the webpage node data grouping here can be mainly divided into two parts. First, in the first part, similarly, the XML path language NDX of the webpage node data ND of the foregoing table is subjected to minimum edit distance calculation, and the web node data ND pair with the MED value less than an XML threshold value (not shown) is added. To the same path group XG. In the second embodiment, the grouping status is as shown in the following table:
Figure TWI611308BD00005

接著,第二部分,於各路徑群組XG中,針對網頁節點資料ND之文字內容NDT進行TF-IDF(term frequency-inverse document frequency)計算,得相應之複數用語頻率向量,並計算兩兩文字內容之用語頻率向量間之餘弦值,若大於一文字內容門 檻值(未繪示),則將其加至同一網頁節點資料群組ndg中。以第二實施例來說,分組狀況如下表所示:

Figure TWI611308BD00006
Then, in the second part, in each path group XG, the TF-IDF (term frequency-inverse document frequency) calculation is performed on the text content NDT of the webpage node data ND, and the corresponding plural frequency frequency vector is obtained, and the two-word text is calculated. If the cosine value between the content frequency vectors of the content is greater than a text content threshold (not shown), it is added to the same web node data group ndg. In the second embodiment, the grouping status is as shown in the following table:
Figure TWI611308BD00006

如此一來,整合前述二部分之分組方式,便形成網頁節點資料群組ndg,如下表所示:

Figure TWI611308BD00007
Figure TWI611308BD00008
In this way, by integrating the above two parts of the grouping method, the web node data group ndg is formed, as shown in the following table:
Figure TWI611308BD00007
Figure TWI611308BD00008

須說明,利用關鍵字針對文字內容進行TF-IDF計算得相關向量,並計算兩兩向量之餘弦值以判斷向量間之關聯性之技術,應為本領域技術人員透過習知技術可輕易理解之內容,於此不再贅述,本發明主要係將其用於分組之關聯性依據。 It should be noted that the technique of using the keyword to perform the TF-IDF calculation of the correlation vector for the text content and calculating the cosine value of the two vectors to determine the correlation between the vectors should be easily understood by those skilled in the art through the prior art. The content is not described here, and the present invention is mainly used for the relevance basis of the grouping.

接著,網頁資料擷取裝置2之處理單元23分別計算各網頁節點資料群組ndg之部分網頁節點資料ND之文字內容總和,即計算同一網頁節點資料群組ndg之網頁節點資料ND之文字總長度,詳如下表:

Figure TWI611308BD00009
Figure TWI611308BD00010
Then, the processing unit 23 of the webpage data extraction device 2 calculates the total text content of the webpage node data ND of each webpage node data group ndg, that is, calculates the total length of the text of the webpage node data ND of the same webpage node data group ndg. , as detailed below:
Figure TWI611308BD00009
Figure TWI611308BD00010

接著,網頁資料擷取裝置2之處理單元23將相應於不同網頁節點資料群組ndg之文字內容總和排序成一文字內容總和序列,如下表所示:

Figure TWI611308BD00011
Next, the processing unit 23 of the webpage data extracting device 2 sorts the sum of the text contents corresponding to the different webpage node data groups ndg into a text content sum sequence, as shown in the following table:
Figure TWI611308BD00011

隨後,網頁資料擷取裝置2之處理單元23計算排序後之文字總和序列中,相鄰文字內容總和之差值:1、2、1、44、1,並挑選最大差值,即44。據此,同樣地,由於同一網路頁面中,具有主要內容之網頁節點資料通常具有資料量較大之文字內容,因此,最大差值出現之處,即為主要內容之網頁節點資料與不具有主要內容之網頁節點資料之分界。 Subsequently, the processing unit 23 of the web page data extracting device 2 calculates the difference between the sum of the adjacent text contents in the sorted text sum sequence: 1, 2, 1, 44, 1, and selects the maximum difference, that is, 44. Accordingly, since the webpage node data having the main content usually has a large amount of text content in the same webpage, the maximum difference occurs, that is, the webpage node data of the main content does not have The boundary between the main content of the web page node data.

因此,網頁資料擷取裝置2之處理單元23便可根據最大差值,將文字內容總和序列分為一主要區域以及一次要區域,並根據主要區域,判斷網頁節點資料群組ndg之至少一主要網頁節點資料群組MNDG,如下表所示:

Figure TWI611308BD00012
Figure TWI611308BD00013
Therefore, the processing unit 23 of the webpage data extraction device 2 can divide the text content sum sequence into a main area and a primary area according to the maximum difference, and determine at least one main page of the webpage node data group ndg according to the main area. The web node data group MNDG is shown in the following table:
Figure TWI611308BD00012
Figure TWI611308BD00013

因此,於第二實施例中,主要網頁節點資料群組MNDG包含之部分網頁節點資料ND之XML路徑語言NDX如下表所示:

Figure TWI611308BD00014
Therefore, in the second embodiment, the XML path language NDX of the partial webpage node data ND included in the main webpage node data group MNDG is as follows:
Figure TWI611308BD00014

隨後,網頁資料擷取裝置2之處理單元23便可針對主要網頁節點資料群組MNDG包含之部分網頁節點資料ND之XML路徑語言NDX,進行最長共同子序列(Longest Common Subsequence)演算法,決定網頁主要內容擷取資訊MX。於第二實施例中,網頁主要內容擷取資訊MX為:’html/body/div[1]/main[1]/article[[0-9]+].*’。 Then, the processing unit 23 of the webpage data retrieval device 2 can perform the longest common subsequence algorithm for the XML path language NDX of the partial webpage node data ND included in the main webpage node data group MNDG, and determine the webpage. The main content captures information MX. In the second embodiment, the main content capture information MX of the webpage is: 'html/body/div[1]/main[1]/article[[0-9]+].*'.

如此一來,在前述URL群組(即http://www.aaaaa.com/item1~3.html)具有相同性質(例如樣板及 排版)網頁之情況下,網頁資料擷取裝置2之處理單元23後續便可選擇具有相同主要內容擷取資訊MX(即html/body/div[1]/main[1]/article[[0-9]+].*)之網頁節點,俾後續主要內容之分析及利用。 In this way, in the case that the aforementioned URL group (ie, http://www.aaaaa.com/item1~3.html ) has the same nature (for example, template and typesetting) webpage, the processing unit of the webpage data capturing device 2 23 Subsequent selection of the web page node with the same main content capture information MX (ie html/body/div[1]/main[1]/article[[0-9]+].*), subsequent main content Analysis and utilization.

本發明之第三實施例為網頁資料擷取方法,其流程圖請參考第3圖。第三實施例之方法係用於一網頁資料擷取裝置(例如前述實施例之網頁資料擷取裝置1)。網頁資料擷取裝置自一網頁伺服器接收複數網頁資料。第三實施例之詳細步驟如下所述。 The third embodiment of the present invention is a webpage data extraction method, and the flowchart thereof is referred to FIG. The method of the third embodiment is applied to a web page data extracting device (for example, the web page data extracting device 1 of the foregoing embodiment). The webpage data retrieval device receives the plurality of webpage data from a webpage server. The detailed steps of the third embodiment are as follows.

首先,執行步驟301,令網頁資料擷取裝置根據複數網頁資料之複數URL之位址關聯性,將複數網頁資料分為至少一URL群組。其中,至少一URL群組包含一第一URL群組,第一URL群組包含至少部分網頁資料。執行步驟302,令網頁資料擷取裝置自第一URL群組之部分網頁資料中,挑選一第一網頁資料以及一第二網頁資料。 First, step 301 is executed to enable the webpage data retrieval device to divide the plurality of webpage data into at least one URL group according to the address association of the plural URL of the plurality of webpage materials. The at least one URL group includes a first URL group, and the first URL group includes at least part of the webpage data. Step 302 is executed to enable the webpage data retrieval device to select a first webpage material and a second webpage data from a portion of the webpage data of the first URL group.

執行步驟303,令網頁資料擷取裝置解析第一網頁資料以及第二網頁資料得一網頁節點資料集合。其中,網頁節點資料集合包含複數網頁節點資料,各網頁節點資料包含相對應之一XML路徑語言以及一文字內容。 Step 303 is executed to enable the webpage data capture device to parse the first webpage data and the second webpage data to obtain a webpage node dataset. The webpage node data set includes a plurality of webpage node materials, and each webpage node data includes a corresponding one of an XML path language and a text content.

執行步驟304,令網頁資料擷取裝置根據網頁節點資料集合之網頁節點資料之XML路徑語言之路徑關聯性以及文字內容之文字關聯性,將網頁節點資料集合之複數網頁節點資料分為 複數網頁節點資料群組。其中,各網頁節點資料群組至少包含部分網頁節點資料。 Step 304 is executed to enable the webpage data extraction device to divide the plurality of webpage node data of the webpage node data collection according to the path relevance of the XML path language of the webpage node data of the webpage node data collection and the textual relevance of the text content. Multiple web page node data groups. The webpage data group includes at least part of the webpage node data.

執行步驟305,令網頁資料擷取裝置分別計算各網頁節點資料群組之部分網頁節點資料之一文字內容總和。執行步驟306,令網頁資料擷取裝置根據複數文字內容總和,判斷複數網頁節點資料群組之至少一主要網頁節點資料群組。最後,執行步驟307,令網頁資料擷取裝置根據至少一主要網頁節點資料群組包含之部分網頁節點資料之XML路徑語言,決定一網頁主要內容擷取資訊。 Step 305 is executed to enable the webpage data extraction device to calculate the sum of the text contents of one of the webpage node data of each webpage node data group. Step 306 is executed to enable the webpage data retrieval device to determine at least one major webpage node data group of the plurality of webpage node data groups according to the sum of the plurality of textual content. Finally, step 307 is executed to enable the webpage data retrieval device to determine the main content capture information of a webpage according to the XML path language of the webpage node data included in the at least one main webpage node data group.

本發明之第四實施例為網頁資料擷取方法,其流程圖請參考第4圖。第四實施例之方法係用於一網頁資料擷取裝置(例如前述實施例之網頁資料擷取裝置2)。網頁資料擷取裝置自一網頁伺服器接收複數網頁資料。第四實施例之詳細步驟如下所述。 The fourth embodiment of the present invention is a webpage data extraction method, and the flowchart thereof is referred to FIG. The method of the fourth embodiment is applied to a web page data extracting device (for example, the web page data extracting device 2 of the foregoing embodiment). The webpage data retrieval device receives the plurality of webpage data from a webpage server. The detailed steps of the fourth embodiment are as follows.

首先,執行步驟401,令網頁資料擷取裝置根據複數網頁資料之複數URL之位址關聯性,將複數網頁資料分為至少一URL群組。其中,至少一URL群組包含一第一URL群組,第一URL群組包含至少部分網頁資料,且第一URL群組中,部分網頁資料之URL間之最小編輯距離皆小於一URL門檻值。 First, step 401 is executed to enable the webpage data retrieval device to divide the plurality of webpage data into at least one URL group according to the address relevance of the plural URL of the plurality of webpage materials. At least one URL group includes a first URL group, the first URL group includes at least part of the webpage data, and the minimum edit distance between the URLs of the partial webpage data in the first URL group is less than a URL threshold. .

執行步驟402,令網頁資料擷取裝置自第一URL群組之部分網頁資料中,挑選資料量最高之一第一網頁資料以及資料量第二高之一第二網頁資料。執行步驟403,令網頁資料擷取裝置 解析第一網頁資料以及第二網頁資料得一網頁節點資料集合。其中,網頁節點資料集合包含複數網頁節點資料,各網頁節點資料包含相對應之一XML路徑語言以及一文字內容。 Step 402 is executed to enable the webpage data retrieval device to select one of the first webpage materials with the highest amount of data and the second webpage data of the second highest data volume from the webpage data of the first URL group. Step 403 is executed to enable the webpage data capture device Parsing the first webpage data and the second webpage data to obtain a webpage node data set. The webpage node data set includes a plurality of webpage node materials, and each webpage node data includes a corresponding one of an XML path language and a text content.

執行步驟404,令網頁資料擷取裝置自文字內容中挑選至少一無效文字內容以及至少一重複節點資料,並將相對應於至少一無效文字內容以及至少一重複節點資料之網頁節點自網頁節點資料集合中刪除。 Step 404 is executed to enable the webpage data extraction device to select at least one invalid text content and at least one duplicate node data from the text content, and the webpage node corresponding to the at least one invalid text content and the at least one duplicate node data from the webpage node data Remove from the collection.

執行步驟405,令網頁資料擷取裝置根據網頁節點資料集合之複數網頁節點資料之XML路徑語言之路徑關聯性,將網頁節點資料集合之複數網頁節點資料分為複數路徑群組。其中,各路徑群組之部分網頁節點資料之XML路徑語言間之最小編輯距離皆小於一XML門檻值。 Step 405 is executed to enable the webpage data retrieval device to divide the plurality of webpage node data of the webpage node data collection into a plurality of path groups according to the path association of the XML path language of the plurality of webpage node data of the webpage node data collection. The minimum edit distance between the XML path languages of the partial web node data of each path group is less than an XML threshold.

執行步驟406,令網頁資料擷取裝置針對各路徑群組,根據部分網頁節點資料之文字內容之文字關聯性,將各路徑群組分為複數網頁節點資料群組。其中,各路徑群組中,部分網頁節點資料之各文字內容具有一用語頻率向量,且各路徑群組中,各網頁節點資料群組之部分網頁節點資料之文字內容之用語頻率向量間之餘弦值大於一文字內容門檻值。 Step 406 is executed to enable the webpage data retrieval device to divide each path group into a plurality of webpage node data groups according to the textual relevance of the text content of the partial webpage node data for each path group. Wherein, in each path group, each text content of the part of the webpage node data has a term frequency vector, and in each path group, the cosine frequency between the text frequency of the text content of the part of the webpage node data group The value is greater than a text content threshold.

執行步驟407,令網頁資料擷取裝置將複數文字內容總和排序成一文字內容總和序列。執行步驟408,令網頁資料擷取裝置計算文字內容總和序列中相鄰文字內容總和之複數差值。執行步驟409,令網頁資料擷取裝置挑選複數差值之一最大差值。執 行步驟410,令網頁資料擷取裝置根據最大差值,將文字內容總和序列分為一主要區域以及一次要區域。 Step 407 is executed to enable the webpage data extracting device to sort the sum of the plurality of text contents into a sum of text content sums. Step 408 is executed to enable the webpage data retrieval device to calculate a complex difference between the sum of adjacent text contents in the sum of the text contents. Step 409 is executed to enable the webpage data capture device to select one of the maximum differences of the complex differences. Hold In step 410, the webpage data extracting device divides the text content sum sequence into a main area and a primary area according to the maximum difference.

執行步驟411,令網頁資料擷取裝置根據主要區域,判斷複數網頁節點資料群組之至少一主要網頁節點資料群組。執行步驟412,令網頁資料擷取裝置針對至少一主要網頁節點資料群組包含之部分網頁節點資料之XML路徑語言,進行最長共同子序列演算法。執行步驟413,令網頁資料擷取裝置根據步驟412之結果,決定網頁主要內容擷取資訊。 Step 411 is executed to enable the webpage data retrieval device to determine at least one major webpage node data group of the plurality of webpage node data groups according to the main area. Step 412 is executed to enable the webpage data retrieval device to perform the longest common subsequence algorithm for the XML path language of the partial webpage node data included in the at least one main webpage node data group. Step 413 is executed to enable the webpage data retrieval device to determine the main content of the webpage to retrieve the information according to the result of step 412.

綜合上述,本發明之網頁資料擷取裝置及其網頁資料擷取方法,主要可自動地分析不同網頁群組之樣板及排版之語法,並且據以自動地找出具有主要內容之網頁節點。如此一來,便可更有效率地完成網頁資料之擷取,俾利進行後續相關數據分析。 In summary, the webpage data extracting apparatus and the webpage data extracting method thereof of the present invention can automatically analyze the syntax of the template and typesetting of different webpage groups, and automatically find the webpage node having the main content. In this way, the webpage data can be retrieved more efficiently, and the subsequent relevant data analysis can be performed.

惟上述實施例僅為例示性說明本發明之實施態樣,以及闡釋本發明之技術特徵,並非用來限制本發明之保護範疇。任何熟悉此技藝之人士可輕易完成之改變或均等性之安排均屬於本發明所主張之範圍,本發明之權利保護範圍應以申請專利範圍為準。 The above-described embodiments are merely illustrative of the embodiments of the present invention and the technical features of the present invention are not intended to limit the scope of the present invention. It is intended that any changes or equivalents of the invention may be made by those skilled in the art. The scope of the invention should be determined by the scope of the claims.

301~307‧‧‧步驟 301~307‧‧‧Steps

Claims (14)

一種用於網頁資料擷取裝置之網頁資料擷取方法,該網頁資料擷取裝置自一網頁伺服器接收複數網頁資料,該網頁資料擷取方法包含:(a)令該網頁資料擷取裝置根據該等網頁資料之複數統一資源定址器(uniform resource locator,URL)之位址關聯性,將該等網頁資料分為至少一URL群組,其中,該至少一URL群組包含一第一URL群組,該第一URL群組包含至少部分該等網頁資料;(b)令該網頁資料擷取裝置自該第一URL群組之部分該等網頁資料中,挑選一第一網頁資料以及一第二網頁資料;(c)令該網頁資料擷取裝置解析該第一網頁資料以及該第二網頁資料得一網頁節點資料集合,其中,該網頁節點資料集合包含複數網頁節點資料,各該網頁節點資料包含相對應之一XML路徑語言(XML Path Language)以及一文字內容;(d)令該網頁資料擷取裝置根據該網頁節點資料集合之該等網頁節點資料之該等XML路徑語言之路徑關聯性以及該等文字內容之文字關聯性,將該網頁節點資料集合之該等網頁節點資料分為複數網頁節點資料群組,其中,各該網頁節點資料群組至少包含部分該等網頁節點資料;(e)令該網頁資料擷取裝置分別計算各該網頁節點資料群組之部分該等網頁節點資料之一文字內容總和;(f)令該網頁資料擷取裝置根據該等文字內容總和,判斷該等網頁節點資料群組之至少一主要網頁節點資料群組;(g)令該網頁資料擷取裝置根據該至少一主要網頁節點資料群組包 含之部分該等網頁節點資料之該等XML路徑語言,決定一網頁主要內容擷取資訊。 A webpage data capture method for a webpage data capture device, the webpage data capture device receives a plurality of webpage data from a webpage server, and the webpage data capture method comprises: (a) causing the webpage data capture apparatus to Addressing the address of the uniform resource locator (URL) of the webpage data, and dividing the webpage data into at least one URL group, wherein the at least one URL group includes a first URL group a group, the first URL group includes at least a portion of the webpage data; (b) causing the webpage data retrieval device to select a first webpage material and a first page from a portion of the webpage data of the first URL group (b) causing the webpage data capture device to parse the first webpage data and the second webpage data to obtain a webpage node data set, wherein the webpage node dataset includes a plurality of webpage node data, each webpage node The data includes a corresponding XML Path Language and a text content; (d) the web page data extraction device according to the web node node data collection of the webpage nodes The path association of the XML path language of the data and the textual relevance of the text content, the webpage node data of the webpage node data set is divided into a plurality of webpage node data groups, wherein each webpage node data group The group includes at least a portion of the webpage node data; (e) causing the webpage data capture device to calculate a sum of textual content of one of the webpage node data of each of the webpage node data groups; (f) extracting the webpage data Determining, according to the sum of the text contents, at least one main webpage node data group of the webpage node data group; (g) causing the webpage data extracting apparatus to according to the at least one main webpage node data grouping package The XML path language of the data of the web page nodes, which determine the content of the web pages, determines the main content of a web page to capture information. 如請求項1所述之網頁資料擷取方法,其中,該第一URL群組中,部分該等網頁資料之該等URL間之最小編輯距離(Minimum Edit Distance)皆小於一URL門檻值。 The method for extracting webpage data according to claim 1, wherein a minimum edit distance (Minimum Edit Distance) between the URLs of the webpage data in the first URL group is less than a URL threshold. 如請求項1所述之網頁資料擷取方法,其中,步驟(b)更包含:(b1)令該網頁資料擷取裝置自該第一URL群組之部分該等網頁資料中,挑選資料量最高之該第一網頁資料以及資料量第二高之該第二網頁資料。 The method for extracting webpage data according to claim 1, wherein the step (b) further comprises: (b1) causing the webpage data retrieval device to select a data amount from a portion of the webpage data of the first URL group. The highest first page of the webpage and the second webpage of the second highest amount of data. 如請求項1所述之網頁資料擷取方法,其中,步驟(c)後更包含:(c1)令該網頁資料擷取裝置自該等文字內容中挑選至少一無效文字內容以及至少一重複節點資料,並將相對應於該至少一無效文字內容以及至少一重複節點資料之網頁節點自該網頁節點資料集合中刪除。 The method for extracting webpage data according to claim 1, wherein the step (c) further comprises: (c1) causing the webpage data extracting device to select at least one invalid text content and at least one repeating node from the text content. And deleting the webpage node corresponding to the at least one invalid text content and the at least one duplicate node data from the webpage node data set. 如請求項1所述之網頁資料擷取方法,其中,步驟(d)更包含:(d1)令網頁資料擷取裝置根據該網頁節點資料集合之該等網頁節點資料之該等XML路徑語言之路徑關聯性,將該網頁節點資料集合之該等網頁節點資料分為複數路徑群組,其中,各該路徑群組之部分該等網頁節點資料之該等XML路徑語言間之最小編輯距離皆小於一XML門檻值;(d2)令該網頁資料擷取裝置針對各該路徑群組,根據部分該等網頁節點資料之該等文字內容之文字關聯性,將各該路徑群組分為該等網頁節點資料群組; 其中,各該路徑群組中,部分該等網頁節點資料之各該文字內容具有一用語頻率向量;其中,各該路徑群組中,各該網頁節點資料群組之部分該等網頁節點資料之該等文字內容之該等用語頻率向量間之餘弦值大於一文字內容門檻值。 The method of extracting webpage data according to claim 1, wherein the step (d) further comprises: (d1) causing the webpage data retrieval device to use the XML path language of the webpage node data according to the webpage node data collection. Path association, the webpage node data of the webpage node data set is divided into a plurality of path groups, wherein a minimum edit distance between the XML path languages of the webpage node data of each part of the path group is less than An XML threshold value; (d2) causing the webpage data capture device to divide each of the path groups into the webpages according to the textual relevance of the textual content of the webpage node data for each of the path groups Node data group; Wherein, in each of the path groups, each of the text content of the webpage node data has a term frequency vector; wherein, in each of the path groups, each of the webpage node data groups is part of the webpage node data The cosine values between the frequency frequencies of the words of the text content are greater than a text content threshold. 如請求項1所述之網頁資料擷取方法,其中,步驟(f)更包含:(f1)令該網頁資料擷取裝置將該等文字內容總和排序成一文字內容總和序列;(f2)令該網頁資料擷取裝置計算該文字內容總和序列中相鄰文字內容總和之複數差值;(f3)令該網頁資料擷取裝置挑選該等差值之一最大差值;(f4)令該網頁資料擷取裝置根據該最大差值,將該文字內容總和序列分為一主要區域以及一次要區域;(f5)令該網頁資料擷取裝置根據該主要區域,判斷該等網頁節點資料群組之該至少一主要網頁節點資料群組。 The method for extracting webpage data according to claim 1, wherein the step (f) further comprises: (f1) causing the webpage data extracting device to sort the sum of the text contents into a sum of text content; (f2) The webpage data extraction device calculates a complex difference between the sum of adjacent text contents in the summation sequence of the text content; (f3) causing the webpage data extraction device to select a maximum difference of one of the differences; (f4) making the webpage data The capturing device divides the text content sum sequence into a main area and a primary area according to the maximum difference; (f5) causing the webpage data extracting device to determine the webpage node data group according to the main area At least one primary web node data group. 如請求項1所述之網頁資料擷取方法,其中,步驟(g)更包含:(g1)令該網頁資料擷取裝置針對該至少一主要網頁節點資料群組包含之部分該等網頁節點資料之該等XML路徑語言,進行最長共同子序列(Longest Common Subsequence)演算法;(g2)令網頁資料擷取裝置根據步驟(g1)之結果,決定該網頁主要內容擷取資訊。 The method for extracting webpage data according to claim 1, wherein the step (g) further comprises: (g1) causing the webpage data capture device to include a portion of the webpage node data included in the at least one main webpage data group The XML path language performs a longest common subsequence algorithm; (g2) causes the web page data extracting device to determine the main content of the web page to retrieve information according to the result of the step (g1). 一種網頁資料擷取裝置,包含: 一接收單元,用以自一網頁伺服器接收複數網頁資料;以及一處理單元,用以:根據該等網頁資料之複數統一資源定址器(uniform resource locator,URL)之位址關聯性,將該等網頁資料分為至少一URL群組,其中,該至少一URL群組包含一第一URL群組,該第一URL群組包含至少部分該等網頁資料;自該第一URL群組之部分該等網頁資料中,挑選一第一網頁資料以及一第二網頁資料;解析該第一網頁資料以及該第二網頁資料得一網頁節點資料集合,其中,該網頁節點資料集合包含複數網頁節點資料,各該網頁節點資料包含相對應之一XML路徑語言(XML Path Language)以及一文字內容;根據該網頁節點資料集合之該等網頁節點資料之該等XML路徑語言之路徑關聯性以及該等文字內容之文字關聯性,將該網頁節點資料集合之該等網頁節點資料分為複數網頁節點資料群組,其中,各該網頁節點資料群組至少包含部分該等網頁節點資料;分別計算各該網頁節點資料群組之部分該等網頁節點資料之一文字內容總和;根據該等文字內容總和,判斷該等網頁節點資料群組之至少一主要網頁節點資料群組;根據該至少一主要網頁節點資料群組包含之部分該等網頁節點資料之該等XML路徑語言,決定一網頁主要內容擷取資訊。 A webpage data capture device comprising: a receiving unit, configured to receive a plurality of webpage data from a webpage server; and a processing unit configured to: according to the address association of a plurality of uniform resource locators (URLs) of the webpage data, And the webpage data is divided into at least one URL group, wherein the at least one URL group comprises a first URL group, the first URL group includes at least part of the webpage materials; and the part of the first URL group Selecting a first webpage data and a second webpage data in the webpage data; parsing the first webpage data and the second webpage data to obtain a webpage node dataset, wherein the webpage node dataset includes a plurality of webpage node data Each web page node data includes a corresponding one of an XML path language and a text content; a path association of the XML path languages of the web page node data according to the web page node data set and the text content Text association, the webpage node data of the webpage node data collection is divided into a plurality of webpage node data groups, wherein each of the The webpage node data group includes at least some of the webpage node data; respectively, calculating a sum of text content of one of the webpage node data of each webpage node data group; and determining the webpage node data group according to the sum of the textual content At least one main webpage node data group; determining the main content of the webpage to retrieve information according to the XML path language of the webpage node data included in the at least one main webpage data group. 如請求項8所述之網頁資料擷取裝置,其中,該第一URL群組中,部分該等網頁資料之該等URL間之最小編輯距離(Minimum Edit Distance)皆小於一URL門檻值。 The webpage data capture device of claim 8, wherein a minimum edit distance (Minimum Edit Distance) between the URLs of the webpage data of the first URL group is less than a URL threshold. 如請求項8所述之網頁資料擷取裝置,其中,該處理單元更用以:自該第一URL群組之部分該等網頁資料中,挑選資料量最高之該第一網頁資料以及資料量第二高之該第二網頁資料。 The webpage data retrieval device of claim 8, wherein the processing unit is further configured to: select the first webpage data and the amount of data with the highest amount of data from a portion of the webpage data of the first URL group The second highest webpage data. 如請求項8所述之網頁資料擷取裝置,其中,該處理單元更用以:自該等文字內容中挑選至少一無效文字內容以及至少一重複節點資料,並將相對應於該至少一無效文字內容以及至少一重複節點資料之網頁節點自該網頁節點資料集合中刪除。 The webpage data retrieval device of claim 8, wherein the processing unit is further configured to: select at least one invalid text content and at least one duplicate node data from the text content, and invalidate corresponding to the at least one The text content and at least one web node of the repeated node data are deleted from the web node data collection. 如請求項8所述之網頁資料擷取裝置,其中,該處理單元更用以:根據該網頁節點資料集合之該等網頁節點資料之該等XML路徑語言之路徑關聯性,將該網頁節點資料集合之該等網頁節點資料分為複數路徑群組,其中,各該路徑群組之部分該等網頁節點資料之該等XML路徑語言間之最小編輯距離皆小於一XML門檻值;針對各該路徑群組,根據部分該等網頁節點資料之該等文字內容之文字關聯性,將各該路徑群組分為該等網頁節點資料群組;其中,各該路徑群組中,部分該等網頁節點資料之各該文字內容具有一用語頻率向量;其中,各該路徑群組中,各該網頁節點資料群組之部分該等網頁節點資料之該等文字內容之該等用語頻率向量間之餘弦值大於一文字內容門檻值。 The webpage data retrieval device of claim 8, wherein the processing unit is further configured to: according to the path association of the XML path languages of the webpage node data of the webpage node data collection, the webpage node data The collection of the webpage node data is divided into a plurality of path groups, wherein a minimum edit distance between the XML path languages of the webpage node data of each of the path groups is less than an XML threshold value; a group, according to the textual relevance of the text content of the webpage node data, the path group is divided into the webpage node data groups; wherein, among the path groups, some of the webpage nodes Each text content of the data has a term frequency vector; wherein, in each of the path groups, a cosine value between the word frequency vectors of the text content of the webpage node data of the webpage node data group Greater than a text content threshold. 如請求項8所述之網頁資料擷取裝置,其中,該處理單元更用以:將該等文字內容總和排序成一文字內容總和序列;計算該文字內容總和序列中相鄰文字內容總和之複數差值;挑選該等差值之一最大差值;根據該最大差值,將該文字內容總和序列分為一主要區域以及一次要區域;根據該主要區域,判斷該等網頁節點資料群組之該至少一主要網頁節點資料群組。 The webpage data extraction device of claim 8, wherein the processing unit is further configured to: sort the sum of the text contents into a sum of text content sum; calculate a complex difference of the sum of adjacent text contents in the summation sequence of the text content a maximum difference value of one of the difference values; the sum of the text content sum is divided into a main area and a primary area according to the maximum difference; and the web area data group is determined according to the main area At least one primary web node data group. 如請求項8所述之網頁資料擷取裝置,其中,該處理單元更用以:針對該至少一主要網頁節點資料群組包含之部分該等網頁節點資料之該等XML路徑語言,進行最長共同子序列(Longest Common Subsequence)演算法;根據最長共同子序列演算法之結果,決定該網頁主要內容擷取資訊。 The webpage data capture device of claim 8, wherein the processing unit is further configured to: perform the longest common for the XML path language of the webpage node data of the at least one main webpage data group The Longest Common Subsequence algorithm; determines the main content of the web page to capture information based on the results of the longest common subsequence algorithm.
TW105135730A 2016-11-03 2016-11-03 Webpage data extraction device and webpage data extraction method thereof TWI611308B (en)

Priority Applications (3)

Application Number Priority Date Filing Date Title
TW105135730A TWI611308B (en) 2016-11-03 2016-11-03 Webpage data extraction device and webpage data extraction method thereof
CN201611000331.0A CN108021600A (en) 2016-11-03 2016-11-14 Webpage data capturing equipment and webpage data capturing method thereof
US15/358,119 US20180121558A1 (en) 2016-11-03 2016-11-21 Webpage data extraction device and webpage data extraction method thereof

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
TW105135730A TWI611308B (en) 2016-11-03 2016-11-03 Webpage data extraction device and webpage data extraction method thereof

Publications (2)

Publication Number Publication Date
TWI611308B true TWI611308B (en) 2018-01-11
TW201818268A TW201818268A (en) 2018-05-16

Family

ID=61728282

Family Applications (1)

Application Number Title Priority Date Filing Date
TW105135730A TWI611308B (en) 2016-11-03 2016-11-03 Webpage data extraction device and webpage data extraction method thereof

Country Status (3)

Country Link
US (1) US20180121558A1 (en)
CN (1) CN108021600A (en)
TW (1) TWI611308B (en)

Families Citing this family (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2018159337A1 (en) * 2017-03-03 2018-09-07 日本電信電話株式会社 Profile generation device, attack detection apparatus, profile generation method, and profile generation program
US10977289B2 (en) * 2019-02-11 2021-04-13 Verizon Media Inc. Automatic electronic message content extraction method and apparatus
CN110134901B (en) * 2019-04-30 2023-06-16 哈尔滨英赛克信息技术有限公司 Multilink webpage tampering judging method based on flow analysis
CN110704761A (en) * 2019-09-25 2020-01-17 恩亿科(北京)数据科技有限公司 Method for acquiring webpage information and computer storage medium

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
TW200900957A (en) * 2007-03-15 2009-01-01 Seung-June Song Internet service system in connection with a contacted website and a method for the same
CN101517570A (en) * 2006-07-10 2009-08-26 网圣公司 System and method of analyzing web content
TW201030542A (en) * 2008-11-18 2010-08-16 Yahoo Inc System and method for URL based query for retrieving data related to a context
US20120054129A1 (en) * 2010-08-30 2012-03-01 International Business Machines Corporation Method for classification of objects in a graph data stream
CN105843965A (en) * 2016-04-20 2016-08-10 广州精点计算机科技有限公司 Deep web crawler form filling method and device based on URL (uniform resource locator) subject classification

Family Cites Families (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20090063538A1 (en) * 2007-08-30 2009-03-05 Krishna Prasad Chitrapura Method for normalizing dynamic urls of web pages through hierarchical organization of urls from a web site
CN102314497B (en) * 2011-08-26 2014-12-10 百度在线网络技术(北京)有限公司 Method and equipment for identifying body contents of markup language files
CN102298638A (en) * 2011-08-31 2011-12-28 北京中搜网络技术股份有限公司 Method and system for extracting news webpage contents by clustering webpage labels
WO2013159246A1 (en) * 2012-04-28 2013-10-31 Hewlett-Packard Development Company, L.P. Detecting valuable sections in webpage
US20150067476A1 (en) * 2013-08-29 2015-03-05 Microsoft Corporation Title and body extraction from web page
EP3161610B1 (en) * 2014-06-26 2020-08-05 Google LLC Optimized browser rendering process
CN106021582B (en) * 2016-06-02 2020-06-05 腾讯科技(深圳)有限公司 Method for filtering position information, method and device for extracting effective webpage information
US10148700B2 (en) * 2016-06-30 2018-12-04 Fortinet, Inc. Classification of top-level domain (TLD) websites based on a known website classification

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101517570A (en) * 2006-07-10 2009-08-26 网圣公司 System and method of analyzing web content
TW200900957A (en) * 2007-03-15 2009-01-01 Seung-June Song Internet service system in connection with a contacted website and a method for the same
TW201030542A (en) * 2008-11-18 2010-08-16 Yahoo Inc System and method for URL based query for retrieving data related to a context
US20120054129A1 (en) * 2010-08-30 2012-03-01 International Business Machines Corporation Method for classification of objects in a graph data stream
CN105843965A (en) * 2016-04-20 2016-08-10 广州精点计算机科技有限公司 Deep web crawler form filling method and device based on URL (uniform resource locator) subject classification

Also Published As

Publication number Publication date
CN108021600A (en) 2018-05-11
TW201818268A (en) 2018-05-16
US20180121558A1 (en) 2018-05-03

Similar Documents

Publication Publication Date Title
TWI611308B (en) Webpage data extraction device and webpage data extraction method thereof
KR101715432B1 (en) Word pair acquisition device, word pair acquisition method, and recording medium
CN107346433B (en) Text data classification method and server
JP6203374B2 (en) Web page style address integration
US20160314104A1 (en) Methods and systems for efficient and accurate text extraction from unstructured documents
US9098581B2 (en) Method for finding text reading order in a document
US20200004792A1 (en) Automated website data collection method
US10579372B1 (en) Metadata-based API attribute extraction
CN103853834B (en) Text structure analysis-based Web document abstract generation method
CN105528422A (en) Focused crawler processing method and apparatus
US8090720B2 (en) Method for merging document clusters
JP2010123000A (en) Web page group extraction method, device and program
US7231626B2 (en) Method of implementing an engineering change order in an integrated circuit design by windows
Mehta et al. DOM tree based approach for web content extraction
CN110866407B (en) Analysis method, device and equipment for determining similarity between text of mutual translation
CN104778232B (en) Searching result optimizing method and device based on long query
Tiedemann Improved text extraction from PDF documents for large-scale natural language processing
Štěpánek et al. Comparing web pages in terms of inner structure
Nethra et al. WEB CONTENT EXTRACTION USING HYBRID APPROACH.
CN104978431B (en) Web data fusion method and device
Shafiq et al. Towards building a urdu language corpus using common crawl
CN104063506A (en) Method and device for identifying repeated web pages
CN107545020A (en) A kind of determination method and device of Web page classifying
Kaddu et al. To extract informative content from online web pages by using hybrid approach
CN112989163A (en) Vertical search method and system