TWI764491B - Text information automatically mining method and system - Google Patents

Text information automatically mining method and system

Info

Publication number
TWI764491B
TWI764491B TW109147025A TW109147025A TWI764491B TW I764491 B TWI764491 B TW I764491B TW 109147025 A TW109147025 A TW 109147025A TW 109147025 A TW109147025 A TW 109147025A TW I764491 B TWI764491 B TW I764491B
Authority
TW
Taiwan
Prior art keywords
node
dom tree
text
depth
text information
Prior art date
Application number
TW109147025A
Other languages
Chinese (zh)
Other versions
TW202227990A (en
Inventor
郭博鈞
歐曜瑋
Original Assignee
重量科技股份有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 重量科技股份有限公司 filed Critical 重量科技股份有限公司
Priority to TW109147025A priority Critical patent/TWI764491B/en
Application granted granted Critical
Publication of TWI764491B publication Critical patent/TWI764491B/en
Publication of TW202227990A publication Critical patent/TW202227990A/en

Links

Images

Landscapes

  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

A text information automatic mining method and a system are provided. The text information automatic exploration method can perform a depth-first search algorithm after constructing source codes of a web page as a document object tree to retrieve content of a DOM tree node with high possibility of text information based on the depth and distance.

Description

文字資訊自動探勘方法及文字資訊自動探勘系統Text information automatic mining method and text information automatic mining system

本發明涉及一種資訊探勘方法及系統,特別是涉及一種文字資訊自動探勘方法及文字資訊自動探勘系統。The present invention relates to an information mining method and system, in particular to a textual information automatic mining method and a textual information automatic mining system.

網站可顯示各種文章及各種其他內容,如廣告、連結、分享按鈕、列印、以電子郵件發送文章及其他相似內容。然而,對於網頁的讀者來說,例如新聞網頁上,諸如廣告等的額外內容可能讓人分心,且容易遮蓋讀者想要閱讀的內文。The website may display various articles and various other content such as advertisements, links, share buttons, printing, emailing articles and other similar content. However, for readers of web pages, such as on news web pages, additional content such as advertisements can be distracting and tend to obscure the content the reader wants to read.

因此,如何快速且有效的取得網頁上的重要內容,已經是本領域亟需達成的目標之一。Therefore, how to quickly and effectively obtain important content on a web page has become one of the urgently needed goals in the art.

本發明所要解決的技術問題在於,針對現有技術的不足提供一種文字資訊自動探勘方法及文字資訊自動探勘系統。The technical problem to be solved by the present invention is to provide an automatic text information mining method and a text information automatic mining system aiming at the shortcomings of the prior art.

為了解決上述的技術問題,本發明所採用的其中一技術方案是提供一種文字資訊自動探勘方法,其包括:配置一網頁擷取模組取得一目標網頁的網頁原始碼資料;配置一模型樹建構模組,將所取得的該網頁原始碼資料進行解析,並根據一解析結果取得多個文件物件模型(Document Object Model, DOM)樹節點,且根據該些DOM樹節點建構一文件物件模型(Document Object Model, DOM)樹;配置一第一搜尋模組,針對該DOM樹的該些DOM樹節點執行一深度優先搜尋演算法,以依據一深度優先搜尋順序行進於該DOM樹中,並針對該些DOM樹節點中的每一個進行:判斷當前的該DOM樹節點是否為文字節點(text node),若否,則進入下一個該DOM樹節點,若是,則判斷當前的該DOM樹節點與上一個被判斷為文字節點的該DOM樹節點之間的一節點深度是否大於一預定深度且一節點距離是否大於一預定距離,其中,該節點深度係由該DOM樹節點與上一個文字節點之間的節點階層數量來決定,該節點距離係由該DOM樹節點與上一個文字節點之間的節點行進數量來決定;及響應於判斷當前的該DOM樹節點與上一個文字節點(text node)之間的該節點深度及該節點距離分別大於該預定深度範圍及該預定距離,則將當前的該DOM樹節點的內文進行記錄,並依據該深度優先搜尋順序進入下一個該DOM樹節點;以及配置一儲存單元以儲存執行該深度優先搜尋演算法後記錄的內文,以作爲該目標網頁的一文字資訊。In order to solve the above-mentioned technical problems, one of the technical solutions adopted by the present invention is to provide an automatic text information mining method, which includes: configuring a web page capture module to obtain web page source code data of a target web page; configuring a model tree construction The module parses the obtained webpage source code data, obtains a plurality of document object model (Document Object Model, DOM) tree nodes according to a parsing result, and constructs a document object model (Document Object Model, DOM) according to the DOM tree nodes Object Model, DOM) tree; configure a first search module, execute a depth-first search algorithm for the DOM tree nodes of the DOM tree, to proceed in the DOM tree according to a depth-first search order, and for the DOM tree Each of these DOM tree nodes: judge whether the current DOM tree node is a text node, if not, enter the next DOM tree node, if so, judge whether the current DOM tree node is the same as the previous one. Whether a node depth between the DOM tree nodes determined to be text nodes is greater than a predetermined depth and whether a node distance is greater than a predetermined distance, wherein the node depth is determined by the DOM tree node and the previous text node. The node distance is determined by the number of node travels between the DOM tree node and the previous text node; and in response to determining the distance between the current DOM tree node and the previous text node (text node) If the depth of the node and the distance between the nodes are respectively greater than the predetermined depth range and the predetermined distance, the current context of the DOM tree node is recorded, and the next DOM tree node is entered according to the depth-first search sequence; and A storage unit is configured to store the content recorded after executing the depth-first search algorithm as a text information of the target webpage.

在一些實施例中,判斷當前的該DOM樹節點是否為文字節點是判斷當前的該DOM樹節點的格式是否符合超文本標記語言(Hyper Text Markup Language, HTML)中定義的文字節點。In some embodiments, determining whether the current DOM tree node is a text node is to determine whether the format of the current DOM tree node conforms to a text node defined in Hyper Text Markup Language (HTML).

在一些實施例中,該預定深度係在3至7的範圍內,該預定距離係在8至12的範圍內,且該預定深度及該預定範圍皆爲正整數。In some embodiments, the predetermined depth is in the range of 3 to 7, the predetermined distance is in the range of 8 to 12, and both the predetermined depth and the predetermined range are positive integers.

在一些實施例中,文字資訊自動探勘方法,更包括配置一檢查模組以檢查單一該DOM樹節點上所記錄的內文是否超過一預定長度,若是,則配置該儲存單元以儲存經過檢查的內文。In some embodiments, the automatic text information mining method further includes configuring a checking module to check whether the text recorded on a single DOM tree node exceeds a predetermined length, and if so, configuring the storage unit to store the checked data inner text.

在一些實施例中,文字資訊自動探勘方法更包括:配置一第二搜尋模組以判斷該DOM樹的該些DOM樹節點中,是否有同時符合 HTML 中定義的元素節點(element node) 且符合一特定描述標籤的該DOM樹節點,若是,則直接取得該DOM樹節點的內文進行記錄。In some embodiments, the automatic text information mining method further includes: configuring a second search module to determine whether the DOM tree nodes in the DOM tree both conform to the element nodes defined in HTML and conform to . A specific description tag of the DOM tree node, if so, directly obtain the content of the DOM tree node for recording.

在一些實施例中,文字資訊自動探勘方法,更包括配置該儲存單元以儲存該第二搜尋模組記錄的內文,以作爲該目標網頁的該文字資訊。In some embodiments, the automatic text information mining method further includes configuring the storage unit to store the text recorded by the second search module as the text information of the target webpage.

在一些實施例中,該特定描述標籤係用於描述插入放置於該目標網頁上的素材的一素材類型。In some embodiments, the specific description tag is used to describe a material type inserted into the material placed on the target web page.

在一些實施例中,該特定描述標籤為Golang的DataAtom 下定義的<style>標籤、<script>標籤或<iframe>標籤。In some embodiments, the specific description tag is a <style> tag, a <script> tag or an <iframe> tag defined under DataAtom of Golang.

為了解決上述的技術問題,本發明所採用的另外一技術方案是提供一種文字資訊自動判定及探勘系統,其包括網頁擷取模組、模型樹建構模組、第一搜尋模組及儲存單元。網頁擷取模組經配置以取得一目標網頁的網頁原始碼資料。模型樹建構模組經配置以將所取得的該網頁原始碼資料進行解析,並根據一解析結果取得多個文件物件模型(Document Object Model, DOM)樹節點,且根據該些DOM樹節點建構一文件物件模型(Document Object Model, DOM)樹。第一搜尋模組經配置以針對該DOM樹的該些DOM樹節點執行一深度優先搜尋演算法,以依據一深度優先搜尋順序行進於該DOM樹中,並針對該些DOM樹節點中的每一個進行:判斷當前的該DOM樹節點是否為文字節點(text node),若否,則進入下一個該DOM樹節點,若是,則判斷當前的該DOM樹節點與上一個被判斷為文字節點的該DOM樹節點之間的一節點深度是否大於一預定深度且一節點距離是否大於一預定距離,其中,該節點深度係由該DOM樹節點與上一個文字節點之間的節點階層數量來決定,該節點距離係由該DOM樹節點與上一個文字節點之間的節點行進數量來決定;及響應於判斷當前的該DOM樹節點與上一個文字節點(text node)之間的該節點深度及該節點距離分別大於該預定深度範圍及該預定距離,則將當前的該DOM樹節點的內文進行記錄,並依據該深度優先搜尋順序進入下一個該DOM樹節點。儲存單元經配置以儲存執行該深度優先搜尋演算法後記錄的內文,以作爲該目標網頁的一文字資訊。In order to solve the above technical problem, another technical solution adopted by the present invention is to provide a text information automatic determination and exploration system, which includes a webpage capture module, a model tree construction module, a first search module and a storage unit. The webpage capture module is configured to obtain webpage source code data of a target webpage. The model tree construction module is configured to parse the obtained web page source code data, obtain a plurality of document object model (Document Object Model, DOM) tree nodes according to a parsing result, and construct a tree node according to the DOM tree nodes Document Object Model (DOM) tree. The first search module is configured to execute a depth-first search algorithm for the DOM tree nodes of the DOM tree to travel through the DOM tree according to a depth-first search order, and for each of the DOM tree nodes One process: judge whether the current DOM tree node is a text node, if not, enter the next DOM tree node, if so, judge the current DOM tree node and the previous one judged to be a text node. Whether a node depth between the DOM tree nodes is greater than a predetermined depth and a node distance is greater than a predetermined distance, wherein the node depth is determined by the number of node levels between the DOM tree node and the previous text node, The node distance is determined by the number of node travels between the DOM tree node and the previous text node; and in response to determining the node depth and the node depth between the current DOM tree node and the previous text node If the node distance is greater than the predetermined depth range and the predetermined distance, respectively, the content of the current DOM tree node is recorded, and the next DOM tree node is entered according to the depth-first search order. The storage unit is configured to store the content recorded after executing the depth-first search algorithm as a text information of the target webpage.

本發明的其中一有益效果在於,本發明所提供的文字資訊自動探勘方法及文字資訊自動探勘系統,可將網頁原始碼建構為DOM樹後,執行深度優先搜尋法,以依據深度及距離來針對有高度可能性為文字資訊的DOM樹節點的內文進行擷取。One of the beneficial effects of the present invention is that the automatic text information mining method and text information automatic mining system provided by the present invention can construct a web page source code into a DOM tree, and then perform a depth-first search method to search for target data according to depth and distance. There is a high probability that the content of a DOM tree node with textual information is retrieved.

另一方面,本發明所提供的文字資訊自動探勘方法及文字資訊自動探勘系統,可通過特定的節點類型及特定的描述標籤來判斷目標網頁的內容是否有符合特定格式條件的資料,進而可直接取得該資料,以節省人工資料探勘所需的時間及成本。On the other hand, the text information automatic mining method and text information automatic mining system provided by the present invention can judge whether the content of the target web page has data that meets the specific format conditions through a specific node type and a specific description tag, and then can directly This data is obtained to save the time and cost of manual data mining.

為使能更進一步瞭解本發明的特徵及技術內容,請參閱以下有關本發明的詳細說明與圖式,然而所提供的圖式僅用於提供參考與說明,並非用來對本發明加以限制。For a further understanding of the features and technical content of the present invention, please refer to the following detailed descriptions and drawings of the present invention. However, the drawings provided are only for reference and description, and are not intended to limit the present invention.

以下是通過特定的具體實施例來說明本發明所公開有關“文字資訊自動探勘方法及文字資訊自動探勘系統”的實施方式,本領域技術人員可由本說明書所公開的內容瞭解本發明的優點與效果。本發明可通過其他不同的具體實施例加以施行或應用,本說明書中的各項細節也可基於不同觀點與應用,在不背離本發明的構思下進行各種修改與變更。另外,本發明的附圖僅為簡單示意說明,並非依實際尺寸的描繪,事先聲明。以下的實施方式將進一步詳細說明本發明的相關技術內容,但所公開的內容並非用以限制本發明的保護範圍。另外,本文中所使用的術語“或”,應視實際情況可能包括相關聯的列出項目中的任一個或者多個的組合。The following are specific specific examples to illustrate the implementation of the "text information automatic mining method and text information automatic mining system" disclosed in the present invention. Those skilled in the art can understand the advantages and effects of the present invention from the content disclosed in this specification. . The present invention can be implemented or applied through other different specific embodiments, and various details in this specification can also be modified and changed based on different viewpoints and applications without departing from the concept of the present invention. In addition, the drawings of the present invention are merely schematic illustrations, and are not drawn according to the actual size, and are stated in advance. The following embodiments will further describe the related technical contents of the present invention in detail, but the disclosed contents are not intended to limit the protection scope of the present invention. In addition, the term "or", as used herein, should include any one or a combination of more of the associated listed items, as the case may be.

圖1為根據本發明實施例繪示的文字資訊自動探勘系統的方塊圖。參閱圖1所示,本發明實施例提供一種文字資訊自動判定及探勘系統1,其包括網頁擷取模組100、模型樹建構模組102、第一搜尋模組104及儲存單元106。網頁擷取模組100經配置以取得一目標網頁2的網頁原始碼資料。FIG. 1 is a block diagram of an automatic text information mining system according to an embodiment of the present invention. Referring to FIG. 1 , an embodiment of the present invention provides an automatic text information determination and exploration system 1 , which includes a web page capture module 100 , a model tree construction module 102 , a first search module 104 and a storage unit 106 . The webpage capture module 100 is configured to obtain webpage source code data of a target webpage 2 .

模型樹建構模組102經配置以將所取得的網頁原始碼資料進行解析,並根據解析結果取得多個文件物件模型(Document Object Model, DOM)樹節點,且根據該些DOM樹節點建構文件物件模型(Document Object Model, DOM)樹。The model tree construction module 102 is configured to parse the obtained web page source code data, obtain a plurality of document object model (Document Object Model, DOM) tree nodes according to the analysis results, and construct document objects according to the DOM tree nodes Model (Document Object Model, DOM) tree.

詳細而言,DOM並非超文本標記語言(Hyper Text Markup Language, HTML)文件的一部分,而是由瀏覽器在記憶體所建立的資料,瀏覽器可利用DOM樹結構來建立此模型 ,因為此模型是由物件所建構而成,因此稱為物件模型。In detail, the DOM is not a part of the Hyper Text Markup Language (HTML) file, but the data created by the browser in the memory. The browser can use the DOM tree structure to build this model, because this model It is constructed from objects, so it is called an object model.

一般而言,DOM樹可由四種節點架構而成,包括文件節點 (Document node)、元素節點 (Element node)、屬性節點 (Attribute node)及文字節點 (Text node)。文件節點為DOM樹的最頂端,代表整個網頁,也稱為文件物件 (Document object)。元素節點說明HTML頁面的結構,由HTML標籤 (Tag) 構成,存取DOM樹的資料的方式為尋找某個元素,找到後,就可以存取該元素的屬性或文字。屬性節點為HTML元素的開始標籤裡所置放的屬性,文字節點則爲HTML元素開始與結束標籤之間的文字。Generally speaking, a DOM tree can be composed of four kinds of nodes, including Document node, Element node, Attribute node and Text node. The document node is the top of the DOM tree, representing the entire web page, also known as the document object (Document object). The element node describes the structure of the HTML page, which is composed of HTML tags (Tag). The way to access the data of the DOM tree is to find an element, and after finding it, you can access the attribute or text of the element. The attribute node is the attribute placed in the opening tag of the HTML element, and the text node is the text between the opening and closing tags of the HTML element.

依據上述條件,模型樹建構模組102可將網頁原始碼資訊進行分析來產生多個DOM樹節點,以建構DOM樹。例如,可參考圖2,圖2為根據本發明實施例繪示的DOM樹的示意圖。According to the above conditions, the model tree construction module 102 can analyze the source code information of the webpage to generate a plurality of DOM tree nodes, so as to construct the DOM tree. For example, please refer to FIG. 2 , which is a schematic diagram of a DOM tree according to an embodiment of the present invention.

接著,第一搜尋模組104經配置以針對DOM樹的該些DOM樹節點執行一深度優先搜尋演算法。詳細而言,深度優先搜尋法可用於遍尋上述DOM樹的演算法。由DOM樹的根(即爲圖2最上方的文件節點DN)來開始探尋,先探尋邊(Edge),例如,文件節點DN與下方第一個元素節點EN1之間的連結可稱爲邊)上未搜尋的一DOM樹節點,並儘可能深的搜索,直到該節點的所有邊上節點都已探尋,就回溯(backtracking)到前一個節點,重覆探尋未搜尋的DOM樹節點,直到找到目的節點或遍尋全部節點。換言之,上述順序可稱爲深度優先搜尋順序,第一搜尋模組104可依據此順序行進於該DOM樹中,藉此找尋具有高度可能性關聯於內文的DOM樹節點。Next, the first search module 104 is configured to perform a depth-first search algorithm for the DOM tree nodes of the DOM tree. In detail, a depth-first search method can be used for the algorithm that traverses the DOM tree described above. Start exploring from the root of the DOM tree (that is, the topmost file node DN in Figure 2), first explore the edge (Edge), for example, the connection between the file node DN and the first element node EN1 below can be called an edge) A DOM tree node that has not been searched, and search as deep as possible until all the edge nodes of the node have been explored, backtracking to the previous node, and repeatedly exploring the unsearched DOM tree node until it finds Destination node or traverse all nodes. In other words, the above order may be referred to as a depth-first search order, and the first search module 104 may proceed in the DOM tree according to this order, thereby searching for DOM tree nodes with a high probability of being associated with the context.

可參考圖3,其爲根據本發明實施例繪示的深度優先搜尋演算法的流程圖。如圖所示,第一搜尋模組104針對該些DOM樹節點中的每一個進行下列步驟:Please refer to FIG. 3 , which is a flowchart of a depth-first search algorithm according to an embodiment of the present invention. As shown, the first search module 104 performs the following steps for each of the DOM tree nodes:

步驟S30:依據深度優先搜尋順序,行進至下一個DOM樹節點。Step S30: Proceed to the next DOM tree node according to the depth-first search order.

步驟S31:判斷當前的DOM樹節點是否為文字節點。若否,則回到步驟S30,進入下一個該DOM樹節點。若是,則進入步驟S32。Step S31: Determine whether the current DOM tree node is a text node. If not, go back to step S30 to enter the next DOM tree node. If yes, go to step S32.

更詳細而言,在此步驟中,判斷當前的DOM樹節點是否為文字節點是判斷當前的DOM樹節點的格式是否符合HTML中定義的文字節點。More specifically, in this step, determining whether the current DOM tree node is a text node is to determine whether the format of the current DOM tree node conforms to the text node defined in HTML.

步驟S32:判斷當前的DOM樹節點與上一個被判斷為文字節點的DOM樹節點之間的節點深度是否大於預定深度,以及當前的DOM樹節點與上一個被判斷為文字節點的DOM樹節點之間的節點距離是否大於預定距離。Step S32: Determine whether the node depth between the current DOM tree node and the last DOM tree node judged to be a text node is greater than a predetermined depth, and whether the current DOM tree node and the previous DOM tree node judged to be a text node are not deep. Whether the distance between nodes is greater than the predetermined distance.

詳細而言,節點深度係由DOM樹節點與上一個文字節點之間的節點階層數量來決定,節點距離係由該DOM樹節點與上一個文字節點之間的節點行進數量來決定。舉例而言,如圖2所示,當由最左側的屬性節點AN1行進至屬性節點AN2時,兩者之間的節點深度為0。另一方面,針對節點距離,由於在到達屬性節點AN1前,屬性節點AN1的祖先(Ancestor)節點均已行進過,無須重複,因此,直接由屬性節點AN1行進至屬性節點AN2的父(Parent)節點,而後到達屬性節點AN2。因此,依據上述流程,屬性節點AN1與屬性節點AN2之間的節點距離為2,並可以此方式類推至所有的DOM樹節點。Specifically, the node depth is determined by the number of node levels between the DOM tree node and the previous text node, and the node distance is determined by the number of node travels between the DOM tree node and the previous text node. For example, as shown in FIG. 2, when proceeding from the leftmost attribute node AN1 to the attribute node AN2, the node depth between the two is 0. On the other hand, for the node distance, before reaching the attribute node AN1, the ancestor (Ancestor) node of the attribute node AN1 has already traveled, and there is no need to repeat it, so the attribute node AN1 directly travels to the attribute node AN2 The parent (Parent) node, and then to attribute node AN2. Therefore, according to the above process, the node distance between the attribute node AN1 and the attribute node AN2 is 2, and it can be deduced to all DOM tree nodes in this way.

在較佳實施例中,預定深度可在3至7的範圍內,預定距離可在8至12的範圍內,且預定深度及預定範圍皆爲正整數。在最佳實施例中,預定深度可為5,預定範圍可為12。詳細而言,此機制對應於新聞網頁的編寫邏輯,因此,依據此機制來輔助上述深度優先搜尋演算法進行搜尋,可有高機率取得對應目標網頁2的主要文字資訊的文字節點。In a preferred embodiment, the predetermined depth may be in the range of 3 to 7, the predetermined distance may be in the range of 8 to 12, and both the predetermined depth and the predetermined range are positive integers. In the preferred embodiment, the predetermined depth may be 5 and the predetermined range may be 12. Specifically, this mechanism corresponds to the writing logic of the news webpage. Therefore, according to this mechanism to assist the above-mentioned depth-first search algorithm to search, the text node corresponding to the main text information of the target webpage 2 can be obtained with a high probability.

響應於在步驟S32中判斷當前的DOM樹節點與上一個文字節點之間的節點深度及節點距離分別大於預定深度及預定距離,則進入步驟S33:將當前的DOM樹節點的內文進行記錄,並回到步驟S30,依據深度優先搜尋順序進入下一個DOM樹節點。若在步驟S32中,節點深度未大於預定深度及/或節點距離未大於預定距離,則回到步驟S30。In response to judging in step S32 that the node depth and node distance between the current DOM tree node and the previous text node are respectively greater than the predetermined depth and the predetermined distance, then enter step S33: record the content of the current DOM tree node, And go back to step S30, enter the next DOM tree node according to the depth-first search order. If in step S32, the node depth is not greater than the predetermined depth and/or the node distance is not greater than the predetermined distance, then go back to step S30.

其後,儲存單元106可經配置以儲存執行深度優先搜尋演算法後記錄的內文,以作爲該目標網頁的文字資訊。Thereafter, the storage unit 106 may be configured to store the content recorded after executing the depth-first search algorithm as the text information of the target webpage.

此外,更如圖1所示,文字資訊自動探勘系統1可進一步包括檢查模組110,經配置以檢查單一DOM樹節點上所記錄的內文是否超過預定長度,若是,則儲存單元106才儲存經過檢查的內文。在特定實施例中,檢查模組110可將所要記錄的內文中的空白(space)字元去除後,再計算是否超過預定長度。在較佳實施例中,預定長度係在30至80個字元的範圍內,在最佳實施例中,預定長度為去除空白字元後的內文為50個字元。In addition, as shown in FIG. 1 , the text information automatic mining system 1 may further include a checking module 110, which is configured to check whether the text recorded on a single DOM tree node exceeds a predetermined length, and if so, the storage unit 106 stores it. Checked text. In a specific embodiment, the checking module 110 may remove the blank (space) characters in the content to be recorded, and then calculate whether the length exceeds a predetermined length. In a preferred embodiment, the predetermined length is in the range of 30 to 80 characters, and in a preferred embodiment, the predetermined length is 50 characters of the text after removing the blank characters.

請復參考圖1,文字資訊自動探勘系統1可進一步包括第二搜尋模組108,經配置以判斷DOM樹的該些DOM樹節點中,是否有同時符合 HTML 中定義的元素節點(element node) 且符合特定描述標籤的DOM樹節點。若是,則直接取得此DOM樹節點的內文進行記錄。而儲存單元106可儲存第二搜尋模組108記錄的內文,以作爲目標網頁2的文字資訊。Please refer to FIG. 1 again, the text information automatic mining system 1 may further include a second search module 108, which is configured to determine whether the DOM tree nodes in the DOM tree have element nodes that also conform to the definition in HTML. A DOM tree node that matches a specific description tag. If so, directly obtain the content of this DOM tree node for recording. The storage unit 106 can store the content recorded by the second search module 108 as the text information of the target webpage 2 .

舉例而言,特定描述標籤係用於描述插入放置於目標網頁2上的素材的素材類型。例如,特定描述標籤為Golang的DataAtom下定義的<style>標籤、<script>標籤或<iframe>標籤。以<iframe>標籤爲例,其界定一個內聯框架作爲素材,以在當前HTML文件中嵌入另一個HTML文件。<script>標籤用於定義用戶端指令碼,比如JavaScript。<script>標籤既可以包含腳本語句,也可以通過src屬性指向外部指令檔。For example, a specific description tag is used to describe the type of material inserted into the material placed on the target web page 2 . For example, the specific description tag is the <style> tag, <script> tag or <iframe> tag defined under Golang's DataAtom. Taking the <iframe> tag as an example, it defines an iframe as a material to embed another HTML file in the current HTML file. The <script> tag is used to define user-side scripts, such as JavaScript. The <script> tag can either contain script statements or point to an external command file through the src attribute.

需要說明的是,網頁架構的不同可能導致第一搜尋模組104及第二搜尋模組108在取得目標網頁2的文字資訊時,其抓取率有所浮動。然而,在大部分的情形下,第一搜尋模組104及第二搜尋模組108在抓取率上能夠互補的。換言之,可將第一搜尋模組104及第二搜尋模組108分別取得的內文進行整合,以作爲最終獲得的目標網頁2的文字資訊。It should be noted that the difference in the structure of the webpage may cause the crawling rate of the first search module 104 and the second search module 108 to fluctuate when obtaining the text information of the target webpage 2 . However, in most cases, the first search module 104 and the second search module 108 can be complementary in crawl rate. In other words, the content respectively obtained by the first search module 104 and the second search module 108 can be integrated to serve as the final obtained text information of the target webpage 2 .

綜合上述,可進一步參考圖4,其爲根據本發明實施例繪示的文字資訊自動探勘方法的流程圖。In view of the above, further reference may be made to FIG. 4 , which is a flowchart of an automatic text information mining method according to an embodiment of the present invention.

如圖4所示,文字資訊自動探勘方法可包括下列步驟:As shown in Figure 4, the automatic text information mining method may include the following steps:

步驟S40:配置網頁擷取模組取得目標網頁的網頁原始碼資料。Step S40: Configure the webpage capture module to obtain webpage source code data of the target webpage.

步驟S41:配置模型樹建構模組,將所取得的網頁原始碼資料進行解析,並根據解析結果取得多個DOM樹節點,且根據DOM樹節點建構DOM樹。Step S41 : Configure a model tree construction module, parse the acquired webpage source code data, obtain a plurality of DOM tree nodes according to the analysis results, and construct a DOM tree according to the DOM tree nodes.

步驟S42:配置第一搜尋模組,針對該些DOM樹節點執行深度優先搜尋演算法,以依據深度優先搜尋順序行進於該DOM樹中,並取得內文。Step S42 : Configure a first search module to execute a depth-first search algorithm for the DOM tree nodes, so as to proceed in the DOM tree according to the depth-first search order, and obtain the context.

步驟S43:配置第二搜尋模組以判斷該DOM樹中,是否有同時符合 HTML 中定義的元素節點且符合特定描述標籤的DOM樹節點,若是,則直接取得此DOM樹節點的內文進行記錄。Step S43: Configure a second search module to determine whether there is a DOM tree node in the DOM tree that both conforms to the element node defined in HTML and conforms to a specific description tag, and if so, directly obtains the content of the DOM tree node for recording .

步驟S44:配置儲存單元以儲存第一搜尋模組及第二搜尋模組記錄的內文,以作爲目標網頁的文字資訊。Step S44: Configure a storage unit to store the contents recorded by the first search module and the second search module as text information of the target webpage.

前述實施例的文字資訊自動探勘方法可由前述實施例之文字資訊自動探勘系統1來執行,而文字資訊自動探勘系統1可由計算機系統(如桌上型電腦、伺服器等)實現,其例如具有中央處理器、南北橋、揮發性記憶體、儲存單元、網路晶片等電子元件。The textual information automatic mining method of the foregoing embodiment can be executed by the textual information automatic mining system 1 of the preceding embodiment, and the textual information automatic mining system 1 can be realized by a computer system (such as a desktop computer, a server, etc.), which, for example, has a central Electronic components such as processors, north-south bridges, volatile memory, storage units, and network chips.

儲存單元106可例如為獨立磁碟備援陣列(Redundant Array of Independent Disks,RAID)或簡單磁碟綁定(Just a Bunch Of Disks,JBOD)系統等邏輯磁碟陣列。或者,儲存單元106也可以是硬碟(Hard Disk Drive,HDD)等非揮發性儲存裝置。The storage unit 106 can be, for example, a logical disk array such as a Redundant Array of Independent Disks (RAID) or a Just a Bunch Of Disks (JBOD) system. Alternatively, the storage unit 106 may also be a non-volatile storage device such as a hard disk drive (Hard Disk Drive, HDD).

文字資訊自動判定及探勘系統1、網頁擷取模組100、模型樹建構模組102、第一搜尋模組104、第二搜尋模組108及檢查模組110係為儲存於儲存單元106中的電腦程式產品,而能被中央處理器執行而完成特定功能。The text information automatic determination and exploration system 1 , the webpage capture module 100 , the model tree construction module 102 , the first search module 104 , the second search module 108 and the inspection module 110 are stored in the storage unit 106 A computer program product that can be executed by a central processing unit to perform specific functions.

[實施例的有益效果][Advantageous effects of the embodiment]

本發明的其中一有益效果在於,本發明所提供的文字資訊自動探勘方法及文字資訊自動探勘系統,可將網頁原始碼建構為DOM樹後,執行深度優先搜尋法,以依據深度及距離來針對有高度可能性為文字資訊的DOM樹節點的內文進行擷取。One of the beneficial effects of the present invention is that the automatic text information mining method and text information automatic mining system provided by the present invention can construct a web page source code into a DOM tree, and then perform a depth-first search method to search for target data according to depth and distance. There is a high probability that the content of a DOM tree node with textual information is retrieved.

另一方面,本發明所提供的文字資訊自動探勘方法及文字資訊自動探勘系統,可通過特定的節點類型及特定的描述標籤來判斷目標網頁的內容是否有符合特定格式條件的資料,進而可直接取得該資料,以節省人工資料探勘所需的時間及成本。On the other hand, the text information automatic mining method and text information automatic mining system provided by the present invention can judge whether the content of the target web page has data that meets the specific format conditions through a specific node type and a specific description tag, and then can directly This data is obtained to save the time and cost of manual data mining.

以上所公開的內容僅為本發明的優選可行實施例,並非因此侷限本發明的申請專利範圍,所以凡是運用本發明說明書及圖式內容所做的等效技術變化,均包含於本發明的申請專利範圍內。The contents disclosed above are only preferred feasible embodiments of the present invention, and are not intended to limit the scope of the present invention. Therefore, any equivalent technical changes made by using the contents of the description and drawings of the present invention are included in the application of the present invention. within the scope of the patent.

1:文字資訊自動判定及探勘系統1: Text information automatic judgment and exploration system

100:網頁擷取模組100: Web Capture Module

102:模型樹建構模組102: Model Tree Building Blocks

104:第一搜尋模組104: The first search module

106:儲存單元106: Storage unit

108:第二搜尋模組108: Second search module

110:檢查模組110: Check the mod

2:目標網頁2: Landing page

DN:文件節點DN: file node

AN2:屬性節點AN2: Attribute Node

AN1:屬性節點AN1: Attribute Node

EN1:元素節點EN1: Element Node

圖1為根據本發明實施例繪示的文字資訊自動探勘系統的方塊圖。FIG. 1 is a block diagram of an automatic text information mining system according to an embodiment of the present invention.

圖2為根據本發明實施例繪示的文件物件模型樹的示意圖。FIG. 2 is a schematic diagram of a document object model tree according to an embodiment of the present invention.

圖3爲根據本發明實施例繪示的深度優先搜尋演算法的流程圖。FIG. 3 is a flowchart of a depth-first search algorithm according to an embodiment of the present invention.

圖4為根據本發明實施例繪示的文字資訊自動探勘方法的流程圖。FIG. 4 is a flowchart of an automatic text information mining method according to an embodiment of the present invention.

1:文字資訊自動判定及探勘系統 1: Text information automatic judgment and exploration system

100:網頁擷取模組 100: Web Capture Module

102:模型樹建構模組 102: Model Tree Building Blocks

104:第一搜尋模組 104: The first search module

106:儲存單元 106: Storage unit

108:第二搜尋模組 108: Second search module

110:檢查模組 110: Check the mod

2:目標網頁 2: Landing page

Claims (16)

一種文字資訊自動探勘方法,其包括:配置一網頁擷取模組取得一目標網頁的網頁原始碼資料;配置一模型樹建構模組,將所取得的該網頁原始碼資料進行解析,並根據一解析結果取得多個文件物件模型(Document Object Model,DOM)樹節點,且根據該些DOM樹節點建構一文件物件模型(Document Object Model,DOM)樹;配置一第一搜尋模組,針對該DOM樹的該些DOM樹節點執行一深度優先搜尋演算法,以依據一深度優先搜尋順序行進於該DOM樹中,並針對該些DOM樹節點中的每一個進行:判斷當前的該DOM樹節點是否為文字節點(text node),若否,則進入下一個該DOM樹節點,若是,則判斷當前的該DOM樹節點與上一個被判斷為文字節點的該DOM樹節點之間的一節點深度是否大於一預定深度且一節點距離是否大於一預定距離,其中,該節點深度係由該DOM樹節點與上一個文字節點之間的節點階層數量來決定,該節點距離係由該DOM樹節點與上一個文字節點之間的節點行進數量來決定;及響應於判斷當前的該DOM樹節點與上一個文字節點(text node)之間的該節點深度及該節點距離分別大於該預定深度範圍及該預定距離,則將當前的該DOM樹節點的內文進行記錄,並依據該深度優先搜尋順序進入下一個該DOM樹節點;以及配置一儲存單元以儲存執行該深度優先搜尋演算法後記錄的內文,以作為該目標網頁的一文字資訊。 An automatic text information mining method, comprising: configuring a web page capture module to obtain web page source code data of a target web page; configuring a model tree construction module to parse the obtained web page source code data, and based on a The parsing result obtains a plurality of document object model (Document Object Model, DOM) tree nodes, and constructs a document object model (Document Object Model, DOM) tree according to the DOM tree nodes; configures a first search module for the DOM The DOM tree nodes of the tree execute a depth-first search algorithm to travel through the DOM tree according to a depth-first search order, and for each of the DOM tree nodes: determine whether the current DOM tree node is It is a text node. If not, enter the next DOM tree node. If so, judge whether the depth of a node between the current DOM tree node and the previous DOM tree node judged to be a text node is not. greater than a predetermined depth and whether a node distance is greater than a predetermined distance, wherein the node depth is determined by the number of node levels between the DOM tree node and the previous text node, and the node distance is determined by the DOM tree node and the previous text node. Determined by the number of node travels between a text node; and in response to judging that the node depth and the node distance between the current DOM tree node and the previous text node (text node) are greater than the predetermined depth range and the predetermined depth, respectively. distance, record the current content of the DOM tree node, and enter the next DOM tree node according to the depth-first search sequence; and configure a storage unit to store the recorded content after executing the depth-first search algorithm , as a text message for the landing page. 如請求項1所述的文字資訊自動探勘方法,其中,判斷當前 的該DOM樹節點是否為文字節點是判斷當前的該DOM樹節點的格式是否符合超文本標記語言(Hyper Text Markup Language,HTML)中定義的文字節點。 The automatic text information mining method according to claim 1, wherein the current Whether the DOM tree node is a text node is to determine whether the current format of the DOM tree node conforms to the text node defined in the Hyper Text Markup Language (Hyper Text Markup Language, HTML). 如請求項1所述的文字資訊自動探勘方法,其中該預定深度係在3至7的範圍內,該預定距離係在8至12的範圍內,且該預定深度及該預定範圍皆為正整數。 The automatic text information mining method according to claim 1, wherein the predetermined depth is in the range of 3 to 7, the predetermined distance is in the range of 8 to 12, and both the predetermined depth and the predetermined range are positive integers . 如請求項1所述的文字資訊自動探勘方法,更包括配置一檢查模組以檢查單一該DOM樹節點上所記錄的內文是否超過一預定長度,若是,則配置該儲存單元以儲存經過檢查的內文。 The automatic text information mining method according to claim 1, further comprising configuring a checking module to check whether the text recorded on the single DOM tree node exceeds a predetermined length, and if so, configuring the storage unit to store the checked 's content. 如請求項1所述的文字資訊自動探勘方法,更包括:配置一第二搜尋模組以判斷該DOM樹的該些DOM樹節點中,是否有同時符合HTML中定義的元素節點(element node)且符合一特定描述標籤,若是,則直接取得該DOM樹節點的內文進行記錄。 The automatic text information mining method according to claim 1, further comprising: configuring a second search module to determine whether the DOM tree nodes in the DOM tree have element nodes that also conform to the definition in HTML And match a specific description tag, if so, directly obtain the content of the DOM tree node for recording. 如請求項5所述的文字資訊自動探勘方法,更包括配置該儲存單元以儲存該第二搜尋模組記錄的內文,以作為該目標網頁的該文字資訊。 The automatic text information mining method according to claim 5, further comprising configuring the storage unit to store the content recorded by the second search module as the text information of the target webpage. 如請求項5所述的文字資訊自動探勘方法,其中,該特定描述標籤係用於描述插入放置於該目標網頁上的素材的一素材類型。 The automatic text information mining method according to claim 5, wherein the specific description tag is used to describe a material type of the material inserted and placed on the target web page. 如請求項7所述的文字資訊自動探勘方法,其中,該特定描述標籤為Golang程式語言的DataAtom下定義的<style>標籤、<script>標籤或<iframe>標籤。 The automatic text information mining method according to claim 7, wherein the specific description tag is a <style> tag, a <script> tag or an <iframe> tag defined under DataAtom of the Golang programming language. 一種文字資訊自動探勘系統,其包括:一網頁擷取模組,經配置以取得一目標網頁的網頁原始碼資料; 一模型樹建構模組,經配置以將所取得的該網頁原始碼資料進行解析,並根據一解析結果取得多個文件物件模型(Document Object Model,DOM)樹節點,且根據該些DOM樹節點建構一文件物件模型(Document Object Model,DOM)樹;一第一搜尋模組,經配置以針對該DOM樹的該些DOM樹節點執行一深度優先搜尋演算法,以依據一深度優先搜尋順序行進於該DOM樹中,並針對該些DOM樹節點中的每一個進行:判斷當前的該DOM樹節點是否為文字節點(text node),若否,則進入下一個該DOM樹節點,若是,則判斷當前的該DOM樹節點與上一個被判斷為文字節點的該DOM樹節點之間的一節點深度是否大於一預定深度且一節點距離是否大於一預定距離,其中,該節點深度係由該DOM樹節點與上一個文字節點之間的節點階層數量來決定,該節點距離係由該DOM樹節點與上一個文字節點之間的節點行進數量來決定;及響應於判斷當前的該DOM樹節點與上一個文字節點(text node)之間的該節點深度及該節點距離分別大於該預定深度範圍及該預定距離,則將當前的該DOM樹節點的內文進行記錄,並依據該深度優先搜尋順序進入下一個該DOM樹節點;以及一儲存單元,經配置以儲存執行該深度優先搜尋演算法後記錄的內文,以作為該目標網頁的一文字資訊。 An automatic text information exploration system, comprising: a webpage capture module configured to obtain webpage source code data of a target webpage; A model tree construction module configured to parse the obtained web page source code data, obtain a plurality of document object model (DOM) tree nodes according to a parsing result, and according to the DOM tree nodes constructing a Document Object Model (DOM) tree; a first search module configured to execute a depth-first search algorithm for the DOM tree nodes of the DOM tree to proceed according to a depth-first search order In the DOM tree, and for each of the DOM tree nodes: determine whether the current DOM tree node is a text node (text node), if not, enter the next DOM tree node, if so, then Determine whether a node depth between the current DOM tree node and the last DOM tree node judged to be a text node is greater than a predetermined depth and a node distance is greater than a predetermined distance, wherein the node depth is determined by the DOM The number of node levels between the tree node and the previous text node is determined, and the node distance is determined by the number of node travels between the DOM tree node and the previous text node; and in response to judging that the current DOM tree node and the If the node depth and the node distance between the previous text nodes are respectively greater than the predetermined depth range and the predetermined distance, the current content of the DOM tree node is recorded, and the depth-first search order is performed. Entering the next DOM tree node; and a storage unit configured to store the content recorded after executing the depth-first search algorithm as a text information of the target webpage. 如請求項9所述的文字資訊自動探勘系統,其中,判斷當前的該DOM樹節點是否為文字節點是判斷當前的該DOM樹節點的格式是否符合超文本標記語言(Hyper Text Markup Language,HTML)中定義的文字節點。 The automatic text information mining system according to claim 9, wherein judging whether the current DOM tree node is a text node is judging whether the format of the current DOM tree node conforms to the Hyper Text Markup Language (Hyper Text Markup Language) A literal node defined in Language, HTML). 如請求項9所述的文字資訊自動探勘系統,其中該預定深度係在3至7的範圍內,該預定距離係在8至12的範圍內,且該預定深度及該預定範圍皆為正整數。 The text information automatic exploration system according to claim 9, wherein the predetermined depth is in the range of 3 to 7, the predetermined distance is in the range of 8 to 12, and both the predetermined depth and the predetermined range are positive integers . 如請求項9所述的文字資訊自動探勘系統,更包括一檢查模組,經配置以檢查單一該DOM樹節點上所記錄的內文是否超過一預定長度,若是,則該儲存單元經配置以儲存經過檢查的內文。 The automatic text information mining system according to claim 9, further comprising a checking module configured to check whether the text recorded on the single node of the DOM tree exceeds a predetermined length, and if so, the storage unit is configured to Save the checked body. 如請求項9所述的文字資訊自動探勘系統,更包括一第二搜尋模組,經配置以判斷該DOM樹的該些DOM樹節點中,是否有同時符合HTML中定義的元素節點(element node)且符合一特定描述標籤,若是,則直接取得該DOM樹節點的內文進行記錄。 The automatic text information mining system according to claim 9, further comprising a second search module configured to determine whether the DOM tree nodes in the DOM tree have element nodes that also conform to the HTML definition. ) and match a specific description tag, if so, directly obtain the content of the DOM tree node for recording. 如請求項13所述的文字資訊自動探勘系統,其中該儲存單元更經配置以儲存該第二搜尋模組記錄的內文,以作為該目標網頁的該文字資訊。 The automatic text information mining system of claim 13, wherein the storage unit is further configured to store the text recorded by the second search module as the text information of the target webpage. 如請求項13所述的文字資訊自動探勘系統,其中,該特定描述標籤係用於描述插入放置於該目標網頁上的素材的一素材類型。 The automatic text information mining system of claim 13, wherein the specific description tag is used to describe a material type of the material inserted and placed on the target web page. 如請求項15所述的文字資訊自動探勘系統,其中,該特定描述標籤為Golang程式語言的DataAtom下定義的<style>標籤、<script>標籤或<iframe>標籤。The automatic text information mining system according to claim 15, wherein the specific description tag is a <style> tag, a <script> tag or an <iframe> tag defined under DataAtom of the Golang programming language.
TW109147025A 2020-12-31 2020-12-31 Text information automatically mining method and system TWI764491B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
TW109147025A TWI764491B (en) 2020-12-31 2020-12-31 Text information automatically mining method and system

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
TW109147025A TWI764491B (en) 2020-12-31 2020-12-31 Text information automatically mining method and system

Publications (2)

Publication Number Publication Date
TWI764491B true TWI764491B (en) 2022-05-11
TW202227990A TW202227990A (en) 2022-07-16

Family

ID=82594101

Family Applications (1)

Application Number Title Priority Date Filing Date
TW109147025A TWI764491B (en) 2020-12-31 2020-12-31 Text information automatically mining method and system

Country Status (1)

Country Link
TW (1) TWI764491B (en)

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN105740417A (en) * 2016-01-29 2016-07-06 青岛海信移动通信技术股份有限公司 Webpage based target data search method and module, browser and terminal
CN108052632A (en) * 2017-12-20 2018-05-18 成都律云科技有限公司 A kind of method for obtaining network information, system and company information search system
US10586397B1 (en) * 2018-08-24 2020-03-10 VIRNECT inc. Augmented reality service software as a service based augmented reality operating system
CN111737421A (en) * 2020-08-07 2020-10-02 杭州六棱镜知识产权科技有限公司 Intellectual property big data information retrieval system and storage medium

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN105740417A (en) * 2016-01-29 2016-07-06 青岛海信移动通信技术股份有限公司 Webpage based target data search method and module, browser and terminal
CN108052632A (en) * 2017-12-20 2018-05-18 成都律云科技有限公司 A kind of method for obtaining network information, system and company information search system
US10586397B1 (en) * 2018-08-24 2020-03-10 VIRNECT inc. Augmented reality service software as a service based augmented reality operating system
CN111737421A (en) * 2020-08-07 2020-10-02 杭州六棱镜知识产权科技有限公司 Intellectual property big data information retrieval system and storage medium

Also Published As

Publication number Publication date
TW202227990A (en) 2022-07-16

Similar Documents

Publication Publication Date Title
Mitchell Web scraping with Python: Collecting more data from the modern web
US9298680B2 (en) Display of hypertext documents grouped according to their affinity
US20120110437A1 (en) Style and layout caching of web content
US11205041B2 (en) Web element rediscovery system and method
US11256912B2 (en) Electronic form identification using spatial information
US10713429B2 (en) Joining web data with spreadsheet data using examples
US20130339840A1 (en) System and method for logical chunking and restructuring websites
US8359307B2 (en) Method and apparatus for building sales tools by mining data from websites
US10146749B2 (en) Tracking JavaScript actions
KR20050097444A (en) Method and apparatus for searching element, and recording medium storing a program to implement thereof
Hajba Website Scraping with Python
Parameswaran et al. Optimal schemes for robust web extraction
US10042827B2 (en) System and method for recognizing non-body text in webpage
WO2022179128A1 (en) Crawler-based data crawling method and apparatus, computer device, and storage medium
CN114021042A (en) Webpage content extraction method and device, computer equipment and storage medium
CN104778232B (en) Searching result optimizing method and device based on long query
US20120124077A1 (en) Domain Constraint Based Data Record Extraction
TWI764491B (en) Text information automatically mining method and system
Soulemane et al. Crawling the hidden web: An approach to dynamic web indexing
US10713329B2 (en) Deriving links to online resources based on implicit references
KR102365434B1 (en) Content search method and content search system
CN105224583A (en) The method for cleaning of journal file and device
US20150324333A1 (en) Systems and methods for automatically generating hyperlinks
Zeleny et al. Cluster-based Page Segmentation-a fast and precise method for web page pre-processing
CN110618809B (en) Front-end webpage input constraint extraction method and device