TWI738126B

TWI738126B - Web content filtering method

Info

Publication number: TWI738126B
Application number: TW108142681A
Authority: TW
Inventors: 許展源; 丘祐瑋
Original assignee: 大數軟體有限公司
Priority date: 2019-11-25
Filing date: 2019-11-25
Publication date: 2021-09-01
Also published as: TW202121200A

Abstract

A web content filtering method includes: S10: browsing a web; S20: cluster identifying a content of a home of the web; S30: finding out an article list in the home, the article list includes a plurality of URL link; S40: taking out two URL link in the article list; S50: respectively linking to the URL link and getting two article pages, the article page includes a plurality of data field; S60: comparing two article pages and removing the data field having same content; S70: marking remaining the data field into a collecting field; and S80: capturing the data of the collecting field.

Description

Methods of Web Content Screening

一種資料篩選的方法，特別是一種網頁內容篩選的方法 A method of data screening, especially a method of web content screening

爬蟲軟體(web crawler)是現今一種常用的資料收集工具，是一種能夠自動瀏覽網頁，並將網頁內容下載並收集的一種工具。由於可取代傳統人力，並且在技術發展的趨勢下，爬蟲軟體在大數據產業中扮演了重要的角色。 Crawler software (web crawler) is a commonly used data collection tool nowadays. It is a tool that can automatically browse web pages, download and collect web content. Because it can replace traditional manpower and under the trend of technological development, crawler software plays an important role in the big data industry.

然而，爬蟲軟體雖然能夠自動擷取網頁內容，但是無法分辨所收集的資料，無效資料或是廣告的比例仍佔大多數。因此目前所收集的資料多為人工識別來篩選，或是另外設計演算法來識別這些資料。這對於某些特定的分析領域並不方便，也影響到分析結果的準確性。 However, although crawling software can automatically retrieve web content, it cannot distinguish the collected data. Invalid data or advertisements still account for the majority. Therefore, the currently collected data are mostly screened by manual identification, or additional algorithms are designed to identify these data. This is not convenient for some specific analysis fields, and it also affects the accuracy of the analysis results.

因此，如何改善爬蟲軟體的資料蒐集準確性，便是本領域具通常知識者值得去思量的。 Therefore, how to improve the accuracy of data collection of crawler software is worth considering for those with ordinary knowledge in the field.

本發明提供一種網頁內容篩選的方法，包括：S10：造訪一網站；S20：對該網站之一首頁的內容進行分群辨識；S30：從該首頁中找出一文章列表，該文章列表中包括多個網址連結；S40：從該文章列表中取出兩個該網址連結； S50：分別連結至該網址連結，取得兩個文章網頁，該文章網頁中包括多個資料欄位；S60：比較兩個該文章網頁，去除相同內容的該資料欄位；S70：將剩餘的該資料欄位標記為一收集欄位；及S80：擷取該收集欄位中的資料。 The present invention provides a method for screening web content, including: S10: visiting a website; S20: grouping and identifying the content of the homepage of one of the websites; S30: finding an article list from the homepage, and the article list includes multiple URL links; S40: Take out two URL links from the article list; S50: Link to the URL link respectively to obtain two article webpages, which include multiple data fields; S60: compare two article webpages, and remove the data field with the same content; S70: change the remaining data fields The data field is marked as a collection field; and S80: Retrieve the data in the collection field.

上述的網頁內容篩選的方法，其中，該網站為新聞網站。 In the above method for screening web content, the website is a news website.

上述的網頁內容篩選的方法，其中，在步驟S10中，是以該網站的Meta tag判斷該網站是否為新聞網站。 In the above method for screening web content, in step S10, the Meta tag of the website is used to determine whether the website is a news website.

上述的網頁內容篩選的方法，其中，還包括：S90：瀏覽具有相同位置的該文章網頁；及S100：擷取該些文章網頁中該收集欄位中的資料。 The above-mentioned method for screening web content further includes: S90: browsing the article webpage with the same location; and S100: capturing data in the collection field of the article webpages.

上述的網頁內容篩選的方法，其中，在步驟S90中，是以該文章網頁的Xpath位置判斷是否具有相同位置。 In the above method for screening webpage content, in step S90, it is determined whether the article webpage has the same position based on the Xpath position of the article webpage.

上述的網頁內容篩選的方法，其中，在步驟S20中，是經由機械學習的方式對該網站進行分群。 In the above method for screening web content, in step S20, the website is grouped by means of mechanical learning.

上述的網頁內容篩選的方法，其中，是以一資料收集系統執行該網頁內容篩選的方法。 In the above-mentioned method for screening web content, a data collection system implements a method for screening the web content.

S10~S100:流程圖步驟 S10~S100: Flowchart steps

110:首頁 110: Home

111:類別欄位 111: category field

112:廣告版位 112: ad slot

113:頭條欄位 113: Headline column

114:文章列表 114: Article list

1411、1411a~d:網址連結 1411, 1411a~d: URL link

120:文章網頁 120: Article page

121:新聞分類 121: News Classification

122:新聞標題 122: News Title

123、125:廣告版位 123, 125: Advertising slot

124、126:新聞主文 124, 126: News main article

200:資料收集系統 200: Data Collection System

210:瀏覽模組 210: Browse Module

220:判斷模組 220: Judgment Module

230:對比模組 230: contrast module

240:收集模組 240: Collection module

250:資料庫 250: database

圖1A所繪示為本發明之網頁內容篩選的方法。 FIG. 1A shows the method for screening web content of the present invention.

圖1B所繪示為資料收集系統。 Figure 1B shows a data collection system.

圖2所繪示為首頁的示意圖。 Figure 2 shows a schematic diagram of the home page.

圖3所繪示為文章網頁的示意圖。 Figure 3 shows a schematic diagram of the article webpage.

圖4所繪示為去除資料欄位的示意圖。 Figure 4 shows a schematic diagram of removing data fields.

本發明提供一種網頁內容篩選的方法，能夠正確地從網頁中擷取主要的內容，並且避開錯誤或無效的內容，提高資料收集的準確度。 The present invention provides a method for screening webpage content, which can correctly extract main content from webpages, avoid wrong or invalid content, and improve the accuracy of data collection.

在本實施例中，使經由一資料收集系統200來執行網頁內容篩選的方法。請參閱圖1B，圖1B所繪示為資料收集系統。資料收集系統200則例如為一爬蟲軟體。資料收集系統200包括一瀏覽模組210、一判斷模組220、一對比模組230、一收集模組240與一資料庫250。 In this embodiment, a method of filtering web content is performed through a data collection system 200. Please refer to Figure 1B. Figure 1B shows a data collection system. The data collection system 200 is, for example, a crawler software. The data collection system 200 includes a browsing module 210, a judgment module 220, a comparison module 230, a collection module 240, and a database 250.

資料收集系統200是經由瀏覽模組210連接至多個網站100進行瀏覽。瀏覽模組210會根據網站100的Meta tag判斷其網站類別，優選是瀏覽新聞網站。判斷模組220則適於判斷網站100中的首頁110，從首頁110中找出特定的位置。在本實施例中，判斷模組220是經過機械訓練或深度學習訓練的判斷模型，因此能夠自行從首頁110中識別首頁110上的內容區塊，從中找出文章列表114，並從文章列表114中取得兩個網址連結。而瀏覽模組210便會瀏覽對應兩個網址連結的文章網頁120。 The data collection system 200 is connected to a plurality of websites 100 via the browsing module 210 for browsing. The browsing module 210 judges the website category according to the Meta tag of the website 100, preferably browsing news websites. The judgment module 220 is adapted to judge the home page 110 in the website 100 and find a specific location from the home page 110. In this embodiment, the judgment module 220 is a judgment model that has undergone mechanical training or deep learning training. Therefore, it can identify the content block on the home page 110 from the home page 110 by itself, find the article list 114 therefrom, and find the article list 114 from the article list 114. Get two URL links in. The browsing module 210 will browse the article webpage 120 corresponding to the two URL links.

比對模組230適於比對文章網頁120。透過瀏覽模組210瀏覽的兩的文章網頁120，對比模組230便會比較兩個文章網頁120的內容。經對比後，會除去兩個文章網頁120內相同的欄位，保留內容相異的欄位。並且將保留下的欄位定義為收集欄位。 The comparison module 230 is adapted to compare the article webpage 120. After the two article webpages 120 are browsed through the browsing module 210, the comparison module 230 compares the content of the two article webpages 120. After the comparison, the same fields in the two article web pages 120 are removed, and fields with different contents are retained. And define the reserved field as a collection field.

收集模組240適於收集文章網頁120中的資料，特別是針對收集欄位中的資料進行收集。進一步的，資料收集系統200透過比對模組230確定該網站100的收集欄位，收集模組240也會收集該網站100中其他文章網頁120中收集欄位裡的資料。而這些收集的資料便傳送至資料庫250保存。 The collection module 240 is suitable for collecting data in the article webpage 120, especially collecting data in the collection field. Further, the data collection system 200 determines the collection fields of the website 100 through the comparison module 230, and the collection module 240 also collects data in the collection fields of other article webpages 120 in the website 100. The collected information is sent to the database 250 for storage.

以下將說明具體的網頁內容篩選的方法。請參閱圖1A，圖1A所繪示為本發明之網頁內容篩選的方法。首先，造訪一網站100(步驟S10)。在較佳實施例中，網站為一新聞網站，並且可透過網站的Meta tag判斷該網站100是否為新聞網站。造訪網站100之後，對網站100的首頁110上的內容進行分群(Clustering)辨識(步驟S20)。 The following will explain the specific method of web content screening. Please refer to FIG. 1A. FIG. 1A illustrates the method for screening web content of the present invention. First, visit a website 100 (step S10). In a preferred embodiment, the website is a news website, and the Meta tag of the website can be used to determine whether the website 100 is a news website. After visiting the website 100, the content on the home page 110 of the website 100 is clustered and identified (step S20).

請參閱圖2，圖2所繪示為首頁的示意圖。在圖2的實施例中，首頁110例如為一種新聞網站的首頁110。因此首頁110包括了類別欄位111、廣告版位112、頭條欄位113與文章列表114。而在步驟S20中，則是將類別欄位111、廣告版位112、頭條欄位113與文章列表114進行分群辨識。 Please refer to Figure 2, which is a schematic diagram of the home page. In the embodiment of FIG. 2, the home page 110 is, for example, the home page 110 of a news website. Therefore, the home page 110 includes a category field 111, an advertisement space 112, a headline field 113, and an article list 114. In step S20, the category column 111, the advertisement slot 112, the headline column 113 and the article list 114 are grouped and identified.

對首頁110的內容分群之後，找出首頁110中的文章列表114(步驟S30)，文章列表114中具有多個網址連結1141，網址連結1141是連接至網站中的文章網頁120。此時，從文章列表114中取出其中兩個網址連結1141(步驟S40)，在圖2的實施例中例如為網址連結1141a與1141b。接著，分別連結至網址連結1141a與1141b對應的網頁，取得兩個文章網頁120(步驟S50)。 After the content of the homepage 110 is grouped, the article list 114 in the homepage 110 is found (step S30). The article list 114 has a plurality of URL links 1141, and the URL links 1141 are connected to the article webpage 120 in the website. At this time, two of the URL links 1141 are extracted from the article list 114 (step S40). In the embodiment of FIG. 2, for example, the URL links 1141a and 1141b. Then, link to the web pages corresponding to the URL links 1141a and 1141b, respectively, to obtain two article web pages 120 (step S50).

請參閱圖3，圖3所繪示為文章網頁的示意圖。文章網頁120中會包括多個資料欄位，資料欄位為網頁記載內容的欄位。在本實施例中，資料欄位包括了新聞分類121、新聞標題122、廣告版位123、125、新聞主文124、126等欄位。 Please refer to FIG. 3, which is a schematic diagram of the article webpage. The article webpage 120 will include multiple data fields, and the data fields are fields for the content recorded on the webpage. In this embodiment, the data fields include fields such as news category 121, news headline 122, advertisement positions 123, 125, news main text 124, 126, and so on.

在步驟S50中取得了兩個文章網頁120，因此接下來比對兩個文章網頁120的資料欄位，並且去除內容相同的資料欄位。請參閱圖3與圖4，圖4所繪示為去除資料欄位的示意圖。 In step S50, two article webpages 120 are obtained, so the data fields of the two article webpages 120 are then compared, and the data fields with the same content are removed. Please refer to FIG. 3 and FIG. 4. FIG. 4 shows a schematic diagram of removing data fields.

在本實施例中，由於使用資料收集系統200瀏覽網頁，利用現行的網路技術，同一終端所瀏覽到的網頁中會有資料欄位有所相同。例如，新聞分類121是網站供用戶選擇新聞類別，因此是不變的。廣告版位123、125則是因為廣告內容通常是在瀏覽器呈現網頁時在被置入，而資料收集系統200瀏覽網頁的介面並非瀏覽器，因此廣告版位123、125中是一串代表載入廣告的程式碼，因此廣告版位123、125對於資料收集系統200也具有相同內容。 In this embodiment, since the data collection system 200 is used to browse webpages, and the current network technology is used, the webpages browsed by the same terminal will have the same data fields. For example, the news category 121 is a website for users to select news categories, so it is unchanged. Advertisement slots 123 and 125 are because the advertisement content is usually placed when the browser renders the webpage, and the data collection system 200 is not used to browse the webpage interface. It is not a browser, so the advertisement slots 123 and 125 are a string of code representing loading advertisements. Therefore, the advertisement slots 123 and 125 also have the same content for the data collection system 200.

綜上所述，經過比對之後，便可去除具有相同內容的資料欄位，也就是新聞分類121與廣告版位123、125。而保留下來便是記載新聞內容的資料欄位，如新聞標題122與新聞主文124、126。 In summary, after the comparison, the data fields with the same content, that is, the news category 121 and the advertisement positions 123 and 125 can be removed. What remains is the data field that records the news content, such as the news headline 122 and the news main text 124, 126.

接下來，將保留下來的資料欄位標記為收集欄位(步驟S70)，並擷取收集欄位中的資料(步驟S80)。經由步驟S10至步驟S80便可確立網站中的文章配置，例如新聞標題122與新聞主文124、126的位置，便可套用資料欄位的位置，針對網站中的其他文章進行收集。換言之，步驟S10至步驟S80的主要目的在於確認網站100的文章架構，找出收集標的物的位置，排除不需要收集的資料位置。當需要收集其他網站時，則對另一個網站執行步驟S10至步驟S80，即可確認另一個網站的文章架構。 Next, mark the retained data field as a collection field (step S70), and retrieve the data in the collection field (step S80). Through step S10 to step S80, the article configuration in the website can be established, for example, the positions of the news headline 122 and the news main text 124, 126, and the position of the data field can be applied to collect other articles in the website. In other words, the main purpose of step S10 to step S80 is to confirm the article structure of the website 100, find the location of the collected subject matter, and exclude the location of data that does not need to be collected. When other websites need to be collected, steps S10 to S80 are executed on another website to confirm the article structure of the other website.

因此，接下來瀏覽具有相同位置的文章網頁120(步驟S90)，在本實施例中，是透過文章網頁120的Xpath位置判斷是否具有相同的位置。即這些文章網頁事隸屬於相同的網站，也具備相同的文章配置。之後，便可擷取這些文章網頁中的收集欄位中的資料(步驟S100)。 Therefore, next browse the article webpage 120 with the same position (step S90). In this embodiment, the Xpath position of the article webpage 120 is used to determine whether the article webpage 120 has the same position. That is, these article pages belong to the same website and also have the same article configuration. After that, the data in the collection fields in the article webpages can be retrieved (step S100).

經由上述的步驟S10至步驟S80，利用取出同一網站下的兩個網頁，並比較網頁內容來判對需要收集的資料欄位(如新聞主文、標題、刊登時間等)，並去除不需收集的資料欄位(如廣告、新聞分類等)。經過步驟S10~步驟S80的判斷便可判斷出同一網站下的文章配置。之後則特過步驟S90至步驟S100進一步收集同一網站下其他文章的內容。反覆執行步驟S10至步驟S100，便可更有效率的收集網頁資料，並排除無效資料。 Through the above steps S10 to S80, extract the two webpages from the same website and compare the content of the webpages to determine the data fields that need to be collected (such as news main text, title, publication time, etc.), and remove those that do not need to be collected Data fields (such as advertisements, news categories, etc.). After the judgment of step S10 to step S80, the article configuration under the same website can be judged. After that, step S90 to step S100 are performed to further collect the content of other articles under the same website. By repeatedly performing steps S10 to S100, webpage data can be collected more efficiently and invalid data can be eliminated.

本發明說明如上，然其並非用以限定本創作所主張之專利權利範圍。其專利保護範圍當視後附之申請專利範圍及其等同領域而定。凡本領域具有通常知識者，在不脫離本專利精神或範圍內，所作之更動或潤飾，均屬於本創作所揭示精神下所完成之等效改變或設計，且應包含在下述之申請專利範圍內。 The description of the present invention is as above, but it is not intended to limit the scope of the patent rights claimed in this creation. The scope of its patent protection shall be determined by the scope of the attached patent application and its equivalent fields. Anyone who has common knowledge in the field Those who know, without departing from the spirit or scope of this patent, any changes or modifications made are equivalent changes or designs completed under the spirit of this creation, and should be included in the scope of the following patent applications.

S10~S100:流程圖步驟 S10~S100: Flowchart steps

Claims

A method for screening webpage content includes: S10: visiting a website; S20: grouping and identifying the content of the homepage of one of the websites; S30: finding an article list from the homepage, and the article list includes multiple URL links ; S40: fetch two links to the website from the article list; S50: link to the website links respectively to obtain two article webpages, the article webpage includes multiple data fields; S60: compare two article webpages, Remove the data field with the same content; S70: mark the remaining data field as a collection field; and S80: retrieve the data in the collection field.

Such as the method for screening web content described in item 1 of the scope of patent application, wherein the website is a news website.

For example, in the method for screening web content described in item 2 of the scope of patent application, in step S10, the Meta tag of the website is used to determine whether the website is a news website.

For example, the method for screening web content described in the first item of the scope of the patent application further includes: S90: browsing the article webpage with the same location; and S100: extracting the data in the collection field of the article webpages.

For the method for screening web content described in item 4 of the scope of patent application, in step S90, it is determined whether the article webpage has the same position based on the Xpath position of the article webpage.

As described in the first item of the scope of patent application, in the method for screening web content, in step S20, the website is grouped by means of mechanical learning.

For example, the method for screening web content described in items 1 to 6 of the scope of patent application is a method of performing the screening of the web content by a data collection system.