TWI738126B - Web content filtering method - Google Patents
Web content filtering method Download PDFInfo
- Publication number
- TWI738126B TWI738126B TW108142681A TW108142681A TWI738126B TW I738126 B TWI738126 B TW I738126B TW 108142681 A TW108142681 A TW 108142681A TW 108142681 A TW108142681 A TW 108142681A TW I738126 B TWI738126 B TW I738126B
- Authority
- TW
- Taiwan
- Prior art keywords
- article
- website
- data
- web content
- webpage
- Prior art date
Links
Images
Landscapes
- Information Transfer Between Computers (AREA)
Abstract
Description
一種資料篩選的方法,特別是一種網頁內容篩選的方法 A method of data screening, especially a method of web content screening
爬蟲軟體(web crawler)是現今一種常用的資料收集工具,是一種能夠自動瀏覽網頁,並將網頁內容下載並收集的一種工具。由於可取代傳統人力,並且在技術發展的趨勢下,爬蟲軟體在大數據產業中扮演了重要的角色。 Crawler software (web crawler) is a commonly used data collection tool nowadays. It is a tool that can automatically browse web pages, download and collect web content. Because it can replace traditional manpower and under the trend of technological development, crawler software plays an important role in the big data industry.
然而,爬蟲軟體雖然能夠自動擷取網頁內容,但是無法分辨所收集的資料,無效資料或是廣告的比例仍佔大多數。因此目前所收集的資料多為人工識別來篩選,或是另外設計演算法來識別這些資料。這對於某些特定的分析領域並不方便,也影響到分析結果的準確性。 However, although crawling software can automatically retrieve web content, it cannot distinguish the collected data. Invalid data or advertisements still account for the majority. Therefore, the currently collected data are mostly screened by manual identification, or additional algorithms are designed to identify these data. This is not convenient for some specific analysis fields, and it also affects the accuracy of the analysis results.
因此,如何改善爬蟲軟體的資料蒐集準確性,便是本領域具通常知識者值得去思量的。 Therefore, how to improve the accuracy of data collection of crawler software is worth considering for those with ordinary knowledge in the field.
本發明提供一種網頁內容篩選的方法,包括:S10:造訪一網站;S20:對該網站之一首頁的內容進行分群辨識;S30:從該首頁中找出一文章列表,該文章列表中包括多個網址連結;S40:從該文章列表中取出兩個該網址連結; S50:分別連結至該網址連結,取得兩個文章網頁,該文章網頁中包括多個資料欄位;S60:比較兩個該文章網頁,去除相同內容的該資料欄位;S70:將剩餘的該資料欄位標記為一收集欄位;及S80:擷取該收集欄位中的資料。 The present invention provides a method for screening web content, including: S10: visiting a website; S20: grouping and identifying the content of the homepage of one of the websites; S30: finding an article list from the homepage, and the article list includes multiple URL links; S40: Take out two URL links from the article list; S50: Link to the URL link respectively to obtain two article webpages, which include multiple data fields; S60: compare two article webpages, and remove the data field with the same content; S70: change the remaining data fields The data field is marked as a collection field; and S80: Retrieve the data in the collection field.
上述的網頁內容篩選的方法,其中,該網站為新聞網站。 In the above method for screening web content, the website is a news website.
上述的網頁內容篩選的方法,其中,在步驟S10中,是以該網站的Meta tag判斷該網站是否為新聞網站。 In the above method for screening web content, in step S10, the Meta tag of the website is used to determine whether the website is a news website.
上述的網頁內容篩選的方法,其中,還包括:S90:瀏覽具有相同位置的該文章網頁;及S100:擷取該些文章網頁中該收集欄位中的資料。 The above-mentioned method for screening web content further includes: S90: browsing the article webpage with the same location; and S100: capturing data in the collection field of the article webpages.
上述的網頁內容篩選的方法,其中,在步驟S90中,是以該文章網頁的Xpath位置判斷是否具有相同位置。 In the above method for screening webpage content, in step S90, it is determined whether the article webpage has the same position based on the Xpath position of the article webpage.
上述的網頁內容篩選的方法,其中,在步驟S20中,是經由機械學習的方式對該網站進行分群。 In the above method for screening web content, in step S20, the website is grouped by means of mechanical learning.
上述的網頁內容篩選的方法,其中,是以一資料收集系統執行該網頁內容篩選的方法。 In the above-mentioned method for screening web content, a data collection system implements a method for screening the web content.
S10~S100:流程圖步驟 S10~S100: Flowchart steps
110:首頁 110: Home
111:類別欄位 111: category field
112:廣告版位 112: ad slot
113:頭條欄位 113: Headline column
114:文章列表 114: Article list
1411、1411a~d:網址連結 1411, 1411a~d: URL link
120:文章網頁 120: Article page
121:新聞分類 121: News Classification
122:新聞標題 122: News Title
123、125:廣告版位 123, 125: Advertising slot
124、126:新聞主文 124, 126: News main article
200:資料收集系統 200: Data Collection System
210:瀏覽模組 210: Browse Module
220:判斷模組 220: Judgment Module
230:對比模組 230: contrast module
240:收集模組 240: Collection module
250:資料庫 250: database
圖1A所繪示為本發明之網頁內容篩選的方法。 FIG. 1A shows the method for screening web content of the present invention.
圖1B所繪示為資料收集系統。 Figure 1B shows a data collection system.
圖2所繪示為首頁的示意圖。 Figure 2 shows a schematic diagram of the home page.
圖3所繪示為文章網頁的示意圖。 Figure 3 shows a schematic diagram of the article webpage.
圖4所繪示為去除資料欄位的示意圖。 Figure 4 shows a schematic diagram of removing data fields.
本發明提供一種網頁內容篩選的方法,能夠正確地從網頁中擷取主要的內容,並且避開錯誤或無效的內容,提高資料收集的準確度。 The present invention provides a method for screening webpage content, which can correctly extract main content from webpages, avoid wrong or invalid content, and improve the accuracy of data collection.
在本實施例中,使經由一資料收集系統200來執行網頁內容篩選的方法。請參閱圖1B,圖1B所繪示為資料收集系統。資料收集系統200則例如為一爬蟲軟體。資料收集系統200包括一瀏覽模組210、一判斷模組220、一對比模組230、一收集模組240與一資料庫250。
In this embodiment, a method of filtering web content is performed through a
資料收集系統200是經由瀏覽模組210連接至多個網站100進行瀏覽。瀏覽模組210會根據網站100的Meta tag判斷其網站類別,優選是瀏覽新聞網站。判斷模組220則適於判斷網站100中的首頁110,從首頁110中找出特定的位置。在本實施例中,判斷模組220是經過機械訓練或深度學習訓練的判斷模型,因此能夠自行從首頁110中識別首頁110上的內容區塊,從中找出文章列表114,並從文章列表114中取得兩個網址連結。而瀏覽模組210便會瀏覽對應兩個網址連結的文章網頁120。
The
比對模組230適於比對文章網頁120。透過瀏覽模組210瀏覽的兩的文章網頁120,對比模組230便會比較兩個文章網頁120的內容。經對比後,會除去兩個文章網頁120內相同的欄位,保留內容相異的欄位。並且將保留下的欄位定義為收集欄位。
The
收集模組240適於收集文章網頁120中的資料,特別是針對收集欄位中的資料進行收集。進一步的,資料收集系統200透過比對模組230確定該網站100的收集欄位,收集模組240也會收集該網站100中其他文章網頁120中收集欄位裡的資料。而這些收集的資料便傳送至資料庫250保存。
The
以下將說明具體的網頁內容篩選的方法。請參閱圖1A,圖1A所繪示為本發明之網頁內容篩選的方法。首先,造訪一網站100(步驟S10)。在較佳實施例中,網站為一新聞網站,並且可透過網站的Meta tag判斷該網站100是否為新聞網站。造訪網站100之後,對網站100的首頁110上的內容進行分群(Clustering)辨識(步驟S20)。
The following will explain the specific method of web content screening. Please refer to FIG. 1A. FIG. 1A illustrates the method for screening web content of the present invention. First, visit a website 100 (step S10). In a preferred embodiment, the website is a news website, and the Meta tag of the website can be used to determine whether the
請參閱圖2,圖2所繪示為首頁的示意圖。在圖2的實施例中,首頁110例如為一種新聞網站的首頁110。因此首頁110包括了類別欄位111、廣告版位112、頭條欄位113與文章列表114。而在步驟S20中,則是將類別欄位111、廣告版位112、頭條欄位113與文章列表114進行分群辨識。
Please refer to Figure 2, which is a schematic diagram of the home page. In the embodiment of FIG. 2, the
對首頁110的內容分群之後,找出首頁110中的文章列表114(步驟S30),文章列表114中具有多個網址連結1141,網址連結1141是連接至網站中的文章網頁120。此時,從文章列表114中取出其中兩個網址連結1141(步驟S40),在圖2的實施例中例如為網址連結1141a與1141b。接著,分別連結至網址連結1141a與1141b對應的網頁,取得兩個文章網頁120(步驟S50)。
After the content of the
請參閱圖3,圖3所繪示為文章網頁的示意圖。文章網頁120中會包括多個資料欄位,資料欄位為網頁記載內容的欄位。在本實施例中,資料欄位包括了新聞分類121、新聞標題122、廣告版位123、125、新聞主文124、126等欄位。
Please refer to FIG. 3, which is a schematic diagram of the article webpage. The
在步驟S50中取得了兩個文章網頁120,因此接下來比對兩個文章網頁120的資料欄位,並且去除內容相同的資料欄位。請參閱圖3與圖4,圖4所繪示為去除資料欄位的示意圖。
In step S50, two
在本實施例中,由於使用資料收集系統200瀏覽網頁,利用現行的網路技術,同一終端所瀏覽到的網頁中會有資料欄位有所相同。例如,新聞分類121是網站供用戶選擇新聞類別,因此是不變的。廣告版位123、125則是因為廣告內容通常是在瀏覽器呈現網頁時在被置入,而資料收集系統200瀏覽網頁的介面並
非瀏覽器,因此廣告版位123、125中是一串代表載入廣告的程式碼,因此廣告版位123、125對於資料收集系統200也具有相同內容。
In this embodiment, since the
綜上所述,經過比對之後,便可去除具有相同內容的資料欄位,也就是新聞分類121與廣告版位123、125。而保留下來便是記載新聞內容的資料欄位,如新聞標題122與新聞主文124、126。
In summary, after the comparison, the data fields with the same content, that is, the
接下來,將保留下來的資料欄位標記為收集欄位(步驟S70),並擷取收集欄位中的資料(步驟S80)。經由步驟S10至步驟S80便可確立網站中的文章配置,例如新聞標題122與新聞主文124、126的位置,便可套用資料欄位的位置,針對網站中的其他文章進行收集。換言之,步驟S10至步驟S80的主要目的在於確認網站100的文章架構,找出收集標的物的位置,排除不需要收集的資料位置。當需要收集其他網站時,則對另一個網站執行步驟S10至步驟S80,即可確認另一個網站的文章架構。
Next, mark the retained data field as a collection field (step S70), and retrieve the data in the collection field (step S80). Through step S10 to step S80, the article configuration in the website can be established, for example, the positions of the
因此,接下來瀏覽具有相同位置的文章網頁120(步驟S90),在本實施例中,是透過文章網頁120的Xpath位置判斷是否具有相同的位置。即這些文章網頁事隸屬於相同的網站,也具備相同的文章配置。之後,便可擷取這些文章網頁中的收集欄位中的資料(步驟S100)。
Therefore, next browse the
經由上述的步驟S10至步驟S80,利用取出同一網站下的兩個網頁,並比較網頁內容來判對需要收集的資料欄位(如新聞主文、標題、刊登時間等),並去除不需收集的資料欄位(如廣告、新聞分類等)。經過步驟S10~步驟S80的判斷便可判斷出同一網站下的文章配置。之後則特過步驟S90至步驟S100進一步收集同一網站下其他文章的內容。反覆執行步驟S10至步驟S100,便可更有效率的收集網頁資料,並排除無效資料。 Through the above steps S10 to S80, extract the two webpages from the same website and compare the content of the webpages to determine the data fields that need to be collected (such as news main text, title, publication time, etc.), and remove those that do not need to be collected Data fields (such as advertisements, news categories, etc.). After the judgment of step S10 to step S80, the article configuration under the same website can be judged. After that, step S90 to step S100 are performed to further collect the content of other articles under the same website. By repeatedly performing steps S10 to S100, webpage data can be collected more efficiently and invalid data can be eliminated.
本發明說明如上,然其並非用以限定本創作所主張之專利權利範圍。其專利保護範圍當視後附之申請專利範圍及其等同領域而定。凡本領域具有通常知 識者,在不脫離本專利精神或範圍內,所作之更動或潤飾,均屬於本創作所揭示精神下所完成之等效改變或設計,且應包含在下述之申請專利範圍內。 The description of the present invention is as above, but it is not intended to limit the scope of the patent rights claimed in this creation. The scope of its patent protection shall be determined by the scope of the attached patent application and its equivalent fields. Anyone who has common knowledge in the field Those who know, without departing from the spirit or scope of this patent, any changes or modifications made are equivalent changes or designs completed under the spirit of this creation, and should be included in the scope of the following patent applications.
S10~S100:流程圖步驟 S10~S100: Flowchart steps
Claims (7)
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
TW108142681A TWI738126B (en) | 2019-11-25 | 2019-11-25 | Web content filtering method |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
TW108142681A TWI738126B (en) | 2019-11-25 | 2019-11-25 | Web content filtering method |
Publications (2)
Publication Number | Publication Date |
---|---|
TW202121200A TW202121200A (en) | 2021-06-01 |
TWI738126B true TWI738126B (en) | 2021-09-01 |
Family
ID=77516512
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
TW108142681A TWI738126B (en) | 2019-11-25 | 2019-11-25 | Web content filtering method |
Country Status (1)
Country | Link |
---|---|
TW (1) | TWI738126B (en) |
Citations (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN101630325A (en) * | 2009-08-18 | 2010-01-20 | 北京大学 | Webpage clustering method based on script feature |
CN105138701A (en) * | 2015-09-29 | 2015-12-09 | 北京奇虎科技有限公司 | Method and device for extracting contents of index pages and search engine |
TW201705021A (en) * | 2015-07-23 | 2017-02-01 | 葆光資訊有限公司 | An information retrieving method utilizing webpage visual features and webpage language features and a system using thereof |
US20190222560A1 (en) * | 2015-08-05 | 2019-07-18 | Intralinks, Inc. | Systems and methods of secure data exchange |
-
2019
- 2019-11-25 TW TW108142681A patent/TWI738126B/en active
Patent Citations (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN101630325A (en) * | 2009-08-18 | 2010-01-20 | 北京大学 | Webpage clustering method based on script feature |
TW201705021A (en) * | 2015-07-23 | 2017-02-01 | 葆光資訊有限公司 | An information retrieving method utilizing webpage visual features and webpage language features and a system using thereof |
US20190222560A1 (en) * | 2015-08-05 | 2019-07-18 | Intralinks, Inc. | Systems and methods of secure data exchange |
CN105138701A (en) * | 2015-09-29 | 2015-12-09 | 北京奇虎科技有限公司 | Method and device for extracting contents of index pages and search engine |
Also Published As
Publication number | Publication date |
---|---|
TW202121200A (en) | 2021-06-01 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN108256104B (en) | Comprehensive classification method of internet websites based on multidimensional characteristics | |
US6721736B1 (en) | Methods, computer system, and computer program product for configuring a meta search engine | |
CN110352427B (en) | System and method for collecting data associated with fraudulent content in a networked environment | |
US20100169311A1 (en) | Approaches for the unsupervised creation of structural templates for electronic documents | |
US20090125529A1 (en) | Extracting information based on document structure and characteristics of attributes | |
US20210303641A1 (en) | Artificial intelligence for product data extraction | |
US20150287047A1 (en) | Extracting Information from Chain-Store Websites | |
US7962523B2 (en) | System and method for detecting templates of a website using hyperlink analysis | |
CN106960058B (en) | Webpage structure change detection method and system | |
CN101908071A (en) | Method and device thereof for improving search efficiency of search engine | |
CN101676907A (en) | Method and system of directionally acquiring Internet resources | |
CN104391978B (en) | Web page storage processing method and processing device for browser | |
US20160103913A1 (en) | Method and system for calculating a degree of linkage for webpages | |
US11443006B2 (en) | Intelligent browser bookmark management | |
CN112287273A (en) | Method, system and storage medium for classifying website list pages | |
CN105117434A (en) | Webpage classification method and webpage classification system | |
CN112035723A (en) | Resource library determination method and device, storage medium and electronic device | |
CN104268289A (en) | Link URL (Uniform Resource Locator) failure detection method and device | |
US8037073B1 (en) | Detection of bounce pad sites | |
CN111125485A (en) | Website URL crawling method based on Scapy | |
WO2015074455A1 (en) | Method and apparatus for computing url pattern of associated webpage | |
CN103605742A (en) | Method and device for recognizing network resource entity content page | |
TWI738126B (en) | Web content filtering method | |
CN109948015B (en) | Meta search list result extraction method and system | |
CN106951505B (en) | Webpage information obtaining method and system |