CN111125485A

CN111125485A - Website URL crawling method based on Scapy

Info

Publication number: CN111125485A
Application number: CN201911323361.9A
Authority: CN
Inventors: 何建锋; 袁莺; 马昱阳
Original assignee: Xi'an Jiaotong University Jump Network Technology Co ltd
Current assignee: Xi'an Jiaotong University Jump Network Technology Co ltd
Priority date: 2019-12-20
Filing date: 2019-12-20
Publication date: 2020-05-08

Abstract

The invention discloses a website URL crawling method based on Scapy, which comprises the following steps: reading a target URL from a first database, requesting the target URL and downloading a webpage of the target URL; extracting all URLs from the webpage, and performing bloom deduplication; simultaneously storing the de-duplicated URL into a first database and a second database; and reading the URL with the same domain name from the first database, removing the duplicate, storing, constructing an http request for the URL after the duplicate removal, and crawling again. Reduce the low efficiency and climb and take up the resource, be favorable to improving overall efficiency.

Description

Website URL crawling method based on Scapy

Technical Field

The invention belongs to the technical field of computers and networks, and particularly relates to a website URL crawling method based on Scapy.

Background

With the change of information technology and the development of network information system application. The information technology revolution represented by the internet impacts and changes various fields of social life and production, employees are engaged in network entertainment activities irrelevant to work during working hours to cause low working efficiency, and according to statistics, nearly 37% of employees of enterprises are used for online chatting, browsing news entertainment, stock quotations, pornographic websites or processing personal things in the internet activities on average every day.

One of the functions of the information auditing system is a technical means for monitoring network behaviors and communication contents in a network environment in real time by using various technical means so as to collect, record, analyze, alarm and process in a centralized manner. One of the cores of the system is to perform a large amount of website data analysis, quickly classify and identify URLs, and calibrate behavior characteristics of webpage browsing so as to achieve the purpose of auditing, so that how to efficiently crawl URLs of websites is called as a technical key.

Disclosure of Invention

In view of this, a website URL crawling method based on script is provided, which is mainly based on a script framework implemented by using Python, is improved, and realizes fast crawling of a website URL and occupies a lower memory.

The website URL crawling method based on Scapy comprises the following steps:

s1, reading a target URL from a first database, requesting the target URL and downloading a webpage of the target URL;

s2, extracting all URLs from the webpage, and performing bloom deduplication;

s3, simultaneously storing the duplicate-removed URL into a first database and a second database;

and S4, reading the URL with the same domain name from the first database, removing the duplicate, storing, constructing an http request for the URL after the duplicate removal, crawling again, and returning to the step S1.

In S14, when the condition for stopping crawling of URLs is triggered, a new target URL is read from the first database and crawled.

Preferably, the crawl stopping condition of the URL includes: when the DNS query of the URL is overtime and/or the downloading of the webpage is overtime, stopping crawling of the current URL and reading the next target URL; at least M new URLs are generated within the latest N time duration; and when the newly-added URLs in the URL webpage do not reach M within the latest N time length, stopping crawling and reading the next target URl.

Further, in step S14, at least two crawling processes are set, each crawling process crawls a web page under a domain name, and each crawling process sets a corresponding URL queue to be crawled.

And when the crawling stop condition of the URL is triggered, emptying the URL queue of the current crawling process.

Preferably, the extraction of the URL in the webpage uses XPath; the first database is a Redis database, and the URL to be crawled is stored in a distributed message queue form; the second database is a Mysql database and stores the crawled URL to be analyzed; and the duplication removal of the URL with the same domain name in the S14 is realized by using a Set container of Python.

The URL crawling method further includes extracting the website thumbnail:

s51, reading a URL to be analyzed from a second database, and downloading a webpage;

s52, extracting a website title and all links from a webpage, and filtering the links according to keywords to obtain picture links and website thumbnail mark links;

s53, sending an http request after the link is deduplicated, and judging whether the website thumbnail mark file is downloaded successfully:

if the downloading is successful, clipping according to the configuration requirement and then saving the picture file,

if the downloading is not successful, using other pictures which are successfully downloaded as the website thumbnail marks for storage, and if all the pictures of the webpage are not successfully downloaded, drawing the first Chinese character of the website title as the website thumbnail marks for storage;

and taking the MD5 code of the website thumbnail mark picture as a name, and writing the name into a database.

The website abbreviated mark link refers to a website domain name + '/favicon. ico' link.

Preferably, the URLs in the second database are classified according to the keyword dictionary; the keyword dictionary comprises a mapping relation table of URL types and keywords in URLs.

According to the technical scheme, the crawler is improved based on a Scapy framework, the crawler is quickly realized, and the content or pictures of the specified website are captured and stored in the database to be analyzed; the Redis is atomicity and is a memory database, and the crawling efficiency can be effectively improved when the Redis is used as a distributed message queue; the bloom deduplication used for the URL deduplication is fast and the memory occupation is low; moreover, necessary stopping conditions are set for the crawling process, resources occupied by low-efficiency crawling are reduced, and the overall efficiency is improved; extracting or drawing the abbreviated mark of the crawled website, so that the website can be conveniently and quickly positioned during the examination; and classifying the crawled URL according to the keyword dictionary, so that the quick recognition of website browsing can be realized, the URL matching efficiency is improved, and the audit of the URL is realized.

Drawings

FIG. 1 is a schematic diagram of a script framework according to an embodiment of the present invention;

FIG. 2 is a schematic diagram of a crawling workflow according to an embodiment of the website URL crawling method of the present invention;

fig. 3 is a schematic view of a website URL crawling method according to an embodiment of the present invention, illustrating a website thumbnail mark extracting process.

Detailed Description

The web crawler is a program for automatically extracting web pages, and the web crawler acquires URLs on initial web pages from URLs of one or a plurality of initial web pages, and continuously extracts new URLs from current web pages and puts the new URLs into a queue in the process of capturing the web pages. The workflow of the focused crawler is complex, and links irrelevant to the subject need to be filtered according to a certain webpage analysis algorithm, and useful links are reserved and put into a URL queue to be captured. Then, it will select the next web page URL from the queue according to a certain search strategy, and repeat the above process.

The script is an application framework written for crawling website data and extracting structural data and is often applied to a series of programs including data mining, information processing or historical data storage.

As shown in FIG. 1, the framework structure of Scapy includes:

the script Engine is responsible for communication, signal transmission, data transmission and the like among the Spider, ItemPipeline, Downloader and Scheduler;

a Scheduler, which is responsible for receiving the Request sent by the engine, arranging and queuing the Request according to a certain mode, and returning the Request to the engine when the Request is needed by the engine;

a Downloader which is responsible for downloading all Requests sent by the script Engine, returning obtained Responses to the script Engine and handing the Requests to the Spider for processing;

the Spider is responsible for processing all Responses, analyzing and extracting data from the Responses, acquiring data required by the Item field, submitting the URL required to be followed to the engine, and entering the Scheduler again;

item Pipeline: the part responsible for processing the Item acquired from the Spider and performing post-processing (detailed analysis, filtration, storage and the like);

the download middleware is a component which can customize and extend the download function;

spider Middlewares (Spider middleware) is a functional component which can self-define extension and operate engine and communication between Spiders (such as Responses entering Spider; and Requests exiting Spider).

Through the Scapy, a crawler can be simply realized through a Scapy framework to capture the content or pictures of the specified website.

As shown in fig. 2, the website URL crawling method based on script includes:

step 1, reading a target URL from a first database, requesting the target URL and downloading a webpage of the target URL;

step 2, extracting all URLs from the webpage, and performing bloom deduplication;

step 3, simultaneously storing the duplicate-removed URL into a first database and a second database;

and 4, reading the URL with the same domain name from the first database, removing the duplicate, storing, constructing an http request for the URL after the duplicate removal, crawling again, and returning to the step 1.

In step 4, when the condition of stopping crawling of the URL is triggered, reading a new target URL from the first database and starting crawling. The condition of stopping crawling of the URL comprises the following steps: when the DNS query of the URL is overtime and/or the downloading of the webpage is overtime, stopping crawling of the current URL and reading the next target URL; for some large websites with fewer external links and more internal links, in order to avoid wasting too much time and network bandwidth, at least M new URLs are generated within the latest N time duration; and when the number of the newly-added URLs in the URL webpage does not reach M within the latest N time length, stopping crawling and reading the next target URL.

Preferably, in the step 4, at least two crawling processes are set, each crawling process respectively crawls a webpage under a domain name, and each crawling process sets a corresponding URL queue to be crawled; and when the crawling stop condition of the URL is triggered, emptying the URL queue of the current crawling process.

Preferably, the extraction of the URL in the webpage uses XPath; XPath is a language of looking for information in XML document, can be used for going through to element and attribute in XML document, is used for extracting the form and is HTML's webpage source code efficiency also is fairly high, can go through each label and attribute of HTML, fixes a position to required information, and draws.

The first database is a Redis database, and the URL to be crawled is stored in a distributed message queue form; the storage of redis is atomic and an in-memory database, so it can be treated as a distributed message queue.

The second database is a Mysql database and stores the crawled URL to be analyzed;

and (4) the duplicate removal of the URL with the same domain name in the step 4 is realized by using a Set container of Python.

As shown in fig. 3, the URL crawling method further includes extracting a website contraction mark, where it is to be noted that the website contraction mark (favicon, an abbreviation of Favorites Icon, which can be displayed on a browser tab, an address bar left side and a favorite, is a contraction logo mark showing website personality, and can also be called a website avatar) can make the favorite of the browser display a corresponding title, and distinguish different websites in an Icon manner, so as to facilitate intuitive website identification. The favicon of most web pages is stored in favicon. ico files found in a web page directory.

The process of extracting the website thumbnail mark favicon comprises the following steps:

step 1, reading a URL to be analyzed from a second database (namely a Mysql database) and downloading a webpage;

step 2, extracting a website title and all links from a webpage, and filtering the links according to keywords to obtain a picture link and a website thumbnail mark link;

and 3, sending an http request after the link is deduplicated, and judging whether the website thumbnail mark file is downloaded successfully:

Preferably, the URLs in the second database are classified according to the keyword dictionary; the keyword dictionary includes a mapping relationship table between URL types and keywords in URLs, for example, the URL keywords of a technology website include: tech, it, csdn, etc. when the user browses a certain web page, the information auditing system matches with the keyword dictionary through the URL to identify whether the web page should be allowed to access.

According to the technical scheme, a large number of URLs can be crawled quickly and efficiently for analysis, and therefore the efficiency and accuracy of the information auditing system for URL filtering and auditing can be improved.

Claims

1. The website URL crawling method based on Scapy is characterized by comprising the following steps:

s1, reading a target URL from a first database, and requesting and downloading a webpage of the target URL;

s2, extracting all URLs from the webpage, and performing bloom deduplication;

2. The URL crawling method according to claim 1, wherein in S14, when the URL crawling stop condition is triggered, crawling is started by reading a new target URL from the first database.

3. The URL crawling method according to claim 2, wherein the condition for stopping crawling of the URL comprises: and when the DNS query of the URL is overtime and/or the downloading of the webpage is overtime, stopping crawling of the current URL and reading the next target URL.

4. The URL crawling method according to claim 2, wherein the condition for stopping crawling of the URL comprises: at least M new URLs are generated within the latest N time duration; and when the newly-added URLs in the URL webpage do not reach M within the latest N time length, stopping crawling and reading the next target URl.

5. The URL crawling method according to claim 1, wherein in S14, at least two crawling processes are set, each crawling process crawls a web page under a domain name, and each crawling process sets a corresponding URL queue to be crawled.

6. The URL crawling method according to any one of claims 2 to 5, wherein when the crawling stop condition of the URL is triggered, the URL queue of the current crawling process is emptied.

7. The URL crawling method as claimed in claim 1, wherein the URL in the web page is extracted using XPath; the first database is a Redis database, and the URL to be crawled is stored in a distributed message queue form; the second database is a Mysql database and stores the crawled URL to be analyzed; and the duplication removal of the URL with the same domain name in the S14 is realized by using a Set container of Python.

8. The URL crawling method according to claim 1, further comprising extracting the website thumbnail mark:

9. The URL crawling method as claimed in claim 8, wherein the website thumbnail mark link is a website domain name "+ '/favicon. ico' link.

10. The URL crawling method according to claim 1 or 8, wherein the URLs in the second database are classified according to a keyword dictionary; the keyword dictionary comprises a mapping relation table of URL types and keywords in URLs.