CN111125485A - Website URL crawling method based on Scapy - Google Patents

Website URL crawling method based on Scapy Download PDF

Info

Publication number
CN111125485A
CN111125485A CN201911323361.9A CN201911323361A CN111125485A CN 111125485 A CN111125485 A CN 111125485A CN 201911323361 A CN201911323361 A CN 201911323361A CN 111125485 A CN111125485 A CN 111125485A
Authority
CN
China
Prior art keywords
url
crawling
database
website
webpage
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN201911323361.9A
Other languages
Chinese (zh)
Inventor
何建锋
袁莺
马昱阳
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Xi'an Jiaotong University Jump Network Technology Co ltd
Original Assignee
Xi'an Jiaotong University Jump Network Technology Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Xi'an Jiaotong University Jump Network Technology Co ltd filed Critical Xi'an Jiaotong University Jump Network Technology Co ltd
Priority to CN201911323361.9A priority Critical patent/CN111125485A/en
Publication of CN111125485A publication Critical patent/CN111125485A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/95Retrieval from the web
    • G06F16/951Indexing; Web crawling techniques
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/95Retrieval from the web
    • G06F16/955Retrieval from the web using information identifiers, e.g. uniform resource locators [URL]
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/95Retrieval from the web
    • G06F16/958Organisation or management of web site content, e.g. publishing, maintaining pages or automatic linking

Landscapes

  • Engineering & Computer Science (AREA)
  • Databases & Information Systems (AREA)
  • Theoretical Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Information Transfer Between Computers (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The invention discloses a website URL crawling method based on Scapy, which comprises the following steps: reading a target URL from a first database, requesting the target URL and downloading a webpage of the target URL; extracting all URLs from the webpage, and performing bloom deduplication; simultaneously storing the de-duplicated URL into a first database and a second database; and reading the URL with the same domain name from the first database, removing the duplicate, storing, constructing an http request for the URL after the duplicate removal, and crawling again. Reduce the low efficiency and climb and take up the resource, be favorable to improving overall efficiency.

Description

Website URL crawling method based on Scapy
Technical Field
The invention belongs to the technical field of computers and networks, and particularly relates to a website URL crawling method based on Scapy.
Background
With the change of information technology and the development of network information system application. The information technology revolution represented by the internet impacts and changes various fields of social life and production, employees are engaged in network entertainment activities irrelevant to work during working hours to cause low working efficiency, and according to statistics, nearly 37% of employees of enterprises are used for online chatting, browsing news entertainment, stock quotations, pornographic websites or processing personal things in the internet activities on average every day.
One of the functions of the information auditing system is a technical means for monitoring network behaviors and communication contents in a network environment in real time by using various technical means so as to collect, record, analyze, alarm and process in a centralized manner. One of the cores of the system is to perform a large amount of website data analysis, quickly classify and identify URLs, and calibrate behavior characteristics of webpage browsing so as to achieve the purpose of auditing, so that how to efficiently crawl URLs of websites is called as a technical key.
Disclosure of Invention
In view of this, a website URL crawling method based on script is provided, which is mainly based on a script framework implemented by using Python, is improved, and realizes fast crawling of a website URL and occupies a lower memory.
The website URL crawling method based on Scapy comprises the following steps:
s1, reading a target URL from a first database, requesting the target URL and downloading a webpage of the target URL;
s2, extracting all URLs from the webpage, and performing bloom deduplication;
s3, simultaneously storing the duplicate-removed URL into a first database and a second database;
and S4, reading the URL with the same domain name from the first database, removing the duplicate, storing, constructing an http request for the URL after the duplicate removal, crawling again, and returning to the step S1.
In S14, when the condition for stopping crawling of URLs is triggered, a new target URL is read from the first database and crawled.
Preferably, the crawl stopping condition of the URL includes: when the DNS query of the URL is overtime and/or the downloading of the webpage is overtime, stopping crawling of the current URL and reading the next target URL; at least M new URLs are generated within the latest N time duration; and when the newly-added URLs in the URL webpage do not reach M within the latest N time length, stopping crawling and reading the next target URl.
Further, in step S14, at least two crawling processes are set, each crawling process crawls a web page under a domain name, and each crawling process sets a corresponding URL queue to be crawled.
And when the crawling stop condition of the URL is triggered, emptying the URL queue of the current crawling process.
Preferably, the extraction of the URL in the webpage uses XPath; the first database is a Redis database, and the URL to be crawled is stored in a distributed message queue form; the second database is a Mysql database and stores the crawled URL to be analyzed; and the duplication removal of the URL with the same domain name in the S14 is realized by using a Set container of Python.
The URL crawling method further includes extracting the website thumbnail:
s51, reading a URL to be analyzed from a second database, and downloading a webpage;
s52, extracting a website title and all links from a webpage, and filtering the links according to keywords to obtain picture links and website thumbnail mark links;
s53, sending an http request after the link is deduplicated, and judging whether the website thumbnail mark file is downloaded successfully:
if the downloading is successful, clipping according to the configuration requirement and then saving the picture file,
if the downloading is not successful, using other pictures which are successfully downloaded as the website thumbnail marks for storage, and if all the pictures of the webpage are not successfully downloaded, drawing the first Chinese character of the website title as the website thumbnail marks for storage;
and taking the MD5 code of the website thumbnail mark picture as a name, and writing the name into a database.
The website abbreviated mark link refers to a website domain name + '/favicon. ico' link.
Preferably, the URLs in the second database are classified according to the keyword dictionary; the keyword dictionary comprises a mapping relation table of URL types and keywords in URLs.
According to the technical scheme, the crawler is improved based on a Scapy framework, the crawler is quickly realized, and the content or pictures of the specified website are captured and stored in the database to be analyzed; the Redis is atomicity and is a memory database, and the crawling efficiency can be effectively improved when the Redis is used as a distributed message queue; the bloom deduplication used for the URL deduplication is fast and the memory occupation is low; moreover, necessary stopping conditions are set for the crawling process, resources occupied by low-efficiency crawling are reduced, and the overall efficiency is improved; extracting or drawing the abbreviated mark of the crawled website, so that the website can be conveniently and quickly positioned during the examination; and classifying the crawled URL according to the keyword dictionary, so that the quick recognition of website browsing can be realized, the URL matching efficiency is improved, and the audit of the URL is realized.
Drawings
FIG. 1 is a schematic diagram of a script framework according to an embodiment of the present invention;
FIG. 2 is a schematic diagram of a crawling workflow according to an embodiment of the website URL crawling method of the present invention;
fig. 3 is a schematic view of a website URL crawling method according to an embodiment of the present invention, illustrating a website thumbnail mark extracting process.
Detailed Description
The web crawler is a program for automatically extracting web pages, and the web crawler acquires URLs on initial web pages from URLs of one or a plurality of initial web pages, and continuously extracts new URLs from current web pages and puts the new URLs into a queue in the process of capturing the web pages. The workflow of the focused crawler is complex, and links irrelevant to the subject need to be filtered according to a certain webpage analysis algorithm, and useful links are reserved and put into a URL queue to be captured. Then, it will select the next web page URL from the queue according to a certain search strategy, and repeat the above process.
The script is an application framework written for crawling website data and extracting structural data and is often applied to a series of programs including data mining, information processing or historical data storage.
As shown in FIG. 1, the framework structure of Scapy includes:
the script Engine is responsible for communication, signal transmission, data transmission and the like among the Spider, ItemPipeline, Downloader and Scheduler;
a Scheduler, which is responsible for receiving the Request sent by the engine, arranging and queuing the Request according to a certain mode, and returning the Request to the engine when the Request is needed by the engine;
a Downloader which is responsible for downloading all Requests sent by the script Engine, returning obtained Responses to the script Engine and handing the Requests to the Spider for processing;
the Spider is responsible for processing all Responses, analyzing and extracting data from the Responses, acquiring data required by the Item field, submitting the URL required to be followed to the engine, and entering the Scheduler again;
item Pipeline: the part responsible for processing the Item acquired from the Spider and performing post-processing (detailed analysis, filtration, storage and the like);
the download middleware is a component which can customize and extend the download function;
spider Middlewares (Spider middleware) is a functional component which can self-define extension and operate engine and communication between Spiders (such as Responses entering Spider; and Requests exiting Spider).
Through the Scapy, a crawler can be simply realized through a Scapy framework to capture the content or pictures of the specified website.
As shown in fig. 2, the website URL crawling method based on script includes:
step 1, reading a target URL from a first database, requesting the target URL and downloading a webpage of the target URL;
step 2, extracting all URLs from the webpage, and performing bloom deduplication;
step 3, simultaneously storing the duplicate-removed URL into a first database and a second database;
and 4, reading the URL with the same domain name from the first database, removing the duplicate, storing, constructing an http request for the URL after the duplicate removal, crawling again, and returning to the step 1.
In step 4, when the condition of stopping crawling of the URL is triggered, reading a new target URL from the first database and starting crawling. The condition of stopping crawling of the URL comprises the following steps: when the DNS query of the URL is overtime and/or the downloading of the webpage is overtime, stopping crawling of the current URL and reading the next target URL; for some large websites with fewer external links and more internal links, in order to avoid wasting too much time and network bandwidth, at least M new URLs are generated within the latest N time duration; and when the number of the newly-added URLs in the URL webpage does not reach M within the latest N time length, stopping crawling and reading the next target URL.
Preferably, in the step 4, at least two crawling processes are set, each crawling process respectively crawls a webpage under a domain name, and each crawling process sets a corresponding URL queue to be crawled; and when the crawling stop condition of the URL is triggered, emptying the URL queue of the current crawling process.
Preferably, the extraction of the URL in the webpage uses XPath; XPath is a language of looking for information in XML document, can be used for going through to element and attribute in XML document, is used for extracting the form and is HTML's webpage source code efficiency also is fairly high, can go through each label and attribute of HTML, fixes a position to required information, and draws.
The first database is a Redis database, and the URL to be crawled is stored in a distributed message queue form; the storage of redis is atomic and an in-memory database, so it can be treated as a distributed message queue.
The second database is a Mysql database and stores the crawled URL to be analyzed;
and (4) the duplicate removal of the URL with the same domain name in the step 4 is realized by using a Set container of Python.
As shown in fig. 3, the URL crawling method further includes extracting a website contraction mark, where it is to be noted that the website contraction mark (favicon, an abbreviation of Favorites Icon, which can be displayed on a browser tab, an address bar left side and a favorite, is a contraction logo mark showing website personality, and can also be called a website avatar) can make the favorite of the browser display a corresponding title, and distinguish different websites in an Icon manner, so as to facilitate intuitive website identification. The favicon of most web pages is stored in favicon. ico files found in a web page directory.
The process of extracting the website thumbnail mark favicon comprises the following steps:
step 1, reading a URL to be analyzed from a second database (namely a Mysql database) and downloading a webpage;
step 2, extracting a website title and all links from a webpage, and filtering the links according to keywords to obtain a picture link and a website thumbnail mark link;
and 3, sending an http request after the link is deduplicated, and judging whether the website thumbnail mark file is downloaded successfully:
if the downloading is successful, clipping according to the configuration requirement and then saving the picture file,
if the downloading is not successful, using other pictures which are successfully downloaded as the website thumbnail marks for storage, and if all the pictures of the webpage are not successfully downloaded, drawing the first Chinese character of the website title as the website thumbnail marks for storage;
and taking the MD5 code of the website thumbnail mark picture as a name, and writing the name into a database.
The website abbreviated mark link refers to a website domain name + '/favicon. ico' link.
Preferably, the URLs in the second database are classified according to the keyword dictionary; the keyword dictionary includes a mapping relationship table between URL types and keywords in URLs, for example, the URL keywords of a technology website include: tech, it, csdn, etc. when the user browses a certain web page, the information auditing system matches with the keyword dictionary through the URL to identify whether the web page should be allowed to access.
According to the technical scheme, a large number of URLs can be crawled quickly and efficiently for analysis, and therefore the efficiency and accuracy of the information auditing system for URL filtering and auditing can be improved.

Claims (10)

1. The website URL crawling method based on Scapy is characterized by comprising the following steps:
s1, reading a target URL from a first database, and requesting and downloading a webpage of the target URL;
s2, extracting all URLs from the webpage, and performing bloom deduplication;
s3, simultaneously storing the duplicate-removed URL into a first database and a second database;
and S4, reading the URL with the same domain name from the first database, removing the duplicate, storing, constructing an http request for the URL after the duplicate removal, crawling again, and returning to the step S1.
2. The URL crawling method according to claim 1, wherein in S14, when the URL crawling stop condition is triggered, crawling is started by reading a new target URL from the first database.
3. The URL crawling method according to claim 2, wherein the condition for stopping crawling of the URL comprises: and when the DNS query of the URL is overtime and/or the downloading of the webpage is overtime, stopping crawling of the current URL and reading the next target URL.
4. The URL crawling method according to claim 2, wherein the condition for stopping crawling of the URL comprises: at least M new URLs are generated within the latest N time duration; and when the newly-added URLs in the URL webpage do not reach M within the latest N time length, stopping crawling and reading the next target URl.
5. The URL crawling method according to claim 1, wherein in S14, at least two crawling processes are set, each crawling process crawls a web page under a domain name, and each crawling process sets a corresponding URL queue to be crawled.
6. The URL crawling method according to any one of claims 2 to 5, wherein when the crawling stop condition of the URL is triggered, the URL queue of the current crawling process is emptied.
7. The URL crawling method as claimed in claim 1, wherein the URL in the web page is extracted using XPath; the first database is a Redis database, and the URL to be crawled is stored in a distributed message queue form; the second database is a Mysql database and stores the crawled URL to be analyzed; and the duplication removal of the URL with the same domain name in the S14 is realized by using a Set container of Python.
8. The URL crawling method according to claim 1, further comprising extracting the website thumbnail mark:
s51, reading a URL to be analyzed from a second database, and downloading a webpage;
s52, extracting a website title and all links from a webpage, and filtering the links according to keywords to obtain picture links and website thumbnail mark links;
s53, sending an http request after the link is deduplicated, and judging whether the website thumbnail mark file is downloaded successfully:
if the downloading is successful, clipping according to the configuration requirement and then saving the picture file,
if the downloading is not successful, using other pictures which are successfully downloaded as the website thumbnail marks for storage, and if all the pictures of the webpage are not successfully downloaded, drawing the first Chinese character of the website title as the website thumbnail marks for storage;
and taking the MD5 code of the website thumbnail mark picture as a name, and writing the name into a database.
9. The URL crawling method as claimed in claim 8, wherein the website thumbnail mark link is a website domain name "+ '/favicon. ico' link.
10. The URL crawling method according to claim 1 or 8, wherein the URLs in the second database are classified according to a keyword dictionary; the keyword dictionary comprises a mapping relation table of URL types and keywords in URLs.
CN201911323361.9A 2019-12-20 2019-12-20 Website URL crawling method based on Scapy Pending CN111125485A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201911323361.9A CN111125485A (en) 2019-12-20 2019-12-20 Website URL crawling method based on Scapy

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201911323361.9A CN111125485A (en) 2019-12-20 2019-12-20 Website URL crawling method based on Scapy

Publications (1)

Publication Number Publication Date
CN111125485A true CN111125485A (en) 2020-05-08

Family

ID=70500953

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201911323361.9A Pending CN111125485A (en) 2019-12-20 2019-12-20 Website URL crawling method based on Scapy

Country Status (1)

Country Link
CN (1) CN111125485A (en)

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112422707A (en) * 2020-10-22 2021-02-26 北京安博通科技股份有限公司 Domain name data mining method and device and Redis server
CN112579862A (en) * 2020-12-22 2021-03-30 福建江夏学院 Xpath automatic extraction method based on MD5 value comparison
CN113220703A (en) * 2021-05-31 2021-08-06 普瑞纯证医疗科技(广州)有限公司 Method, server and system for updating medical data based on big data platform

Cited By (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112422707A (en) * 2020-10-22 2021-02-26 北京安博通科技股份有限公司 Domain name data mining method and device and Redis server
CN112579862A (en) * 2020-12-22 2021-03-30 福建江夏学院 Xpath automatic extraction method based on MD5 value comparison
CN112579862B (en) * 2020-12-22 2022-06-14 福建江夏学院 Xpath automatic extraction method based on MD5 value comparison
CN113220703A (en) * 2021-05-31 2021-08-06 普瑞纯证医疗科技(广州)有限公司 Method, server and system for updating medical data based on big data platform

Similar Documents

Publication Publication Date Title
US9614862B2 (en) System and method for webpage analysis
US9390176B2 (en) System and method for recursively traversing the internet and other sources to identify, gather, curate, adjudicate, and qualify business identity and related data
CN109902220B (en) Webpage information acquisition method, device and computer readable storage medium
US8601120B2 (en) Update notification method and system
CN102200980B (en) Method and system for providing network resources
US6910071B2 (en) Surveillance monitoring and automated reporting method for detecting data changes
CN108737423B (en) Phishing website discovery method and system based on webpage key content similarity analysis
CN102930059B (en) Method for designing focused crawler
US20150134913A1 (en) Method and apparatus for cleaning files in a mobile terminal and associated mobile terminal
US20100115003A1 (en) Methods For Merging Text Snippets For Context Classification
CN108021598B (en) Page extraction template matching method and device and server
CN111125485A (en) Website URL crawling method based on Scapy
US20150341771A1 (en) Hotspot aggregation method and device
CN111008348A (en) Anti-crawler method, terminal, server and computer readable storage medium
US11443006B2 (en) Intelligent browser bookmark management
CN109600385B (en) Access control method and device
WO2021068681A1 (en) Tag analysis method and device, and computer readable storage medium
CN104391978A (en) Method and device for storing and processing web pages of browsers
CN111444408A (en) Network search processing method and device and electronic equipment
CN114528457A (en) Web fingerprint detection method and related equipment
CN111460255A (en) Music work information data acquisition and storage method
CN103605742A (en) Method and device for recognizing network resource entity content page
Hurst et al. Social streams blog crawler
JPWO2018056299A1 (en) INFORMATION COLLECTION SYSTEM, INFORMATION COLLECTION METHOD, AND PROGRAM
CN116166867A (en) Content filtering method, device, equipment and storage medium for network acquisition

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination