CN111881337B - Data acquisition method and system based on Scapy framework and storage medium - Google Patents

Data acquisition method and system based on Scapy framework and storage medium Download PDF

Info

Publication number
CN111881337B
CN111881337B CN202010784262.7A CN202010784262A CN111881337B CN 111881337 B CN111881337 B CN 111881337B CN 202010784262 A CN202010784262 A CN 202010784262A CN 111881337 B CN111881337 B CN 111881337B
Authority
CN
China
Prior art keywords
request
data
engine
cookie
response
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202010784262.7A
Other languages
Chinese (zh)
Other versions
CN111881337A (en
Inventor
岳希
梁云浩
唐聃
高燕
蔡红亮
张海清
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Chengdu University of Information Technology
Original Assignee
Chengdu University of Information Technology
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Chengdu University of Information Technology filed Critical Chengdu University of Information Technology
Priority to CN202010784262.7A priority Critical patent/CN111881337B/en
Publication of CN111881337A publication Critical patent/CN111881337A/en
Application granted granted Critical
Publication of CN111881337B publication Critical patent/CN111881337B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/95Retrieval from the web
    • G06F16/951Indexing; Web crawling techniques

Abstract

The invention discloses a data acquisition method, a system and a storage medium based on a script frame, wherein a mode of sequentially sending two requests, obtaining an updated cookie by the first request and obtaining data by carrying the updated cookie by the second request is used, the two requests are combined with a delay request and sequentially sent, the updated cookie returned by the last request is used for each request, the returned updated cookie is extracted after the request, the limitation of ip and the limitation of dynamic cookies are solved, the problem that the dynamic webpage data are difficult to crawl by a crawler in the prior art is solved, and the purposes of overcoming a reverse crawling mechanism of a dynamic webpage and efficiently and quickly obtaining the required data are realized.

Description

Data acquisition method and system based on Scapy framework and storage medium
Technical Field
The invention relates to the field of data acquisition, in particular to a data acquisition method, a data acquisition system and a storage medium based on a Scapy framework.
Background
With the rapid development of information technology and network technology, network information is also rapidly growing, how to effectively collect and extract the information is a serious challenge, and a search engine is created to solve the problem. The search engine is a system for providing retrieval service for users, which collects information on the internet by using a specific computer program according to a certain strategy, organizes and processes the information, and displays the processed information to the users. The process of collecting network information by a search engine depends on crawling of website information by a web crawler, and the web crawler is a program for automatically extracting the network information.
The script crawler frame is a set of asynchronous processing frame based on Twisted, is the crawler frame that pure Python realized, is applied to and crawls website data, draws structural data, is better than other crawlers in the speed and the efficiency of crawling data to coupling degree is low between its module, and expansibility is strong, can accomplish various demands in a flexible way, has customizable nature.
When data analysis and data mining are carried out, massive data need to be used, data crawling needs to be carried out from corresponding websites, along with the development of a big data technology and the increase of visitors of all big websites, the websites are continuously upgraded and expanded, the information amount of all websites is also continuously increased, massive information is stored in a database in a website server, meanwhile, a reverse crawling mechanism of the websites is also continuously improved, the data is prevented from being obtained, and the safety is ensured, malicious attack through a crawler is prevented, the server is prevented from being paralyzed, so that the key is how to efficiently and quickly obtain the required data and solve the reverse crawling mechanism when data crawling is carried out.
IP restriction is a common back-crawl mechanism. For some websites, when the number of accesses per unit time of the same IP exceeds a certain threshold, the server blocks the IP, and the server returns an HTTP Error403: Forbidden Error and returns some erroneous data. The solutions in the prior art are mainly the following two: (1) the request frequency is reduced by pausing for a short time, typically 3-5ms, at each request, which is the most secure and reliable method, but since each request has a time delay, the crawler performance is affected a little, the performance is affected the less the delay time is set, but if the delay time is too short, for example 0.05ms, the server still limits the detection of ip. (2) Different IPs are used for access, namely, the proxy IP uses different IPs when being accessed every time, so that the browser can be considered as an access request from different computers, the IP cannot be blocked, but the proxy IP needs to be replaced frequently and is not easy to obtain.
In addition, for some dynamic web pages, data is acquired from a server through an Ajax request, and a crawling prevention mechanism in the prior art, such as a request header, an IP agent and the like, is still detected by a website during crawling, and prompts similar information such as "you operate too frequently and please visit later" and the like.
Disclosure of Invention
The invention provides a data acquisition method, a data acquisition system and a storage medium based on a script framework, which aim to solve the problem that dynamic webpage data are difficult to crawl through a crawler in the prior art and achieve the aims of overcoming a reverse crawling mechanism of a dynamic webpage and efficiently and quickly acquiring required data.
The invention is realized by the following technical scheme:
a data acquisition method based on a Scapy framework comprises the following steps:
s1, the crawler module sends a request of a website detail page needing to be crawled to an engine;
s2, the engine sends all request requests to the scheduler, and the scheduler presses all request requests into a queue;
s3, the engine acquires a request from the scheduler and sends the request to the downloader through the download middleware;
s4, after receiving the request, the downloader sends the request to the server, downloads the source code of the corresponding webpage, generates the response of the corresponding webpage, and returns the response to the engine through the downloader middleware; the request sent to the server carries updated cookie data; the downloaded corresponding webpage source code is included in the response;
s5, the engine receives a response from the downloader and sends the response to the crawler module through the crawler middleware;
s6, the crawler module processes and analyzes the response, extracts the required information from the response, extracts the next request from the queue, packages the required information into entity data, and sends the entity data and the next request to the engine together;
s7, the engine sends the returned entity data to the pipeline, the pipeline stores the entity data in the database and sends the next request to the scheduler, the scheduler sends the next request to the engine, and the engine sends the next request to the downloader through the downloading middleware;
and S8, repeating S4-S7 until no request requests remain in the queue.
According to the method, the massive data are crawled quickly by using the Scapy framework, compared with the traditional method that the requests are traversed on all URLs respectively by only using the requests, after the webpage source codes are obtained, certain source codes are traversed again, and useful information in the source codes is extracted, the problem that the efficiency of a traditional crawler method is low when frequent I/O requests are processed is solved, the difficulty that asynchronous requests and multithreading requests need to be achieved independently is avoided, different components complete different work through mutual cooperation of a plurality of components, the components support asynchronous processing, the utilization rate of the network bandwidth is maximized, and the problem that the efficiency of data crawling and processing can be greatly improved when a recruitment website collects data is solved.
For the prior art, similar information such as "you operate too frequently and please visit later" is easily prompted when crawling dynamic webpage data, and the inertial thinking of the skilled in the art in the face of the problem is caused by the fact that crawlers visit websites too frequently, and a solution is easily considered in the aspect of increasing the interval time between adjacent requests, however, the traditional thought not only obviously affects the efficiency of the crawlers, but also is difficult to effectively overcome for a reverse crawling mechanism of the dynamic webpage. After a great deal of research, the inventor of the present application finds that the foregoing problem is not caused by too frequent visits, but because the website limits cookies, cookies change every time the website visits, that is, such websites adopt a dynamic cookie mechanism, and thus a cookie is always used and is detected, and crawling cannot be continued. Therefore, the application encapsulates the updated cookie data in the request sent by the downloader to the server, overcomes the back-crawling mechanism of the dynamic cookie in the prior art by updating the cookie data in each request, prevents IP from being forbidden, and further efficiently and quickly obtains the required data.
Further, in step S4, for each request, the downloader sends the cookie parameter to the server twice, obtains the cookie parameter returned by the server through the first sending, sends the cookie parameter returned as updated cookie data for the second time, and obtains the response returned by the server through the second sending. In the scheme, each request is sent twice, data cannot be obtained after the request is sent for the first time, but cookie can be obtained, the server can return a cookie parameter to update the cookie, the updated cookie can be obtained through the cookie parameter returned by the server, the updated cookie is packaged in the request when the request is sent for the second time, and the required data can be obtained after the request is sent.
Further, in step S4, if the downloader obtains the first request for a website detail page, setting the cookie parameter as the cookie of the logged-in account obtained through the browser developer mode, sending the request to the server, and after obtaining the response returned by the server, processing the response to obtain updated cookie data;
if the downloader receives the second and subsequent request requests, the cookie parameter is set as the cookie obtained in the last request, the cookie parameter is used as updated cookie data, and then the request is sent to the server; and after the response returned by the server is obtained, processing the response to obtain updated cookie data and providing the updated cookie data for the next request to use.
According to the scheme, the situation that when a first request of a specific website page is requested, namely a website is crawled for the first time is particularly considered, the request is sent to the server by adopting the cookie of the logged-in account obtained through a browser developer mode, the server returns a response, and the response is processed to obtain updated cookie data for use. In addition, for each url in the subsequent process (starting from the second request), the scheme only provides a request, sets cookie parameters as cookies obtained in the last request, takes the cookies as updated cookie data, and sends the request to the server; and after the response returned by the server is obtained, processing the response to obtain updated cookie data for the next request to use.
The scheme can overcome a reverse crawling mechanism of a dynamic webpage, and each time a request is requested, updated cookie data obtained by processing the last request is carried, and the updated cookie data returned by the request is processed from a response after the request is requested and is used for the next request. The time interval between the two request requests is the time consumption of the updated cookie data, and the time consumption can be used as the time for delaying the request, so that the problem of dynamic cookie restriction is solved, and a common IP block reverse-climbing mechanism is overcome, so that the reverse-climbing mechanism needing to set the request delay time to overcome the IP restriction in the prior art is synchronously solved, and the method has prominent substantive characteristics and remarkable progress compared with the prior art. In other words, the scheme has the technical effect of simultaneously overcoming two anti-crawling means of IP limitation and dynamic web pages.
Further, the method for processing the response to obtain the updated cookie data comprises: after the server returns a response, the updated cookie data is obtained using the queries repository of the python programming language. After a response returned by the server is obtained, an updated cookie is obtained for the response by using the cookies method of the request library of python, the cookie parameter is set as the updated cookie obtained by the last request when the second request and the subsequent request requests send the requests by using the get method of python, and the updated cookie obtained by using the cookies method of the request library after the response is returned by the server is provided for the next request.
Further, the request in step S1 includes Token information, which includes Token value of the logged-in user.
Some websites can acquire data only by detecting whether a user logs in, when accessing the websites, a member is required to be registered, when the registered member logs in, a cookie is set in a browser, when the website is accessed, the cookie is compared, the user who does not log in cannot check the data, and the data cannot be acquired when the request is sent. The solution in the prior art is still to crack by using cookies, that is, a corresponding cookie is first acquired in a browser developer mode, and the cookie is set as the acquired cookie when a request is sent for cracking.
Token, namely the Token in computer identity authentication, is a segment of character string generated by a server as a Token requested by a client, after logging in for the first time, the server generates a Token and returns the Token to the client, and the client only needs to take the Token when requesting data subsequently, and does not need to take a user name and a password again.
Compared with the cookie cracking in the prior art, the Token cracking has the advantages that: (1) when data of the mobile application is crawled, cookies are not supported and need to be solved by using a cookie container, and the complex process and the large amount of calculation are carried out; the scheme can be cracked simply by using the token, so that the calculation amount can be obviously reduced, and the cracking efficiency can be improved. (2) Using Token is superior in crawler performance: for cookie cracking, the principle is that after a user logs in, a session containing information such as a user ID and a user name is stored at a server, an ID value of the session is stored in a cookie and is stored at a client (namely a browser), when a request is initiated again, the server queries a database to compare the ID value of the session, and if the request passes verification, the user is judged to have logged in; i.e., one alignment is performed for each request. When the scheme is cracked by using the Token, the server receives the request, the Token carried by the request and the Token stored by the client are compared and verified, and the user can be judged to log in after verification is successful.
Further, in step S4, each time a request is sent, the next request is sent with a delay, or the request is sent using proxy IP. For some websites, when the number of accesses per unit time of the same IP exceeds a certain threshold, the server blocks the IP, and the server returns an HTTP Error403: Forbidden Error and returns some erroneous data. The scheme further improves the adaptability of the method to an IP limitation reverse crawling mechanism in a mode of delaying a request or acting an IP.
Further, the method also comprises the step of decrypting the entity data stored in the database, and the decryption method comprises the following steps:
s901, viewing webpage source codes in a browser developer mode, and finding out corresponding codes for generating encrypted fonts;
s902, decoding the encrypted font to obtain a decoded font file;
s903, constructing a character dictionary mapped by the Unicode in the font file: obtaining font shapes of all fonts in the font file, and mapping the fonts into a unicode list by using a unicode decoding method one by one;
s904, creating a mapping relation, storing the corresponding relation between the encrypted data and the actual data, traversing the crawled data one by one, and mapping the crawled characters into corresponding actual values;
s905, the meaning of each encryption code mapped and represented is obtained, and the entity data stored in the database is converted.
Some important information of the website is not directly provided in the webpage source code, for example, the real-time refreshed data can see a string of irregular letters or numbers and special symbols in the webpage source code in an encryption manner, that is, the crawled data is obtained by encrypting the set fonts. In order to overcome the problems, the meaning of each encryption code mapped representation can be obtained through the steps of S901-905, so that the conversion processing is carried out after the collection, and the anti-crawler mechanism for information encryption is cracked.
Further, the method also comprises request preprocessing for cracking signature verification, and the method for the request preprocessing comprises the following steps:
s001, viewing the webpage source codes in a browser developer mode, finding out parameters which change every time in the data parameters of the request form, and defining the parameters as encryption parameters;
s002, capturing all webpage source codes, finding out the method for generating the encryption parameters in the webpage source codes, and defining the method as an encryption method;
and S003, before the request is sent, generating a new encryption parameter according to the obtained encryption method, and carrying the new encryption parameter in each request.
Some websites with strong anti-crawlers also use signature verification to prevent a crawler program from obtaining data thereof, particularly dynamic data returned by a post-interface of a submitted form or an input box according to an input content request, wherein the signature is a calculation or encryption process according to a data source, and the signature results in a character string with uniqueness and consistency. The signature result is characterized in that the signature result becomes a condition for verifying the data source and the data integrity, and can effectively avoid the server side from processing forged data or tampered data as normal data, namely, the existing modes such as using a User-Agent and carrying a cookie by a request header are invalid. Therefore, the method and the device overcome the anti-crawler mechanism of signature verification by requesting preprocessing, and solve the problem that the crawler technology in the prior art cannot crack the signature verification.
A script framework based data acquisition system comprising:
a crawler module: logic responsible for crawling and analysis rules of the webpage are defined in the system and are used for analyzing the response, analyzing and extracting data from the response, acquiring required data, submitting a request to be followed to an engine, and entering a scheduler again;
an engine: data stream processing for the entire system; and for obtaining updated cookie data;
a downloader: the request is used for downloading the request sent by the engine, and the obtained content is delivered back to the engine in a response mode;
pipeline: the system comprises a crawler module, a data processing module and a data processing module, wherein the crawler module is used for processing entity data acquired from a webpage by the crawler module;
downloading the middleware: middleware located between the engine and the downloader, mainly processing requests and responses between the engine and the downloader;
crawler middleware: and the middleware is arranged between the engine and the crawler module and used for processing response input and request output of the crawler module.
A computer-readable storage medium, storing a computer program which, when executed by a processor, implements the steps of a script framework-based data acquisition method as described above.
Compared with the prior art, the invention has the following advantages and beneficial effects:
1. the invention relates to a data acquisition method, a system and a storage medium based on a script frame.A mode of sending request requests twice in sequence is used, wherein the first request requests to obtain updated cookie, and the second request requests to carry the updated cookie to obtain data, so that the limitation of website dynamic cookie is solved;
2. the invention relates to a data acquisition method, a system and a storage medium based on a script frame, which combine a delay request and successively send two requests, wherein each request uses an update cookie returned by the last request, and the returned update cookie is extracted for the next request after the request is sent, so that the IP limitation and the limitation of dynamic cookies are simultaneously solved;
3. according to the data acquisition method, the data acquisition system and the data acquisition storage medium based on the Scapy framework, the Token is used as a method for detecting user login by cracking a website, and compared with a cracking method in the prior art, the performance is greatly improved;
4. the invention relates to a data acquisition method, a data acquisition system and a storage medium based on a Scapy framework, which are used for carrying out conversion processing after acquisition and breaking an anti-crawler mechanism for information encryption in the prior art;
5. according to the data acquisition method, the data acquisition system and the data acquisition storage medium based on the script framework, encryption parameters are generated according to the obtained encryption method and carried in the request before the request is sent, signature verification is cracked, compared with a traditional crawler algorithm, a website anti-crawler mechanism is cracked, and usability of a crawler is greatly improved.
Drawings
The accompanying drawings, which are included to provide a further understanding of the embodiments of the invention and are incorporated in and constitute a part of this application, illustrate embodiment(s) of the invention and together with the description serve to explain the principles of the invention. In the drawings:
FIG. 1 is an architectural diagram of a Scapy framework for use in an embodiment of the present invention;
FIG. 2 is a schematic flow chart of a collection method according to an embodiment of the present invention;
FIG. 3 is an architecture diagram for decrypting entity data in accordance with an embodiment of the present invention;
FIG. 4 is an architecture diagram of the request for preprocessing according to an embodiment of the present invention.
Detailed Description
In order to make the objects, technical solutions and advantages of the present invention more apparent, the present invention is further described in detail below with reference to examples and accompanying drawings, and the exemplary embodiments and descriptions thereof are only used for explaining the present invention and are not meant to limit the present invention.
Example 1:
a data acquisition method based on a Scapy framework comprises the following steps:
the Scapy framework mainly comprises the following components:
crawler (Spiders): logic responsible for crawling and a webpage analysis rule are defined in the system, and the system is mainly responsible for analyzing responses, analyzing and extracting data from the responses, acquiring data required by Item fields, submitting requests required to follow to an engine, and entering a scheduler again;
engine (Engine): the system is responsible for communication, signals, data transmission and the like among spiders, ItemPipline, downloaders and schedulers, processes data stream processing and triggering objects of the whole system and is the core of the whole framework;
scheduler (Scheduler): the system is responsible for receiving a request sent by an engine and adding the request into a queue according to a certain mode, and when the engine needs the request, the request is provided for the engine;
downloader (Downloader): responsible for downloading all request requests sent by the engine and returning the acquired content to the engine in a response mode
ItemPIpline (pipe): the method is mainly used for processing the Item acquired from the webpage by the Spider, and the main tasks of the Spider are cleaning, verifying and storing data;
downloader middleware (Downloader middlewaes): middleware located between the engine and the downloader, mainly processing requests and responses between the engine and the downloader;
crawler middleware (Spider middlewaes): middleware between the Scapy engine and the crawler, wherein the main work is to process response input and request output of the spider;
scheduling middleware (Scheduler Middlewares): middleware between the Scapy Engine and the Schedule, requests and responses sent from the Scapy Engine to the Schedule
The website data acquisition method by using the script crawler frame in the embodiment comprises the following steps:
s1, sending a request of a URL of a website detail page needing to be crawled to an engine of the script by a crawler module of the script, setting token in a head of the crawled request as a token value of a logged-in user, and preventing the anti-crawler mechanism detection that some websites can only check data for the logged-in user;
s2, the engine sends all URLs to the dispatcher, and the dispatcher receives the request requests and pushes the request requests into a queue;
s3, the engine acquires a request of a URL from the scheduler, and the engine sends the request to the downloader through the download middleware;
s4, after the downloader receives the request, setting a delay period after each access, or the method of using proxy IP prevents the website reverse crawling mechanism from blocking the IP, if the downloader receives the request of the first url, namely the website is crawled for the first time, setting the cookie parameter of the get method as the cookie of the logged-in account obtained through the browser developer mode, sending a request to the server, obtaining the response returned by the server, obtaining an updated cookie by using a cookies method of a requests library of python for the response, setting cookie parameters to be the updated cookie obtained by the last request when the request is sent by the get method in the second request and the subsequent request requests, after the server returns response, obtaining an updated cookie by using a cookie method of a requests library of python and providing the updated cookie for the next request to use, downloading a corresponding webpage source code after the request is requested, and after the downloading is finished, returning the response to the engine through a downloader middleware; among them, the cookies method of the requests library of python programming language is the prior art.
S5, the engine receives response from the downloader and sends the response to the crawler through the crawler middleware;
s6, the crawler module processes and analyzes the response, extracts the required information from the response, extracts the next request from the queue, packages the required information into entity data, and sends the entity data and the next request to the engine together;
s7, the engine sends the returned entity data to the pipeline, the pipeline stores the entity data in the database and sends the next request to the scheduler, the scheduler sends the next request to the engine, and the engine sends the next request to the downloader through the downloading middleware;
and S8, repeating S4-S7 until the scheduler does not have a request, and completing the crawling of the information of the recruiting website.
The solution to the anti-crawling mechanism of the present embodiment mainly includes the following two points:
1. for some dynamic web pages, data is acquired from a server through an Ajax Request, even if a Request header, an IP proxy and the like are set, the data is still detected by a website during crawling, and a prompt that 'you operate too frequently and need to access later' is prompted is provided, the problem is not that the access is too frequent, but because the website limits cookie, cookie changes during each access, so that to acquire page data, we need to Request twice, the first Request cannot acquire data, but can acquire cookie, the server returns a cookie parameter to update cookie, we can take an updated cookie from response returned by the server after the Request through a cookie method packaged in a Request library of python, and set the updated cookie in the parameter of a get method when the second Request is performed, set the cookie parameter as an updated cookie, and can acquire data after the Request is sent, meanwhile, the method is used together with the first method for releasing the IP restriction, namely after each request, the cookie is set in the next request as the updated cookie returned by the last request, and the time spent on returning the cookie between the two requests can be used as the time for delaying the request, so that the problems of IP block and dynamic cookie restriction can be solved at the same time.
2. The method for detecting whether the user logs in the website can be cracked by setting a Token, wherein the Token is a section of character string generated by a server and used as a Token for the client to request, after the user logs in for the first time, the server generates a Token and returns the Token to the client, and the client only needs to take the Token when requesting data, and does not need to take the user name and the password again. Specifically, a browser developer mode is started, an interface at the back end of a page request is found in a network option, a json key value pair of user _ trace _ token (XXX) is arranged in a request header of the interface, the json key value pair is the token required by people, and when a request is sent, the token is arranged in the request header, so that the limitation of a website on user login to obtain data can be broken. The benefits of using token to break versus cookie are: (1) when the data of the mobile application is crawled, the cookie is not supported and needs to be solved by using a cookie container, and the cracking by using the token can be simple and much. (2) The token is used to have better performance on a crawler, for a cookie, after a user logs in, a session containing information such as a user ID and a user name is stored at a server, an ID value of the session is stored in the cookie and is stored at a client (namely a browser), when a request is initiated again, the server queries a database to compare the ID value of the session, if the verification is passed, the user is judged to log in, when the token is used, the server receives the request, the token carried by the request and the token stored at the client are compared and verified, the verification is successful, the user is judged to log in, compared with the method of identity verification cracking by using the cookie, the token is not used for querying and comparing the database when the network requests, and the performance is greatly improved.
Example 2:
on the basis of the embodiment 1, the embodiment also decrypts the entity data stored in the database, and the principle is as follows:
some important information of the website cannot be directly provided in the webpage source code, for example, real-time refreshed data can see a string of irregular letters or numbers and special symbols in the webpage source code in an encrypted mode, a font-face is used for defining a character set in a page, the character set is displayed through unicode demapping, fonts can be loaded from a network by using a web-font, therefore, a set of fonts can be created by the user, a customized character mapping relation table is set, and data displayed by the webpage correspond to characters in the webpage source code. That is, the data we crawl is encrypted by the font. For this purpose, the present embodiment is solved in the following manner: (1) checking a webpage source code in a browser developer mode, finding a corresponding code for generating an encrypted font, obtaining a font stored in a webpage after being encoded and encrypted by base64, (2) decoding the encrypted font by base64 to obtain a decoded font file, (3) constructing a character dictionary mapped by Unicode in the font file, firstly obtaining font shapes of all fonts in the font file, using a Unicode decoding method for the fonts one by one, mapping the fonts to a Unicode list (4) to create a mapping relation, storing the corresponding relation between the encrypted data and actual data, traversing the crawled data one by one, and mapping the crawled characters to corresponding actual values. The meaning of each encryption code mapped representation is obtained through the steps, so that conversion processing is carried out after collection, and the anti-crawler mechanism for information encryption is cracked.
The specific implementation manner of cracking data encryption in the embodiment is as follows: analyzing the JavaScript code rendering this data, finding the interface of the request back end, can find out that the font is stored in the web page after being encrypted by base64, therefore, in python, the encrypted data of base64 is extracted from the webpage source code according to the regular expression, then decoded by using base64.base decode (), the decoded data is stored as ttf file and then stored locally, then a character dictionary mapped by the Unicode in the ttf font file needs to be constructed, importing a fontford library into python for editing and creating fonts, opening the ttf file by fontford, obtaining all font shapes in the ttf font file by a glyphs () method of the fontford library, traversal of it uses the unicode method, maps fonts to a unicode list, and then creating a mapping dictionary, storing the corresponding relation between the encrypted data and the actual data, traversing the crawled data, and mapping the crawled characters into actual values in the dictionary.
Example 3:
on the basis of any of the above embodiments, the present embodiment also performs request preprocessing for cracking signature verification, and the principle is as follows: some websites with strong anti-crawler capability also use signature verification to prevent a crawler program from obtaining data thereof, especially dynamic data returned by a post-interface of a submitted form or an input box according to an input content request, wherein the signature is a process of calculation or encryption according to a data source, and the signature result is a character string with uniqueness and consistency. The signature result is characterized in that the signature result becomes a condition for verifying the data source and the data integrity, and can effectively avoid that a server side treats forged data or tampered data as normal data, namely, the modes of using a User-Agent and carrying a cookie by a request header are invalid. The signature verification principle is that the customer service end generates some random values and irreversible MD5 encryption character strings, and sends the values to the server end when initiating a request. The server side calculates the random value and encrypts the MD5 in the same way, if the MD5 value obtained by the server side is the same as the MD5 value submitted by the front end, the request is normal, otherwise, an error page or a 403 page is returned. The cracking method comprises the following steps: (1) the method comprises the steps of looking up webpage source codes in a browser developer mode, looking up parameters which change every time in data parameters of a request form, obtaining the parameters, (2) capturing all the webpage source codes, finding a method (3) for generating the encryption parameters, generating the encryption parameters according to the obtained encryption method and carrying the encryption parameters in the request before sending the request, and breaking signature verification.
The specific implementation manner of cracking signature verification in the embodiment is as follows: downloading a JavaScript file loaded by a webpage to the local, searching salt in a JavaScript code to find an encryption algorithm for signature verification, realizing the logic of the JavaScript encryption algorithm in a crawler program through a python language, constructing a generated salt value and sign value into a python dictionary type, simultaneously storing other unchangeable data in form data and data needing to be input in the form in the dictionary in a key value pair mode, and setting parameter data of a post request method as the dictionary, namely cracking signature verification.
The beneficial effects of this embodiment include:
(1) compared with the traditional method that requests are respectively traversed on all URLs by only using requests, after webpage source codes are obtained, certain source codes are traversed again to extract useful information, the problem that the efficiency of the traditional crawler method is low when frequent I/O requests are processed is solved, the difficulty that asynchronous requests and multithreading requests need to be independently realized is avoided, different components complete different work, and support of the components on asynchronous processing is provided, the network bandwidth is maximally utilized by the script, and the problem that the efficiency of data crawling and processing can be greatly improved when a recruitment website acquires data is solved.
(2) Corresponding cracking is performed aiming at a website anti-crawler mechanism: setting a User-Agent to crack the detection of the website on the access request header; setting a mechanism that when the Cookie is cracked to access, the data can be checked only by detecting the login of a user logging in the Cookie;
(3) cracking a mechanism that the server can block the IP when detecting multiple requests at the same time in a request delay or IP proxy mode;
(4) the method comprises the steps that a request is sent twice in sequence, an updated cookie is obtained through first sending, and the limitation of the website dynamic cookie is solved through a mode that the updated cookie is set to be sent for the second time to obtain data; combining the delay request and sending two requests successively, wherein each request uses the update cookie returned by the last request, and the returned update cookie is extracted after the request, so that the problems of ip limitation and dynamic cookie are solved; the token is used as a method for detecting user login by cracking the website, so that the performance is greatly improved compared with a cookie mode;
(5) for a font-face font anti-crawling mechanism, decrypting a font file encrypted by base64, obtaining all font shapes, performing Unicode decoding, mapping a font to a Unicode list, creating a corresponding mapping relation through the list, and mapping a crawled character to corresponding data to complete decryption;
(6) for signature verification, parameters which can change in form data parameters during each request are obtained first, the parameters are obtained, then all webpage source codes are captured, a method for generating the encryption parameters is found, before the request is sent, the encryption parameters generated according to the obtained encryption method are carried in the request, and then signature verification can be cracked. Compared with the traditional crawler algorithm, the website anti-crawler mechanism is broken, and the usability of the crawler is greatly improved.
Example 4:
the Scapy framework adopted by the embodiment comprises: engine, Item, Scheduler, Downloader, Spiders, ItemPipline, Downloader middlewares, Spiders;
the engine is used for processing the data flow processing of the whole system, triggering the issue, and is the core of the framework;
the scheduler is used for receiving a request sent by the engine, pressing the request into the queue and returning the request when the engine requests again, can imagine a priority queue of URL (web address or link for grabbing web page), and determines what the next web address to grab by the priority queue, and meanwhile removes repeated web addresses;
the downloader is used for downloading the webpage content and returning the webpage content to the crawler;
crawlers are used to extract the information they need, so-called entities (items), from a specific web page. The user can also extract the link from the page, and the script continues to grab the next page;
the pipeline is responsible for processing entities extracted from the webpage by the crawler, and the main functions of the pipeline are to persist the entities, verify the validity of the entities and remove unnecessary information. When the page is parsed by the crawler, it is sent to the project pipeline and the data is processed through several specific orders.
The downloader middleware is located between the Scapy engine and the downloader and mainly processes requests and responses between the Scapy engine and the downloader.
The crawler middleware is located between the script engine and the crawler, and the main work is to process response input and request output of the crawler.
The embodiment is applied to the field of recruitment websites, the recruitment websites are the first choice for the job hunting of people at present, and the embodiment selects a certain recruitment website for crawling, so that the website is concentrated on internet job hunting, the amount of recruitment information is large, and the website anti-crawler mechanism is perfect.
As shown in fig. 2, the recruitment website data collection method case based on the script crawler frame provided in this embodiment includes the following steps:
s1: selecting a post of a Java engineer in the website to crawl, obtaining URLs of the previous 15000 post data in advance, after a script crawler frame is started, sending request requests of all the URLs to an engine through yeild keywords of Python by the crawler, setting a token value in a request header to be the token value of a logged account obtained in a browser developer mode, and preventing a website anti-crawling mechanism from detecting that a user does not log in and refuses to access the website;
s2: the engine sends the request to the scheduler without any processing, and the scheduler presses the request into the queue according to the priority order;
s3: after all the request requests are pressed into the queue of the scheduler, the engine acquires a first request in the queue from the scheduler and sends the request to a downloader through the downloader middleware;
s4: after the downloader takes the Request, the Request is carried out through a get method of a Request library in python, a cookie in a Request header is set as an updated cookie obtained after the last Request, the webpage source code of the detail page of the recruitment post is downloaded after the Request, the updated cookie obtained from the website of the Request is returned, after the downloading is finished, the response of the page is generated, and the response is returned to the engine through the downloader middleware;
the cookie is set to prevent the website from detecting that the user is not logged in and browses a plurality of messages in a short time, and the website can be detected as a crawler program, so that the first information can be obtained by the website, the later information is an empty page, and the website cannot send an ajax request to the server to obtain data. (ii) a
The cookie is set as the updated cookie returned after the last request when the cookie is requested because the cookie of the website is dynamically updated, if a cookie is used all the time, the cookie can be detected and cannot be crawled continuously, data can be obtained by using the prepared cookie when the cookie is requested for the first time, and wrong information of 'you operate too frequently and please visit again later' can be prompted from the second request. This problem can be solved by using the last obtained update cookie for each request. Meanwhile, the IP address can be sealed by the server, so that an HTTPError403: Forbidden error occurs, the crawler program can access the website to obtain a large amount of data in a short time by using a Get mode and can be considered by the server to attack the website, so that the request of the crawler program is refused, the IP of the crawler program is automatically sealed, the result cannot be obtained, and all source codes downloaded by the downloader are 403 error page source codes specified by the webpage.
If according to the thinking among the prior art, solve above-mentioned problem and can adopt: 1. adding a request header, packaging the request and changing the request into a browser request mode, however, the method has no effect on the website, adding the header only disguises the request as a different browser, and if the same IP is frequently accessed in a short time, the browser request is rejected. 2. Proxy IPs "confuse" the server by automatically replacing different IPs, letting it think that an access request is coming from a different computer, and thus cannot be denied. However, this method has drawbacks that although different ip addresses are randomly generated for each access, a large number of proxy ip addresses need to be prepared in advance, and the access is frequently prohibited by the server one by one in a short time. 3. Reducing the frequency of access, simply pausing for a short time each request, is the safest method, and for the method taken by the website, a delay of 3-5ms is generally set, but delaying the request has an impact on performance.
Therefore, the embodiment adopts the method of setting the cookie as the updated cookie returned after the last Request every time, so that the Request delay time is not set, because after each Request, namely after the get method of the Request library is executed, a statement for obtaining the updated cookie is executed, and then the next url is requested, the time consumption for executing the statement can replace the Request delay, thereby achieving the purpose of preventing the ip from being blocked and simultaneously preventing the detection of the dynamic cookie.
S5: the engine sends response to the crawler through the crawler middleware, in the parsing function (parse () of the crawler), a Beautiful Soup object is created through the Beautiful Soup library by using the Beautiful Soup library, and the parsing is carried out in an Ixml mode, and the needed information is found through the parsing method packaged by the Beautiful Soup library.
According to the html tag and the content of the webpage source code corresponding to the information required by the mode check of the open developer in the browser, the html tag of the position name is < h1class > < name > java development engineer </1 > and the position name, position salary, office place and experience requirements are obtained through the method selector find of Beautiful Soup, the html tag corresponding to the academic requirement is < h3> < span > < sary >12k-20k </span > < span >/Chengdu > < span > and the experience 5-10 years/span > < span > the family and the experience/< span > < h3, and the CSS selector select (& gt, j _ talk _ 3 _ span > < shift > < 0 >) and the position name of the position name is obtained through the CSS selector select ([ 0, j _ shift >) select ] < 0 In a public place, a post working experience requirement is obtained through a CSS selector select ('. jobrequesth 3span: nth-child (3)') [0]. get _ text (), a post academic requirement is obtained through the CSS selector select ('. jobrequesth 3span: nth-child (4)') [0]. get _ text (), and an html source code corresponding to the specific work content of a post is < divclass ═ jobdetail "> work content: < br >1, communicating with the business requirements of a product manager, negotiating requirement design and independently carrying out development scheme design; < br >2, participate in the development work of the new and old system, can consider business demand, development efficiency and code robustness comprehensively to realize the functional design; < br >3, writing related documents including comments in the code and related pages in the knowledge base; < br >4, can complete daily development work, and autonomously solve most technical problems in development; < br >5, completing other work items submitted by the superior level according to quality on time; < br > < br > station requirements: < br > < br >1. the Java basic technology system (comprising JVM, class loading mechanism, multithreading concurrency, IO and network) has good mastery and application experience. < br >2. good object-oriented design understanding, familiar with the object-oriented design principle, masters the design mode and application scenario. The method has strong problem analysis and processing capability, excellent practical ability, enthusiasm technology, proficiency and refinement and certain technical addiction. < br >4. familiarize with the underlying middleware, distributed technologies (including caching, messaging systems, hot-deployment, JMX, etc.). < br >5 practical project product experience for high concurrency, high stable availability, high performance, large data processing. And 6, familiarizing the development process of the software project, having primary knowledge on agile development, and having certain analysis capability on user experience, interactive operation process and user requirements. [ subject-div ] obtains position specific skill requirements and work content through CSS selector select ('. jobdetail') [0]. get _ text (), encapsulates these confidences into an entity (Item), and the entity defines a data structure of a crawling result, and the crawling result is assigned into the Item object. Item class is inherited and field type is defined as script. The crawler returns the entity and the new request to the engine.
S6: the engine sends the new request to the scheduler, the entity to the pipeline, and the pipeline stores the data in the entity to the Mysql database.
S7: and repeating the steps S2-S6 until no request is requested in the scheduler, and completing crawling.
Example 5:
a script framework based data acquisition system comprising:
a crawler module: logic responsible for crawling and analysis rules of the webpage are defined in the system and are used for analyzing the response, analyzing and extracting data from the response, acquiring required data, submitting a request to be followed to an engine, and entering a scheduler again;
an engine: data stream processing for the entire system; and for obtaining updated cookie data;
a downloader: the request is used for downloading the request sent by the engine and returning the acquired content to the engine in a response mode;
pipeline: the system comprises a crawler module, a data processing module and a data processing module, wherein the crawler module is used for processing entity data acquired from a webpage by the crawler module;
downloading the middleware: middleware located between the engine and the downloader, mainly processing requests and responses between the engine and the downloader;
crawler middleware: and the middleware is arranged between the engine and the crawler module and used for processing response input and request output of the crawler module.
Example 6:
a computer-readable storage medium storing a computer program which, when executed by a processor, implements the steps of the script framework-based data collection method as described in any of embodiments 1-4.
The above-mentioned embodiments are intended to illustrate the objects, technical solutions and advantages of the present invention in further detail, and it should be understood that the above-mentioned embodiments are merely exemplary embodiments of the present invention, and are not intended to limit the scope of the present invention, and any modifications, equivalent substitutions, improvements and the like made within the spirit and principle of the present invention should be included in the scope of the present invention.
It should be noted that, in this document, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus.

Claims (5)

1. A data acquisition method based on a Scapy framework is characterized by comprising the following steps:
a crawler module: logic responsible for crawling and analysis rules of the webpage are defined in the system and are used for analyzing the response, analyzing and extracting data from the response, acquiring required data, submitting a request to be followed to an engine, and entering a scheduler again;
an engine: data stream processing for the entire system; and for obtaining updated cookie data;
a downloader: the request is used for downloading the request sent by the engine, and the obtained content is delivered back to the engine in a response mode;
pipeline: the system comprises a crawler module, a data processing module and a data processing module, wherein the crawler module is used for processing entity data acquired from a webpage by the crawler module;
downloading the middleware: middleware located between the engine and the downloader, mainly processing requests and responses between the engine and the downloader;
crawler middleware: the middleware is arranged between the engine and the crawler module and used for processing response input and request output of the crawler module;
the data acquisition method comprises the following steps:
request preprocessing for cracking signature verification, wherein the request preprocessing method comprises the following steps:
s001, viewing the webpage source codes in a browser developer mode, finding out parameters which change every time in the data parameters of the request form, and defining the parameters as encryption parameters;
s002, capturing all webpage source codes, finding out the method for generating the encryption parameters in the webpage source codes, and defining the method as an encryption method;
s003, before the request is sent, generating a new encryption parameter according to the obtained encryption method, and carrying the new encryption parameter in each request;
s1, the crawler module sends a request of a website detail page needing to be crawled to an engine;
s2, the engine sends all request requests to the scheduler, and the scheduler presses all request requests into a queue;
s3, the engine acquires a request from the scheduler and sends the request to the downloader through the download middleware;
s4, after receiving the request, the downloader sends the request to the server, downloads the source code of the corresponding webpage, generates the response of the corresponding webpage, and returns the response to the engine through the downloader middleware; the request sent to the server carries updated cookie data; the downloaded corresponding webpage source code is included in the response;
the method for carrying updated cookie data in the request sent to the server comprises the following steps:
if the downloader obtains the first request aiming at a website detail page, the cookie parameter is set as the cookie of the logged account obtained through the browser developer mode, the request is sent to the server, and after the response returned by the server is obtained, the response is processed to obtain updated cookie data;
if the downloader receives the second and subsequent request requests, the cookie parameter is set as the cookie obtained in the last request, the cookie parameter is used as updated cookie data, and then the request is sent to the server; after the response returned by the server is obtained, the response is processed to obtain updated cookie data which is provided for the next request to use;
s5, the engine receives a response from the downloader and sends the response to the crawler module through the crawler middleware;
s6, the crawler module processes and analyzes the response, extracts the required information from the response, extracts the next request from the queue, packages the required information into entity data, and sends the entity data and the next request to the engine together;
s7, the engine sends the returned entity data to the pipeline, the pipeline stores the entity data in the database and sends the next request to the scheduler, the scheduler sends the next request to the engine, and the engine sends the next request to the downloader through the downloading middleware;
s8, repeating S4-S7 until no request remains in the queue;
decrypting the entity data stored in the database, wherein the decryption method comprises the following steps:
s901, viewing webpage source codes in a browser developer mode, and finding out corresponding codes for generating encrypted fonts;
s902, decoding the encrypted font to obtain a decoded font file;
s903, constructing a character dictionary mapped by the Unicode in the font file: obtaining font shapes of all fonts in the font file, and mapping the fonts into a unicode list by using a unicode decoding method one by one;
s904, creating a mapping relation, storing the corresponding relation between the encrypted data and the actual data, traversing the crawled data one by one, and mapping the crawled characters into corresponding actual values;
s905, the meaning of each encryption code mapped and represented is obtained, and the entity data stored in the database is converted.
2. The data collection method based on Scapy framework as claimed in claim 1, wherein the method for processing the response to obtain the updated cookie data comprises: after the server returns a response, the updated cookie data is obtained using the queries repository of the python programming language.
3. The data collection method based on Scapy framework as claimed in claim 1, wherein the request in step S1 includes Token information, and the Token information includes Token value of the logged user.
4. The data collection method based on script framework as claimed in claim 1, wherein in step S4, each time a request is sent, the next request is sent with a delay, or a proxy IP is used to send the request.
5. A computer-readable storage medium, which stores a computer program, wherein the computer program, when executed by a processor, implements the steps of a script framework-based data collection method as claimed in any of claims 1 to 4.
CN202010784262.7A 2020-08-06 2020-08-06 Data acquisition method and system based on Scapy framework and storage medium Active CN111881337B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202010784262.7A CN111881337B (en) 2020-08-06 2020-08-06 Data acquisition method and system based on Scapy framework and storage medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202010784262.7A CN111881337B (en) 2020-08-06 2020-08-06 Data acquisition method and system based on Scapy framework and storage medium

Publications (2)

Publication Number Publication Date
CN111881337A CN111881337A (en) 2020-11-03
CN111881337B true CN111881337B (en) 2021-06-01

Family

ID=73210858

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202010784262.7A Active CN111881337B (en) 2020-08-06 2020-08-06 Data acquisition method and system based on Scapy framework and storage medium

Country Status (1)

Country Link
CN (1) CN111881337B (en)

Families Citing this family (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112528120A (en) * 2020-12-21 2021-03-19 北京中安智达科技有限公司 Method for web data crawler to use browser to divide body and proxy
CN113536301A (en) * 2021-07-19 2021-10-22 北京计算机技术及应用研究所 Behavior characteristic analysis-based anti-crawling method
CN113660312A (en) * 2021-07-23 2021-11-16 中建材(合肥)粉体科技装备有限公司 Cement plant equipment data acquisition system and method

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109033115A (en) * 2017-06-12 2018-12-18 广东技术师范学院 A kind of dynamic web page crawler system
CN109902216A (en) * 2019-03-04 2019-06-18 桂林电子科技大学 A kind of data collection and analysis method based on social networks
CN111274466A (en) * 2019-12-18 2020-06-12 成都迪普曼林信息技术有限公司 Non-structural data acquisition system and method for overseas server

Family Cites Families (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US9646095B1 (en) * 2012-03-01 2017-05-09 Pathmatics, Inc. Systems and methods for generating and maintaining internet user profile data
CN104252530B (en) * 2014-09-10 2017-09-15 北京京东尚科信息技术有限公司 A kind of unit crawler capturing method and system
CN105956175B (en) * 2016-05-24 2017-09-05 考拉征信服务有限公司 The method and apparatus that web page contents are crawled
CN109508422A (en) * 2018-12-05 2019-03-22 南京邮电大学 The height of multithreading intelligent scheduling is hidden crawler system

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109033115A (en) * 2017-06-12 2018-12-18 广东技术师范学院 A kind of dynamic web page crawler system
CN109902216A (en) * 2019-03-04 2019-06-18 桂林电子科技大学 A kind of data collection and analysis method based on social networks
CN111274466A (en) * 2019-12-18 2020-06-12 成都迪普曼林信息技术有限公司 Non-structural data acquisition system and method for overseas server

Non-Patent Citations (3)

* Cited by examiner, † Cited by third party
Title
基于 Scrapy 框架的爬虫和反爬虫研究;韩贝 等;《计算机技术与发展》;20181115;第29卷(第2期);第139-142页,正文第1节、第3节以及附图1 *
基于Scrapy框架的网络爬虫实现与数据抓取分析;安子建;《中国优秀硕士学位论文全文数据库 信息科技辑》;20171015;I138-276 *
基于主题网络爬虫的服装信息采集;李俊 等;《信息技术与信息化》;20180825;第97-99页 *

Also Published As

Publication number Publication date
CN111881337A (en) 2020-11-03

Similar Documents

Publication Publication Date Title
US11552993B2 (en) Automated collection of branded training data for security awareness training
CN111881337B (en) Data acquisition method and system based on Scapy framework and storage medium
EP2558973B1 (en) Streaming insertion of tokens into content to protect against csrf
CN110602052B (en) Micro-service processing method and server
US11126749B2 (en) Apparatus and method for securing web application server source code
US9317693B2 (en) Systems and methods for advanced dynamic analysis scanning
CN110881044B (en) Computer firewall dynamic defense security platform
EP2673708B1 (en) DISTINGUISH VALID USERS FROM BOTS, OCRs AND THIRD PARTY SOLVERS WHEN PRESENTING CAPTCHA
US8495358B2 (en) Software based multi-channel polymorphic data obfuscation
US20120167182A1 (en) Device independent authentication system and method
CN101964025A (en) XSS (Cross Site Scripting) detection method and device
CN102571846A (en) Method and device for forwarding hyper text transport protocol (HTTP) request
WO2010108421A1 (en) Method and apparatus for authenticating a website
WO2011109766A2 (en) Input parameter filtering for web application security
CN113660250B (en) Defense method, device and system based on WEB application firewall and electronic device
US20130160132A1 (en) Cross-site request forgery protection
CN110581841B (en) Back-end anti-crawler method
CN116324766A (en) Optimizing crawling requests by browsing profiles
Wedman et al. An analytical study of web application session management mechanisms and HTTP session hijacking attacks
Ham et al. Big Data Preprocessing Mechanism for Analytics of Mobile Web Log.
CN104038344B (en) Identity authentication method based on regular expression
KR101296384B1 (en) System and method for verifying integrity of web page
Gupta et al. POND: polishing the execution of nested context-familiar runtime dynamic parsing and sanitisation of XSS worms on online edge servers of fog computing
AU2016340025B2 (en) Dynamic Cryptographic Polymorphism (DCP) system and method
Cheah et al. A review of common web application breaching techniques (SQLi, XSS, CSRF)

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant