CN116108252A

CN116108252A - Limiting data grabbing method, limiting data grabbing system, limiting data grabbing computer equipment and limiting data grabbing storage medium

Info

Publication number: CN116108252A
Application number: CN202310396461.4A
Authority: CN
Inventors: 毛文浩; 苏睿
Original assignee: Shenzhen Hexun Huagu Information Technology Co ltd
Current assignee: Shenzhen Hexun Huagu Information Technology Co ltd
Priority date: 2023-04-14
Filing date: 2023-04-14
Publication date: 2023-05-12

Abstract

The application is applicable to the technical field of data processing, and provides a method, a system, computer equipment and a storage medium for limiting data crawling, which aim to limit data crawling of a website for an external request identified as a crawler. The method mainly comprises the following steps: when an external request is received, judging whether the external request meets a preset crawler identification standard or not; if the external request meets the crawler identification standard, executing a preset access limiting strategy; and if the external request does not meet the crawler identification standard, allowing to respond to the external request.

Description

Limiting data grabbing method, limiting data grabbing system, limiting data grabbing computer equipment and limiting data grabbing storage medium

Technical Field

The application belongs to the technical field of data processing, and particularly relates to a method, a system, computer equipment and a storage medium for limiting data grabbing.

Background

Some malicious crawler requests often appear in the internet in the prior art, for example, malicious crawls website data (such as core data of price, sales volume and the like of an electronic commerce) of a competitor through a crawler technology, so that a great deal of website data is leaked; for example, the online crawler is used for simulating access request website data of the real user, analyzing webpage content and capturing useful information, and a large number of crawler requests simultaneously request websites so as to influence normal access of the real user to the websites, even cause overload and downtime of a server and influence normal access of the real user to the websites.

Disclosure of Invention

The application aims to provide a method, a system, computer equipment and a storage medium for limiting data crawling, which aim to limit data crawling of a website for external requests which are identified as crawlers.

In a first aspect, the present application provides a restricted data crawling method, including:

when an external request is received, judging whether the external request meets a preset crawler identification standard or not;

if the external request meets the crawler identification standard, executing a preset access limiting strategy;

and if the external request does not meet the crawler identification standard, allowing to respond to the external request.

Optionally, the determining whether the external request meets a preset crawler identification criterion includes:

determining target request characteristic data carried in the external request;

and judging whether the target request characteristic data meets a preset crawler identification standard or not.

Optionally, the crawler identification criteria include an access frequency upper limit, request header parameter integrity, a blacklist and preset criteria.

Optionally, the method further comprises:

collecting normal external requests sent by a plurality of verified real users to obtain a normal external request set;

extracting user characteristic data from the normal external request set to obtain a user characteristic data set;

classifying the user characteristic data set to obtain X categories of user characteristic data diversity, wherein X is a positive integer greater than 0;

and selecting Y categories of target user characteristic data from the user characteristic data distribution set as the requirement of the parameter integrity of the request head, wherein Y is a positive integer which is more than 0 and less than or equal to X.

Optionally, after obtaining the user feature data set, the method further comprises:

carrying out access frequency analysis of a real user on the user characteristic data set to obtain a real access frequency data set;

and selecting the target access frequency in the real access frequency data set as the access frequency upper limit.

Optionally, after obtaining the diversity of the X kinds of user characteristic data, the method further comprises:

displaying the normal external request set, the user characteristic data set and the user characteristic data diversity of the X categories;

and receiving the preset standard input by the user.

Optionally, before allowing the response to the external request, the method further comprises:

performing confusion processing on a front-end code of the front end of the webpage to be accessed by the external request by using a code confusion technology;

and carrying out encryption processing on data transmission between the front end of the webpage and the rear end of the webpage.

In a second aspect, the present application provides a restricted data crawling system comprising:

the judging unit is used for judging whether the external request meets the preset crawler identification standard or not when the external request is received;

the limiting unit is used for executing a preset access limiting strategy if the external request meets the crawler identification standard;

and the response unit is used for allowing to respond to the external request if the external request does not meet the crawler identification standard.

Optionally, when the judging unit judges whether the external request meets a preset crawler identification standard, the judging unit is specifically configured to:

determining target request characteristic data carried in the external request;

Optionally, the system further comprises:

the collecting unit is used for collecting normal external requests sent by a plurality of verified real users to obtain a normal external request set;

the extracting unit is used for extracting the user characteristic data from the normal external request set to obtain a user characteristic data set;

the classification unit is used for classifying the user characteristic data set to obtain X categories of user characteristic data diversity, wherein X is a positive integer greater than 0;

and the selection unit is used for selecting Y categories of target user characteristic data from the user characteristic data distribution as the requirement of the parameter integrity of the request header, wherein Y is a positive integer which is more than 0 and less than or equal to X.

Optionally, the system further comprises:

the analysis unit is used for analyzing the access frequency of the real user to the user characteristic data set to obtain a real access frequency data set;

the selecting unit is further configured to select a target access frequency in the real access frequency dataset as the access frequency upper limit.

Optionally, the system further comprises:

the display unit is used for displaying the normal external request set, the user characteristic data set and the user characteristic data diversity of the X types;

and the receiving unit is used for receiving the preset standard input by the user.

Optionally, the system further comprises:

the confusion unit is used for carrying out confusion processing on the front end code of the front end of the webpage to be accessed by the external request by using a code confusion technology;

and the encryption unit is used for carrying out encryption processing on the data transmission between the front end of the webpage and the rear end of the webpage.

In a third aspect, the present application provides a computer device comprising:

a processor, a memory, a bus, an input-output interface, and a wireless network interface;

the processor is connected with the memory, the input/output interface and the wireless network interface through buses;

the memory stores a program;

the processor implements the restricted data crawling method of the first aspect when executing the program stored in the memory.

In a fourth aspect, the present application provides a computer-readable storage medium having stored therein instructions that, when executed on a computer, cause the computer to perform the restricted data crawling method of the first aspect.

In a fifth aspect, the present application provides a computer program product which, when executed on a computer, causes the computer to perform the restricted data crawling method as described in the first aspect above.

The above technical solution can be seen that the embodiment of the application has the following advantages:

when an external request is received, the embodiment of the limited data crawling method firstly judges whether the external request meets the preset crawler identification standard, if the external request meets the crawler identification standard, the external request is proved to be the technology disguise of a crawler, then a preset limited access strategy is executed, the access of the external request is limited or forbidden, and the data crawling of the external request identified as the crawler to the website is limited; if the external request does not meet the crawler identification standard, and the external request is proved to be in high probability of not being camouflaged by crawler technology, the website is allowed to respond to the external request.

Drawings

FIG. 1 is a flow chart illustrating an embodiment of a method for restricting crawling data according to the present application;

FIG. 2 is a flow chart illustrating another embodiment of a method for restricting crawling data according to the present application;

FIG. 3 is a schematic diagram illustrating one embodiment of a restricted data crawling system of the present application;

FIG. 4 is a schematic diagram illustrating another embodiment of a restricted data crawling system according to the present application;

FIG. 5 is a schematic diagram of an embodiment of a computer device of the present application.

Detailed Description

In order to make the objects, technical solutions and advantages of the present application more apparent, the present application will be further described in detail with reference to the accompanying drawings and examples. It should be understood that the specific embodiments described herein are for purposes of illustration only and are not intended to limit the present application.

A web crawler (web crawler), which is also called a web spider, a web robot, or the like for short, is a program or script that automatically captures web information according to certain rules. At present, the technology of crawlers is mature at home and abroad, and on one hand, a search engine records websites so as to meet the requirement of searching by users; on the other hand, the artificial intelligence big data analysis is required for the retrieval of public information data. However, some malicious crawler requests may also occur in the internet in the prior art, for example, malicious crawling is performed on website data (such as core data of e-commerce price, sales volume, etc.) of a competitor through a crawler technology, so that a great deal of leakage of the website data is caused; for example, an online crawler mode is used for simulating access of a real user to request website data, webpage content is analyzed, useful information is obtained, and a large number of crawler requests are simultaneously requested to a website, so that normal access of the real user to the website can be influenced, even a server is overloaded and down, and normal access of the real user to the website can be influenced. In view of this, the identification of web crawlers and the limitation of their access to web sites are a major challenge.

Based on the above understanding of the web crawlers, referring to fig. 1, one embodiment of the present application limits a data crawling method, including:

101. when an external request is received, judging whether the external request meets a preset crawler identification standard, and if the external request meets the preset crawler identification standard, executing step 102; if the external request does not meet the preset crawler identification criteria, step 103 is performed.

It should be noted that, in this embodiment, a preset crawler identification standard needs to be stored in advance, that is, whether an external request accessed to a website server is a request with a camouflage crawler technology can be judged and distinguished through the crawler identification standard, if the external request meets the preset crawler identification standard, it is proved that the external request is with a high probability with the camouflage crawler technology, and the external request can be considered to belong to an abnormal operation; if the external request does not meet the crawler identification standard, the external request is proved to be disguised by the crawler technology with high probability, and the external request can be considered to belong to normal operation. The crawler recognition standard can be set or dynamically adjusted according to actual needs, and is not limited herein.

102. And executing a preset restricted access policy.

Specifically, after proving that the external request is disguised by the crawler technology in step 101, the step executes a preset access limiting policy to limit or prohibit the access of the external request, so as to limit the access and crawling of the data of the external request identified as the crawler to the website. The access limiting strategies preset in the step include access prohibition, re-or repeated popup verification, access limiting times and the like, and the access limiting strategies can be selected according to actual needs and are not limited further.

103. Allowing response to external requests.

Specifically, when it is proved in step 101 that the external request is not camouflaged by the crawler technology, this step allows the website to respond to the external request. For example, the external request carries information such as reply address and reply content, and this step sends the reply content that it needs to the reply address. The content of the external request is not further defined herein.

Referring to fig. 2, another embodiment of the limited data crawling method of the present application includes:

201. the normal external requests sent by the verified real users are collected to obtain a normal external request set.

It should be noted that, in order to accurately identify the difference between the normal external requests sent by the real users and the external requests disguised by the crawler technology, the step needs to collect some verified normal external requests sent by the real users, and store the normal external requests to obtain a normal external request set for research and analysis in the subsequent steps. For example, the normal external request may be sent by means of a PC, APP, H5, etc. to access the website of the present embodiment.

202. Extracting user characteristic data from the normal external request set to obtain a user characteristic data set.

And cleaning, analyzing and extracting each normal external request set obtained in the step 201 to obtain user characteristic data corresponding to each external request, thereby forming a user characteristic data set. For example, the user characteristic data typically includes browser information, internet protocol addresses (Internet Protocol Address, IP), account cookies, custom request headers, and the like. The normal external request generally includes some or all of the user feature data, and the user feature data specifically included in the normal external request is not limited herein.

203. And classifying the user characteristic data set to obtain X categories of user characteristic data diversity, wherein X is a positive integer greater than 0.

And (3) identifying and classifying and summarizing the feature data of each same category in the user feature data set in the step 202 to obtain user feature data diversity of different categories, and obtaining the category names, the category numbers and the like of the user feature data contained in each external request from the user feature data diversity. For example, different categories of user characteristic data may include: request line: request methods, requested resource paths, HTTP protocol versions, etc.; request header: metadata of the request, such as Content-Type (indicating Type of request body), accept (indicating Type of response accepted by the client), user-Agent (indicating browser or proxy information of the client), etc.; request body: the optional request text is generally used for transmitting data in a POST or PUT method and the like. The foregoing is merely illustrative, and in practical applications, the user feature data included in each external request may be part or all of the X kinds of user feature data.

204. The target user characteristic data of Y categories is selected from the user characteristic data distribution as the requirement for requesting the integrity of the header parameters, Y being a positive integer greater than 0 and less than or equal to X.

Notably, research has shown that external requests formed using crawler technology often lack some sort of user feature data, e.g., crawler request headers often lack some normal browser fixed parameters, etc. Thus, this step may require that the external request must carry some browser fixed program, otherwise the external request is considered to be camouflaged by the crawler; further, research shows that the external requests formed by using the crawler technology often contain relatively fewer categories of user feature data, so that the step can select target user feature data of Y categories from the user feature data diversity as the requirement of parameter integrity of the request header, and thus the external requests disguised by some crawler technologies can be effectively identified in terms of the number of the user feature data.

205. And carrying out access frequency analysis of the real user on the user characteristic data set to obtain a real access frequency data set.

After the normal external request set is obtained in step 201, after the user feature data set is obtained in step 202, and/or after the user feature data diversity is obtained in step 203, the user feature data of each real user can be further tracked and subjected to access frequency analysis, namely, the data such as the access times, the access time intervals and the like of the website server in a certain time of each real user can be obtained, and further the real access frequency of each real user can be calculated (for example, the external requests sent out by the normal user in one second can not exceed 10), the real access frequency can be obtained by tracking and summarizing the real users, more powerful data support is provided for setting the crawler identification standard by collecting the data, meanwhile, different indexes can be sampled and counted for different real user groups, and errors of data samples are reduced), and further the real access frequency data set of all the real users is obtained by summarizing. The real access frequency data set can truly reflect the frequency interval (the upper limit and the lower limit of the access frequency) of the real user to the website server. Of course, based on the obtained data such as the number of accesses to the website server and the access time interval within a certain period of time of the real user, the values of the request time, the response time and the like related to time or the number of times or the value to be identified by accumulation can be further known.

206. The target access frequency in the real access frequency dataset is selected as the access frequency upper limit.

Specifically, the upper limit of the access frequency may be used as one of the criteria of the crawler, and in this step, the target access frequency may be obtained by using the actual access frequency data set in step 205 to reflect the average value, the highest value, etc. of the actual user's access to the web server.

207. And displaying the normal external request set, the user characteristic data set and the X kinds of user characteristic data diversity.

In other embodiments, after the normal external request set is obtained in step 201, the user feature data set is obtained in step 202, and/or the diversity of user feature data of category X is obtained in step 203, a presentation view may be presented to the manager via the display device so that the manager may perform the integrated decision analysis.

208. And receiving preset standards input by a user.

It will be appreciated that, in this embodiment, the user (manager) may set the preset criteria for the crawler identification as the crawler identification criteria by himself, and execute the criteria after receiving the criteria by the restricted data crawling system. The preset standard may be summarized by the manager from the analysis rule in step 207, or may be an industry crawler identification standard entered by the manager, and is not limited herein, and the preset standard may be dynamically modified according to actual needs to be effective.

209. And taking the upper limit of the access frequency, the integrity of the request header parameters, the blacklist and the preset standard as the crawler identification standard.

It should be noted that, in this step, the above-mentioned upper limit of access frequency, integrity of parameters of request header, preset standard, etc. are used as the criteria for identifying the crawlers, and a blacklist identified by the crawlers may be set, where the blacklist refers to an external request sending address, data downloading address, etc. recorded in the blacklist will not respond at all. For example, the same IP address corresponding to an external request exceeding a certain number is sent out in a certain period of time, if the IP address is judged to be a crawler, the IP address is written into a blacklist, and the IP address is blocked.

210. An external request is received.

It will be appreciated that the external request of the verified real user received in this step may also be used as a component of the normal external request set in step 201, so as to dynamically update the basic data of the crawler identification standard, and dynamically optimize the data source of the crawler identification standard.

211. And determining target request characteristic data carried in the external request.

The external request received in step 210 is subjected to data analysis, so that the target request feature data carried in the external request can be determined. The target request feature data carried in the external request includes, for example: browser information, internet protocol address (Internet Protocol Address, IP), account cookie, custom request header, etc.

212. Judging whether the target request feature data meets the preset crawler identification standard, and if the target request feature data meets the preset crawler identification standard, executing step 213; if the target request feature data does not meet the preset crawler identification criteria, step 214 is performed.

Judging whether the external request accessed to the website server is a request disguised by the crawler technology or not according to the crawler identification standard and the target request feature data in the step 211 and the crawler identification standard preset in the step 209, and if the external request meets the preset crawler identification standard, proving that the external request is disguised by the crawler technology with high probability; if the external request does not meet the crawler identification standard, the external request is proved to be not disguised by crawler technology with high probability.

213. And executing a preset restricted access policy.

After determining in step 212 that the external request is disguised by the crawler technology, the step executes a preset access restriction policy to restrict or prohibit access to the external request, so as to restrict access and crawling of data to the website for the external request identified as the crawler. The access limiting strategies preset in the step include access prohibition, re-or repeated verification, access limiting times and the like, and the access limiting strategies can be selected according to actual needs and are not limited further.

214. The front end code of the front end of the web page to be accessed by the external request is obfuscated using a code obfuscation technique.

In order to further ensure the security of the front end code of the front end of the web page displayed by the accessed web server web site, the code structure and format of the front end code can be modified by using a code confusion technology, and packaged and deployed, so that the front end code is difficult to understand and read by people. For example, the code obfuscation technique may process JavaScript code during construction, such as renaming identifier names of variables, functions, classes, etc., replacing constant values, deleting notes and spaces, etc.; therefore, the source code logic is difficult to understand, the difficulty of cracking after the crawler requests to acquire data is increased, and the safety of codes is improved. In summary, the code obfuscation technology is a powerful protection and security measure, but needs to comprehensively consider the benefits and the cost brought by the code obfuscation technology, and use and adjust the code obfuscation technology according to practical situations. For example, unicode font mapping modification is used, page display fonts and actual font dynamic correspondence are modified in a custom font mode, the front end of a webpage is visible but not reproducible, and data security is improved.

215. And carrying out encryption processing on data transmission between the front end of the webpage and the back end of the webpage.

Because the front-end code and the background code are easy to cause data leakage if the front-end code and the background code are transparent, the step needs to encrypt data transmission between the front-end and the back-end of the webpage. The encryption processing in this step refers to performing encryption algorithm processing on a data request between the front end of a web page and the back end of the web page, for example, symmetric encryption or asymmetric encryption processing.

216. Allowing response to external requests.

When it turns out in step 212 that the external request is not camouflaged by the crawler technique, this step allows the website to respond to the external request. For example, the external request carries information such as reply address and reply content, and this step sends the reply content that it needs to the reply address.

Therefore, the online behavior of the real user is analyzed in a summarizing way, the access behavior of the user is analyzed and displayed, the general user characteristic data are extracted to distinguish the crawler request from the normal request, the attention point and the resident point of the real user access are provided for the crawler identification standard, the data support is provided for the supplement and the dynamic adjustment of the crawler identification standard, the crawler identification standard and the user access authority can be dynamically adjusted, the malicious access searching rule breakthrough access is reduced, the risk of website data leakage is reduced, and the safety of the website data is ensured.

The foregoing embodiments describe embodiments of the present application limiting data crawling method, and the following describes embodiments of the present application limiting data crawling system, referring to fig. 3, and one embodiment of the limiting data crawling system includes:

a judging unit 301, configured to, when an external request is received, judge whether the external request meets a preset crawler identification criterion;

a limiting unit 302, configured to execute a preset access limiting policy if the external request meets the crawler identification criteria;

and the response unit 303 is configured to allow response to the external request if the external request does not meet the crawler identification criteria.

Operations performed by the restricted data crawling system are similar to those performed in the foregoing embodiment of fig. 1, and will not be described herein.

Referring to FIG. 4, another embodiment of a restricted data crawling system includes:

a judging unit 401, configured to, when an external request is received, judge whether the external request meets a preset crawler identification criterion;

a limiting unit 402, configured to execute a preset access limiting policy if the external request meets the crawler identification criteria;

and the response unit 403 is configured to allow response to the external request if the external request does not meet the criteria for crawler identification.

Optionally, when the determining unit 401 determines whether the external request meets a preset crawler identification criteria, the determining unit is specifically configured to:

determining target request characteristic data carried in the external request;

Optionally, the system further comprises:

a collecting unit 404, configured to collect normal external requests sent by a plurality of verified real users, so as to obtain a normal external request set;

an extracting unit 405, configured to extract user feature data from the normal external request set, so as to obtain a user feature data set;

a classifying unit 406, configured to classify the user feature data set to obtain X types of user feature data diversity, where X is a positive integer greater than 0;

a selection unit 407, configured to select target user feature data of Y categories from the user feature data diversity as the request header parameter integrity requirement, where Y is a positive integer greater than 0 and less than or equal to X.

Optionally, the system further comprises:

an analysis unit 408, configured to perform access frequency analysis of the real user on the user feature data set, so as to obtain a real access frequency data set;

the selecting unit 407 is further configured to select a target access frequency in the real access frequency dataset as the access frequency upper limit.

Optionally, the system further comprises:

a display unit 409, configured to display the normal external request set, the user feature data set, and the X-class user feature data diversity;

and a receiving unit 410, configured to receive the preset standard input by the user.

Optionally, the system further comprises:

a confusion unit 411, configured to use a code confusion technique to perform confusion processing on a front end code of a front end of a web page to be accessed by the external request;

and the encryption unit 412 is configured to encrypt data transmission between the front end of the web page and the back end of the web page.

Operations performed by the restricted data crawling system are similar to those performed in the foregoing embodiment of fig. 2, and will not be described herein.

Turning now to the description of the computer device of the embodiments of the present application, referring to fig. 5, one embodiment of the computer device of the embodiments of the present application includes:

the computer device 500 may include one or more processors (central processing units, CPU) 501 and memory 502, with one or more applications or data stored in the memory 502. Wherein the memory 502 is volatile storage or persistent storage. The program stored in memory 502 may include one or more modules, each of which may include a series of instruction operations in a computer device. Still further, the processor 501 may be configured to communicate with the memory 502 and execute a series of instruction operations in the memory 502 on the computer device 500. The computer device 500 may also include one or more wireless network interfaces 503, one or more input/output interfaces 504, and/or one or more operating systems, such as Windows Server, mac OS, unix, linux, freeBSD, etc. The processor 501 may perform the operations performed in the foregoing embodiments of fig. 1 or fig. 2, and details thereof are not described herein.

In the several embodiments provided in the embodiments of the present application, it should be understood by those skilled in the art that the disclosed systems, apparatuses and methods may be implemented in other manners. For example, the apparatus embodiments described above are merely illustrative, e.g., the division of the units is merely a logical function division, and there may be additional divisions when actually implemented, e.g., multiple units or components may be combined or integrated into another system, or some features may be omitted or not performed. Alternatively, the coupling or direct coupling or communication connection shown or discussed with each other may be an indirect coupling or communication connection via some interfaces, devices or units, which may be in electrical, mechanical or other form.

In addition, each functional unit in each embodiment of the present application may be integrated in one processing unit, or each unit may exist alone physically, or two or more units may be integrated in one unit. The integrated units may be implemented in hardware or in software functional units. The integrated units, if implemented in the form of software functional units and sold or used as stand-alone products, may be stored in a computer readable storage medium. Based on such understanding, the technical solution of the present application may be embodied in essence or a part contributing to the prior art or all or part of the technical solution in the form of a software product stored in a storage medium, including several instructions to cause a computer device (which may be a personal computer, a server, or a network device, etc.) to perform all or part of the steps of the methods of the embodiments of the present application. And the aforementioned storage medium includes: a U-disk, a removable hard disk, a read-only memory (ROM), a random access memory (RAM, random access memory), a magnetic disk, or an optical disk, or other various media capable of storing program codes.

The foregoing description of the preferred embodiments of the present application is not intended to be limiting, but is intended to cover any and all modifications, equivalents, or alternatives falling within the spirit and principles of the present application.

Claims

1. A method of restricting crawling of data, comprising:

2. The method of claim 1, wherein determining whether the external request meets a preset crawler identification criteria comprises:

determining target request characteristic data carried in the external request;

3. The restricted data crawling method of claim 2, wherein the crawler identification criteria include an access frequency upper limit, request header parameter integrity, blacklist, preset criteria.

4. A method of restricting data crawling according to claim 3, said method further comprising:

5. The restricted data crawling method of claim 4, wherein after the user characteristic data set is obtained, the method further comprises:

6. The restricted data crawling method of claim 4, further comprising, after obtaining the diversity of the X categories of user characteristic data:

and receiving the preset standard input by the user.

7. The restricted data crawling method of claim 1, wherein prior to allowing response to the external request, the method further comprises:

8. A restricted data crawling system, comprising:

9. A computer device, comprising:

the memory stores a program;

the limited data crawling method of any one of claims 1 to 7 is implemented when the processor executes the program stored in the memory.

10. A computer readable storage medium having stored therein instructions which, when executed on a computer, cause the computer to perform the restricted data crawling method of any of claims 1 to 7.