CN109246070B

CN109246070B - Anti-data crawling method

Info

Publication number: CN109246070B
Application number: CN201810689937.2A
Authority: CN
Inventors: 朱秀松; 程国艮
Original assignee: Global Tone Communication Technology Co ltd
Current assignee: Global Tone Communication Technology Co ltd
Priority date: 2018-06-28
Filing date: 2018-06-28
Publication date: 2021-04-30
Anticipated expiration: 2038-06-28
Also published as: CN109246070A

Abstract

The invention discloses a method for anti-data crawling. The method includes the steps that the number of interfaces called when a user views data is counted, if the number of the interfaces called when the user views a webpage is larger than or equal to a preset interface number threshold value, the user is judged to be a common access user, if the number of the interfaces called by the user is smaller than the preset interface number threshold value, and the number of calling each interface exceeds the preset calling number threshold value, the user is judged to be a crawler, and the IP of the user is added into a firewall to be sealed. The method of the invention judges normal user and crawler behaviors based on the rule of accessing the data interface by the user and the crawler, and judges the normal user and the crawler behaviors based on the user access amount, so that the accuracy is higher and the error forbidding rate is lower.

Description

Anti-data crawling method

Technical Field

The invention belongs to the technical field of information protection and network security, and particularly relates to a method for anti-data crawling.

Background

With the rapid growth of networks, the world wide web has become a carrier of a large amount of information, and how to efficiently extract and utilize such information has become a great challenge. Search engines, as a tool to assist people in retrieving information, have become the entry and guide for users to access the world wide web. However, these general search engines also have certain limitations, such as:

(1) users in different fields and different backgrounds often have different retrieval purposes and requirements, and the results returned by the general search engine contain a large number of web pages which are not concerned by the users.

(2) The goal of a general search engine is to maximize network coverage, and the contradiction between limited search engine server resources and unlimited network data resources will be further exacerbated.

(3) The data form of the world wide web is rich and the network technology is continuously developed, different data such as pictures, databases, audio and video multimedia and the like appear in large quantity, and a universal search engine often has no capacity for the data which has dense information content and a certain structure and cannot find and obtain the data well.

(4) Most general search engines provide keyword-based retrieval, and are difficult to support queries made according to semantic information.

In order to solve the problems, a focused crawler for directionally grabbing related webpage resources is produced. A focused crawler is a program for automatically downloading web pages, and selectively accesses web pages and related links on the world Wide Web to obtain required information according to a set grabbing target. Unlike general purpose crawlers, focused crawlers do not pursue large coverage, but rather target crawling of web pages related to a particular topic content, preparing data resources for topic-oriented user queries.

The web crawler is a program for automatically extracting web pages, downloads web pages from the world wide web for a search engine, and is an important component of the search engine. The traditional crawler obtains the URL on the initial webpage from the URL of one or a plurality of initial webpages, continuously extracts new URLs from the current webpage and puts the new URLs into a queue in the process of capturing the webpage until certain stop conditions of the system are met. The workflow of the focused crawler is complex, and links irrelevant to the subject need to be filtered according to a certain webpage analysis algorithm, and useful links are reserved and put into a URL queue to be captured. Then, it will select the next web page URL from the queue according to a certain search strategy, and repeat the above process until reaching a certain condition of the system. In addition, all the web pages grabbed by the crawler are stored by the system, certain analysis and filtering are carried out, and indexes are established so as to facilitate later query and retrieval; for focused crawlers, the analysis results obtained by this process may also give feedback and guidance to the subsequent grabbing process.

The anti-crawler is an information protection behavior for preventing others from obtaining own website information in batch by using various technical means. Traditional anti-crawling generally takes two approaches. In the first mode, a background counts User access requests, and if the number of access requests of a single User Agent (UA) exceeds a threshold, the User Agent is blocked. The anti-crawler effect of the method is very good, but the problem is that the false sealing is very serious, and the normal webpage access behavior of the user is blocked. In the second mode, the background counts the access requests of the users, and if the number of the access requests of a single user session (session) exceeds a threshold value, the user session is blocked. The method looks feasible, the crawler sending effect is worse, because the session can be applied for free, when the session is blocked, the user can re-apply for a session for free, and the information crawling behavior is continued to the best, so that the actual application effect of the anti-crawling method is not feasible.

When a normal user browses a webpage, a plurality of data interfaces are called, and the data interfaces are accessed by the normal user. If the crawler is used, the crawler can grab the data which the crawler wants to obtain in a batch mode in a pile of data interfaces, skip the data interfaces which the crawler considers useless, only crawl one data interface, and then crawl the data in a batch mode all the time. One crawler crawls data-generated network traffic in batches, which corresponds to hundreds of normal user visits. At this time, the access volume of the web pages begins to increase, the transmission speed of the server is reduced, and the crawler batch captures a single data interface, so that the transmission speed of the data interface is slowed down and even exceeds the load capacity of the server, and the server is stopped. Crawling data by a crawler can cause normal users to have access to the data, and sometimes the normal users cannot access required information.

Disclosure of Invention

In order to solve the problems of error sealing, poor anti-crawling effect and the like in the existing anti-crawling technology, the invention provides the anti-data crawling method, and the method judges whether a client initiating data access is a crawler or not by counting the number of interfaces called when a user checks data. The method can greatly reduce the possibility of false sealing and obviously improve the reverse crawling effect.

In order to achieve the above object, the present invention adopts the following technical solutions.

The method includes the steps that the number of interfaces called when a user views data is counted, if the number of the interfaces called when the user views a webpage is larger than or equal to a preset interface number threshold value, the user is judged to be a common access user, if the number of the interfaces called by the user is smaller than the preset interface number threshold value, and the number of calling each interface exceeds the preset calling number threshold value, the user is judged to be a crawler, and the IP of the user is added into a firewall to be sealed.

A method of anti-data crawling, the method comprising the steps of:

1) intercepting a user access request;

2) recording the IP address of the user;

3) recording a data interface accessed by a user;

4) calling log data, and counting the number of data interfaces called by the IP address of the user;

5) if the number of interfaces called by the IP address of the user is larger than the preset threshold value of the number of interfaces, judging that the user is a normal user, not executing interception operation, and normally returning user response information to the server;

6) if the number of the interfaces called by the IP address of the user is smaller than a preset interface number threshold, counting the calling times of the IP address to each data interface, and if the calling times to a certain data interface exceeds the preset calling time threshold, judging that the user is a crawler;

7) and adding the IP address of the crawler user into a forbidden IP list of the protective wall, and forbidding the access of the IP address to the website.

Preferably, the interface number threshold is 2.

Preferably, the threshold value of the number of calls is eighty percent, and eighty percent is that the ratio of the number of calls to the page interface to other interfaces exceeds eighty percent. For example, there are two interfaces, the first 1 and the second 3. Normal users are 2_6,3_9 growing, while crawler users are 1_6,1_9 growing.

The invention has the advantages and beneficial effects that:

1) the method of the invention judges normal user and crawler behaviors based on the rule of accessing the data interface to the user and the crawler, and not only judges the behavior based on the user access amount, so that the accuracy is higher, and the error forbidding rate is lower;

2) the anti-crawler program in the method runs on the server side, so that on one hand, the management is convenient, on the other hand, the fluency of the client side is improved, and the user experience is improved;

3) the crawler is defended based on the user IP, on one hand, the prohibited crawler user can be known more intuitively, and on the other hand, once the prohibition is mistakenly performed, the user access right is easy to recover based on the IP address.

Detailed Description

The present invention will be further described with reference to the following examples.

Example 1

A method of anti-data crawling, comprising the steps of:

1) intercepting a user access request;

2) recording the IP address of the user;

3) recording a data interface accessed by a user;

5) if the number of the interfaces called by the user IP address is more than or equal to 2, judging that the user is a normal user, not executing interception operation, and normally returning user response information to the server;

6) if the number of the interfaces called by the user IP address is 1, counting the calling times of the IP address to the data interface, and if the calling times to the data interface exceeds 80 percent, judging that the user is a crawler;

Finally, it should be noted that: it should be understood that the above examples are only for clearly illustrating the present invention and are not intended to limit the embodiments. Other variations and modifications will be apparent to persons skilled in the art in light of the above description. And are neither required nor exhaustive of all embodiments. And obvious variations or modifications therefrom are intended to be within the scope of the invention.

Claims

1. A method of anti-data crawling, characterized by: the method is based on counting the number of interfaces called when a user views data, if the number of the interfaces called when the user views a webpage is larger than or equal to a preset interface number threshold value, the user is judged to be a common access user, if the number of the interfaces called by the user is smaller than the preset interface number threshold value and the number of calling a certain interface exceeds the preset calling number threshold value, the user is judged to be a crawler, and the IP of the user is added into a firewall to be sealed;

wherein the method comprises the steps of:

1) intercepting a user access request;

2) recording the IP address of the user;

3) recording a data interface accessed by a user;

2. A method of anti-data crawling in accordance with claim 1, wherein: the interface number threshold is 2.

3. A method of anti-data crawling in accordance with claim 1, wherein: the calling frequency threshold value is eighty percent, and eighty percent is that the ratio of the calling frequency of the page interface to other interfaces exceeds eighty percent.