CN109246070B - Anti-data crawling method - Google Patents
Anti-data crawling method Download PDFInfo
- Publication number
- CN109246070B CN109246070B CN201810689937.2A CN201810689937A CN109246070B CN 109246070 B CN109246070 B CN 109246070B CN 201810689937 A CN201810689937 A CN 201810689937A CN 109246070 B CN109246070 B CN 109246070B
- Authority
- CN
- China
- Prior art keywords
- user
- data
- interface
- crawler
- address
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Active
Links
Classifications
-
- H—ELECTRICITY
- H04—ELECTRIC COMMUNICATION TECHNIQUE
- H04L—TRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
- H04L63/00—Network architectures or network communication protocols for network security
- H04L63/14—Network architectures or network communication protocols for network security for detecting or protecting against malicious traffic
- H04L63/1441—Countermeasures against malicious traffic
- H04L63/1483—Countermeasures against malicious traffic service impersonation, e.g. phishing, pharming or web spoofing
-
- H—ELECTRICITY
- H04—ELECTRIC COMMUNICATION TECHNIQUE
- H04L—TRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
- H04L63/00—Network architectures or network communication protocols for network security
- H04L63/14—Network architectures or network communication protocols for network security for detecting or protecting against malicious traffic
- H04L63/1441—Countermeasures against malicious traffic
Landscapes
- Engineering & Computer Science (AREA)
- Computer Security & Cryptography (AREA)
- Computer Hardware Design (AREA)
- Computing Systems (AREA)
- General Engineering & Computer Science (AREA)
- Computer Networks & Wireless Communication (AREA)
- Signal Processing (AREA)
- Information Transfer Between Computers (AREA)
Abstract
The invention discloses a method for anti-data crawling. The method includes the steps that the number of interfaces called when a user views data is counted, if the number of the interfaces called when the user views a webpage is larger than or equal to a preset interface number threshold value, the user is judged to be a common access user, if the number of the interfaces called by the user is smaller than the preset interface number threshold value, and the number of calling each interface exceeds the preset calling number threshold value, the user is judged to be a crawler, and the IP of the user is added into a firewall to be sealed. The method of the invention judges normal user and crawler behaviors based on the rule of accessing the data interface by the user and the crawler, and judges the normal user and the crawler behaviors based on the user access amount, so that the accuracy is higher and the error forbidding rate is lower.
Description
Technical Field
The invention belongs to the technical field of information protection and network security, and particularly relates to a method for anti-data crawling.
Background
With the rapid growth of networks, the world wide web has become a carrier of a large amount of information, and how to efficiently extract and utilize such information has become a great challenge. Search engines, as a tool to assist people in retrieving information, have become the entry and guide for users to access the world wide web. However, these general search engines also have certain limitations, such as:
(1) users in different fields and different backgrounds often have different retrieval purposes and requirements, and the results returned by the general search engine contain a large number of web pages which are not concerned by the users.
(2) The goal of a general search engine is to maximize network coverage, and the contradiction between limited search engine server resources and unlimited network data resources will be further exacerbated.
(3) The data form of the world wide web is rich and the network technology is continuously developed, different data such as pictures, databases, audio and video multimedia and the like appear in large quantity, and a universal search engine often has no capacity for the data which has dense information content and a certain structure and cannot find and obtain the data well.
(4) Most general search engines provide keyword-based retrieval, and are difficult to support queries made according to semantic information.
In order to solve the problems, a focused crawler for directionally grabbing related webpage resources is produced. A focused crawler is a program for automatically downloading web pages, and selectively accesses web pages and related links on the world Wide Web to obtain required information according to a set grabbing target. Unlike general purpose crawlers, focused crawlers do not pursue large coverage, but rather target crawling of web pages related to a particular topic content, preparing data resources for topic-oriented user queries.
The web crawler is a program for automatically extracting web pages, downloads web pages from the world wide web for a search engine, and is an important component of the search engine. The traditional crawler obtains the URL on the initial webpage from the URL of one or a plurality of initial webpages, continuously extracts new URLs from the current webpage and puts the new URLs into a queue in the process of capturing the webpage until certain stop conditions of the system are met. The workflow of the focused crawler is complex, and links irrelevant to the subject need to be filtered according to a certain webpage analysis algorithm, and useful links are reserved and put into a URL queue to be captured. Then, it will select the next web page URL from the queue according to a certain search strategy, and repeat the above process until reaching a certain condition of the system. In addition, all the web pages grabbed by the crawler are stored by the system, certain analysis and filtering are carried out, and indexes are established so as to facilitate later query and retrieval; for focused crawlers, the analysis results obtained by this process may also give feedback and guidance to the subsequent grabbing process.
The anti-crawler is an information protection behavior for preventing others from obtaining own website information in batch by using various technical means. Traditional anti-crawling generally takes two approaches. In the first mode, a background counts User access requests, and if the number of access requests of a single User Agent (UA) exceeds a threshold, the User Agent is blocked. The anti-crawler effect of the method is very good, but the problem is that the false sealing is very serious, and the normal webpage access behavior of the user is blocked. In the second mode, the background counts the access requests of the users, and if the number of the access requests of a single user session (session) exceeds a threshold value, the user session is blocked. The method looks feasible, the crawler sending effect is worse, because the session can be applied for free, when the session is blocked, the user can re-apply for a session for free, and the information crawling behavior is continued to the best, so that the actual application effect of the anti-crawling method is not feasible.
When a normal user browses a webpage, a plurality of data interfaces are called, and the data interfaces are accessed by the normal user. If the crawler is used, the crawler can grab the data which the crawler wants to obtain in a batch mode in a pile of data interfaces, skip the data interfaces which the crawler considers useless, only crawl one data interface, and then crawl the data in a batch mode all the time. One crawler crawls data-generated network traffic in batches, which corresponds to hundreds of normal user visits. At this time, the access volume of the web pages begins to increase, the transmission speed of the server is reduced, and the crawler batch captures a single data interface, so that the transmission speed of the data interface is slowed down and even exceeds the load capacity of the server, and the server is stopped. Crawling data by a crawler can cause normal users to have access to the data, and sometimes the normal users cannot access required information.
Disclosure of Invention
In order to solve the problems of error sealing, poor anti-crawling effect and the like in the existing anti-crawling technology, the invention provides the anti-data crawling method, and the method judges whether a client initiating data access is a crawler or not by counting the number of interfaces called when a user checks data. The method can greatly reduce the possibility of false sealing and obviously improve the reverse crawling effect.
In order to achieve the above object, the present invention adopts the following technical solutions.
The method includes the steps that the number of interfaces called when a user views data is counted, if the number of the interfaces called when the user views a webpage is larger than or equal to a preset interface number threshold value, the user is judged to be a common access user, if the number of the interfaces called by the user is smaller than the preset interface number threshold value, and the number of calling each interface exceeds the preset calling number threshold value, the user is judged to be a crawler, and the IP of the user is added into a firewall to be sealed.
A method of anti-data crawling, the method comprising the steps of:
1) intercepting a user access request;
2) recording the IP address of the user;
3) recording a data interface accessed by a user;
4) calling log data, and counting the number of data interfaces called by the IP address of the user;
5) if the number of interfaces called by the IP address of the user is larger than the preset threshold value of the number of interfaces, judging that the user is a normal user, not executing interception operation, and normally returning user response information to the server;
6) if the number of the interfaces called by the IP address of the user is smaller than a preset interface number threshold, counting the calling times of the IP address to each data interface, and if the calling times to a certain data interface exceeds the preset calling time threshold, judging that the user is a crawler;
7) and adding the IP address of the crawler user into a forbidden IP list of the protective wall, and forbidding the access of the IP address to the website.
Preferably, the interface number threshold is 2.
Preferably, the threshold value of the number of calls is eighty percent, and eighty percent is that the ratio of the number of calls to the page interface to other interfaces exceeds eighty percent. For example, there are two interfaces, the first 1 and the second 3. Normal users are 2_6,3_9 growing, while crawler users are 1_6,1_9 growing.
The invention has the advantages and beneficial effects that:
1) the method of the invention judges normal user and crawler behaviors based on the rule of accessing the data interface to the user and the crawler, and not only judges the behavior based on the user access amount, so that the accuracy is higher, and the error forbidding rate is lower;
2) the anti-crawler program in the method runs on the server side, so that on one hand, the management is convenient, on the other hand, the fluency of the client side is improved, and the user experience is improved;
3) the crawler is defended based on the user IP, on one hand, the prohibited crawler user can be known more intuitively, and on the other hand, once the prohibition is mistakenly performed, the user access right is easy to recover based on the IP address.
Detailed Description
The present invention will be further described with reference to the following examples.
Example 1
A method of anti-data crawling, comprising the steps of:
1) intercepting a user access request;
2) recording the IP address of the user;
3) recording a data interface accessed by a user;
4) calling log data, and counting the number of data interfaces called by the IP address of the user;
5) if the number of the interfaces called by the user IP address is more than or equal to 2, judging that the user is a normal user, not executing interception operation, and normally returning user response information to the server;
6) if the number of the interfaces called by the user IP address is 1, counting the calling times of the IP address to the data interface, and if the calling times to the data interface exceeds 80 percent, judging that the user is a crawler;
7) and adding the IP address of the crawler user into a forbidden IP list of the protective wall, and forbidding the access of the IP address to the website.
Finally, it should be noted that: it should be understood that the above examples are only for clearly illustrating the present invention and are not intended to limit the embodiments. Other variations and modifications will be apparent to persons skilled in the art in light of the above description. And are neither required nor exhaustive of all embodiments. And obvious variations or modifications therefrom are intended to be within the scope of the invention.
Claims (3)
1. A method of anti-data crawling, characterized by: the method is based on counting the number of interfaces called when a user views data, if the number of the interfaces called when the user views a webpage is larger than or equal to a preset interface number threshold value, the user is judged to be a common access user, if the number of the interfaces called by the user is smaller than the preset interface number threshold value and the number of calling a certain interface exceeds the preset calling number threshold value, the user is judged to be a crawler, and the IP of the user is added into a firewall to be sealed;
wherein the method comprises the steps of:
1) intercepting a user access request;
2) recording the IP address of the user;
3) recording a data interface accessed by a user;
4) calling log data, and counting the number of data interfaces called by the IP address of the user;
5) if the number of interfaces called by the IP address of the user is larger than the preset threshold value of the number of interfaces, judging that the user is a normal user, not executing interception operation, and normally returning user response information to the server;
6) if the number of the interfaces called by the IP address of the user is smaller than a preset interface number threshold, counting the calling times of the IP address to each data interface, and if the calling times to a certain data interface exceeds the preset calling time threshold, judging that the user is a crawler;
7) and adding the IP address of the crawler user into a forbidden IP list of the protective wall, and forbidding the access of the IP address to the website.
2. A method of anti-data crawling in accordance with claim 1, wherein: the interface number threshold is 2.
3. A method of anti-data crawling in accordance with claim 1, wherein: the calling frequency threshold value is eighty percent, and eighty percent is that the ratio of the calling frequency of the page interface to other interfaces exceeds eighty percent.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201810689937.2A CN109246070B (en) | 2018-06-28 | 2018-06-28 | Anti-data crawling method |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201810689937.2A CN109246070B (en) | 2018-06-28 | 2018-06-28 | Anti-data crawling method |
Publications (2)
Publication Number | Publication Date |
---|---|
CN109246070A CN109246070A (en) | 2019-01-18 |
CN109246070B true CN109246070B (en) | 2021-04-30 |
Family
ID=65072168
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201810689937.2A Active CN109246070B (en) | 2018-06-28 | 2018-06-28 | Anti-data crawling method |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN109246070B (en) |
Families Citing this family (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN110035068B (en) * | 2019-03-14 | 2021-10-01 | 微梦创科网络科技(中国)有限公司 | Sealing forbidding method and device for anti-grabbing station system |
Citations (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN102880698A (en) * | 2012-09-21 | 2013-01-16 | 新浪网技术(中国)有限公司 | Method and device for determining caught website |
CN106254537A (en) * | 2016-09-22 | 2016-12-21 | 北京小米移动软件有限公司 | Interface interchange method and apparatus |
CN106411639A (en) * | 2016-09-18 | 2017-02-15 | 合网络技术(北京)有限公司 | Method and system for monitoring access data |
CN107105071A (en) * | 2017-05-05 | 2017-08-29 | 北京京东金融科技控股有限公司 | IP call methods and device, storage medium, electronic equipment |
CN107392022A (en) * | 2017-07-20 | 2017-11-24 | 北京小度信息科技有限公司 | Reptile identification, processing method and relevant apparatus |
Family Cites Families (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US10242102B2 (en) * | 2014-12-29 | 2019-03-26 | Samsung Electronics Co., Ltd. | Network crawling prioritization |
-
2018
- 2018-06-28 CN CN201810689937.2A patent/CN109246070B/en active Active
Patent Citations (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN102880698A (en) * | 2012-09-21 | 2013-01-16 | 新浪网技术(中国)有限公司 | Method and device for determining caught website |
CN106411639A (en) * | 2016-09-18 | 2017-02-15 | 合网络技术(北京)有限公司 | Method and system for monitoring access data |
CN106254537A (en) * | 2016-09-22 | 2016-12-21 | 北京小米移动软件有限公司 | Interface interchange method and apparatus |
CN107105071A (en) * | 2017-05-05 | 2017-08-29 | 北京京东金融科技控股有限公司 | IP call methods and device, storage medium, electronic equipment |
CN107392022A (en) * | 2017-07-20 | 2017-11-24 | 北京小度信息科技有限公司 | Reptile identification, processing method and relevant apparatus |
Also Published As
Publication number | Publication date |
---|---|
CN109246070A (en) | 2019-01-18 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
Yu et al. | Summary of web crawler technology research | |
CN103179132B (en) | A kind of method and device detecting and defend CC attack | |
US7827191B2 (en) | Discovering web-based multimedia using search toolbar data | |
CN109688097A (en) | Website protection method, website protective device, website safeguard and storage medium | |
CN103428224B (en) | A kind of method and apparatus of intelligence defending DDoS (Distributed Denial of Service) attacks | |
CN104978408A (en) | Berkeley DB database based topic crawler system | |
US8510443B2 (en) | Real-time harmful website blocking method using object attribute access engine | |
CN106656922A (en) | Flow analysis based protective method and device against network attack | |
CN107341160A (en) | A kind of method and device for intercepting reptile | |
CN109729044B (en) | Universal internet data acquisition reverse-crawling system and method | |
CN101350822A (en) | Method for discovering and tracing Internet malevolence code | |
CN103177005A (en) | Processing method and system of data access | |
EP2719153A1 (en) | Model-based method for managing information derived from network traffic | |
CN107341395A (en) | A kind of method for intercepting reptile | |
CN109768992A (en) | Webpage malicious scanning processing method and device, terminal device, readable storage medium storing program for executing | |
CN109600385B (en) | Access control method and device | |
CN103984743B (en) | A kind of method and device of managing internal memory resource | |
CN103559203A (en) | Method, device and system for web page sorting | |
CN113518077A (en) | Malicious web crawler detection method, device, equipment and storage medium | |
CN107784113A (en) | Html web page collecting method, device and computer-readable recording medium | |
CN109246070B (en) | Anti-data crawling method | |
CN107743128A (en) | It is a kind of that domain name and the illegal website method for digging with service IP are associated based on homepage | |
CN116389156A (en) | Real-time abnormality detection system in large-data-volume environment | |
CN111444412B (en) | Method and device for scheduling web crawler tasks | |
CN103268347A (en) | System and method for mobile internet searching system based on messages |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant |