CN109246070B - Anti-data crawling method - Google Patents

Anti-data crawling method Download PDF

Info

Publication number
CN109246070B
CN109246070B CN201810689937.2A CN201810689937A CN109246070B CN 109246070 B CN109246070 B CN 109246070B CN 201810689937 A CN201810689937 A CN 201810689937A CN 109246070 B CN109246070 B CN 109246070B
Authority
CN
China
Prior art keywords
user
data
interface
crawler
address
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201810689937.2A
Other languages
Chinese (zh)
Other versions
CN109246070A (en
Inventor
朱秀松
程国艮
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Global Tone Communication Technology Co ltd
Original Assignee
Global Tone Communication Technology Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Global Tone Communication Technology Co ltd filed Critical Global Tone Communication Technology Co ltd
Priority to CN201810689937.2A priority Critical patent/CN109246070B/en
Publication of CN109246070A publication Critical patent/CN109246070A/en
Application granted granted Critical
Publication of CN109246070B publication Critical patent/CN109246070B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L63/00Network architectures or network communication protocols for network security
    • H04L63/14Network architectures or network communication protocols for network security for detecting or protecting against malicious traffic
    • H04L63/1441Countermeasures against malicious traffic
    • H04L63/1483Countermeasures against malicious traffic service impersonation, e.g. phishing, pharming or web spoofing
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L63/00Network architectures or network communication protocols for network security
    • H04L63/14Network architectures or network communication protocols for network security for detecting or protecting against malicious traffic
    • H04L63/1441Countermeasures against malicious traffic

Abstract

The invention discloses a method for anti-data crawling. The method includes the steps that the number of interfaces called when a user views data is counted, if the number of the interfaces called when the user views a webpage is larger than or equal to a preset interface number threshold value, the user is judged to be a common access user, if the number of the interfaces called by the user is smaller than the preset interface number threshold value, and the number of calling each interface exceeds the preset calling number threshold value, the user is judged to be a crawler, and the IP of the user is added into a firewall to be sealed. The method of the invention judges normal user and crawler behaviors based on the rule of accessing the data interface by the user and the crawler, and judges the normal user and the crawler behaviors based on the user access amount, so that the accuracy is higher and the error forbidding rate is lower.

Description

Anti-data crawling method
Technical Field
The invention belongs to the technical field of information protection and network security, and particularly relates to a method for anti-data crawling.
Background
With the rapid growth of networks, the world wide web has become a carrier of a large amount of information, and how to efficiently extract and utilize such information has become a great challenge. Search engines, as a tool to assist people in retrieving information, have become the entry and guide for users to access the world wide web. However, these general search engines also have certain limitations, such as:
(1) users in different fields and different backgrounds often have different retrieval purposes and requirements, and the results returned by the general search engine contain a large number of web pages which are not concerned by the users.
(2) The goal of a general search engine is to maximize network coverage, and the contradiction between limited search engine server resources and unlimited network data resources will be further exacerbated.
(3) The data form of the world wide web is rich and the network technology is continuously developed, different data such as pictures, databases, audio and video multimedia and the like appear in large quantity, and a universal search engine often has no capacity for the data which has dense information content and a certain structure and cannot find and obtain the data well.
(4) Most general search engines provide keyword-based retrieval, and are difficult to support queries made according to semantic information.
In order to solve the problems, a focused crawler for directionally grabbing related webpage resources is produced. A focused crawler is a program for automatically downloading web pages, and selectively accesses web pages and related links on the world Wide Web to obtain required information according to a set grabbing target. Unlike general purpose crawlers, focused crawlers do not pursue large coverage, but rather target crawling of web pages related to a particular topic content, preparing data resources for topic-oriented user queries.
The web crawler is a program for automatically extracting web pages, downloads web pages from the world wide web for a search engine, and is an important component of the search engine. The traditional crawler obtains the URL on the initial webpage from the URL of one or a plurality of initial webpages, continuously extracts new URLs from the current webpage and puts the new URLs into a queue in the process of capturing the webpage until certain stop conditions of the system are met. The workflow of the focused crawler is complex, and links irrelevant to the subject need to be filtered according to a certain webpage analysis algorithm, and useful links are reserved and put into a URL queue to be captured. Then, it will select the next web page URL from the queue according to a certain search strategy, and repeat the above process until reaching a certain condition of the system. In addition, all the web pages grabbed by the crawler are stored by the system, certain analysis and filtering are carried out, and indexes are established so as to facilitate later query and retrieval; for focused crawlers, the analysis results obtained by this process may also give feedback and guidance to the subsequent grabbing process.
The anti-crawler is an information protection behavior for preventing others from obtaining own website information in batch by using various technical means. Traditional anti-crawling generally takes two approaches. In the first mode, a background counts User access requests, and if the number of access requests of a single User Agent (UA) exceeds a threshold, the User Agent is blocked. The anti-crawler effect of the method is very good, but the problem is that the false sealing is very serious, and the normal webpage access behavior of the user is blocked. In the second mode, the background counts the access requests of the users, and if the number of the access requests of a single user session (session) exceeds a threshold value, the user session is blocked. The method looks feasible, the crawler sending effect is worse, because the session can be applied for free, when the session is blocked, the user can re-apply for a session for free, and the information crawling behavior is continued to the best, so that the actual application effect of the anti-crawling method is not feasible.
When a normal user browses a webpage, a plurality of data interfaces are called, and the data interfaces are accessed by the normal user. If the crawler is used, the crawler can grab the data which the crawler wants to obtain in a batch mode in a pile of data interfaces, skip the data interfaces which the crawler considers useless, only crawl one data interface, and then crawl the data in a batch mode all the time. One crawler crawls data-generated network traffic in batches, which corresponds to hundreds of normal user visits. At this time, the access volume of the web pages begins to increase, the transmission speed of the server is reduced, and the crawler batch captures a single data interface, so that the transmission speed of the data interface is slowed down and even exceeds the load capacity of the server, and the server is stopped. Crawling data by a crawler can cause normal users to have access to the data, and sometimes the normal users cannot access required information.
Disclosure of Invention
In order to solve the problems of error sealing, poor anti-crawling effect and the like in the existing anti-crawling technology, the invention provides the anti-data crawling method, and the method judges whether a client initiating data access is a crawler or not by counting the number of interfaces called when a user checks data. The method can greatly reduce the possibility of false sealing and obviously improve the reverse crawling effect.
In order to achieve the above object, the present invention adopts the following technical solutions.
The method includes the steps that the number of interfaces called when a user views data is counted, if the number of the interfaces called when the user views a webpage is larger than or equal to a preset interface number threshold value, the user is judged to be a common access user, if the number of the interfaces called by the user is smaller than the preset interface number threshold value, and the number of calling each interface exceeds the preset calling number threshold value, the user is judged to be a crawler, and the IP of the user is added into a firewall to be sealed.
A method of anti-data crawling, the method comprising the steps of:
1) intercepting a user access request;
2) recording the IP address of the user;
3) recording a data interface accessed by a user;
4) calling log data, and counting the number of data interfaces called by the IP address of the user;
5) if the number of interfaces called by the IP address of the user is larger than the preset threshold value of the number of interfaces, judging that the user is a normal user, not executing interception operation, and normally returning user response information to the server;
6) if the number of the interfaces called by the IP address of the user is smaller than a preset interface number threshold, counting the calling times of the IP address to each data interface, and if the calling times to a certain data interface exceeds the preset calling time threshold, judging that the user is a crawler;
7) and adding the IP address of the crawler user into a forbidden IP list of the protective wall, and forbidding the access of the IP address to the website.
Preferably, the interface number threshold is 2.
Preferably, the threshold value of the number of calls is eighty percent, and eighty percent is that the ratio of the number of calls to the page interface to other interfaces exceeds eighty percent. For example, there are two interfaces, the first 1 and the second 3. Normal users are 2_6,3_9 growing, while crawler users are 1_6,1_9 growing.
The invention has the advantages and beneficial effects that:
1) the method of the invention judges normal user and crawler behaviors based on the rule of accessing the data interface to the user and the crawler, and not only judges the behavior based on the user access amount, so that the accuracy is higher, and the error forbidding rate is lower;
2) the anti-crawler program in the method runs on the server side, so that on one hand, the management is convenient, on the other hand, the fluency of the client side is improved, and the user experience is improved;
3) the crawler is defended based on the user IP, on one hand, the prohibited crawler user can be known more intuitively, and on the other hand, once the prohibition is mistakenly performed, the user access right is easy to recover based on the IP address.
Detailed Description
The present invention will be further described with reference to the following examples.
Example 1
A method of anti-data crawling, comprising the steps of:
1) intercepting a user access request;
2) recording the IP address of the user;
3) recording a data interface accessed by a user;
4) calling log data, and counting the number of data interfaces called by the IP address of the user;
5) if the number of the interfaces called by the user IP address is more than or equal to 2, judging that the user is a normal user, not executing interception operation, and normally returning user response information to the server;
6) if the number of the interfaces called by the user IP address is 1, counting the calling times of the IP address to the data interface, and if the calling times to the data interface exceeds 80 percent, judging that the user is a crawler;
7) and adding the IP address of the crawler user into a forbidden IP list of the protective wall, and forbidding the access of the IP address to the website.
Finally, it should be noted that: it should be understood that the above examples are only for clearly illustrating the present invention and are not intended to limit the embodiments. Other variations and modifications will be apparent to persons skilled in the art in light of the above description. And are neither required nor exhaustive of all embodiments. And obvious variations or modifications therefrom are intended to be within the scope of the invention.

Claims (3)

1. A method of anti-data crawling, characterized by: the method is based on counting the number of interfaces called when a user views data, if the number of the interfaces called when the user views a webpage is larger than or equal to a preset interface number threshold value, the user is judged to be a common access user, if the number of the interfaces called by the user is smaller than the preset interface number threshold value and the number of calling a certain interface exceeds the preset calling number threshold value, the user is judged to be a crawler, and the IP of the user is added into a firewall to be sealed;
wherein the method comprises the steps of:
1) intercepting a user access request;
2) recording the IP address of the user;
3) recording a data interface accessed by a user;
4) calling log data, and counting the number of data interfaces called by the IP address of the user;
5) if the number of interfaces called by the IP address of the user is larger than the preset threshold value of the number of interfaces, judging that the user is a normal user, not executing interception operation, and normally returning user response information to the server;
6) if the number of the interfaces called by the IP address of the user is smaller than a preset interface number threshold, counting the calling times of the IP address to each data interface, and if the calling times to a certain data interface exceeds the preset calling time threshold, judging that the user is a crawler;
7) and adding the IP address of the crawler user into a forbidden IP list of the protective wall, and forbidding the access of the IP address to the website.
2. A method of anti-data crawling in accordance with claim 1, wherein: the interface number threshold is 2.
3. A method of anti-data crawling in accordance with claim 1, wherein: the calling frequency threshold value is eighty percent, and eighty percent is that the ratio of the calling frequency of the page interface to other interfaces exceeds eighty percent.
CN201810689937.2A 2018-06-28 2018-06-28 Anti-data crawling method Active CN109246070B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201810689937.2A CN109246070B (en) 2018-06-28 2018-06-28 Anti-data crawling method

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201810689937.2A CN109246070B (en) 2018-06-28 2018-06-28 Anti-data crawling method

Publications (2)

Publication Number Publication Date
CN109246070A CN109246070A (en) 2019-01-18
CN109246070B true CN109246070B (en) 2021-04-30

Family

ID=65072168

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201810689937.2A Active CN109246070B (en) 2018-06-28 2018-06-28 Anti-data crawling method

Country Status (1)

Country Link
CN (1) CN109246070B (en)

Families Citing this family (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110035068B (en) * 2019-03-14 2021-10-01 微梦创科网络科技(中国)有限公司 Sealing forbidding method and device for anti-grabbing station system

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102880698A (en) * 2012-09-21 2013-01-16 新浪网技术(中国)有限公司 Method and device for determining caught website
CN106254537A (en) * 2016-09-22 2016-12-21 北京小米移动软件有限公司 Interface interchange method and apparatus
CN106411639A (en) * 2016-09-18 2017-02-15 合网络技术(北京)有限公司 Method and system for monitoring access data
CN107105071A (en) * 2017-05-05 2017-08-29 北京京东金融科技控股有限公司 IP call methods and device, storage medium, electronic equipment
CN107392022A (en) * 2017-07-20 2017-11-24 北京小度信息科技有限公司 Reptile identification, processing method and relevant apparatus

Family Cites Families (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US10242102B2 (en) * 2014-12-29 2019-03-26 Samsung Electronics Co., Ltd. Network crawling prioritization

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102880698A (en) * 2012-09-21 2013-01-16 新浪网技术(中国)有限公司 Method and device for determining caught website
CN106411639A (en) * 2016-09-18 2017-02-15 合网络技术(北京)有限公司 Method and system for monitoring access data
CN106254537A (en) * 2016-09-22 2016-12-21 北京小米移动软件有限公司 Interface interchange method and apparatus
CN107105071A (en) * 2017-05-05 2017-08-29 北京京东金融科技控股有限公司 IP call methods and device, storage medium, electronic equipment
CN107392022A (en) * 2017-07-20 2017-11-24 北京小度信息科技有限公司 Reptile identification, processing method and relevant apparatus

Also Published As

Publication number Publication date
CN109246070A (en) 2019-01-18

Similar Documents

Publication Publication Date Title
US9992220B2 (en) Graphical display of events indicating security threats in an information technology system
US7827191B2 (en) Discovering web-based multimedia using search toolbar data
CN109688097A (en) Website protection method, website protective device, website safeguard and storage medium
CN103179132B (en) A kind of method and device detecting and defend CC attack
US8510443B2 (en) Real-time harmful website blocking method using object attribute access engine
CN106656922A (en) Flow analysis based protective method and device against network attack
US20130042319A1 (en) Method and apparatus for detecting and defending against cc attack
KR100848319B1 (en) Harmful web site filtering method and apparatus using web structural information
CN107341160A (en) A kind of method and device for intercepting reptile
JP2000357176A (en) Contents indexing retrieval system and retrieval result providing method
CN109768992A (en) Webpage malicious scanning processing method and device, terminal device, readable storage medium storing program for executing
CN109729044B (en) Universal internet data acquisition reverse-crawling system and method
CN105589943B (en) The method, apparatus and server of the picture adaptive processes of result of page searching
CN107341395A (en) A kind of method for intercepting reptile
CN103632084A (en) Building method for malicious feature data base, malicious object detecting method and device of malicious feature data base
CN106951784B (en) XSS vulnerability detection-oriented Web application reverse analysis method
CN113518077A (en) Malicious web crawler detection method, device, equipment and storage medium
CN108429785A (en) A kind of generation method, reptile recognition methods and the device of reptile identification encryption string
CN102158499A (en) Trojan-embedded website detection method based on hyper text transfer protocol (HTTP) traffic analysis
CN107784113A (en) Html web page collecting method, device and computer-readable recording medium
CN107743128A (en) It is a kind of that domain name and the illegal website method for digging with service IP are associated based on homepage
CN109246070B (en) Anti-data crawling method
US10491606B2 (en) Method and apparatus for providing website authentication data for search engine
CN116389156A (en) Real-time abnormality detection system in large-data-volume environment
CN106599270A (en) Network data capturing method and crawler

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant