CN109246070A - A kind of method that anti-data crawl - Google Patents

A kind of method that anti-data crawl Download PDF

Info

Publication number
CN109246070A
CN109246070A CN201810689937.2A CN201810689937A CN109246070A CN 109246070 A CN109246070 A CN 109246070A CN 201810689937 A CN201810689937 A CN 201810689937A CN 109246070 A CN109246070 A CN 109246070A
Authority
CN
China
Prior art keywords
user
data
interface
threshold value
address
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN201810689937.2A
Other languages
Chinese (zh)
Other versions
CN109246070B (en
Inventor
朱秀松
程国艮
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Chinese Translation Language Through Polytron Technologies Inc
Original Assignee
Chinese Translation Language Through Polytron Technologies Inc
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Chinese Translation Language Through Polytron Technologies Inc filed Critical Chinese Translation Language Through Polytron Technologies Inc
Priority to CN201810689937.2A priority Critical patent/CN109246070B/en
Publication of CN109246070A publication Critical patent/CN109246070A/en
Application granted granted Critical
Publication of CN109246070B publication Critical patent/CN109246070B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L63/00Network architectures or network communication protocols for network security
    • H04L63/14Network architectures or network communication protocols for network security for detecting or protecting against malicious traffic
    • H04L63/1441Countermeasures against malicious traffic
    • H04L63/1483Countermeasures against malicious traffic service impersonation, e.g. phishing, pharming or web spoofing
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L63/00Network architectures or network communication protocols for network security
    • H04L63/14Network architectures or network communication protocols for network security for detecting or protecting against malicious traffic
    • H04L63/1441Countermeasures against malicious traffic

Landscapes

  • Engineering & Computer Science (AREA)
  • Computer Security & Cryptography (AREA)
  • Computer Hardware Design (AREA)
  • Computing Systems (AREA)
  • General Engineering & Computer Science (AREA)
  • Computer Networks & Wireless Communication (AREA)
  • Signal Processing (AREA)
  • Information Transfer Between Computers (AREA)

Abstract

The present invention discloses a kind of method that anti-data crawl.The method counting user checks the interface quantity called when data, if user checks that the interface quantity called when webpage is greater than or equal to preset interface quantity threshold value, then judge that the user is common access user, if the interface quantity that user calls is less than preset interface quantity threshold value, and the number for calling each interface is more than preset call number threshold value, then judge that the user for crawler, the IP of the user is added to inside firewall and is closed.The method of the invention differentiates normal users and crawler behavior based on the rule to user and crawler access data-interface, judges rather than just based on user's amount of access, and accuracy is higher, and accidentally taboo rate is lower.

Description

A kind of method that anti-data crawl
Technical field
The invention belongs to message protections and technical field of network security, and in particular to a kind of method that anti-data crawl.
Background technique
With the rapid development of network, WWW becomes the carrier of bulk information, how to efficiently extract and use these Information becomes a huge challenge.The tool that search engine assists people to retrieve information as one becomes ten thousand dimension of user's access The entrance and guide of net.But there is also certain limitations for these versatility search engines, such as:
(1) different field, different background user often there is different retrieval purpose and demand, universal search engine to be returned Result include the unconcerned webpage of a large number of users.
(2) target of universal search engine is the network coverage as big as possible, limited search engine server resource Contradiction between unlimited network data resource will further deepen.
(3) abundant and network technology the continuous development of world wide web data form, the more matchmakers of picture, database, audio, video The different data such as body largely occur, often intensive to these information contents and data with certain structure of universal search engine without It can be power, cannot find and obtain well.
(4) universal search engine provides the retrieval based on keyword mostly, it is difficult to which support is looked into according to what semantic information proposed It askes.
To solve the above-mentioned problems, the focused crawler of orientation crawl related web page resource comes into being.Focused crawler is one The program of a automatic downloading webpage, it according to set crawl target, selectively access webpage on WWW to it is relevant Link, information required for obtaining.Different from general crawler (general purpose web crawler), focused crawler is simultaneously Big covering is not pursued, and will be targeted by crawl webpage relevant to a certain specific subject content, is the user of subject-oriented Inquiry prepares data resource.
Web crawlers is the program for automatically extracting webpage, and it is to search that it, which is search engine support grid page above and below WWW, Index the important composition held up.Traditional crawler obtains the URL on Initial page since the URL of one or several Initial pages, During grabbing webpage, new URL is constantly extracted from current page and is put into queue, certain stopping item until meeting system Part.The workflow of focused crawler is complex, needs to link according to certain web page analysis algorithm filtering is unrelated with theme, The URL queue to be captured such as retain useful link and put it into.Then, it by according to certain search strategy from queue The selection next step webpage URL to be grabbed, and repeat the above process, stopping when reaching a certain condition of system.In addition, institute Having will be stored by the webpage of crawler capturing by system, certain analysis, filtering be carried out, and establish index, so as to looking into later It askes and retrieves;For focused crawler, the obtained analysis result of this process is also possible to provide later crawl process Feedback and guidance.
Anti- crawler is exactly to prevent others from obtaining a kind of information protection of oneself site information in batches using various technological means Behavior.Traditional anti-crawler generally takes two ways.First way from the background counts user access request, if The access request number that single user acts on behalf of (User Agent, UA) is more than threshold value, then is blocked to the user agent.This side The anti-crawler effect of formula is very good, but the problem of bringing is exactly that misplacement is very serious, frequently results in the normal webpage of user Access behavior is blocked.The second way from the background counts user access request, if single user's session (session) access request number is more than threshold value, then is blocked to the user conversation.This mode seems to manage it, in fact It is worse to send out crawler effect, because session can be with free application, user, can be with gratis Shen again again when session is blocked Please a session, continue information crawler behavior heartily, cause this anti-practical application effect for crawling mode not all right.
Normal users can call many data-interfaces when browsing webpage, this is normal user's access.If crawler, it The data that it wants can be grabbed in batches in a pile data-interface, skip him and think useless data-interface, to only can That data-interface is climbed, then crawls data in bulk always.One crawler grabs the network flow of data generation in bulk, It is equivalent to the amount of access of hundreds of normal users.At this time web page access amount starts to increase, and server transport speed reduces, And what it is due to crawler batch crawl is single data-interface, causes the data-interface transmission rate slack-off, even more than takes The load-bearing capacity of business device, leads to server outage.When crawler crawls data, the access of normal users can be caused to become Caton, had Shi Keneng also results in normal users access less than required information.
Summary of the invention
In order to solve existing anti-the problems such as crawling misplacement existing for technology and ineffective anti-creep, the present invention provides one kind The method that anti-data crawl, the number of ports that the method is called when checking data by counting user judge initiate data access Client whether be crawler.A possibility that misplacement can be not only greatly reduced in this method, but also can significantly improve counter crawl Effect.
To realize that above-mentioned target, the technical solution used in the present invention are as follows.
The method counting user checks the interface quantity called when data, if user checks the interface called when webpage Quantity is greater than or equal to preset interface quantity threshold value, then judges that the user is common access user, if user's tune Interface quantity is less than preset interface quantity threshold value, and calling the number of each interface is more than preset tune With frequency threshold value, then judge that the user for crawler, the IP of the user is added to inside firewall and is closed.
A kind of method that anti-data crawl, the described method comprises the following steps:
1) user access request is intercepted and captured;
2) IP address is recorded;
3) data-interface of record user access;
4) daily record data is called, the data-interface number of IP address calling is counted;
5) if the number of ports that IP address calls is greater than preset number of ports threshold value, it is common to judge that the user is positive Family, does not execute interception operation, and server normally returns to user response information;
6) if the number of ports that IP address calls is less than preset number of ports threshold value, the IP address is counted to each The call number of data-interface, if being more than pre-set call number threshold value to the call number of some data-interface, Judge the user for crawler;
7) IP address of crawler user is added to the forbidding in IP list of protecting wall, forbids visit of the IP address to this website It asks.
Preferably, the number of ports threshold value is 2.
Preferably, the call number threshold value be 80 percent, 8 percent tenth is that invoking page interface number and its The ratio of his interface is more than 80 percent.Than if any two, interface, first 1 time, second 3 times.Normal users are 2_6, 3_9 increases, and crawler user is 1_6, and 1_9 increases.
The advantages and benefits of the present invention are:
1) the method for the invention is based on the rule to user and crawler access data-interface, to differentiate normal users and crawler row To judge rather than just based on user's amount of access, accuracy is higher, and accidentally taboo rate is lower;
2) the anti-crawlers in the method for the invention are run in server-side, are on the one hand facilitated management, are on the other hand improved The fluency of client, improves user experience;
3) based on User IP come anti-crawler, forbidden crawler user on the one hand can be more intuitively understood, on the other hand, one When denier occurs accidentally to prohibit, also it is easy to restore user's access right based on IP address.
Specific embodiment
Below with reference to embodiment, the invention will be further described.
Embodiment 1
A kind of method that anti-data crawl, comprising the following steps:
1) user access request is intercepted and captured;
2) IP address is recorded;
3) data-interface of record user access;
4) daily record data is called, the data-interface number of IP address calling is counted;
5) if the number of ports that IP address calls is greater than or equal to 2, judge that the user for normal users, does not execute interception Operation, server normally return to user response information;
6) if the number of ports that IP address calls is 1, the IP address is counted to the call number of this data-interface, such as Fruit is more than percent 80 to the call number of the data-interface, then judges the user for crawler;
7) IP address of crawler user is added to the forbidding in IP list of protecting wall, forbids visit of the IP address to this website It asks.
Finally, it should be noted that obviously, the above embodiment is merely an example for clearly illustrating the present invention, and simultaneously The non-restriction to embodiment.For those of ordinary skill in the art, it can also do on the basis of the above description Other various forms of variations or variation out.There is no necessity and possibility to exhaust all the enbodiments.And thus drawn The obvious changes or variations of stretching are still in the protection scope of this invention.

Claims (4)

1. a kind of method that anti-data crawl, it is characterised in that: the method is called when checking data based on counting user Interface quantity is sentenced if user checks that the interface quantity called when webpage is greater than or equal to preset number of ports threshold value The user of breaking is common access user, if the interface quantity that user calls is less than preset number of ports threshold value, and The number for calling each interface is more than preset call number threshold value, then judges the user for crawler, by the IP of the user It is added to inside firewall and is closed.
2. the method that a kind of anti-data crawl according to claim 1, which is characterized in that the method includes following steps It is rapid:
1) user access request is intercepted and captured;
2) IP address is recorded;
3) data-interface of record user access;
4) daily record data is called, the data-interface number of IP address calling is counted;
5) if the number of ports that IP address calls is greater than preset number of ports threshold value, it is common to judge that the user is positive Family, does not execute interception operation, and server normally returns to user response information;
6) if the number of ports that IP address calls is less than preset number of ports threshold value, the IP address is counted to each The call number of data-interface, if being more than pre-set call number threshold value to the call number of some data-interface, Judge the user for crawler;
7) IP address of crawler user is added to the forbidding in IP list of protecting wall, forbids visit of the IP address to this website It asks.
3. according to claim 1 or a kind of method that anti-data crawl described in 2, it is characterised in that: the number of ports threshold value is 2。
4. the method that a kind of anti-data according to claim 1 or 2 crawl, it is characterised in that: the call number threshold value It is 80 percent, 8 percent tenth is that the ratio of invoking page interface number and other interfaces is more than 80 percent.
CN201810689937.2A 2018-06-28 2018-06-28 Anti-data crawling method Active CN109246070B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201810689937.2A CN109246070B (en) 2018-06-28 2018-06-28 Anti-data crawling method

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201810689937.2A CN109246070B (en) 2018-06-28 2018-06-28 Anti-data crawling method

Publications (2)

Publication Number Publication Date
CN109246070A true CN109246070A (en) 2019-01-18
CN109246070B CN109246070B (en) 2021-04-30

Family

ID=65072168

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201810689937.2A Active CN109246070B (en) 2018-06-28 2018-06-28 Anti-data crawling method

Country Status (1)

Country Link
CN (1) CN109246070B (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110035068A (en) * 2019-03-14 2019-07-19 微梦创科网络科技(中国)有限公司 It is a kind of it is counter grab station system close method and device down

Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102880698A (en) * 2012-09-21 2013-01-16 新浪网技术(中国)有限公司 Method and device for determining caught website
US20160188717A1 (en) * 2014-12-29 2016-06-30 Quixey, Inc. Network crawling prioritization
CN106254537A (en) * 2016-09-22 2016-12-21 北京小米移动软件有限公司 Interface interchange method and apparatus
CN106411639A (en) * 2016-09-18 2017-02-15 合网络技术(北京)有限公司 Method and system for monitoring access data
CN107105071A (en) * 2017-05-05 2017-08-29 北京京东金融科技控股有限公司 IP call methods and device, storage medium, electronic equipment
CN107392022A (en) * 2017-07-20 2017-11-24 北京小度信息科技有限公司 Reptile identification, processing method and relevant apparatus

Patent Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102880698A (en) * 2012-09-21 2013-01-16 新浪网技术(中国)有限公司 Method and device for determining caught website
US20160188717A1 (en) * 2014-12-29 2016-06-30 Quixey, Inc. Network crawling prioritization
CN106411639A (en) * 2016-09-18 2017-02-15 合网络技术(北京)有限公司 Method and system for monitoring access data
CN106254537A (en) * 2016-09-22 2016-12-21 北京小米移动软件有限公司 Interface interchange method and apparatus
CN107105071A (en) * 2017-05-05 2017-08-29 北京京东金融科技控股有限公司 IP call methods and device, storage medium, electronic equipment
CN107392022A (en) * 2017-07-20 2017-11-24 北京小度信息科技有限公司 Reptile identification, processing method and relevant apparatus

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110035068A (en) * 2019-03-14 2019-07-19 微梦创科网络科技(中国)有限公司 It is a kind of it is counter grab station system close method and device down
CN110035068B (en) * 2019-03-14 2021-10-01 微梦创科网络科技(中国)有限公司 Sealing forbidding method and device for anti-grabbing station system

Also Published As

Publication number Publication date
CN109246070B (en) 2021-04-30

Similar Documents

Publication Publication Date Title
US11108807B2 (en) Performing rule-based actions for newly observed domain names
US10567412B2 (en) Security threat detection based o patterns in machine data events
CN106599075B (en) A kind of method and device of counting user behavioral data
US7860971B2 (en) Anti-spam tool for browser
CN109688097A (en) Website protection method, website protective device, website safeguard and storage medium
CN103761279B (en) Method and system for scheduling network crawlers on basis of keyword search
CN109768992A (en) Webpage malicious scanning processing method and device, terminal device, readable storage medium storing program for executing
CN107341160A (en) A kind of method and device for intercepting reptile
CN106528659B (en) Control method and device for browser to jump to application program
CN105589943B (en) The method, apparatus and server of the picture adaptive processes of result of page searching
CN112787992A (en) Method, device, equipment and medium for detecting and protecting sensitive data
US10509546B2 (en) History component for single page application
CN106951784B (en) XSS vulnerability detection-oriented Web application reverse analysis method
KR100870714B1 (en) Method for blocking harmful internet sites in real-time by approaching engine to object property
CN107341395A (en) A kind of method for intercepting reptile
CN105657471A (en) Account management method and device
CN108429785A (en) A kind of generation method, reptile recognition methods and the device of reptile identification encryption string
CN106599270B (en) Network data capturing method and crawler
CN106230835A (en) Method based on the anti-malicious access that Nginx log analysis and IPTABLES forward
US11281770B2 (en) Detection of structured query language (SQL) injection events using simple statistical analysis
CN109246070A (en) A kind of method that anti-data crawl
CN111049837A (en) Malicious website identification and interception technology based on communication operator network transport layer
CN105959808A (en) Method and device for deleting video bullet screen
CN110213301A (en) A kind of method, server and system shifting network attack face
Liu et al. Identifying user clicks based on dependency graph

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant