CN109246070A - A kind of method that anti-data crawl - Google Patents
A kind of method that anti-data crawl Download PDFInfo
- Publication number
- CN109246070A CN109246070A CN201810689937.2A CN201810689937A CN109246070A CN 109246070 A CN109246070 A CN 109246070A CN 201810689937 A CN201810689937 A CN 201810689937A CN 109246070 A CN109246070 A CN 109246070A
- Authority
- CN
- China
- Prior art keywords
- user
- data
- interface
- threshold value
- address
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Granted
Links
Classifications
-
- H—ELECTRICITY
- H04—ELECTRIC COMMUNICATION TECHNIQUE
- H04L—TRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
- H04L63/00—Network architectures or network communication protocols for network security
- H04L63/14—Network architectures or network communication protocols for network security for detecting or protecting against malicious traffic
- H04L63/1441—Countermeasures against malicious traffic
- H04L63/1483—Countermeasures against malicious traffic service impersonation, e.g. phishing, pharming or web spoofing
-
- H—ELECTRICITY
- H04—ELECTRIC COMMUNICATION TECHNIQUE
- H04L—TRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
- H04L63/00—Network architectures or network communication protocols for network security
- H04L63/14—Network architectures or network communication protocols for network security for detecting or protecting against malicious traffic
- H04L63/1441—Countermeasures against malicious traffic
Landscapes
- Engineering & Computer Science (AREA)
- Computer Security & Cryptography (AREA)
- Computer Hardware Design (AREA)
- Computing Systems (AREA)
- General Engineering & Computer Science (AREA)
- Computer Networks & Wireless Communication (AREA)
- Signal Processing (AREA)
- Information Transfer Between Computers (AREA)
Abstract
The present invention discloses a kind of method that anti-data crawl.The method counting user checks the interface quantity called when data, if user checks that the interface quantity called when webpage is greater than or equal to preset interface quantity threshold value, then judge that the user is common access user, if the interface quantity that user calls is less than preset interface quantity threshold value, and the number for calling each interface is more than preset call number threshold value, then judge that the user for crawler, the IP of the user is added to inside firewall and is closed.The method of the invention differentiates normal users and crawler behavior based on the rule to user and crawler access data-interface, judges rather than just based on user's amount of access, and accuracy is higher, and accidentally taboo rate is lower.
Description
Technical field
The invention belongs to message protections and technical field of network security, and in particular to a kind of method that anti-data crawl.
Background technique
With the rapid development of network, WWW becomes the carrier of bulk information, how to efficiently extract and use these
Information becomes a huge challenge.The tool that search engine assists people to retrieve information as one becomes ten thousand dimension of user's access
The entrance and guide of net.But there is also certain limitations for these versatility search engines, such as:
(1) different field, different background user often there is different retrieval purpose and demand, universal search engine to be returned
Result include the unconcerned webpage of a large number of users.
(2) target of universal search engine is the network coverage as big as possible, limited search engine server resource
Contradiction between unlimited network data resource will further deepen.
(3) abundant and network technology the continuous development of world wide web data form, the more matchmakers of picture, database, audio, video
The different data such as body largely occur, often intensive to these information contents and data with certain structure of universal search engine without
It can be power, cannot find and obtain well.
(4) universal search engine provides the retrieval based on keyword mostly, it is difficult to which support is looked into according to what semantic information proposed
It askes.
To solve the above-mentioned problems, the focused crawler of orientation crawl related web page resource comes into being.Focused crawler is one
The program of a automatic downloading webpage, it according to set crawl target, selectively access webpage on WWW to it is relevant
Link, information required for obtaining.Different from general crawler (general purpose web crawler), focused crawler is simultaneously
Big covering is not pursued, and will be targeted by crawl webpage relevant to a certain specific subject content, is the user of subject-oriented
Inquiry prepares data resource.
Web crawlers is the program for automatically extracting webpage, and it is to search that it, which is search engine support grid page above and below WWW,
Index the important composition held up.Traditional crawler obtains the URL on Initial page since the URL of one or several Initial pages,
During grabbing webpage, new URL is constantly extracted from current page and is put into queue, certain stopping item until meeting system
Part.The workflow of focused crawler is complex, needs to link according to certain web page analysis algorithm filtering is unrelated with theme,
The URL queue to be captured such as retain useful link and put it into.Then, it by according to certain search strategy from queue
The selection next step webpage URL to be grabbed, and repeat the above process, stopping when reaching a certain condition of system.In addition, institute
Having will be stored by the webpage of crawler capturing by system, certain analysis, filtering be carried out, and establish index, so as to looking into later
It askes and retrieves;For focused crawler, the obtained analysis result of this process is also possible to provide later crawl process
Feedback and guidance.
Anti- crawler is exactly to prevent others from obtaining a kind of information protection of oneself site information in batches using various technological means
Behavior.Traditional anti-crawler generally takes two ways.First way from the background counts user access request, if
The access request number that single user acts on behalf of (User Agent, UA) is more than threshold value, then is blocked to the user agent.This side
The anti-crawler effect of formula is very good, but the problem of bringing is exactly that misplacement is very serious, frequently results in the normal webpage of user
Access behavior is blocked.The second way from the background counts user access request, if single user's session
(session) access request number is more than threshold value, then is blocked to the user conversation.This mode seems to manage it, in fact
It is worse to send out crawler effect, because session can be with free application, user, can be with gratis Shen again again when session is blocked
Please a session, continue information crawler behavior heartily, cause this anti-practical application effect for crawling mode not all right.
Normal users can call many data-interfaces when browsing webpage, this is normal user's access.If crawler, it
The data that it wants can be grabbed in batches in a pile data-interface, skip him and think useless data-interface, to only can
That data-interface is climbed, then crawls data in bulk always.One crawler grabs the network flow of data generation in bulk,
It is equivalent to the amount of access of hundreds of normal users.At this time web page access amount starts to increase, and server transport speed reduces,
And what it is due to crawler batch crawl is single data-interface, causes the data-interface transmission rate slack-off, even more than takes
The load-bearing capacity of business device, leads to server outage.When crawler crawls data, the access of normal users can be caused to become Caton, had
Shi Keneng also results in normal users access less than required information.
Summary of the invention
In order to solve existing anti-the problems such as crawling misplacement existing for technology and ineffective anti-creep, the present invention provides one kind
The method that anti-data crawl, the number of ports that the method is called when checking data by counting user judge initiate data access
Client whether be crawler.A possibility that misplacement can be not only greatly reduced in this method, but also can significantly improve counter crawl
Effect.
To realize that above-mentioned target, the technical solution used in the present invention are as follows.
The method counting user checks the interface quantity called when data, if user checks the interface called when webpage
Quantity is greater than or equal to preset interface quantity threshold value, then judges that the user is common access user, if user's tune
Interface quantity is less than preset interface quantity threshold value, and calling the number of each interface is more than preset tune
With frequency threshold value, then judge that the user for crawler, the IP of the user is added to inside firewall and is closed.
A kind of method that anti-data crawl, the described method comprises the following steps:
1) user access request is intercepted and captured;
2) IP address is recorded;
3) data-interface of record user access;
4) daily record data is called, the data-interface number of IP address calling is counted;
5) if the number of ports that IP address calls is greater than preset number of ports threshold value, it is common to judge that the user is positive
Family, does not execute interception operation, and server normally returns to user response information;
6) if the number of ports that IP address calls is less than preset number of ports threshold value, the IP address is counted to each
The call number of data-interface, if being more than pre-set call number threshold value to the call number of some data-interface,
Judge the user for crawler;
7) IP address of crawler user is added to the forbidding in IP list of protecting wall, forbids visit of the IP address to this website
It asks.
Preferably, the number of ports threshold value is 2.
Preferably, the call number threshold value be 80 percent, 8 percent tenth is that invoking page interface number and its
The ratio of his interface is more than 80 percent.Than if any two, interface, first 1 time, second 3 times.Normal users are 2_6,
3_9 increases, and crawler user is 1_6, and 1_9 increases.
The advantages and benefits of the present invention are:
1) the method for the invention is based on the rule to user and crawler access data-interface, to differentiate normal users and crawler row
To judge rather than just based on user's amount of access, accuracy is higher, and accidentally taboo rate is lower;
2) the anti-crawlers in the method for the invention are run in server-side, are on the one hand facilitated management, are on the other hand improved
The fluency of client, improves user experience;
3) based on User IP come anti-crawler, forbidden crawler user on the one hand can be more intuitively understood, on the other hand, one
When denier occurs accidentally to prohibit, also it is easy to restore user's access right based on IP address.
Specific embodiment
Below with reference to embodiment, the invention will be further described.
Embodiment 1
A kind of method that anti-data crawl, comprising the following steps:
1) user access request is intercepted and captured;
2) IP address is recorded;
3) data-interface of record user access;
4) daily record data is called, the data-interface number of IP address calling is counted;
5) if the number of ports that IP address calls is greater than or equal to 2, judge that the user for normal users, does not execute interception
Operation, server normally return to user response information;
6) if the number of ports that IP address calls is 1, the IP address is counted to the call number of this data-interface, such as
Fruit is more than percent 80 to the call number of the data-interface, then judges the user for crawler;
7) IP address of crawler user is added to the forbidding in IP list of protecting wall, forbids visit of the IP address to this website
It asks.
Finally, it should be noted that obviously, the above embodiment is merely an example for clearly illustrating the present invention, and simultaneously
The non-restriction to embodiment.For those of ordinary skill in the art, it can also do on the basis of the above description
Other various forms of variations or variation out.There is no necessity and possibility to exhaust all the enbodiments.And thus drawn
The obvious changes or variations of stretching are still in the protection scope of this invention.
Claims (4)
1. a kind of method that anti-data crawl, it is characterised in that: the method is called when checking data based on counting user
Interface quantity is sentenced if user checks that the interface quantity called when webpage is greater than or equal to preset number of ports threshold value
The user of breaking is common access user, if the interface quantity that user calls is less than preset number of ports threshold value, and
The number for calling each interface is more than preset call number threshold value, then judges the user for crawler, by the IP of the user
It is added to inside firewall and is closed.
2. the method that a kind of anti-data crawl according to claim 1, which is characterized in that the method includes following steps
It is rapid:
1) user access request is intercepted and captured;
2) IP address is recorded;
3) data-interface of record user access;
4) daily record data is called, the data-interface number of IP address calling is counted;
5) if the number of ports that IP address calls is greater than preset number of ports threshold value, it is common to judge that the user is positive
Family, does not execute interception operation, and server normally returns to user response information;
6) if the number of ports that IP address calls is less than preset number of ports threshold value, the IP address is counted to each
The call number of data-interface, if being more than pre-set call number threshold value to the call number of some data-interface,
Judge the user for crawler;
7) IP address of crawler user is added to the forbidding in IP list of protecting wall, forbids visit of the IP address to this website
It asks.
3. according to claim 1 or a kind of method that anti-data crawl described in 2, it is characterised in that: the number of ports threshold value is
2。
4. the method that a kind of anti-data according to claim 1 or 2 crawl, it is characterised in that: the call number threshold value
It is 80 percent, 8 percent tenth is that the ratio of invoking page interface number and other interfaces is more than 80 percent.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201810689937.2A CN109246070B (en) | 2018-06-28 | 2018-06-28 | Anti-data crawling method |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201810689937.2A CN109246070B (en) | 2018-06-28 | 2018-06-28 | Anti-data crawling method |
Publications (2)
Publication Number | Publication Date |
---|---|
CN109246070A true CN109246070A (en) | 2019-01-18 |
CN109246070B CN109246070B (en) | 2021-04-30 |
Family
ID=65072168
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201810689937.2A Active CN109246070B (en) | 2018-06-28 | 2018-06-28 | Anti-data crawling method |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN109246070B (en) |
Cited By (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN110035068A (en) * | 2019-03-14 | 2019-07-19 | 微梦创科网络科技(中国)有限公司 | It is a kind of it is counter grab station system close method and device down |
Citations (6)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN102880698A (en) * | 2012-09-21 | 2013-01-16 | 新浪网技术(中国)有限公司 | Method and device for determining caught website |
US20160188717A1 (en) * | 2014-12-29 | 2016-06-30 | Quixey, Inc. | Network crawling prioritization |
CN106254537A (en) * | 2016-09-22 | 2016-12-21 | 北京小米移动软件有限公司 | Interface interchange method and apparatus |
CN106411639A (en) * | 2016-09-18 | 2017-02-15 | 合网络技术(北京)有限公司 | Method and system for monitoring access data |
CN107105071A (en) * | 2017-05-05 | 2017-08-29 | 北京京东金融科技控股有限公司 | IP call methods and device, storage medium, electronic equipment |
CN107392022A (en) * | 2017-07-20 | 2017-11-24 | 北京小度信息科技有限公司 | Reptile identification, processing method and relevant apparatus |
-
2018
- 2018-06-28 CN CN201810689937.2A patent/CN109246070B/en active Active
Patent Citations (6)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN102880698A (en) * | 2012-09-21 | 2013-01-16 | 新浪网技术(中国)有限公司 | Method and device for determining caught website |
US20160188717A1 (en) * | 2014-12-29 | 2016-06-30 | Quixey, Inc. | Network crawling prioritization |
CN106411639A (en) * | 2016-09-18 | 2017-02-15 | 合网络技术(北京)有限公司 | Method and system for monitoring access data |
CN106254537A (en) * | 2016-09-22 | 2016-12-21 | 北京小米移动软件有限公司 | Interface interchange method and apparatus |
CN107105071A (en) * | 2017-05-05 | 2017-08-29 | 北京京东金融科技控股有限公司 | IP call methods and device, storage medium, electronic equipment |
CN107392022A (en) * | 2017-07-20 | 2017-11-24 | 北京小度信息科技有限公司 | Reptile identification, processing method and relevant apparatus |
Cited By (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN110035068A (en) * | 2019-03-14 | 2019-07-19 | 微梦创科网络科技(中国)有限公司 | It is a kind of it is counter grab station system close method and device down |
CN110035068B (en) * | 2019-03-14 | 2021-10-01 | 微梦创科网络科技(中国)有限公司 | Sealing forbidding method and device for anti-grabbing station system |
Also Published As
Publication number | Publication date |
---|---|
CN109246070B (en) | 2021-04-30 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
US11108807B2 (en) | Performing rule-based actions for newly observed domain names | |
CN106599075B (en) | A kind of method and device of counting user behavioral data | |
CN106503134B (en) | Browser jumps to the method for data synchronization and device of application program | |
US20170048265A1 (en) | Detection of Potential Security Threats Based on Categorical Patterns | |
CN109688097A (en) | Website protection method, website protective device, website safeguard and storage medium | |
CN107395782A (en) | A kind of IP limitation controlled source information extraction methods based on agent pool | |
CN109768992A (en) | Webpage malicious scanning processing method and device, terminal device, readable storage medium storing program for executing | |
US20090216868A1 (en) | Anti-spam tool for browser | |
CN112787992A (en) | Method, device, equipment and medium for detecting and protecting sensitive data | |
CN107341160A (en) | A kind of method and device for intercepting reptile | |
CN106528659B (en) | Control method and device for browser to jump to application program | |
CN105589943B (en) | The method, apparatus and server of the picture adaptive processes of result of page searching | |
US8510443B2 (en) | Real-time harmful website blocking method using object attribute access engine | |
CN107341395A (en) | A kind of method for intercepting reptile | |
CN102045319A (en) | Method and device for detecting SQL (Structured Query Language) injection attack | |
CN105657471A (en) | Account management method and device | |
CN108429785A (en) | A kind of generation method, reptile recognition methods and the device of reptile identification encryption string | |
CN106599270B (en) | Network data capturing method and crawler | |
CN107800689A (en) | A kind of Website Usability ensures processing method and processing device | |
CN106230835A (en) | Method based on the anti-malicious access that Nginx log analysis and IPTABLES forward | |
CN107590386A (en) | Processing method, device, storage medium and the computer equipment of security event information | |
CN109635210A (en) | Report method, device, equipment and the storage medium of behavioral data | |
CN104008213B (en) | A kind of more new discovery of info web and the method and apparatus of statistics | |
CN109246070A (en) | A kind of method that anti-data crawl | |
CN111049837A (en) | Malicious website identification and interception technology based on communication operator network transport layer |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant |