CN114338099A - Crawler behavior identification method and prevention system - Google Patents
Crawler behavior identification method and prevention system Download PDFInfo
- Publication number
- CN114338099A CN114338099A CN202111510989.7A CN202111510989A CN114338099A CN 114338099 A CN114338099 A CN 114338099A CN 202111510989 A CN202111510989 A CN 202111510989A CN 114338099 A CN114338099 A CN 114338099A
- Authority
- CN
- China
- Prior art keywords
- access
- connection address
- similarity
- user
- access request
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
- 238000000034 method Methods 0.000 title claims abstract description 30
- 230000002265 prevention Effects 0.000 title claims abstract description 12
- 238000012795 verification Methods 0.000 claims abstract description 40
- 238000004458 analytical method Methods 0.000 claims description 28
- 230000008569 process Effects 0.000 claims description 9
- 230000000977 initiatory effect Effects 0.000 claims description 3
- 230000006399 behavior Effects 0.000 abstract description 53
- 230000000694 effects Effects 0.000 abstract description 6
- 230000009286 beneficial effect Effects 0.000 abstract description 3
- 230000007246 mechanism Effects 0.000 abstract description 3
- 238000010586 diagram Methods 0.000 description 8
- 238000004590 computer program Methods 0.000 description 5
- 238000004364 calculation method Methods 0.000 description 2
- 239000003814 drug Substances 0.000 description 2
- 229940079593 drug Drugs 0.000 description 2
- 230000006870 function Effects 0.000 description 2
- 230000002159 abnormal effect Effects 0.000 description 1
- 230000002411 adverse Effects 0.000 description 1
- 238000003491 array Methods 0.000 description 1
- 239000003795 chemical substances by application Substances 0.000 description 1
- 238000007405 data analysis Methods 0.000 description 1
- 235000019800 disodium phosphate Nutrition 0.000 description 1
- 230000003287 optical effect Effects 0.000 description 1
- 239000013307 optical fiber Substances 0.000 description 1
- 239000004065 semiconductor Substances 0.000 description 1
Images
Landscapes
- Management, Administration, Business Operations System, And Electronic Commerce (AREA)
Abstract
The invention discloses a crawler behavior identification method and a prevention system, which comprise the following steps: s1: receiving an access request of a user; s2: storing the connection address, the access object and the status code; s3: judging whether the connection address is a connection address with a crawler behavior; yes, go to S6; NO, go to S4; s4: judging the similarity between the access request and the crawler behavior; the similarity is in a first range, and the access request is accepted; similarity is in the second range, turn to S5; similarity is in the third range, go to S6; s5: verifying the verification code; if so, accepting the access request; NO, go to S6; s6: and storing the connection address into a blacklist. The invention has the beneficial effects that: the accumulated access requests and the real-time access requests of the users are analyzed respectively, so that the accuracy of crawler behavior identification is improved, and a good identification effect is realized. And the similarity range and the verification mechanism of the verification code are reasonably set to effectively distinguish normal access requests and crawler behaviors, so that the user experience is improved.
Description
Technical Field
The invention relates to the technical field of internet security, in particular to a crawler behavior identification method and a prevention system.
Background
Crawler software refers to a certain type of specific computer program, which obtains specific data from specific web pages, interfaces and the like under certain preset rules. Crawler software is widely used in many fields, and can automatically acquire new data from a website and store the new data for convenient access, analysis and use. Data acquired through crawler software can be used as a basis for big data analysis after being sorted, processed and screened, so that specific objects can be effectively analyzed, and the method is a very common analysis means in internet enterprises. In the pharmaceutical and e-commerce industry, crawler software is often used to collect data such as product categories, commodity names, prices, discounts, and the like of the same industry as a common analysis tool. After the data are analyzed and processed, the data can be used for guiding commodity pricing. For example, in a certain pharmaceutical e-commerce platform, the number of access requests per day is about 7000 thousands, including access requests from several crawlers with different functions. The crawler software occupies server resources and network bandwidth, influences user experience, and has adverse effects on the operation behavior of the drug and the electric business. Therefore, it is necessary to design a corresponding recognition and prevention system for the crawler software.
In the prior art, the request times of a single IP address in a specific time interval are limited, but the method cannot achieve a good effect on crawler software with random IP addresses. And to some extent, it will also affect the use of some users with large access requirements, such as distributors with large purchase, retail terminals, etc.
Disclosure of Invention
Aiming at the problems in the prior art, a crawler behavior identification method and a prevention system are provided.
The specific technical scheme is as follows:
a method of identifying crawler behavior, comprising:
step S1: receiving an access request of a user, and recording a connection address, an access object and a state code;
step S2: storing the connection address, the access object and the state code into a log module;
step S3: analyzing the access object and the state code of the connection address by adopting the log module, and judging whether the connection address is a connection address with a crawler behavior according to an analysis result;
if yes, go to step S6;
if not, go to step S4;
step S4: adopting an analysis module to judge the similarity between the access request and the crawler behavior;
if the similarity is in a preset first range, receiving the access request and returning the requested access object to the user, and then finishing the judgment;
if the similarity is within a second predetermined range, go to step S5;
if the similarity is within a preset third range, go to step S6;
step S5: initiating verification of a verification code to the user and judging whether the user passes the verification;
if so, receiving the access request and returning the requested access object to the user, and then finishing the judgment;
if not, go to step S6;
step S6: and storing the connection address into a blacklist, and then finishing the judgment.
Preferably, the step S3 includes:
step S31: judging whether the connection address is in a preset high-risk connection address range or not;
if so, increasing the numerical value of the similarity, and then turning to step S32;
if not, go to the step S32;
step S32: judging whether a first access rule is met according to the access object and the state code
If yes, judging that the connection address is the connection address with the crawler behavior, and then turning to step S6;
if not, go to step S33;
step S33: judging whether a second access rule is met or not according to the access object;
if yes, judging that the connection address is the connection address with the crawler behavior, and then turning to step S6;
if not, go to step S34;
step S34: judging whether a third access rule is met or not according to the state code and the connection address;
if yes, judging that the connection address is the connection address with the crawler behavior, and then turning to step S6;
if not, go to step S35;
step S35: judging whether the number of times of the access requests sent from the connection address exceeds an access limit value within preset time;
if so, increasing the numerical value of the similarity, and then turning to step S4;
if not, the process goes to step S4.
Preferably, the calculation formula of the similarity is as follows:
wherein: cos θ is the similarity, x1For the connection address, x2Frequency, y, of connection addresses in said black list1For the access object, y2The frequency of accessing the object in the blacklist.
Preferably, the step S4 further includes:
step S41: extracting a user identification from the access request;
step S42: judging whether the user identification is in a preset user identification range or not;
if so, receiving the access request and returning the requested access object to the user, and then finishing the judgment;
if not, go to step S43;
step S43: calculating and judging the similarity between the access request and the crawler behavior;
if the similarity is in the first range, receiving the access request and returning the requested access object to the user, and then finishing the judgment;
if the similarity is in the second range, go to step S5;
if the similarity is in the third range, the process goes to step S6.
Preferably, the step S5 includes:
step S51: sending a verification code authentication request to the user, and recording the authentication times;
step S52: judging whether the authentication times reach an authentication upper limit value or not;
if yes, go to step S6;
if not, go to step S53;
step S53: judging whether the authentication request passes or not;
if so, receiving the access request and returning the requested access object to the user, and then finishing the judgment;
if not, the process returns to the step S51.
Preferably, the step S6 includes:
setting a period for storing the connection address in the blacklist according to the connection address or the access object;
the period is a preset first period or a preset second period.
A prevention system for crawler behavior is characterized in that the prevention system is used for implementing the identification method and comprises the following steps:
the load balancing module is connected with a plurality of users and used for receiving access requests sent by the users;
the analysis module receives the access request forwarded by the load balancing module and judges the similarity between the access request and a preset crawler model;
the verification module is connected with the analysis module and sends a verification code verification request to the user according to the similarity;
the log module is connected with the analysis module and the verification module and is used for adding the users with extremely high similarity and/or the users who do not pass verification request of the verification code into a blacklist;
and the load balancing module judges whether to forward the access request of the external user according to the blacklist.
Preferably, the log module further comprises:
and the behavior analysis submodule reads the user, the connection address, the access object and the state code from the log stored in the log module and judges whether to add the user into the blacklist or not according to the user, the connection address, the access object and the state code.
The technical scheme has the following advantages or beneficial effects: by arranging the log module and the analysis module to analyze the accumulated access request and the real-time access request of the user respectively, the accuracy of crawler behavior identification is improved, and a better identification effect is realized. And the similarity range and the verification mechanism of the verification code are reasonably set to effectively distinguish normal access requests and crawler behaviors, so that the user experience is improved.
Drawings
Embodiments of the present invention will now be described more fully hereinafter with reference to the accompanying drawings. The drawings are, however, to be regarded as illustrative and explanatory only and are not restrictive of the scope of the invention.
FIG. 1 is an overall schematic diagram of an embodiment of the present invention;
FIG. 2 is a schematic diagram illustrating the substep of step S3 according to an embodiment of the present invention;
FIG. 3 is a schematic diagram illustrating the substep of step S4 according to an embodiment of the present invention;
FIG. 4 is a schematic diagram illustrating the substep of step S5 according to an embodiment of the present invention;
fig. 5 is a functional block diagram according to an embodiment of the present invention.
Detailed Description
The technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.
It should be noted that the embodiments and features of the embodiments may be combined with each other without conflict.
The invention is further described with reference to the following drawings and specific examples, which are not intended to be limiting.
In the prior art, a method for preventing crawler software is mainly realized by limiting the number of requests of a single IP connection address in a certain time. For example, the nginx anti-proxy server mounts two components, namely the limit _ conn _ zone and the limit _ req _ zone, so as to limit the access frequency and the access times of the user. This presents the following two problems: part of crawler software avoids the limitation on the times of single IP connection in an IP address pool mode; in the pharmaceutical and e-commerce platform, the purchasing networks of part of purchasers, distributors, drug stores and platform merchants adopt unified export, the access amount of a single IP address is naturally higher, and the use inconvenience of users can be caused by the adoption of a single IP address connection limiting mode. Aiming at the technical problems, the invention provides a system for identifying specific crawler behaviors so as to realize the prevention effect of crawler software.
The invention comprises the following steps:
a method for identifying a crawler behavior, as shown in fig. 1, includes:
step S1: receiving an access request of a user, and recording a connection address, an access object and a state code;
step S2: storing the connection address, the access object and the state code into a log module;
step S3: analyzing the access object and the state code of the connection address by adopting a log module, and judging whether the connection address is a connection address with a crawler behavior according to an analysis result;
if yes, go to step S6;
if not, go to step S4;
step S4: adopting an analysis module to judge the similarity between the access request and the crawler behavior;
if the similarity is in a preset first range, receiving the access request and returning the requested access object to the user, and then finishing the judgment;
if the similarity is within a second predetermined range, go to step S5;
if the similarity is within a preset third range, go to step S6;
as an optional embodiment, the first range, the second range and the third range are continuous ranges, the numerical value of the continuous ranges is 0-10, and the smaller the number is, the lower the similarity between the access request and the crawler behavior is.
Step S5: initiating verification of a verification code to a user and judging whether the user passes the verification;
if so, receiving the access request and returning the requested access object to the user, and then finishing the judgment;
if not, go to step S6;
step S6: and storing the connection address into a blacklist, and then finishing the judgment.
Specifically, in the present technical solution, the hardware architecture mainly includes: the load balancing server is used as an external uniform interface of the system to uniformly forward and balance the access requests of the users so as to avoid the phenomenon of service blockage caused by excessive connection times; the log server is used for recording relevant information such as an access request of a user, in one embodiment, marking of the user is realized by recording an IP address, namely a connection address, of the user, the log server can be selected as a nginx server cluster, and recording of the access request of the user is realized through an access log component (access log) of the log server; the log analysis server accesses the log server through a software interface, analyzes the access request of the user recorded in the log and judges whether the log has a crawler behavior; the behavior analysis server receives the access request forwarded by the load balancing server, judges the current access behavior of the user, generates the similarity between the current access behavior of the user and the behavior of the crawler, directly allows access with low similarity, adds the access with high similarity into a blacklist, checks through a verification code to judge whether the access request comes from crawler software; and the blacklist is used for storing the users prohibited from accessing and the access requests thereof and providing basis for the behavior analysis of the behavior analysis server.
In a preferred embodiment, as shown in fig. 2, step S3 includes:
step S31: judging whether the connection address is in a preset high-risk connection address range or not;
if yes, increasing the numerical value of the similarity, and turning to step S32;
if not, go to step S32;
specifically, part of the crawler software adopts an IP address pool mode to avoid the limitation of the access times of a single IP connection address in the prior art. Most crawlers use a pool of IP addresses obtained through an Internet Data Center (IDC), which typically have the same IP address field. Therefore, by presetting a specific high-risk IP address field and adding a similarity value to an access request from the IP address field, which is equivalent to reducing the threshold value of the access request determined as the crawler behavior, the accuracy of identification of the type of crawler software can be increased.
Step S32: judging whether the first access rule is met or not according to the access object and the state code
If yes, judging that the connection address is the connection address with the crawler behavior, and then turning to step S6;
if not, go to step S33;
specifically, within a preset period, such as 1 hour, if the request for accessing the object from a connection address includes a high-risk request, such as: and when the status code of the login request (login), the administrator instruction (admin), the backup request (backup) or the database connection request (sql) is 404, the connection address is judged to be a high-risk connection address, and the connection address needs to be added into a blacklist, so that the load balancing server refuses the access of the connection address.
As an optional technical feature, in step S32, the source of the connection address may be further analyzed, and if the address field of the connection address is the home field of the internet data center, the connection address is added to the black list.
Step S33: judging whether the second access rule is met or not according to the access object;
if yes, judging that the connection address is the connection address with the crawler behavior, and then turning to step S6;
if not, go to step S34;
specifically, within a short preset period, such as 30 minutes, if the access objects are similar, it indicates that the access behavior does not conform to the normal user behavior, and may be generated by software, such as computing uri by regexp _ extract, and if the prefix of uri is similar but its id is changed regularly, such as repeating +1, -1, etc., it indicates that it may be an access request generated by crawler software.
Step S34: judging whether the third access rule is met or not according to the state code and the connection address;
if yes, judging that the connection address is the connection address with the crawler behavior, and then turning to step S6;
if not, go to step S35;
specifically, within a short preset period, for example, 30 minutes, if the status code is abnormal, for example, the status code at the beginning of 4XX appears frequently, and the connection address comes from the internet data center, it indicates that it is doing crawler behavior, and a blacklist needs to be added to make the load balancing server prevent its access behavior.
Step S35: judging whether the number of times of access requests sent by the self-connection address exceeds an access limit value within preset time;
if so, increasing the numerical value of the similarity, and then turning to step S4;
if not, the process goes to step S4.
Specifically, within a very short preset period, for example, 5 minutes, if an access request sent from a connection address significantly exceeds the normal access times, for example, 200 times, that is, the access frequency is much higher than that of a normal user, it may be considered as an access request generated by the crawler software, and the connection address needs to be blacklisted so that the load balancing server prevents its access behavior.
As an optional technical feature, for access requests conforming to different access rules, the added similarity may be different, for example, in step S35, if the number of access requests exceeds the access limit within a predetermined time, the similarity may be directly added to the second range to force the access requests to perform the access to the verification code.
In a preferred embodiment, the similarity is calculated by the formula:
wherein: cos θ is the degree of similarity, x1To connect addresses, x2Frequency, y, of addresses connected in the blacklist1To access an object, y2Is the frequency of accessing objects in the black list.
Specifically, in the technical scheme, the similarity is calculated mainly through two parameters, namely the connection address of the access request and the access object, according to the past crawler behavior data stored in the blacklist. Because the past period crawler behavior data recorded in the blacklist can be regularly cleaned, better timeliness can be obtained when the analysis module calculates according to the past period crawler behavior data stored in the blacklist each time. Generally, based on the same crawler software, the last proxy server through which the access request passes is regionally related, so that whether the source of the access request is from a high-risk connection address section can be effectively judged by taking the connection address as a calculation parameter of the similarity; furthermore, based on practical considerations, the content of interest to the crawler, such as the product SKU, product category, pricing, discount price, etc., is typically in a fixed range, rather than staying in a single page for a period of time, accessing product pictures, parameters, functional presentations, etc., as is the case with normal users. Therefore, whether the access request is similar to the past crawler behavior data recorded in the blacklist can be judged according to the request object. And inputting the two parameters into the formula to obtain the numerical value of the similarity.
As an optional implementation mode, the range of the similarity is 0-10, wherein 0-2 is a first range allowing access, 3-6 is a second range needing verification through a verification code, and 7-10 indicates that the similarity is too similar to existing past crawler behavior data and the similarity is required to be directly added into a blacklist.
In a preferred embodiment, as shown in fig. 3, step S4 further includes:
step S41: extracting a user identification from the access request;
step S42: judging whether the user identification is in a preset user identification range or not;
if so, receiving the access request and returning the requested access object to the user, and then finishing the judgment;
if not, go to step S43;
in particular, some search engines, such as hundredths, dog searches, 360 searches, etc., also periodically and automatically access the user's web site through a computer program to build an index for use by the user of the search engine. The computer program has no influence on the operation behavior of the pharmaceutical e-commerce platform, and the source of the access request is often indicated by a user identifier (user _ agent), so that the access request can be identified and released by presetting a user identifier range, and the identification efficiency of the crawler behavior is improved.
Step S43: calculating and judging the similarity between the access request and the crawler behavior;
if the similarity is in the first range, receiving the access request and returning the requested access object to the user, and then finishing the judgment;
if the similarity is in the second range, go to step S5;
if the similarity is in the third range, the process goes to step S6.
As an optional implementation manner, step S4 further includes: inquiring whether the corresponding connection address is set with a corresponding mark or not according to the log module, and if so, acquiring a corresponding similarity value for operation; by the method, the similarity can be quickly judged based on the connection address, and the judgment efficiency is improved.
As an optional implementation manner, step S4 further includes: it is determined whether to directly jump to step S5 according to whether the access object includes a key object, such as a login request, an administrator instruction, a backup request, or a database connection request. In this way, good security can be achieved by verification of the authentication code when the access request includes critical operations.
In a preferred embodiment, as shown in fig. 4, step S5 includes:
step S51: sending a verification code authentication request to a user, and recording the authentication times;
step S52: judging whether the authentication times reach an authentication upper limit value or not;
if yes, go to step S6;
if not, go to step S53;
step S53: judging whether the authentication request passes;
if so, receiving the access request and returning the requested access object to the user, and then finishing the judgment;
if not, the process returns to step S51.
In a preferred embodiment, step S6 includes:
setting a period for storing the connection address in a blacklist according to the connection address or the access object;
the period is a preset first period or a preset second period.
A prevention system for crawler behavior is characterized in that the prevention system is used for implementing the identification method and comprises the following steps:
the load balancing module 1 is connected with a plurality of users and used for receiving access requests sent by the users;
the analysis module 2 receives the access request forwarded by the load balancing module and judges the similarity between the access request and a preset crawler model;
the checking module 3 is connected with the analysis module 2, and sends a verification code checking request to the user according to the similarity;
the log module 4 is connected with the analysis module and the verification module and is used for adding users with extremely high similarity and/or users who do not pass verification requests of the verification codes into a blacklist;
and the load balancing module 1 judges whether to forward the access request of the external user according to the blacklist.
In a preferred embodiment, the log module further comprises:
and the behavior analysis submodule 41 reads the user, the connection address, the resource access request and the status code from the log stored in the log module 4, and judges whether to add the user to the blacklist according to the user, the connection address, the resource access request and the status code.
The invention has the beneficial effects that: by arranging the log module and the analysis module to analyze the accumulated access request and the real-time access request of the user respectively, the accuracy of crawler behavior identification is improved, and a better identification effect is realized. And the similarity range and the verification mechanism of the verification code are reasonably set to effectively distinguish normal access requests and crawler behaviors, so that the user experience is improved.
It should be noted that some of the block diagrams shown in the figures are functional entities and do not necessarily correspond to physically or logically separate entities. These functional entities may be implemented in the form of software, or in one or more hardware modules or integrated circuits, or in different networks and/or processor devices and/or microcontroller devices.
As will be appreciated by one of ordinary skill in the art, various aspects of the invention, or possible implementations of various aspects, may be embodied as a system, method, or computer program product. Accordingly, aspects of the present invention, or possible implementations of aspects, may take the form of an entirely hardware embodiment, an entirely software embodiment (including firmware, resident software, etc.) or an embodiment combining software and hardware aspects that may all generally be referred to herein as a "circuit," module "or" system. Furthermore, aspects of the invention, or possible implementations of aspects, may take the form of a computer program product, which refers to computer-readable program code stored in a computer-readable medium.
The computer readable medium may be a computer readable signal medium or a computer readable storage medium. A computer readable storage medium may include, but is not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or any suitable combination of the foregoing, such as Random Access Memory (RAM), read-only memory (ROM), erasable programmable read-only memory (EPROM or flash memory), optical fiber, and portable read-only memory (CD-ROM).
A processor in the computer reads the computer-readable program code stored in the computer-readable medium, so that the processor can perform the functional actions specified in each step, or a combination of steps, in the flowcharts; and means for generating a block diagram that implements the functional operation specified in each block or a combination of blocks.
It should be understood that a processor in a computer may be understood as one or more Application Specific Integrated Circuits (ASICs), DSPs, Programmable Logic Devices (PLDs), Complex Programmable Logic Devices (CPLDs), Field Programmable Gate Arrays (FPGAs), general purpose processors, controllers, micro-controllers (MCUs), microprocessors (microprocessors), or other electronic component implementations for executing the aforementioned computer readable program code.
The computer readable program code may execute entirely on the user's local computer, partly on the user's local computer, as a stand-alone software package, partly on the user's local computer and partly on a remote computer or entirely on the remote computer or server. It should also be noted that, in some alternative implementations, the functions noted in the flowchart or block diagram block may occur out of the order noted in the figures. For example, two steps or two blocks shown in succession may, in fact, be executed substantially concurrently, or the blocks may sometimes be executed in the reverse order, depending upon the functionality involved.
While the invention has been described with reference to a preferred embodiment, it will be understood by those skilled in the art that various changes in form and detail may be made therein without departing from the spirit and scope of the invention.
Claims (8)
1. A crawler behavior recognition method is characterized by comprising the following steps:
step S1: receiving an access request of a user, and recording a connection address, an access object and a state code;
step S2: storing the connection address, the access object and the state code into a log module;
step S3: analyzing the access object and the state code of the connection address by adopting the log module, and judging whether the connection address is a connection address with a crawler behavior according to an analysis result;
if yes, go to step S6;
if not, go to step S4;
step S4: adopting an analysis module to judge the similarity between the access request and the crawler behavior;
if the similarity is in a preset first range, receiving the access request and returning the requested access object to the user, and then finishing the judgment;
if the similarity is within a second predetermined range, go to step S5;
if the similarity is within a preset third range, go to step S6;
step S5: initiating verification of a verification code to the user and judging whether the user passes the verification;
if so, receiving the access request and returning the requested access object to the user, and then finishing the judgment;
if not, go to step S6;
step S6: and storing the connection address into a blacklist, and then finishing the judgment.
2. The identification method according to claim 1, wherein the step S3 includes:
step S31: judging whether the connection address is in a preset high-risk connection address range or not;
if so, increasing the numerical value of the similarity, and then turning to step S32;
if not, go to the step S32;
step S32: judging whether a first access rule is met according to the access object and the state code
If yes, judging that the connection address is the connection address with the crawler behavior, and then turning to step S6;
if not, go to step S33;
step S33: judging whether a second access rule is met or not according to the access object;
if yes, judging that the connection address is the connection address with the crawler behavior, and then turning to step S6;
if not, go to step S34;
step S34: judging whether a third access rule is met or not according to the state code and the connection address;
if yes, judging that the connection address is the connection address with the crawler behavior, and then turning to step S6;
if not, go to step S35;
step S35: judging whether the number of times of the access requests sent from the connection address exceeds an access limit value within preset time;
if so, increasing the numerical value of the similarity, and then turning to step S4;
if not, the process goes to step S4.
3. The recognition method according to claim 1, wherein the similarity is calculated by the formula:
wherein: cos θ is the similarity, x1For the connection address, x2Frequency, y, of connection addresses in said black list1For the purpose of the access to the object,y2the frequency of accessing the object in the blacklist.
4. The identification method according to claim 1, wherein the step S4 further comprises:
step S41: extracting a user identification from the access request;
step S42: judging whether the user identification is in a preset user identification range or not;
if so, receiving the access request and returning the requested access object to the user, and then finishing the judgment;
if not, go to step S43;
step S43: calculating and judging the similarity between the access request and the crawler behavior;
if the similarity is in the first range, receiving the access request and returning the requested access object to the user, and then finishing the judgment;
if the similarity is in the second range, go to step S5;
if the similarity is in the third range, the process goes to step S6.
5. The identification method according to claim 1, wherein the step S5 includes:
step S51: sending a verification code authentication request to the user, and recording the authentication times;
step S52: judging whether the authentication times reach an authentication upper limit value or not;
if yes, go to step S6;
if not, go to step S53;
step S53: judging whether the authentication request passes or not;
if so, receiving the access request and returning the requested access object to the user, and then finishing the judgment;
if not, the process returns to the step S51.
6. The identification method according to claim 1, wherein the step S6 includes:
setting a period for storing the connection address in the blacklist according to the connection address or the access object;
the period is a preset first period or a preset second period.
7. A prevention system for crawler behavior, for implementing the identification method of any one of claims 1 to 6, comprising:
the load balancing module is connected with a plurality of users and used for receiving access requests sent by the users;
the analysis module receives the access request forwarded by the load balancing module and judges the similarity between the access request and a preset crawler model;
the verification module is connected with the analysis module and sends a verification code verification request to the user according to the similarity;
the log module is connected with the analysis module and the verification module and is used for adding the users with extremely high similarity and/or the users who do not pass verification request of the verification code into a blacklist;
and the load balancing module judges whether to forward the access request of the external user according to the blacklist.
8. The identification system of claim 7, wherein the log module further comprises:
and the behavior analysis submodule reads the user, the connection address, the access object and the state code from the log stored in the log module and judges whether to add the user into the blacklist or not according to the user, the connection address, the access object and the state code.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202111510989.7A CN114338099A (en) | 2021-12-10 | 2021-12-10 | Crawler behavior identification method and prevention system |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202111510989.7A CN114338099A (en) | 2021-12-10 | 2021-12-10 | Crawler behavior identification method and prevention system |
Publications (1)
Publication Number | Publication Date |
---|---|
CN114338099A true CN114338099A (en) | 2022-04-12 |
Family
ID=81050987
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202111510989.7A Pending CN114338099A (en) | 2021-12-10 | 2021-12-10 | Crawler behavior identification method and prevention system |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN114338099A (en) |
Cited By (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN113343232A (en) * | 2021-07-13 | 2021-09-03 | 壹药网科技(上海)股份有限公司 | Reversal crawler system |
Citations (6)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN107800684A (en) * | 2017-09-20 | 2018-03-13 | 贵州白山云科技有限公司 | A kind of low frequency reptile recognition methods and device |
CN108154029A (en) * | 2017-10-25 | 2018-06-12 | 上海观安信息技术股份有限公司 | Intrusion detection method, electronic equipment and computer storage media |
CN109274664A (en) * | 2018-09-12 | 2019-01-25 | 珠海天燕科技有限公司 | A kind of anti-crawler method and apparatus |
US20190215330A1 (en) * | 2018-01-07 | 2019-07-11 | Microsoft Technology Licensing, Llc | Detecting attacks on web applications using server logs |
CN112073412A (en) * | 2020-09-08 | 2020-12-11 | 北京天融信网络安全技术有限公司 | Anti-crawler method, device, processor and computer readable medium |
CN112434208A (en) * | 2020-12-03 | 2021-03-02 | 百果园技术(新加坡)有限公司 | Training of isolated forest and identification method and related device of web crawler of isolated forest |
-
2021
- 2021-12-10 CN CN202111510989.7A patent/CN114338099A/en active Pending
Patent Citations (6)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN107800684A (en) * | 2017-09-20 | 2018-03-13 | 贵州白山云科技有限公司 | A kind of low frequency reptile recognition methods and device |
CN108154029A (en) * | 2017-10-25 | 2018-06-12 | 上海观安信息技术股份有限公司 | Intrusion detection method, electronic equipment and computer storage media |
US20190215330A1 (en) * | 2018-01-07 | 2019-07-11 | Microsoft Technology Licensing, Llc | Detecting attacks on web applications using server logs |
CN109274664A (en) * | 2018-09-12 | 2019-01-25 | 珠海天燕科技有限公司 | A kind of anti-crawler method and apparatus |
CN112073412A (en) * | 2020-09-08 | 2020-12-11 | 北京天融信网络安全技术有限公司 | Anti-crawler method, device, processor and computer readable medium |
CN112434208A (en) * | 2020-12-03 | 2021-03-02 | 百果园技术(新加坡)有限公司 | Training of isolated forest and identification method and related device of web crawler of isolated forest |
Cited By (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN113343232A (en) * | 2021-07-13 | 2021-09-03 | 壹药网科技(上海)股份有限公司 | Reversal crawler system |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN109862018B (en) | Anti-crawler method and system based on user access behavior | |
CN109241461B (en) | User portrait construction method and device | |
US20180039770A1 (en) | Multi-Factor Profile and Security Fingerprint Analysis | |
EP2748781B1 (en) | Multi-factor identity fingerprinting with user behavior | |
US10740411B2 (en) | Determining repeat website users via browser uniqueness tracking | |
CN104184832B (en) | Data submission method and device in network application | |
US10043038B2 (en) | Identifying private information from data streams | |
CN103166917A (en) | Method and system for network equipment identity recognition | |
CN108112038B (en) | Method and device for controlling access flow | |
CN109951469B (en) | Method, device, storage medium and server for creating domain name black and white list | |
CN106776973B (en) | Blacklist data generation method and apparatus | |
CN107592305A (en) | A kind of anti-brush method and system based on elk and redis | |
CN105656867A (en) | Monitoring method and device for account theft event | |
CN111932380A (en) | Big data-based information processing method and device and information processing sharing platform | |
CN110532461B (en) | Information platform pushing method and device, computer equipment and storage medium | |
CN108270754B (en) | Detection method and device for phishing website | |
CN115238247A (en) | Data processing method based on zero trust data access control system | |
CN114338099A (en) | Crawler behavior identification method and prevention system | |
CN108512822B (en) | Risk identification method and device for data processing event | |
CN112347328A (en) | Network platform identification method, device, equipment and readable storage medium | |
US9904661B2 (en) | Real-time agreement analysis | |
CN115174205B (en) | Network space safety real-time monitoring method, system and computer storage medium | |
CN111212153A (en) | IP address checking method, device, terminal equipment and storage medium | |
CN107679865B (en) | Identity verification method and device based on touch area | |
CN112468444B (en) | Internet domain name abuse identification method and device, electronic equipment and storage medium |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination |