CN114338099A - Crawler behavior identification method and prevention system - Google Patents

Crawler behavior identification method and prevention system Download PDF

Info

Publication number
CN114338099A
CN114338099A CN202111510989.7A CN202111510989A CN114338099A CN 114338099 A CN114338099 A CN 114338099A CN 202111510989 A CN202111510989 A CN 202111510989A CN 114338099 A CN114338099 A CN 114338099A
Authority
CN
China
Prior art keywords
access
connection address
similarity
user
access request
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202111510989.7A
Other languages
Chinese (zh)
Inventor
王文彪
于刚
李志刚
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Yiyaowang Technology Shanghai Co ltd
Original Assignee
Yiyaowang Technology Shanghai Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Yiyaowang Technology Shanghai Co ltd filed Critical Yiyaowang Technology Shanghai Co ltd
Priority to CN202111510989.7A priority Critical patent/CN114338099A/en
Publication of CN114338099A publication Critical patent/CN114338099A/en
Pending legal-status Critical Current

Links

Images

Landscapes

  • Management, Administration, Business Operations System, And Electronic Commerce (AREA)

Abstract

The invention discloses a crawler behavior identification method and a prevention system, which comprise the following steps: s1: receiving an access request of a user; s2: storing the connection address, the access object and the status code; s3: judging whether the connection address is a connection address with a crawler behavior; yes, go to S6; NO, go to S4; s4: judging the similarity between the access request and the crawler behavior; the similarity is in a first range, and the access request is accepted; similarity is in the second range, turn to S5; similarity is in the third range, go to S6; s5: verifying the verification code; if so, accepting the access request; NO, go to S6; s6: and storing the connection address into a blacklist. The invention has the beneficial effects that: the accumulated access requests and the real-time access requests of the users are analyzed respectively, so that the accuracy of crawler behavior identification is improved, and a good identification effect is realized. And the similarity range and the verification mechanism of the verification code are reasonably set to effectively distinguish normal access requests and crawler behaviors, so that the user experience is improved.

Description

Crawler behavior identification method and prevention system
Technical Field
The invention relates to the technical field of internet security, in particular to a crawler behavior identification method and a prevention system.
Background
Crawler software refers to a certain type of specific computer program, which obtains specific data from specific web pages, interfaces and the like under certain preset rules. Crawler software is widely used in many fields, and can automatically acquire new data from a website and store the new data for convenient access, analysis and use. Data acquired through crawler software can be used as a basis for big data analysis after being sorted, processed and screened, so that specific objects can be effectively analyzed, and the method is a very common analysis means in internet enterprises. In the pharmaceutical and e-commerce industry, crawler software is often used to collect data such as product categories, commodity names, prices, discounts, and the like of the same industry as a common analysis tool. After the data are analyzed and processed, the data can be used for guiding commodity pricing. For example, in a certain pharmaceutical e-commerce platform, the number of access requests per day is about 7000 thousands, including access requests from several crawlers with different functions. The crawler software occupies server resources and network bandwidth, influences user experience, and has adverse effects on the operation behavior of the drug and the electric business. Therefore, it is necessary to design a corresponding recognition and prevention system for the crawler software.
In the prior art, the request times of a single IP address in a specific time interval are limited, but the method cannot achieve a good effect on crawler software with random IP addresses. And to some extent, it will also affect the use of some users with large access requirements, such as distributors with large purchase, retail terminals, etc.
Disclosure of Invention
Aiming at the problems in the prior art, a crawler behavior identification method and a prevention system are provided.
The specific technical scheme is as follows:
a method of identifying crawler behavior, comprising:
step S1: receiving an access request of a user, and recording a connection address, an access object and a state code;
step S2: storing the connection address, the access object and the state code into a log module;
step S3: analyzing the access object and the state code of the connection address by adopting the log module, and judging whether the connection address is a connection address with a crawler behavior according to an analysis result;
if yes, go to step S6;
if not, go to step S4;
step S4: adopting an analysis module to judge the similarity between the access request and the crawler behavior;
if the similarity is in a preset first range, receiving the access request and returning the requested access object to the user, and then finishing the judgment;
if the similarity is within a second predetermined range, go to step S5;
if the similarity is within a preset third range, go to step S6;
step S5: initiating verification of a verification code to the user and judging whether the user passes the verification;
if so, receiving the access request and returning the requested access object to the user, and then finishing the judgment;
if not, go to step S6;
step S6: and storing the connection address into a blacklist, and then finishing the judgment.
Preferably, the step S3 includes:
step S31: judging whether the connection address is in a preset high-risk connection address range or not;
if so, increasing the numerical value of the similarity, and then turning to step S32;
if not, go to the step S32;
step S32: judging whether a first access rule is met according to the access object and the state code
If yes, judging that the connection address is the connection address with the crawler behavior, and then turning to step S6;
if not, go to step S33;
step S33: judging whether a second access rule is met or not according to the access object;
if yes, judging that the connection address is the connection address with the crawler behavior, and then turning to step S6;
if not, go to step S34;
step S34: judging whether a third access rule is met or not according to the state code and the connection address;
if yes, judging that the connection address is the connection address with the crawler behavior, and then turning to step S6;
if not, go to step S35;
step S35: judging whether the number of times of the access requests sent from the connection address exceeds an access limit value within preset time;
if so, increasing the numerical value of the similarity, and then turning to step S4;
if not, the process goes to step S4.
Preferably, the calculation formula of the similarity is as follows:
Figure BDA0003405345080000031
wherein: cos θ is the similarity, x1For the connection address, x2Frequency, y, of connection addresses in said black list1For the access object, y2The frequency of accessing the object in the blacklist.
Preferably, the step S4 further includes:
step S41: extracting a user identification from the access request;
step S42: judging whether the user identification is in a preset user identification range or not;
if so, receiving the access request and returning the requested access object to the user, and then finishing the judgment;
if not, go to step S43;
step S43: calculating and judging the similarity between the access request and the crawler behavior;
if the similarity is in the first range, receiving the access request and returning the requested access object to the user, and then finishing the judgment;
if the similarity is in the second range, go to step S5;
if the similarity is in the third range, the process goes to step S6.
Preferably, the step S5 includes:
step S51: sending a verification code authentication request to the user, and recording the authentication times;
step S52: judging whether the authentication times reach an authentication upper limit value or not;
if yes, go to step S6;
if not, go to step S53;
step S53: judging whether the authentication request passes or not;
if so, receiving the access request and returning the requested access object to the user, and then finishing the judgment;
if not, the process returns to the step S51.
Preferably, the step S6 includes:
setting a period for storing the connection address in the blacklist according to the connection address or the access object;
the period is a preset first period or a preset second period.
A prevention system for crawler behavior is characterized in that the prevention system is used for implementing the identification method and comprises the following steps:
the load balancing module is connected with a plurality of users and used for receiving access requests sent by the users;
the analysis module receives the access request forwarded by the load balancing module and judges the similarity between the access request and a preset crawler model;
the verification module is connected with the analysis module and sends a verification code verification request to the user according to the similarity;
the log module is connected with the analysis module and the verification module and is used for adding the users with extremely high similarity and/or the users who do not pass verification request of the verification code into a blacklist;
and the load balancing module judges whether to forward the access request of the external user according to the blacklist.
Preferably, the log module further comprises:
and the behavior analysis submodule reads the user, the connection address, the access object and the state code from the log stored in the log module and judges whether to add the user into the blacklist or not according to the user, the connection address, the access object and the state code.
The technical scheme has the following advantages or beneficial effects: by arranging the log module and the analysis module to analyze the accumulated access request and the real-time access request of the user respectively, the accuracy of crawler behavior identification is improved, and a better identification effect is realized. And the similarity range and the verification mechanism of the verification code are reasonably set to effectively distinguish normal access requests and crawler behaviors, so that the user experience is improved.
Drawings
Embodiments of the present invention will now be described more fully hereinafter with reference to the accompanying drawings. The drawings are, however, to be regarded as illustrative and explanatory only and are not restrictive of the scope of the invention.
FIG. 1 is an overall schematic diagram of an embodiment of the present invention;
FIG. 2 is a schematic diagram illustrating the substep of step S3 according to an embodiment of the present invention;
FIG. 3 is a schematic diagram illustrating the substep of step S4 according to an embodiment of the present invention;
FIG. 4 is a schematic diagram illustrating the substep of step S5 according to an embodiment of the present invention;
fig. 5 is a functional block diagram according to an embodiment of the present invention.
Detailed Description
The technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.
It should be noted that the embodiments and features of the embodiments may be combined with each other without conflict.
The invention is further described with reference to the following drawings and specific examples, which are not intended to be limiting.
In the prior art, a method for preventing crawler software is mainly realized by limiting the number of requests of a single IP connection address in a certain time. For example, the nginx anti-proxy server mounts two components, namely the limit _ conn _ zone and the limit _ req _ zone, so as to limit the access frequency and the access times of the user. This presents the following two problems: part of crawler software avoids the limitation on the times of single IP connection in an IP address pool mode; in the pharmaceutical and e-commerce platform, the purchasing networks of part of purchasers, distributors, drug stores and platform merchants adopt unified export, the access amount of a single IP address is naturally higher, and the use inconvenience of users can be caused by the adoption of a single IP address connection limiting mode. Aiming at the technical problems, the invention provides a system for identifying specific crawler behaviors so as to realize the prevention effect of crawler software.
The invention comprises the following steps:
a method for identifying a crawler behavior, as shown in fig. 1, includes:
step S1: receiving an access request of a user, and recording a connection address, an access object and a state code;
step S2: storing the connection address, the access object and the state code into a log module;
step S3: analyzing the access object and the state code of the connection address by adopting a log module, and judging whether the connection address is a connection address with a crawler behavior according to an analysis result;
if yes, go to step S6;
if not, go to step S4;
step S4: adopting an analysis module to judge the similarity between the access request and the crawler behavior;
if the similarity is in a preset first range, receiving the access request and returning the requested access object to the user, and then finishing the judgment;
if the similarity is within a second predetermined range, go to step S5;
if the similarity is within a preset third range, go to step S6;
as an optional embodiment, the first range, the second range and the third range are continuous ranges, the numerical value of the continuous ranges is 0-10, and the smaller the number is, the lower the similarity between the access request and the crawler behavior is.
Step S5: initiating verification of a verification code to a user and judging whether the user passes the verification;
if so, receiving the access request and returning the requested access object to the user, and then finishing the judgment;
if not, go to step S6;
step S6: and storing the connection address into a blacklist, and then finishing the judgment.
Specifically, in the present technical solution, the hardware architecture mainly includes: the load balancing server is used as an external uniform interface of the system to uniformly forward and balance the access requests of the users so as to avoid the phenomenon of service blockage caused by excessive connection times; the log server is used for recording relevant information such as an access request of a user, in one embodiment, marking of the user is realized by recording an IP address, namely a connection address, of the user, the log server can be selected as a nginx server cluster, and recording of the access request of the user is realized through an access log component (access log) of the log server; the log analysis server accesses the log server through a software interface, analyzes the access request of the user recorded in the log and judges whether the log has a crawler behavior; the behavior analysis server receives the access request forwarded by the load balancing server, judges the current access behavior of the user, generates the similarity between the current access behavior of the user and the behavior of the crawler, directly allows access with low similarity, adds the access with high similarity into a blacklist, checks through a verification code to judge whether the access request comes from crawler software; and the blacklist is used for storing the users prohibited from accessing and the access requests thereof and providing basis for the behavior analysis of the behavior analysis server.
In a preferred embodiment, as shown in fig. 2, step S3 includes:
step S31: judging whether the connection address is in a preset high-risk connection address range or not;
if yes, increasing the numerical value of the similarity, and turning to step S32;
if not, go to step S32;
specifically, part of the crawler software adopts an IP address pool mode to avoid the limitation of the access times of a single IP connection address in the prior art. Most crawlers use a pool of IP addresses obtained through an Internet Data Center (IDC), which typically have the same IP address field. Therefore, by presetting a specific high-risk IP address field and adding a similarity value to an access request from the IP address field, which is equivalent to reducing the threshold value of the access request determined as the crawler behavior, the accuracy of identification of the type of crawler software can be increased.
Step S32: judging whether the first access rule is met or not according to the access object and the state code
If yes, judging that the connection address is the connection address with the crawler behavior, and then turning to step S6;
if not, go to step S33;
specifically, within a preset period, such as 1 hour, if the request for accessing the object from a connection address includes a high-risk request, such as: and when the status code of the login request (login), the administrator instruction (admin), the backup request (backup) or the database connection request (sql) is 404, the connection address is judged to be a high-risk connection address, and the connection address needs to be added into a blacklist, so that the load balancing server refuses the access of the connection address.
As an optional technical feature, in step S32, the source of the connection address may be further analyzed, and if the address field of the connection address is the home field of the internet data center, the connection address is added to the black list.
Step S33: judging whether the second access rule is met or not according to the access object;
if yes, judging that the connection address is the connection address with the crawler behavior, and then turning to step S6;
if not, go to step S34;
specifically, within a short preset period, such as 30 minutes, if the access objects are similar, it indicates that the access behavior does not conform to the normal user behavior, and may be generated by software, such as computing uri by regexp _ extract, and if the prefix of uri is similar but its id is changed regularly, such as repeating +1, -1, etc., it indicates that it may be an access request generated by crawler software.
Step S34: judging whether the third access rule is met or not according to the state code and the connection address;
if yes, judging that the connection address is the connection address with the crawler behavior, and then turning to step S6;
if not, go to step S35;
specifically, within a short preset period, for example, 30 minutes, if the status code is abnormal, for example, the status code at the beginning of 4XX appears frequently, and the connection address comes from the internet data center, it indicates that it is doing crawler behavior, and a blacklist needs to be added to make the load balancing server prevent its access behavior.
Step S35: judging whether the number of times of access requests sent by the self-connection address exceeds an access limit value within preset time;
if so, increasing the numerical value of the similarity, and then turning to step S4;
if not, the process goes to step S4.
Specifically, within a very short preset period, for example, 5 minutes, if an access request sent from a connection address significantly exceeds the normal access times, for example, 200 times, that is, the access frequency is much higher than that of a normal user, it may be considered as an access request generated by the crawler software, and the connection address needs to be blacklisted so that the load balancing server prevents its access behavior.
As an optional technical feature, for access requests conforming to different access rules, the added similarity may be different, for example, in step S35, if the number of access requests exceeds the access limit within a predetermined time, the similarity may be directly added to the second range to force the access requests to perform the access to the verification code.
In a preferred embodiment, the similarity is calculated by the formula:
Figure BDA0003405345080000091
wherein: cos θ is the degree of similarity, x1To connect addresses, x2Frequency, y, of addresses connected in the blacklist1To access an object, y2Is the frequency of accessing objects in the black list.
Specifically, in the technical scheme, the similarity is calculated mainly through two parameters, namely the connection address of the access request and the access object, according to the past crawler behavior data stored in the blacklist. Because the past period crawler behavior data recorded in the blacklist can be regularly cleaned, better timeliness can be obtained when the analysis module calculates according to the past period crawler behavior data stored in the blacklist each time. Generally, based on the same crawler software, the last proxy server through which the access request passes is regionally related, so that whether the source of the access request is from a high-risk connection address section can be effectively judged by taking the connection address as a calculation parameter of the similarity; furthermore, based on practical considerations, the content of interest to the crawler, such as the product SKU, product category, pricing, discount price, etc., is typically in a fixed range, rather than staying in a single page for a period of time, accessing product pictures, parameters, functional presentations, etc., as is the case with normal users. Therefore, whether the access request is similar to the past crawler behavior data recorded in the blacklist can be judged according to the request object. And inputting the two parameters into the formula to obtain the numerical value of the similarity.
As an optional implementation mode, the range of the similarity is 0-10, wherein 0-2 is a first range allowing access, 3-6 is a second range needing verification through a verification code, and 7-10 indicates that the similarity is too similar to existing past crawler behavior data and the similarity is required to be directly added into a blacklist.
In a preferred embodiment, as shown in fig. 3, step S4 further includes:
step S41: extracting a user identification from the access request;
step S42: judging whether the user identification is in a preset user identification range or not;
if so, receiving the access request and returning the requested access object to the user, and then finishing the judgment;
if not, go to step S43;
in particular, some search engines, such as hundredths, dog searches, 360 searches, etc., also periodically and automatically access the user's web site through a computer program to build an index for use by the user of the search engine. The computer program has no influence on the operation behavior of the pharmaceutical e-commerce platform, and the source of the access request is often indicated by a user identifier (user _ agent), so that the access request can be identified and released by presetting a user identifier range, and the identification efficiency of the crawler behavior is improved.
Step S43: calculating and judging the similarity between the access request and the crawler behavior;
if the similarity is in the first range, receiving the access request and returning the requested access object to the user, and then finishing the judgment;
if the similarity is in the second range, go to step S5;
if the similarity is in the third range, the process goes to step S6.
As an optional implementation manner, step S4 further includes: inquiring whether the corresponding connection address is set with a corresponding mark or not according to the log module, and if so, acquiring a corresponding similarity value for operation; by the method, the similarity can be quickly judged based on the connection address, and the judgment efficiency is improved.
As an optional implementation manner, step S4 further includes: it is determined whether to directly jump to step S5 according to whether the access object includes a key object, such as a login request, an administrator instruction, a backup request, or a database connection request. In this way, good security can be achieved by verification of the authentication code when the access request includes critical operations.
In a preferred embodiment, as shown in fig. 4, step S5 includes:
step S51: sending a verification code authentication request to a user, and recording the authentication times;
step S52: judging whether the authentication times reach an authentication upper limit value or not;
if yes, go to step S6;
if not, go to step S53;
step S53: judging whether the authentication request passes;
if so, receiving the access request and returning the requested access object to the user, and then finishing the judgment;
if not, the process returns to step S51.
In a preferred embodiment, step S6 includes:
setting a period for storing the connection address in a blacklist according to the connection address or the access object;
the period is a preset first period or a preset second period.
A prevention system for crawler behavior is characterized in that the prevention system is used for implementing the identification method and comprises the following steps:
the load balancing module 1 is connected with a plurality of users and used for receiving access requests sent by the users;
the analysis module 2 receives the access request forwarded by the load balancing module and judges the similarity between the access request and a preset crawler model;
the checking module 3 is connected with the analysis module 2, and sends a verification code checking request to the user according to the similarity;
the log module 4 is connected with the analysis module and the verification module and is used for adding users with extremely high similarity and/or users who do not pass verification requests of the verification codes into a blacklist;
and the load balancing module 1 judges whether to forward the access request of the external user according to the blacklist.
In a preferred embodiment, the log module further comprises:
and the behavior analysis submodule 41 reads the user, the connection address, the resource access request and the status code from the log stored in the log module 4, and judges whether to add the user to the blacklist according to the user, the connection address, the resource access request and the status code.
The invention has the beneficial effects that: by arranging the log module and the analysis module to analyze the accumulated access request and the real-time access request of the user respectively, the accuracy of crawler behavior identification is improved, and a better identification effect is realized. And the similarity range and the verification mechanism of the verification code are reasonably set to effectively distinguish normal access requests and crawler behaviors, so that the user experience is improved.
It should be noted that some of the block diagrams shown in the figures are functional entities and do not necessarily correspond to physically or logically separate entities. These functional entities may be implemented in the form of software, or in one or more hardware modules or integrated circuits, or in different networks and/or processor devices and/or microcontroller devices.
As will be appreciated by one of ordinary skill in the art, various aspects of the invention, or possible implementations of various aspects, may be embodied as a system, method, or computer program product. Accordingly, aspects of the present invention, or possible implementations of aspects, may take the form of an entirely hardware embodiment, an entirely software embodiment (including firmware, resident software, etc.) or an embodiment combining software and hardware aspects that may all generally be referred to herein as a "circuit," module "or" system. Furthermore, aspects of the invention, or possible implementations of aspects, may take the form of a computer program product, which refers to computer-readable program code stored in a computer-readable medium.
The computer readable medium may be a computer readable signal medium or a computer readable storage medium. A computer readable storage medium may include, but is not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or any suitable combination of the foregoing, such as Random Access Memory (RAM), read-only memory (ROM), erasable programmable read-only memory (EPROM or flash memory), optical fiber, and portable read-only memory (CD-ROM).
A processor in the computer reads the computer-readable program code stored in the computer-readable medium, so that the processor can perform the functional actions specified in each step, or a combination of steps, in the flowcharts; and means for generating a block diagram that implements the functional operation specified in each block or a combination of blocks.
It should be understood that a processor in a computer may be understood as one or more Application Specific Integrated Circuits (ASICs), DSPs, Programmable Logic Devices (PLDs), Complex Programmable Logic Devices (CPLDs), Field Programmable Gate Arrays (FPGAs), general purpose processors, controllers, micro-controllers (MCUs), microprocessors (microprocessors), or other electronic component implementations for executing the aforementioned computer readable program code.
The computer readable program code may execute entirely on the user's local computer, partly on the user's local computer, as a stand-alone software package, partly on the user's local computer and partly on a remote computer or entirely on the remote computer or server. It should also be noted that, in some alternative implementations, the functions noted in the flowchart or block diagram block may occur out of the order noted in the figures. For example, two steps or two blocks shown in succession may, in fact, be executed substantially concurrently, or the blocks may sometimes be executed in the reverse order, depending upon the functionality involved.
While the invention has been described with reference to a preferred embodiment, it will be understood by those skilled in the art that various changes in form and detail may be made therein without departing from the spirit and scope of the invention.

Claims (8)

1. A crawler behavior recognition method is characterized by comprising the following steps:
step S1: receiving an access request of a user, and recording a connection address, an access object and a state code;
step S2: storing the connection address, the access object and the state code into a log module;
step S3: analyzing the access object and the state code of the connection address by adopting the log module, and judging whether the connection address is a connection address with a crawler behavior according to an analysis result;
if yes, go to step S6;
if not, go to step S4;
step S4: adopting an analysis module to judge the similarity between the access request and the crawler behavior;
if the similarity is in a preset first range, receiving the access request and returning the requested access object to the user, and then finishing the judgment;
if the similarity is within a second predetermined range, go to step S5;
if the similarity is within a preset third range, go to step S6;
step S5: initiating verification of a verification code to the user and judging whether the user passes the verification;
if so, receiving the access request and returning the requested access object to the user, and then finishing the judgment;
if not, go to step S6;
step S6: and storing the connection address into a blacklist, and then finishing the judgment.
2. The identification method according to claim 1, wherein the step S3 includes:
step S31: judging whether the connection address is in a preset high-risk connection address range or not;
if so, increasing the numerical value of the similarity, and then turning to step S32;
if not, go to the step S32;
step S32: judging whether a first access rule is met according to the access object and the state code
If yes, judging that the connection address is the connection address with the crawler behavior, and then turning to step S6;
if not, go to step S33;
step S33: judging whether a second access rule is met or not according to the access object;
if yes, judging that the connection address is the connection address with the crawler behavior, and then turning to step S6;
if not, go to step S34;
step S34: judging whether a third access rule is met or not according to the state code and the connection address;
if yes, judging that the connection address is the connection address with the crawler behavior, and then turning to step S6;
if not, go to step S35;
step S35: judging whether the number of times of the access requests sent from the connection address exceeds an access limit value within preset time;
if so, increasing the numerical value of the similarity, and then turning to step S4;
if not, the process goes to step S4.
3. The recognition method according to claim 1, wherein the similarity is calculated by the formula:
Figure FDA0003405345070000021
wherein: cos θ is the similarity, x1For the connection address, x2Frequency, y, of connection addresses in said black list1For the purpose of the access to the object,y2the frequency of accessing the object in the blacklist.
4. The identification method according to claim 1, wherein the step S4 further comprises:
step S41: extracting a user identification from the access request;
step S42: judging whether the user identification is in a preset user identification range or not;
if so, receiving the access request and returning the requested access object to the user, and then finishing the judgment;
if not, go to step S43;
step S43: calculating and judging the similarity between the access request and the crawler behavior;
if the similarity is in the first range, receiving the access request and returning the requested access object to the user, and then finishing the judgment;
if the similarity is in the second range, go to step S5;
if the similarity is in the third range, the process goes to step S6.
5. The identification method according to claim 1, wherein the step S5 includes:
step S51: sending a verification code authentication request to the user, and recording the authentication times;
step S52: judging whether the authentication times reach an authentication upper limit value or not;
if yes, go to step S6;
if not, go to step S53;
step S53: judging whether the authentication request passes or not;
if so, receiving the access request and returning the requested access object to the user, and then finishing the judgment;
if not, the process returns to the step S51.
6. The identification method according to claim 1, wherein the step S6 includes:
setting a period for storing the connection address in the blacklist according to the connection address or the access object;
the period is a preset first period or a preset second period.
7. A prevention system for crawler behavior, for implementing the identification method of any one of claims 1 to 6, comprising:
the load balancing module is connected with a plurality of users and used for receiving access requests sent by the users;
the analysis module receives the access request forwarded by the load balancing module and judges the similarity between the access request and a preset crawler model;
the verification module is connected with the analysis module and sends a verification code verification request to the user according to the similarity;
the log module is connected with the analysis module and the verification module and is used for adding the users with extremely high similarity and/or the users who do not pass verification request of the verification code into a blacklist;
and the load balancing module judges whether to forward the access request of the external user according to the blacklist.
8. The identification system of claim 7, wherein the log module further comprises:
and the behavior analysis submodule reads the user, the connection address, the access object and the state code from the log stored in the log module and judges whether to add the user into the blacklist or not according to the user, the connection address, the access object and the state code.
CN202111510989.7A 2021-12-10 2021-12-10 Crawler behavior identification method and prevention system Pending CN114338099A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202111510989.7A CN114338099A (en) 2021-12-10 2021-12-10 Crawler behavior identification method and prevention system

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202111510989.7A CN114338099A (en) 2021-12-10 2021-12-10 Crawler behavior identification method and prevention system

Publications (1)

Publication Number Publication Date
CN114338099A true CN114338099A (en) 2022-04-12

Family

ID=81050987

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202111510989.7A Pending CN114338099A (en) 2021-12-10 2021-12-10 Crawler behavior identification method and prevention system

Country Status (1)

Country Link
CN (1) CN114338099A (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113343232A (en) * 2021-07-13 2021-09-03 壹药网科技(上海)股份有限公司 Reversal crawler system

Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107800684A (en) * 2017-09-20 2018-03-13 贵州白山云科技有限公司 A kind of low frequency reptile recognition methods and device
CN108154029A (en) * 2017-10-25 2018-06-12 上海观安信息技术股份有限公司 Intrusion detection method, electronic equipment and computer storage media
CN109274664A (en) * 2018-09-12 2019-01-25 珠海天燕科技有限公司 A kind of anti-crawler method and apparatus
US20190215330A1 (en) * 2018-01-07 2019-07-11 Microsoft Technology Licensing, Llc Detecting attacks on web applications using server logs
CN112073412A (en) * 2020-09-08 2020-12-11 北京天融信网络安全技术有限公司 Anti-crawler method, device, processor and computer readable medium
CN112434208A (en) * 2020-12-03 2021-03-02 百果园技术(新加坡)有限公司 Training of isolated forest and identification method and related device of web crawler of isolated forest

Patent Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107800684A (en) * 2017-09-20 2018-03-13 贵州白山云科技有限公司 A kind of low frequency reptile recognition methods and device
CN108154029A (en) * 2017-10-25 2018-06-12 上海观安信息技术股份有限公司 Intrusion detection method, electronic equipment and computer storage media
US20190215330A1 (en) * 2018-01-07 2019-07-11 Microsoft Technology Licensing, Llc Detecting attacks on web applications using server logs
CN109274664A (en) * 2018-09-12 2019-01-25 珠海天燕科技有限公司 A kind of anti-crawler method and apparatus
CN112073412A (en) * 2020-09-08 2020-12-11 北京天融信网络安全技术有限公司 Anti-crawler method, device, processor and computer readable medium
CN112434208A (en) * 2020-12-03 2021-03-02 百果园技术(新加坡)有限公司 Training of isolated forest and identification method and related device of web crawler of isolated forest

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113343232A (en) * 2021-07-13 2021-09-03 壹药网科技(上海)股份有限公司 Reversal crawler system

Similar Documents

Publication Publication Date Title
CN109862018B (en) Anti-crawler method and system based on user access behavior
CN109241461B (en) User portrait construction method and device
US20180039770A1 (en) Multi-Factor Profile and Security Fingerprint Analysis
EP2748781B1 (en) Multi-factor identity fingerprinting with user behavior
US10740411B2 (en) Determining repeat website users via browser uniqueness tracking
CN104184832B (en) Data submission method and device in network application
US10043038B2 (en) Identifying private information from data streams
CN103166917A (en) Method and system for network equipment identity recognition
CN108112038B (en) Method and device for controlling access flow
CN109951469B (en) Method, device, storage medium and server for creating domain name black and white list
CN106776973B (en) Blacklist data generation method and apparatus
CN107592305A (en) A kind of anti-brush method and system based on elk and redis
CN105656867A (en) Monitoring method and device for account theft event
CN111932380A (en) Big data-based information processing method and device and information processing sharing platform
CN110532461B (en) Information platform pushing method and device, computer equipment and storage medium
CN108270754B (en) Detection method and device for phishing website
CN115238247A (en) Data processing method based on zero trust data access control system
CN114338099A (en) Crawler behavior identification method and prevention system
CN108512822B (en) Risk identification method and device for data processing event
CN112347328A (en) Network platform identification method, device, equipment and readable storage medium
US9904661B2 (en) Real-time agreement analysis
CN115174205B (en) Network space safety real-time monitoring method, system and computer storage medium
CN111212153A (en) IP address checking method, device, terminal equipment and storage medium
CN107679865B (en) Identity verification method and device based on touch area
CN112468444B (en) Internet domain name abuse identification method and device, electronic equipment and storage medium

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination