CN114338099A

CN114338099A - Crawler behavior identification method and prevention system

Info

Publication number: CN114338099A
Application number: CN202111510989.7A
Authority: CN
Inventors: 王文彪; 于刚; 李志刚
Original assignee: Yiyaowang Technology Shanghai Co ltd
Current assignee: Yiyaowang Technology Shanghai Co ltd
Priority date: 2021-12-10
Filing date: 2021-12-10
Publication date: 2022-04-12

Abstract

The invention discloses a crawler behavior identification method and a prevention system, which comprise the following steps: s1: receiving an access request of a user; s2: storing the connection address, the access object and the status code; s3: judging whether the connection address is a connection address with a crawler behavior; yes, go to S6; NO, go to S4; s4: judging the similarity between the access request and the crawler behavior; the similarity is in a first range, and the access request is accepted; similarity is in the second range, turn to S5; similarity is in the third range, go to S6; s5: verifying the verification code; if so, accepting the access request; NO, go to S6; s6: and storing the connection address into a blacklist. The invention has the beneficial effects that: the accumulated access requests and the real-time access requests of the users are analyzed respectively, so that the accuracy of crawler behavior identification is improved, and a good identification effect is realized. And the similarity range and the verification mechanism of the verification code are reasonably set to effectively distinguish normal access requests and crawler behaviors, so that the user experience is improved.

Description

Crawler behavior identification method and prevention system

Technical Field

The invention relates to the technical field of internet security, in particular to a crawler behavior identification method and a prevention system.

Background

Crawler software refers to a certain type of specific computer program, which obtains specific data from specific web pages, interfaces and the like under certain preset rules. Crawler software is widely used in many fields, and can automatically acquire new data from a website and store the new data for convenient access, analysis and use. Data acquired through crawler software can be used as a basis for big data analysis after being sorted, processed and screened, so that specific objects can be effectively analyzed, and the method is a very common analysis means in internet enterprises. In the pharmaceutical and e-commerce industry, crawler software is often used to collect data such as product categories, commodity names, prices, discounts, and the like of the same industry as a common analysis tool. After the data are analyzed and processed, the data can be used for guiding commodity pricing. For example, in a certain pharmaceutical e-commerce platform, the number of access requests per day is about 7000 thousands, including access requests from several crawlers with different functions. The crawler software occupies server resources and network bandwidth, influences user experience, and has adverse effects on the operation behavior of the drug and the electric business. Therefore, it is necessary to design a corresponding recognition and prevention system for the crawler software.

In the prior art, the request times of a single IP address in a specific time interval are limited, but the method cannot achieve a good effect on crawler software with random IP addresses. And to some extent, it will also affect the use of some users with large access requirements, such as distributors with large purchase, retail terminals, etc.

Disclosure of Invention

Aiming at the problems in the prior art, a crawler behavior identification method and a prevention system are provided.

The specific technical scheme is as follows:

a method of identifying crawler behavior, comprising:

step S1: receiving an access request of a user, and recording a connection address, an access object and a state code;

step S2: storing the connection address, the access object and the state code into a log module;

step S3: analyzing the access object and the state code of the connection address by adopting the log module, and judging whether the connection address is a connection address with a crawler behavior according to an analysis result;

if yes, go to step S6;

if not, go to step S4;

step S4: adopting an analysis module to judge the similarity between the access request and the crawler behavior;

if the similarity is in a preset first range, receiving the access request and returning the requested access object to the user, and then finishing the judgment;

if the similarity is within a second predetermined range, go to step S5;

if the similarity is within a preset third range, go to step S6;

step S5: initiating verification of a verification code to the user and judging whether the user passes the verification;

if so, receiving the access request and returning the requested access object to the user, and then finishing the judgment;

if not, go to step S6;

step S6: and storing the connection address into a blacklist, and then finishing the judgment.

Preferably, the step S3 includes:

step S31: judging whether the connection address is in a preset high-risk connection address range or not;

if so, increasing the numerical value of the similarity, and then turning to step S32;

if not, go to the step S32;

step S32: judging whether a first access rule is met according to the access object and the state code

If yes, judging that the connection address is the connection address with the crawler behavior, and then turning to step S6;

if not, go to step S33;

step S33: judging whether a second access rule is met or not according to the access object;

if not, go to step S34;

step S34: judging whether a third access rule is met or not according to the state code and the connection address;

if not, go to step S35;

step S35: judging whether the number of times of the access requests sent from the connection address exceeds an access limit value within preset time;

if so, increasing the numerical value of the similarity, and then turning to step S4;

if not, the process goes to step S4.

Preferably, the calculation formula of the similarity is as follows:

wherein: cos θ is the similarity, x₁For the connection address, x₂Frequency, y, of connection addresses in said black list₁For the access object, y₂The frequency of accessing the object in the blacklist.

Preferably, the step S4 further includes:

step S41: extracting a user identification from the access request;

step S42: judging whether the user identification is in a preset user identification range or not;

if not, go to step S43;

step S43: calculating and judging the similarity between the access request and the crawler behavior;

if the similarity is in the first range, receiving the access request and returning the requested access object to the user, and then finishing the judgment;

if the similarity is in the second range, go to step S5;

if the similarity is in the third range, the process goes to step S6.

Preferably, the step S5 includes:

step S51: sending a verification code authentication request to the user, and recording the authentication times;

step S52: judging whether the authentication times reach an authentication upper limit value or not;

if yes, go to step S6;

if not, go to step S53;

step S53: judging whether the authentication request passes or not;

if not, the process returns to the step S51.

Preferably, the step S6 includes:

setting a period for storing the connection address in the blacklist according to the connection address or the access object;

the period is a preset first period or a preset second period.

A prevention system for crawler behavior is characterized in that the prevention system is used for implementing the identification method and comprises the following steps:

the load balancing module is connected with a plurality of users and used for receiving access requests sent by the users;

the analysis module receives the access request forwarded by the load balancing module and judges the similarity between the access request and a preset crawler model;

the verification module is connected with the analysis module and sends a verification code verification request to the user according to the similarity;

the log module is connected with the analysis module and the verification module and is used for adding the users with extremely high similarity and/or the users who do not pass verification request of the verification code into a blacklist;

and the load balancing module judges whether to forward the access request of the external user according to the blacklist.

Preferably, the log module further comprises:

and the behavior analysis submodule reads the user, the connection address, the access object and the state code from the log stored in the log module and judges whether to add the user into the blacklist or not according to the user, the connection address, the access object and the state code.

The technical scheme has the following advantages or beneficial effects: by arranging the log module and the analysis module to analyze the accumulated access request and the real-time access request of the user respectively, the accuracy of crawler behavior identification is improved, and a better identification effect is realized. And the similarity range and the verification mechanism of the verification code are reasonably set to effectively distinguish normal access requests and crawler behaviors, so that the user experience is improved.

Drawings

Embodiments of the present invention will now be described more fully hereinafter with reference to the accompanying drawings. The drawings are, however, to be regarded as illustrative and explanatory only and are not restrictive of the scope of the invention.

FIG. 1 is an overall schematic diagram of an embodiment of the present invention;

FIG. 2 is a schematic diagram illustrating the substep of step S3 according to an embodiment of the present invention;

FIG. 3 is a schematic diagram illustrating the substep of step S4 according to an embodiment of the present invention;

FIG. 4 is a schematic diagram illustrating the substep of step S5 according to an embodiment of the present invention;

fig. 5 is a functional block diagram according to an embodiment of the present invention.

Detailed Description

The technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.

It should be noted that the embodiments and features of the embodiments may be combined with each other without conflict.

The invention is further described with reference to the following drawings and specific examples, which are not intended to be limiting.

In the prior art, a method for preventing crawler software is mainly realized by limiting the number of requests of a single IP connection address in a certain time. For example, the nginx anti-proxy server mounts two components, namely the limit _ conn _ zone and the limit _ req _ zone, so as to limit the access frequency and the access times of the user. This presents the following two problems: part of crawler software avoids the limitation on the times of single IP connection in an IP address pool mode; in the pharmaceutical and e-commerce platform, the purchasing networks of part of purchasers, distributors, drug stores and platform merchants adopt unified export, the access amount of a single IP address is naturally higher, and the use inconvenience of users can be caused by the adoption of a single IP address connection limiting mode. Aiming at the technical problems, the invention provides a system for identifying specific crawler behaviors so as to realize the prevention effect of crawler software.

The invention comprises the following steps:

a method for identifying a crawler behavior, as shown in fig. 1, includes:

step S3: analyzing the access object and the state code of the connection address by adopting a log module, and judging whether the connection address is a connection address with a crawler behavior according to an analysis result;

if yes, go to step S6;

if not, go to step S4;

if the similarity is within a second predetermined range, go to step S5;

if the similarity is within a preset third range, go to step S6;

as an optional embodiment, the first range, the second range and the third range are continuous ranges, the numerical value of the continuous ranges is 0-10, and the smaller the number is, the lower the similarity between the access request and the crawler behavior is.

Step S5: initiating verification of a verification code to a user and judging whether the user passes the verification;

if not, go to step S6;

Specifically, in the present technical solution, the hardware architecture mainly includes: the load balancing server is used as an external uniform interface of the system to uniformly forward and balance the access requests of the users so as to avoid the phenomenon of service blockage caused by excessive connection times; the log server is used for recording relevant information such as an access request of a user, in one embodiment, marking of the user is realized by recording an IP address, namely a connection address, of the user, the log server can be selected as a nginx server cluster, and recording of the access request of the user is realized through an access log component (access log) of the log server; the log analysis server accesses the log server through a software interface, analyzes the access request of the user recorded in the log and judges whether the log has a crawler behavior; the behavior analysis server receives the access request forwarded by the load balancing server, judges the current access behavior of the user, generates the similarity between the current access behavior of the user and the behavior of the crawler, directly allows access with low similarity, adds the access with high similarity into a blacklist, checks through a verification code to judge whether the access request comes from crawler software; and the blacklist is used for storing the users prohibited from accessing and the access requests thereof and providing basis for the behavior analysis of the behavior analysis server.

In a preferred embodiment, as shown in fig. 2, step S3 includes:

if yes, increasing the numerical value of the similarity, and turning to step S32;

if not, go to step S32;

specifically, part of the crawler software adopts an IP address pool mode to avoid the limitation of the access times of a single IP connection address in the prior art. Most crawlers use a pool of IP addresses obtained through an Internet Data Center (IDC), which typically have the same IP address field. Therefore, by presetting a specific high-risk IP address field and adding a similarity value to an access request from the IP address field, which is equivalent to reducing the threshold value of the access request determined as the crawler behavior, the accuracy of identification of the type of crawler software can be increased.

Step S32: judging whether the first access rule is met or not according to the access object and the state code

if not, go to step S33;

specifically, within a preset period, such as 1 hour, if the request for accessing the object from a connection address includes a high-risk request, such as: and when the status code of the login request (login), the administrator instruction (admin), the backup request (backup) or the database connection request (sql) is 404, the connection address is judged to be a high-risk connection address, and the connection address needs to be added into a blacklist, so that the load balancing server refuses the access of the connection address.

As an optional technical feature, in step S32, the source of the connection address may be further analyzed, and if the address field of the connection address is the home field of the internet data center, the connection address is added to the black list.

Step S33: judging whether the second access rule is met or not according to the access object;

if not, go to step S34;

specifically, within a short preset period, such as 30 minutes, if the access objects are similar, it indicates that the access behavior does not conform to the normal user behavior, and may be generated by software, such as computing uri by regexp _ extract, and if the prefix of uri is similar but its id is changed regularly, such as repeating +1, -1, etc., it indicates that it may be an access request generated by crawler software.

Step S34: judging whether the third access rule is met or not according to the state code and the connection address;

if not, go to step S35;

specifically, within a short preset period, for example, 30 minutes, if the status code is abnormal, for example, the status code at the beginning of 4XX appears frequently, and the connection address comes from the internet data center, it indicates that it is doing crawler behavior, and a blacklist needs to be added to make the load balancing server prevent its access behavior.

Step S35: judging whether the number of times of access requests sent by the self-connection address exceeds an access limit value within preset time;

if not, the process goes to step S4.

Specifically, within a very short preset period, for example, 5 minutes, if an access request sent from a connection address significantly exceeds the normal access times, for example, 200 times, that is, the access frequency is much higher than that of a normal user, it may be considered as an access request generated by the crawler software, and the connection address needs to be blacklisted so that the load balancing server prevents its access behavior.

As an optional technical feature, for access requests conforming to different access rules, the added similarity may be different, for example, in step S35, if the number of access requests exceeds the access limit within a predetermined time, the similarity may be directly added to the second range to force the access requests to perform the access to the verification code.

In a preferred embodiment, the similarity is calculated by the formula:

wherein: cos θ is the degree of similarity, x₁To connect addresses, x₂Frequency, y, of addresses connected in the blacklist₁To access an object, y₂Is the frequency of accessing objects in the black list.

Specifically, in the technical scheme, the similarity is calculated mainly through two parameters, namely the connection address of the access request and the access object, according to the past crawler behavior data stored in the blacklist. Because the past period crawler behavior data recorded in the blacklist can be regularly cleaned, better timeliness can be obtained when the analysis module calculates according to the past period crawler behavior data stored in the blacklist each time. Generally, based on the same crawler software, the last proxy server through which the access request passes is regionally related, so that whether the source of the access request is from a high-risk connection address section can be effectively judged by taking the connection address as a calculation parameter of the similarity; furthermore, based on practical considerations, the content of interest to the crawler, such as the product SKU, product category, pricing, discount price, etc., is typically in a fixed range, rather than staying in a single page for a period of time, accessing product pictures, parameters, functional presentations, etc., as is the case with normal users. Therefore, whether the access request is similar to the past crawler behavior data recorded in the blacklist can be judged according to the request object. And inputting the two parameters into the formula to obtain the numerical value of the similarity.

As an optional implementation mode, the range of the similarity is 0-10, wherein 0-2 is a first range allowing access, 3-6 is a second range needing verification through a verification code, and 7-10 indicates that the similarity is too similar to existing past crawler behavior data and the similarity is required to be directly added into a blacklist.

In a preferred embodiment, as shown in fig. 3, step S4 further includes:

step S41: extracting a user identification from the access request;

if not, go to step S43;

in particular, some search engines, such as hundredths, dog searches, 360 searches, etc., also periodically and automatically access the user's web site through a computer program to build an index for use by the user of the search engine. The computer program has no influence on the operation behavior of the pharmaceutical e-commerce platform, and the source of the access request is often indicated by a user identifier (user _ agent), so that the access request can be identified and released by presetting a user identifier range, and the identification efficiency of the crawler behavior is improved.

if the similarity is in the second range, go to step S5;

if the similarity is in the third range, the process goes to step S6.

As an optional implementation manner, step S4 further includes: inquiring whether the corresponding connection address is set with a corresponding mark or not according to the log module, and if so, acquiring a corresponding similarity value for operation; by the method, the similarity can be quickly judged based on the connection address, and the judgment efficiency is improved.

As an optional implementation manner, step S4 further includes: it is determined whether to directly jump to step S5 according to whether the access object includes a key object, such as a login request, an administrator instruction, a backup request, or a database connection request. In this way, good security can be achieved by verification of the authentication code when the access request includes critical operations.

In a preferred embodiment, as shown in fig. 4, step S5 includes:

step S51: sending a verification code authentication request to a user, and recording the authentication times;

if yes, go to step S6;

if not, go to step S53;

step S53: judging whether the authentication request passes;

if not, the process returns to step S51.

In a preferred embodiment, step S6 includes:

setting a period for storing the connection address in a blacklist according to the connection address or the access object;

the period is a preset first period or a preset second period.

the load balancing module 1 is connected with a plurality of users and used for receiving access requests sent by the users;

the analysis module 2 receives the access request forwarded by the load balancing module and judges the similarity between the access request and a preset crawler model;

the checking module 3 is connected with the analysis module 2, and sends a verification code checking request to the user according to the similarity;

the log module 4 is connected with the analysis module and the verification module and is used for adding users with extremely high similarity and/or users who do not pass verification requests of the verification codes into a blacklist;

and the load balancing module 1 judges whether to forward the access request of the external user according to the blacklist.

In a preferred embodiment, the log module further comprises:

and the behavior analysis submodule 41 reads the user, the connection address, the resource access request and the status code from the log stored in the log module 4, and judges whether to add the user to the blacklist according to the user, the connection address, the resource access request and the status code.

The invention has the beneficial effects that: by arranging the log module and the analysis module to analyze the accumulated access request and the real-time access request of the user respectively, the accuracy of crawler behavior identification is improved, and a better identification effect is realized. And the similarity range and the verification mechanism of the verification code are reasonably set to effectively distinguish normal access requests and crawler behaviors, so that the user experience is improved.

It should be noted that some of the block diagrams shown in the figures are functional entities and do not necessarily correspond to physically or logically separate entities. These functional entities may be implemented in the form of software, or in one or more hardware modules or integrated circuits, or in different networks and/or processor devices and/or microcontroller devices.

As will be appreciated by one of ordinary skill in the art, various aspects of the invention, or possible implementations of various aspects, may be embodied as a system, method, or computer program product. Accordingly, aspects of the present invention, or possible implementations of aspects, may take the form of an entirely hardware embodiment, an entirely software embodiment (including firmware, resident software, etc.) or an embodiment combining software and hardware aspects that may all generally be referred to herein as a "circuit," module "or" system. Furthermore, aspects of the invention, or possible implementations of aspects, may take the form of a computer program product, which refers to computer-readable program code stored in a computer-readable medium.

The computer readable medium may be a computer readable signal medium or a computer readable storage medium. A computer readable storage medium may include, but is not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or any suitable combination of the foregoing, such as Random Access Memory (RAM), read-only memory (ROM), erasable programmable read-only memory (EPROM or flash memory), optical fiber, and portable read-only memory (CD-ROM).

A processor in the computer reads the computer-readable program code stored in the computer-readable medium, so that the processor can perform the functional actions specified in each step, or a combination of steps, in the flowcharts; and means for generating a block diagram that implements the functional operation specified in each block or a combination of blocks.

It should be understood that a processor in a computer may be understood as one or more Application Specific Integrated Circuits (ASICs), DSPs, Programmable Logic Devices (PLDs), Complex Programmable Logic Devices (CPLDs), Field Programmable Gate Arrays (FPGAs), general purpose processors, controllers, micro-controllers (MCUs), microprocessors (microprocessors), or other electronic component implementations for executing the aforementioned computer readable program code.

The computer readable program code may execute entirely on the user's local computer, partly on the user's local computer, as a stand-alone software package, partly on the user's local computer and partly on a remote computer or entirely on the remote computer or server. It should also be noted that, in some alternative implementations, the functions noted in the flowchart or block diagram block may occur out of the order noted in the figures. For example, two steps or two blocks shown in succession may, in fact, be executed substantially concurrently, or the blocks may sometimes be executed in the reverse order, depending upon the functionality involved.

While the invention has been described with reference to a preferred embodiment, it will be understood by those skilled in the art that various changes in form and detail may be made therein without departing from the spirit and scope of the invention.

Claims

1. A crawler behavior recognition method is characterized by comprising the following steps:

if yes, go to step S6;

if not, go to step S4;

if the similarity is within a second predetermined range, go to step S5;

if the similarity is within a preset third range, go to step S6;

if not, go to step S6;

2. The identification method according to claim 1, wherein the step S3 includes:

if not, go to the step S32;

if not, go to step S33;

if not, go to step S34;

if not, go to step S35;

if not, the process goes to step S4.

3. The recognition method according to claim 1, wherein the similarity is calculated by the formula:

wherein: cos θ is the similarity, x₁For the connection address, x₂Frequency, y, of connection addresses in said black list₁For the purpose of the access to the object,y₂the frequency of accessing the object in the blacklist.

4. The identification method according to claim 1, wherein the step S4 further comprises:

step S41: extracting a user identification from the access request;

if not, go to step S43;

if the similarity is in the second range, go to step S5;

if the similarity is in the third range, the process goes to step S6.

5. The identification method according to claim 1, wherein the step S5 includes:

if yes, go to step S6;

if not, go to step S53;

step S53: judging whether the authentication request passes or not;

if not, the process returns to the step S51.

6. The identification method according to claim 1, wherein the step S6 includes:

the period is a preset first period or a preset second period.

7. A prevention system for crawler behavior, for implementing the identification method of any one of claims 1 to 6, comprising:

8. The identification system of claim 7, wherein the log module further comprises: