CN112383513B - Crawler behavior detection method and device based on proxy IP address pool and storage medium - Google Patents

Crawler behavior detection method and device based on proxy IP address pool and storage medium Download PDF

Info

Publication number
CN112383513B
CN112383513B CN202011164587.1A CN202011164587A CN112383513B CN 112383513 B CN112383513 B CN 112383513B CN 202011164587 A CN202011164587 A CN 202011164587A CN 112383513 B CN112383513 B CN 112383513B
Authority
CN
China
Prior art keywords
address
proxy
agent
pool
addresses
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202011164587.1A
Other languages
Chinese (zh)
Other versions
CN112383513A (en
Inventor
许祥
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Hangzhou Dt Dream Technology Co Ltd
Original Assignee
Hangzhou Dt Dream Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Hangzhou Dt Dream Technology Co Ltd filed Critical Hangzhou Dt Dream Technology Co Ltd
Priority to CN202011164587.1A priority Critical patent/CN112383513B/en
Publication of CN112383513A publication Critical patent/CN112383513A/en
Application granted granted Critical
Publication of CN112383513B publication Critical patent/CN112383513B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L63/00Network architectures or network communication protocols for network security
    • H04L63/14Network architectures or network communication protocols for network security for detecting or protecting against malicious traffic
    • H04L63/1408Network architectures or network communication protocols for network security for detecting or protecting against malicious traffic by monitoring network traffic
    • H04L63/1416Event detection, e.g. attack signature detection
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L61/00Network arrangements, protocols or services for addressing or naming
    • H04L61/50Address allocation
    • H04L61/5061Pools of addresses
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L61/00Network arrangements, protocols or services for addressing or naming
    • H04L61/59Network arrangements, protocols or services for addressing or naming using proxies for addressing
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L63/00Network architectures or network communication protocols for network security
    • H04L63/14Network architectures or network communication protocols for network security for detecting or protecting against malicious traffic
    • H04L63/1408Network architectures or network communication protocols for network security for detecting or protecting against malicious traffic by monitoring network traffic
    • H04L63/1425Traffic logging, e.g. anomaly detection
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L63/00Network architectures or network communication protocols for network security
    • H04L63/14Network architectures or network communication protocols for network security for detecting or protecting against malicious traffic
    • H04L63/1441Countermeasures against malicious traffic
    • H04L63/145Countermeasures against malicious traffic the attack involving the propagation of malware through the network, e.g. viruses, trojans or worms

Abstract

The application provides a crawler behavior detection method, a crawler behavior detection device and a storage medium based on an agent IP address pool, wherein the method comprises the steps of obtaining request data to be detected; if the source IP address of the request data to be tested is the proxy IP address to be tested, determining a target proxy IP address pool to which the proxy IP address to be tested belongs, wherein the target proxy IP address pool belongs to at least one proxy IP address pool, and the proxy IP address pool has corresponding access behavior characteristics; and detecting whether malicious crawler behaviors exist in the requested data to be detected or not according to the target access behavior characteristics of the target proxy IP address pool. The detection characteristic that the IP agent pool is used for decentralized detection of the crawler behaviors can be effectively detected through the application, so that malicious crawler behaviors based on the IP agent pool can be effectively identified, and the detection effect of the malicious crawler behaviors based on the IP agent pool is improved.

Description

Crawler behavior detection method and device based on proxy IP address pool and storage medium
Technical Field
The present application relates to the field of computer technologies, and in particular, to a crawler behavior detection method and apparatus based on a proxy IP address pool, and a storage medium.
Background
With the development of the internet, access services of a large amount of network data can be provided externally based on the internet, and based on the access services, a large amount of crawlers for the network data are generated, for example, crawlers for remaining ticket information of a ticket purchasing system, crawlers for swiping preferential information and the like, and malicious crawlers for the network data have great influence on data security, business security and the like, so that data security and business security of various business systems are seriously influenced, and how to detect malicious crawlers is very important.
In the related art, malicious crawler behaviors are generally identified based on detection of request frequency or by analyzing overall behaviors such as request static file ratio, request frequency and the like included in request behaviors through an algorithm.
Under the methods, when malicious crawler behaviors based on an IP address in an agent IP (Internet Protocol) address pool exist, the identification is not accurate enough, the identification process is complicated, and the identification effect is not good.
Disclosure of Invention
The present application is directed to solving, at least in part, one of the technical problems in the related art.
Therefore, an object of the present application is to provide a crawler behavior detection method, device and storage medium based on an agent IP address pool, which can effectively detect decentralized detection features of crawler behaviors using an IP agent pool, thereby effectively identifying malicious crawler behaviors based on the IP agent pool and improving a detection effect of malicious crawler behaviors based on the IP agent pool.
In order to achieve the above object, an embodiment of the present application provides a crawler behavior detection method based on a proxy IP address pool, including: acquiring request data to be detected; if the source IP address of the request data to be tested is the proxy IP address to be tested, determining a target proxy IP address pool to which the proxy IP address to be tested belongs, wherein the target proxy IP address pool belongs to the at least one proxy IP address pool, and the proxy IP address pool has corresponding access behavior characteristics; and detecting whether malicious crawler behaviors exist in the requested data to be detected or not according to the target access behavior characteristics of the target agent IP address pool.
According to the crawler behavior detection method based on the agent IP address pool provided by the embodiment of the first aspect of the application, by obtaining request data to be detected and determining a target agent IP address pool to which the agent IP address to be detected belongs when a source IP address of the request data to be detected is the agent IP address to be detected, the target agent IP address pool belongs to at least one agent IP address pool, the agent IP address pool has corresponding access behavior characteristics, whether malicious crawler behaviors exist in the request data to be detected or not is detected according to the target access behavior characteristics of the target agent IP address pool, and decentralized detection characteristics of the crawler behaviors can be effectively detected by using the IP agent pool, so that the malicious crawler behaviors based on the IP agent pool can be effectively identified, and the detection effect of the malicious crawler behaviors based on the IP agent pool is improved.
In order to achieve the above object, an embodiment of the second aspect of the present application provides a crawler behavior detection apparatus based on a proxy IP address pool, including: the first acquisition module is used for acquiring request data to be detected; a determining module, configured to determine, when a source IP address of the request data to be detected is an agent IP address to be detected, a target agent IP address pool to which the agent IP address to be detected belongs, where the target agent IP address pool belongs to the at least one agent IP address pool, and the agent IP address pool has a corresponding access behavior characteristic; and the detection module is used for detecting whether malicious crawler behaviors exist in the requested data to be detected according to the target access behavior characteristics of the target agent IP address pool.
According to the crawler behavior detection device based on the agent IP address pool provided by the embodiment of the second aspect of the application, by acquiring request data to be detected and determining a target agent IP address pool to which the agent IP address to be detected belongs when a source IP address of the request data to be detected is the agent IP address to be detected, the target agent IP address pool belongs to at least one agent IP address pool, the agent IP address pool has corresponding access behavior characteristics, and whether malicious crawler behaviors exist in the request data to be detected or not is detected according to the target access behavior characteristics of the target agent IP address pool, so that decentralized detection characteristics of the crawler behaviors can be effectively detected by using the IP agent pool, malicious crawler behaviors based on the IP agent pool can be effectively identified, and the detection effect of the malicious crawler behaviors based on the IP agent pool is improved.
A non-transitory computer-readable storage medium according to an embodiment of a third aspect of the present application, wherein instructions of the storage medium, when executed by a processor of a computer device, enable the computer device to perform a crawler behavior detection method based on a proxy IP address pool, the method including: the embodiment of the first aspect of the application provides a crawler behavior detection method based on a proxy IP address pool.
In the non-transitory computer readable storage medium provided in the embodiment of the third aspect of the present application, request data to be detected is obtained, and when a source IP address of the request data to be detected is an agent IP address to be detected, a target agent IP address pool to which the agent IP address to be detected belongs is determined, the target agent IP address pool belongs to at least one agent IP address pool, and the agent IP address pool has corresponding access behavior characteristics, and according to the target access behavior characteristics of the target agent IP address pool, whether malicious crawler behaviors exist in the request data to be detected can be detected, and a decentralized detection characteristic of crawler behaviors using the IP agent pool can be effectively detected, so that the malicious crawler behaviors based on the IP agent pool can be effectively identified, and a detection effect of the malicious crawler behaviors based on the IP agent pool is improved.
An embodiment of a fourth aspect of the present application provides a computer device, where the computer device includes: the device comprises a shell, a processor, a memory, a circuit board and a power circuit, wherein the circuit board is arranged in a space enclosed by the shell, and the processor and the memory are arranged on the circuit board; the power supply circuit is used for supplying power to each circuit or device of the computer equipment; the memory is used for storing executable program codes; the processor runs a program corresponding to the executable program code by reading the executable program code stored in the memory, so as to execute the crawler behavior detection method based on the proxy IP address pool provided by the embodiment of the first aspect of the present application.
According to the computer device provided by the embodiment of the fourth aspect of the application, by acquiring request data to be detected and determining a target agent IP address pool to which the agent IP address to be detected belongs when a source IP address of the request data to be detected is the agent IP address to be detected, the target agent IP address pool belongs to at least one agent IP address pool, the agent IP address pool has corresponding access behavior characteristics, and whether malicious crawler behaviors exist in the request data to be detected or not is detected according to the target access behavior characteristics of the target agent IP address pool, so that decentralized detection characteristics of the crawler behaviors by using the IP agent pool can be effectively detected, the malicious crawler behaviors based on the IP agent pool can be effectively identified, and the detection effect of the malicious crawler behaviors based on the IP agent pool is improved.
Additional aspects and advantages of the present application will be set forth in part in the description which follows and, in part, will be obvious from the description, or may be learned by practice of the present application.
Drawings
The foregoing and/or additional aspects and advantages of the present application will become apparent and readily appreciated from the following description of the embodiments, taken in conjunction with the accompanying drawings of which:
fig. 1 is a schematic flowchart of a crawler behavior detection method based on a proxy IP address pool according to an embodiment of the present application;
FIG. 2 is a flowchart illustrating a crawler behavior detection method based on a proxy IP address pool according to another embodiment of the present application;
fig. 3 is a schematic flowchart of a crawler behavior detection method based on a proxy IP address pool according to another embodiment of the present application;
FIG. 4 is a flowchart illustrating a crawler behavior detection method based on a proxy IP address pool according to another embodiment of the present application;
fig. 5 is a schematic structural diagram of a crawler behavior detection apparatus based on a proxy IP address pool according to an embodiment of the present application;
fig. 6 is a schematic flowchart of a crawler behavior detection method based on a proxy IP address pool according to another embodiment of the present application;
fig. 7 is a schematic structural diagram of a computer device according to an embodiment of the present application.
Detailed Description
Reference will now be made in detail to embodiments of the present application, examples of which are illustrated in the accompanying drawings, wherein like or similar reference numerals refer to the same or similar elements or elements having the same or similar function throughout. The embodiments described below with reference to the drawings are exemplary only for the purpose of explaining the present application and are not to be construed as limiting the present application. On the contrary, the embodiments of the application include all changes, modifications and equivalents coming within the spirit and terms of the claims appended hereto.
Fig. 1 is a schematic flowchart of a crawler behavior detection method based on a proxy IP address pool according to an embodiment of the present application.
The present embodiment is exemplified in that the proxy IP address pool-based crawler behavior detection method is configured as a proxy IP address pool-based crawler behavior detection apparatus.
In this embodiment, the crawler behavior detection method based on the proxy IP address pool may be configured in a crawler behavior detection apparatus based on the proxy IP address pool, and the crawler behavior detection apparatus based on the proxy IP address pool may be set in a server or may also be set in a computer device, which is not limited in this embodiment of the present application.
The embodiment takes the example that the crawler behavior detection method based on the proxy IP address pool is configured in the computer device.
It should be noted that, the execution main body in the embodiment of the present application may be, for example, a Central Processing Unit (CPU) in a server or a computer device in terms of hardware, and may be, for example, a related background service in the server or the computer device in terms of software, which is not limited to this.
Referring to fig. 1, the method includes:
s101: and acquiring the request data to be tested.
The request data to be tested may be data related to the request message to be tested, where the related data is, for example, a source IP address, a destination IP address, an interface name of a called service, and the like, and the request message for currently detecting whether a malicious crawler behavior exists is referred to as a request message to be tested.
The request message to be detected may be, for example, a request message supporting HTTP (Hyper Text Transfer Protocol), or may also be a request message supporting any other communication Protocol, which is not limited to this.
Generally, in an application scenario of data communication, a computer device may respond to a user access instruction and generate a corresponding request message according to the user access instruction, so as to send the request message to a background server to obtain a corresponding service, and when receiving the request message, the background server performs malicious crawler behavior detection on the request message, and analyzes the content of the request message to obtain a source IP address, a destination IP address, an interface name of a called service, and the like, which are used as request data to be detected.
S102: and if the source IP address of the requested data to be detected is the proxy IP address to be detected, determining a target proxy IP address pool to which the proxy IP address to be detected belongs, wherein the target proxy IP address pool belongs to at least one proxy IP address pool, and the proxy IP address pool has corresponding access behavior characteristics.
In the embodiment of the application, after the request data to be detected is obtained, whether a source IP address carried in the request data is an agent IP address to be detected is analyzed, if the source IP address is the agent IP address to be detected (when the source IP address is the agent IP address to be detected, it is indicated that the source IP address is identified as the agent IP address, and the source IP address belongs to a pre-established agent IP address pool, the user-side device initiates a large amount of request messages to a background server based on the agent IP address pool, wherein the step of establishing the agent IP address pool may be specifically referred to in the subsequent embodiments, and is not described herein again), at this time, the agent IP address pool to which the agent IP address to be detected belongs may be identified and used as a target agent IP address pool, so as to assist in subsequently determining the access behavior characteristics of using the IP agent pool for crawling behavior, and the access behavior characteristics of the IP agent pool are adopted to assist in detecting malicious crawling behavior.
For the above method for analyzing whether the source IP address in the request data to be tested is the proxy IP address, reference may be made to related technologies, and details are not described herein again.
S103: and detecting whether malicious crawler behaviors exist in the requested data to be detected or not according to the target access behavior characteristics of the target proxy IP address pool.
The access behavior characteristics can be used to describe the overall access condition of the proxy IP addresses in the proxy IP address pool, for example, query Per Second (QPS) for a certain interface, or call duration for a certain interface, that is, the distribution of the query per second QPS for the overall proxy IP addresses, or the distribution of the call duration, but of course, any other possible overall access condition may also be used, and no limitation is imposed on this.
The proxy IP address pool has corresponding access behavior characteristics, where the corresponding access behavior characteristics may be obtained by labeling in advance, that is, massive request data for network application may be learned in advance, and first, source IP addresses carried by the request data are grouped to obtain the proxy IP address pool, and then, access behavior conditions of each proxy IP address included in one proxy IP address pool are analyzed, and an overall access behavior condition is analyzed, so that the corresponding access behavior characteristics of the proxy IP address pool are labeled according to an analysis result, or any other possible method may be used to determine the access behavior characteristics corresponding to the proxy IP address pool, which is not limited.
That is to say, in the embodiment of the present application, an agent IP address pool to which an agent IP address to be detected belongs is identified and used as a target agent IP address pool, an access behavior feature of a crawler behavior performed by using the IP agent pool is determined, and the access behavior feature of the IP agent pool is used to assist in detecting a malicious crawler behavior.
Optionally, in some embodiments, referring to fig. 2, after acquiring the request data to be tested, the method further includes:
s201: and judging whether the source IP address of the requested data to be tested is in the source IP address list or not.
S202: and if the source IP address of the requested data to be detected is in the source IP address list, detecting whether malicious crawler behaviors exist in the requested data to be detected or not by adopting a general malicious crawler behavior detection method.
S203: and if the source IP address of the requested data to be detected is not in the source IP address list, directly judging that malicious crawler behaviors do not exist in the requested data to be detected.
That is, in this embodiment of the application, first, it is determined whether a source IP address of request data to be detected is an agent IP address to be detected, if not, it may be triggered to determine whether the source IP address is in a source IP address list (the source IP address list may be configured in advance, specifically, see the following embodiments, a source IP address in the source IP address list may be an IP address suspected of having malicious crawler behavior), and when it is determined that the source IP address is in the source IP address list, it may be indicated that the source IP address is suspected of having malicious crawler behavior, and the source IP address may not belong to a specific agent IP address pool, at this time, it may be detected whether malicious crawler behavior exists in the request data to be detected to which the source IP address belongs by using a general malicious crawler behavior detection method, for example, behaviors such as access frequency characteristics of the source IP address, request static file proportion in the request data, and request frequency may be detected to identify.
If the source IP address of the requested data to be detected is not in the source IP address list, it is directly determined that malicious crawler behavior does not exist in the requested data to be detected, that is, if the source IP address of the requested data to be detected is not in the source IP address list, it indicates that the source IP address is not an agent IP address, and it does not belong to an IP address highly suspected of having malicious crawler behavior, it is directly determined that malicious crawler behavior does not exist in the requested data to be detected.
For example, assuming an application scenario, assume the decision rule is: if the request frequency for accessing a certain interface is 100QPS, it is considered that the request data for accessing the interface is a crawler behavior, and if 100 proxy IP addresses are used, and each proxy IP address is based on 1QPS crawling data, the above crawler behavior based on the proxy IP address pool cannot be detected by using a method in the related art, but by using the detection method in the present application, for a source IP address belonging to the source IP list, a general crawler identification policy may be used for identification, and for a proxy IP address pool, crawler behavior detection is performed by taking the aggregated access behavior characteristics of the proxy IP address pool as a whole, specifically, for 100 proxy IP addresses, and each proxy IP address crawls a target system by using 1QPS, it is considered that the 100 IPs belong to one proxy IP address pool, and the request rate of the proxy IP address pool is calculated to be 100QPS (target access behavior characteristics), so that the malicious behavior detection is performed by taking the request rate of the proxy IP address pool as 100 QPS.
In the embodiment of the application, the proxy IP address pool is used as an analysis target, so that decentralized detection characteristics of crawler behaviors based on the proxy IP address pool can be effectively detected, a universal malicious crawler behavior detection method can be docked in the analysis process based on the proxy IP address pool, seamless docking detection can be realized when a source IP address is not a proxy IP address, the comprehensiveness of detection is effectively improved, and the detection effect is improved.
In addition, it should be noted that, after the request data to be detected is obtained, it may be determined whether the source IP address of the request data to be detected is in the source IP address list while determining whether the source IP address of the request data to be detected is the proxy IP address to be detected, or it may be determined whether the source IP address of the request data to be detected is in the source IP address list first and then determined whether the source IP address of the request data to be detected is in the source IP address list, or it may be determined whether the source IP address of the request data to be detected is in the source IP address list first and then determined whether the source IP address of the request data to be detected is the proxy IP address to be detected, where the execution order of the two determination processing logics is not limited in the embodiment of the present application.
In this embodiment, by acquiring request data to be detected, and determining a target proxy IP address pool to which the proxy IP address to be detected belongs when a source IP address of the request data to be detected is the proxy IP address to be detected, where the target proxy IP address pool belongs to at least one proxy IP address pool, and the proxy IP address pool has corresponding access behavior characteristics, and detecting whether malicious crawler behaviors exist in the request data to be detected according to the target access behavior characteristics of the target proxy IP address pool, a decentralized detection characteristic of crawler behaviors using the IP proxy pool can be effectively detected, so that the malicious crawler behaviors based on the IP proxy pool can be effectively identified, and a detection effect of the malicious crawler behaviors based on the IP proxy pool is improved.
Fig. 3 is a flowchart illustrating a crawler behavior detection method based on a proxy IP address pool according to an embodiment of the present application.
In the embodiment of the present application, before obtaining request data to be tested, referring to fig. 3, the method includes:
s301: obtaining a plurality of pieces of request data for a web application, the request data including: a source IP address and a proxy IP address is determined from the plurality of source IP addresses.
The plurality of request data described above is historical request data for web applications that may be used to model and form a plurality of candidate proxy IP address pools.
In some embodiments, when multiple pieces of request data for the web application are obtained, the multiple pieces of request data for the web application may be obtained in a manner of traffic mirroring, application log reporting and collecting, RASP (Runtime application self-protection), and the like.
In other embodiments, the collection API may be configured to monitor request data for accessing the network application, and the collection API is executed to obtain multiple pieces of request data for the network application, which is not limited in this respect.
In acquiring the pieces of request data for the web application, the pieces of request data for the web application may be acquired within a time range, for example, 12 hours, one day, three days, one week, etc., without limitation.
In this embodiment, after acquiring a plurality of pieces of request data for a network application, a source IP address of each piece of request data may be analyzed, and an agent IP address may be determined from the plurality of source IP addresses, where the source IP address identifies an IP address corresponding to a user-side device accessing the network application, and behavior characteristics of the source IP address may be analyzed, so as to determine whether the source IP address is the agent IP address according to the behavior characteristics, so as to determine the agent IP address from the plurality of source IP addresses.
In this embodiment, a request data is taken as a request data supporting HTTP (hypertext Transfer Protocol), multiple pieces of request data may be analyzed according to the HTTP Protocol standard, and a result obtained by the analysis is stored on a big data platform, so as to facilitate processing of subsequent big data and ensure processing efficiency of massive request data, where the result obtained by the analysis may include main fields of the HTTP Protocol, including but not limited to source IP, destination IP, source port, destination port, host, cookie, user-Agent, URL, postData, and the like, and descriptions for these fields may be obtained from related technologies, which is not described herein again.
S302: and forming a source IP address list according to other source IP addresses except the proxy IP address in the plurality of source IP addresses.
That is, each source IP address may or may not be a proxy IP address, and in this embodiment, at least one proxy IP address pool may be formed for the identified proxy IP addresses, and a source IP address list may be formed by using other source IP addresses that are not proxy IP addresses.
The source IP addresses in the source IP address list may be IP addresses suspected of having malicious crawlers, and thus, in this embodiment, the IP addresses suspected of having malicious crawlers may be identified from other source IP addresses, and a source IP address list may be formed according to the IP addresses.
In this embodiment, at least one proxy IP address pool and a source IP address list are obtained by learning and modeling in advance according to source IP addresses of mass request data, so as to effectively assist comprehensiveness and detection effect of detection in an actual crawler behavior detection process.
S303: and analyzing the request data to obtain the corresponding Cookie and/or User _ agent.
The embodiment provides an implementation method for grouping proxy IP addresses to form at least one proxy IP address pool, which comprises the steps of firstly analyzing request data to obtain corresponding Cookie and/or User _ agent.
The Cookie is session data corresponding to a key value of a Cookie field in the request data, for example, if a value of the Cookie field of a certain request data is:
XSRF-TOKEN=9cacc300-47ce-440d-aa55-86905b17cd01;
SESSIONID=OTJmN2JmYmEtM2E2YS00ODNjLTkzMTktNWM2ZDQ5OTAwZTBk;
then session data will be extracted:
SESSIONID = OTJmN2 jmymtm 2E2YS00 odnjltkzmtnwm 2ZDQ5OTAwZTBk, and the session data is regarded as a Cookie.
Therefore, the session data corresponding to the key value of the Cookie field in the request data is extracted as the Cookie, so that the interference on the accuracy of grouping caused by the existence of some changed random values in the Cookie field can be effectively avoided, and the grouping effect can be improved.
Of course, the above-mentioned Cookie form may also be customized by the user according to the actual application scene requirements, for example, the user may also designate a certain application for a special scene, extract the key values of other customized Cookie fields, and before grouping the proxy IP addresses according to the cookies, in order to make the grouping effect appear better, may also reject the request data with all cookies being empty and the packet with only one IP.
The value corresponding to the User _ agent field in the request data can be directly used, without limitation.
S304: grouping the proxy IP addresses by combining a grouping strategy according to Cookie and/or User _ agent to form at least one proxy IP address pool; and the proxy IP addresses in the proxy IP address pool correspond to the same user side equipment.
After analyzing each request data to obtain the Cookie and/or User _ agent corresponding to each request data, the proxy IP addresses may be grouped according to the Cookie and/or User _ agent in combination with a grouping policy to form at least one proxy IP address pool, so that the proxy IP addresses in the proxy IP address pool correspond to the same User side device, and the proxy IP addresses in the proxy IP address pool have similar access behavior characteristics.
The grouping policy may be configured in advance, or may be set dynamically according to actual usage scenario requirements, and the embodiment shown in fig. 4 of the present application provides a grouping implementation manner, and of course, any other possible manner may also be used to group the proxy IP addresses according to cookies and/or User _ agents, for example, a model manner, an engineering manner, and the like, which is not limited thereto.
It should be noted that "and/or" means that, in this embodiment of the present application, parsing request data is supported to obtain a corresponding Cookie, and then, grouping proxy IP addresses by combining a grouping policy according to the Cookie to form at least one proxy IP address pool, or, in this embodiment of the present application, parsing request data is also supported to obtain a corresponding User _ agent, and then, grouping the proxy IP addresses by combining a grouping policy according to the Cookie and/or the User _ agent to form at least one proxy IP address pool, or, in this embodiment of the present application, parsing request data is supported to obtain a corresponding Cookie and a User _ agent, and then, according to the Cookie and the User _ agent, grouping the proxy IP addresses by combining a grouping policy to form at least one proxy IP address pool.
Thus, the embodiment shown in fig. 3 obtains a plurality of pieces of request data for the web application, where the request data includes: the method comprises the steps of determining a proxy IP address from a plurality of source IP addresses, forming a source IP address list according to other source IP addresses except the proxy IP address in the plurality of source IP addresses, analyzing request data to obtain corresponding Cookie and/or User _ agent, grouping the proxy IP addresses according to the Cookie and/or User _ agent and combining a grouping strategy to form at least one proxy IP address pool, effectively assisting detection comprehensiveness and detection effect in the actual crawler behavior detection process, and effectively avoiding interference on the accuracy of grouping due to the fact that some variable random values exist in the Cookie field by extracting session data corresponding to key values of the Cookie field in the request data as the Cookie, so that the grouping effect can be improved.
Referring to fig. 4, grouping the proxy IP addresses according to the Cookie and the User _ agent in combination with the grouping policy to form at least one proxy IP address pool includes:
s401: and judging whether the target key values corresponding to the Cookie of each proxy IP address are the same or not, and judging whether the User _ agent of each proxy IP address is the same or not.
S402: and dividing the proxy IP addresses with the same target key value into the same first proxy IP address group to obtain a plurality of first proxy IP address groups.
The Cookie-based packet may be referred to as a first proxy IP address packet, and each proxy IP address in the first proxy IP address packet conveniently has a synergy in the expression of a target key value (a target key value, for example, session data corresponding to a key value of a Cookie field) corresponding to a Cookie.
S403: and deleting the proxy IP address with the empty Cookie from the first proxy IP address packet, and supplementing the proxy IP address with the empty Cookie into the source IP address list.
S404: and dividing the proxy IP addresses with the same User _ agent into the same second proxy IP address packet to obtain a plurality of second proxy IP address packets.
The User _ agent-based packet may be referred to as a second proxy IP address packet, and each proxy IP address in the second proxy IP address packet may be conveniently represented by the User _ agent in a coordinated manner.
For example, a data algorithm model of a big data platform firstly groups source IP addresses of all request data in N hours according to a User _ agent of the request data to obtain a group based on the User _ agent, and simultaneously groups all request data in N hours according to Cookie of the request data to obtain a group based on the Cookie.
That is, the proxy IP addresses are grouped according to the Cookie and the User _ agent in combination with the grouping policy to form at least one proxy IP address pool, because the request data is parsed and learned based on big data parsing to obtain the Cookie and the User _ agent, and then the Cookie-based packet is generated by parsing the default grouping policy of the Cookie or the User-defined grouping policy (the Cookie-based packet may be referred to as a first proxy IP address packet), and the User _ agent is grouped and analyzed for the target distribution, the time latitude distribution, the request frequency distribution and the User-defined arbitrary dimension under the packet, which may be a group of IP addresses (which may belong to one proxy IP address pool, the User _ agent-based packet may be referred to as a second proxy IP address packet) under the packet is calculated.
After the steps of S401 to S404, the second proxy IP address packet and the first proxy IP address packet are obtained, in this embodiment, since all source IP addresses in the packet data of the first proxy IP address packet are grouped according to cookies, the source IP addresses in the same packet are all the same cookies, and therefore if there are multiple source IP addresses, the source IP addresses all belong to the same source (a visitor, which may be the entire proxy pool).
In a special scene, if the service does not need to log in, the attacker does not carry cookies if the service can be crawled, or the attacker registers a mass account for data crawling for a fixed number of times due to service limitation, the first proxy IP address group cannot cover all source IP addresses with the crawler behavior, and at the moment, the data can be supplemented based on the grouped data of the second proxy IP address group, and if the cookies of several users are completely consistent in the special scene, the grouped data of the first proxy IP address group can enter a group, and actually the grouped data do not belong to the same visitor.
S405: and optimizing and adjusting the second proxy IP address packet according to the access parameters of each proxy IP address in the second proxy IP address packet.
Optionally, in some embodiments, the performing optimization adjustment on the second proxy IP address packet according to the access parameter of each proxy IP address in the second proxy IP address packet includes: aiming at each second proxy IP address group, determining an access parameter of request data to which the proxy IP address belongs; the proxy IP address with the matching access parameter is adjusted into the same second proxy IP address packet.
The access parameters are used to describe the access time, destination IP address, access frequency, etc. of the requested data, and are not limited to this, and certainly, the access parameters can be adaptively adjusted according to the actual detection requirement, so that the grouping policy has more flexibility and is adaptive to the requirements of different use scenarios.
Optionally, in some embodiments, adjusting the proxy IP address having the matching access parameter into the same second proxy IP address packet includes: and when the access parameter is the target IP address of the request data, adjusting the proxy IP address with the same target IP address into the same second proxy IP address packet.
For example, an algorithm model of the big data platform may calculate whether access destination IP addresses corresponding to each source IP address in the second proxy IP address packet are consistent, and since cookies are the same, it may be considered that a plurality of source IP addresses belong to the same target, data is grouped according to the access destination IP addresses, if the same User-Agent is present and all access the same destination IP address, it is considered that the source IP addresses belong to one class, and for the grouped data, if a Count value Count is 1 (i.e., there is only one source IP address accessing a certain destination IP address under the same User-Agent), the record may be removed from the second proxy IP address packet, so as to obtain an optimized and adjusted second proxy IP address packet.
Optionally, in other embodiments, adjusting the proxy IP address with the matching access parameter into the same second proxy IP address packet comprises: when the access parameters are access time and access frequency, analyzing the access behavior characteristics corresponding to the proxy IP address according to the access time and the access frequency; proxy IP addresses with similar access behavior characteristics are adjusted into the same second proxy IP address packet.
That is, after the proxy IP addresses with the same target IP address are adjusted into the same second proxy IP address packet, the second proxy IP address packet may be further optimized according to the access time and the access frequency.
The method for performing optimization processing on the second proxy IP address packet according to the access time may be as follows: the access time (which may include a start access time and an end access time, that is, the access time is a period of time) of each source IP is analyzed.
For example, if the AIP access starts at the 10 th minute, the 15 th minute ends, the BIP starts at the 15 th minute, the access ends at the 20 th minute, the FIP accesses at the 25 th minute ends, the DIP accesses at the 18 th minute and the 24 th minute ends, it is recognized that the accesses of a, B and F are coherent, and then the IPs can be divided into a second proxy IP address packet, while on the other hand, if the IP access time axis is distributed like abceabceabdacbe (a issues a request for AIP, B issues a request for BIP, and so on), it can be determined that ABCE is common and assigned to a second proxy IP address packet, and then the results from the two methods can be combined, and the combining rule is a union set, i.e., it is determined that a, B, C, E and F of the User-Agent belong to the same second proxy IP address packet, and likewise, if there is a single IP address, the latitude-Agent is removed, and the optimized second proxy IP address packet is obtained.
Further, the method for continuing to perform optimization processing on the second agent IP address packet according to the access frequency may be, for example, as follows:
and optimizing the second proxy IP address packets according to the similarity of the access frequencies, for example, AIP10QPS, BIP 10QPS and CIP 20QPS, wherein A and B are considered as one packet after aggregation, C is considered as a single packet, and the A and B are only reserved when the single IP packet is removed, so that the optimized second proxy IP address packet with the access frequency dimension is generated.
In this embodiment, adaptive configuration of the access parameter is further supported, so that a user can customize a packet dimension (for example, the access parameter may also be configured to a total request duration, and the like) to continue to optimize processing of the second agent IP address packet.
S406: and forming a proxy IP address pool based on Cookie according to the first proxy IP address grouping, and forming a proxy IP address pool based on a User _ agent field according to the optimized and adjusted second proxy IP address grouping.
After the optimized and adjusted second proxy IP address grouping based on various access parameter dimensions is obtained, a User-Agent time dimension second proxy IP address grouping, a User-Agent access frequency dimension second proxy IP address grouping and a User-Agent custom dimension second proxy IP address grouping can be aggregated to obtain the optimized and adjusted second proxy IP address grouping, then the first proxy IP address grouping is used as a proxy IP address pool based on Cookie, the optimized and adjusted second proxy IP address is used as a proxy IP address pool based on a User-Agent field, in addition, in the optimization processing process, a source IP address which is removed from the first proxy IP address grouping and the second proxy IP address grouping is divided into a source IP address list, and the source IP address which is not a proxy IP address is supplemented into the source IP address list, but is a source IP address suspected of malicious crawler behavior.
In the embodiment of the present application, the method for grouping the proxy IP addresses according to the cookies and/or User _ agent in combination with the grouping policy to form at least one proxy IP address pool is not limited to the above, and is described as follows:
in some other embodiments, grouping the proxy IP addresses according to a Cookie and/or a User _ agent in combination with a grouping policy to form at least one proxy IP address pool may further include: judging whether the target key values corresponding to the Cookies of the proxy IP addresses are the same or not; dividing the proxy IP addresses with the same target key value into the same first proxy IP address group to obtain a plurality of first proxy IP address groups; and forming a proxy IP address pool based on Cookie according to the first proxy IP address grouping.
In some other embodiments, grouping the proxy IP addresses according to the Cookie and/or the User _ agent in combination with the grouping policy to form at least one proxy IP address pool may further include: judging whether the User _ agent of each agent IP address is the same; dividing the proxy IP addresses with the same User _ agent into the same second proxy IP address group to obtain a plurality of second proxy IP address groups; optimizing and adjusting the second proxy IP address packet according to the access parameters of each proxy IP address in the second proxy IP address packet; and forming an agent IP address pool based on a User _ agent field according to the optimized and adjusted second agent IP address grouping.
For the illustration of each step in the two embodiments, reference may be made to the embodiments described above, and details are not described herein.
That is to say, the embodiment of the present application not only supports grouping the proxy IP addresses according to the Cookie and the User _ agent in combination with the grouping policy, but also supports grouping the proxy IP addresses according to the Cookie in combination with the grouping policy, or further supports grouping the proxy IP addresses according to the User _ agent in combination with the grouping policy, thereby providing a flexible grouping manner, and any of the above embodiments can be implemented according to the actual use requirement, so that the method for detecting the crawler behavior based on the proxy IP address pool provided by the embodiment of the present application has better applicability.
Fig. 5 is a schematic structural diagram of a crawler behavior detection apparatus based on a proxy IP address pool according to an embodiment of the present application.
Referring to fig. 5, the apparatus 500 includes:
a first obtaining module 501, configured to obtain request data to be tested;
a determining module 502, configured to determine a target proxy IP address pool to which a proxy IP address to be detected belongs when a source IP address of request data to be detected is the proxy IP address to be detected, where the target proxy IP address pool belongs to at least one proxy IP address pool, and the proxy IP address pool has corresponding access behavior characteristics;
the detecting module 503 is configured to detect whether malicious crawler behavior exists in the requested data to be detected according to the target access behavior feature of the target agent IP address pool.
Optionally, in some embodiments, referring to fig. 6, the apparatus 500 further comprises:
a second obtaining module 504, configured to obtain multiple pieces of request data for the network application before obtaining the request data to be tested, where the request data includes: a source IP address, and determining a proxy IP address from a plurality of source IP addresses;
an analysis module 505, configured to analyze the request data to obtain a corresponding Cookie and/or User _ agent;
a grouping module 506, configured to group the proxy IP addresses according to the cookies and/or User _ agent in combination with a grouping policy to form at least one proxy IP address pool; and the proxy IP addresses in the proxy IP address pool correspond to the same user side equipment.
Optionally, in some embodiments, the grouping module 506 is further configured to form a source IP address list according to other source IP addresses except the proxy IP address in the plurality of source IP addresses;
optionally, in some embodiments, referring to fig. 6, further comprising:
the determining module 507 is configured to determine whether a source IP address of the requested data to be detected is in the source IP address list after the requested data to be detected is acquired;
the detecting module 503 is further configured to detect whether malicious crawler behavior exists in the requested data to be detected by using a general malicious crawler behavior detecting device when the source IP address of the requested data to be detected is in the source IP address list.
Optionally, in some embodiments, the grouping module 506 is further configured to:
judging whether the target key values corresponding to the Cookies of the proxy IP addresses are the same or not;
dividing the proxy IP addresses with the same target key value into the same first proxy IP address group to obtain a plurality of first proxy IP address groups;
and forming a proxy IP address pool based on Cookie according to the first proxy IP address group.
Optionally, in some embodiments, the grouping module 506 is further configured to:
judging whether the User _ agent of each agent IP address is the same;
dividing the proxy IP addresses with the same User _ agent into the same second proxy IP address group to obtain a plurality of second proxy IP address groups;
optimizing and adjusting the second proxy IP address packet according to the access parameters of each proxy IP address in the second proxy IP address packet;
and forming an agent IP address pool based on a User _ agent field according to the optimized and adjusted second agent IP address grouping.
Optionally, in some embodiments, the grouping module 506 is further configured to:
judging whether the target key values corresponding to the Cookies of the proxy IP addresses are the same or not, and judging whether the User _ agents of the proxy IP addresses are the same or not;
dividing the proxy IP addresses with the same target key value into the same first proxy IP address group to obtain a plurality of first proxy IP address groups;
dividing the agent IP addresses with the same User _ agent into the same second agent IP address group to obtain a plurality of second agent IP address groups;
optimizing and adjusting the second proxy IP address packet according to the access parameters of each proxy IP address in the second proxy IP address packet;
and forming a proxy IP address pool based on Cookie according to the first proxy IP address group, and forming a proxy IP address pool based on a User _ agent field according to the optimized and adjusted second proxy IP address group.
Optionally, in some embodiments, the grouping module 506 is further configured to:
aiming at each second proxy IP address group, determining an access parameter of request data to which the proxy IP address belongs;
the proxy IP address with the matching access parameter is adjusted into the same second proxy IP address packet.
Optionally, in some embodiments, the grouping module 506 is further configured to:
and when the access parameter is the target IP address of the request data, adjusting the proxy IP address with the same target IP address into the same second proxy IP address packet.
Optionally, in some embodiments, the grouping module 506 is further configured to:
when the access parameters are access time and access frequency, analyzing the access behavior characteristics corresponding to the proxy IP address according to the access time and the access frequency;
proxy IP addresses with similar access behavior characteristics are adjusted into the same second proxy IP address packet.
Optionally, in some embodiments, the grouping module 506 is further configured to:
and adaptively adjusting the access parameters according to the actual detection requirement.
Alternatively, in some embodiments, wherein,
the grouping module 506 is further configured to delete the proxy IP address with an empty Cookie from the first proxy IP address packet after dividing the proxy IP addresses with the same Cookie into the same first proxy IP address packet to obtain a plurality of first proxy IP address packets, and supplement the proxy IP address with an empty Cookie into the source IP address list.
It should be noted that, the explanation of the foregoing embodiment of fig. 1 to fig. 4 for the crawler behavior detection method based on the proxy IP address pool is also applicable to the crawler behavior detection apparatus 500 based on the proxy IP address pool of this embodiment, and the implementation principle thereof is similar, and is not described herein again.
In this embodiment, by obtaining request data to be detected, and when a source IP address of the request data to be detected is an agent IP address to be detected, a target agent IP address pool to which the agent IP address to be detected belongs is determined, the target agent IP address pool belongs to at least one agent IP address pool, and the agent IP address pool has corresponding access behavior characteristics, and whether malicious crawler behaviors exist in the request data to be detected is detected according to the target access behavior characteristics of the target agent IP address pool, and detection characteristics of decentralized crawler behaviors using the IP agent pool can be effectively detected, so that the malicious crawler behaviors based on the IP agent pool can be effectively identified, and a detection effect of malicious crawler behaviors based on the IP agent pool is improved.
Fig. 7 is a schematic structural diagram of a computer device according to an embodiment of the present application.
Referring to fig. 7, a computer device 700 of the present embodiment includes a housing 701, a processor 702, a memory 703, a circuit board 704, and a power supply circuit 705, wherein the circuit board 704 is disposed inside a space surrounded by the housing 701, and the processor 702 and the memory 703 are disposed on the circuit board 704; a power supply circuit 705 for supplying power to the respective circuits or devices of the computer apparatus 700; the memory 703 is used to store executable program code; the processor 702 runs a program corresponding to the executable program code by reading the executable program code stored in the memory 703, for performing:
acquiring request data to be tested;
if the source IP address of the request data to be tested is the proxy IP address to be tested, determining a target proxy IP address pool to which the proxy IP address to be tested belongs, wherein the target proxy IP address pool belongs to at least one proxy IP address pool, and the proxy IP address pool has corresponding access behavior characteristics;
and detecting whether malicious crawler behaviors exist in the requested data to be detected or not according to the target access behavior characteristics of the target proxy IP address pool.
It should be noted that the explanation of the foregoing crawler behavior detection method embodiment based on the proxy IP address pool in the embodiments of fig. 1 to fig. 4 is also applicable to the computer device 700 of this embodiment, and the implementation principle is similar, and is not described here again.
In this embodiment, by acquiring request data to be detected, and determining a target proxy IP address pool to which the proxy IP address to be detected belongs when a source IP address of the request data to be detected is the proxy IP address to be detected, where the target proxy IP address pool belongs to at least one proxy IP address pool, and the proxy IP address pool has corresponding access behavior characteristics, and detecting whether malicious crawler behaviors exist in the request data to be detected according to the target access behavior characteristics of the target proxy IP address pool, a decentralized detection characteristic of crawler behaviors using the IP proxy pool can be effectively detected, so that the malicious crawler behaviors based on the IP proxy pool can be effectively identified, and a detection effect of the malicious crawler behaviors based on the IP proxy pool is improved.
In order to implement the foregoing embodiments, an embodiment of the present application provides a non-transitory computer-readable storage medium, on which a computer program is stored, where the computer program, when executed by a processor, implements the crawler behavior detection method based on a proxy IP address pool of the foregoing method embodiment.
It should be noted that, in the description of the present application, the terms "first", "second", etc. are used for descriptive purposes only and are not to be construed as indicating or implying relative importance. In addition, in the description of the present application, "a plurality" means two or more unless otherwise specified.
Any process or method descriptions in flow charts or otherwise described herein may be understood as representing modules, segments, or portions of code which include one or more executable instructions for implementing specific logical functions or steps of the process, and the scope of the preferred embodiments of the present application includes other implementations in which functions may be executed out of order from that shown or discussed, including substantially concurrently or in reverse order, depending on the functionality involved, as would be understood by those reasonably skilled in the art of the present application.
It should be understood that portions of the present application may be implemented in hardware, software, firmware, or a combination thereof. In the above embodiments, the various steps or methods may be implemented in software or firmware stored in memory and executed by a suitable instruction execution system. For example, if implemented in hardware, as in another embodiment, any one or combination of the following techniques, which are known in the art, may be used: a discrete logic circuit having a logic gate circuit for implementing a logic function on a data signal, an application specific integrated circuit having an appropriate combinational logic gate circuit, a Programmable Gate Array (PGA), a Field Programmable Gate Array (FPGA), or the like.
It will be understood by those skilled in the art that all or part of the steps carried out in the method of implementing the above embodiments may be implemented by hardware related to instructions of a program, which may be stored in a computer readable storage medium, and the program, when executed, includes one or a combination of the steps of the method embodiments.
In addition, functional units in the embodiments of the present application may be integrated into one processing module, or each unit may exist alone physically, or two or more units are integrated into one module. The integrated module can be realized in a hardware mode, and can also be realized in a software functional module mode. The integrated module, if implemented in the form of a software functional module and sold or used as a stand-alone product, may also be stored in a computer readable storage medium.
The storage medium mentioned above may be a read-only memory, a magnetic or optical disk, etc.
In the description of the present specification, reference to the description of "one embodiment," "some embodiments," "an example," "a specific example," or "some examples" or the like means that a particular feature, structure, material, or characteristic described in connection with the embodiment or example is included in at least one embodiment or example of the present application. In this specification, the schematic representations of the terms used above do not necessarily refer to the same embodiment or example. Furthermore, the particular features, structures, materials, or characteristics described may be combined in any suitable manner in any one or more embodiments or examples.
Although embodiments of the present application have been shown and described above, it is understood that the above embodiments are exemplary and should not be construed as limiting the present application, and that variations, modifications, substitutions and alterations may be made to the above embodiments by those of ordinary skill in the art within the scope of the present application.

Claims (22)

1. A crawler behavior detection method based on an agent IP address pool is characterized by comprising the following steps:
acquiring request data to be tested;
if the source IP address of the request data to be tested is the proxy IP address to be tested, determining a target proxy IP address pool to which the proxy IP address to be tested belongs, wherein the target proxy IP address pool belongs to at least one proxy IP address pool, and the proxy IP address pool has corresponding access behavior characteristics, wherein the access behavior characteristics are used for describing the overall access condition of the proxy IP address in the proxy IP address pool;
detecting whether malicious crawler behaviors exist in the requested data to be detected or not according to the target access behavior characteristics of the target agent IP address pool;
before the obtaining of the requested data to be tested, the method further includes:
obtaining a plurality of pieces of request data for a web application, the request data including: a source IP address, and determining a proxy IP address from a plurality of the source IP addresses;
analyzing the request data to obtain corresponding Cookie and/or User _ agent;
grouping the proxy IP addresses according to the Cookie and/or the User _ agent by combining a grouping strategy to form at least one proxy IP address pool;
the proxy IP addresses in the proxy IP address pool correspond to the same user side equipment;
and extracting session data corresponding to the key value of the Cookie field in the request data as the Cookie.
2. The method of claim 1, wherein after said determining a proxy IP address from a plurality of said source IP addresses, further comprising:
forming a source IP address list according to other source IP addresses except the proxy IP address in the plurality of source IP addresses;
after the request data to be tested is obtained, the method further comprises the following steps:
judging whether the source IP address of the requested data to be detected is in the source IP address list or not;
and if the source IP address of the request data to be detected is in the source IP address list, detecting whether malicious crawler behaviors exist in the request data to be detected by adopting a general malicious crawler behavior detection method.
3. The method of claim 1, wherein said grouping said proxy IP addresses in conjunction with a grouping policy according to said Cookie and/or said User _ agent to form at least one of said pool of proxy IP addresses comprises:
judging whether the target key values corresponding to the Cookies of the proxy IP addresses are the same or not;
dividing the proxy IP addresses with the same target key value into the same first proxy IP address group to obtain a plurality of first proxy IP address groups;
and forming a proxy IP address pool based on the Cookie according to the first proxy IP address group.
4. The method of claim 1 wherein said grouping said proxy IP addresses in conjunction with a grouping policy according to said Cookie and/or said User _ agent to form at least one pool of said proxy IP addresses comprises:
judging whether the User _ agent of each agent IP address is the same;
dividing the agent IP addresses with the same User _ agent into the same second agent IP address packet to obtain a plurality of second agent IP address packets;
optimizing and adjusting the second proxy IP address packet according to the access parameters of each proxy IP address in the second proxy IP address packet;
and forming an agent IP address pool based on the User _ agent field according to the optimized and adjusted second agent IP address group.
5. The method of claim 1 wherein said grouping said proxy IP addresses in conjunction with a grouping policy according to said Cookie and/or said User _ agent to form at least one pool of said proxy IP addresses comprises:
judging whether the target key values corresponding to the Cookie of each proxy IP address are the same or not, and judging whether the User _ agent of each proxy IP address is the same or not;
dividing the proxy IP addresses with the same target key value into the same first proxy IP address group to obtain a plurality of first proxy IP address groups;
dividing the proxy IP addresses with the same User _ agent into the same second proxy IP address group to obtain a plurality of second proxy IP address groups;
according to the access parameters of each proxy IP address in the second proxy IP address packet, carrying out optimization adjustment on the second proxy IP address packet;
and forming a proxy IP address pool based on the Cookie according to the first proxy IP address group, and forming a proxy IP address pool based on the User _ agent field according to the optimized and adjusted second proxy IP address group.
6. The method of claim 4 or 5, wherein the optimally adjusting the second proxy IP address packet according to the access parameters of each proxy IP address in the second proxy IP address packet comprises:
for each second proxy IP address group, determining an access parameter of request data to which the proxy IP address belongs;
and adjusting the agent IP address with the matched access parameter into the same second agent IP address packet.
7. The method of claim 6, wherein said adjusting the proxy IP address having the matching access parameter into the same second proxy IP address packet comprises:
and when the access parameter is the target IP address of the affiliated request data, adjusting the proxy IP address with the same target IP address into the same second proxy IP address packet.
8. The method of claim 6, wherein said adjusting the proxy IP address having the matching access parameter into the same second proxy IP address packet comprises:
when the access parameters are access time and access frequency, analyzing access behavior characteristics corresponding to the proxy IP address according to the access time and the access frequency;
and adjusting the proxy IP addresses with similar access behavior characteristics into the same second proxy IP address packet.
9. The method of any one of claims 7-8, further comprising:
and adaptively adjusting the access parameters according to actual detection requirements.
10. The method of claim 3, wherein after dividing the proxy IP addresses with the same Cookie into the same first proxy IP address packet to obtain a plurality of first proxy IP address packets, further comprising:
and deleting the proxy IP address with the empty Cookie from the first proxy IP address packet, and supplementing the proxy IP address with the empty Cookie into the source IP address list.
11. A crawler behavior detection device based on an agent IP address pool is characterized by comprising:
the first acquisition module is used for acquiring request data to be detected;
the determining module is used for determining a target proxy IP address pool to which the proxy IP address to be detected belongs when the source IP address of the request data to be detected is the proxy IP address to be detected, wherein the target proxy IP address pool belongs to at least one proxy IP address pool, and the proxy IP address pool has corresponding access behavior characteristics, wherein the access behavior characteristics are used for describing the overall access condition of the proxy IP address in the proxy IP address pool;
the detection module is used for detecting whether malicious crawler behaviors exist in the requested data to be detected according to the target access behavior characteristics of the target agent IP address pool;
the device further comprises:
a second obtaining module, configured to obtain multiple pieces of request data for a network application before obtaining the request data to be tested, where the request data includes: a source IP address, and determining a proxy IP address from a plurality of the source IP addresses;
the analysis module is used for analyzing the request data to obtain a corresponding Cookie and/or a User _ agent;
the grouping module is used for grouping the proxy IP addresses according to the Cookie and/or the User _ agent by combining a grouping strategy to form at least one proxy IP address pool;
the proxy IP addresses in the proxy IP address pool correspond to the same user side equipment;
and extracting session data corresponding to the key value of the Cookie field in the request data as the Cookie.
12. The apparatus of claim 11, further comprising:
the grouping module is further used for forming a source IP address list according to other source IP addresses except the proxy IP address in the plurality of source IP addresses;
further comprising:
the judging module is used for judging whether the source IP address of the request data to be detected is in the source IP address list or not after the request data to be detected is obtained;
the detection module is further configured to detect whether malicious crawler behaviors exist in the request data to be detected by using a general malicious crawler behavior detection device when the source IP address of the request data to be detected is in the source IP address list.
13. The apparatus of claim 11, wherein the grouping module is further configured to:
judging whether the target key values corresponding to the Cookies of the proxy IP addresses are the same or not;
dividing the proxy IP addresses with the same target key value into the same first proxy IP address group to obtain a plurality of first proxy IP address groups;
and forming a proxy IP address pool based on the Cookie according to the first proxy IP address group.
14. The apparatus of claim 11, wherein the grouping module is further configured to:
judging whether the User _ agent of each agent IP address is the same;
dividing the agent IP addresses with the same User _ agent into the same second agent IP address packet to obtain a plurality of second agent IP address packets;
according to the access parameters of each proxy IP address in the second proxy IP address packet, carrying out optimization adjustment on the second proxy IP address packet;
and forming an agent IP address pool based on the User _ agent field according to the optimized and adjusted second agent IP address group.
15. The apparatus of claim 11, wherein the grouping module is further configured to:
judging whether the target key values corresponding to the Cookie of each proxy IP address are the same or not, and judging whether the User _ agent of each proxy IP address is the same or not;
dividing the proxy IP addresses with the same target key value into the same first proxy IP address group to obtain a plurality of first proxy IP address groups;
dividing the agent IP addresses with the same User _ agent into the same second agent IP address packet to obtain a plurality of second agent IP address packets;
according to the access parameters of each proxy IP address in the second proxy IP address packet, carrying out optimization adjustment on the second proxy IP address packet;
and forming a proxy IP address pool based on the Cookie according to the first proxy IP address group, and forming a proxy IP address pool based on the User _ agent field according to the optimized and adjusted second proxy IP address group.
16. The apparatus of claim 14 or 15, wherein the grouping module is further to:
for each second proxy IP address group, determining an access parameter of request data to which the proxy IP address belongs;
and adjusting the agent IP address with the matched access parameter into the same second agent IP address packet.
17. The apparatus of claim 16, wherein the grouping module is further configured to:
and when the access parameter is the target IP address of the affiliated request data, adjusting the proxy IP address with the same target IP address into the same second proxy IP address packet.
18. The apparatus of claim 16, wherein the grouping module is further configured to:
when the access parameters are access time and access frequency, analyzing access behavior characteristics corresponding to the proxy IP address according to the access time and the access frequency;
and adjusting the proxy IP addresses with similar access behavior characteristics into the same second proxy IP address packet.
19. The apparatus of any one of claims 17-18, wherein the grouping module is further configured to:
and adaptively adjusting the access parameters according to actual detection requirements.
20. The apparatus of claim 13, wherein,
the grouping module is further configured to delete the proxy IP address with an empty Cookie from the first proxy IP address group after the proxy IP addresses with the same Cookie are divided into the same first proxy IP address group to obtain a plurality of first proxy IP address groups, and supplement the proxy IP address with an empty Cookie to the source IP address list.
21. A non-transitory computer readable storage medium having stored thereon a computer program, wherein the program, when executed by a processor, implements the proxy IP address pool based crawler behavior detection method according to any one of claims 1-10.
22. A computer device comprising a housing, a processor, a memory, a circuit board, and a power circuit, wherein the circuit board is disposed inside a space enclosed by the housing, the processor and the memory being disposed on the circuit board; the power supply circuit is used for supplying power to each circuit or device of the computer equipment; the memory is used for storing executable program codes; the processor runs a program corresponding to the executable program code by reading the executable program code stored in the memory, for executing the crawler behavior detection method based on the proxy IP address pool according to any one of claims 1 to 10.
CN202011164587.1A 2020-10-27 2020-10-27 Crawler behavior detection method and device based on proxy IP address pool and storage medium Active CN112383513B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202011164587.1A CN112383513B (en) 2020-10-27 2020-10-27 Crawler behavior detection method and device based on proxy IP address pool and storage medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202011164587.1A CN112383513B (en) 2020-10-27 2020-10-27 Crawler behavior detection method and device based on proxy IP address pool and storage medium

Publications (2)

Publication Number Publication Date
CN112383513A CN112383513A (en) 2021-02-19
CN112383513B true CN112383513B (en) 2023-03-14

Family

ID=74577641

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202011164587.1A Active CN112383513B (en) 2020-10-27 2020-10-27 Crawler behavior detection method and device based on proxy IP address pool and storage medium

Country Status (1)

Country Link
CN (1) CN112383513B (en)

Families Citing this family (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113518077A (en) * 2021-05-26 2021-10-19 杭州安恒信息技术股份有限公司 Malicious web crawler detection method, device, equipment and storage medium
CN113536301A (en) * 2021-07-19 2021-10-22 北京计算机技术及应用研究所 Behavior characteristic analysis-based anti-crawling method

Family Cites Families (10)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
KR100464376B1 (en) * 2001-12-10 2005-01-05 삼성전자주식회사 Packet data service method for wireless telecommunication system
US20160344701A1 (en) * 2015-02-23 2016-11-24 Shadow SMS, LLC Systems and methods for a two-way common pool proxy to obscure communication routing
US9794281B1 (en) * 2015-09-24 2017-10-17 Amazon Technologies, Inc. Identifying sources of network attacks
CN105429977B (en) * 2015-11-13 2018-08-07 武汉邮电科学研究院 Deep packet inspection device abnormal flow monitoring method based on comentropy measurement
CN108345642B (en) * 2018-01-12 2020-10-16 深圳壹账通智能科技有限公司 Method, storage medium and server for crawling website data by proxy IP
CN109241733A (en) * 2018-08-07 2019-01-18 北京神州绿盟信息安全科技股份有限公司 Crawler Activity recognition method and device based on web access log
CN109582844A (en) * 2018-11-07 2019-04-05 北京三快在线科技有限公司 A kind of method, apparatus and system identifying crawler
CN110912902B (en) * 2019-11-27 2022-04-19 杭州安恒信息技术股份有限公司 Method, system, equipment and readable storage medium for processing access request
CN111064745B (en) * 2019-12-30 2022-06-03 厦门市美亚柏科信息股份有限公司 Self-adaptive back-climbing method and system based on abnormal behavior detection
CN111245838B (en) * 2020-01-13 2022-04-26 四川坤翔科技有限公司 Method for protecting key information by anti-crawler

Also Published As

Publication number Publication date
CN112383513A (en) 2021-02-19

Similar Documents

Publication Publication Date Title
CN107465651B (en) Network attack detection method and device
US10129288B1 (en) Using IP address data to detect malicious activities
Dou et al. A confidence-based filtering method for DDoS attack defense in cloud environment
US11429625B2 (en) Query engine for remote endpoint information retrieval
CN108259425A (en) The determining method, apparatus and server of query-attack
US20130246609A1 (en) Methods and apparatus to track web browsing sessions
KR20180088577A (en) Method, apparatus, and system for discovering application topology relationship
CN112383513B (en) Crawler behavior detection method and device based on proxy IP address pool and storage medium
CN107026758B (en) Information processing method, information processing system and server for CDN service update
CN112165445B (en) Method, device, storage medium and computer equipment for detecting network attack
US9935853B2 (en) Application centric network experience monitoring
US20160299971A1 (en) Identifying Search Engine Crawlers
US11843576B2 (en) Methods and apparatus to perform network-based monitoring of media accesses
CN105553770B (en) Data acquisition control method and device
CN110941823B (en) Threat information acquisition method and device
US11516138B2 (en) Determining network flow direction
US20080162687A1 (en) Data acquisition system and method
Jafarabadi et al. A stochastic epidemiological model for the propagation of active worms considering the dynamicity of network topology
US20170222904A1 (en) Distributed Business Transaction Specific Network Data Capture
EP3439263B1 (en) Evaluation of tcp responses via remote clients
CN114915434A (en) Network agent detection method, device, storage medium and computer equipment
US10263798B2 (en) Validating hypertext transfer protocol messages for a toll-free data service
US20210243219A1 (en) Security handling skill measurement system, method, and program
CN110049065A (en) Attack defense method, device, medium and the calculating equipment of security gateway
CN109150871A (en) Safety detection method, device, electronic equipment and computer readable storage medium

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant