CN115065520A - Anti-crawler processing method and device, electronic equipment and readable storage medium - Google Patents

Anti-crawler processing method and device, electronic equipment and readable storage medium Download PDF

Info

Publication number
CN115065520A
CN115065520A CN202210649745.5A CN202210649745A CN115065520A CN 115065520 A CN115065520 A CN 115065520A CN 202210649745 A CN202210649745 A CN 202210649745A CN 115065520 A CN115065520 A CN 115065520A
Authority
CN
China
Prior art keywords
crawler
historical
index
stage
access
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202210649745.5A
Other languages
Chinese (zh)
Inventor
林海
陈家烁
王蕙蓉
谭成
马稼明
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
China Telecom Corp Ltd
Original Assignee
China Telecom Corp Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by China Telecom Corp Ltd filed Critical China Telecom Corp Ltd
Priority to CN202210649745.5A priority Critical patent/CN115065520A/en
Publication of CN115065520A publication Critical patent/CN115065520A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L63/00Network architectures or network communication protocols for network security
    • H04L63/14Network architectures or network communication protocols for network security for detecting or protecting against malicious traffic
    • H04L63/1408Network architectures or network communication protocols for network security for detecting or protecting against malicious traffic by monitoring network traffic
    • H04L63/1425Traffic logging, e.g. anomaly detection
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/95Retrieval from the web
    • G06F16/951Indexing; Web crawling techniques
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L63/00Network architectures or network communication protocols for network security
    • H04L63/02Network architectures or network communication protocols for network security for separating internal from external traffic, e.g. firewalls
    • H04L63/0227Filtering policies
    • H04L63/0236Filtering by address, protocol, port number or service, e.g. IP-address or URL

Abstract

The embodiment of the invention provides an anti-crawler processing method and device, electronic equipment and a readable storage medium. In the method, in response to an access request, a crawler index corresponding to a target IP is searched from crawler indexes corresponding to a first historical IP and is used as the target crawler index; and the crawler index corresponding to the first historical IP and the first historical IP is obtained according to the IP access records of at least two second website nodes, the crawler index is used for representing the probability that the first historical IP belongs to the crawler IP, and the target IP is the IP used by the access request. And executing a preset anti-crawler operation under the condition that the target crawler index is greater than a preset index threshold value. Thus, the crawler IP is identified by using the access records of the plurality of second website nodes, so that the crawler IP can be quickly identified and the crawler-resisting operation can be executed.

Description

Anti-crawler processing method and device, electronic equipment and readable storage medium
Technical Field
The invention belongs to the technical field of networks, and particularly relates to an anti-crawler processing method and device, an electronic device and a readable storage medium.
Background
With the continuous development of network technology, more and more websites provide data for users to access. A user may access a website using an Internet Protocol Address (IP) to view data provided by the website.
However, in practical applications, an illegal user may illegally obtain data provided by a website based on a crawler technology. Therefore, how to quickly identify a crawler IP, that is, identify an IP used when data is illegally crawled based on a crawler technology, and perform an anti-crawler operation on the IP becomes a technical problem to be urgently solved.
Disclosure of Invention
The invention provides an anti-crawler processing method and device, electronic equipment and a readable storage medium, and aims to solve the technical problem of how to quickly identify a crawler IP and execute anti-crawler operation on the crawler IP.
In a first aspect, the present invention provides an anti-crawler processing method, applied to a first website node, where the method includes:
in response to the access request, searching a crawler index corresponding to a target IP from crawler indexes corresponding to the first historical Internet protocol address IP, and taking the crawler index as the target crawler index; the first historical IP and a crawler index corresponding to the first historical IP are obtained according to IP access records of at least two second website nodes, the crawler index is used for representing the probability that the first historical IP belongs to the crawler IP, and the target IP is the IP used by the access request;
and executing a preset anti-crawler operation under the condition that the target crawler index is greater than a preset index threshold value.
In a second aspect, the present invention provides an anti-crawler processing apparatus, applied to a first website node, the apparatus including:
the searching module is used for responding to the access request, searching a crawler index corresponding to the target IP from the crawler index corresponding to the first historical Internet protocol address IP to be used as the target crawler index; the first historical IP and a crawler index corresponding to the first historical IP are obtained according to IP access records of at least two second website nodes, the crawler index is used for representing the probability that the first historical IP belongs to the crawler IP, and the target IP is the IP used by the access request;
and the execution module is used for executing a preset anti-crawler operation under the condition that the target crawler index is greater than a preset index threshold value.
In a third aspect, the present invention provides an anti-crawler processing system, where the system includes at least two first website nodes, and each first website node is configured to perform the following steps:
responding to the access request, and searching a crawler index corresponding to the target IP from crawler indexes corresponding to the first historical IP to serve as the target crawler index; the first historical IP and a crawler index corresponding to the first historical IP are obtained according to IP access records of the at least two first website nodes, the crawler index is used for representing the probability that the first historical IP belongs to the crawler IP, and the target IP is the IP used by the access request;
and executing a preset anti-crawler operation under the condition that the target crawler index is larger than a preset index threshold value.
In a fourth aspect, the present invention provides an electronic device comprising: the system comprises a processor, a memory and a computer program which is stored on the memory and can run on the processor, and is characterized in that the processor realizes the anti-crawler processing method when executing the program.
In a fifth aspect, the present invention provides a readable storage medium, wherein instructions of the storage medium, when executed by a processor of an electronic device, enable the electronic device to perform the above anti-crawler processing method.
In the embodiment of the invention, in response to an access request, a crawler index corresponding to a target IP is searched from crawler indexes corresponding to a first historical IP and is used as the target crawler index; and the crawler index corresponding to the first historical IP and the first historical IP is obtained according to the IP access records of at least two second website nodes, the crawler index is used for representing the probability that the first historical IP belongs to the crawler IP, and the target IP is the IP used by the access request. And executing a preset anti-crawler operation under the condition that the target crawler index is greater than a preset index threshold value. Therefore, the crawler IP is identified by using the access records of the plurality of second website nodes, more sufficient information is provided for crawler IP identification, the first website node searches the first historical IP obtained according to the IP access records of the at least two second website nodes and the crawler index corresponding to the first historical IP, and the crawler IP can be identified, so that the crawler IP can be identified rapidly to a certain extent, and the crawler IP can be executed with anti-crawler operation.
Meanwhile, compared with the method of performing identification based on only the historical data of the website itself, even in the case that the target IP accesses the first website node for the first time, a certain second website node may be accessed before the target IP. Therefore, the first website node identifies by using the crawler index obtained according to the IP access records of the at least two second website nodes, so that the probability of finding the crawler index of the target IP can be improved to a certain extent, and therefore, under the condition that the target IP is accessed for the first time, the identification is rapidly carried out, and the anti-crawler measure is timely executed.
Drawings
In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings used in the description of the embodiments or the prior art will be briefly described below, and it is obvious that the drawings in the following description are some embodiments of the present invention, and those skilled in the art can also obtain other drawings according to the drawings without creative efforts.
FIG. 1 is a flowchart illustrating steps of a method for anti-crawler processing according to an embodiment of the present invention;
FIG. 2 is a schematic diagram of a scenario provided by an embodiment of the present invention;
FIG. 3 is a block diagram of an anti-crawler processing apparatus according to an embodiment of the present invention;
fig. 4 is a structural diagram of an electronic device according to an embodiment of the present invention.
Detailed Description
The technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are some, not all, embodiments of the present invention. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.
Fig. 1 is a flowchart of steps of an anti-crawler processing method according to an embodiment of the present invention, where the method is applied to a first website node, and as shown in fig. 1, the method may include:
step 101, responding to an access request, searching a crawler index corresponding to a target IP from crawler indexes corresponding to a first historical IP, and taking the crawler index as the target crawler index; and the first historical IP and a crawler index corresponding to the first historical IP are obtained according to IP access records of at least two second website nodes, the crawler index is used for representing the probability that the first historical IP belongs to the crawler IP, and the target IP is the IP used by the access request.
In the embodiment of the present invention, the first website node may parse the access request, and extract an IP used by the access request from the access request, as a target IP. And matching the target IP with each first historical IP, and if the matching is successful, determining the crawler index corresponding to the matched first historical IP as the target crawler index. The crawler index can be used to represent the probability that an IP belongs to a crawler IP, the higher the crawler index, the greater the probability that the IP belongs to a crawler IP, the smaller the crawler index, the smaller the probability that the IP belongs to a crawler IP.
The IP access record may be an IP access log, and for an access request for accessing the second website node, information about an IP used by the access request may be recorded in the IP access log. The first historical IP involved in the crawler index corresponding to the first historical IP may be an IP accessing the second website node, and the crawler index of the first historical IP may be generated by integrating IP access records of a plurality of second website nodes.
The first website node and the second website node may be backend servers of the website. Different second website nodes may correspond to different websites.
And 102, executing a preset anti-crawler operation under the condition that the target crawler index is greater than a preset index threshold value.
The preset index threshold value can be set according to actual requirements, and if the target crawler index is larger than the preset index threshold value, the target IP can be determined to be a crawler IP, and then preset anti-crawler operation can be executed. The preset index thresholds of different website nodes can be set according to actual requirements, and the preset index thresholds of different website nodes can be different or the same.
The preset anti-crawler operation can be set according to a time requirement, and exemplarily, the preset anti-crawler operation can include adding the target IP into an access blacklist, returning prompt information representing that access is prohibited to a sending end of the access request, adding verification codes for verification, blocking the IP, and the like, so as to prevent the sending end from acquiring data in the website represented by the first website node based on the target IP.
Further, when the target crawler index is not greater than the preset index threshold, it may be determined that the target IP is not a crawler IP, and accordingly, data in the website represented by the first website node may be returned to the sending end, so that a sending end user may normally access the website represented by the first website node and normally obtain and view the data in the website. It should be noted that, if the target crawler index corresponding to the target IP is not found, the current access operation of the target IP may be allowed first, and the data in the website represented by the first website node may be returned to the sending end. And when the target IP accesses the first website node again, searching based on the crawler index corresponding to the first historical IP.
In summary, in the anti-crawler processing method provided in the embodiment of the present invention, in response to an access request, a crawler index corresponding to a target IP is searched from crawler indexes corresponding to a first historical IP address IP, and the crawler index is used as the target crawler index; and the crawler index corresponding to the first historical IP and the first historical IP is obtained according to the IP access records of at least two second website nodes, the crawler index is used for representing the probability that the first historical IP belongs to the crawler IP, and the target IP is the IP used by the access request. And executing a preset anti-crawler operation under the condition that the target crawler index is greater than a preset index threshold value. Therefore, the crawler IP is identified by using the access records of the plurality of second website nodes, more sufficient information is provided for crawler IP identification, the first website node searches the first historical IP obtained according to the IP access records of the at least two second website nodes and the crawler index corresponding to the first historical IP, and the crawler IP can be identified, so that the crawler IP can be identified rapidly to a certain extent, and the crawler IP can be executed with anti-crawler operation.
Meanwhile, compared with the method of performing identification based on only the historical data of the website itself, even in the case that the target IP accesses the first website node for the first time, a certain second website node may be accessed before the target IP. Therefore, the first website node identifies by using the crawler index obtained according to the IP access records of the at least two second website nodes, so that the probability of finding the crawler index of the target IP can be improved to a certain extent, and therefore, under the condition that the target IP is accessed for the first time, the identification is rapidly carried out, and the anti-crawler measure is timely executed.
Optionally, in the embodiment of the present invention, the crawler index corresponding to the first historical IP is stored in a blockchain, the first website node is any website node in the at least two second website nodes, and the blockchain is locally deployed in any first website node. Specifically, compared to a manner that other non-second website nodes may also use the crawler index corresponding to the first historical IP, in the embodiment of the present invention, only any one website node of the at least two second website nodes may be used to execute the anti-crawler processing method provided in the embodiment of the present invention, that is, only any one website node of the at least two second website nodes may perform crawler IP identification based on the block chain in which the crawler index corresponding to the first historical IP is stored, so that the security of data may be improved to a certain extent. That is, at least two second website nodes may form a federation chain, and each website node participating in the federation chain may perform crawler identification based on a crawler index on the chain. The alliance chain is high in robustness as a block chain in nature, and stability can be ensured. Compared with a mode that other non-second website nodes can also access the public link, the use of the alliance link can reduce the use cost of the participating nodes to a certain extent, improve controllability and data safety, and improve transaction speed on the link to a certain extent due to no need of verification.
Correspondingly, the block chains are locally deployed on the first website node, so that the first website node can directly search based on the crawler index corresponding to the first historical IP recorded in the local block chain when the first website node executes the search operation, and the search efficiency can be improved to a certain extent.
It should be noted that the IP whose crawler index is greater than the preset index threshold in the block chain may be listed in a blacklist, thereby facilitating the search.
Optionally, the following steps may also be executed in the embodiment of the present invention:
step S21, under the condition that a preset condition is met, according to the IP access record of the first website node within a preset time, determining a stage crawler index of a second historical IP within the preset time; the stage crawler index is used for representing the probability that the second historical IP belongs to the crawler within the preset time.
In the embodiment of the invention, the preset condition and the preset time can be set according to actual requirements. Illustratively, the preset condition may be that a preset execution time interval is reached, that is, this step may be executed multiple times, thereby implementing dynamic maintenance on the crawler index. Or the preset condition may be that a calculation instruction is received, and the like. The preset time may be the last 3 days, the last 7 days, etc. Correspondingly, after every preset time, the stage crawler index of the second historical IP within the preset time is determined once according to the IP access record within the preset time. The second historical IP may be an IP that accesses the first website node within a predetermined time, and the second historical IP may belong to the first historical IP, for example, after the crawler index of the second historical IP is linked up, the second historical IP may be regarded as the first historical IP. The stage crawler index can be used for representing the probability that the second historical IP belongs to the crawler within the preset time, and correspondingly, the crawler index of the second historical IP can be used for representing the probability that the second historical IP currently belongs to the crawler. That is, the stage crawler index is the probability of evaluating an IP from a certain period of time as belonging to a crawler, and the crawler index is the probability of evaluating an IP from the whole as belonging to a crawler. Illustratively, the stage crawler index may be represented by P, and the crawler index may be represented by P'. The crawler index may represent a final crawler index.
Step S22, chaining the stage crawler index to a block chain in the first website node, generating a new crawler index for the second historical IP according to the stage crawler index and the historical stage crawler index of the second historical IP, and chaining the new crawler index to the block chain in the first website node; the historical stage crawler index comprises a stage crawler index previously linked to the blockchain.
In the embodiment of the invention, after the stage crawler index is linked up, a new crawler index is generated for the second historical IP according to the stage crawler index and the historical stage crawler index of the second historical IP, and the new crawler index is linked up to the block chain in the first website node. Or, a new crawler index may be generated for the second historical IP according to the stage crawler index and the historical stage crawler index of the second historical IP, and the stage crawler index may be linked after the new crawler index is linked to the blockchain in the first website node. Or, a new crawler index is generated for the second historical IP according to the stage crawler index and the historical stage crawler index of the second historical IP, and the new crawler index and the stage crawler index are linked to the block chain in the first website node together.
Specifically, the second historical IP and the stage crawler index may be packaged into a block, and a link operation may be performed on the block, so as to link the stage crawler index to the local block chain of the first website node. And packaging the second historical IP and the new crawler index into a block, and performing uplink operation on the block, thereby realizing the uplink of the new crawler index to the local block chain of the first website node. It should be noted that, when packing, other information may be further packed into a block, for example, the second history IP, the stage crawler index, and the timestamp of the stage crawler index are packed into one block. And packing the second historical IP, the new crawler index and the timestamp of the new crawler index into a block. Wherein the timestamp may be used to characterize the stage crawler index/crawler index generation time.
It should be noted that the execution sequence of the steps S21 to S22 is not exclusive, for example, the steps S21 to S22 may be executed before or after the step 101, or after the step 102, and the embodiment of the present invention is not limited thereto. And the steps S21 to S22 can be repeatedly executed for a plurality of times, so that the dynamic maintenance of the crawler index is realized.
Over time, the nature of the IPs may change, e.g., an IP may be used as a crawler IP from a certain time. While some previous crawler IP may be released as the IP used by normal users. In the embodiment of the invention, any first website node determines the stage crawler index of the second historical IP periodically according to the IP access record in the preset time, and generates a new crawler index based on the stage crawler index of the second historical IP and the previous stage crawler index. Therefore, the crawler index of the second historical IP can accurately represent the probability that the second historical IP belongs to the crawler IP at present, and the accuracy of subsequent identification is ensured.
Optionally, in this embodiment of the present invention, after the step of determining the stage crawler index of the second historical IP within the preset time, the following steps may be further performed:
step S31, synchronizing the stage crawler index to other first website nodes deployed with the block chain, so that the other first website nodes generate a new crawler index for the second historical IP according to the stage crawler index and the historical stage crawler index of the second historical IP, and link the new crawler index and the received stage crawler index to the block chain in the other first website nodes.
Specifically, the other first website nodes may be: and the nodes except the first website node for calculating the crawler index of the stage in the plurality of first website nodes. The stage crawler index can be broadcasted to other first website nodes to achieve synchronization. Correspondingly, each other first website node can generate a new crawler index for the second historical IP in the same calculation mode according to the received stage crawler index and the historical stage crawler index of the second historical IP. And other first website nodes can read the historical stage crawler index of the second historical IP from the local blockchain. And after calculating to obtain a new crawler index, linking the new crawler index into the local block chain. Further, the other first website nodes may also link the received stage crawler index of the second historical IP to the blockchain in the other first website nodes. The execution sequence of the operation of the stage crawler index received in the uplink and the operation of the new crawler index can be selected according to needs.
In the embodiment of the invention, the first website node only needs to synchronize the stage crawler index obtained by the first website node with other first website nodes provided with the block chains, so that the other first website nodes can link the new crawler index of the second historical IP in the local block chain, and further ensure the content synchronization of the block chains held by each first website node, thereby ensuring the crawler identification effect of each first website node based on the block chains.
Meanwhile, other first website nodes generate new crawler indexes by themselves and link the new crawler indexes generated by themselves to a local block chain, so that the data security can be ensured to a certain extent.
It should be noted that, in the embodiment of the present invention, the first website node may also synchronize the second historical IP and the calculated new crawler index to other first website nodes where the blockchain is deployed. Therefore, other first website nodes do not need to execute the operation of generating the new crawler index, and the received new crawler index can be directly linked up, so that the processing cost can be saved.
Optionally, the following steps may also be executed in the embodiment of the present invention:
step S41, if a stage crawler index of a third historical IP shared by other first website nodes is received, generating a new crawler index for the third historical IP according to the stage crawler index shared by the other first website nodes and the historical stage crawler index of the third historical IP, and linking the new crawler index and the received stage crawler index to a block chain in the first website node.
In the embodiment of the invention, under the condition that other first website nodes possibly meet the preset condition, the stage crawler index of the third historical IP in the preset time is determined according to the IP access record of the other first website nodes in the preset time. The third historical IP may be the same as or different from the second historical IP, and the implementation manner of determining the stage crawler index of the third historical IP within the preset time may be the same as the implementation manner of determining the stage crawler index of the second historical IP within the preset time. Accordingly, the other first website nodes can link the stage crawler index to the block chain in the first website node, and generate a new crawler index for the third historical IP according to the stage crawler index and the historical stage crawler index of the third historical IP. And linking the new crawler index to a block chain in the first website node, and sharing the crawler index of the stage to the first website node except the first website node. The implementation manner of generating a new crawler index for the third historical IP may be the same as the implementation manner of generating a new crawler index for the second historical IP.
As the first website node that receives the shared stage crawler index, a new crawler index may be generated for the third history IP according to the received stage crawler index and the historical stage crawler index of the third history IP, and the third history IP and the new crawler index are packed into a block, and a chain winding operation is performed, thereby ensuring that the contents of the block chains held by each first website node are synchronized, and ensuring that each first website node has a crawler identification effect based on the block chains. Further, the third history IP and the received stage crawler index of the third history IP may be packaged into a block, and the uplink operation may be performed.
In the embodiment of the invention, different websites represented by different first website nodes can correspond to different applications. The plurality of first website nodes determine the stage crawler index of historical IP of the website according to the IP access records of the website nodes in the preset time, and each first website node can link the newly generated crawler index of the historical IP in the local block chain through sharing, so that the block chain is commonly maintained by the plurality of websites. Accordingly, the crawler IP risk detection is performed based on the crawler index, and the crawler IP risk detection can be performed based on a block chain commonly maintained by multiple websites. Wherein, it indicates to carry out crawler IP risk detection based on the crawler index: and identifying whether the target IP is a crawler IP or not based on the searched target crawler index.
Optionally, the operation of generating a new crawler index for the second historical IP according to the stage crawler index and the historical stage crawler index of the second historical IP specifically may include:
step S51, obtaining the historical stage crawler index of the second historical IP from the uplink block of the block chain, and obtaining the generation time of the historical stage crawler index as the first time.
In the embodiment of the invention, the block for storing the stage crawler index can be provided with a first identifier, the block for storing the stage crawler index can be provided with a second identifier, and the first identifier is different from the second identifier. Accordingly, if the identifier of the uplink block in the local block chain of the first website node is the first identifier and the historical IP stored in the uplink block is the second historical IP, the stage crawling index in the uplink block can be read as the historical stage crawling index of the second historical IP. The historical stage crawler index of the second historical IP may not exist, and may be 1 or more. Over time, the historical stage crawler index for the second historical IP may increase.
Further, the block for storing the stage crawler index may further store a timestamp of the stage crawler index. Accordingly, a timestamp of a last chain block identified as a first identifier and the stored historical IP as a second historical IP may be read as a first time.
Step S52, generating a new crawler index for the second historical IP according to the historical stage crawler index, the first time, the stage crawler index and the second time; the second time is the generation time of the stage crawler index.
Wherein, the generation time of the stage crawler index can be recorded when the stage crawler index is generated. Since the access behavior of the IP may change over time, i.e., the IP may change from a normal IP to a crawler IP and vice versa. Therefore, in the embodiment of the invention, the first time of the historical stage crawler index and the second time of the stage crawler index are acquired, and the historical stage crawler index, the first time, the stage crawler index and the second time are combined to generate a new crawler index for the second historical IP, so that the accuracy of the generated crawler index can be ensured to a certain extent.
Optionally, the operation of generating a new crawler index for the second historical IP according to the historical stage crawler index, the first time, the stage crawler index, and the second time may specifically include:
step S61, respectively determining a first weight of the historical stage crawler index and a second weight of the stage crawler index according to the first time and the second time; the first weight is negatively correlated with a duration of the first time from the current time, and the second weight is negatively correlated with a duration of the second time from the current time.
In the embodiment of the present invention, if the generation time of the stage crawler index is closer to the current time, it may be determined that the stage crawler index is more capable of representing the current situation of the second historical IP. Therefore, in this step, a negative correlation method may be adopted to determine the first weight for each historical stage crawler index according to the first time and the second time, and determine the second weight for the stage crawler index.
Wherein the second weight may be greater than either of the first weights, since the stage crawler index was most recently generated. For a plurality of historical stage crawler indices, the earlier the generation time, the smaller the first weight of the historical stage crawler index may be.
For example, the calculation may be based on a function of a preset time decay. The weight is represented by X, and then the weight can be calculated based on the following formula:
X=e -α(t-t0)
where α denotes a preset attenuation coefficient, t denotes a current time, and t0 denotes a generation time. t-t0 represents the length of time the generation time is from the current time. For the first weight, t0 is the first time, and for the second weight, t0 is the second time. Of course, in an actual application scenario, other attenuation formulas may also be used to perform attenuation, so as to calculate the weight, which is not limited in this embodiment.
Step S62, generating the new crawler index according to the first weight, the second weight, the historical stage crawler index and the stage crawler index.
In the embodiment of the present invention, the product of the second weight and the stage crawler index may be calculated, and the product of each historical stage crawler index and the first weight of the historical stage crawler index may be calculated. A new crawler index is then determined from the sum of the products. For example, the sum of the products is directly determined as the new crawler index. Accordingly, the new crawler index may be expressed as:
Figure BDA0003686852490000121
the earlier the stage crawler index, the less the proportion of final P' over time. Where n ∈ 1, …, y. Pn denotes the nth risk index involved in the calculation. Assuming that the first risk index to participate in the calculation is the stage crawler index, p1 represents the stage crawler index, and the remaining p2, …, py represents the historical stage crawler index. y represents the total number of risk indices involved in the calculation. The total number refers to the sum of the number of historical stage crawler indices and the number of stage crawler indices, i.e., the number of historical stage crawler indices plus 1.
Alternatively, the ratio of the sum of the products to the total number may be determined as the new crawler index, which is not limited in the embodiment of the present invention.
In the embodiment of the invention, the weight is correspondingly set according to the morning and evening of the generation time, so that the influence of the earlier generated risk index on the finally generated crawler index is smaller, the accuracy of the generated crawler index can be ensured to a certain extent, and the influence of the time on the IP crawler risk judgment is reduced.
Correspondingly, the operation of searching the crawler index corresponding to the target IP from the crawler index corresponding to the first historical IP may specifically include: and searching the newly generated crawler index corresponding to the target IP from the block chain of the first website node. In this step, the stored crawler index and the timestamp may be read from the uplink block with the identifier as the second identifier and the stored historical IP as the target IP. And searching the crawler index with the earliest time represented by the corresponding timestamp to serve as the target crawler index. In the embodiment of the invention, the latest generated crawler index corresponding to the target IP is searched, so that the searched crawler index is the crawler index which is most suitable for the current situation, the accuracy of subsequent crawler identification is ensured, and unnecessary execution of preset anti-crawler operation is avoided.
It should be noted that, for any first website node, the operations of determining the stage crawler index of the second historical IP within the preset time, chaining the stage crawler index to the block chain in the first website node, generating a new crawler index for the second historical IP, and synchronizing the stage crawler index to other first website nodes with the block chain may be performed for multiple times.
Optionally, the determining, according to the IP access record of the first website node within the preset time, the operation of the stage crawler index of the second historical IP within the preset time may specifically include:
and step S71, for a second historical IP within the preset time, determining a first usage number, a second usage number of access sources, and a second usage number of access agents, which correspond to the second historical IP, according to the IP access record.
The access source may refer to a refer parameter of a historical access request carrying the second historical IP, and may be, for example, a part of a request header (header) in the HTTP request, which is used to indicate a client source requesting a current resource. The access Agent may refer to a User Agent (User-Agent). The user agent may be a special string header and the user agent may be a first level identification that identifies the identity.
For each historical access request, the IP used by the historical access request can be extracted as a second historical IP, and the corresponding access source and the access proxy are obtained and written into the IP access record. The implementation manner of obtaining the access source and the access proxy may be selected according to actual requirements, for example, the access source and the access proxy may be extracted from a request header of a history request.
Specifically, the second historical IP may be counted according to the IP access record within the preset time. There may be multiple records for a second historical IP that records the second historical IP. Wherein a record may correspond to a historical access request, i.e. to an access. And extracting the access source and the access agent recorded in each record from a plurality of records corresponding to the second historical IP. The plurality of records may involve a plurality of access sources, and the number of occurrences of the highest frequency access source is determined as the first number of uses. Assume that 10 access sources are extracted, 8 of which are access sources 1 and 2 of which are access sources 2. That may take 8 as the first number of uses. Multiple access agents may be involved in the multiple records, and the number of occurrences of the highest frequency access agent is determined as the second number of uses. Assume that 10 access agents are extracted, 7 for access 1 and 3 for access agent 2. That may be 7 as the second number of uses.
The number of access times corresponding to the second historical IP may refer to a total number of requests of the second historical IP within a preset time, and the number of the plurality of records corresponding to the second historical IP may be determined as the number of access times. The preset time can be set according to actual requirements, and for example, the preset time can be 1 hour. The number of accesses corresponding to the second historical IP may represent the frequency of accesses by the second historical IP.
Step S72, calculating a stage crawler index of the second historical IP according to the first using times, the access times, the second using times and the total access times appearing in the preset time; the stage crawler index is positively correlated with the first using times and the times ratio, and is negatively correlated with the second using times, wherein the times ratio is the ratio of the visiting times to the total visiting times.
The total access times occurring within the preset time may be the total number of IP access records within the preset time. The first number of uses may be denoted as r and the second number of uses may be denoted as u. The number of accesses may be represented as c, and the total number of accesses occurring within the preset time may be represented as c'.
Illustratively, the stage crawler index P of the second historical IP may be expressed as:
Figure BDA0003686852490000141
where Δ r, Δ u, and Δ c represent preset weights, and Δ r + Δ u + Δ c is 1.
The more uniform the access source of the same IP, the greater the probability of a crawler. The greater the probability of a crawler if the user agents being visited are relatively scattered. If the number of IP accesses is relatively large, the probability of the crawler is higher. The larger the first number of uses is, it may be determined that the request sender frequently uses the same access source within a preset time. The smaller the first number of uses, the smaller the concentration degree of access agents used by the request sender within a preset time and the higher the dispersion degree can be determined. The larger the ratio of the number of visits to the total number of visits, the higher the frequency of visits and the higher the probability of crawlers can be determined. Therefore, in the embodiment of the invention, the accuracy of the calculated stage crawler index can be ensured by calculating according to the manner that the stage crawler index is positively correlated with the first using times, positively correlated with the times proportion and negatively correlated with the second using times.
It should be noted that, in an actual application scenario, the P value may also be calculated in combination with other log dimensions of the access log, for example, further combining a domain name, and the like, and different log dimensions may be combined at will, which is not limited in this embodiment of the present invention.
Optionally, in this embodiment of the present invention, before determining, according to the IP access record, the first number of times of use of the access source, the number of times of access, and the second number of times of use of the access agent, which correspond to the second historical IP, the following steps may be performed:
step S81, acquiring a target request proportion corresponding to the second historical IP; the target request proportion is the proportion of the historical access requests carrying the specified keywords corresponding to the second historical IP.
Specifically, the keyword may be a field for characterizing a browser used when the history access request is transmitted. Wherein the keywords may be extracted from the user agent field. The historical access request using the second historical IP can be determined from the IP access record in the specified preset time as the historical access request corresponding to the second historical IP. And then, extracting keywords from the user agent fields of the historical access requests corresponding to the second historical IP, and comparing the keywords with preset specified keywords. And if the keyword is matched with the specified keyword, determining that the historical access request carries the specified keyword. Finally, the number of the historical access requests carrying the specified keywords can be counted, and the number of the historical access requests corresponding to the second historical IP can be counted. And calculating the ratio of the number of the historical access requests carrying the specified keywords to the number of the historical access requests corresponding to the second historical IP to obtain the target request ratio.
Correspondingly, the operation of determining the first number of times of use of the access source, the number of times of access, and the second number of times of use of the access agent corresponding to the second historical IP according to the IP access record may specifically include:
and step S91, determining, according to the IP access record, a first number of times of use of the access source, a number of times of access, and a second number of times of use of the access agent, which correspond to the second historical IP, when the target request duty is not greater than a preset duty threshold.
If the target request duty is greater than the preset duty threshold, it may be determined that the second historical IP does not belong to a category for which identification of crawlers is required, the second historical IP belonging to crawlers IPs that can be released, for example, to crawlers of a search engine. Accordingly, passing may be performed without counting the second historical IP. On the contrary, if the target request occupation ratio is not greater than the preset occupation ratio threshold, it can be determined that the second historical IP belongs to the category of the crawler to be identified, and therefore, the operation of determining the first usage times, the access times and the second usage times of the access agent corresponding to the second historical IP according to the IP access record can be executed under the condition that the target request occupation ratio is not greater than the preset occupation ratio threshold, so that the P value of the second historical IP is calculated in the subsequent steps. In this way, unnecessary computing operations can be avoided, thereby saving processing resources.
Specifically, the target request duty is larger than a preset duty threshold, and the related information of the second historical IP is the specific information, so that the target request duty is released. The relevant information can be set according to actual requirements, and the relevant information can be other information which can represent whether the IP is a passable IP or not. For example, the relevant information may be the IP itself, and the specified information may be the IP that does not belong to the crawler that needs to be identified. Still alternatively, the related information may be a user agent field, and the specific information may be a user agent field that is not used by the crawler IP to be identified. Therefore, the second historical IP is released only when the target request occupation ratio is larger than the preset occupation ratio threshold value and the related information is the designated information, and the problem that the crawler IP is released by mistake to further cause inaccurate identification in the follow-up process can be avoided to a certain extent. Correspondingly, the operation of determining the first use times of the access source corresponding to the second historical IP, the access times and the second use times of the access agent according to the IP access record can be executed only when the target request duty is greater than the preset duty threshold and the related information of the second historical IP is not the specified information, so as to calculate the P value of the second historical IP in the subsequent steps.
In the embodiment of the invention, the target request proportion can be obtained based on the IP back-checking module so as to judge whether the target request proportion needs to be released. The IP back-check module can be responsible for back-checking the real source of the historical IP appearing in the IP log. To determine whether the historical IP is a web crawler IP that can be released, such as a crawler IP of a search engine. The IP back-check module can call the IP back-check interface to acquire the related information of the second historical IP to confirm whether the second historical IP needs to be released or not under the condition that the second historical IP uses a large number of user-agents carrying the specified keywords.
A specific application scenario of the embodiment of the present invention is described below. With the continuous development of information technology, the demand for network data acquisition is continuously increased, accordingly, the crawler industry is rapidly increased, and the phenomenon of illegally crawling data in an unauthorized platform based on the crawler technology is more and more common. The large amount of crawler behaviors may cause the problems that the data access of the user is distorted, the normal user access is influenced, and the resources of the website server are wasted. And data resources provided by the website are grabbed by the illegal crawler in batch, so that loss is caused to the website. Therefore, it is very important to quickly identify the crawler IP, perform a crawler-reversing policy on the crawler IP, and limit illegal crawling data of the crawler.
Illustratively, the embodiment of the invention can be realized based on an access log analysis module, an intelligent contract interaction module, an IP back-check module and an intelligent contract module. The intelligent contract module can be deployed on the block chain, and the access log analysis module, the intelligent contract interaction module and the IP back check module can be deployed under the block chain. The access log analysis module, the intelligent contract interaction module, the IP back check module and the intelligent contract module can be deployed in each participating node as a system, namely, each participating node is deployed with the access log analysis module, the intelligent contract interaction module, the IP back check module and the intelligent contract module, and can be deployed based on cloud services. Further, in the embodiment of the present invention, a contract of a federation chain may be established in advance. Each participating node can deploy the contract to the back end of the server of the participating node and monitor the network request. After the deployment is successful, the node can be accessed into the alliance chain as a new node. The participating node may refer to the aforementioned first website node.
The access log analysis module can be responsible for analyzing the IP access logs obtained from the application and calculating the stage crawler index. The IP reconnaissance module may be to determine whether a second historical IP referenced in the IP access log needs to be cleared. The intelligent contract interaction module may be used to interact with the intelligent contract modules on the blockchain. The method is used for calling the intelligent contract, sending the P value obtained by the access log analysis module and the timestamp corresponding to the P value to the intelligent contract, providing an API (application programming interface), and facilitating the calling of the client of the website node so as to query the crawler index of the target IP. Meanwhile, the P value can be synchronized to other participating nodes. Further, the smart contract may be used to uplink received P values and is responsible for calculating P' on the chain. P' can be provided for the client side to make a decision of a crawler strategy through the intelligent contract interaction module.
Taking as an example that there are 3 first website nodes, each of the 3 first website nodes is deployed with a block chain, and the block chain stores therein crawler indexes corresponding to first historical IPs obtained based on IP access records of the 3 first website nodes. Fig. 2 is a schematic view of a scenario provided by an embodiment of the present invention, and as shown in fig. 2, an access log parsing module, an intelligent contract interaction module, an IP reverse check module, and an intelligent contract module, which are deployed in each first website node, may interact with each other. Wherein, the application 1, the application 2, and the application 3 are applications corresponding to the 3 first website nodes, respectively.
In one implementation, the user may access a website of the client, and the website may input the generated IP access log to the access log parsing module by data pushing or by calling an interface. The IP contrast module may determine whether the analyzed second historical IP needs to be released, and if so, may not process the second historical IP. If the second historical IP does not need to be released, the access log analysis module can generate a stage crawler index P aiming at the second historical IP, and call the intelligent contract interaction module to input the P value and the time stamp of the second historical IP into the intelligent contract interaction module. The intelligent contract interaction module calls the intelligent contracts on the block chain, sends the received IP, the P value of the IP and the timestamp to the intelligent contract modules on the chain, stores the received IP, the P value of the IP and the timestamp into the block chain through the intelligent contract modules, and calculates P' based on the calculation. Wherein these steps may be performed multiple times.
Correspondingly, when a subsequent user accesses the client, the client can acquire the user IP, namely, the access IP of the access request is acquired as the target IP, the intelligent contract interaction module is called, the intelligent contract on the chain is called through the intelligent contract interaction module, and the P' value of the target IP is acquired. According to the set safety strategy threshold, if the P' value is larger than the preset index threshold, triggering the anti-crawler strategy, otherwise, not triggering.
In one implementation, the anti-crawler policy of a website is often post-processing, i.e., post-processing. After the crawler IP accesses for multiple times, so that the anti-crawler condition is triggered, the anti-crawler measure can be implemented on the crawler IP request. And the anti-crawler strategies of all websites are relatively independent. Thus, the crawler IP cannot be identified in time.
According to the embodiment of the invention, the access records from different IPs are uploaded by multiple applications together, the crawler index calculation of the IPs is completed on the block chain by using an intelligent contract, the IP crawler indexes are stored in the block chain, the IP crawler indexes are shared, and the IP of a suspected crawler is identified by using the IP access logs of the multiple applications. Further, a block chain technology is utilized, an alliance chain is built, IP behaviors are quantized, the crawler indexes obtained through quantization are stored in the alliance chain, participation nodes on the chain can jointly recognize suspected access requests from crawlers through the alliance chain, the crawler IP is rapidly recognized, the crawler indexes are maintained jointly and issued to different participation nodes, namely, each participation node can inquire the crawler index of the IP from the block chain, and therefore the crawler IP can be conveniently recognized by each participation node to a certain extent, and anti-crawler strategies are rapidly implemented.
Fig. 3 is a block diagram of an anti-crawler processing apparatus according to an embodiment of the present invention, where the apparatus may be applied to a first website node, and the apparatus 20 may include:
the searching module 201 is configured to search, in response to the access request, a crawler index corresponding to the target IP from crawler indexes corresponding to the first historical IP, as a target crawler index; the first historical IP and a crawler index corresponding to the first historical IP are obtained according to IP access records of at least two second website nodes, the crawler index is used for representing the probability that the first historical IP belongs to the crawler IP, and the target IP is the IP used by the access request;
an executing module 202, configured to execute a preset anti-crawler operation when the target crawler index is greater than a preset index threshold.
Optionally, the crawler index corresponding to the first historical IP is stored in a blockchain, the first website node is any one website node of the at least two second website nodes, and the blockchain is locally deployed in any one of the first website nodes.
Optionally, the apparatus 20 further includes:
the determining module is used for determining a stage crawler index of a second historical IP within a preset time according to an IP access record of the first website node within the preset time under the condition that a preset condition is met; the stage crawler index is used for representing the probability that the second historical IP belongs to a crawler within the preset time;
a chaining module for chaining the stage crawler index to a block chain in the first website node, and a first generating module for generating a new crawler index for the second historical IP according to the stage crawler index and the historical stage crawler index of the second historical IP, and chaining the new crawler index to the block chain in the first website node; the historical stage crawler index comprises a stage crawler index previously linked to the blockchain.
Optionally, the apparatus 20 further includes:
and the synchronization module is used for synchronizing the stage crawler index to other first website nodes with the block chains after the determination module determines the stage crawler index of the second historical IP within the preset time, so that the other first website nodes can generate a new crawler index for the second historical IP according to the stage crawler index and the historical stage crawler index of the second historical IP, and link the new crawler index and the received stage crawler index to the block chains in the other first website nodes.
Optionally, the apparatus 20 further includes:
and the second generation module is used for generating a new crawler index for the third historical IP according to the stage crawler index shared by other first website nodes and the historical stage crawler index of the third historical IP if the stage crawler index of the third historical IP shared by other first website nodes is received, and chaining the new crawler index and the received stage crawler index to the block chain in the first website node.
Optionally, the first generating module is specifically configured to:
acquiring a historical stage crawler index of the second historical IP from an uplink block of the block chain, and acquiring generation time of the historical stage crawler index as first time;
generating a new crawler index for the second historical IP according to the historical stage crawler index, the first time, the stage crawler index and the second time; the second time is the generation time of the stage crawler index.
Optionally, the first generating module is further specifically configured to:
according to the first time and the second time, respectively determining a first weight of the historical stage crawler index and a second weight of the stage crawler index; the first weight is negatively correlated with the duration of the first time from the current time, and the second weight is negatively correlated with the duration of the second time from the current time;
generating the new crawler index according to the first weight, the second weight, the historical stage crawler index and the stage crawler index;
the search module 201 is specifically configured to: and searching the newly generated crawler index corresponding to the target IP from the block chain of the first website node.
Optionally, the determining module is specifically configured to:
for a second historical IP within the preset time, determining a first use frequency, an access frequency and a second use frequency of an access agent of an access source corresponding to the second historical IP according to the IP access record;
calculating a stage crawler index of the second historical IP according to the first using times, the access times, the second using times and the total access times appearing in the preset time; the stage crawler index is positively correlated with the first using times and the times ratio, and is negatively correlated with the second using times, wherein the times ratio is the ratio of the visiting times to the total visiting times.
Optionally, the apparatus 20 further includes:
an obtaining module, configured to obtain a target request duty ratio corresponding to the second historical IP before the determining module determines, according to the IP access record, a first number of times of use of an access source, a number of times of access, and a second number of times of use of an access agent, which correspond to the second historical IP; the target request proportion is the proportion of the historical access requests which are corresponding to the second historical IP and carry the specified keywords;
the determining module is further specifically configured to:
and under the condition that the target request duty ratio is not greater than a preset duty ratio threshold, determining the first using times and the accessing times of the access source corresponding to the second historical IP and the second using times of the access agent according to the IP access record.
In summary, the anti-crawler processing apparatus provided in the embodiment of the present invention, in response to the access request, searches for a crawler index corresponding to the target IP from the crawler indexes corresponding to the first historical IP, and uses the crawler index as the target crawler index; and the crawler index corresponding to the first historical IP and the first historical IP is obtained according to the IP access records of at least two second website nodes, the crawler index is used for representing the probability that the first historical IP belongs to the crawler IP, and the target IP is the IP used by the access request. And executing a preset anti-crawler operation under the condition that the target crawler index is greater than a preset index threshold value. Therefore, the crawler IP is identified by using the access records of the plurality of second website nodes, more sufficient information is provided for crawler IP identification, the first website node searches the first historical IP obtained according to the IP access records of the at least two second website nodes and the crawler index corresponding to the first historical IP, and the crawler IP can be identified, so that the crawler IP can be identified rapidly to a certain extent, and the crawler IP can be executed with anti-crawler operation.
Meanwhile, compared with the method of performing identification based on only the historical data of the website itself, even in the case that the target IP accesses the first website node for the first time, a certain second website node may be accessed before the target IP. Therefore, the first website node identifies by using the crawler indexes obtained according to the IP access records of the at least two second website nodes, and the probability of finding the crawler index of the target IP can be improved to a certain extent, so that the identification is quickly carried out under the condition that the target IP is accessed for the first time, and the anti-crawler measures are timely executed.
The invention also provides an anti-crawler processing system, which comprises at least two first website nodes, wherein each first website node is used for executing the following steps: in response to the access request, searching a crawler index corresponding to the target IP from crawler indexes corresponding to the first historical IP to serve as the target crawler index; the first historical IP and a crawler index corresponding to the first historical IP are obtained according to IP access records of the at least two first website nodes, the crawler index is used for representing the probability that the first historical IP belongs to the crawler IP, and the target IP is the IP used by the access request; and executing a preset anti-crawler operation under the condition that the target crawler index is larger than a preset index threshold value. The implementation manner of each step and the achievable technical effect may refer to the foregoing related description, and are not described herein again.
The present invention also provides an electronic device, see fig. 4, comprising: a processor 901, a memory 902 and a computer program 9021 stored on and executable on the memory, which when executed by the processor implements the anti-crawler processing method of the foregoing embodiments.
The present invention also provides a readable storage medium, in which instructions, when executed by a processor of an electronic device, enable the electronic device to perform the anti-crawler processing method of the foregoing embodiment.
For the device embodiment, since it is basically similar to the method embodiment, the description is simple, and for the relevant points, refer to the partial description of the method embodiment.
It should be noted that various information and data acquired in the embodiment of the present invention are acquired under the authorization of the information/data holder.
The algorithms and displays presented herein are not inherently related to any particular computer, virtual machine, or other apparatus. Various general purpose systems may also be used with the teachings herein. The required structure for constructing such a system will be apparent from the description above. Moreover, the present invention is not directed to any particular programming language. It is appreciated that a variety of programming languages may be used to implement the teachings of the present invention as described herein, and any descriptions of specific languages are provided above to disclose the best mode of the invention.
In the description provided herein, numerous specific details are set forth. It is understood, however, that embodiments of the invention may be practiced without these specific details. In some instances, well-known methods, structures and techniques have not been shown in detail in order not to obscure an understanding of this description.
Similarly, it should be appreciated that in the foregoing description of exemplary embodiments of the invention, various features of the invention are sometimes grouped together in a single embodiment, figure, or description thereof for the purpose of streamlining the invention and aiding in the understanding of one or more of the various inventive aspects. However, the disclosed method should not be interpreted as reflecting an intention that: that the invention as claimed requires more features than are expressly recited in each claim. Rather, as the following claims reflect, inventive aspects lie in less than all features of a single foregoing disclosed embodiment. Thus, the claims following the detailed description are hereby expressly incorporated into this detailed description, with each claim standing on its own as a separate embodiment of this invention.
Those skilled in the art will appreciate that the modules in the device in an embodiment may be adaptively changed and disposed in one or more devices different from the embodiment. The modules or units or components of the embodiments may be combined into one module or unit or component, and furthermore they may be divided into a plurality of sub-modules or sub-units or sub-components. All of the features disclosed in this specification (including any accompanying claims, abstract and drawings), and all of the processes or elements of any method or apparatus so disclosed, may be combined in any combination, except combinations where at least some of such features and/or processes or elements are mutually exclusive. Each feature disclosed in this specification (including any accompanying claims, abstract and drawings) may be replaced by alternative features serving the same, equivalent or similar purpose, unless expressly stated otherwise.
The various component embodiments of the invention may be implemented in hardware, or in software modules running on one or more processors, or in a combination thereof. It will be appreciated by those skilled in the art that a microprocessor or Digital Signal Processor (DSP) may be used in practice to implement some or all of the functions of some or all of the components in a sequencing device according to the present invention. The present invention may also be embodied as an apparatus or device program for carrying out a portion or all of the methods described herein. Such programs implementing the present invention may be stored on computer-readable media or may be in the form of one or more signals. Such a signal may be downloaded from an internet website or provided on a carrier signal or in any other form.
It should be noted that the above-mentioned embodiments illustrate rather than limit the invention, and that those skilled in the art will be able to design alternative embodiments without departing from the scope of the appended claims. In the claims, any reference signs placed between parentheses shall not be construed as limiting the claim. The word "comprising" does not exclude the presence of elements or steps not listed in a claim. The word "a" or "an" preceding an element does not exclude the presence of a plurality of such elements. The invention may be implemented by means of hardware comprising several distinct elements, and by means of a suitably programmed computer. In the unit claims enumerating several means, several of these means may be embodied by one and the same item of hardware. The usage of the words first, second and third, etcetera do not indicate any ordering. These words may be interpreted as names.
It is clear to those skilled in the art that, for convenience and brevity of description, the specific working processes of the above-described systems, apparatuses and units may refer to the corresponding processes in the foregoing method embodiments, and are not described herein again.
The above description is only for the purpose of illustrating the preferred embodiments of the present invention and is not to be construed as limiting the invention, and any modifications, equivalents and improvements made within the spirit and principle of the present invention are intended to be included within the scope of the present invention.
The above description is only for the specific embodiments of the present invention, but the scope of the present invention is not limited thereto, and any person skilled in the art can easily conceive of the changes or substitutions within the technical scope of the present invention, and all the changes or substitutions should be covered within the scope of the present invention. Therefore, the protection scope of the present invention shall be subject to the protection scope of the claims.

Claims (13)

1. A crawler processing method is applied to a first website node, and comprises the following steps:
in response to the access request, searching a crawler index corresponding to a target IP from crawler indexes corresponding to the first historical Internet protocol address IP, and taking the crawler index as the target crawler index; the first historical IP and a crawler index corresponding to the first historical IP are obtained according to IP access records of at least two second website nodes, the crawler index is used for representing the probability that the first historical IP belongs to the crawler IP, and the target IP is the IP used by the access request;
and executing a preset anti-crawler operation under the condition that the target crawler index is larger than a preset index threshold value.
2. The method of claim 1, wherein the crawler index corresponding to the first historical IP is stored in a blockchain, wherein the first website node is any one of the at least two second website nodes, and wherein the blockchain is deployed locally at any one of the first website nodes.
3. The method of claim 2, further comprising:
under the condition that a preset condition is met, according to an IP access record of the first website node within a preset time, determining a stage crawler index of a second historical IP within the preset time; the stage crawler index is used for representing the probability that the second historical IP belongs to a crawler within the preset time;
linking the stage crawler index to a block chain in the first website node, generating a new crawler index for the second historical IP according to the stage crawler index and the historical stage crawler index of the second historical IP, and linking the new crawler index to the block chain in the first website node; the historical stage crawler index includes a stage crawler index that was previously linked up to the blockchain.
4. The method of claim 3, wherein after determining the stage crawler index for the second historical IP within the preset time, the method further comprises:
and synchronizing the stage crawler index to other first website nodes with the block chains, so that the other first website nodes can generate a new crawler index for the second historical IP according to the stage crawler index and the historical stage crawler index of the second historical IP, and chain the new crawler index and the received stage crawler index to the block chains in the other first website nodes.
5. The method of claim 3, further comprising:
if a stage crawler index of a third historical IP shared by other first website nodes is received, generating a new crawler index for the third historical IP according to the stage crawler index shared by the other first website nodes and the historical stage crawler index of the third historical IP, and linking the new crawler index and the received stage crawler index to a block chain in the first website node.
6. The method according to any one of claims 3-5, wherein the generating a new crawler index for the second historical IP according to the stage crawler index and the historical stage crawler index for the second historical IP comprises:
acquiring a historical stage crawler index of the second historical IP from a chain-linked block of the block chain, and acquiring generation time of the historical stage crawler index as first time;
generating a new crawler index for the second historical IP according to the historical stage crawler index, the first time, the stage crawler index and the second time; the second time is the generation time of the stage crawler index.
7. The method of claim 6, wherein generating a new crawler index for the second historical IP based on the historical stage crawler index, the first time, the stage crawler index, and a second time comprises:
according to the first time and the second time, respectively determining a first weight of the historical stage crawler index and a second weight of the stage crawler index; the first weight is negatively correlated with the duration of the first time from the current time, and the second weight is negatively correlated with the duration of the second time from the current time;
generating the new crawler index according to the first weight, the second weight, the historical stage crawler index and the stage crawler index;
the searching for the crawler index corresponding to the target IP from the crawler index corresponding to the first historical Internet protocol address IP comprises the following steps: and searching the newly generated crawler index corresponding to the target IP from the block chain of the first website node.
8. The method according to any one of claims 3-5, wherein the determining the stage crawler index of the second historical IP within a preset time according to the IP access record of the first website node within the preset time comprises:
for a second historical IP within the preset time, determining a first use frequency, an access frequency and a second use frequency of an access agent of an access source corresponding to the second historical IP according to the IP access record;
calculating a stage crawler index of the second historical IP according to the first using times, the access times, the second using times and the total access times appearing in the preset time; the stage crawler index is positively correlated with the first using times and the times proportion, and is negatively correlated with the second using times, wherein the times proportion is the ratio of the access times to the total access times.
9. The method of claim 8, wherein before determining the first number of times of use of the access source, the number of times of access, and the second number of times of use of the access agent corresponding to the second historical IP from the IP access record, the method further comprises:
acquiring a target request proportion corresponding to the second historical IP; the target request proportion is the proportion of the historical access requests which are corresponding to the second historical IP and carry the specified keywords;
the determining, according to the IP access record, a first number of times of use, a number of times of access, and a second number of times of use of an access agent of an access source corresponding to the second historical IP includes:
and under the condition that the target request duty ratio is not greater than a preset duty ratio threshold, determining the first using times and the accessing times of the access source corresponding to the second historical IP and the second using times of the access agent according to the IP access record.
10. An anti-crawler processing apparatus, applied to a first website node, the apparatus comprising:
the searching module is used for responding to the access request, searching a crawler index corresponding to the target IP from the crawler index corresponding to the first historical Internet protocol address IP to be used as the target crawler index; the first historical IP and a crawler index corresponding to the first historical IP are obtained according to IP access records of at least two second website nodes, the crawler index is used for representing the probability that the first historical IP belongs to the crawler IP, and the target IP is the IP used by the access request;
and the execution module is used for executing preset anti-crawler operation under the condition that the target crawler index is greater than a preset index threshold value.
11. An anti-crawler processing system, the system comprising at least two first website nodes, each first website node configured to perform the steps of:
responding to the access request, and searching a crawler index corresponding to the target IP from crawler indexes corresponding to the first historical IP to serve as the target crawler index; the first historical IP and a crawler index corresponding to the first historical IP are obtained according to IP access records of the at least two first website nodes, the crawler index is used for representing the probability that the first historical IP belongs to the crawler IP, and the target IP is the IP used by the access request;
and executing a preset anti-crawler operation under the condition that the target crawler index is larger than a preset index threshold value.
12. An electronic device, comprising:
a processor, a memory, and a computer program stored on the memory and executable on the processor, the processor implementing the method of any of claims 1-9 when executing the program.
13. A readable storage medium, wherein instructions in the storage medium, when executed by a processor of an electronic device, enable the electronic device to perform the method of any of claims 1-9.
CN202210649745.5A 2022-06-09 2022-06-09 Anti-crawler processing method and device, electronic equipment and readable storage medium Pending CN115065520A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202210649745.5A CN115065520A (en) 2022-06-09 2022-06-09 Anti-crawler processing method and device, electronic equipment and readable storage medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202210649745.5A CN115065520A (en) 2022-06-09 2022-06-09 Anti-crawler processing method and device, electronic equipment and readable storage medium

Publications (1)

Publication Number Publication Date
CN115065520A true CN115065520A (en) 2022-09-16

Family

ID=83201222

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202210649745.5A Pending CN115065520A (en) 2022-06-09 2022-06-09 Anti-crawler processing method and device, electronic equipment and readable storage medium

Country Status (1)

Country Link
CN (1) CN115065520A (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116032540A (en) * 2022-12-05 2023-04-28 杭州思律舟到科技有限公司 Network security management method and system based on data processing

Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US7599920B1 (en) * 2006-10-12 2009-10-06 Google Inc. System and method for enabling website owners to manage crawl rate in a website indexing system
CN105187396A (en) * 2015-08-11 2015-12-23 小米科技有限责任公司 Method and device for identifying web crawler
CN109862018A (en) * 2019-02-21 2019-06-07 中国工商银行股份有限公司 Anti- crawler method and system based on user access activity
CN110958228A (en) * 2019-11-19 2020-04-03 用友网络科技股份有限公司 Crawler access interception method and device, server and computer readable storage medium
CN111597424A (en) * 2020-07-21 2020-08-28 平安国际智慧城市科技股份有限公司 Crawler identification method and device, computer equipment and storage medium
CN113364753A (en) * 2021-05-31 2021-09-07 平安国际智慧城市科技股份有限公司 Anti-crawler method and device, electronic equipment and computer readable storage medium

Patent Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US7599920B1 (en) * 2006-10-12 2009-10-06 Google Inc. System and method for enabling website owners to manage crawl rate in a website indexing system
CN105187396A (en) * 2015-08-11 2015-12-23 小米科技有限责任公司 Method and device for identifying web crawler
CN109862018A (en) * 2019-02-21 2019-06-07 中国工商银行股份有限公司 Anti- crawler method and system based on user access activity
CN110958228A (en) * 2019-11-19 2020-04-03 用友网络科技股份有限公司 Crawler access interception method and device, server and computer readable storage medium
CN111597424A (en) * 2020-07-21 2020-08-28 平安国际智慧城市科技股份有限公司 Crawler identification method and device, computer equipment and storage medium
CN113364753A (en) * 2021-05-31 2021-09-07 平安国际智慧城市科技股份有限公司 Anti-crawler method and device, electronic equipment and computer readable storage medium

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116032540A (en) * 2022-12-05 2023-04-28 杭州思律舟到科技有限公司 Network security management method and system based on data processing

Similar Documents

Publication Publication Date Title
CN101582887B (en) Safety protection method, gateway device and safety protection system
CN105491053A (en) Web malicious code detection method and system
CN107341395B (en) Method for intercepting reptiles
CN103166917A (en) Method and system for network equipment identity recognition
CN102436564A (en) Method and device for identifying falsified webpage
CN103678321A (en) Webpage element determination method and device and user behavior route determination method and device
CN102663052B (en) Method and device for providing search results of search engine
CN108154029A (en) Intrusion detection method, electronic equipment and computer storage media
CN110071941A (en) A kind of network attack detecting method, equipment, storage medium and computer equipment
CN104184832A (en) Data submitting method and device in network application
US20180203927A1 (en) System and method for determining an authority rank for real time searching
CN110708339B (en) Correlation analysis method based on WEB log
CN111368227B (en) URL processing method and device
CN109241733A (en) Crawler Activity recognition method and device based on web access log
CN103905372A (en) Method and device for removing false alarm of phishing website
CN113518077A (en) Malicious web crawler detection method, device, equipment and storage medium
CN114915479A (en) Web attack phase analysis method and system based on Web log
CN109800364A (en) Amount of access statistical method, device, equipment and storage medium based on block chain
CN107800686A (en) A kind of fishing website recognition methods and device
CN112131507A (en) Website content processing method, device, server and computer-readable storage medium
CN105184156A (en) Security threat management method and system
CN115065520A (en) Anti-crawler processing method and device, electronic equipment and readable storage medium
CN109981533B (en) DDoS attack detection method, device, electronic equipment and storage medium
CN103684823A (en) Weblog recording method, network access path determining method and related devices
CN110933082A (en) Method, device and equipment for identifying lost host and storage medium

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination