CN113132336A - Method, system and equipment for processing web crawler - Google Patents

Method, system and equipment for processing web crawler Download PDF

Info

Publication number
CN113132336A
CN113132336A CN202010027254.8A CN202010027254A CN113132336A CN 113132336 A CN113132336 A CN 113132336A CN 202010027254 A CN202010027254 A CN 202010027254A CN 113132336 A CN113132336 A CN 113132336A
Authority
CN
China
Prior art keywords
crawler
crawlers
category
unknown
access flow
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202010027254.8A
Other languages
Chinese (zh)
Inventor
朱传江
高力
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Shanghai Yundun Information Technology Co ltd
Original Assignee
Shanghai Yundun Information Technology Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Shanghai Yundun Information Technology Co ltd filed Critical Shanghai Yundun Information Technology Co ltd
Priority to CN202010027254.8A priority Critical patent/CN113132336A/en
Publication of CN113132336A publication Critical patent/CN113132336A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L63/00Network architectures or network communication protocols for network security
    • H04L63/14Network architectures or network communication protocols for network security for detecting or protecting against malicious traffic
    • H04L63/1441Countermeasures against malicious traffic
    • H04L63/145Countermeasures against malicious traffic the attack involving the propagation of malware through the network, e.g. viruses, trojans or worms
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L41/00Arrangements for maintenance, administration or management of data switching networks, e.g. of packet switching networks
    • H04L41/14Network analysis or design
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L41/00Arrangements for maintenance, administration or management of data switching networks, e.g. of packet switching networks
    • H04L41/14Network analysis or design
    • H04L41/147Network analysis or design for predicting network behaviour
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L43/00Arrangements for monitoring or testing data switching networks
    • H04L43/04Processing captured monitoring data, e.g. for logfile generation
    • H04L43/045Processing captured monitoring data, e.g. for logfile generation for graphical visualisation of monitoring data
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L63/00Network architectures or network communication protocols for network security
    • H04L63/02Network architectures or network communication protocols for network security for separating internal from external traffic, e.g. firewalls
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L63/00Network architectures or network communication protocols for network security
    • H04L63/10Network architectures or network communication protocols for network security for controlling access to devices or network resources
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L63/00Network architectures or network communication protocols for network security
    • H04L63/14Network architectures or network communication protocols for network security for detecting or protecting against malicious traffic
    • H04L63/1408Network architectures or network communication protocols for network security for detecting or protecting against malicious traffic by monitoring network traffic
    • H04L63/1416Event detection, e.g. attack signature detection
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L63/00Network architectures or network communication protocols for network security
    • H04L63/14Network architectures or network communication protocols for network security for detecting or protecting against malicious traffic
    • H04L63/1408Network architectures or network communication protocols for network security for detecting or protecting against malicious traffic by monitoring network traffic
    • H04L63/1425Traffic logging, e.g. anomaly detection
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L63/00Network architectures or network communication protocols for network security
    • H04L63/14Network architectures or network communication protocols for network security for detecting or protecting against malicious traffic
    • H04L63/1441Countermeasures against malicious traffic
    • H04L63/1466Active attacks involving interception, injection, modification, spoofing of data unit addresses, e.g. hijacking, packet injection or TCP sequence number attacks
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L63/00Network architectures or network communication protocols for network security
    • H04L63/20Network architectures or network communication protocols for network security for managing network security; network security policies in general

Abstract

The method comprises the steps of receiving access flow, drawing the access flow to a protection node provided with a firewall edge, identifying the access flow, and determining a known crawler and a crawler category to which the known crawler belongs; performing behavior analysis and identification on access flow of the crawler category to which the crawler belongs, determining a threat value of an unknown crawler, and determining the crawler category to which the unknown crawler belongs according to the threat value; and performing visual display of access flow according to the category of the known crawler and the category of the unknown crawler, and adjusting the access flow of the current service. Therefore, malicious crawlers and normal traffic are efficiently and accurately identified and respectively treated, meanwhile, crawler data are visually and visually displayed, and website managers can conveniently perform targeted analysis on crawler category related data.

Description

Method, system and equipment for processing web crawler
Technical Field
The present application relates to the field of computers, and in particular, to a method, system, and device for processing web crawlers.
Background
Statistically, up to 30-60% of the traffic on the internet is generated by Bot web robots, and only part of the traffic is generated by normal human access behavior. These program flows are not all malicious, but there are normal Bot programs, such as search engine crawlers, advertisement programs, third party partner programs, Robots protocol friendly programs, etc., which are normal machine flows.
However, the malicious crawler Bot program may cause problems such as unavailability of service of a business website, reduction in user experience, vulnerability security of the website, and business failure, and may cause problems such as unavailability of service due to crawling of business data, brushing of an interface, and CC attack, which may bring high risk and loss difficult to estimate to enterprises. The proportion of malicious crawler Bot traffic to the overall network traffic is even more than 30%. The threat of malicious Bot traffic faced by a large company is more severe, industries are distributed on online lottery, airline department, finance, medical treatment, ticketing and the like, wherein the specialization degree of the malicious Bot traffic of the e-commerce, medical treatment and airline department industries is higher, the identification efficiency of malicious crawler bots in the prior art is low, the protection is weak, the malicious crawler and normal traffic cannot be identified accurately, and website services and intellectual property are extremely easy to be damaged, for example: marketing cheating, malicious library collision, seat occupation during flight and travel, invalid operation, climbing of sensitive information, brushing of interfaces, overload of servers and the like.
The existing scheme I is as follows: usually, the website will block by means of IP identification based on network firewall, or by comparing with the already constructed IP library, its disadvantage is: 1) the accidental injury probability of the protection means is high; 2) compared with the established IP address base, the information synchronization is slow, and the information synchronization is easy to be bypassed by the proxy IP, so that the corresponding protection effect cannot be generated.
The existing scheme is as follows: the access request is limited by setting the access frequency of the website based on the control of the business key node, and the defects are that: 1) only aiming at partial scene interfaces, the accidental injury probability is high; 2) the coupling with the service layer is relatively high, the maintenance cost is high, and the requirements of a service system cannot be met.
The existing scheme is three: the disadvantage of opening the protection against the known Bot type is: 1) the protection disposal mode is single; 2) unknown bots cannot be found and summarized, and no visual page is used for visually displaying crawler data, so that the pertinence analysis of website managers is not facilitated.
Disclosure of Invention
An object of the present application is to provide a web crawler processing method, system and device, which solve the problems in the prior art that the identification efficiency of malicious crawlers is low, the protection is weak, malicious crawlers and normal traffic cannot be identified accurately, and crawler data cannot be displayed visually.
According to an aspect of the present application, there is provided a web crawler processing method, including:
receiving access flow, drawing the access flow to a protection node provided with a firewall edge, identifying the access flow, and determining a known crawler and a crawler category to which the known crawler belongs;
performing behavior analysis and identification on access flow of the crawler category to which the crawler belongs, determining a threat value of an unknown crawler, and determining the crawler category to which the unknown crawler belongs according to the threat value;
and performing visual display of access flow according to the category of the known crawler and the category of the unknown crawler, and adjusting the access flow of the current service.
Further, the identifying the access traffic includes:
crawler information of access traffic is identified according to historical crawler network addresses and behavior characteristic information, wherein the historical crawler network addresses are known crawler information.
Further, the crawler categories include malicious crawlers and legitimate crawlers, and the method includes:
and processing malicious crawlers in known crawlers and malicious crawlers in unknown crawlers according to preset processing actions to obtain processed clean flow, and returning the clean flow to a source station.
Further, the performing behavior analysis and identification on the access traffic of the crawler category to which the determination is not made includes:
constructing a data set according to historical browsing data and historical crawler capturing data, and training an artificial intelligence model by using the data set to obtain a preset artificial intelligence model;
and performing behavior analysis and identification on the access flow of the crawler category to which the access flow belongs by the preset artificial intelligence model.
Further, the visually displaying the access traffic according to the category to which the known crawler belongs and the category to which the unknown crawler belongs includes:
the information of the known crawlers and the unknown crawlers is counted according to the crawler categories to which the known crawlers belong and the categories to which the unknown crawlers belong, clustering analysis is carried out according to the counted information of the known crawlers and the unknown crawlers, and the crawler category information is determined, wherein the crawler category information comprises: the crawler request times, the information of known crawlers, the information of unknown crawlers, the information of legal crawlers and the information of malicious crawlers;
and visually displaying the access flow according to the corresponding crawler category information.
Further, the method comprises:
and visually displaying the unknown crawler according to the category of the unknown crawler, the equipment fingerprint, the network address field and the behavior of the website service.
Further, the preset treatment action comprises:
and performing any one or more of the following combined processing according to the service identification of the user: and returning false data, observing, releasing, intercepting, man-machine identifying and presetting a custom font library.
Further, the method further comprises:
a rule set consisting of a control policy is added to limit access traffic whose request frequency is not within a preset threshold.
According to another aspect of the present application, there is provided a web crawler processing system, wherein the system includes: a crawler detection module, a crawler identification module and a crawler display module, wherein,
the crawler detection module is used for receiving access flow, drawing the access flow to a protection node provided with a firewall edge, identifying the access flow and determining a known crawler and a crawler category to which the known crawler belongs;
the crawler identification module is used for performing behavior analysis and identification on access flow of a crawler type to which the crawler identification module does not determine belongs, determining a threat value of an unknown crawler, and determining the crawler type to which the unknown crawler belongs according to the threat value;
and the crawler display module is used for visually displaying the access flow according to the crawler category to which the known crawler belongs and the category to which the unknown crawler belongs, and adjusting the access flow of the current service.
Further, the system comprises a crawler processing module, wherein the crawler processing module is used for processing malicious crawlers in known crawlers and malicious crawlers in unknown crawlers according to preset processing actions to obtain processed clean flow, and the clean flow is returned to the source station.
According to yet another aspect of the present application, there is provided a computer readable medium having stored thereon computer readable instructions executable by a processor to implement a web crawler processing method as described in any one of the preceding.
According to still another aspect of the present application, there is provided a web crawler processing apparatus, wherein the apparatus includes:
one or more processors; and
a memory storing computer readable instructions that, when executed, cause the processor to perform the operations of a web crawler processing method as in any one of the preceding claims.
Compared with the prior art, the method and the system have the advantages that the access flow is received, is drawn to the protection node with the firewall edge, is identified according to the historical crawler network address library and the behavior characteristic information, and the known crawlers and the crawler categories to which the known crawlers belong are determined; performing behavior analysis and identification on access flow of the crawler category to which the crawler belongs, determining a threat value of an unknown crawler, and determining the crawler category to which the unknown crawler belongs according to the threat value; and performing visual display of access flow according to the category of the known crawler and the category of the unknown crawler, and adjusting the access flow of the current service. Therefore, malicious crawlers and normal traffic are efficiently and accurately identified and respectively treated, meanwhile, crawler data are visually and visually displayed, and website managers can conveniently perform targeted analysis on crawler category related data.
Drawings
Other features, objects and advantages of the present application will become more apparent upon reading of the following detailed description of non-limiting embodiments thereof, made with reference to the accompanying drawings in which:
FIG. 1 illustrates a flow diagram of a web crawler processing method provided in accordance with an aspect of the subject application;
FIG. 2 is a flow chart of a flow processing method in a preferred embodiment of the present application;
FIG. 3 illustrates a processing system framework diagram of a web crawler provided in accordance with another aspect of the subject application.
The same or similar reference numbers in the drawings identify the same or similar elements.
Detailed Description
The present application is described in further detail below with reference to the attached figures.
In a typical configuration of the present application, the terminal, the device serving the network, and the trusted party each include one or more processors (CPUs), input/output interfaces, network interfaces, and memory.
The memory may include forms of volatile memory in a computer readable medium, Random Access Memory (RAM) and/or non-volatile memory, such as Read Only Memory (ROM) or flash memory (flash RAM). Memory is an example of a computer-readable medium.
Computer-readable media, including both non-transitory and non-transitory, removable and non-removable media, may implement information storage by any method or technology. The information may be computer readable instructions, data structures, modules of a program, or other data. Examples of computer storage media include, but are not limited to, phase change memory (PRAM), Static Random Access Memory (SRAM), Dynamic Random Access Memory (DRAM), other types of Random Access Memory (RAM), Read Only Memory (ROM), Electrically Erasable Programmable Read Only Memory (EEPROM), flash memory or other memory technology, compact disc read only memory (CD-ROM), Digital Versatile Discs (DVD) or other optical storage, magnetic cassettes, magnetic tape, magnetic disk storage or other magnetic storage devices, or any other non-transmission medium that can be used to store information that can be accessed by a computing device. As defined herein, computer readable media does not include non-transitory computer readable media (transient media), such as modulated data signals and carrier waves.
Fig. 1 is a schematic flow chart illustrating a web crawler processing method according to an aspect of the present application, where the method includes: S11-S13, wherein in the S11, access flow is received, the access flow is drawn to a protection node provided with a firewall edge, the access flow is identified, and a known crawler and a crawler category to which the known crawler belongs are determined;
step S12, performing behavior analysis and identification on the access flow of the crawler category to which the identification is not determined, determining a threat value of an unknown crawler, and determining the crawler category to which the unknown crawler belongs according to the threat value;
and step S13, performing visual display of access flow according to the category of the known crawler and the category of the unknown crawler, and adjusting the access flow of the current service. Therefore, malicious crawlers and normal traffic are efficiently and accurately identified and respectively treated, meanwhile, crawler data are visually and visually displayed, and website managers can conveniently perform targeted analysis on crawler category related data.
Specifically, step S11 receives an access flow, draws the access flow to a protection node provided with a firewall edge, identifies the access flow, and determines a known crawler and a crawler category to which the known crawler belongs. Here, the edge protection node provides a protection network protocol address (protection IP address) for the client after the client service accesses the firewall, and pulls all access traffic of the current service to the protection node at the firewall edge to protect the website source network protocol address, and the behavior feature information is based on features in a user agent (usergent), such as Python scripts, golang program identifiers, and the like, where the access traffic includes all existing access types.
Preferably, the crawler information of the access traffic is identified according to a historical crawler network address and the behavior characteristic information, wherein the historical crawler network address is known crawler information. Comparing whether the network address and the behavior characteristic information of the historical crawler are consistent with those of the access flow or not, if so, determining a known crawler in the access flow and a crawler category to which the known crawler belongs according to the network address and the behavior characteristic information of the historical crawler; if not, the crawler is an undetermined crawler. Here, the historical crawler network addresses in the historical crawler network address repository are known Internet Protocol (IP) threat information, and the crawler types can be directly determined, including but not limited to search engine crawlers, partner crawlers, monitoring crawlers, aggregator crawlers, social network crawlers, advertisement crawlers, reverse link crawlers, IDC data center crawlers, malicious UA crawlers, fake search engine crawlers, IP reputation repository blacklists, proxy pool crawlers. The behavior characteristic information includes, but is not limited to, a source return ratio, a dynamic resource ratio, a static resource ratio, a dynamic resource source return ratio, a static resource source return ratio, a total access number, a malicious UA ratio, an access duration, an average number of accesses per minute, a deduplication UA number, a deduplication UA ratio, a deduplication url number, a deduplication url ratio, an average request response time, an average access number of the web site IP, a total number of the web site IP, a ratio of the total access number to the average access number of the web site IP, and the like. And determining whether the known crawlers can be classified according to the characteristics and behaviors, if so, classifying the known crawlers to obtain a classification, thereby continuously expanding the known crawlers, and otherwise, manually intervening to determine whether the known crawlers are classified or newly added to the classification.
Step S12, performing behavior analysis and identification on the access flow of the crawler category to which the identification is not determined, determining the threat value of the unknown crawler, and determining the crawler category to which the unknown crawler belongs according to the threat value. Here, the behavior is a classification of behavior according to a business class of the website, such as a login behavior, a submit behavior, or a data acquisition behavior. Behavior analysis and identification can be carried out on access flow of the crawler category to which the behavior analysis and identification is not determined by artificial intelligence, a threat value of an unknown crawler is determined according to an identification result to measure threat degree, whether the threat value is within a preset threshold value or not is judged, if yes, the unknown crawler is a legal crawler, and if not, the unknown crawler is a malicious crawler.
In a preferred embodiment of the present application, the predicted value calculated by the artificial intelligence model is a floating point number between 0 and 1, and the closer to 1 the higher the threat level, the artificial intelligence model uses a machine learning algorithm, such as LightGBM, to perform pre-detection by analyzing the behavior of the visitor, where the pre-detected items include, but are not limited to, a provenance return ratio, a dynamic resource ratio, a static resource ratio, a total number of visits, a malicious User Agent (UA) ratio, a visit duration, an average number of visits per minute, a number of User Agents (UA) to be reused, a number of url to be reused, an average request response time, an average number of visits to a website IP, an average number of websites IP, and a ratio of the total number of visits to the average number of websites IP visits. Different scores are preset for different visitor behaviors, and scoring is carried out according to the obtained visitor behaviors and the preset scores. And setting the preset threshold value to be 0.7, and determining that the unknown crawler with the threat value of more than or equal to 0.7 is the malicious crawler after detection.
And step S13, performing visual display of access flow according to the category of the known crawler and the category of the unknown crawler, and adjusting the access flow of the current service. Here, the traffic visualization display is accessed so that a user can quickly obtain the current crawled situation and the type of the crawler of each service, for example, the crawled situation by a search engine, the ratio of the crawled data to the dynamic and static resources, the ratio of the legal crawler to the malicious crawler in all crawler data, and the like, and the most direct reference data is provided for adjusting the traffic management strategy.
Preferably, the crawler categories include malicious crawlers and legal crawlers, the malicious crawlers in the known crawlers and the malicious crawlers in the unknown crawlers are treated according to preset treatment actions to obtain the post-treatment clean flow, and the clean flow is returned to the source station. Here, the preset treatment actions include, but are not limited to, returning false data, observing, releasing, intercepting, and man-machine recognizing. And after intercepting the malicious crawlers, obtaining the post-positioned clean flow, wherein the clean flow is legal flow after removing all the malicious crawlers, and then returning the clean flow to a source station.
Preferably, in step S12, a data set is constructed according to the historical browsing data and the historical crawler capturing data, and an artificial intelligence model is trained by using the data set to obtain a preset artificial intelligence model; and performing behavior analysis and identification on the access flow of the crawler category to which the access flow belongs by the preset artificial intelligence model. Here, the behavior analysis and identification process includes: the method comprises the steps of firstly obtaining historical normal browsing data and historical crawler capturing data existing in a website, constructing a data set, converting the data set into historical crawler flow related data and carrying out quantitative processing, namely visually evaluating historical crawler flow by utilizing a series of digital values, and then training by combining an artificial intelligence algorithm (AI algorithm) according to the series of digital values to obtain a preset artificial intelligence model. And then, performing behavior analysis and identification on the access flow of the crawler category to which the access flow belongs by using a preset artificial intelligence model, such as prediction of a website request. Preferably, the known crawler is updated according to the category of the unknown crawler obtained through analysis, and the category of the unknown crawler obtained through new analysis is recorded into the category of the known crawler. In a preferred embodiment of the application, an artificial intelligence engine (AI engine) is used for judging whether the access flow is an undetermined crawler, if the access flow is an unknown crawler, the access flow is marked as the unknown crawler, and whether the access flow can be classified into a known crawler is determined according to behavior characteristic information of the unknown crawler, if so, the access flow is classified into a known crawler classification, so that the known crawler is continuously expanded, and if not, manual intervention is carried out to determine whether the access flow is classified or newly added; and if the AI engine judges that the crawler is not unknown, the crawler is normal flow.
Preferably, in step S13, the information of the known crawlers and the unknown crawlers is counted according to the crawler categories to which the known crawlers belong and the categories to which the unknown crawlers belong, and cluster analysis is performed according to the counted information of the known crawlers and the unknown crawlers to determine crawler category information, where the crawler category information includes: the crawler request times, the information of known crawlers, the information of unknown crawlers, the information of legal crawlers and the information of malicious crawlers; and visually displaying the access flow according to the corresponding crawler category information. Here, request trends, traffic trends, dynamic and static resource request frequency trends, known crawler (known Bot) activity analyses, unknown crawler (unknown Bot) activity analyses, and the like of crawlers (Bot) are graphically displayed according to the results of the cluster analysis. The method can enable a user to quickly obtain the current crawled condition and the crawler type of each current service, such as the crawled condition of a search engine, the dynamic and static resource proportion, the legal crawler proportion and the malicious crawler proportion of crawled data, and provides the most direct reference basis for adjusting the flow management strategy.
Preferably, the unknown crawler is visually displayed according to the category to which the unknown crawler belongs, the device fingerprint, the network address field and the behavior performed according to the website service. And performing cluster analysis on the unknown crawlers according to the categories of the unknown crawlers, the device fingerprints, the network address fields and the behavior performed according to the website service, and performing visual display.
Preferably, any one or more of the following combination processes are performed according to the service identifier of the user: and returning false data, observing, releasing, intercepting, man-machine identifying and presetting a custom font library. The service identifier of the user is preferably a login user identifier, so that account linkage is realized, that is, a set of processing modes is set for the same login user, wherein the processing modes are one or more combined processing of false data returning, observation, release, interception, man-machine recognition and a preset custom font library, so that all request data of the same user can be managed to manage the crawler.
Preferably, a rule set consisting of a control policy is added to limit access traffic whose request frequency is not within a preset threshold. Here, adding a rule set in the console and adding a control policy to the rule set, wherein the control policy includes a combined access control policy for a Uniform Resource Locator (URL), an Internet Protocol (IP), a reference (Referer), a region, a User Agent (UA), a request type, a request parameter, a query string, a request header, a request time, a request method, a device type, an internet protocol type (IP type), a suffix, and an internet protocol request frequency (IP request frequency), may effectively manage crawler traffic.
In a preferred embodiment of the present application, taking the internet protocol request (IP request) frequency as an example, the logic is greater than, the request time is 3s, the number of times is 100, and the handling manner is blocking, that is, when the IP request time reaches 100 times after the number of requests of 3s, the current IP continued request is automatically limited, so as to effectively avoid malicious data crawling. Here, the handling of the crawler traffic is not limited to malicious crawlers, but traffic handling may also be performed for friendly crawlers, for example, when the client server performance does not allow a large amount of access, if a search engine initiates a number of requests exceeding a certain value, the server may be overloaded, in which case, a management and control handling may also be performed on the search engine traffic, such as access denial, and the like.
Fig. 2 is a schematic flow diagram illustrating a flow processing method in a preferred embodiment of the present application, where crawler management is established on a Web application firewall (Web application firewall), all access flows are pulled to a protection node provided with an edge of the Web application firewall, a known crawler is quickly identified on the edge protection node through an IP library and feature identification of the known crawler, an unknown crawler is identified through artificial intelligence and behavior analysis, and the unknown crawler is displayed through device fingerprints, IP segments, behaviors, and the like through cluster analysis, and is disposed according to a configured disposition action. Then, after malicious traffic is handled at the edge, the edge protection node returns clean traffic to the source station.
FIG. 3 illustrates a web crawler processing system, wherein the system comprises: the system comprises a crawler detection module 11, a crawler identification module 12 and a crawler display module 13, wherein the crawler detection module 11 is used for receiving access flow, drawing the access flow to a protection node provided with a firewall edge, identifying the access flow, and determining a known crawler and a crawler category to which the known crawler belongs; the crawler identification module 12 is configured to perform behavior analysis and identification on access traffic of a crawler category to which the crawler identification module is not determined, determine a threat value of an unknown crawler, and determine a crawler category to which the unknown crawler belongs according to the threat value; the crawler display module 13 is configured to perform visual display of access traffic according to the crawler category to which the known crawler belongs and the category to which the unknown crawler belongs, and adjust the access traffic of the current service. Therefore, malicious crawlers and normal traffic are efficiently and accurately identified and respectively treated, meanwhile, crawler data are visually and visually displayed, and website managers can conveniently perform targeted analysis on crawler category related data.
Specifically, the crawler detection module 11 is configured to receive access traffic, pull the access traffic to a protection node provided with a firewall edge, identify the access traffic, and determine a known crawler and a crawler category to which the known crawler belongs, where the crawler detection module 11 pulls all access traffic of a current service to the protection node at the firewall edge, preferably, identifies the access traffic according to a historical crawler network address base and behavior feature information, and determines the known crawler and the crawler category to which the known crawler belongs. Then, the crawler detection module 11 compares whether the network address base and behavior feature information of the historical crawler are consistent with the network address and behavior feature information in the access traffic, and if so, determines a known crawler in the access traffic and a crawler category to which the known crawler belongs according to the network address base and the behavior feature information of the historical crawler; if not, the crawler is an undetermined crawler.
The crawler identification module 12 is configured to perform behavior analysis and identification on access traffic of a crawler category to which the crawler identification module is not determined, determine a threat value of an unknown crawler, determine a crawler category to which the unknown crawler belongs according to the threat value, where the crawler identification module 12 may perform behavior analysis and identification on the access traffic of the crawler category to which the crawler identification module is not determined by using artificial intelligence, determine a threat value of the unknown crawler according to an identification result to measure a threat degree, determine whether the threat value is within a preset threshold, if so, determine that the unknown crawler is a legal crawler, and if not, determine that the unknown crawler is a malicious crawler.
The crawler display module 13 is configured to perform access flow visualization display according to the crawler category to which the known crawler belongs and the category to which the unknown crawler belongs, and adjust the access flow of the current service, where the crawler display module 13 is configured to graphically and visually display all statistical data of the access flow, so that a user can quickly obtain a current crawled condition and a crawler type of each service, for example, a crawled condition by a search engine, a crawled data dynamic and static resource ratio, an occupation ratio of a legal crawler and a malicious crawler in all crawler data, and the most direct reference data is provided for adjusting a traffic control policy.
Further, the system comprises a crawler processing module 14, wherein the crawler processing module 14 is configured to process malicious crawlers in known crawlers and malicious crawlers in unknown crawlers according to preset processing actions to obtain processed clean flow, and return the clean flow to the source station. Here, the preset treatment actions include, but are not limited to, returning false data, observing, releasing, intercepting, and man-machine recognizing. For example, after intercepting the malicious crawlers, the crawler processing module 14 obtains a clean flow after the crawlers are located, where the clean flow is a legal flow after all the malicious crawlers are removed, and then the crawler processing module 14 returns the clean flow to the source station. In a preferred embodiment of the present application, the crawler processing module 14 serves as one of the modules of a Web application firewall (Web application firewall), and can rapidly acquire capabilities such as DDoS resistance, vulnerability protection, and performance optimization while managing the crawler.
In addition, the embodiment of the present application further provides a computer readable medium, on which computer readable instructions are stored, and the computer readable instructions can be executed by a processor to implement the foregoing processing method for the web crawler.
According to still another aspect of the present application, there is also provided a web crawler processing apparatus, wherein the apparatus includes:
one or more processors; and
a memory storing computer readable instructions that, when executed, cause the processor to perform the operations of one of the web crawler's processing methods previously described.
For example, the computer readable instructions, when executed, cause the one or more processors to:
receiving access flow, drawing the access flow to a protection node provided with a firewall edge, identifying the access flow, and determining a known crawler and a crawler category to which the known crawler belongs; performing behavior analysis and identification on access flow of the crawler category to which the crawler belongs, determining a threat value of an unknown crawler, and determining the crawler category to which the unknown crawler belongs according to the threat value; and performing visual display of access flow according to the category of the known crawler and the category of the unknown crawler, and adjusting the access flow of the current service.
It will be apparent to those skilled in the art that various changes and modifications may be made in the present application without departing from the spirit and scope of the application. Thus, if such modifications and variations of the present application fall within the scope of the claims of the present application and their equivalents, the present application is intended to include such modifications and variations as well.
It should be noted that the present application may be implemented in software and/or a combination of software and hardware, for example, implemented using Application Specific Integrated Circuits (ASICs), general purpose computers or any other similar hardware devices. In one embodiment, the software programs of the present application may be executed by a processor to implement the steps or functions described above. Likewise, the software programs (including associated data structures) of the present application may be stored in a computer readable recording medium, such as RAM memory, magnetic or optical drive or diskette and the like. Additionally, some of the steps or functions of the present application may be implemented in hardware, for example, as circuitry that cooperates with the processor to perform various steps or functions.
In addition, some of the present application may be implemented as a computer program product, such as computer program instructions, which when executed by a computer, may invoke or provide methods and/or techniques in accordance with the present application through the operation of the computer. Program instructions which invoke the methods of the present application may be stored on a fixed or removable recording medium and/or transmitted via a data stream on a broadcast or other signal-bearing medium and/or stored within a working memory of a computer device operating in accordance with the program instructions. An embodiment according to the present application comprises an apparatus comprising a memory for storing computer program instructions and a processor for executing the program instructions, wherein the computer program instructions, when executed by the processor, trigger the apparatus to perform a method and/or a solution according to the aforementioned embodiments of the present application.
It will be evident to those skilled in the art that the present application is not limited to the details of the foregoing illustrative embodiments, and that the present application may be embodied in other specific forms without departing from the spirit or essential attributes thereof. The present embodiments are therefore to be considered in all respects as illustrative and not restrictive, the scope of the application being indicated by the appended claims rather than by the foregoing description, and all changes which come within the meaning and range of equivalency of the claims are therefore intended to be embraced therein. Any reference sign in a claim should not be construed as limiting the claim concerned. Furthermore, it is obvious that the word "comprising" does not exclude other elements or steps, and the singular does not exclude the plural. A plurality of units or means recited in the apparatus claims may also be implemented by one unit or means in software or hardware. The terms first, second, etc. are used to denote names, but not any particular order.

Claims (12)

1. A web crawler processing method, wherein the method comprises:
receiving access flow, drawing the access flow to a protection node provided with a firewall edge, identifying the access flow, and determining a known crawler and a crawler category to which the known crawler belongs;
performing behavior analysis and identification on access flow of the crawler category to which the crawler belongs, determining a threat value of an unknown crawler, and determining the crawler category to which the unknown crawler belongs according to the threat value;
and performing visual display of access flow according to the category of the known crawler and the category of the unknown crawler, and adjusting the access flow of the current service.
2. The method of claim 1, wherein the identifying the access traffic comprises:
crawler information of access traffic is identified according to historical crawler network addresses and behavior characteristic information, wherein the historical crawler network addresses are known crawler information.
3. The method of claim 1, wherein the crawler categories include malicious crawlers and legitimate crawlers, the method comprising:
and processing malicious crawlers in known crawlers and malicious crawlers in unknown crawlers according to preset processing actions to obtain processed clean flow, and returning the clean flow to a source station.
4. The method of claim 1, wherein the performing behavioral analysis recognition on access traffic of the crawler category to which the uncertainty belongs comprises:
constructing a data set according to historical browsing data and historical crawler capturing data, and training an artificial intelligence model by using the data set to obtain a preset artificial intelligence model;
and performing behavior analysis and identification on the access flow of the crawler category to which the access flow belongs by the preset artificial intelligence model.
5. The method of claim 1, wherein the visually presenting access traffic according to the crawler category to which the known crawler belongs and the category to which the unknown crawler belongs comprises:
the information of the known crawlers and the unknown crawlers is counted according to the crawler categories to which the known crawlers belong and the categories to which the unknown crawlers belong, clustering analysis is carried out according to the counted information of the known crawlers and the unknown crawlers, and the crawler category information is determined, wherein the crawler category information comprises: the crawler request times, the information of known crawlers, the information of unknown crawlers, the information of legal crawlers and the information of malicious crawlers;
and visually displaying the access flow according to the corresponding crawler category information.
6. The method of claim 1, wherein the method comprises:
and visually displaying the unknown crawler according to the category of the unknown crawler, the equipment fingerprint, the network address field and the behavior of the website service.
7. The method of claim 3, wherein the preset treatment action comprises:
and performing any one or more of the following combined processing according to the service identification of the user: and returning false data, observing, releasing, intercepting, man-machine identifying and presetting a custom font library.
8. The method of claim 7, wherein the method further comprises:
a rule set consisting of a control policy is added to limit access traffic whose request frequency is not within a preset threshold.
9. A system for managing crawlers, wherein the system comprises: a crawler detection module, a crawler identification module and a crawler display module, wherein,
the crawler detection module is used for receiving access flow, drawing the access flow to a protection node provided with a firewall edge, identifying the access flow according to a historical crawler network address base and behavior characteristic information, and determining a known crawler and a crawler category to which the known crawler belongs;
the crawler identification module is used for performing behavior analysis and identification on access flow of a crawler type to which the crawler identification module does not determine belongs, determining a threat value of an unknown crawler, and determining the crawler type to which the unknown crawler belongs according to the threat value;
and the crawler display module is used for visually displaying the access flow according to the crawler category to which the known crawler belongs and the category to which the unknown crawler belongs, and adjusting the access flow of the current service.
10. The system according to claim 9, wherein the system comprises a crawler processing module, and the crawler processing module is configured to process malicious crawlers in known crawlers and malicious crawlers in unknown crawlers according to preset processing actions to obtain the processed clean traffic, and return the clean traffic to the source station.
11. A computer readable medium having computer readable instructions stored thereon which are executable by a processor to implement the method of any one of claims 1 to 8.
12. A web crawler processing apparatus, wherein the apparatus comprises:
one or more processors; and
a memory storing computer readable instructions that, when executed, cause the processor to perform the operations of the method of any of claims 1 to 8.
CN202010027254.8A 2020-01-10 2020-01-10 Method, system and equipment for processing web crawler Pending CN113132336A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202010027254.8A CN113132336A (en) 2020-01-10 2020-01-10 Method, system and equipment for processing web crawler

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202010027254.8A CN113132336A (en) 2020-01-10 2020-01-10 Method, system and equipment for processing web crawler

Publications (1)

Publication Number Publication Date
CN113132336A true CN113132336A (en) 2021-07-16

Family

ID=76770876

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202010027254.8A Pending CN113132336A (en) 2020-01-10 2020-01-10 Method, system and equipment for processing web crawler

Country Status (1)

Country Link
CN (1) CN113132336A (en)

Cited By (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114726616A (en) * 2022-04-07 2022-07-08 京东科技信息技术有限公司 Website access request processing method and device
CN116108252A (en) * 2023-04-14 2023-05-12 深圳市和讯华谷信息技术有限公司 Limiting data grabbing method, limiting data grabbing system, limiting data grabbing computer equipment and limiting data grabbing storage medium
EP4343594A1 (en) * 2022-09-22 2024-03-27 Citrix Systems Inc. Systems and methods for autonomous program classification generation
CN114401104B (en) * 2021-11-30 2024-04-30 中国建设银行股份有限公司 Web crawler processing method, device, server and storage medium

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108388794A (en) * 2018-02-01 2018-08-10 金蝶软件(中国)有限公司 Page data guard method, device, computer equipment and storage medium
CN109818949A (en) * 2019-01-17 2019-05-28 济南浪潮高新科技投资发展有限公司 A kind of anti-crawler method neural network based
CN109862018A (en) * 2019-02-21 2019-06-07 中国工商银行股份有限公司 Anti- crawler method and system based on user access activity
CN110351248A (en) * 2019-06-14 2019-10-18 北京纵横无双科技有限公司 A kind of safety protecting method and device based on intellectual analysis and intelligent current limliting

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108388794A (en) * 2018-02-01 2018-08-10 金蝶软件(中国)有限公司 Page data guard method, device, computer equipment and storage medium
CN109818949A (en) * 2019-01-17 2019-05-28 济南浪潮高新科技投资发展有限公司 A kind of anti-crawler method neural network based
CN109862018A (en) * 2019-02-21 2019-06-07 中国工商银行股份有限公司 Anti- crawler method and system based on user access activity
CN110351248A (en) * 2019-06-14 2019-10-18 北京纵横无双科技有限公司 A kind of safety protecting method and device based on intellectual analysis and intelligent current limliting

Cited By (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114401104B (en) * 2021-11-30 2024-04-30 中国建设银行股份有限公司 Web crawler processing method, device, server and storage medium
CN114726616A (en) * 2022-04-07 2022-07-08 京东科技信息技术有限公司 Website access request processing method and device
EP4343594A1 (en) * 2022-09-22 2024-03-27 Citrix Systems Inc. Systems and methods for autonomous program classification generation
CN116108252A (en) * 2023-04-14 2023-05-12 深圳市和讯华谷信息技术有限公司 Limiting data grabbing method, limiting data grabbing system, limiting data grabbing computer equipment and limiting data grabbing storage medium

Similar Documents

Publication Publication Date Title
EP3343867B1 (en) Methods and apparatus for processing threat metrics to determine a risk of loss due to the compromise of an organization asset
EP2769508B1 (en) System and method for detection of denial of service attacks
DE112010003454B4 (en) Threat detection in a data processing system
EP2691848B1 (en) Determining machine behavior
CN113132336A (en) Method, system and equipment for processing web crawler
WO2020133986A1 (en) Botnet domain name family detecting method, apparatus, device, and storage medium
CN114915479B (en) Web attack stage analysis method and system based on Web log
CN111786966A (en) Method and device for browsing webpage
CN109831459B (en) Method, device, storage medium and terminal equipment for secure access
CN106534146A (en) Safety monitoring system and method
CN108052824B (en) Risk prevention and control method and device and electronic equipment
CN107733699B (en) Internet asset security management method, system, device and readable storage medium
CN113518077A (en) Malicious web crawler detection method, device, equipment and storage medium
Awan et al. Identifying cyber risk hotspots: A framework for measuring temporal variance in computer network risk
EP3913888A1 (en) Detection method for malicious domain name in domain name system and detection device
US10419449B1 (en) Aggregating network sessions into meta-sessions for ranking and classification
US10320823B2 (en) Discovering yet unknown malicious entities using relational data
EP4033717A1 (en) Distinguishing network connection requests
CN114357447A (en) Attacker threat scoring method and related device
RU2481633C2 (en) System and method for automatic investigation of safety incidents
CN114500122B (en) Specific network behavior analysis method and system based on multi-source data fusion
CN110955890A (en) Method and device for detecting malicious batch access behaviors and computer storage medium
CN112839029B (en) Botnet activity degree analysis method and system
Xi et al. Quantitative threat situation assessment based on alert verification
CN113572781A (en) Method for collecting network security threat information

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
REG Reference to a national code

Ref country code: HK

Ref legal event code: DE

Ref document number: 40056801

Country of ref document: HK