CN114710318B - Method, device, equipment and medium for limiting high-frequency access of crawler - Google Patents

Method, device, equipment and medium for limiting high-frequency access of crawler Download PDF

Info

Publication number
CN114710318B
CN114710318B CN202210208114.XA CN202210208114A CN114710318B CN 114710318 B CN114710318 B CN 114710318B CN 202210208114 A CN202210208114 A CN 202210208114A CN 114710318 B CN114710318 B CN 114710318B
Authority
CN
China
Prior art keywords
access
address
user
identity information
target page
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202210208114.XA
Other languages
Chinese (zh)
Other versions
CN114710318A (en
Inventor
赵志庆
侯玉柱
陈佐相
董席峰
余毛猛
张昊
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Rongxing Technology Co ltd
Original Assignee
Rongxing Technology Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Rongxing Technology Co ltd filed Critical Rongxing Technology Co ltd
Priority to CN202210208114.XA priority Critical patent/CN114710318B/en
Publication of CN114710318A publication Critical patent/CN114710318A/en
Application granted granted Critical
Publication of CN114710318B publication Critical patent/CN114710318B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L63/00Network architectures or network communication protocols for network security
    • H04L63/10Network architectures or network communication protocols for network security for controlling access to devices or network resources
    • H04L63/101Access control lists [ACL]
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/95Retrieval from the web
    • G06F16/951Indexing; Web crawling techniques
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L63/00Network architectures or network communication protocols for network security
    • H04L63/14Network architectures or network communication protocols for network security for detecting or protecting against malicious traffic
    • H04L63/1441Countermeasures against malicious traffic
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02DCLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
    • Y02D10/00Energy efficient computing, e.g. low power processors, power management or thermal management

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • General Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Computer Security & Cryptography (AREA)
  • Computing Systems (AREA)
  • Data Mining & Analysis (AREA)
  • General Physics & Mathematics (AREA)
  • Health & Medical Sciences (AREA)
  • Biomedical Technology (AREA)
  • Signal Processing (AREA)
  • Computer Networks & Wireless Communication (AREA)
  • Computer Hardware Design (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Artificial Intelligence (AREA)
  • Databases & Information Systems (AREA)
  • Biophysics (AREA)
  • Computational Linguistics (AREA)
  • Evolutionary Computation (AREA)
  • General Health & Medical Sciences (AREA)
  • Molecular Biology (AREA)
  • Mathematical Physics (AREA)
  • Software Systems (AREA)
  • Information Transfer Between Computers (AREA)

Abstract

The application discloses a method, a device, equipment and a medium for limiting high-frequency access of a crawler, which are used for solving the technical problem that the network request of a normal user is abnormal due to the fact that the existing web crawler accesses a server endlessly. The method comprises the following steps: acquiring an access request of a user and determining a corresponding IP address and a historical access track sequence; analyzing the history access track sequence to obtain history access information; the method comprises the steps of inputting an IP address and an access time interval, accumulated access times and single access time length of the IP address to a target page within a preset time length to a pre-trained learning model, and outputting identity information corresponding to the IP address and confidence corresponding to the identity information; acquiring energy consumption corresponding to an access target page of an IP address in a historical access track sequence and processing the access request by a server to determine target identity information corresponding to the IP address; and when the target identity information corresponding to the IP address is the crawler user information, adding the IP address to the Nginx shielding file to limit the access of the IP address.

Description

Method, device, equipment and medium for limiting high-frequency access of crawler
Technical Field
The present application relates to the field of data security technologies, and in particular, to a method, an apparatus, a device, and a medium for limiting high frequency access of a crawler.
Background
A Web Crawler (Web Crawler), also known as a Web Spider (Web Spider) or Web Robot (Web Robot), is a program or script that automatically crawls the world wide Web according to certain rules. The web crawler technology is a basic technology of many Internet applications, and is characterized by very popular application in the fields of big data storage, data mining, network evidence collection, information aggregation, public opinion monitoring and the like.
However, while the web crawler brings great convenience to the user, it also poses a direct or indirect threat to the internet environment, and a great number of requests of the crawler program may put a certain stress on the server. The web crawler can endlessly access the server, consume the bandwidth, the memory, the CPU and other resources of the target server, and cause abnormal network requests of normal users. Therefore, the high-frequency access of the crawler needs to be limited to ensure that the network request of the normal user is normal.
Disclosure of Invention
The embodiment of the application provides a method and equipment for limiting high-frequency access of a crawler, which are used for solving the technical problem that the existing web crawler endless access server consumes resources such as bandwidth, memory, CPU and the like of a target server, and causes abnormal network requests of normal users.
In one aspect, an embodiment of the present application provides a method for limiting high-frequency access of a crawler, including:
acquiring an access request of a user, and determining an IP address corresponding to the user and a historical access track sequence corresponding to the IP address according to the access request;
analyzing the historical access track sequence to obtain historical access information of the IP address within a preset duration, wherein the historical access information at least comprises one or more of the following items: the access time interval, accumulated access times and single access duration of the target page are carried out;
inputting the IP address and the access time interval, accumulated access times and single access time length of the IP address to a target page within a preset time length to a pre-trained learning model, and outputting identity information corresponding to the IP address and confidence coefficient corresponding to the identity information, wherein the identity information comprises crawler user information and normal user information;
under the condition that the confidence coefficient does not exceed a preset confidence coefficient threshold value, acquiring energy consumption corresponding to the access target page of the IP address in the historical access track sequence and energy consumption of a server corresponding to the target page for processing the access request;
Determining target identity information corresponding to the IP address according to the size relation between the energy consumption corresponding to the IP address access target page and the energy consumption corresponding to the access request processed by the server;
and when the target identity information corresponding to the IP address is the crawler user information, adding the IP address into an Nginx shielding file to limit the access of the IP address.
In one implementation manner of the present application, before the inputting the IP address and the access time interval of the IP address to the target page within the preset duration, the accumulated access times, and the single access duration into the pre-trained learning model, the method further includes:
inputting a plurality of IP addresses with the predetermined identity information being the user information of the crawler and the corresponding historical access information into a convolutional neural network for training;
extracting the access time interval, the accumulated access times and the single access time length of the IP address to the target page in the history access information through a convolution layer of the convolution neural network;
and processing the access time interval, the accumulated access times and the single access duration of the IP address to the target page through a pooling layer and a full-connection layer, and outputting the identity information corresponding to the IP address and the confidence coefficient corresponding to the identity information until the identity information corresponding to the IP address is output to be matched with the identity information determined in advance so as to complete training of the learning model.
In one implementation manner of the present application, after the outputting the identity information corresponding to the IP address and the confidence coefficient corresponding to the identity information, the method further includes:
under the condition that the confidence coefficient corresponding to the identity information exceeds a preset confidence coefficient threshold value, taking the identity information output by the learning model as target identity information corresponding to the IP address;
in the case that the confidence level does not exceed a preset confidence threshold, the method further comprises:
and if the energy consumption corresponding to the IP address access target page is not more than the energy consumption corresponding to the access request processed by the server, determining the target identity information corresponding to the IP address as the crawler user information.
In one implementation manner of the present application, the obtaining, when the confidence coefficient does not exceed a preset confidence coefficient threshold, energy consumption corresponding to the access target page by the IP address in the historical access track sequence and energy consumption corresponding to the target page by the server for processing the access request specifically includes:
determining the total number of the target pages accessed by the IP address in a preset time period and the total energy consumption of the user terminal corresponding to the IP address in the preset time period from the historical access track sequence, and determining the energy consumption corresponding to the target pages accessed by the IP address;
And determining a server corresponding to the target page according to the target page to be accessed by the IP address, and determining energy consumption corresponding to the server for processing the access request according to the total number of the access requests received by the server within the preset time period.
In one implementation manner of the present application, after the obtaining the access request of the user and determining the IP address corresponding to the user according to the access request, the method further includes:
comparing the IP address corresponding to the user with the IP address forbidden to be accessed in the Nginx shielding file;
directly rejecting the access request under the condition that the Nginx shielding file comprises the IP address corresponding to the user;
and under the condition that the Nginx shielding file does not comprise the IP address corresponding to the user, predicting the identity information corresponding to the IP address through a learning model.
In an implementation manner of the present application, when the target identity information corresponding to the IP address is crawler user information, the IP address is added to an nginnx mask file to limit the access of the IP address, and then the method further includes:
controlling the offset corresponding to the target page in the laminated style sheet to hide the laminated style corresponding to the target page;
Rendering characters in the target webpage through a preset custom font file aiming at the offset target webpage, and displaying the target webpage according to the corresponding relation between the preset custom font file and the characters.
In one implementation manner of the present application, the obtaining an access request of a user, and determining, according to the access request, an IP address corresponding to the user, and a historical access track sequence corresponding to the IP address specifically includes:
receiving an access request of a user, and determining an IP address corresponding to the user and an ID of the user according to request information in the access request;
determining a historical access track sequence corresponding to the IP address according to the IP address corresponding to the user and the ID of the user;
the historical access track sequence further comprises: the IP address aims at a plurality of access records of a target page, and the access records are stored in the historical access track sequence according to the access time sequence.
On the other hand, the embodiment of the application also provides a device for limiting the high-frequency access of the crawler, which comprises:
the request acquisition unit is used for acquiring an access request of a user, and determining an IP address corresponding to the user and a historical access track sequence corresponding to the IP address according to the access request;
The analysis unit is used for analyzing the historical access track sequence to obtain historical access information of the IP address within a preset duration, wherein the historical access information at least comprises one or more of the following items: the access time interval, accumulated access times and single access duration of the target page are carried out;
the output unit is used for inputting the IP address and the access time interval, the accumulated access times and the single access time length of the IP address to the target page within the preset time length to a pre-trained learning model, and outputting identity information corresponding to the IP address and confidence coefficient corresponding to the identity information, wherein the identity information comprises crawler user information and normal user information;
the energy consumption acquisition unit is used for acquiring the energy consumption corresponding to the access target page of the IP address in the historical access track sequence and the energy consumption of the server corresponding to the target page for processing the access request under the condition that the confidence coefficient does not exceed a preset confidence coefficient threshold value;
the determining unit is used for determining target identity information corresponding to the IP address according to the size relation between the energy consumption corresponding to the IP address access target page and the energy consumption corresponding to the access request processed by the server;
And the adding unit is used for adding the IP address into the Nginx shielding file when the target identity information corresponding to the IP address is the crawler user information so as to limit the access of the IP address.
In another aspect, an embodiment of the present application further provides an apparatus for limiting high-frequency access of a crawler, where the apparatus includes:
at least one processor;
and a memory communicatively coupled to the at least one processor;
wherein the memory stores instructions executable by the at least one processor to enable the at least one processor to perform a method of restricting crawler high frequency access as described above.
On the other hand, the embodiment of the application also provides a non-volatile computer storage medium, which stores computer executable instructions, wherein the computer executable instructions are executed to realize the method for limiting the high-frequency access of the crawler.
The embodiment of the application provides a method, a device, equipment and a medium for limiting high-frequency access of a crawler, which at least comprise the following beneficial effects: determining an IP address corresponding to a user and a historical access track sequence corresponding to the IP address by acquiring a user access request; analyzing the historical access track sequence to obtain the access time interval, the accumulated access times and the single access time length of the IP address to the target page within the preset time length; the IP address and the access time interval, the accumulated access times and the single access time length of the IP address to the target page within the preset time length are input into a pre-trained learning model, so that identity information corresponding to the IP address and the confidence coefficient of the identity information can be obtained; under the condition that the confidence coefficient does not exceed a preset confidence coefficient threshold value, the energy consumption corresponding to the access target page of the obtained IP address in the historical access track sequence and the energy consumption of the server corresponding to the target page for processing the access request are compared, so that the target identity information corresponding to the IP address can be determined; when the target identity information corresponding to the IP address is the crawler user information, the IP address is added into the Nginx shielding file, so that the current IP address can be effectively limited to be accessed. Therefore, normal access of normal users can be guaranteed, the crawler users can be accurately identified, the identification efficiency of the crawler is improved, and the waste of resources such as bandwidth, memory and CPU of the server is reduced.
Drawings
The accompanying drawings, which are included to provide a further understanding of the application and are incorporated in and constitute a part of this application, illustrate embodiments of the application and together with the description serve to explain the application and do not constitute an undue limitation to the application. In the drawings:
FIG. 1 is a flowchart of a method for limiting high frequency access of a crawler according to an embodiment of the present disclosure;
fig. 2 is a schematic structural diagram of an apparatus for limiting high-frequency access of a crawler according to an embodiment of the present application;
fig. 3 is a schematic internal structure of a device for limiting high-frequency access of a crawler according to an embodiment of the present application.
Detailed Description
For the purposes, technical solutions and advantages of the present application, the technical solutions of the present application will be clearly and completely described below with reference to specific embodiments of the present application and corresponding drawings. It will be apparent that the described embodiments are only some, but not all, of the embodiments of the present application. All other embodiments, which can be made by one of ordinary skill in the art without undue burden from the present disclosure, are within the scope of the present disclosure.
The embodiment of the application provides a method, a device, equipment and a medium for limiting crawler high-frequency access, which are used for determining an IP address corresponding to a user and a historical access track sequence corresponding to the IP address by acquiring a user access request; analyzing the historical access track sequence to obtain the access time interval, the accumulated access times and the single access time length of the IP address to the target page within the preset time length; the IP address and the access time interval, the accumulated access times and the single access time length of the IP address to the target page within the preset time length are input into a pre-trained learning model, so that identity information corresponding to the IP address and the confidence coefficient of the identity information can be obtained; under the condition that the confidence coefficient does not exceed a preset confidence coefficient threshold value, the energy consumption corresponding to the access target page of the obtained IP address in the historical access track sequence and the energy consumption of the server corresponding to the target page for processing the access request are compared, so that the target identity information corresponding to the IP address can be determined; when the target identity information corresponding to the IP address is the crawler user information, the IP address is added into the Nginx shielding file, so that the current IP address can be effectively limited to be accessed. The method solves the technical problem that the existing web crawlers endlessly access the target server, consume the resources such as bandwidth, memory, CPU and the like of the target server, and cause abnormal network requests of normal users.
The following describes in detail the technical solution proposed in the embodiments of the present application through the accompanying drawings.
Fig. 1 is a flowchart of a method for limiting high-frequency access of a crawler according to an embodiment of the present application. As shown in fig. 1, a method for limiting high-frequency access of a crawler provided in an embodiment of the present application mainly includes the following steps:
step 101, obtaining an access request of a user, and determining an IP address corresponding to the user and a historical access track sequence corresponding to the IP address according to the access request.
The access request includes basic information corresponding to the client, for example: the user ID, IP address, and the access request may further include information such as access time, a target page to be accessed, and a server to be accessed. The client may be a mobile intelligent terminal, for example: smart phones, tablets, computers, kiosks, smart watches, various smart devices, and the like. The access request refers to a request sent to a server corresponding to a target page when a user accesses the target page through a client. The server acquires the access request of the user, and determines the IP address corresponding to the user and the historical access track sequence corresponding to the IP address according to the access request.
Specifically, the server receives an access request of a user, and determines an IP address corresponding to the user and an ID of the user according to request information in the access request; and then finding a historical access track sequence corresponding to the IP address according to the ID of the user and the IP address corresponding to the user.
It should be noted that, in the embodiment of the present application, the historical access track sequence further includes: the IP address is directed to several access records of the target page, which are stored in a historical access track sequence according to the generated timing.
In one embodiment of the present application, after obtaining an access request of a user, and determining an IP address corresponding to the user and a historical access track sequence corresponding to the IP address according to the access request, the server extracts information corresponding to the current IP address from the historical access track sequence, for example: the method comprises the steps of comparing an IP address corresponding to an access request with an IP address forbidden to be accessed in an Nginx shielding file, directly rejecting a current access request under the condition that the Nginx shielding file comprises the IP address corresponding to the access request, and estimating identity information corresponding to the IP address through a learning model under the condition that the Nginx shielding file does not comprise the IP address corresponding to the access request.
What needs to be described, what is stored in the nmginx mask file in the embodiment of the present application is an IP address corresponding to a crawler user who accesses a target page in history, which is determined by a server, so that when the server obtains an access request again, the server first compares the IP address corresponding to the access request with an IP address in the nmginx mask file, where access is prohibited, and accordingly determines whether the IP address corresponding to the access request is a crawler user, thereby directly prohibiting access of the crawler user, and avoiding wasting resources of the server due to frequent authentication of the crawler user who frequently requests access.
And 102, analyzing the historical access track sequence to obtain the historical access information of the IP address within the preset duration.
The server can obtain the historical access information of the IP address in the preset duration by analyzing the obtained historical access track sequence corresponding to the IP address, so that the resource bandwidth required by analyzing the historical access track sequence when the IP address is required to be used later is reduced, the resource consumption is reduced, and the information obtaining efficiency is improved.
It should be noted that, in the embodiment of the present application, the history access information includes at least one or more of the following: access time interval, accumulated access times and single access duration to the target page.
And 103, inputting the IP address and the access time interval, the accumulated access times and the single access time length of the IP address to the target page within the preset time length to a pre-trained learning model, and outputting identity information corresponding to the IP address and confidence corresponding to the identity information.
The server inputs the acquired IP address and the access time interval, the accumulated access times and the single access time length of the current IP address to the target page within the preset time length to a pre-trained learning model, so that identity information corresponding to the IP address and confidence corresponding to the identity information can be output.
It should be noted that, the identity information in the embodiment of the present application includes the crawler user information and the normal user information, and the IP address and the sample data of the access time interval, the accumulated access times and the word access time length of the IP address to the target page in the preset time length are obtained through training.
In one embodiment of the present application, before inputting the IP address and the access time interval, the accumulated access times, and the single access time length of the IP address to the target page within the preset time length into the pre-trained learning model, the following method may be further executed:
the server inputs a plurality of IP addresses and corresponding historical access information which are determined in advance as the user information of the crawler, and inputs the IP addresses and the corresponding historical access information into the convolutional neural network for training.
The method comprises the steps of extracting access time intervals, accumulated access times and single access time length of an IP address to a target page in historical access information through a convolution layer of a convolution neural network, and then processing the access time intervals, the accumulated access times and the single access time length of the IP address to the target page through a pooling layer and a full-connection layer, so that identity information corresponding to the IP address and confidence corresponding to the identity information are output until the identity information corresponding to the output IP address is matched with the predetermined identity information, and training of a learning model is completed.
In one embodiment of the present application, the access information in the history access track sequence of the normal user corresponds to the life habit of the user, so that the access information is relatively regular, but the access information of the crawler user belongs to mechanical grabbing and is relatively irregular, so that the server can determine whether the user corresponding to the access request is the crawler user by simply judging the behavior of the user.
In one embodiment of the present application, after outputting the identity information corresponding to the IP address and the confidence level of the identity information, the server uses the identity information output by the learning model as the target identity information corresponding to the IP address when the confidence level corresponding to the identity information exceeds a preset confidence threshold. And under the condition that the confidence coefficient does not exceed a preset confidence coefficient threshold value, the server needs to compare the energy consumption corresponding to the IP address access target page with the energy consumption corresponding to the server processing the access request, and if the energy consumption corresponding to the IP address access target page does not exceed the energy consumption corresponding to the server processing the access request, the server can determine that the target identity information corresponding to the current IP address is the crawler user information.
And 104, under the condition that the confidence coefficient does not exceed a preset confidence coefficient threshold value, acquiring the energy consumption corresponding to the access target page of the IP address in the historical access track sequence and the energy consumption of the server corresponding to the target page for processing the access request.
And under the condition that the confidence coefficient corresponding to the identity information output by the model does not exceed a preset confidence coefficient threshold, namely the identity information output by the model cannot reach the specified accuracy, the server needs to acquire the energy consumption corresponding to the IP address access target page again and the energy consumption of the server corresponding to the target page for processing the access request.
Specifically, the server determines the total number of times that the IP address accesses the target page within a preset duration and the total energy consumption of the IP address corresponding to the user terminal within the preset duration from the historical access track sequence, so as to determine the energy consumption corresponding to the IP address accessing the target page.
The server also determines a server corresponding to the target page according to the target page to be accessed by the IP address, and determines energy consumption corresponding to the access request of the server for receiving the IP address according to the total number of the access requests received by the server within a preset duration.
And 105, determining target identity information corresponding to the IP address according to the size relation between the energy consumption corresponding to the IP address access target page and the energy consumption corresponding to the server processing access request.
The server compares the energy consumption corresponding to the IP address access target page with the energy consumption corresponding to the server processing access request, and can determine the target identity information corresponding to the IP address according to the size relation between the energy consumption corresponding to the IP address access target page and the energy consumption corresponding to the server processing access request.
And 106, when the target identity information corresponding to the IP address is the crawler user information, adding the IP address into the Nginx shielding file to limit the access of the IP address.
When the target identity information corresponding to the IP address is the crawler user information, the server adds the current IP address to the pre-established Nginx shielding file, so that the purpose of limiting the access of the IP address is achieved.
In one embodiment of the present application, when the target identity information corresponding to the IP address is the crawler user information, the server adds the IP address to the nginnx mask file to limit access of the IP address, and then further hides the real overlay style corresponding to the target page by controlling the offset corresponding to the target page in the overlay style sheet, and calls a TTF file corresponding to a preset custom font to render text in the target page for the offset target page, so that real content corresponding to the target page is displayed according to the corresponding relationship between the TTF file and the text.
The foregoing is a method embodiment presented herein. Based on the same inventive concept, the embodiment of the application also provides a device for limiting the high-frequency access of the crawler, and the structure of the device is shown in fig. 2.
Fig. 2 is a schematic structural diagram of an apparatus for limiting high-frequency access of a crawler according to an embodiment of the present application. As shown in fig. 2, the apparatus includes: a request acquisition unit 201, a parsing unit 202, an output unit 203, an energy consumption acquisition unit 204, a determination unit 205, and an addition unit 206;
a request obtaining unit 201, configured to obtain an access request of a user, and determine an IP address corresponding to the user and a historical access track sequence corresponding to the IP address according to the access request;
the parsing unit 202 is configured to parse the historical access track sequence to obtain historical access information of the IP address within a preset duration, where the historical access information at least includes one or more of the following: the access time interval, accumulated access times and single access duration of the target page are carried out;
the output unit 203 is configured to input the IP address and an access time interval, an accumulated access number, and a single access duration of the IP address to the target page within a preset duration to a pre-trained learning model, and output identity information corresponding to the IP address and a confidence level corresponding to the identity information, where the identity information includes crawler user information and normal user information;
the energy consumption obtaining unit 204 is configured to obtain, when the confidence coefficient does not exceed a preset confidence coefficient threshold, energy consumption corresponding to the access target page by the IP address in the historical access track sequence, and energy consumption of processing the access request by the server corresponding to the target page;
A determining unit 205, configured to determine, according to a size relationship between energy consumption corresponding to the IP address access target page and energy consumption corresponding to the server processing the access request, target identity information corresponding to the IP address;
and the adding unit 206 is configured to add the IP address to the nginnx mask file to limit access of the IP address when the target identity information corresponding to the IP address is crawler user information.
Fig. 3 is a schematic internal structure of a device for limiting high-frequency access of a crawler according to an embodiment of the present application. As shown in fig. 3, the apparatus includes:
at least one processor;
and a memory communicatively coupled to the at least one processor;
wherein the memory stores instructions executable by the at least one processor, the instructions being executable by the at least one processor to enable the at least one processor to:
a method of restricting crawler high frequency access as claimed in any preceding claim is performed.
The embodiments of the present application also provide a nonvolatile computer storage medium storing computer executable instructions configured to:
a method of restricting high frequency access to a crawler as claimed in any preceding claim.
In the 90 s of the 20 th century, improvements to one technology could clearly be distinguished as improvements in hardware (e.g., improvements to circuit structures such as diodes, transistors, switches, etc.) or software (improvements to the process flow). However, with the development of technology, many improvements of the current method flows can be regarded as direct improvements of hardware circuit structures. Designers almost always obtain corresponding hardware circuit structures by programming improved method flows into hardware circuits. Therefore, an improvement of a method flow cannot be said to be realized by a hardware entity module. For example, a programmable logic device (Programmable Logic Device, PLD) (e.g., field programmable gate array (Field Programmable Gate Array, FPGA)) is an integrated circuit whose logic function is determined by the programming of the device by a user. A designer programs to "integrate" a digital system onto a PLD without requiring the chip manufacturer to design and fabricate application-specific integrated circuit chips. Moreover, nowadays, instead of manually manufacturing integrated circuit chips, such programming is mostly implemented by using "logic compiler" software, which is similar to the software compiler used in program development and writing, and the original code before the compiling is also written in a specific programming language, which is called hardware description language (Hardware Description Language, HDL), but not just one of the hdds, but a plurality of kinds, such as ABEL (Advanced Boolean Expression Language), AHDL (Altera Hardware Description Language), confluence, CUPL (Cornell University Programming Language), HDCal, JHDL (Java Hardware Description Language), lava, lola, myHDL, PALASM, RHDL (Ruby Hardware Description Language), etc., VHDL (Very-High-Speed Integrated Circuit Hardware Description Language) and Verilog are currently most commonly used. It will also be apparent to those skilled in the art that a hardware circuit implementing the logic method flow can be readily obtained by merely slightly programming the method flow into an integrated circuit using several of the hardware description languages described above.
The controller may be implemented in any suitable manner, for example, the controller may take the form of, for example, a microprocessor or processor and a computer readable medium storing computer readable program code (e.g., software or firmware) executable by the (micro) processor, logic gates, switches, application specific integrated circuits (Application Specific Integrated Circuit, ASIC), programmable logic controllers, and embedded microcontrollers, examples of which include, but are not limited to, the following microcontrollers: ARC 625D, atmel AT91SAM, microchip PIC18F26K20, and Silicone Labs C8051F320, the memory controller may also be implemented as part of the control logic of the memory. Those skilled in the art will also appreciate that, in addition to implementing the controller in a pure computer readable program code, it is well possible to implement the same functionality by logically programming the method steps such that the controller is in the form of logic gates, switches, application specific integrated circuits, programmable logic controllers, embedded microcontrollers, etc. Such a controller may thus be regarded as a kind of hardware component, and means for performing various functions included therein may also be regarded as structures within the hardware component. Or even means for achieving the various functions may be regarded as either software modules implementing the methods or structures within hardware components.
The system, apparatus, module or unit set forth in the above embodiments may be implemented in particular by a computer chip or entity, or by a product having a certain function. One typical implementation is a computer. In particular, the computer may be, for example, a personal computer, a laptop computer, a cellular telephone, a camera phone, a smart phone, a personal digital assistant, a media player, a navigation device, an email device, a game console, a tablet computer, a wearable device, or a combination of any of these devices.
For convenience of description, the above devices are described as being functionally divided into various units, respectively. Of course, the functions of each element may be implemented in one or more software and/or hardware elements when implemented in the present specification.
It will be appreciated by those skilled in the art that the present description may be provided as a method, system, or computer program product. Accordingly, the present specification embodiments may take the form of an entirely hardware embodiment, an entirely software embodiment or an embodiment combining software and hardware aspects. Furthermore, the present description embodiments may take the form of a computer program product on one or more computer-usable storage media (including, but not limited to, disk storage, CD-ROM, optical storage, etc.) having computer-usable program code embodied therein.
The present description is described with reference to flowchart illustrations and/or block diagrams of methods, apparatus (systems) and computer program products according to embodiments of the specification. It will be understood that each flow and/or block of the flowchart illustrations and/or block diagrams, and combinations of flows and/or blocks in the flowchart illustrations and/or block diagrams, can be implemented by computer program instructions. These computer program instructions may be provided to a processor of a general purpose computer, special purpose computer, embedded processor, or other programmable data processing apparatus to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing apparatus, create means for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.
These computer program instructions may also be stored in a computer-readable memory that can direct a computer or other programmable data processing apparatus to function in a particular manner, such that the instructions stored in the computer-readable memory produce an article of manufacture including instruction means which implement the function specified in the flowchart flow or flows and/or block diagram block or blocks.
These computer program instructions may also be loaded onto a computer or other programmable data processing apparatus to cause a series of operational steps to be performed on the computer or other programmable apparatus to produce a computer implemented process such that the instructions which execute on the computer or other programmable apparatus provide steps for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.
In one typical configuration, a computing device includes one or more processors (CPUs), input/output interfaces, network interfaces, and memory.
The memory may include volatile memory in a computer-readable medium, random Access Memory (RAM) and/or nonvolatile memory, such as Read Only Memory (ROM) or flash memory (flash RAM). Memory is an example of computer-readable media.
Computer readable media, including both non-transitory and non-transitory, removable and non-removable media, may implement information storage by any method or technology. The information may be computer readable instructions, data structures, modules of a program, or other data. Examples of storage media for a computer include, but are not limited to, phase change memory (PRAM), static Random Access Memory (SRAM), dynamic Random Access Memory (DRAM), other types of Random Access Memory (RAM), read Only Memory (ROM), electrically Erasable Programmable Read Only Memory (EEPROM), flash memory or other memory technology, compact disc read only memory (CD-ROM), digital Versatile Discs (DVD) or other optical storage, magnetic cassettes, magnetic tape magnetic disk storage or other magnetic storage devices, or any other non-transmission medium, which can be used to store information that can be accessed by a computing device. Computer-readable media, as defined herein, does not include transitory computer-readable media (transmission media), such as modulated data signals and carrier waves.
It should also be noted that the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, an element defined by the phrase "comprising one … …" does not exclude the presence of other like elements in a process, method, article or apparatus that comprises the element.
The description may be described in the general context of computer-executable instructions, such as program modules, being executed by a computer. Generally, program modules include routines, programs, objects, components, data structures, etc. that perform particular tasks or implement particular abstract data types. The specification may also be practiced in distributed computing environments where tasks are performed by remote processing devices that are linked through a communications network. In a distributed computing environment, program modules may be located in both local and remote computer storage media including memory storage devices.
In this specification, each embodiment is described in a progressive manner, and identical and similar parts of each embodiment are all referred to each other, and each embodiment mainly describes differences from other embodiments. In particular, for apparatus, devices, non-volatile computer storage medium embodiments, the description is relatively simple, as it is substantially similar to method embodiments, with reference to the section of the method embodiments being relevant.
The foregoing describes specific embodiments of the present disclosure. Other embodiments are within the scope of the following claims. In some cases, the actions or steps recited in the claims can be performed in a different order than in the embodiments and still achieve desirable results. In addition, the processes depicted in the accompanying figures do not necessarily require the particular order shown, or sequential order, to achieve desirable results. In some embodiments, multitasking and parallel processing are also possible or may be advantageous.
The foregoing is merely one or more embodiments of the present description and is not intended to limit the present description. Various modifications and alterations to one or more embodiments of this description will be apparent to those skilled in the art. Any modification, equivalent replacement, improvement, or the like, which is within the spirit and principles of one or more embodiments of the present description, is intended to be included within the scope of the claims of the present description.

Claims (10)

1. A method of restricting crawler high frequency access, the method comprising:
acquiring an access request of a user, and determining an IP address corresponding to the user and a historical access track sequence corresponding to the IP address according to the access request;
analyzing the historical access track sequence to obtain historical access information of the IP address within a preset duration, wherein the historical access information at least comprises one of the following items: the access time interval, accumulated access times and single access duration of the target page are carried out;
inputting the IP address and the access time interval, accumulated access times and single access time length of the IP address to a target page within a preset time length to a pre-trained learning model, and outputting identity information corresponding to the IP address and confidence coefficient corresponding to the identity information, wherein the identity information comprises crawler user information and normal user information;
under the condition that the confidence coefficient does not exceed a preset confidence coefficient threshold value, acquiring energy consumption corresponding to the access target page of the IP address in the historical access track sequence and energy consumption of a server corresponding to the target page for processing the access request;
Determining target identity information corresponding to the IP address according to the size relation between the energy consumption corresponding to the IP address access target page and the energy consumption corresponding to the access request processed by the server;
and when the target identity information corresponding to the IP address is the crawler user information, adding the IP address into an Nginx shielding file to limit the access of the IP address.
2. The method for limiting high-frequency access of crawlers according to claim 1, wherein before the IP address and the access time interval, the accumulated access times and the single access time length of the IP address to the target page within the preset time length are input to the pre-trained learning model, the method further comprises:
inputting a plurality of IP addresses with the predetermined identity information being the user information of the crawler and the corresponding historical access information into a convolutional neural network for training;
extracting the access time interval, the accumulated access times and the single access time length of the IP address to the target page in the history access information through a convolution layer of the convolution neural network;
and processing the access time interval, the accumulated access times and the single access duration of the IP address to the target page through a pooling layer and a full-connection layer, and outputting the identity information corresponding to the IP address and the confidence coefficient corresponding to the identity information until the identity information corresponding to the IP address is output to be matched with the identity information determined in advance so as to complete training of the learning model.
3. The method for limiting high-frequency access of a crawler according to claim 1, wherein after the outputting the identity information corresponding to the IP address and the confidence level corresponding to the identity information, the method further comprises:
under the condition that the confidence coefficient corresponding to the identity information exceeds a preset confidence coefficient threshold value, taking the identity information output by the learning model as target identity information corresponding to the IP address;
in the case that the confidence level does not exceed a preset confidence threshold, the method further comprises:
and if the energy consumption corresponding to the IP address access target page is not more than the energy consumption corresponding to the access request processed by the server, determining the target identity information corresponding to the IP address as the crawler user information.
4. The method for limiting high-frequency access of a crawler according to claim 1, wherein the obtaining, if the confidence does not exceed a preset confidence threshold, energy consumption corresponding to accessing a target page by the IP address in the historical access track sequence and energy consumption corresponding to processing the access request by a server corresponding to the target page specifically includes:
determining the total number of the target pages accessed by the IP address in a preset time period and the total energy consumption of the user terminal corresponding to the IP address in the preset time period from the historical access track sequence, and determining the energy consumption corresponding to the target pages accessed by the IP address;
And determining a server corresponding to the target page according to the target page to be accessed by the IP address, and determining energy consumption corresponding to the server for processing the access request according to the total number of the access requests received by the server within the preset time period.
5. The method for limiting high-frequency access of a crawler according to claim 1, wherein after the access request of the user is obtained and the IP address corresponding to the user is determined according to the access request, the method further comprises:
comparing the IP address corresponding to the user with the IP address forbidden to be accessed in the Nginx shielding file;
directly rejecting the access request under the condition that the Nginx shielding file comprises the IP address corresponding to the user;
and under the condition that the Nginx shielding file does not comprise the IP address corresponding to the user, predicting the identity information corresponding to the IP address through a learning model.
6. The method for limiting high-frequency access of a crawler according to claim 1, wherein when the target identity information corresponding to the IP address is crawler user information, the IP address is added to an nginnx mask file to limit access of the IP address, and the method further comprises:
Controlling the offset corresponding to the target page in the laminated style sheet to hide the laminated style corresponding to the target page;
rendering characters in the target page through a preset custom font file aiming at the offset target page, and displaying the target page according to the corresponding relation between the preset custom font file and the characters.
7. The method for limiting high-frequency access of a crawler according to claim 1, wherein the obtaining an access request of a user, determining an IP address corresponding to the user according to the access request, and a historical access track sequence corresponding to the IP address specifically includes:
receiving an access request of a user, and determining an IP address corresponding to the user and an ID of the user according to request information in the access request;
determining a historical access track sequence corresponding to the IP address according to the IP address corresponding to the user and the ID of the user;
the historical access track sequence further comprises: the IP address aims at a plurality of access records of a target page, and the access records are stored in the historical access track sequence according to the access time sequence.
8. An apparatus for restricting high frequency access to a crawler, the apparatus comprising:
the request acquisition unit is used for acquiring an access request of a user, and determining an IP address corresponding to the user and a historical access track sequence corresponding to the IP address according to the access request;
the analysis unit is used for analyzing the historical access track sequence to obtain the historical access information of the IP address within the preset duration, wherein the historical access information at least comprises one of the following items: the access time interval, accumulated access times and single access duration of the target page are carried out;
the output unit is used for inputting the IP address and the access time interval, the accumulated access times and the single access time length of the IP address to the target page within the preset time length to a pre-trained learning model, and outputting identity information corresponding to the IP address and confidence coefficient corresponding to the identity information, wherein the identity information comprises crawler user information and normal user information;
the energy consumption acquisition unit is used for acquiring the energy consumption corresponding to the access target page of the IP address in the historical access track sequence and the energy consumption of the server corresponding to the target page for processing the access request under the condition that the confidence coefficient does not exceed a preset confidence coefficient threshold value;
The determining unit is used for determining target identity information corresponding to the IP address according to the size relation between the energy consumption corresponding to the IP address access target page and the energy consumption corresponding to the access request processed by the server;
and the adding unit is used for adding the IP address into the Nginx shielding file when the target identity information corresponding to the IP address is the crawler user information so as to limit the access of the IP address.
9. An apparatus for restricting high frequency access to a crawler, the apparatus comprising:
at least one processor;
and a memory communicatively coupled to the at least one processor;
wherein the memory stores instructions executable by the at least one processor to enable the at least one processor to:
a method of restricting high frequency access to a crawler as claimed in any one of claims 1 to 7 is performed.
10. A non-transitory computer storage medium storing computer executable instructions which, when executed, implement a method of restricting crawler high frequency access as claimed in any one of claims 1 to 7.
CN202210208114.XA 2022-03-03 2022-03-03 Method, device, equipment and medium for limiting high-frequency access of crawler Active CN114710318B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202210208114.XA CN114710318B (en) 2022-03-03 2022-03-03 Method, device, equipment and medium for limiting high-frequency access of crawler

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202210208114.XA CN114710318B (en) 2022-03-03 2022-03-03 Method, device, equipment and medium for limiting high-frequency access of crawler

Publications (2)

Publication Number Publication Date
CN114710318A CN114710318A (en) 2022-07-05
CN114710318B true CN114710318B (en) 2024-03-22

Family

ID=82166096

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202210208114.XA Active CN114710318B (en) 2022-03-03 2022-03-03 Method, device, equipment and medium for limiting high-frequency access of crawler

Country Status (1)

Country Link
CN (1) CN114710318B (en)

Families Citing this family (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116668120A (en) * 2023-06-01 2023-08-29 泰州市野徐太丰防护用品厂 Network security protection system based on access habit analysis

Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110175278A (en) * 2019-05-24 2019-08-27 新华三信息安全技术有限公司 The detection method and device of web crawlers
CN110609937A (en) * 2019-08-15 2019-12-24 平安科技(深圳)有限公司 Crawler identification method and device
CN110650142A (en) * 2019-09-25 2020-01-03 腾讯科技(深圳)有限公司 Access request processing method, device, system, storage medium and computer equipment
CN110851274A (en) * 2019-10-29 2020-02-28 深信服科技股份有限公司 Resource access control method, device, equipment and storage medium
CN111428108A (en) * 2020-03-25 2020-07-17 山东浪潮通软信息科技有限公司 Anti-crawler method, device and medium based on deep learning
CN112989158A (en) * 2019-12-16 2021-06-18 顺丰科技有限公司 Method, device and storage medium for identifying webpage crawler behavior

Family Cites Families (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US8161549B2 (en) * 2005-11-17 2012-04-17 Patrik Lahti Method for defending against denial-of-service attack on the IPV6 neighbor cache

Patent Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110175278A (en) * 2019-05-24 2019-08-27 新华三信息安全技术有限公司 The detection method and device of web crawlers
CN110609937A (en) * 2019-08-15 2019-12-24 平安科技(深圳)有限公司 Crawler identification method and device
CN110650142A (en) * 2019-09-25 2020-01-03 腾讯科技(深圳)有限公司 Access request processing method, device, system, storage medium and computer equipment
CN110851274A (en) * 2019-10-29 2020-02-28 深信服科技股份有限公司 Resource access control method, device, equipment and storage medium
CN112989158A (en) * 2019-12-16 2021-06-18 顺丰科技有限公司 Method, device and storage medium for identifying webpage crawler behavior
CN111428108A (en) * 2020-03-25 2020-07-17 山东浪潮通软信息科技有限公司 Anti-crawler method, device and medium based on deep learning

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
SDN中基于用户信任度的资源访问控制方案;魏占祯;《信息网络安全》;全文 *
基于可信度的细粒度RBAC访问控制模型框架;刘宏月;阎军智;马建峰;;通信学报(第S1期);全文 *

Also Published As

Publication number Publication date
CN114710318A (en) 2022-07-05

Similar Documents

Publication Publication Date Title
CN110163417B (en) Traffic prediction method, device and equipment
US20210326357A1 (en) Data processing methods, apparatuses, and devices
CN110457578B (en) Customer service demand identification method and device
CN108243032B (en) Method, device and equipment for acquiring service level information
US20150207691A1 (en) Preloading content based on network connection behavior
CN112347512A (en) Image processing method, device, equipment and storage medium
CN115618964B (en) Model training method and device, storage medium and electronic equipment
CN115238826B (en) Model training method and device, storage medium and electronic equipment
CN111783018A (en) Page processing method, device and equipment
CN114710318B (en) Method, device, equipment and medium for limiting high-frequency access of crawler
CN116049761A (en) Data processing method, device and equipment
CN108769152B (en) Service refresh policy registration method, service refresh request method, device and equipment
CN107368281B (en) Data processing method and device
CN111242195B (en) Model, insurance wind control model training method and device and electronic equipment
CN112307371B (en) Applet sub-service identification method, device, equipment and storage medium
CN111241395B (en) Recommendation method and device for authentication service
CN111652074B (en) Face recognition method, device, equipment and medium
CN111753328B (en) Private data leakage risk detection method and system
CN113344590A (en) Method and device for model training and complaint rate estimation
CN115688130B (en) Data processing method, device and equipment
CN110209746B (en) Data processing method and device for data warehouse
CN111680203B (en) Data acquisition method and device and electronic equipment
CN117369783B (en) Training method and device for security code generation model
CN113204746B (en) Identity recognition method and device, storage medium and electronic equipment
CN117592102A (en) Service execution method, device, equipment and storage medium

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant