CN110912902A

CN110912902A - Method, system, equipment and readable storage medium for processing access request

Info

Publication number: CN110912902A
Application number: CN201911182484.5A
Authority: CN
Inventors: 叶亮; 范渊; 莫凡; 刘博�
Original assignee: Hangzhou Dbappsecurity Technology Co Ltd
Current assignee: Hangzhou Dbappsecurity Technology Co Ltd
Priority date: 2019-11-27
Filing date: 2019-11-27
Publication date: 2020-03-24
Anticipated expiration: 2039-11-27
Also published as: CN110912902B

Abstract

The application discloses a method for processing an access request, which comprises the following steps: acquiring a website flow log file; marking a log file with an IP address as an agent IP in a website flow log file as a suspicious log file; extracting suspicious characteristic information from the suspicious log file; when an access request is received, judging whether the characteristic information in the access request is suspicious characteristic information or not; if so, the access request is redirected to the verification interface. According to the method and the device, suspicious log files are screened from website flow log files through the proxy IP, suspicious characteristic information is extracted from the suspicious log files, the log files can be determined from the perspective of the proxy IP, high-probability crawlers simulating user behaviors are intercepted according to the suspicious characteristic information extracted from the suspicious log files, and efficiency and accuracy of crawler interception are greatly improved. The application also provides a system, a device and a readable storage medium for processing the access request, which have the beneficial effects.

Description

Method, system, equipment and readable storage medium for processing access request

Technical Field

The present application relates to the field of access request processing, and in particular, to a method, a system, a device, and a readable storage medium for processing an access request.

Background

The web crawler is a process that a user initiates a request to a target web page through a script or a program, and after receiving a response from a server, the web crawler analyzes the content of the web page, extracts required data information, and stores the data information into a corresponding data set. Many search engine principles at home and abroad are realized by a crawler, and a crawler program traverses each link on a website on the world wide web, collects information of each webpage and stores the information into a database or other storage containers. With the rise of the internet, the data era is prosperous, and the crawler behaviors on the network are increasing day by day. Some illegal users write crawler scripts to launch malicious website attacks in order to gain benefits or achieve a certain purpose, and the network health is damaged. The direct impact brought is that the real user experience of surfing the internet is worsened, and the benefits of the related websites are indirectly lost.

Most websites can make some restrictions on the user's request to prevent malicious crawlers from accessing, such as: limiting the access frequency of the IP, judging whether fields such as a user agent in the request, a request source link and the like exist, carrying out download access statistics of the IP, detecting the type of the access resource and the like. The junior crawler is often detected and redirected to an authentication page or blocked from subsequent access.

However, a highly hidden crawler capable of simulating user behavior often initiates a request through the proxy IP, and since there are millions of proxy IPs available on the network and the proxy IPs are continuously updated, such crawler cannot be blocked by using the IP blocking method.

Therefore, how to intercept highly hidden crawlers that simulate user behavior is a technical problem that needs to be solved by those skilled in the art at present.

Disclosure of Invention

The application aims to provide an access request processing method, system, equipment and readable storage medium, which are used for intercepting highly-hidden crawlers simulating user behaviors.

In order to solve the above technical problem, the present application provides a method for processing an access request, including:

acquiring a website flow log file;

marking the log file with the IP address as the proxy IP in the website traffic log file as a suspicious log file;

extracting suspicious characteristic information from the suspicious log file;

when an access request is received, judging whether the characteristic information in the access request is the suspicious characteristic information;

and if so, redirecting the access request to a verification interface.

Optionally, marking a log file in the website traffic log file, in which the IP address is the proxy IP, as a suspicious log file, includes:

collecting the proxy IP from a preset proxy website and storing the proxy IP into a proxy database;

extracting a source IP in the website traffic log file through a regular expression, and judging whether the source IP exists in the proxy database;

if yes, the log file corresponding to the source IP is marked as the suspicious log file.

Optionally, the method further includes:

periodically using a detection script to carry out availability verification on the agent IP in the agent database;

and deleting the proxy IP which fails the availability verification.

Optionally, after redirecting the access request to the verification interface, the method further includes:

acquiring the access times of a source IP of the access request;

and when the access times are larger than a threshold value, prohibiting the source IP of the access request from initiating access.

Optionally, when the suspicious feature information includes a user agent, extracting suspicious feature information from the suspicious log file includes:

marking the user agents corresponding to a plurality of different agent IPs in the suspicious log file as suspicious user agents;

judging whether the number of the requests initiated by the suspicious user agent exceeds a threshold value;

if yes, the suspicious user agent is marked as the suspicious characteristic information.

Optionally, when the suspicious feature information includes a web cookie, extracting suspicious feature information from the suspicious log file, including:

and marking the network cookies corresponding to the plurality of different agent IPs in the suspicious log file as the suspicious characteristic information.

Optionally, when the suspicious feature information includes a request source link, extracting suspicious feature information from the suspicious log file, including:

and acquiring a request source link of which the median of the suspicious log file is a uniform resource locator under a target website, and marking the request source link as the suspicious characteristic information.

The present application further provides a system for processing an access request, the system comprising:

the first acquisition module is used for acquiring a website flow log file;

the marking module is used for marking the log file with the proxy IP address in the website traffic log file as a suspicious log file;

the extraction module is used for extracting suspicious characteristic information from the suspicious log file;

the judging module is used for judging whether the characteristic information in the access request is the suspicious characteristic information or not when the access request is received;

and the redirection module is used for redirecting the access request to a verification interface when the characteristic information in the access request is the suspicious characteristic information.

The present application also provides an access request processing apparatus, including:

a memory for storing a computer program;

a processor for implementing the steps of the method of access request processing according to any one of the preceding claims when executing the computer program.

The present application also provides a readable storage medium having stored thereon a computer program which, when executed by a processor, carries out the steps of the method of access request processing according to any one of the preceding claims.

The method for processing the access request comprises the following steps: acquiring a website flow log file; marking a log file with an IP address as an agent IP in a website flow log file as a suspicious log file; extracting suspicious characteristic information from the suspicious log file; when an access request is received, judging whether the characteristic information in the access request is suspicious characteristic information or not; if so, the access request is redirected to the verification interface.

According to the technical scheme, suspicious log files are screened from website traffic log files through the proxy IP, suspicious characteristic information is extracted from the suspicious log files, an access request with the suspicious characteristic information is redirected to a verification interface, the processing of the access request is completed, the log files can be determined from the perspective of the proxy IP, high-probability crawler simulating user behaviors is intercepted according to the suspicious characteristic information extracted from the suspicious log files, and efficiency and accuracy of crawler intercepting are greatly improved. The present application also provides a system, a device and a readable storage medium for processing an access request, which have the above beneficial effects and are not described herein again.

Drawings

In order to more clearly illustrate the embodiments of the present application or the technical solutions in the prior art, the drawings needed to be used in the description of the embodiments or the prior art will be briefly introduced below, it is obvious that the drawings in the following description are only embodiments of the present application, and for those skilled in the art, other drawings can be obtained according to the provided drawings without creative efforts.

Fig. 1 is a flowchart of a method for processing an access request according to an embodiment of the present application;

FIG. 2 is a flow chart of an actual representation of S102 of a method of access request processing provided in FIG. 1;

fig. 3 is a block diagram of a system for processing an access request according to an embodiment of the present application;

fig. 4 is a block diagram of another system for processing an access request according to an embodiment of the present application;

fig. 5 is a block diagram of an access request processing device according to an embodiment of the present application.

Detailed Description

The core of the application is to provide a method, a system, equipment and a readable storage medium for processing an access request, which are used for intercepting a highly-hidden crawler simulating user behaviors.

In order to make the objects, technical solutions and advantages of the embodiments of the present application clearer, the technical solutions in the embodiments of the present application will be clearly and completely described below with reference to the drawings in the embodiments of the present application, and it is obvious that the described embodiments are some embodiments of the present application, but not all embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present application.

High-hiding crawlers can simulate the access behavior of real users to initiate requests in order to bypass anti-crawler measures of most websites. The highly-confidential crawler acquires a large number of proxy IPs from a network to construct a proxy IP pool, then randomly uses the IPs in the proxy IP pool to initiate intermittent requests (a next request is initiated at random time intervals and the random proxy IPs are used), and adds information such as user agents, request source links, accept-encoding, host and the like into a request header.

The highly hidden crawlers have most characteristics of real user requests, and the crawlers cannot be detected from the dimension for judging whether the IP access frequency and the field in the request header exist or not. From the user access resource type, the detection methods of dimensions such as request response code 4xx occupation ratio, IP download access occupation ratio and the like are all limited by blocking access of corresponding IP, high-security crawler simulating user behavior usually initiates a request through proxy IP, and as the proxy IP available on the network is thousands of times and can be continuously updated, the crawler cannot be blocked by adopting an IP blocking mode. If the access of the highly hidden crawler cannot be blocked, important resources of the website can be stolen, and the high-frequency access occupies network bandwidth, so that the website server cannot normally operate and the like. Therefore, the present application provides a method for processing an access request, which is used to solve the above technical problems;

referring to fig. 1, fig. 1 is a flowchart illustrating a method for processing an access request according to an embodiment of the present disclosure.

The method specifically comprises the following steps:

s101: acquiring a website flow log file;

based on the prior art, the access frequency of the IP is limited, a large number of proxy IPs can be acquired from the network by a highly-hidden crawler to construct a proxy IP pool, then the IP in the proxy IP pool is randomly used to initiate a discontinuous request, and as the proxy IPs available on the network are thousands of proxy IPs and can be continuously updated, the crawler cannot be blocked by adopting an IP-sealing mode; the method and the system creatively screen the suspicious log files through the proxy IP so as to intercept the high-security crawler;

for example, the website traffic log file may be obtained by traversing the website traffic log of the last week.

S102: marking a log file with an IP address as an agent IP in a website flow log file as a suspicious log file;

s103: extracting suspicious characteristic information from the suspicious log file;

according to the method and the device, the log file with the IP address as the proxy IP is marked as a suspicious log file, suspicious characteristic information is extracted from the suspicious log file, and whether the current access request is a crawler or not is judged according to the suspicious characteristic information, so that efficiency and accuracy of crawler interception are improved;

optionally, feature information such as cookies (data stored on the local terminal of the user in order to identify the user identity and perform session tracking on some websites), startTime (access time), a user agent (user agent), httpReferer (request source link), srreadaddress (source agent IP), and the like of each log in the suspicious log set may be extracted by constructing a regular expression script corresponding to the suspicious log file.

Optionally, the suspicious characteristic information mentioned herein may include a user agent, a web cookie, and a request source link, where:

when the suspicious characteristic information includes the user agent, extracting the suspicious characteristic information from the suspicious log file, which may specifically be:

judging whether the number of requests initiated by the suspicious user agent exceeds a threshold value;

if so, the suspicious user agent is marked as suspicious characteristic information.

For example, the non-empty user agent value after deduplication is extracted from the suspicious log file, the number of access requests initiated by each user agent is recorded as N1, N2, and N3 …, the number of unique source agent IPs corresponding to each user agent is determined and recorded as M1, M2, and M3 …, finally, a corresponding user agent access threshold is set according to the actual access traffic of the website, and if Mi >5 and Ni exceeds a preset threshold, the user agent is marked as suspicious feature information.

When the suspicious characteristic information includes a web cookie, extracting the suspicious characteristic information from the suspicious log file, which may specifically be:

and marking the network cookies corresponding to the plurality of different agent IPs in the suspicious log file as suspicious characteristic information.

For example, a non-empty network cookie list after deduplication may be constructed, a source agent IP corresponding to each network cookie in the list is counted, and if the same network cookie corresponds to multiple different agent source IPs, the network cookie is marked as suspicious characteristic information.

When the suspicious feature information includes a request source link, extracting the suspicious feature information from the suspicious log file, which may specifically be:

and acquiring a request source link of which the median of the suspicious log file is the uniform resource locator of the target website, and marking the request source link as suspicious characteristic information.

For example, all URL (Uniform Resource Locator) of the target website may be traversed and stored in a URL list, the URL list is named URL list, the values and corresponding numbers of the request source links after deduplication are extracted, and if there are a large number of request source links whose values are fixed to one or two URLs in the URL or URL list of the target website, the request source links are marked as suspicious characteristic information.

S104: when an access request is received, judging whether the characteristic information in the access request is suspicious characteristic information or not;

if yes, go to step S105;

when the feature information in the access request is suspicious feature information, it indicates that the access request has a high possibility of being an access request initiated by a highly-confidential crawler, and then step S105 is entered, where the access request is redirected to a verification interface to intercept the access request;

optionally, when the feature information in the access request is not suspicious feature information, it indicates that the access request has a lower possibility of being an access request initiated by a highly confidential crawler, and at this time, the access request may be further executed to complete the access.

Preferably, in a specific embodiment, the access request preprocessing script may be run on the server side, information such as a source IP, a user agent, a network cookie, a request source link, and the like in the access request information is extracted, and then it is determined whether the source IP exists in the agent database, if so, it is further determined whether suspicious characteristic information exists in the information such as the user agent, the network cookie, the request source link, and the like, and if so, the access request is redirected to a manual verification interface, so that a detection person performs manual verification on the access request.

Optionally, when the suspicious characteristic information includes the user agent, determining whether the characteristic information in the access request is suspicious characteristic information, which may specifically be:

and judging whether the user agent in the access request is a suspicious user agent marked as suspicious characteristic information.

Optionally, when the suspicious feature information includes a web cookie, determining whether the feature information in the access request is suspicious feature information, which may specifically be:

and judging whether the network cookie in the access request is the network cookie marked as suspicious characteristic information.

Optionally, when the suspicious feature information includes a request source link, determining whether the feature information in the access request is suspicious feature information, which may specifically be:

and judging whether the request source link in the access request is the request source link marked as suspicious characteristic information.

S105: the access request is redirected to a verification interface.

Preferably, after the access request is redirected to the verification interface, the access times of the source IP of the access request can be acquired, and when the access times are greater than a threshold value, the source IP of the access request is prohibited from initiating access, so that the detection efficiency is improved, and the working pressure of detection personnel is reduced.

Based on the technical scheme, the access request processing method provided by the application comprises the steps of screening suspicious log files from website traffic log files through proxy IP, extracting suspicious characteristic information from the suspicious log files, redirecting an access request with the suspicious characteristic information to a verification interface to complete processing of the access request, determining the log files from the perspective of proxy IP, and intercepting high-security crawlers simulating user behaviors according to the suspicious characteristic information extracted from the suspicious log files, so that efficiency and accuracy of crawler interception are greatly improved.

For the step S102 in the previous embodiment, the log file with the proxy IP address in the website traffic log file is marked as a suspicious log file, which may be specifically the step shown in fig. 2, and the following description is made with reference to fig. 2.

Referring to fig. 2, fig. 2 is a flowchart illustrating an actual representation of S102 in the method for processing an access request of fig. 1.

The method specifically comprises the following steps:

s201: collecting agent IP from a preset agent website and storing the agent IP into an agent database;

the preset proxy website mentioned here may include, but is not limited to proxy websites such as attorney proxy, station grandfather, news proxy, mushroom proxy, etc.;

preferably, the method may further comprise:

the method comprises the steps of periodically using a detection script to carry out availability verification on an agent IP in an agent database;

the proxy IP that fails the availability verification is deleted.

Because the proxy IP is not long in timeliness, the embodiment of the application regularly uses the detection script to verify the availability of the proxy IP in the proxy database, for example, the proxy IP can be used to initiate a request to a stable large website, the type of the return code is judged, if the return code is of the 2xx type, the proxy IP is available, and the accuracy of crawler interception is greatly improved.

S202: extracting a source IP in the website traffic log file through a regular expression, and judging whether the source IP exists in an agent database;

if yes, go to step S203;

optionally, when the source IP does not exist in the proxy database, the suspicion of the current log file is proved to be small, and at this time, the current log file can be marked as a safe log file.

S203: and marking the log file corresponding to the source IP as a suspicious log file.

Based on the technical scheme, the agent IP is collected from the preset agent website and stored in the agent database, then the source IP in the website flow log file is extracted through the regular expression, whether the source IP exists in the agent database or not is judged, if yes, the log file corresponding to the source IP is marked as a suspicious log file, the screening accuracy of the suspicious log file is further improved, the obtained suspicious characteristic information is more accurate, and the intercepting accuracy of the highly-hidden crawler is further improved.

Referring to fig. 3, fig. 3 is a block diagram of a system for processing an access request according to an embodiment of the present disclosure.

The system may include:

a first obtaining module 100, configured to obtain a website traffic log file;

the marking module 200 is configured to mark a log file with an IP address of the proxy IP in the website traffic log file as a suspicious log file;

an extraction module 300, configured to extract suspicious feature information from the suspicious log file;

a determining module 400, configured to determine whether feature information in an access request is suspicious feature information when the access request is received;

the redirecting module 500 is configured to redirect the access request to the verification interface when the feature information in the access request is suspicious feature information.

Referring to fig. 4, fig. 4 is a block diagram of another system for processing an access request according to an embodiment of the present application.

The marking module 200 may include:

the collection submodule is used for collecting the proxy IP from the preset proxy website and storing the proxy IP into the proxy database;

the extraction submodule is used for extracting the source IP in the website traffic log file through the regular expression and judging whether the source IP exists in the proxy database;

and the first marking submodule is used for marking the log file corresponding to the source IP as a suspicious log file when the source IP exists in the proxy database.

The marking module 200 may further include:

the verification submodule is used for regularly using the detection script to carry out availability verification on the proxy IP in the proxy database;

and the deleting submodule is used for deleting the proxy IP which fails the availability verification.

The system may further comprise:

the second acquisition module is used for acquiring the access times of the source IP of the access request;

and the forbidding module is used for forbidding the source IP of the access request to initiate access when the access times are larger than the threshold value.

The extraction module 300 may include:

the second marking submodule is used for marking the user agents corresponding to a plurality of different agent IPs in the suspicious log file as suspicious user agents when the suspicious characteristic information comprises the user agents;

the judging submodule is used for judging whether the number of the requests initiated by the suspicious user agent exceeds a threshold value;

and the third marking submodule is used for marking the suspicious user agent as suspicious characteristic information when the number of the requests initiated by the suspicious user agent exceeds a threshold value.

The extraction module 300 may include:

and the fourth marking submodule is used for marking the network biscuits corresponding to the plurality of different agent IPs in the suspicious log file as suspicious characteristic information when the suspicious characteristic information comprises the network biscuits.

The extraction module 300 may include:

and the fifth marking submodule is used for acquiring the request source link of which the median value in the suspicious log file is the uniform resource locator of the target website when the suspicious characteristic information comprises the request source link, and marking the request source link as the suspicious characteristic information.

The various components of the above system may be practically applied in the following embodiments:

a first acquisition module acquires a website flow log file; the collection submodule collects the proxy IP from a preset proxy website and stores the proxy IP into a proxy database; the extraction submodule extracts a source IP in the website traffic log file through a regular expression and judges whether the source IP exists in the proxy database or not; when the source IP exists in the agent database, the first marking submodule marks the log file corresponding to the source IP as a suspicious log file. The verification sub-module regularly uses the detection script to verify the availability of the proxy IP in the proxy database; the delete submodule deletes the proxy IP that fails the availability verification.

The extraction module extracts suspicious characteristic information from the suspicious log file; when an access request is received, the judging module judges whether the characteristic information in the access request is suspicious characteristic information; and when the characteristic information in the access request is suspicious characteristic information, the redirection module redirects the access request to the verification interface. The second acquisition module acquires the access times of the source IP of the access request; when the access times are larger than the threshold value, the forbidding module forbids the source IP of the access request to initiate access.

Referring to fig. 5, fig. 5 is a structural diagram of an access request processing device according to an embodiment of the present application.

The access request processing apparatus 600 may have a relatively large difference due to different configurations or performances, and may include one or more processors (CPUs) 622 (e.g., one or more processors) and a memory 632, one or more storage media 630 (e.g., one or more mass storage devices) storing applications 642 or data 644. Memory 632 and storage medium 630 may be, among other things, transient or persistent storage. The program stored in the storage medium 630 may include one or more modules (not shown), each of which may include a sequence of instructions operating on the device. Further, the processor 622 may be configured to communicate with the storage medium 630 to execute a series of instruction operations in the storage medium 630 on the access request processing device 600.

Access request processing apparatus 600 may also include one or more power supplies 626, one or more wired or wireless network interfaces 650, one or more input-output interfaces 658, and/or one or more operating systems 641, such as Windows ServerTM, Mac OS XTM, UnixTM, LinuxTM, FreeBSDTM, and so forth.

The steps in the method of access request processing described above in fig. 1 to 2 are implemented by the access request processing device based on the structure shown in fig. 5.

It can be clearly understood by those skilled in the art that, for convenience and brevity of description, the specific working processes of the system, the apparatus and the module described above may refer to the corresponding processes in the foregoing method embodiments, and are not described herein again.

In the several embodiments provided in the present application, it should be understood that the disclosed apparatus, device and method may be implemented in other ways. For example, the above-described apparatus embodiments are merely illustrative, and for example, a division of modules is merely a division of logical functions, and an actual implementation may have another division, for example, a plurality of modules or components may be combined or integrated into another system, or some features may be omitted, or not executed. In addition, the shown or discussed mutual coupling or direct coupling or communication connection may be an indirect coupling or communication connection through some interfaces, devices or modules, and may be in an electrical, mechanical or other form.

Modules described as separate parts may or may not be physically separate, and parts displayed as modules may or may not be physical modules, may be located in one place, or may be distributed on a plurality of network modules. Some or all of the modules may be selected according to actual needs to achieve the purpose of the solution of the present embodiment.

In addition, functional modules in the embodiments of the present application may be integrated into one processing module, or each of the modules may exist alone physically, or two or more modules are integrated into one module. The integrated module can be realized in a hardware mode, and can also be realized in a software functional module mode.

The integrated module, if implemented in the form of a software functional module and sold or used as a separate product, may be stored in a computer readable storage medium. Based on such understanding, the technical solution of the present application may be substantially implemented or contributed to by the prior art, or all or part of the technical solution may be embodied in a software product, which is stored in a storage medium and includes instructions for causing a computer device (which may be a personal computer, a function calling device, or a network device) to execute all or part of the steps of the method of the embodiments of the present application. And the aforementioned storage medium includes: various media capable of storing program codes, such as a usb disk, a removable hard disk, a Read-Only Memory (ROM), a Random Access Memory (RAM), a magnetic disk, or an optical disk.

A method, a system, a device and a readable storage medium for processing an access request provided by the present application are described in detail above. The principles and embodiments of the present application are explained herein using specific examples, which are provided only to help understand the method and the core idea of the present application. It should be noted that, for those skilled in the art, it is possible to make several improvements and modifications to the present application without departing from the principle of the present application, and such improvements and modifications also fall within the scope of the claims of the present application.

It is further noted that, in the present specification, relational terms such as first and second, and the like are used solely to distinguish one entity or action from another entity or action without necessarily requiring or implying any actual such relationship or order between such entities or actions. Also, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, an element defined by the phrase "comprising an … …" does not exclude the presence of other identical elements in a process, method, article, or apparatus that comprises the element.

Claims

1. A method of access request processing, comprising:

acquiring a website flow log file;

extracting suspicious characteristic information from the suspicious log file;

and if so, redirecting the access request to a verification interface.

2. The method of claim 1, wherein marking a log file in the website traffic log file with an IP address of proxy IP as a suspicious log file comprises:

3. The method of claim 2, further comprising:

and deleting the proxy IP which fails the availability verification.

4. The method of claim 1, after redirecting the access request to a verification interface, further comprising:

acquiring the access times of a source IP of the access request;

5. The method of claim 1, wherein when the suspicious feature information comprises a user agent, extracting suspicious feature information from the suspicious log file comprises:

6. The method of claim 1, wherein when the suspicious feature information comprises a web cookie, extracting suspicious feature information from the suspicious log file comprises:

7. The method of claim 1, wherein when the suspect feature information comprises a request source link, extracting suspect feature information from the suspect log file comprises:

8. A system for access request processing, comprising:

the first acquisition module is used for acquiring a website flow log file;

9. An access request processing apparatus, comprising:

a memory for storing a computer program;

a processor for implementing the steps of the method of access request processing according to any one of claims 1 to 7 when executing the computer program.

10. A readable storage medium, having stored thereon a computer program which, when being executed by a processor, carries out the steps of the method of access request processing according to any one of claims 1 to 7.