CN113221053B

CN113221053B - Anti-crawling method and device, electronic equipment and storage medium

Info

Publication number: CN113221053B
Application number: CN202110595366.8A
Authority: CN
Inventors: 果海涛; 罗港
Original assignee: Beijing Chengshi Wanglin Information Technology Co Ltd
Current assignee: Beijing Chengshi Wanglin Information Technology Co Ltd
Priority date: 2021-05-28
Filing date: 2021-05-28
Publication date: 2024-04-12
Anticipated expiration: 2041-05-28
Also published as: CN113221053A

Abstract

The invention provides an anti-crawling method, an anti-crawling device, electronic equipment and a storage medium. The method comprises the following steps: after the traffic layer receives the page access request, responding to the fact that the page access request does not hit a first interception strategy preset by the traffic layer and does not hit a first interception list stored by the traffic layer, and sending the page access request to the service layer; after the business layer receives the page access request, responding to a second interception policy preset by the page access request hit business layer, and updating the first interception list based on the page access request so that the page access request with the same source as the page access request can hit the first interception list. Therefore, the interception rate of the traffic layer is improved, the condition that data is stolen is reduced, and the page access request entering the service layer is reduced, so that the performance loss of the service layer is reduced, the condition that the machine providing the service cannot be used due to too high load caused by too much traffic consumed by crawling the data is avoided, and the stability is improved.

Description

Anti-crawling method and device, electronic equipment and storage medium

Technical Field

The present invention relates to the field of computer technologies, and in particular, to a crawling preventing method, a crawling preventing device, an electronic device, and a storage medium.

Background

With the development of internet technology, various business services have been realized through the internet. On a webpage provided by a business service, a lot of data are displayed for users to browse, however, abnormal browsing behaviors may exist, so that the data on the webpage are crawled, the data are stolen, and too much data crawling consumes traffic, which may directly cause that the load of a machine providing the business service is too high to be used.

Disclosure of Invention

The embodiment of the invention provides an anti-crawling method, an anti-crawling device, electronic equipment and a storage medium, which are used for solving the problems in the related art.

The invention aims at realizing the following technical scheme:

in a first aspect, an embodiment of the present invention provides a crawling preventing method, applied to a server, where the server includes a traffic layer and a service layer, and the method includes:

after the traffic layer receives the page access request, responding to the fact that the page access request does not hit a first interception strategy preset by the traffic layer and does not hit a first interception list stored by the traffic layer, and sending the page access request to the service layer;

After the business layer receives the page access request, responding to a second interception policy preset by the page access request hit business layer, and updating the first interception list based on the page access request so that the page access request with the same source as the page access request can hit the first interception list.

In a possible implementation manner, at least one source identifier is carried in the page access request, the first interception list comprises a sub-interception list corresponding to the source identifier, and the sub-interception list corresponding to the source identifier comprises a source identifier to be intercepted;

based on the page access request, updating the first interception list, including:

and adding each source identifier carried by the page access request into a sub-interception list corresponding to the category to which the source identifier belongs so as to update the first interception list.

In one possible embodiment, the method further comprises:

and responding to any source identifier carried by the page access request to hit a sub-interception list corresponding to the category to which the page access request belongs, and determining that the page access request hits the first interception list.

In a possible implementation manner, the page access request carries at least one source identifier, and the second interception policy includes a sub-interception policy corresponding to the type of the source identifier;

The method further comprises the steps of:

and responding to any source identifier carried by the page access request to hit the sub-interception policy corresponding to the category to which the page access request belongs, and determining that the page access request hits the second interception policy.

In one possible implementation, the sub-interception policy corresponding to the source-identified category includes at least one of the following policies:

intercepting page access requests of which the frequency of the access target list pages corresponding to the source identifiers is greater than or equal to a first threshold value;

intercepting page access requests of which the frequency of the transformation screening words corresponding to the source identifiers is greater than or equal to a second threshold value;

intercepting page access requests of which the frequency of the transformed cities corresponding to the source identifiers is greater than or equal to a third threshold value;

and intercepting page access requests with the frequency of accessing the outdated data being greater than or equal to a fourth threshold value corresponding to the source identification.

In one possible embodiment, the method further comprises:

and responding to any one of the sub-interception strategies included in the sub-interception strategy corresponding to the category to which the source identifier carried by the page access request hits, and determining that the source identifier hits the sub-interception strategy corresponding to the category to which the source identifier belongs.

In one possible implementation, updating the first interception list based on the page access request includes:

Updating a second interception list stored in the business layer based on the page access request;

the second interception list is periodically synchronized to the first interception list to update the first interception list.

In a possible implementation manner, the first interception list is an interception list of the target service, and the method further includes:

the first interception list is synchronized to an interception list of other traffic than the target traffic.

In a second aspect, an embodiment of the present invention provides an anti-crawling apparatus, applied to a server, where the server includes a traffic layer and a service layer, the apparatus includes:

the first interception module is used for responding to the first interception strategy preset by the traffic layer and missing the first interception list stored by the traffic layer after the traffic layer receives the page access request, and sending the page access request to the service layer;

the second interception module is used for responding to a second interception policy preset by the service layer after the service layer receives the page access request, and updating the first interception list based on the page access request so that the page access request with the same source as the page access request can hit the first interception list.

the second interception module is specifically configured to:

In one possible implementation, the second interception module is further configured to:

the second interception module is further used for:

In a possible implementation manner, the second interception module is specifically configured to:

In one possible implementation manner, the first interception list is an interception list of the target service, and the apparatus further includes:

And the sending module is used for synchronizing the first interception list into the interception list of other businesses except the target business.

In a third aspect, an embodiment of the present invention provides an electronic device, including: a processor, a memory and a computer program stored on the memory and executable on the processor, which when executed by the processor performs the steps of the anti-crawling method as in any of the first aspects above.

In a fourth aspect, embodiments of the present invention provide a computer readable storage medium having stored thereon a computer program which, when executed by a processor, implements the steps of the anti-crawling method as in any of the first aspects.

The advantages or beneficial effects in the technical scheme at least comprise:

because the first interception policy of the traffic layer is preset, the first interception list is also stored, the second interception policy of the traffic layer is preset, and the page access request of the second interception list is not hit in the traffic layer, if the second interception policy is hit in the traffic layer, the first interception list of the traffic layer can be updated based on the page access request, so that the subsequent page access request with the same source as the page access request can hit the first interception list, that is, the first interception list of the traffic layer comes from the second interception policy of the traffic layer, therefore, in the traffic layer, the interception rate of the traffic layer is greatly improved, the condition that data is stolen is reduced, and the page access request entering the traffic layer is reduced, thereby reducing the performance loss of the traffic layer, and avoiding the condition that the machine providing the service is not used due to too high load caused by too much traffic of crawling data consumption, thereby improving the stability.

The foregoing summary is for the purpose of the specification only and is not intended to be limiting in any way. In addition to the illustrative aspects, embodiments, and features described above, further aspects, embodiments, and features of the present invention will become apparent by reference to the drawings and the following detailed description.

Drawings

FIG. 1 is a flow chart of a crawling prevention method in an embodiment of the invention;

fig. 2 is a schematic diagram of an application scenario of an anti-crawling method in an embodiment of the present invention;

FIG. 3 is a schematic view of a crawling preventing device in an embodiment of the present invention;

FIG. 4 is a schematic view of a crawling preventing device in an embodiment of the present invention;

fig. 5 is a schematic structural diagram of an electronic device in an embodiment of the present invention.

Detailed Description

The following description of the embodiments of the present invention will be made clearly and fully with reference to the accompanying drawings, in which it is evident that the embodiments described are some, but not all embodiments of the invention. All other embodiments, which can be made by those skilled in the art based on the embodiments of the invention without making any inventive effort, are intended to be within the scope of the invention.

At present, it is common to crawl data on pages by means of crawlers and the like. A crawler is a program or script that automatically crawls page data according to certain rules.

In general, a machine that provides business services, such as a server, may be configured to include at least a traffic layer and a traffic layer. The traffic layer may be implemented based on ng ix, which is a high-performance HTTP and reverse proxy server. In practice, the user's page access request will reach the traffic layer first, and the traffic layer is responsible for distributing the specific request to the service layer. The business layer is a business logic layer for specifically processing the interface content seen by the user.

In order to prevent crawling of data on a page, in the related art, a flow limiting policy for an internet protocol (IP, internet Protocol) address and a User Agent (UA) is made at a traffic layer, that is, a crawling preventing policy, and a plurality of crawling preventing policies for specific service scenarios are made at a service layer. The anti-crawling strategies are interception strategies which are generated according to the characteristics and the business characteristics by analyzing the crawling behavior characteristics. The interception strategy of the traffic layer has low loss on the performance of the machine, but the interception is limited and the interception rate is low. Although the interception rate of the interception policy of the service layer is high, the more the interception policy of the service layer is, the greater the performance loss caused to the service system is. If a crawler malicious attack or a crawler traffic surge occurs, the whole machine cluster is likely to be overloaded, and the cluster is not available.

In order to solve the technical problems, the embodiment of the invention provides an anti-crawling method, which uses the interception policy of a traffic layer and the interception policy of a service layer synchronously and jointly, namely when a page access request hits the interception policy of the service layer, the service layer synchronizes the interception result to the traffic layer in real time, and the traffic layer can directly intercept the page access request with the same source as the page access request subsequently, thereby improving the interception rate of the traffic layer, reducing the performance loss of the service layer, and avoiding the situation that the machine providing the service cannot be used due to too high load caused by too much traffic consumed by crawling data, thereby improving the stability. The anti-crawling method provided by the embodiment of the invention is described in detail below.

FIG. 1 is a flow chart of a crawling prevention method in an embodiment of the invention.

As shown in fig. 1, the crawling preventing method provided in this embodiment is applied to a server, where the server includes a traffic layer and a service layer, and the method at least includes the following steps:

step 101, after a traffic layer receives a page access request, the page access request is sent to a service layer in response to the fact that the page access request does not hit a first interception policy preset by the traffic layer and does not hit a first interception list stored by the traffic layer.

Step 102, after the service layer receives the page access request, responding to a second interception policy preset by the page access request hit service layer, and updating the first interception list based on the page access request so that the page access request with the same source as the page access request can hit the first interception list.

In practice, the server may be a cluster of machines providing business services. The sending end of the page access request, such as the client end, sends the page access request to the server, and the page access request reaches the traffic layer of the server first. The traffic layer is preset with a first interception strategy. The first interception strategy identifies the page access request which needs to be intercepted and has the risk of crawling page data in real time through an algorithm in a traffic layer. If the page access request hits the first interception policy, indicating a risk of crawling page data, it may be intercepted at the traffic layer. In addition, the traffic layer also stores a first interception list. The first interception list can characterize a source of page access requests that are at risk of crawling page data. If the page access request hits the first interception list, which indicates that there is a risk of crawling page data, the page access request can be intercepted at the traffic layer. If the page access request does not hit the first interception policy of the traffic layer or the first interception list of the traffic layer, the traffic layer forwards the page access request to the service layer for processing.

A second interception policy is preset in the service layer. The second interception policy is to intercept the page access request with the risk of crawling page data in real time through an algorithm at the business layer. In practical application, when crawling page data, it is possible to crawl through the sending end of fixed source in a certain time. Thus, if a page access request is at risk of crawling page data, then the same page access request as the source of the page access request may also be at risk of crawling page data. At this time, the first interception list may be updated based on the page access request, so that a subsequent page access request with the same source as the page access request can hit the first interception list, that is, be intercepted at the traffic layer. Here, the source of the page access request refers to the sender of the page access request.

In this embodiment, since the traffic layer is preset with the first interception policy of the traffic layer and the first interception list is stored, and the traffic layer is preset with the second interception policy of the traffic layer, for the page access request that does not hit the first interception policy or the second interception list in the traffic layer, if the second interception policy is hit in the traffic layer, the first interception list of the traffic layer can be updated based on the page access request, so that the subsequent page access request with the same source as the page access request can hit the first interception list, that is, the first interception list of the traffic layer comes from the second interception policy of the traffic layer, so that in the traffic layer, the interception rate of the traffic layer is greatly improved, the situation that data is stolen is reduced, and the page access request that enters the traffic layer is reduced, thereby reducing the performance loss of the traffic layer, and avoiding the situation that the machine providing the service is unavailable due to too much traffic consumption, thereby improving the stability.

The scheme of the embodiment can be suitable for various service scenes, such as a house renting service, a house buying service and the like.

In an exemplary embodiment, the first interception list may be an interception list of the target service. Here, the target service refers to a service for which the current page access request is directed. Correspondingly, the second interception policy is the interception policy of the target service. For the target business, there is a risk of crawling page data, and for other businesses, there may be a risk of crawling page data, so the interception list page of the target business may be applied to other businesses. Based on this, the above-mentioned crawling prevention method may further include synchronizing the first interception list into an interception list of other traffic than the target traffic. Therefore, multiplexing of interception strategies among services is achieved, the interception strategies corresponding to the services are richer, and the safety of service data is further improved.

In an exemplary embodiment, the above anti-crawling method may further include: after the traffic layer receives the page access request, determining whether the page access request hits a first interception policy, and determining whether the page access request hits a first interception list stored in the traffic layer in response to the page access request missing the first interception policy preset in the traffic layer.

The first interception policy of the traffic layer is processed in real time through an algorithm, and the page access request is processed through the first interception policy of the traffic layer, so that the timeliness is higher, and the latest page access request with the data crawling risk is intercepted in time.

In addition, it may be determined whether the page access request hits the first interception list, and then, in response to the page access request not hitting the first interception list of the traffic layer, it is determined whether the page access request hits the first interception policy of the traffic layer.

It will be appreciated that in response to a page access request hitting a first interception policy, the page access request may be intercepted. In response to the page access request hitting the first interception list, the page access request may be intercepted.

In an exemplary embodiment, the above anti-crawling method may further include: responding to a second interception strategy preset by the hit business layer of the page access request, and intercepting the page access request. Responding to a second interception strategy preset by the page access request miss business layer, acquiring a page requested by the page access request and sending the page to the sending end.

In an exemplary embodiment, the page access request may carry at least one source identifier. The first interception list may include a sub-interception list corresponding to a source identified category. The sub-interception list corresponding to the category of the source identifier includes the source identifier that needs interception. In the step 102, based on the page access request, the updating the first interception list may include: and adding each source identifier carried by the page access request into a sub-interception list corresponding to the category to which the source identifier belongs so as to update the first interception list. In this embodiment, the first interception list is set based on the source identifier to intercept, so that not only is the implementation simple, but also the interception is accurate, and the maintenance and the update are also convenient.

In an exemplary embodiment, the above anti-crawling method may further include: and responding to any source identifier carried by the page access request to hit a sub-interception list corresponding to the category to which the page access request belongs, and determining that the page access request hits the first interception list. Any source identifier in the page access request hits the sub-interception list corresponding to the category to which the page access request belongs, and the page access request can be considered to hit the first interception list, so that the interception rate can be further improved.

In addition, it may be determined that the page access request hits the first interception list in response to more than two source identifiers hitting the sub-interception list corresponding to the category to which the source identifier belongs. Therefore, the interception accuracy is higher, and the influence on the browsing of a normal user is avoided.

In an exemplary embodiment, the page access request carries at least one source identifier, and the second interception policy may include a sub-interception policy corresponding to a category of the source identifier. The above-described anti-crawling method may then further comprise: and responding to any source identifier carried by the page access request to hit the sub-interception policy corresponding to the category to which the page access request belongs, and determining that the page access request hits the second interception policy.

In this embodiment, any source identifier in the page access request hits the sub-interception policy corresponding to the category to which the source identifier belongs, which can be considered that the page access request hits the second interception policy, and the interception rate can be further improved.

In addition, it may be determined that the page access request hits the second interception policy in response to more than two source identifiers hitting the sub-interception policy corresponding to the category to which the source identifier belongs. The method can enable the interception accuracy to be higher, and avoid affecting normal users.

In an exemplary embodiment, the sub-interception policy corresponding to the source-identified category may include at least one of the following policies:

and (3) intercepting page access requests of which the frequency of the access target list pages corresponding to the source identifiers is greater than or equal to a first threshold value by the strategy I.

Here, the target list page refers to a list page of a specific service. The frequency of accessing the target list page refers to the number of times the target list page is accessed per unit time. Taking a house renting service as an example, when crawling page data, the personal house source list pages may be continuously crawled, for example, the ordered list pages of the personal house source list pages are crawled, so as to obtain the latest personal house source data. If the same source identifies that the frequency of accessing these list pages is too high in a short time, it is likely that page data is being crawled and interception can be performed.

The specific value of the first threshold may be set according to practical situations, which is not limited herein.

And a second strategy intercepts page access requests of which the frequency of the transformation screening words corresponding to the source identifiers is greater than or equal to a second threshold value.

The screening words are input keywords. The frequency of transforming the filter word refers to the number of times the filter word is transformed in a unit time. Still taking house renting service as an example, when page data is crawled, the filtering words can be continuously transformed, and the personal house source list page is accessed. If the frequency of transforming the screening words in a short time is too high, the same source identification may be crawling the page data, and interception may be performed.

The specific value of the second threshold may be set according to practical situations, and is not limited herein.

And thirdly, intercepting page access requests of which the frequency of the transformed cities corresponding to the source identification is greater than or equal to a third threshold value.

The frequency of transforming cities refers to the number of times cities of the access network are transformed in a unit time. Still taking a house renting service as an example, when page data is crawled, the city may be continuously transformed to access the personal house source list page. If the frequency of transforming cities in a short time is too high for the same source identification, it is possible to crawl page data and interception can be performed. The specific value of the third threshold may be set according to practical situations, and is not limited herein.

And fourthly, intercepting page access requests of which the frequency of access expiration data corresponding to the source identifier is greater than or equal to a fourth threshold value.

Some of the data displayed on the page is time-efficient, and some of the data may not be displayed on the page after a period of time, i.e., may become stale data. The frequency of accessing the expiration data refers to the number of times the expiration data is accessed per unit time. When crawling page data, the database may be updated by frequently accessing the stale data through stale links. Taking a house renting service as an example, when page data is crawled, an expired house source may be frequently accessed, and whether the house source is expired is judged so as to update the database. If the same source identifies that the frequency of accessing stale data in a short time is too high, it is likely that page data is being crawled and interception can be performed. The specific value of the fourth threshold may be set according to practical situations, and is not limited herein.

In this embodiment, a plurality of interception policies are set according to service characteristics, so that the interception policies of the service layer are enriched, and the interception policies of the service layer are more comprehensive, thereby further improving the interception rate of the service layer, and further improving the security of the data of the core service.

It will be appreciated that the foregoing is merely exemplary, and several sub-interception policies corresponding to the source-identified categories are set, and other sub-interception policies may be set according to service characteristics, which are not listed here.

In an exemplary embodiment, the above anti-crawling method may further include: and responding to any one of the sub-interception strategies included in the sub-interception strategy corresponding to the category to which the source identifier carried by the page access request hits, and determining that the source identifier hits the sub-interception strategy corresponding to the category to which the source identifier belongs. In this embodiment, if the page access request hits any policy, the source identifier may be considered to hit the sub-interception policy corresponding to the category to which the source identifier belongs, so that the interception rate may be further improved.

In addition, more than two policies included in the sub-interception policy corresponding to the class to which the source identifier carried by the page access request hits may be determined in response to the source identifier hit. Therefore, the interception condition is stricter, the interception accuracy can be improved, and the influence on the normal browsing user is avoided.

In an exemplary embodiment, the at least one source identification may include an IP address, and/or UA, and/or a device number.

In general, the normal page access request may carry the IP address, UA, and device number of the sender, and based on this, it may be determined from which sender is coming, so the IP address, UA, and device number may be used as the source identifier.

When crawling page data, the crawling may be performed by a fixed IP address, UA and equipment within a certain time. If a page access request is at risk of crawling page data, then there may be a risk of crawling page data with a page access request from the same IP address as the page access request. If a page access request is at risk of crawling page data, then there may be a risk of crawling page data for a page access request from the same UA as the page access request. If a page access request is at risk of crawling page data, then it is also possible that a page access request from the same device number as the page access request is at risk of crawling page data. Therefore, in this embodiment, through the IP address, UA and the device number, the page access request with crawl page data may be accurately intercepted.

Correspondingly, the second interception policy may include a sub-interception policy corresponding to the IP address, may further include a sub-interception policy corresponding to the UA, and may further include a sub-interception policy corresponding to the device number. Based on this, the determining, according to any one policy included in the sub-interception policy corresponding to the class to which the source identifier carried in the page access request hits, that the source identifier hits the sub-interception policy corresponding to the class to which the source identifier belongs may include: responding to any one strategy included in the sub-interception strategy corresponding to the IP address in the second interception strategy hit by the IP address in the page access request, and determining the sub-interception strategy corresponding to the type to which the IP address hit belongs; responding to any one strategy included in the sub-interception strategy corresponding to the UA in the second interception strategy hit by the UA in the page access request, and determining the sub-interception strategy corresponding to the type to which the UA hit belongs; and responding to any one strategy included in the sub-interception strategy corresponding to the equipment number in the equipment number hit second interception strategy in the page access request, and determining that the equipment number hit the sub-interception strategy corresponding to the belonging type.

Correspondingly, the first interception list may include a sub-interception list corresponding to an IP address, where the sub-interception list corresponding to the IP address includes an IP address to be intercepted, and may further include a sub-interception list corresponding to a UA, where the sub-interception list corresponding to the UA includes a UA to be intercepted, and may further include a sub-interception list corresponding to a device number, where the sub-interception list corresponding to the device number includes a device number to be intercepted. Based on this, the above-mentioned crawling prevention method may further include: responding to the IP address carried by the page access request to exist in a sub-interception list corresponding to the IP address, and determining that the IP address carried by the page access request hits the sub-interception list corresponding to the category to which the page access request belongs; responding to the UA carried by the page access request to exist in a sub-interception list corresponding to the UA, and determining that the UA carried by the page access request hits the sub-interception list corresponding to the belonging category; and responding to the fact that the equipment number carried by the page access request exists in the sub-interception list corresponding to the equipment number, and determining that the equipment number carried by the page access request hits the sub-interception list corresponding to the category to which the equipment number belongs.

If the page access request hits the second interception policy, updating the first interception list based on the page access request, wherein one implementation may include: the first interception list may be updated directly based on the page access request. Based on the page access request, updating the first interception list, another implementation may include: updating a second interception list stored in the business layer based on the page access request; the second interception list is periodically synchronized to the first interception list to update the first interception list.

In practical application, the second interception list can be set in the service layer, if the page access request hits the second interception policy, the second interception list stored in the service layer can be updated firstly based on the page access request, and then the second interception list is synchronized to the first interception list periodically, so that the updating times can be reduced in a periodical updating mode, and the performance consumption of the service layer is reduced.

In implementation, a script file for updating the first interception list may be preset at the traffic layer. The service layer can update the second interception list stored in the service layer based on the page access request in an asynchronous log or synchronous log mode. And then, the flow layer periodically acquires a second interception list by executing the script file, and replaces the current first interception list by the second interception list to obtain the latest first interception list.

The method of synchronizing the log refers to synchronously recording the log in the process of processing the page access request. Asynchronous journaling processes page access requests before journaling, and thus, faster response to users.

In an exemplary embodiment, the first interception policy may include a parameter verification policy, where the parameter verification policy may be to verify whether the page access request includes a preset common parameter, and if not, the verification fails, which indicates that the page access request is not legal, which may be a malicious request configured for crawling page data, and if so, the interception process may be performed, and if not, the verification is successful, and the interception process is not performed.

For example, the common parameters of the IP address and the UA are carried in the general page access request, at this time, it may be verified whether the page access request contains the IP address and the UA, if not, the verification fails, and if so, the verification succeeds.

The first interception policy may further include a sub-interception policy corresponding to the IP address. The sub-interception policy corresponding to the IP address in the first interception policy may be a page access request intercepting the page access frequency of the IP address exceeding a fifth threshold. If the page access frequency of the same IP address is too high, the page data may be crawled, and interception may be performed. The specific value of the fifth threshold may be set according to practical situations, and is not limited herein.

The first interception policy may further include a sub-interception policy corresponding to the UA. The sub-interception policy corresponding to UA in the first interception policy may be to intercept page access requests of which the page access frequency of UA exceeds a sixth threshold. If the page access frequency of the same UA is too high, page data may be crawled, and interception may be performed. The specific value of the sixth threshold may be set according to practical situations, and is not limited herein.

In an exemplary embodiment, the above anti-crawling method may further include: and determining that the page access request hits the first interception policy in response to the page access request hitting any one of the first interception policies. In this way, in this embodiment, if the page access request hits any policy, it is considered that the page access request hits the second interception policy, and the page access request can be intercepted in the traffic layer, thereby improving the interception rate.

Of course, it is also possible to determine that the page access request hits the first interception policy in response to the page access request hitting two or more policies included in the first interception policy. Therefore, the interception conditions are stricter, the interception accuracy can be improved, and the influence on the normal browsing user is avoided.

In an exemplary embodiment, the above anti-crawling method may further include: after the traffic layer receives the page access request, determining whether the page access request hits a parameter verification policy in the first interception policy. And determining whether the page access request hits the sub-interception policy corresponding to the IP address in the first interception policy or not according to the parameter verification policy in the first interception policy, and determining whether the page access request hits the sub-interception policy corresponding to the UA in the first interception policy according to the sub-interception policy corresponding to the IP address in the first interception policy. Or, determining whether the page access request hits the sub-interception policy corresponding to the UA in the first interception policy or not in response to the page access request missing the parameter verification policy in the first interception policy, and determining whether the page access request hits the sub-interception policy corresponding to the IP address in the first interception policy in response to the page access request missing the sub-interception policy corresponding to the UA in the first interception policy.

In this embodiment, the parameter verification policy in the first interception policy of the traffic layer is used to process the page access request, so that the illegal page access request is directly filtered, and the subsequent processing of the interception policy is not needed, thereby improving the processing efficiency.

It may be appreciated that the interception process is performed on the page access request in response to the page access request hitting a parameter verification policy in the first interception policy. Responding to the sub-interception strategy corresponding to the IP address in the first interception strategy hit by the page access request, and intercepting the page access request. Responding to a sub-interception policy corresponding to UA in the first interception policy hit by the page access request, and intercepting the page access request.

Taking a house renting service scene as an example, the anti-crawling method provided by the embodiment of the invention is described in more detail.

In the scenario of the rental business of the embodiment, there are many competitors or malicious crawlers in the rental list to crawl the data of the rental list page, which may cause the following problems: firstly, data is stolen, and secondly, too much crawler flow can directly influence the load of the cluster machine and the stability of the system.

The embodiment sets some special interception policies of house renting service, and synchronously combines the interception policies of the Nginx traffic layer and the interception policies of the service layer of the server.

The method comprises the steps of changing the original simple frequency interception into a strategy of automatically acquiring the interception result of a business layer at regular intervals and intercepting the business layer on the basis of the original frequency interception aiming at the Nginx flow layer. Aiming at the service layer, interception results are generated through a specific interception strategy of the house renting service and then are synchronized to the Nginx flow layer.

The specific interception policy of the house renting service is a specific interception policy of the house renting, namely the second interception policy, which is prepared by analyzing the characteristic of a crawler according to the specific scene of the house renting service, and the specific policy is listed as follows:

1. access to the ordered list page of the personal house sources list page is maintained:

in this case, an interception policy is set such that the same device number is accessed more than 60 times in one minute, the same IP address is accessed more than 400 times in one minute, and the same UA is accessed more than 400 times in one minute.

2. The filtering words are continuously changed to access the personal house source list page:

in this case, an interception policy is set for converting more than 30 screen words in one minute for the same device number, more than 200 screen words in one minute for the same IP address, and more than 200 screen words in one minute for the same UA.

3. Continuously changing city access personal house source list page:

In this case, an interception policy is set that the same device number changes more than 10 cities in one minute, the same IP address changes more than 20 cities in one minute, and the same UA changes more than 20 cities in one minute.

4. Since crawlers will update information, outdated room sources will be accessed frequently:

in this regard, interception policies are set such that the same equipment number has access to more than 50 out-of-date sources within one minute, the same IP address has access to more than 120 out-of-date sources within one minute, and the same UA has access to more than 120 out-of-date sources within one minute.

At the traffic layer, a policy specific to the characteristics of the universal crawler, that is, the first interception policy, is listed as follows:

1. frequency policy:

specifically, interception policies of more than 500 times for the same IP and more than 500 times for the same UA are set, namely, sub-interception policies corresponding to IP addresses in the first interception policy and sub-interception policies corresponding to UAs in the first interception policy.

2. Parameter verification strategy

Specifically, an interception policy of failure of common parameter verification is set.

Based on this, the crawling prevention flow provided in this embodiment is as follows:

first, as shown in fig. 2, after receiving a page access request from a client 202, a traffic layer 211 of the server 201 determines whether the page access request hits a first interception policy, if the page access request does not hit the first interception policy, determines whether the page access request hits a first interception list, if the page access request hits the first interception list, performs an interception process, and if the page access request does not hit the first interception list, sends the page access request to a service layer 221.

And secondly, after receiving the page access request, the service layer 221 determines whether the page access request hits the second interception policy, if the page access request does not hit the second interception policy, the page requested by the page access request is acquired and sent to the client, and if the page access request hits the second interception policy, the second interception list of the service layer 221 is updated based on the page access request in an asynchronous log mode.

And thirdly, the traffic layer 211 periodically acquires a second interception list by executing a preset script file, and replaces the first interception list currently stored by the traffic layer with the acquired second interception list, so as to obtain the latest first interception list, so that the subsequent page access request with the same source as the page access request can hit the first interception list, namely, the current page access request is intercepted at the traffic layer.

The specific implementation of the first to third steps may refer to the above related embodiments, and will not be described herein.

According to the method, the crawler characteristics of the house renting service are analyzed, the crawling prevention strategy for the house renting service is obtained, and a powerful rear shield is provided for improving the system stability of the house renting service and guaranteeing the user experience. Through the real-time linkage of the interception strategy of the traffic layer and the interception strategy of the service layer, the CPU load is reduced by more than 20%, the load pressure of the machine is reduced, and the resources and the cost are saved. The IP, UA, equipment number and the like generated by the interception strategy can be multiplexed to other core service scenes of the renting room, and the leakage of the data of the core service can be reduced.

Fig. 3 is a schematic structural view of a crawling preventing device according to an embodiment of the present invention. As shown in fig. 3, the crawling preventing apparatus 300 is applied to a server, the server includes a traffic layer and a service layer, and the apparatus includes:

the first interception module 301 is configured to, after the traffic layer receives the page access request, send the page access request to the service layer in response to the page access request not hitting a first interception policy preset in the traffic layer and not hitting a first interception list stored in the traffic layer;

the second interception module 302 is configured to, after the service layer receives the page access request, respond to a second interception policy preset by the service layer, and update the first interception list based on the page access request, so that the page access request with the same source as the page access request can hit the first interception list.

the second interception module 302 is specifically configured to:

In one possible implementation, the second interception module 302 is further configured to:

the second interception module 302 is further configured to:

In a possible implementation manner, the second interception module 302 is specifically configured to:

In a possible implementation manner, the first interception list is an interception list of the target service, as shown in fig. 4, and the apparatus further includes:

and the sending module 303 is configured to synchronize the first interception list to an interception list of a service other than the target service.

An electronic device provided by an embodiment of the present invention includes: the steps of the anti-crawling method in any of the above embodiments are implemented by the processor, and the same technical effects can be achieved, and the repetition is avoided.

Fig. 5 is a schematic structural diagram of an exemplary electronic device according to an embodiment of the present invention. As shown in fig. 5, the electronic device may include: the device comprises a processor 501, a communication interface 502, a memory 503 and a communication bus 504, wherein the processor 501, the communication interface 502 and the memory 503 are in communication with each other through the communication bus 504. The processor 501 may call a computer program in the memory 503 to perform the anti-crawling method in any of the above embodiments.

The embodiment of the present invention further provides a computer readable storage medium, on which a computer program is stored, where the computer program, when executed by a processor, implements the steps of the crawling prevention method as in any of the above embodiments, and can achieve the same technical effects, so that repetition is avoided, and no further description is given here. Wherein the computer readable storage medium is selected from Read-Only Memory (ROM), random access Memory (Random Access Memory, RAM), magnetic disk or optical disk.

It should be noted that, in this document, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, an element defined by the phrase "comprising one … …" does not exclude the presence of other like elements in a process, method, article, or apparatus that comprises the element.

From the above description of the embodiments, it will be clear to those skilled in the art that the above-described embodiment method may be implemented by means of software plus a necessary general hardware platform, but of course may also be implemented by means of hardware, but in many cases the former is a preferred embodiment. Based on such understanding, the technical solution of the present invention may be embodied essentially or in a part contributing to the prior art in the form of a software product stored in a storage medium (e.g. ROM/RAM, magnetic disk, optical disk) comprising instructions for causing a terminal (which may be a mobile phone, a computer, a server, an air conditioner, or a network device, etc.) to perform the method according to the embodiments of the present invention.

The embodiments of the present invention have been described above with reference to the accompanying drawings, but the present invention is not limited to the above-described embodiments, which are merely illustrative and not restrictive, and many forms may be made by those having ordinary skill in the art without departing from the spirit of the present invention and the scope of the claims, which are to be protected by the present invention.

Those of ordinary skill in the art will appreciate that the various illustrative elements and algorithm steps described in connection with the embodiments disclosed herein may be implemented as electronic hardware, or combinations of computer software and electronic hardware. Whether such functionality is implemented as hardware or software depends upon the particular application and design constraints imposed on the solution. Skilled artisans may implement the described functionality in varying ways for each particular application, but such implementation decisions should not be interpreted as causing a departure from the scope of the present invention.

It will be clear to those skilled in the art that, for convenience and brevity of description, specific working procedures of the above-described systems, apparatuses and units may refer to corresponding procedures in the foregoing method embodiments, and are not repeated herein.

In the embodiments provided in the present invention, it should be understood that the disclosed apparatus and method may be implemented in other manners. For example, the apparatus embodiments described above are merely illustrative, e.g., the division of the units is merely a logical function division, and there may be additional divisions when actually implemented, e.g., multiple units or components may be combined or integrated into another system, or some features may be omitted or not performed. Alternatively, the coupling or direct coupling or communication connection shown or discussed with each other may be an indirect coupling or communication connection via some interfaces, devices or units, which may be in electrical, mechanical or other form.

The units described as separate units may or may not be physically separate, and units shown as units may or may not be physical units, may be located in one place, or may be distributed on a plurality of network units. Some or all of the units may be selected according to actual needs to achieve the purpose of the solution of this embodiment.

In addition, each functional unit in the embodiments of the present invention may be integrated in one processing unit, or each unit may exist alone physically, or two or more units may be integrated in one unit.

The functions, if implemented in the form of software functional units and sold or used as a stand-alone product, may be stored in a computer-readable storage medium. Based on this understanding, the technical solution of the present invention may be embodied essentially or in a part contributing to the prior art or in a part of the technical solution, in the form of a software product stored in a storage medium, comprising several instructions for causing a computer device (which may be a personal computer, a server, a network device, etc.) to perform all or part of the steps of the method according to the embodiments of the present invention. And the aforementioned storage medium includes: a usb disk, a removable hard disk, a ROM, a RAM, a magnetic disk, or an optical disk, etc.

The foregoing is merely illustrative of the present invention, and the present invention is not limited thereto, and any person skilled in the art will readily recognize that variations or substitutions are within the scope of the present invention. Therefore, the protection scope of the invention is subject to the protection scope of the claims.

Claims

1. A crawling prevention method, applied to a server, the server comprising a traffic layer and a service layer, the method comprising:

after the traffic layer receives a page access request, responding to the page access request to miss a first interception policy preset by the traffic layer and miss a first interception list stored by the traffic layer, and sending the page access request to the service layer;

after the business layer receives the page access request, responding to a second interception strategy preset by the business layer, and updating the first interception list based on the page access request so that the page access request with the same source as the page access request can hit the first interception list, wherein the page access request carries at least one source identifier, the first interception list comprises a sub-interception list corresponding to the source identifier, and the sub-interception list corresponding to the source identifier comprises a source identifier to be intercepted;

The updating the first interception list based on the page access request comprises:

and adding each source identifier carried by the page access request to a sub-interception list corresponding to the category to which the source identifier belongs so as to update the first interception list.

2. The method according to claim 1, wherein the method further comprises:

3. The method according to claim 1, wherein the page access request carries at least one source identifier, and the second interception policy includes a sub-interception policy corresponding to a type of the source identifier;

the method further comprises the steps of:

and responding to any source identifier carried by the page access request to hit a sub-interception policy corresponding to the category to which the page access request belongs, and determining that the page access request hits the second interception policy.

4. A method according to claim 3, wherein the sub-interception policy corresponding to the source identified category comprises at least one of the following policies:

5. The method as recited in claim 4, further comprising:

and responding to any one strategy included in the sub-interception strategy corresponding to the category to which the source identifier carried by the page access request hits, and determining that the source identifier hits the sub-interception strategy corresponding to the category to which the source identifier belongs.

6. The method of claim 1, wherein updating the first interception list based on the page access request comprises:

the second interception list is synchronized to the first interception list periodically to update the first interception list.

7. The method according to any one of claims 1 to 6, wherein the first interception list is an interception list of a target service, the method further comprising:

and synchronizing the first interception list to the interception list of other businesses except the target business.

8. An anti-crawling apparatus, for use with a server, the server comprising a traffic layer and a traffic layer, the apparatus comprising:

the first interception module is used for responding to the first interception strategy preset by the flow layer after the flow layer receives the page access request, and sending the page access request to the service layer after the page access request is missed by the first interception strategy preset by the flow layer and the first interception list stored by the flow layer is missed by the first interception module;

the second interception module is used for responding to a second interception strategy preset by the service layer after the service layer receives the page access request, and updating the first interception list based on the page access request so that the page access request with the same source as the page access request can hit the first interception list, wherein the page access request carries at least one source identifier, the first interception list comprises a sub-interception list corresponding to the source identifier, and the sub-interception list corresponding to the source identifier comprises a source identifier to be intercepted;

The second interception module is further configured to add each source identifier carried by the page access request to a sub-interception list corresponding to a category to which the source identifier belongs, so as to update the first interception list.

9. The apparatus of claim 8, wherein the second interception module is further configured to:

10. The apparatus of claim 8, wherein the page access request carries at least one source identifier, and the second interception policy includes a sub-interception policy corresponding to a type of the source identifier;

the second interception module is further configured to:

11. The apparatus of claim 10, wherein the sub-interception policy corresponding to the source-identified category comprises at least one of:

12. The apparatus of claim 11, wherein the second interception module is further configured to:

13. The device according to claim 8, wherein the second interception module is specifically configured to:

14. The apparatus according to any one of claims 8 to 13, wherein the first interception list is an interception list of a target service, the apparatus further comprising:

And the sending module is used for synchronizing the first interception list to the interception list of other businesses except the target business.

15. An electronic device, comprising: a processor, a memory and a computer program stored on the memory and executable on the processor, which when executed by the processor performs the steps of the anti-crawling method of any one of claims 1 to 7.

16. A computer readable storage medium, characterized in that the computer readable storage medium has stored thereon a computer program which, when executed by a processor, implements the steps of the anti-crawling method of any of claims 1 to 7.