CN109241733A

CN109241733A - Crawler Activity recognition method and device based on web access log

Info

Publication number: CN109241733A
Application number: CN201810889455.1A
Authority: CN
Inventors: 樊恒阳; 潘钧康
Original assignee: NSFOCUS Information Technology Co Ltd; Beijing NSFocus Information Security Technology Co Ltd
Current assignee: NSFOCUS Information Technology Co Ltd; Beijing NSFocus Information Security Technology Co Ltd
Priority date: 2018-08-07
Filing date: 2018-08-07
Publication date: 2019-01-18

Abstract

A kind of crawler Activity recognition method and device based on web access log disclosed herein.This passes through the access log for obtaining access originator to be identified, according to the access time in access log, access URL reference URL corresponding with access URL, obtain the access characteristic information of the accessed individual page of access originator to be identified, wherein individual page be access URL in out-degree be not 0 access URL corresponding to the page, when access characteristic information is matched with default crawler access characteristic information, determine that access originator to be identified has crawler behavior.It can be seen that, the access log that the application passes through acquisition, analysis obtains the access characteristic information of access originator to be identified, it is compared based on obtained access characteristic information with default crawler access characteristic information, determine that access originator to be identified has crawler behavior, so as to which user agent's mark in the access originator with crawler behavior is tracked or intercepted, the accuracy rate and safety of identification web crawlers behavior are improved.

Description

Crawler Activity recognition method and device based on web access log

Technical field

This application involves network safety filed more particularly to a kind of crawler Activity recognition methods based on web access log And device.

Background technique

With the development of various Web application technologies, occurs being largely used to the net for obtaining Web page information automatically on network Network crawler.Web crawlers is the basic component part of search engine technique.Web crawlers technology is from one or several original nets The URL (Uniform Resource Locator, uniform resource locator) of page starts, and obtains the URL on Initial page, is grabbing During taking webpage information, according to the crawl strategy of webpage, new URL is constantly extracted from current web page and is put into queue, directly To meeting certain stop condition.Then the webpage information grabbed is stored in the server of search engine.Web crawlers is logical It crosses and crawls user data, business data, to excavate privacy of user or be monitored for public sentiment.

The method of tional identification web crawlers is a kind of statistical method based on threshold value, that is, counts certain IP whithin a period of time The URL total amount that access target website generates, if the value is more than given threshold, then it is assumed that the source IP is web crawlers, i.e., non-real Real user.

However, being that each source IP is confirmed as to an access originator in the above method.Under the public scene of IP, identical sources IP generation These normal users can be mistakenly considered by several real users of table in the case that many normal users share an IP Web crawlers.Such as many normal users share an IP, the URL generated under the IP at this time during website carries out advertising campaign When total amount is greater than given threshold, the real user under the IP is confirmed as web crawlers.As it can be seen that tional identification web crawlers method Accuracy rate is not high.

Summary of the invention

The embodiment of the present application provides a kind of crawler Activity recognition method and device based on web access log, for improving Identify the accuracy rate of web crawlers.

In a first aspect, providing a kind of crawler Activity recognition method based on web access log, this method may include:

The access log of access originator to be identified is obtained, access originator to be identified is the user agent by source IP address and client Mark determination；

According to the access time in access log, accessing united resource positioning symbol URL and the access corresponding reference URL of URL, Obtain the access characteristic information of the accessed individual page of access originator to be identified in preset time period, wherein individual page is to visit The page that ask URL out-degree not be 0, access characteristic information are accessed the access behavioural information of individual page by access originator to be identified；

According to access characteristic information, determine that access originator to be identified has crawler behavior.

As it can be seen that the application obtains the access characteristic information of access originator to be identified by statistics access log and climbs with default Worm access characteristic information is compared, and determines that access originator to be identified has crawler behavior, so as to the access with crawler behavior User agent's mark in source is tracked or is intercepted, and improves the accuracy rate and safety of identification web crawlers behavior.

In an optional realization, access characteristic information includes access rate, and in access randomness, access number At least one access characteristic information；

According to the access time in access log, access URL reference URL corresponding with access URL, access to be identified is obtained The access characteristic information of the accessed individual page in source, comprising:

According to the access URL reference URL corresponding with access URL in access log, the independence in access URL is extracted The page；

According to access time and at least two individual pages extracted, the access rate in preset time period is obtained, and Access at least one of randomness, access number characteristic information.

Above scheme is to obtain a kind of specific embodiment of individual page access characteristic information.

In an optional realization, according to access characteristic information, determine that access originator to be identified has crawler behavior, packet It includes:

Characteristic information condition is accessed if accessing characteristic information and meeting default crawler, it is determined that access originator to be identified has crawler Behavior；

Wherein, when accessing characteristic information includes access rate and access randomness, default crawler access conditions includes: to visit Ask that rate is greater than default access rate and access does not have randomness；

When accessing characteristic information includes access rate, access randomness, access number, crawler access conditions packet is preset Include: access rate is greater than default access rate, access does not have randomness and access number is greater than default access number；

When accessing characteristic information includes access rate, access number, presetting crawler access conditions includes: that access rate is big It is greater than default access number in default access rate, access number.

As it can be seen that the program includes three kinds of specific embodiments to determine whether access originator to be identified has crawler behavior, lead to It crosses and multiple access characteristic informations is detected, improve the accuracy of identification.

In an optional realization, access characteristic information further includes repeated accesses rate；

If accessing characteristic information meets default crawler access conditions, it is determined that access originator to be identified has crawler behavior, packet It includes:

When accessing characteristic information includes access rate, access randomness, access number and repeated accesses rate, crawler is preset Access conditions includes: that access rate is greater than default access rate, access does not have randomness, access number is greater than default access number Amount and repeated accesses rate are less than default repeated accesses rate；

When accessing characteristic information includes access rate, access randomness and repeated accesses rate, crawler access conditions is preset It include: that access rate is greater than default access rate, access does not have randomness and repeated accesses rate is less than default repeated accesses rate；

When accessing characteristic information includes access rate, access number and repeated accesses rate, crawler access conditions packet is preset Include: access rate is greater than default access rate, access number is greater than default access number and repeated accesses rate is less than default repeat Rate of people logging in.

As it can be seen that introducing the access characteristic information of repeated accesses rate, the party to further increase the accuracy of recognition methods Case includes other three kinds of specific embodiments to determine whether access originator to be identified has crawler behavior.

In an optional realization, after determining that access originator to be identified has crawler behavior, this method further include:

Prompt information is sent to client, prompt information is used to show that access originator to be identified has crawler behavior to user.

In an optional realization, the access of the accessed individual page of access originator to be identified in preset time period is obtained Before characteristic information, this method further include:

Inquire the access log of access originator to be identified, obtain multiple access path in access log, access path be to The path that the access URL that identification access originator is accessed is formed；

Extract in multiple access path access URL out-degree be not 0 the page as individual page.

The above scheme is a kind of a kind of mode for obtaining individual page.

Second aspect provides a kind of crawler Activity recognition device, the apparatus may include:

Acquiring unit, for obtaining the access log of access originator to be identified, access originator to be identified is by source IP address and visitor The user agent at family end identifies determining；

Determination unit, for determining that access originator to be identified has crawler behavior according to access characteristic information.

Acquiring unit, specifically for extracting and visiting according to the access URL reference URL corresponding with access URL in access log Ask at least two individual pages in URL；

In an optional realization, determination unit, if being specifically used for access characteristic information meets default crawler access item Part, it is determined that access originator to be identified has crawler behavior.

Determination unit, if being specifically used for access characteristic information meets default crawler access conditions, it is determined that access to be identified Source has crawler behavior；

Wherein, when accessing characteristic information includes access rate, access randomness, access number and repeated accesses rate, in advance If crawler access conditions includes: that access rate is greater than default access rate, access does not have randomness, access number is greater than default Access number and repeated accesses rate are less than default repeated accesses rate.

In an optional realization, the accessed independence of access originator to be identified of acquiring unit in the acquisition of a preset period of time Before the access characteristic information of the page, it is also used to inquire the access log of access originator to be identified；

Obtain multiple access path in access log, the access URL shape that access path is accessed by access originator to be identified At path；

The third aspect provides a kind of electronic equipment, which includes processor, communication interface, memory and lead to Believe bus, wherein processor, communication interface, memory complete mutual communication by communication bus；

Memory, for storing computer program；

When for executing the program stored on memory, it is existing to realize that any one of above-mentioned first aspect uploads for processor Method and step.

Fourth aspect provides a kind of computer readable storage medium, and meter is stored in the computer readable storage medium Calculation machine program, the computer program realize any method and step in above-mentioned first aspect when being executed by processor.

A kind of crawler Activity recognition method and device based on web access log is disclosed in the embodiment of the present application.The party Method passes through the access log for obtaining access originator to be identified, according to URL pairs of access time, access URL and access in access log The reference URL answered obtains the access characteristic information of the accessed individual page of access originator to be identified, and wherein individual page is access URL out-degree is not 0 page, when access characteristic information is matched with default crawler access characteristic information, determines access to be identified Source has crawler behavior.As it can be seen that the application is analyzed from the access log of acquisition obtains the access feature of access originator to be identified Information is compared with default crawler access characteristic information based on obtained access characteristic information, determines access originator tool to be identified There is crawler behavior, so as to which user agent's mark in the access originator with crawler behavior is tracked or intercepted, improves knowledge The accuracy rate and safety of other web crawlers behavior.

Detailed description of the invention

Fig. 1 is that a kind of process of the crawler Activity recognition method based on web access log provided in an embodiment of the present invention is shown It is intended to；

Fig. 2 is the process of another crawler Activity recognition method based on web access log provided in an embodiment of the present invention Schematic diagram；

Fig. 3 is a kind of structural schematic diagram of crawler Activity recognition device provided in an embodiment of the present invention；

Fig. 4 is the structural schematic diagram of a kind of electronic equipment provided in an embodiment of the present invention.

Specific embodiment

Below in conjunction with the attached drawing in the embodiment of the present application, technical solutions in the embodiments of the present application carries out clear, complete Site preparation description, it is clear that described embodiment is only some embodiments of the present application, is not whole embodiments.Based on this Apply for embodiment, every other implementation obtained by those of ordinary skill in the art without making creative efforts Example, shall fall in the protection scope of this application.

In crawler Activity recognition method application server provided in an embodiment of the present invention based on web access log.In order to Guarantee the accuracy of crawler Activity recognition method, server can have stronger computing capability.

For in the scene of public IP, the crawler Activity recognition method of the application pass through the source IP that will access webpage and by User agent identifies in the client that different user uses, such as User-Agent, one access originator of common ID, wherein user Agent identification is used for the client that unique identification user uses.When identical sources IP uses different client device access target networks When standing, determine at this time include different access originators.And if if only if the User- using source IP and client between two access originators When Agent is all the same, just confirm that two access originators are identical access originator.

Crawler Activity recognition method based on web access log passes through the access log for obtaining access originator to be identified, according to Access time, access URL reference URL corresponding with access URL in access log, obtains the accessed independence of access originator to be identified The access characteristic information of the page, access characteristic information are accessed the access behavioural information of individual page by access originator to be identified, when When access characteristic information meets default crawler access conditions, determine that access originator to be identified has crawler behavior, it later can be to tool There is user agent's mark in the access originator of crawler behavior to track or intercept.

Wherein, individual page is the page for accessing URL out-degree and not being 0, and the access URL in the application is non-pattern file Corresponding URL.

Optionally, the extraction process of individual page are as follows: the access log for obtaining access originator, according to access URL and reference URL The digraph of access URL is established, the URL that out-degree is filtered out from digraph not is 0 is and corresponding by the node as node URL is determined as individual page.Wherein, the access path of access originator has been reacted by the digraph that the node of individual page forms.

Rather than the URL of individual page (out-degree is not 0), such as picture URL, script file URL etc., only browser is to present The complete information of individual page out, as caused by the browser actively request of initiation, quantity is rich by the web site contents accessed The factors such as rich degree, website design complexity are determined.During access originator accesses website, the quantity of the dependent page It is much larger than individual page, therefore this partial data is affected to access characteristic information.

It should be noted that being in the prior art all access days generated to access originator to the statistics of above-mentioned access feature Will (individual page and the dependent page) is counted.

As it can be seen that the application analyzes by the access log obtained and obtains the access characteristic information of access originator to be identified, know Whether the access characteristic information not obtained meets default crawler access conditions, climbs so that it is determined that whether access originator to be identified has Behavior is taken, which improves the accuracy rate of identification web crawlers behavior.

Preferred embodiment of the present application is illustrated below in conjunction with Figure of description, it should be understood that described herein Preferred embodiment only for the purpose of illustrating and explaining the present invention and is not intended to limit the present invention, and in the absence of conflict, this Shen Please in embodiment and embodiment in feature can be combined with each other.

Fig. 1 is that a kind of process of the crawler Activity recognition method based on web access log provided in an embodiment of the present invention is shown It is intended to.The executing subject of this method can be server, as shown in Figure 1, this method may include:

Step 110, the access log for obtaining access originator to be identified, access originator to be identified is by source IP address and client User agent identifies determining.

Under identical source IP address, in order to accurately identify different access originators, source IP address and client can be used User agent identifies common ID access originator to be identified.

Web server log recording Web server receives the various originals such as processing request and the run time error of client Beginning information.By being counted to log, analysis and synthesis, can effectively grasp operation conditions, discovery and the row of server Except error reason, the access feature for understanding client etc., the preferably maintenance and management of strengthen the system.Web service mode mainly has Three steps:

Service request: numerous essential informations comprising client, such as IP address, browser type, target URL.

Service response: it after Web server receives request, requires to run corresponding function according to user, and information is returned Back to user.If there is mistake, errored response code will be returned.

Additional log: server will be saved in journal file the relevant information in client access process.

Wherein, the partial content of access log is as shown in table 1:

Table 1

Table 1 is the access log that the same client four times requests generate at Web service end.Wherein, terminal passes through GET Request method sends to server and requests, response code 200, indicates that crawler crawls normally.When request number is 1, terminal exists The T1 moment is "/index.html " by the URL that GET request mode accesses server, and the URL is without reference URL；When request is numbered When being 2, terminal is "/images/logo.png " by the URL that GET request mode accesses server at the T2 moment, the URL's Quoting URL is "/index.html "；When request number is 3, terminal accesses server by GET request mode at the T3 moment URL be "/blog/2018.html ", the reference URL of the URL is "/index.html "；When request number is 4, terminal exists The T4 moment is "/blog/2019.html " by the URL that GET request mode accesses server, the reference URL of the URL be "/ index.html”。

Step 120, the access time according in access log, access URL and the access corresponding reference URL of URL, obtain to Identify the access characteristic information of the accessed individual page of access originator.

Individual page is the page for accessing out-degree in URL and not being 0.Out-degree refers to using target access URL as basic point, into this The number of the access URL of target access URL.

Optionally, the mode for obtaining individual page can be with are as follows: inquires the access log of access originator to be identified, obtains access day Multiple access path in will, the path that the access URL that access path is accessed by access originator to be identified is formed are extracted more later In a access path access URL out-degree be not 0 the page as individual page.

For example, for an access URL if there is reference URL in, and quote URL as access URL access day Occurred in will, then it is assumed that the out-degree of this URL is not 0 (such as table 1 requests the reference URL "/index.html " in 2 in request 1 It is middle as access URL occurred, then it is assumed that request 2 in quote URL "/index.html " out-degree be not 0), i.e., access URL "/ Index.html " is individual page.

Real user access website is to consult the relevant information on the page by manually clicking mode accession page, need Reading time is spent, and occasionally there are the repeated accesses behaviors for the same page；And web crawlers is purposeful acquisition target Data, can rapidly grab the data of all pages in targeted website, and can avoid repeating to crawl.

For accessing individual page, the rate that real user accesses at least two individual pages is accessed much smaller than web crawlers The rate of identical at least two individual page, therefore in order to distinguish real user and web crawlers, available access to be identified Source accesses the access characteristic information of at least two individual pages.The access characteristic information of individual page may include access rate, Access randomness, access number and repeated accesses rate etc..

Wherein, access rate is the rate between the different individual pages of access originator access two.

Accessing is each in the Access Events set of the different individual pages of access with a certain probability with randomness The uncertainty that Access Events are showed, i.e. access randomness, which are embodied in, accesses different individual pages without the rule that can be looked into Rule property.Conversely, access is that the different individual pages of access have the regularity that can be looked into without randomness, such as access RUL has deep Degree traversal feature and/or breadth traversal feature.For example, the access RUL1, the access RUL2 of back-call that access for the first time, and Access RUL1 is the reference RUL for accessing RUL2.

Access number is the number that access originator accesses individual page, including the number for accessing identical individual page.

Repeated accesses rate accounts for the probability of access number for the number that access originator accesses identical individual page.

Server extracts in access URL extremely according to the access URL reference URL corresponding with access URL in access log Few two individual pages according to access time and at least two individual pages extracted, obtain the visit in preset time period later Ask rate, access randomness, access number and repeated accesses rate.Wherein, preset time period can be 1 hour or 10 minutes.

Further, the visit in order to improve identification accuracy, in the available above-mentioned all access characteristic informations of server Rate and access at least one of randomness and access number information are asked, alternatively, the available above-mentioned all visits of server It asks access rate and repeated accesses rate in characteristic information and accesses at least one of randomness and access number information. Wherein, access rate, which can be used for finding that access originator is doubtful, has crawler behavior, access randomness, access number and repeated accesses rate It can be used for confirming that access originator has crawler behavior, wherein repeated accesses rate can also be used for exclusion access originator, and there is crawler behavior to climb Worm.That is, when the value of access rate is more than default access rate, then recognizing the access log that same client generates It is doubtful crawler for access originator；When accessing randomness, access number above default respective threshold, then it can confirm that access originator is Crawler；But when the value of repeated accesses rate is greater than default repeated accesses rate, what server " can will confirm that access originator was crawler " sentences Determining result is to be determined as reporting by mistake.

Step 130, according to access characteristic information, determine access originator to be identified have crawler behavior.

Default crawler access conditions may include: that access rate is greater than default access rate, access does not have randomness and Access number is greater than default access number.

Server judges to access whether characteristic information meets default crawler access conditions, preset if access characteristic information meets Crawler access conditions, it is determined that access originator to be identified has crawler behavior.

Wherein, when accessing characteristic information includes access rate and access randomness, default crawler access conditions includes: to visit Ask that rate is greater than default access rate and access does not have randomness.

When accessing characteristic information includes access rate, access randomness, access number, crawler access conditions packet is preset Include: access rate is greater than default access rate, access does not have randomness and access number is greater than default access number.

When accessing characteristic information includes access rate, access number, presetting crawler access conditions includes: that access rate is big Do not have randomness in default access rate, access, access number is greater than default access number.

Optionally, in order to further increase the accuracy of identification, access characteristic information further includes repeated accesses rate；It is default to climb Worm access conditions can also include: that repeated accesses rate is less than default repeated accesses rate.Wherein, repeated accesses rate can be used as one kind Identification error correction information is to the identification knot determined by access rate and access at least one of randomness and access number information Fruit carries out error correction, i.e., when repeated accesses rate is larger, then judges the recognition result for wrong report.

Specifically: when accessing characteristic information includes access rate, access randomness and repeated accesses rate, presets crawler and visit The condition of asking includes: that access rate is greater than default access rate, access does not have randomness and repeated accesses rate is less than default repeat Rate of people logging in.

When accessing characteristic information includes access rate, access number and repeated accesses rate, crawler access conditions packet is preset Include: access rate is greater than default access rate, access number is greater than default access number, repeated accesses rate repeats to visit less than default Ask rate.

When accessing characteristic information includes access rate, access randomness, access number and repeated accesses rate, crawler is preset Access conditions includes: that access rate is greater than default access rate, access does not have randomness, access number is greater than default access number Amount, repeated accesses rate are less than default repeated accesses rate.

If access characteristic information is unsatisfactory for default crawler access conditions, it is determined that access originator to be identified is that people industrial and commercial bank is.

Optionally, after determining that access originator to be identified has crawler behavior, server can also will have crawler behavior Access originator in source IP address matched with the IP address preset in normal crawler library；

If successful match, it is determined that access originator is normal crawler, such as white crawler, i.e., without the crawler of threat.

The user agent's mark for checking the access originator later, determines the data type of the access originator crawled, such as crawls figure Sheet data crawls text data etc.；

If it fails to match, it is determined that access originator is improper crawler, such as malice crawler.Wherein, server can be by malice Crawler stores to improper crawler library is preset, to be tracked or be intercepted.Optionally, determining access originator to be identified with crawler After behavior, server can also send prompt information to client, and the prompt information is used to show to client to be identified Access originator has crawler behavior.

It include below that access rate, access randomness, access number and repetition are visited with the access characteristic information of individual page For asking rate.

Fig. 2 is the process of another crawler Activity recognition method based on web access log provided in an embodiment of the present invention Schematic diagram.As shown in Fig. 2, this method may include:

The user agent of source IP address and client is identified determining access originator to be identified, and obtained to be identified by step 201 The access log of access originator.

Step 202 judges to extract whether reference URL in access log occurred in access URL before, if so, going out Degree is not 0, executes step 203；If it is not, then out-degree is 0, step 212 is executed, identification process terminates.

Step 203, the access time according in access log, access URL and the access corresponding reference URL of URL, obtain to Identify access rate A, access randomness B, access number C and the access repeated accesses rate D of access originator access individual page.

Step 204 judges whether the access rate A of access originator access individual page to be identified is greater than default access rate, If so, thening follow the steps 205；If it is not, thening follow the steps 211.

Access originator to be identified is labeled as doubtful crawler by step 205.

Step 206 judges whether the quantity for accessing individual page is greater than default access number, if so, thening follow the steps 207, if it is not, thening follow the steps 211.

Step 207 judges to access whether individual page has randomness, if it is not, 208 are thened follow the steps, if so, executing Step 211.

Access originator to be identified is labeled as crawler by step 208.

Step 209 judges whether the repeated accesses rate for accessing individual page is less than default repeated accesses rate, if so, holding Row step 210, if it is not, thening follow the steps 211.

Step 210, confirmation access originator to be identified are crawler.

Access originator to be identified is labeled as real user by step 211.

Step 212 will confirm that the access originator for crawler is matched with normal crawler library is preset, if successful match, hold Row step 213 thens follow the steps 214 if it fails to match；

Step 213 determines that access originator is normal crawler, executes step 215 later.

Step 214 determines that access originator is improper crawler, executes step 215 later.

Step 215 terminates process.

The above method that the embodiment of the present invention improves passes through the access log for obtaining access originator to be identified, according to access log In access time, access URL and the access corresponding reference URL of URL, obtain the visit of the accessed individual page of access originator to be identified Ask characteristic information, wherein individual page is the page for accessing URL out-degree and not being 0, when access characteristic information and default crawler access When characteristic information matches, determine that access originator to be identified has crawler behavior.As it can be seen that the application by count access log obtain to The access characteristic information of the access originator of identification is compared based on obtained access characteristic information and default crawler access characteristic information Compared with, determine access originator to be identified have crawler behavior so as to the user agent in the access originator with crawler behavior identify into Line trace or interception improve the accuracy rate and safety of identification web crawlers behavior.

Corresponding with the above method, the embodiment of the present invention also provides a kind of crawler Activity recognition device, as shown in figure 3, should Device includes: acquiring unit 310 and determination unit 320.

Acquiring unit 310, for obtaining the access log of access originator to be identified, access originator to be identified be by source IP address and The user agent of client identifies determining；

Determination unit 320, for determining that access originator to be identified has crawler behavior according to access characteristic information.

Optionally, access characteristic information includes access rate, and at least one of access randomness, access number are visited Ask characteristic information；

Acquiring unit 310, specifically for mentioning according to the access URL reference URL corresponding with access URL in access log Take at least two individual pages in access URL；

According to access time and at least two individual pages extracted, the access rate in preset time period is obtained, and Access at least one of randomness, access number access characteristic information.

Optionally it is determined that unit 320, if being specifically used for access characteristic information meets default crawler access conditions, it is determined that Access originator to be identified has crawler behavior.

Optionally, access characteristic information further includes repeated accesses rate；

Determination unit 320, if being specifically used for access characteristic information meets default crawler access conditions, it is determined that visit to be identified Ask that source has crawler behavior；

Wherein, when access characteristic information includes access rate, access randomness, access number and the repeated accesses rate When, default crawler access conditions includes: that access rate is big without randomness, access number greater than default access rate, access It is less than default repeated accesses rate in default access number and repeated accesses rate；

Optionally, which further includes transmission unit 330；

Transmission unit 330, for sending prompt information to client, prompt information is used to show access to be identified to user Source has crawler behavior.

Optionally, acquiring unit 310, the visit of the accessed individual page of access originator to be identified in the acquisition of a preset period of time Before asking characteristic information, it is also used to inquire the access log of access originator to be identified；

The above embodiment of the present invention provide device each functional unit function, can by above-mentioned various method steps come It realizes, therefore, the specific work process and beneficial effect of each unit in device provided in an embodiment of the present invention be not multiple herein It repeats.

The embodiment of the invention also provides a kind of electronic equipment, as shown in figure 4, include processor 410, communication interface 420, Memory 430 and communication bus 440, wherein processor 410, communication interface 420, memory 430 are complete by communication bus 440 At mutual communication.

Memory 430, for storing computer program；

Processor 410 when for executing the program stored on memory 430, realizes following steps:

Optionally, the access characteristic information includes access rate, and at least one in access randomness, access number Kind access characteristic information；

According to the access URL reference URL corresponding with access URL in access log, at least two in access URL are extracted Individual page；

Optionally, according to access characteristic information, determine that access originator to be identified has crawler behavior, comprising:

If accessing characteristic information meets default crawler access conditions, it is determined that access originator to be identified has crawler behavior.

Optionally it is determined that after access originator to be identified has crawler behavior, this method further include:

Optionally, obtain preset time period in the accessed individual page of access originator to be identified access characteristic information it Before, the access log of access originator to be identified is inquired, multiple access path in access log are obtained, access path is visit to be identified The path that the access URL that the source of asking is accessed is formed；

Communication bus mentioned above can be Peripheral Component Interconnect standard (Peripheral Component Interconnect, PCI) bus or expanding the industrial standard structure (Extended Industry Standard Architecture, EISA) bus etc..The communication bus can be divided into address bus, data/address bus, control bus etc..For just It is only indicated with a thick line in expression, figure, it is not intended that an only bus or a type of bus.

Communication interface is for the communication between above-mentioned electronic equipment and other equipment.

Memory may include random access memory (Random Access Memory, RAM), also may include non-easy The property lost memory (Non-Volatile Memory, NVM), for example, at least a magnetic disk storage.Optionally, memory may be used also To be storage device that at least one is located remotely from aforementioned processor.

Above-mentioned processor can be general processor, including central processing unit (Central Processing Unit, CPU), network processing unit (Network Processor, NP) etc.；It can also be digital signal processor (Digital Signal Processing, DSP), it is specific integrated circuit (Application Specific Integrated Circuit, ASIC), existing It is field programmable gate array (Field-Programmable Gate Array, FPGA) or other programmable logic device, discrete Door or transistor logic, discrete hardware components.

The embodiment and beneficial effect solved the problems, such as due to each device of electronic equipment in above-described embodiment can join Each step in embodiment as shown in Figure 1 realizes, therefore, the specific works mistake of electronic equipment provided in an embodiment of the present invention Journey and beneficial effect, do not repeat again herein.

In another embodiment provided by the invention, a kind of computer readable storage medium is additionally provided, which can It reads to be stored with instruction in storage medium, when run on a computer, so that computer executes any institute in above-described embodiment The crawler Activity recognition method stated.

In another embodiment provided by the invention, a kind of computer program product comprising instruction is additionally provided, when it When running on computers, so that computer executes any crawler Activity recognition method in above-described embodiment.

It should be understood by those skilled in the art that, the embodiment in the embodiment of the present application can provide as method, system or meter Calculation machine program product.Therefore, complete hardware embodiment, complete software embodiment can be used in the embodiment of the present application or combine soft The form of the embodiment of part and hardware aspect.Moreover, being can be used in the embodiment of the present application in one or more wherein includes meter Computer-usable storage medium (including but not limited to magnetic disk storage, CD-ROM, the optical memory of calculation machine usable program code Deng) on the form of computer program product implemented.

It is referring to according to the method for embodiment, equipment (system) and calculating in the embodiment of the present application in the embodiment of the present application The flowchart and/or the block diagram of machine program product describes.It should be understood that can be realized by computer program instructions flow chart and/or The combination of the process and/or box in each flow and/or block and flowchart and/or the block diagram in block diagram.It can mention For the processing of these computer program instructions to general purpose computer, special purpose computer, Embedded Processor or other programmable datas The processor of equipment is to generate a machine, so that being executed by computer or the processor of other programmable data processing devices Instruction generation refer to for realizing in one or more flows of the flowchart and/or one or more blocks of the block diagram The device of fixed function.

These computer program instructions, which may also be stored in, is able to guide computer or other programmable data processing devices with spy Determine in the computer-readable memory that mode works, so that it includes referring to that instruction stored in the computer readable memory, which generates, Enable the manufacture of device, the command device realize in one box of one or more flows of the flowchart and/or block diagram or The function of being specified in multiple boxes.

These computer program instructions also can be loaded onto a computer or other programmable data processing device, so that counting Series of operation steps are executed on calculation machine or other programmable devices to generate computer implemented processing, thus in computer or The instruction executed on other programmable devices is provided for realizing in one or more flows of the flowchart and/or block diagram one The step of function of being specified in a box or multiple boxes.

Although the preferred embodiment in the embodiment of the present application has been described, once a person skilled in the art knows Basic creative concept, then additional changes and modifications may be made to these embodiments.So appended claims are intended to explain Being includes preferred embodiment and all change and modification for falling into range in the embodiment of the present application.

Obviously, those skilled in the art embodiment in the embodiment of the present application can be carried out various modification and variations without It is detached from the spirit and scope of embodiment in the embodiment of the present application.If in this way, in the embodiment of the present application embodiment these modification Within the scope of belonging in the embodiment of the present application claim and its equivalent technologies with modification, then also it is intended in the embodiment of the present application It includes these modifications and variations.

Claims

1. a kind of crawler Activity recognition method based on web access log, which is characterized in that the described method includes:

The access log of access originator to be identified is obtained, the access originator to be identified is the user agent by source IP address and client Mark determination；

According to access time, accessing united resource positioning symbol URL and the corresponding reference of the access URL in the access log URL obtains the access characteristic information of the accessed individual page of access originator to be identified in preset time period, wherein described Individual page is the page that the access URL out-degree is not 0, and the access characteristic information is accessed by the access originator to be identified The access behavioural information of individual page；

According to the access characteristic information, determine that the access originator to be identified has crawler behavior.

2. the method as described in claim 1, which is characterized in that the access characteristic information includes access rate, and access At least one of randomness and access number access characteristic information；

The access time according in the access log, the access URL and corresponding reference URL of the access URL, obtain institute State the access characteristic information of the accessed individual page of access originator to be identified, comprising:

According to the access URL and the corresponding reference URL of the access URL in the access log, extract in the access URL At least two individual page；

According at least two individual page of the access time and extraction, obtain access rate in preset time period, And access at least one of randomness and access number access characteristic information.

3. method according to claim 2, which is characterized in that it is described according to the access characteristic information, it determines described wait know Other access originator has crawler behavior, comprising:

If the access characteristic information meets the default crawler access conditions, it is determined that the access originator to be identified has crawler Behavior；

Wherein, when the access characteristic information includes the access rate and the access randomness, the default crawler is visited The condition of asking includes: that the access rate is greater than default access rate and access without randomness；

It is described default when the access characteristic information includes the access rate, the access randomness, the access number Crawler access conditions includes: that the access rate is greater than default access rate, access does not have randomness and the access number Greater than default access number；

When the access characteristic information includes the access rate and the access number, the default crawler access conditions packet Include: the access rate is greater than default access rate, the access number is greater than default access number.

4. method as claimed in claim 3, which is characterized in that the access characteristic information further includes repeated accesses rate；

If the access characteristic information meets the default crawler access conditions, it is determined that the access originator to be identified has crawler Behavior, comprising:

When the access characteristic information includes the access rate, the access randomness, the access number and the repetition When rate of people logging in, the default crawler access conditions includes: that the access rate is greater than default access rate, access does not have at random Property, the access number is greater than default access number and the repeated accesses rate is less than default repeated accesses rate；

It is described when the access characteristic information includes the access rate, the access randomness and the repeated accesses rate Default crawler access conditions includes: that the access rate is greater than default access rate, access does not have randomness and the repetition Rate of people logging in is less than default repeated accesses rate；

It is described pre- when the access characteristic information includes the access rate, the access number and the repeated accesses rate If crawler access conditions includes: that the access rate is greater than default access rate, the access number is greater than default access number It is less than default repeated accesses rate with the repeated accesses rate.

5. the method as described in claim 1, which is characterized in that the determination access originator to be identified have crawler behavior it Afterwards, the method also includes:

Prompt information is sent to client, the prompt information is used to show that the access originator to be identified has crawler row to user For.

6. the method as described in claim 1, which is characterized in that the access originator to be identified obtained in preset time period is visited Before the access characteristic information for asking individual page, the method also includes:

The access log for inquiring the access originator to be identified obtains multiple access path in the access log, the access The path that the access URL that path is accessed by the access originator to be identified is formed；

Extract in the multiple access path access URL out-degree be not 0 the page as individual page.

7. a kind of crawler Activity recognition device, which is characterized in that described device includes:

Acquiring unit, for obtaining the access log of access originator to be identified, the access originator to be identified is by source IP address and visitor The user agent at family end identifies determining；

Determination unit, for determining that the access originator to be identified has crawler behavior according to the access characteristic information.

8. device as claimed in claim 7, which is characterized in that the access characteristic information includes access rate, and access At least one of randomness and access number access characteristic information；

The acquiring unit, specifically for according in the access log access URL and the corresponding reference of the access URL URL extracts at least two individual page in the access URL；

According at least two individual page of the access time and extraction, the access rate in preset time period is obtained, And access at least one of randomness and access number access characteristic information.

9. device as claimed in claim 8, which is characterized in that

The determination unit, if meeting the default crawler access conditions specifically for the access characteristic information, it is determined that institute Access originator to be identified is stated with crawler behavior；

When the access characteristic information includes the access rate, the access number, the default crawler access conditions packet Include: the access rate is greater than default access rate, the access number is greater than default access number.

10. device as claimed in claim 9, which is characterized in that the access characteristic information further includes repeated accesses rate；

Wherein, when the access characteristic information includes the access rate, the access randomness, the access number and described When repeated accesses rate, the default crawler access conditions includes: that the access rate is greater than default access rate, access does not have Randomness, the access number are greater than default access number and the repeated accesses rate is less than default repeated accesses rate；

11. device as claimed in claim 7, which is characterized in that described device further includes transmission unit；

The transmission unit, for sending prompt information to client, the prompt information is used to show to user described wait know Other access originator has crawler behavior.

12. device as claimed in claim 7, which is characterized in that the acquiring unit in the acquisition of a preset period of time described Before the access characteristic information of the accessed individual page of access originator to be identified, it is also used to inquire the access of the access originator to be identified Log；

Obtain multiple access path in the access log, the visit that the access path is accessed by the access originator to be identified Ask the path formed URL；

13. a kind of electronic equipment, which is characterized in that the electronic equipment includes that processor, communication interface, memory and communication are total Line, wherein processor, communication interface, memory complete mutual communication by communication bus；

Memory, for storing computer program；

Processor when for executing the program stored on memory, realizes any method and step of claim 1-6.

14. a kind of computer readable storage medium, which is characterized in that be stored with computer in the computer readable storage medium Program realizes claim 1-6 any method and step when the computer program is executed by processor.