CN107818179B

CN107818179B - Crawler identification method based on information quantity theory

Info

Publication number: CN107818179B
Application number: CN201711183589.3A
Authority: CN
Inventors: 许凌川; 祝君; 安云鹏
Original assignee: Chengdu Knownsec Information Technology Co ltd
Current assignee: Chengdu Knownsec Information Technology Co ltd
Priority date: 2017-11-23
Filing date: 2017-11-23
Publication date: 2021-06-18
Anticipated expiration: 2037-11-23
Also published as: CN107818179A

Abstract

The invention discloses a crawler identification method based on an information quantity theory, which comprises the following steps: judging an information webpage concerned by a crawler according to the self site characteristics and the past log; counting logs in real time; summing according to the counted hourly data, drawing a total amount chart, and monitoring the hourly request amount; counting the distribution situation of quartering points of each IP request quantity every hour, and when the data has the uneven distribution situation that 3/4 points are smaller than 50% of the maximum value, marking the IP which is ranked in the front row by IP crawling information; and (4) counting the information quantity characteristic graphs of the suspected IP addresses, and considering the crawlers with the same crawling purpose and mutually cooperative operation by referring to other statistical information if the information quantity characteristic graphs of a certain batch of IP addresses are in accordance with high total information quantity but are mutually orthogonal. The method can effectively identify the crawlers with slow speed, the crawlers with simulated browser access and the cluster crawlers, and is an extremely effective supplement to various current identification algorithms.

Description

Crawler identification method based on information quantity theory

Technical Field

The invention relates to the technical field of website construction and log scanning, in particular to a crawler identification method based on an information quantity theory.

Background

In the existing crawler identification method, the characteristics of the log are mainly scanned, and the fact that the nature of manpower is different from that of a machine is identified through identifying code tool information, request frequency, request resource files and the like in the log. The defects of the prior art are as follows: when the crawler completely simulates the manual click behavior or the crawler controls the browser to access, the crawler cannot recognize the behavior. When a crawler owns an agent pool, it cannot be identified.

Target web page: the crawled information web page can show the same type of different information due to different parameters.

Information quantity characteristic diagram: according to target webpage parameters, natural serial number arrangement is adopted when the number of pages is small, hash is carried out on the number of the pages is excessive, the pages are grouped, the parameters are distributed in a two-dimensional to multi-dimensional space, a formed graph is called an information quantity graph, and each point in the graph is regarded as 1 unit of information.

Output information amount: and calculating the information quantity output by the whole website or the target webpage to the outside in a certain time according to the requested log of the whole website or the target webpage.

Disclosure of Invention

The technical problem to be solved by the invention is to provide a crawler identification method based on the information quantity theory, which can effectively identify the crawlers with low speed, the crawlers for simulating browser to visit and the crawlers for clustering; the identification method can statistically discover the undecided hidden crawler behavior of other identification methods.

In order to solve the technical problems, the invention adopts the technical scheme that:

a crawler identification method based on an information quantity theory comprises the following steps:

step 1: judging an information webpage concerned by a crawler according to the self site characteristics and the past log; estimating the average value of the information content of each page according to the average byte number of the information content of the webpage, and distinguishing the information content of different webpages;

step 2: counting logs in real time, calculating the output quantity of webpage information according to different IPs, and calculating the same webpage with the same IP only once;

and step 3: summing the hourly data counted in the step 2, drawing a total amount chart, and monitoring the hourly request amount; when the information output volume ring ratio of the website is increased in a certain period, it indicates that a crawler event possibly occurs;

and 4, step 4: counting the distribution situation of quartering points of each IP request quantity every hour, and when the data has the uneven distribution situation that 3/4 points are smaller than 50% of the maximum value, marking the IP which is ranked in the front row, namely the suspected IP address;

and 5: and (3) counting the information quantity characteristic graphs of suspected IP addresses, and considering other statistical information including first access time and high-frequency access time periods as crawlers with the same crawling purpose and mutually cooperating if the information quantity characteristic graphs of a certain batch of IP addresses are in accordance with high total information quantity but are mutually orthogonal.

Further, step 3 includes the system alerting when it indicates that there is a possible crawler occurrence.

Further, in the step 4, the method for marking the suspected IP address specifically includes:

1) taking 0 to 1 as x, taking the maximum value of the single IP request information amount from 0 to the current time period as y, taking a list consisting of all y values as list _ y, taking the number n in brackets at the back of the list as the nth item in the list, taking len as the length of the list, taking math.floor as the downward rounding in mathematics, expressing the mapping mode as ' lambda ' (x) list _ y [ len.floor (y _ y) × ') by a lambda function, obtaining the derivation after the Gaussian fuzzy of the hash mapping, taking the x corresponding to the extreme value after obtaining the result, and all the IP addresses ranked at the first x% are suspected to be crawler;

2) and taking the second minimum value after the extreme value in the derivative as z, wherein the second minimum value is more than z% and less than x%, and marking the second minimum value as a suspected IP address.

Compared with the prior art, the invention has the beneficial effects that: the method can effectively identify the crawlers with low speed, the crawlers with simulated browser access and the cluster crawlers, is an extremely effective supplement to various current identification algorithms, and fills the blank that the crawlers of the types can not be identified. The advantage of this is not the ability to intercept immediately, but the statistical discovery of the undeniated, cryptic behavior of other identification methods.

Detailed Description

The present invention will be described in further detail with reference to the following embodiments.

Initialization and determination of long-term slow-speed crawlers

1. According to the characteristics of the self site and the past log, judging the information webpage concerned by the crawler, such as: the first page of forum website, the novel page of novel website, the document page of document website, etc. According to the average byte number of the information content of the web page, the approximate information content average value of each page is estimated, and the main function is to distinguish the information content of different web pages. This mean value of the information content can be used after reduction in a proportional manner.

2. And (4) counting logs in real time, calculating the output quantity of webpage information according to different IPs, and calculating the same webpage with the IP once.

Second, statistics of information output trend of whole website

And drawing a total amount chart after summing the statistical hourly data, monitoring the hourly request amount, and giving out a warning to the system when the information output amount ring ratio of the website in a certain period is increased to indicate that a crawler event possibly occurs.

Third, the reptile discrimination under the reptile alarm state

1. According to the statistics, the distribution situation of the quartile sites of the IP requests per hour is calculated, and when the data are obviously unevenly distributed, for example, the height of the three-quartile site is lower than one-half of the highest value (the height can be flexibly adjusted according to the request quantity characteristics of the website), a small number of IPs are crawling a large amount of information, an alarm is sent to the system, and the top-ranked IPs are marked. An algorithm for labeling suspected IPs is proposed here as:

1) taking 0 to 1 as x, taking the maximum value of the single IP request information amount from 0 to the current time interval as y, taking a list formed by all y values as list _ y, taking the number n in brackets at the back of the list as the nth item in the list, taking len as the length of the list, taking math as the floor in mathematics, expressing the mapping mode as lambda function (x) list _ y [ len _ y ] x ], obtaining derivation after Gaussian fuzzy mapping of the hash, taking x corresponding to the extreme value after obtaining the result (namely finding the IP interval with the maximum amplification and departing from the overall condition), and arranging the IP addresses in the first x% to be suspected as the serious suspicion of the crawler.

2. For suspected IP addresses with the content of more than z% and less than x%, the information quantity characteristic graphs are counted, if a certain batch of IP addresses are high in total information quantity, but the information quantity characteristic graphs are mutually orthogonal, and the suspected IP addresses can be considered to be crawlers with the same crawling purpose and mutually cooperating with each other by referring to other statistical information such as first access time, high-frequency access time period and the like.

Claims

1. A crawler identification method based on an information quantity theory is characterized by comprising the following steps:

and 4, step 4: counting the distribution condition of each quartile IP request quantity point in each hour, and when the IP request quantities of any three points in the four points are less than 50% of the IP request quantity of another point, considering that the distribution is not uniform, crawling information of the IP exists, and marking the top-ranked IP, namely the suspected IP address;

the quartile refers to dividing an hour into 4 aliquots, each aliquot being a "site," i.e., 1/4, 2/4, 3/4, 1;

and 5: for suspected IP addresses, counting information quantity characteristic graphs of the suspected IP addresses, and if a certain batch of IP addresses accord with the condition that the total information quantity is high but the information quantity characteristic graphs are mutually orthogonal, considering the statistical information of the first access time period and the high-frequency access time period as the batch of IP addresses are crawlers with the same crawling purpose and mutually cooperating;

the information quantity characteristic diagram is as follows: according to target webpage parameters, natural serial number arrangement is adopted when the number of pages is small, hash is carried out on the number of the pages is excessive, the pages are grouped, the parameters are distributed in a two-dimensional to multi-dimensional space, a formed graph is called an information quantity graph, and each point in the graph is regarded as 1 unit of information.

2. The method as claimed in claim 1, wherein the step 3 further comprises the step of the system issuing a warning when it is indicated that there is a possible crawler event.

3. The crawler recognition method based on the information quantity theory as claimed in claim 1, wherein in the step 4, the method for marking the suspected IP address specifically comprises:

1) taking any value between 0 and 1 as x, taking the maximum value of the single IP request information amount between 0 and the current time interval as y, taking a list consisting of all y values as list _ y, taking the number n in brackets at the back of the list as the nth item in the list, taking len as the length of the list, taking math.floor as the downward integer in mathematics, expressing the mapping mode as 'lambda' function (x) 'list _ y [ list.floor (len (list _ y) ×)') by means of a lambda function, obtaining the derivation after Gaussian blurring of the hash mapping, taking the x corresponding to the extreme value after obtaining the result, and having the serious suspicion that all IP addresses arranged at the top 100x% are suspected to be crawlers;

2) and taking the second minimum value after the extreme value in the derivative as z, wherein the second minimum value is more than 100z% and less than 100x%, and marking the second minimum value as a suspected IP address.