CN107818179B - Crawler identification method based on information quantity theory - Google Patents

Crawler identification method based on information quantity theory Download PDF

Info

Publication number
CN107818179B
CN107818179B CN201711183589.3A CN201711183589A CN107818179B CN 107818179 B CN107818179 B CN 107818179B CN 201711183589 A CN201711183589 A CN 201711183589A CN 107818179 B CN107818179 B CN 107818179B
Authority
CN
China
Prior art keywords
information
taking
list
crawler
suspected
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201711183589.3A
Other languages
Chinese (zh)
Other versions
CN107818179A (en
Inventor
许凌川
祝君
安云鹏
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Chengdu Knownsec Information Technology Co ltd
Original Assignee
Chengdu Knownsec Information Technology Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Chengdu Knownsec Information Technology Co ltd filed Critical Chengdu Knownsec Information Technology Co ltd
Priority to CN201711183589.3A priority Critical patent/CN107818179B/en
Publication of CN107818179A publication Critical patent/CN107818179A/en
Application granted granted Critical
Publication of CN107818179B publication Critical patent/CN107818179B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/95Retrieval from the web
    • G06F16/958Organisation or management of web site content, e.g. publishing, maintaining pages or automatic linking

Abstract

The invention discloses a crawler identification method based on an information quantity theory, which comprises the following steps: judging an information webpage concerned by a crawler according to the self site characteristics and the past log; counting logs in real time; summing according to the counted hourly data, drawing a total amount chart, and monitoring the hourly request amount; counting the distribution situation of quartering points of each IP request quantity every hour, and when the data has the uneven distribution situation that 3/4 points are smaller than 50% of the maximum value, marking the IP which is ranked in the front row by IP crawling information; and (4) counting the information quantity characteristic graphs of the suspected IP addresses, and considering the crawlers with the same crawling purpose and mutually cooperative operation by referring to other statistical information if the information quantity characteristic graphs of a certain batch of IP addresses are in accordance with high total information quantity but are mutually orthogonal. The method can effectively identify the crawlers with slow speed, the crawlers with simulated browser access and the cluster crawlers, and is an extremely effective supplement to various current identification algorithms.

Description

Crawler identification method based on information quantity theory
Technical Field
The invention relates to the technical field of website construction and log scanning, in particular to a crawler identification method based on an information quantity theory.
Background
In the existing crawler identification method, the characteristics of the log are mainly scanned, and the fact that the nature of manpower is different from that of a machine is identified through identifying code tool information, request frequency, request resource files and the like in the log. The defects of the prior art are as follows: when the crawler completely simulates the manual click behavior or the crawler controls the browser to access, the crawler cannot recognize the behavior. When a crawler owns an agent pool, it cannot be identified.
Target web page: the crawled information web page can show the same type of different information due to different parameters.
Information quantity characteristic diagram: according to target webpage parameters, natural serial number arrangement is adopted when the number of pages is small, hash is carried out on the number of the pages is excessive, the pages are grouped, the parameters are distributed in a two-dimensional to multi-dimensional space, a formed graph is called an information quantity graph, and each point in the graph is regarded as 1 unit of information.
Output information amount: and calculating the information quantity output by the whole website or the target webpage to the outside in a certain time according to the requested log of the whole website or the target webpage.
Disclosure of Invention
The technical problem to be solved by the invention is to provide a crawler identification method based on the information quantity theory, which can effectively identify the crawlers with low speed, the crawlers for simulating browser to visit and the crawlers for clustering; the identification method can statistically discover the undecided hidden crawler behavior of other identification methods.
In order to solve the technical problems, the invention adopts the technical scheme that:
a crawler identification method based on an information quantity theory comprises the following steps:
step 1: judging an information webpage concerned by a crawler according to the self site characteristics and the past log; estimating the average value of the information content of each page according to the average byte number of the information content of the webpage, and distinguishing the information content of different webpages;
step 2: counting logs in real time, calculating the output quantity of webpage information according to different IPs, and calculating the same webpage with the same IP only once;
and step 3: summing the hourly data counted in the step 2, drawing a total amount chart, and monitoring the hourly request amount; when the information output volume ring ratio of the website is increased in a certain period, it indicates that a crawler event possibly occurs;
and 4, step 4: counting the distribution situation of quartering points of each IP request quantity every hour, and when the data has the uneven distribution situation that 3/4 points are smaller than 50% of the maximum value, marking the IP which is ranked in the front row, namely the suspected IP address;
and 5: and (3) counting the information quantity characteristic graphs of suspected IP addresses, and considering other statistical information including first access time and high-frequency access time periods as crawlers with the same crawling purpose and mutually cooperating if the information quantity characteristic graphs of a certain batch of IP addresses are in accordance with high total information quantity but are mutually orthogonal.
Further, step 3 includes the system alerting when it indicates that there is a possible crawler occurrence.
Further, in the step 4, the method for marking the suspected IP address specifically includes:
1) taking 0 to 1 as x, taking the maximum value of the single IP request information amount from 0 to the current time period as y, taking a list consisting of all y values as list _ y, taking the number n in brackets at the back of the list as the nth item in the list, taking len as the length of the list, taking math.floor as the downward rounding in mathematics, expressing the mapping mode as ' lambda ' (x) list _ y [ len.floor (y _ y) × ') by a lambda function, obtaining the derivation after the Gaussian fuzzy of the hash mapping, taking the x corresponding to the extreme value after obtaining the result, and all the IP addresses ranked at the first x% are suspected to be crawler;
2) and taking the second minimum value after the extreme value in the derivative as z, wherein the second minimum value is more than z% and less than x%, and marking the second minimum value as a suspected IP address.
Compared with the prior art, the invention has the beneficial effects that: the method can effectively identify the crawlers with low speed, the crawlers with simulated browser access and the cluster crawlers, is an extremely effective supplement to various current identification algorithms, and fills the blank that the crawlers of the types can not be identified. The advantage of this is not the ability to intercept immediately, but the statistical discovery of the undeniated, cryptic behavior of other identification methods.
Detailed Description
The present invention will be described in further detail with reference to the following embodiments.
Initialization and determination of long-term slow-speed crawlers
1. According to the characteristics of the self site and the past log, judging the information webpage concerned by the crawler, such as: the first page of forum website, the novel page of novel website, the document page of document website, etc. According to the average byte number of the information content of the web page, the approximate information content average value of each page is estimated, and the main function is to distinguish the information content of different web pages. This mean value of the information content can be used after reduction in a proportional manner.
2. And (4) counting logs in real time, calculating the output quantity of webpage information according to different IPs, and calculating the same webpage with the IP once.
Second, statistics of information output trend of whole website
And drawing a total amount chart after summing the statistical hourly data, monitoring the hourly request amount, and giving out a warning to the system when the information output amount ring ratio of the website in a certain period is increased to indicate that a crawler event possibly occurs.
Third, the reptile discrimination under the reptile alarm state
1. According to the statistics, the distribution situation of the quartile sites of the IP requests per hour is calculated, and when the data are obviously unevenly distributed, for example, the height of the three-quartile site is lower than one-half of the highest value (the height can be flexibly adjusted according to the request quantity characteristics of the website), a small number of IPs are crawling a large amount of information, an alarm is sent to the system, and the top-ranked IPs are marked. An algorithm for labeling suspected IPs is proposed here as:
1) taking 0 to 1 as x, taking the maximum value of the single IP request information amount from 0 to the current time interval as y, taking a list formed by all y values as list _ y, taking the number n in brackets at the back of the list as the nth item in the list, taking len as the length of the list, taking math as the floor in mathematics, expressing the mapping mode as lambda function (x) list _ y [ len _ y ] x ], obtaining derivation after Gaussian fuzzy mapping of the hash, taking x corresponding to the extreme value after obtaining the result (namely finding the IP interval with the maximum amplification and departing from the overall condition), and arranging the IP addresses in the first x% to be suspected as the serious suspicion of the crawler.
2) And taking the second minimum value after the extreme value in the derivative as z, wherein the second minimum value is more than z% and less than x%, and marking the second minimum value as a suspected IP address.
2. For suspected IP addresses with the content of more than z% and less than x%, the information quantity characteristic graphs are counted, if a certain batch of IP addresses are high in total information quantity, but the information quantity characteristic graphs are mutually orthogonal, and the suspected IP addresses can be considered to be crawlers with the same crawling purpose and mutually cooperating with each other by referring to other statistical information such as first access time, high-frequency access time period and the like.

Claims (3)

1. A crawler identification method based on an information quantity theory is characterized by comprising the following steps:
step 1: judging an information webpage concerned by a crawler according to the self site characteristics and the past log; estimating the average value of the information content of each page according to the average byte number of the information content of the webpage, and distinguishing the information content of different webpages;
step 2: counting logs in real time, calculating the output quantity of webpage information according to different IPs, and calculating the same webpage with the same IP only once;
and step 3: summing the hourly data counted in the step 2, drawing a total amount chart, and monitoring the hourly request amount; when the information output volume ring ratio of the website is increased in a certain period, it indicates that a crawler event possibly occurs;
and 4, step 4: counting the distribution condition of each quartile IP request quantity point in each hour, and when the IP request quantities of any three points in the four points are less than 50% of the IP request quantity of another point, considering that the distribution is not uniform, crawling information of the IP exists, and marking the top-ranked IP, namely the suspected IP address;
the quartile refers to dividing an hour into 4 aliquots, each aliquot being a "site," i.e., 1/4, 2/4, 3/4, 1;
and 5: for suspected IP addresses, counting information quantity characteristic graphs of the suspected IP addresses, and if a certain batch of IP addresses accord with the condition that the total information quantity is high but the information quantity characteristic graphs are mutually orthogonal, considering the statistical information of the first access time period and the high-frequency access time period as the batch of IP addresses are crawlers with the same crawling purpose and mutually cooperating;
the information quantity characteristic diagram is as follows: according to target webpage parameters, natural serial number arrangement is adopted when the number of pages is small, hash is carried out on the number of the pages is excessive, the pages are grouped, the parameters are distributed in a two-dimensional to multi-dimensional space, a formed graph is called an information quantity graph, and each point in the graph is regarded as 1 unit of information.
2. The method as claimed in claim 1, wherein the step 3 further comprises the step of the system issuing a warning when it is indicated that there is a possible crawler event.
3. The crawler recognition method based on the information quantity theory as claimed in claim 1, wherein in the step 4, the method for marking the suspected IP address specifically comprises:
1) taking any value between 0 and 1 as x, taking the maximum value of the single IP request information amount between 0 and the current time interval as y, taking a list consisting of all y values as list _ y, taking the number n in brackets at the back of the list as the nth item in the list, taking len as the length of the list, taking math.floor as the downward integer in mathematics, expressing the mapping mode as 'lambda' function (x) 'list _ y [ list.floor (len (list _ y) ×)') by means of a lambda function, obtaining the derivation after Gaussian blurring of the hash mapping, taking the x corresponding to the extreme value after obtaining the result, and having the serious suspicion that all IP addresses arranged at the top 100x% are suspected to be crawlers;
2) and taking the second minimum value after the extreme value in the derivative as z, wherein the second minimum value is more than 100z% and less than 100x%, and marking the second minimum value as a suspected IP address.
CN201711183589.3A 2017-11-23 2017-11-23 Crawler identification method based on information quantity theory Active CN107818179B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201711183589.3A CN107818179B (en) 2017-11-23 2017-11-23 Crawler identification method based on information quantity theory

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201711183589.3A CN107818179B (en) 2017-11-23 2017-11-23 Crawler identification method based on information quantity theory

Publications (2)

Publication Number Publication Date
CN107818179A CN107818179A (en) 2018-03-20
CN107818179B true CN107818179B (en) 2021-06-18

Family

ID=61610013

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201711183589.3A Active CN107818179B (en) 2017-11-23 2017-11-23 Crawler identification method based on information quantity theory

Country Status (1)

Country Link
CN (1) CN107818179B (en)

Families Citing this family (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108777687B (en) * 2018-06-05 2020-04-14 掌阅科技股份有限公司 Crawler intercepting method based on user behavior portrait, electronic equipment and storage medium

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN106202383A (en) * 2016-07-08 2016-12-07 武汉烽火普天信息技术有限公司 A kind of network bandwidth accounting dynamic prediction method being applied to web crawlers and system
CN106961444A (en) * 2017-04-26 2017-07-18 广东亿荣电子商务有限公司 A kind of hostile network reptile detection method based on hidden Markov model
CN107092660A (en) * 2017-03-28 2017-08-25 成都优易数据有限公司 A kind of Website server reptile recognition methods and device

Family Cites Families (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20070180399A1 (en) * 2006-01-31 2007-08-02 Honeywell International, Inc. Method and system for scrolling information across a display device

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN106202383A (en) * 2016-07-08 2016-12-07 武汉烽火普天信息技术有限公司 A kind of network bandwidth accounting dynamic prediction method being applied to web crawlers and system
CN107092660A (en) * 2017-03-28 2017-08-25 成都优易数据有限公司 A kind of Website server reptile recognition methods and device
CN106961444A (en) * 2017-04-26 2017-07-18 广东亿荣电子商务有限公司 A kind of hostile network reptile detection method based on hidden Markov model

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
基于熵值的网络论坛热点话题发现;孙永利 等;《计算机工程》;20140630;第40卷(第6期);312-316 *

Also Published As

Publication number Publication date
CN107818179A (en) 2018-03-20

Similar Documents

Publication Publication Date Title
CN102647421B (en) The web back door detection method of Behavior-based control feature and device
CN112434208B (en) Training of isolated forest and recognition method and related device of web crawler
US8832330B1 (en) Analysis of storage system latency by correlating activity of storage system components with latency measurements
CN104504200B (en) A kind of trend curve figure display methods for the monitoring of rotating machinery on-line vibration
CN105141598A (en) APT (Advanced Persistent Threat) attack detection method and APT attack detection device based on malicious domain name detection
CN107819783A (en) A kind of network security detection method and system based on threat information
CN104202291A (en) Anti-phishing method based on multi-factor comprehensive assessment method
CN104765874A (en) Method and device for detecting click-cheating
EP3329640B1 (en) Network operation
CN101826996A (en) Domain name system flow detection method and domain name server
CN112491784A (en) Request processing method and device of Web site and computer readable storage medium
CN111885086B (en) Malicious software heartbeat detection method, device and equipment and readable storage medium
CN114238959A (en) User access behavior evaluation method and system based on zero-trust security system
CN110912874A (en) Method and system for effectively identifying machine access behaviors
CN107818179B (en) Crawler identification method based on information quantity theory
CN111787002A (en) Method and system for analyzing service data network security
CN110519266A (en) A method of the cc attack detecting based on statistical method
CN117478433B (en) Network and information security dynamic early warning system
CN111159708B (en) Apparatus, method and storage medium for detecting web Trojan horse in server
CN111431883A (en) Web attack detection method and device based on access parameters
CN110580265B (en) ETL task processing method, device, equipment and storage medium
CN117040827A (en) Abnormal account detection method and device, storage medium and electronic equipment
CN110210221B (en) File risk detection method and device
CN114172707B (en) Fast-Flux botnet detection method, device, equipment and storage medium
CN115150206A (en) Intrusion detection safety early warning system and method for information safety

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant
CP02 Change in the address of a patent holder
CP02 Change in the address of a patent holder

Address after: 9/F, Block C, No. 28 Tianfu Avenue North Section, Chengdu High tech Zone, China (Sichuan) Pilot Free Trade Zone, Chengdu City, Sichuan Province, 610000

Patentee after: CHENGDU KNOWNSEC INFORMATION TECHNOLOGY Co.,Ltd.

Address before: 610000, 11th floor, building 2, No. 219, Tianfu Third Street, hi tech Zone, Chengdu, Sichuan Province

Patentee before: CHENGDU KNOWNSEC INFORMATION TECHNOLOGY Co.,Ltd.