CN106936778B - Method and device for detecting abnormal website traffic - Google Patents

Method and device for detecting abnormal website traffic Download PDF

Info

Publication number
CN106936778B
CN106936778B CN201511019106.7A CN201511019106A CN106936778B CN 106936778 B CN106936778 B CN 106936778B CN 201511019106 A CN201511019106 A CN 201511019106A CN 106936778 B CN106936778 B CN 106936778B
Authority
CN
China
Prior art keywords
access behavior
behavior data
ratio
websites
target
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201511019106.7A
Other languages
Chinese (zh)
Other versions
CN106936778A (en
Inventor
祁国晟
饶峰云
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Beijing Gridsum Technology Co Ltd
Original Assignee
Beijing Gridsum Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beijing Gridsum Technology Co Ltd filed Critical Beijing Gridsum Technology Co Ltd
Priority to CN201511019106.7A priority Critical patent/CN106936778B/en
Publication of CN106936778A publication Critical patent/CN106936778A/en
Application granted granted Critical
Publication of CN106936778B publication Critical patent/CN106936778B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L63/00Network architectures or network communication protocols for network security
    • H04L63/14Network architectures or network communication protocols for network security for detecting or protecting against malicious traffic
    • H04L63/1408Network architectures or network communication protocols for network security for detecting or protecting against malicious traffic by monitoring network traffic
    • H04L63/1425Traffic logging, e.g. anomaly detection

Abstract

The application discloses a method and a device for detecting website traffic abnormity. Wherein, the method comprises the following steps: acquiring access behavior data of a plurality of websites accessed by a plurality of browsers within a preset time period to obtain an access behavior data set; calculating a first ratio of access behavior data of each browser in the access behavior data set to obtain first access behavior data distribution; calculating a second ratio of the access behavior data of each browser in the access behavior data of each website in the plurality of websites to obtain a plurality of second access behavior data distributions which are in one-to-one correspondence with the plurality of websites; calculating the similarity of each second access behavior data distribution in the second access behavior data distributions and the first access behavior data distribution to obtain a plurality of similarities corresponding to the websites one by one; and determining a target website from the plurality of websites according to the calculated similarity. The method and the device solve the technical problem that the accuracy rate of detecting the abnormal flow of the network station is low in the prior art.

Description

Method and device for detecting abnormal website traffic
Technical Field
The application relates to the field of computers, in particular to a method and a device for detecting website traffic abnormity.
Background
In a conventional traffic anomaly detection method, indices such as the number of requests for a website address (URL), traffic, and processing time of a server are generally selected as indices for analyzing a website traffic anomaly. In this method, a threshold value is simply set, and if the index exceeds the set threshold value, it is determined that the website traffic is abnormal.
In the method, the set threshold has no basis of probability statistics, and is manually set by a programmer, so that the randomness is high, and the result is unreliable. And the index itself is time-varying, such as on weekdays and holidays, the flow itself is not the same; the flow at nine o 'clock in the evening is different from the flow at four o' clock in the morning. And setting a certain threshold to determine whether the traffic of the website is abnormal inevitably leads to erroneous determination.
In view of the above problems, no effective solution has been proposed.
Disclosure of Invention
The embodiment of the application provides a method and a device for detecting website traffic abnormity, and at least solves the technical problem that the accuracy rate of detecting the website traffic abnormity in the prior art is low.
According to an aspect of an embodiment of the present application, a method for detecting website traffic anomaly is provided, including: acquiring access behavior data of a plurality of websites accessed by a plurality of browsers within a preset time period to obtain an access behavior data set; calculating a first ratio of access behavior data using each browser in the access behavior data set to obtain first access behavior data distribution; calculating a second ratio of the access behavior data of each browser in the access behavior data of each website in the plurality of websites to obtain a plurality of second access behavior data distributions which are in one-to-one correspondence with the plurality of websites; calculating the similarity of each second access behavior data distribution in the second access behavior data distributions and the first access behavior data distribution to obtain a plurality of similarities corresponding to the websites one by one; and determining a target website from the plurality of websites according to the calculated similarity, wherein the target website is a website with abnormal flow.
Further, determining a target website from the plurality of websites according to the calculated similarity comprises: selecting websites with the similarity smaller than a preset proportion threshold value from the plurality of websites as the target websites; sequencing the similarity degrees from small to large, and selecting websites corresponding to the first n similarity degrees as the target websites, wherein n is a positive integer greater than or equal to 1; or sequencing the similarity degrees from small to large, and selecting the websites corresponding to the top m% of similarity degrees as the target websites, wherein m is a positive integer which is greater than or equal to 1 and less than or equal to 100.
Further, calculating a similarity between each second access behavior data distribution in the plurality of second access behavior data distributions and the first access behavior data distribution, and obtaining a plurality of similarities corresponding to the plurality of websites one to one includes: calculating the similarity by a formula, wherein xiFor a first ratio, y, in the first access behavior distributioniFor a second ratio in the second access behavior distribution, i is sequentially from 1 to n, where n is the first ratio and the second ratioThe number of second ratios; or by means of formulae
Figure BDA0000894751580000022
Calculating the similarity, wherein xiFor a first ratio, y, in the first access behavior distributioniAnd for a second ratio in the second access behavior distribution, i sequentially takes 1 to n, and n is the number of the first ratio and the second ratio.
Further, before calculating a first ratio of access behavior data using each browser in the set of access behavior data to obtain a first access behavior data distribution, the method further includes: merging the multiple browsers according to the first ratio to obtain multiple target browsers; wherein calculating a first ratio of access behavior data using each browser in the set of access behavior data to obtain a first access behavior data distribution comprises: and calculating a first ratio of the access behavior data of each target browser in the target browsers in the access behavior data of the websites accessed by the target browsers to obtain the first access behavior data distribution.
Further, the multiple target browsers include a first target browser and a second target browser, and merging the multiple browsers according to the first ratio to obtain the multiple target browsers includes: sorting the first ratios in descending order; determining the browsers corresponding to the first k-1 first ratios as the first target browser, wherein k is a positive integer greater than or equal to 1; merging the browsers corresponding to the remaining n-k +1 first ratios into the second target browser, and merging the n-k +1 first ratios into the ratio of the second target browser, wherein the ratio of the second target browser is smaller than the k-1 first ratio.
According to another aspect of the embodiments of the present application, there is also provided a device for detecting website traffic abnormality, including: the acquisition unit is used for acquiring access behavior data of a plurality of websites accessed by a plurality of browsers within a preset time period to obtain an access behavior data set; the first calculation unit is used for calculating a first ratio of the access behavior data using each browser in the access behavior data set to obtain first access behavior data distribution; a second calculation unit, configured to calculate a second ratio of access behavior data of each browser used in the access behavior data of each website in the multiple websites, to obtain multiple second access behavior data distributions that are in one-to-one correspondence with the multiple websites; a third calculating unit, configured to calculate a similarity between each second access behavior data distribution in the plurality of second access behavior data distributions and the first access behavior data distribution, to obtain a plurality of similarities corresponding to the plurality of websites one to one; and the determining unit is used for determining a target website from the plurality of websites according to the calculated similarity, wherein the target website is a website with abnormal flow.
Further, the determining unit includes: the first selection module is used for selecting websites with the similarity smaller than a preset proportion threshold value from the plurality of websites as the target websites; the second selection module is used for sequencing the similarity degrees from small to large and selecting websites corresponding to the first n similarity degrees as the target websites, wherein n is a positive integer greater than or equal to 1; or a third selecting module, configured to sort the multiple similarity degrees from small to large, and select a website corresponding to m% of the top similarity degrees as the target website, where m is a positive integer greater than or equal to 1 and less than or equal to 100.
Further, the third calculation unit includes: a first calculating module for calculating the similarity by a formula, wherein xiFor a first ratio, y, in the first access behavior distributioniFor a second ratio in the second access behavior distribution, i sequentially takes 1 to n, where n is the number of the first ratio and the second ratio; or a second calculation module for passing the formula
Figure BDA0000894751580000032
Calculating the similarity, wherein xiFor a first ratio, y, in the first access behavior distributioniAnd for a second ratio in the second access behavior distribution, i sequentially takes 1 to n, and n is the number of the first ratio and the second ratio.
Further, the apparatus further comprises: a merging unit, configured to merge the multiple browsers according to a first ratio to obtain multiple target browsers before the first computing unit computes a first ratio of access behavior data using each browser in the access behavior data set to obtain first access behavior data distribution; wherein the first calculation unit includes: and the calculation module is used for calculating a first ratio of the access behavior data of each target browser in the plurality of target browsers in the access behavior data set to obtain the first access behavior data distribution.
Further, the plurality of target browsers include a first target browser and a second target browser, and the merging unit includes: the sorting module is used for sorting the first ratios in a descending order; a determining module, configured to determine that the browsers corresponding to the first k-1 first ratios are the first target browser, where k is a positive integer greater than or equal to 1; and the merging module is used for merging the browsers corresponding to the remaining n-k +1 first ratios into the second target browser and merging the n-k +1 first ratios into the ratio of the second target browser, wherein the ratio of the second target browser is smaller than the k-1 first ratio.
In the embodiment of the application, access behavior data sets are obtained by acquiring access behavior data of a plurality of websites accessed by a plurality of browsers within a preset time period; calculating a first ratio of access behavior data using each browser in the access behavior data set to obtain first access behavior data distribution; calculating a second ratio of the access behavior data of each browser in the access behavior data of each website in the plurality of websites to obtain a plurality of second access behavior data distributions which are in one-to-one correspondence with the plurality of websites; calculating the similarity of each second access behavior data distribution in the second access behavior data distributions and the first access behavior data distribution to obtain a plurality of similarities corresponding to the websites one by one; and determining a target website from the plurality of websites according to the calculated similarity, wherein the target website is a website with abnormal flow, the website with abnormal flow is determined according to the similarity by calculating the first access behavior data distribution and the second access behavior data distribution according to the access behavior data, calculating the similarity value according to the first access behavior data distribution and the second access behavior data distribution, and compared with a method for manually checking abnormal websites in the prior art, the method achieves the purpose of quickly and accurately detecting the website with abnormal flow, further solves the technical problem of low accuracy rate of detecting website flow abnormality in the prior art, and accordingly achieves the technical effect of improving the detection efficiency of the website with abnormal flow.
Drawings
The accompanying drawings, which are included to provide a further understanding of the application and are incorporated in and constitute a part of this application, illustrate embodiment(s) of the application and together with the description serve to explain the application and not to limit the application. In the drawings:
FIG. 1 is a flowchart of a method for detecting website traffic anomalies according to an embodiment of the present application; and
fig. 2 is a schematic diagram of a device for detecting website traffic abnormality according to an embodiment of the present application.
Detailed Description
In order to make the technical solutions better understood by those skilled in the art, the technical solutions in the embodiments of the present application will be clearly and completely described below with reference to the drawings in the embodiments of the present application, and it is obvious that the described embodiments are only partial embodiments of the present application, but not all embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present application.
It should be noted that the terms "first," "second," and the like in the description and claims of this application and in the drawings described above are used for distinguishing between similar elements and not necessarily for describing a particular sequential or chronological order. It is to be understood that the data so used is interchangeable under appropriate circumstances such that the embodiments of the application described herein are capable of operation in sequences other than those illustrated or described herein. Furthermore, the terms "comprises," "comprising," and "having," and any variations thereof, are intended to cover a non-exclusive inclusion, such that a process, method, system, article, or apparatus that comprises a list of steps or elements is not necessarily limited to those steps or elements expressly listed, but may include other steps or elements not expressly listed or inherent to such process, method, article, or apparatus.
According to an embodiment of the present application, there is provided a method for detecting website traffic anomalies, it should be noted that the steps shown in the flowcharts of the drawings may be executed in a computer system such as a set of computer-executable instructions, and that although a logical order is shown in the flowcharts, in some cases, the steps shown or described may be executed in an order different from that here.
Fig. 1 is a flowchart of a method for detecting website traffic abnormality according to an embodiment of the present application, and as shown in fig. 1, the method includes steps S102 to S110 as follows:
step S102, access behavior data of a plurality of websites accessed by a plurality of browsers in a preset time period are obtained, and an access behavior data set is obtained.
The preset time period may be selected to be one day, one week or one month, and the browser may be an IE browser, a 360 browser, or other browser, such as Chrome, Safari, Sougo, Firefox, etc. The access behavior data of the website may be various, and in this embodiment, the access behavior data may be an access amount of the website in a preset time period, an access flow of the website in the preset time period, and the like.
And step S104, calculating a first ratio of the access behavior data using each browser in the access behavior data using set to obtain first access behavior data distribution.
For example, IE browser, 360 browser, Chrome browser, Safari browser, SougThe access behavior data of a plurality of websites in the o browser and the Firefox browser are respectively a, b, c, d, e and f, and the first ratio of the access behavior data of each browser in the browsers is respectively
Figure BDA0000894751580000051
And
Figure BDA0000894751580000052
wherein X is a + b + c + d + e + f. The plurality of first ratios are first access behavior data distributions, which are also called benchmark distributions.
Step S106, calculating a second ratio of the access behavior data of each browser in the access behavior data of each website in the plurality of websites to obtain a plurality of second access behavior data distributions which are in one-to-one correspondence with the plurality of websites.
For example, the access behavior data of any website "a website" in browsers such as IE browser, 360 browser, Chrome browser, Safari browser, Sougo browser and Firefox is a2Strip, b2Strip, c2Strip, d2Strip, e2Strips and f2The second ratio of the access behavior data of the website a using each browser in the browsers is respectively:
Figure BDA0000894751580000061
and
Figure BDA0000894751580000062
wherein, X2=a2+b2+c2+d2+e2+f2And the plurality of second ratios are the second access behavior data distribution.
Step S108, calculating the similarity of each second access behavior data distribution in the second access behavior data distributions and the first access behavior data distribution to obtain a plurality of similarities corresponding to the websites one by one.
Specifically, by calculating the similarity between the second access behavior data distribution and the first access behavior data distribution of each website, a website with abnormal traffic can be determined, and an access channel of the website with abnormal traffic, that is, in which browser the user accesses the website, can also be determined.
Step S110, determining a target website from a plurality of websites according to the calculated similarity, wherein the target website is a website with abnormal flow.
Specifically, in the embodiment of the present application, the smaller the calculated similarity is, the higher the probability that the website traffic is abnormal is.
In the embodiment of the application, the first access behavior data distribution and the second access behavior data distribution are calculated according to the access behavior data, the similarity value is calculated according to the first access behavior data distribution and the second access behavior data distribution, and the website with abnormal flow is determined according to the similarity.
According to the similarity, a plurality of target websites are determined from the plurality of websites, and in another optional embodiment, any one of the following modes can be included:
the first method is as follows:
and selecting the website with the similarity smaller than a preset proportion threshold value from the plurality of websites as a target website.
Specifically, setting a preset proportion threshold may be considered to determine the target website, that is, the calculated similarity values are respectively compared with the preset proportion threshold, and a website corresponding to the similarity smaller than the preset proportion threshold is determined to be a website with abnormal traffic, and an access channel of the website with abnormal traffic may also be determined.
The second method comprises the following steps:
and sequencing the similarity degrees from small to large, and selecting the websites corresponding to the first n similarity degrees as target websites, wherein n is a positive integer greater than or equal to 1.
Specifically, the calculated multiple similarities may be sorted in an ascending order to obtain a similarity sequence, and websites corresponding to the first n smaller similarities in the sequence are selected as websites with abnormal traffic (i.e., target websites), where the value of n may be determined by a user according to the type of the actual browser and the number of websites. For example, in 1000 websites, websites corresponding to the top 10 similarities or the top 15 similarities in the similarity sequence are selected as target websites.
The third method comprises the following steps:
and sequencing the similarity degrees from small to large, and selecting the websites corresponding to the top m% of similarity degrees as target websites, wherein m is a positive integer which is greater than or equal to 1 and less than or equal to 100.
Specifically, the calculated multiple similarities may be sorted in ascending order to obtain a similarity sequence, and the websites corresponding to the top m% of the smaller similarities in the sequence are selected as the websites with abnormal traffic. The m% is a percentage value, and the value-taking user of the m% can select the value-taking user according to actual needs, for example, a website corresponding to the first 1% of similarity in the similarity sequence is selected as a website with abnormal traffic. If the number of the websites is 1000, the number of the websites with abnormal traffic is 10.
Alternatively, in the embodiment of the present application, there may be a plurality of preset algorithms, for example, a pearson correlation coefficient algorithm or an algorithm such as a KL divergence formula to calculate the similarity. Preferably, the calculating the similarity between each second access behavior data distribution in the plurality of second access behavior data distributions and the first access behavior data distribution, and obtaining a plurality of similarities in one-to-one correspondence with the plurality of websites includes:
calculating the similarity through a KL divergence formula, wherein the KL divergence formula is as follows:
Figure BDA0000894751580000071
wherein x isiFor a first ratio, y, in a first access behavior distributioniAnd taking 1 to n in sequence as a second ratio in the second access behavior distribution, wherein n is the number of the first ratio and the second ratio.
The similarity can also be calculated by pearson correlation coefficient algorithm,the calculation formula of the pearson correlation coefficient algorithm is as follows:
Figure BDA0000894751580000072
wherein x isiFor a first ratio, y, in a first access behavior distributioniAnd for a second ratio in the second access behavior distribution, i sequentially takes 1 to n, n is the number of the first ratio and the second ratio, x is the average value of the first ratio in the first access behavior distribution, and y is the average value of the second ratio in the second access behavior distribution.
In addition to the two calculation methods, in the present application, other calculation methods may be further selected to calculate the similarity of each second access behavior data distribution in the plurality of second access behavior data distributions to the first access behavior data distribution. Such as the mahalanobis distance algorithm, the chebyshev distance algorithm, etc.
Optionally, before calculating the first ratio of the access behavior data of each browser in the access behavior data set to obtain the first access behavior data distribution, the method further includes the following step S1: and merging the multiple browsers according to the first ratio to obtain a plurality of target browsers.
Specifically, if the number of browsers is large, the calculated similarity has a large error because the occupation ratio calculated by some browsers is small. Therefore, before obtaining the first access behavior data distribution, the multiple browsers may be merged according to the first ratio, so as to obtain the multiple browsers after merging (i.e., the target browser). For example, the browser changes from 100 before the merge to 10 after the merge.
Calculating a first ratio of the access behavior data of each browser in the usage access behavior data set to obtain a first access behavior data distribution includes step S3: and calculating a first ratio of the access behavior data of each target browser in the target browsers in the access behavior data of the target browsers for accessing the websites by using the target browsers to obtain first access behavior data distribution. Specifically, after the multiple browsers are combined to obtain multiple target browsers, first access behavior data of the multiple target browsers are obtained, and then a second ratio of the access behavior data of each target browser in the access behavior data of each website in the multiple websites is calculated to obtain second access behavior data distribution.
Optionally, the step of merging the multiple browsers according to the proportion includes the following steps S11 to S15:
step S11, sorting the plurality of first ratios in descending order.
Step S13, determining the browsers corresponding to the first k-1 first ratios as first target browsers, wherein k is a positive integer greater than or equal to 1.
And step S15, merging the browsers corresponding to the remaining n-k +1 first ratios into a second target browser, and merging the n-k +1 first ratios into the ratio of the second target browser, wherein the ratio of the second target browser is smaller than the k-1 first ratio.
Assuming that the number n of browsers is 100, 100 first ratios, x, are obtained by calculation1~x100The 100 first ratios may be sorted in descending order to obtain a sorted sequence of the first ratios. The user may determine the value of k according to the calculated first ratio, for example, select the first 9 (i.e., k is 10) browsers corresponding to the larger first ratio as the first target browsers, that is, obtain 9 first target browsers. And then combining the last n-k +1 in the sequence into 91 browsers serving as a second target browser, and calculating the sum of first ratios of the last 91 browsers, wherein the sum is the ratio of the second target browser. After merging, the obtained first ratios of the 10 target browsers are x respectively1、x2、x3、x4、x5、x6、x7、x8、x9And y10Wherein, y10=x10+x11+x12+……+x100
After the first ratio of the 10 target browsers is obtained, obtaining a first access behavior number of the target browserBased on the distribution, the occupation ratios of the "a site" and the 10 target browsers are calculated respectively. Suppose that the access behavior data of "a site" in 10 target browsers are k respectively1、k2、k3、k4、k5、k6、k7、k8、k9And k10And calculating a second ratio as follows:
Figure BDA0000894751580000091
and
Figure BDA0000894751580000092
wherein M ═ k1+k2+...+k10And the second access behavior data distribution obtained according to the second ratio is as follows:
Figure BDA0000894751580000093
Figure BDA0000894751580000094
and further according to the first access behavior data distribution { x1、x2、x3、x4、x5、x6、x7、x8、x9、y10And
Figure BDA0000894751580000095
and calculating the similarity, and finally determining the target website according to the similarity.
According to the method for detecting the website traffic abnormality, traditional manual troubleshooting is not relied on, benchmark distribution is calculated through whole network data, the similarity between the distribution of each website and the benchmark distribution is calculated, and then the website with the traffic abnormality can be accurately and quickly determined according to the similarity.
The embodiment of the present application further provides a device for detecting website traffic abnormality, where the device is mainly used to execute the method for detecting website traffic abnormality provided in the foregoing embodiments of the present application, and the following description specifically describes the device for detecting website traffic abnormality provided in the embodiments of the present application.
Fig. 2 is a schematic diagram of a website traffic abnormality detection apparatus according to an embodiment of the present application, and as shown in fig. 2, the website traffic abnormality detection apparatus mainly includes an acquisition unit 10, a first calculation unit 20, a second calculation unit 30, a third calculation unit 40, and a determination unit 50, where:
the acquiring unit 10 is configured to acquire access behavior data of a plurality of websites accessed by using a plurality of browsers within a preset time period, so as to obtain an access behavior data set.
The preset time period may be selected to be one day, one week or one month, and the browser may be an IE browser, a 360 browser, or other browser, such as Chrome, Safari, Sougo, Firefox, etc. The access behavior data of the website may be various, and in this embodiment, the access behavior data may be an access amount of the website in a preset time period, an access flow of the website in the preset time period, and the like.
The first calculating unit 20 is configured to calculate a first ratio of access behavior data of each browser in the access behavior data set, so as to obtain a first access behavior data distribution.
For example, access behavior data of a plurality of websites in an IE browser, a 360 browser, a Chrome browser, a Safari browser, a Sougo browser and a Firefox browser are a bar, b bar, c bar, d bar, e bar and f bar, respectively, and a first ratio of the access behavior data of each of the browsers is a value
Figure BDA0000894751580000101
And
Figure BDA0000894751580000102
wherein X is a + b + c + d + e + f. The plurality of first ratios are the first access behavior data distribution.
The second calculating unit 30 is configured to calculate a second ratio of the access behavior data of each browser in the access behavior data of each of the multiple websites, and obtain multiple second access behavior data distributions that are in one-to-one correspondence with the multiple websites.
For example, the access behavior data of any website "a website" in browsers such as IE browser, 360 browser, Chrome browser, Safari browser, Sougo browser and Firefox is a2Strip, b2Strip, c2Strip, d2Strip, e2Strips and f2The second ratio of the access behavior data of the website a using each browser in the browsers is respectively:
Figure BDA0000894751580000103
and
Figure BDA0000894751580000104
wherein, X2=a2+b2+c2+d2+e2+f2. The plurality of second ratios are second access behavior data distributions.
The third calculating unit 40 is configured to calculate a similarity between each second access behavior data distribution in the plurality of second access behavior data distributions and the first access behavior data distribution, so as to obtain a plurality of similarities corresponding to the plurality of websites one to one.
Specifically, by calculating the similarity between the second access behavior data distribution and the first access behavior data distribution of each website, a website with abnormal traffic can be determined, and an access channel of the website with abnormal traffic, that is, in which browser the user accesses the website, can also be determined.
And the determining unit 50 is configured to determine a target website from the multiple websites according to the calculated similarity, where the target website is a website with abnormal traffic.
Specifically, in the embodiment of the present application, the smaller the calculated similarity is, the higher the probability that the website traffic is abnormal is.
In the embodiment of the application, the first access behavior data distribution and the second access behavior data distribution are calculated according to the access behavior data, the similarity value is calculated according to the first access behavior data distribution and the second access behavior data distribution, and the website with abnormal flow is determined according to the similarity.
Optionally, the determining unit includes: the first selection module is used for selecting websites with the similarity smaller than a preset proportion threshold value from the plurality of websites as target websites; the second selection module is used for sequencing the similarity degrees from small to large and selecting websites corresponding to the first n similarity degrees as target websites, wherein n is a positive integer greater than or equal to 1; or the third selecting module is used for sorting the similarity degrees from small to large and selecting the websites corresponding to the top m% of similarity degrees as target websites, wherein m is a positive integer which is greater than or equal to 1 and less than or equal to 100.
Specifically, setting a preset proportion threshold may be considered to determine the target website, that is, the calculated similarity values are respectively compared with the preset proportion threshold, and a website corresponding to the similarity smaller than the preset proportion threshold is determined to be a website with abnormal traffic, and an access channel of the website with abnormal traffic may also be determined.
The calculated multiple similarities can be sorted in an ascending order to obtain a similarity sequence, and websites corresponding to the first n smaller similarities in the sequence are selected as websites with abnormal traffic (i.e., target websites) by calling the second selection module, wherein the value-taking user of n can be determined according to the type of the actual browser and the number of the websites. For example, in 1000 websites, websites corresponding to the top 10 similarities or the top 15 similarities in the similarity sequence are selected as target websites.
The calculated multiple similarities can be sorted in an ascending order to obtain a similarity sequence, and the top m% of the websites with smaller similarities in the sequence are selected as websites with abnormal traffic by calling a third selection module. The m% is a percentage value, and the value-taking user of the m% can select the value-taking user according to actual needs, for example, a website corresponding to the first 1% of similarity in the similarity sequence is selected as a website with abnormal traffic. If the number of the websites is 1000, the number of the websites with abnormal traffic is 10.
Optionally, the third calculation unit comprises: a first calculating module for calculating the similarity by a formula, wherein xiFor a first ratio, y, in a first access behavior distributioniFor a second ratio in the second access behavior distribution, i is 1 to n in sequence, and n is the number of the first ratio and the second ratio; or a second computing module for passing the disclosure
Figure BDA0000894751580000112
Calculating a similarity, wherein xiFor a first ratio, y, in a first access behavior distributioniAnd taking 1 to n in sequence as a second ratio in the second access behavior distribution, wherein n is the number of the first ratio and the second ratio.
Calculating the similarity through a KL divergence formula, wherein the KL divergence formula is as follows:
Figure BDA0000894751580000121
wherein x isiFor a first ratio, y, in a first access behavior distributioniAnd taking 1 to n in sequence as a second ratio in the second access behavior distribution, wherein n is the number of the first ratio and the second ratio.
The similarity can be calculated through a Pearson correlation coefficient algorithm, and the calculation formula of the Pearson correlation coefficient algorithm is as follows:
Figure BDA0000894751580000122
wherein x isiFor a first ratio, y, in a first access behavior distributioniAnd taking 1 to n in sequence as a second ratio in the second access behavior distribution, wherein n is the number of the first ratio and the second ratio.
In addition to the two calculation methods, in the present application, other calculation methods may be further selected to calculate the similarity of each second access behavior data distribution in the plurality of second access behavior data distributions to the first access behavior data distribution. Such as the mahalanobis distance algorithm, the chebyshev distance algorithm, etc.
Optionally, the detection apparatus further comprises: the merging unit is used for merging the multiple browsers according to the first ratio to obtain a plurality of target browsers before the first computing unit computes the first ratio of the access behavior data of each browser in the access behavior data set to obtain the first access behavior data distribution; specifically, if the number of browsers is large, the calculated similarity has a large error because the occupation ratio calculated by some browsers is small. Therefore, before obtaining the first access behavior data distribution, the multiple browsers may be merged according to the first ratio, so as to obtain the multiple browsers after merging (i.e., the target browser). For example, the browser changes from 100 before the merge to 10 after the merge.
Wherein the first calculation unit includes: the calculation module is used for calculating a first ratio of the access behavior data of each target browser in the target browsers in the access behavior data of the websites accessed by the target browsers to obtain first access behavior data distribution. Specifically, after the multiple browsers are combined to obtain multiple target browsers, first access behavior data of the multiple target browsers are obtained, and then a second ratio of the access behavior data of each target browser in the access behavior data of each website in the multiple websites is calculated to obtain second access behavior data distribution.
Optionally, the plurality of target browsers includes a first target browser and a second target browser, and the merging unit includes: the sorting module is used for sorting the first ratios in a descending order; the determining module is used for determining that the browsers corresponding to the first k-1 first ratios are first target browsers, wherein k is a positive integer greater than or equal to 1; and the merging module is used for merging the browsers corresponding to the remaining n-k +1 first ratios into a second target browser and merging the n-k +1 first ratios into the ratio of the second target browser, wherein the ratio of the second target browser is smaller than the k-1 first ratio.
Assuming that the number n of browsers is 100, 100 first ratios, x, are obtained by calculation1~x100By callingThe sorting module sorts the 100 first ratios in a descending order to obtain a sorting sequence of the first ratios. And determining the value of k by calling the determining module according to the value of the calculated first ratio, for example, selecting the first 9 (i.e., k is 10) browsers corresponding to the larger first ratio as the first target browser, that is, obtaining 9 first target browsers. And then merging the last n-k +1 in the sequence into 91 browsers as a second target browser by calling a merging module, and calculating the sum of first ratios of the last 91 browsers, wherein the sum is the ratio of the second target browser. After the merging, the obtained first ratio values of the 10 target browsers are x respectively1、x2、x3、x4、x5、x6、x7、x8、x9And y10Wherein, y10=x10+x11+x12+……+x100
The device for detecting the website traffic abnormality comprises a processor and a memory, wherein the acquisition unit, the first calculation unit, the second calculation unit, the third calculation unit, the determination unit and the like are stored in the memory as program units, and the processor executes the program units stored in the memory to realize corresponding functions.
The processor comprises a kernel, and the kernel calls the corresponding program unit from the memory. The kernel can be set to be one or more than one, and the website with abnormal traffic can be quickly and accurately detected by adjusting the kernel parameters, so that the technical problem of low accuracy in detecting the website with abnormal traffic in the prior art is solved, and the technical effect of improving the detection efficiency of the website with abnormal traffic is realized.
The memory may include volatile memory in a computer readable medium, Random Access Memory (RAM) and/or nonvolatile memory such as Read Only Memory (ROM) or flash memory (flash RAM), and the memory includes at least one memory chip.
The present application further provides a computer program product adapted to perform program code for initializing the following method steps when executed on a data processing device: acquiring access behavior data of a plurality of websites accessed by a plurality of browsers within a preset time period to obtain an access behavior data set; calculating a first ratio of access behavior data using each browser in the access behavior data set to obtain first access behavior data distribution; calculating a second ratio of the access behavior data of each browser in the access behavior data of each website in the plurality of websites to obtain a plurality of second access behavior data distributions which are in one-to-one correspondence with the plurality of websites; calculating the similarity of each second access behavior data distribution in the second access behavior data distributions and the first access behavior data distribution to obtain a plurality of similarities corresponding to the websites one by one; and determining a target website from the plurality of websites according to the calculated similarity, wherein the target website is a website with abnormal flow.
The above-mentioned serial numbers of the embodiments of the present application are merely for description and do not represent the merits of the embodiments.
In the above embodiments of the present application, the descriptions of the respective embodiments have respective emphasis, and for parts that are not described in detail in a certain embodiment, reference may be made to related descriptions of other embodiments.
In the embodiments provided in the present application, it should be understood that the disclosed technology can be implemented in other ways. The above-described embodiments of the apparatus are merely illustrative, and for example, the division of the units may be a logical division, and in actual implementation, there may be another division, for example, multiple units or components may be combined or integrated into another system, or some features may be omitted, or not executed. In addition, the shown or discussed mutual coupling or direct coupling or communication connection may be an indirect coupling or communication connection through some interfaces, units or modules, and may be in an electrical or other form.
The units described as separate parts may or may not be physically separate, and parts displayed as units may or may not be physical units, may be located in one place, or may be distributed on a plurality of units. Some or all of the units can be selected according to actual needs to achieve the purpose of the solution of the embodiment.
In addition, functional units in the embodiments of the present application may be integrated into one processing unit, or each unit may exist alone physically, or two or more units are integrated into one unit. The integrated unit can be realized in a form of hardware, and can also be realized in a form of a software functional unit.
The integrated unit, if implemented in the form of a software functional unit and sold or used as a stand-alone product, may be stored in a computer readable storage medium. Based on such understanding, the technical solution of the present application may be substantially implemented or contributed to by the prior art, or all or part of the technical solution may be embodied in a software product, which is stored in a storage medium and includes instructions for causing a computer device (which may be a personal computer, a server, or a network device) to execute all or part of the steps of the method according to the embodiments of the present application. And the aforementioned storage medium includes: a U-disk, a Read-Only Memory (ROM), a Random Access Memory (RAM), a removable hard disk, a magnetic or optical disk, and other various media capable of storing program codes.
The foregoing is only a preferred embodiment of the present application and it should be noted that those skilled in the art can make several improvements and modifications without departing from the principle of the present application, and these improvements and modifications should also be considered as the protection scope of the present application.

Claims (8)

1. A method for detecting website traffic anomaly is characterized by comprising the following steps:
acquiring access behavior data of a plurality of websites accessed by a plurality of browsers within a preset time period to obtain an access behavior data set;
calculating a first ratio of access behavior data using each browser in the access behavior data set to obtain first access behavior data distribution;
calculating a second ratio of the access behavior data of each browser in the access behavior data of each website in the plurality of websites to obtain a plurality of second access behavior data distributions which are in one-to-one correspondence with the plurality of websites;
calculating the similarity of each second access behavior data distribution in the second access behavior data distributions and the first access behavior data distribution to obtain a plurality of similarities corresponding to the websites one by one;
determining a target website from the plurality of websites according to the calculated similarity, wherein the target website is a website with abnormal flow;
before calculating a first ratio of access behavior data of each browser in the access behavior data set to obtain a first access behavior data distribution, the method further comprises:
merging the multiple browsers according to the first ratio to obtain multiple target browsers;
wherein calculating a first ratio of access behavior data using each browser in the set of access behavior data to obtain a first access behavior data distribution comprises: and calculating a first ratio of the access behavior data of each target browser in the target browsers in the access behavior data of the websites accessed by the target browsers to obtain the first access behavior data distribution.
2. The method of claim 1, wherein determining a target web site from the plurality of web sites based on the calculated similarities comprises:
selecting websites with the similarity smaller than a preset proportion threshold value from the plurality of websites as the target websites;
sequencing the similarity degrees from small to large, and selecting websites corresponding to the first n similarity degrees as the target websites, wherein n is a positive integer greater than or equal to 1; or
And sequencing the similarity degrees from small to large, and selecting the websites corresponding to the top m% of similarity degrees as the target websites, wherein m is a positive integer which is greater than or equal to 1 and less than or equal to 100.
3. The method of claim 1, wherein calculating a similarity of each of the plurality of second access behavior data distributions to the first access behavior data distribution, and wherein obtaining a plurality of similarities corresponding to the plurality of websites one-to-one comprises:
by the formula
Figure FDA0002340366790000021
Calculating the similarity, wherein xiFor a first ratio, y, in the first access behavior distributioniFor a second ratio in the second access behavior distribution, i sequentially takes 1 to n, where n is the number of the first ratio and the second ratio; or
By the formula
Figure FDA0002340366790000022
Calculating the similarity, wherein xiFor a first ratio, y, in the first access behavior distributioniAnd for a second ratio in the second access behavior distribution, i sequentially takes 1 to n, and n is the number of the first ratio and the second ratio.
4. The method of claim 1, wherein the plurality of target browsers include a first target browser and a second target browser, and merging the plurality of browsers according to the first ratio to obtain the plurality of target browsers comprises:
sorting the first ratios in descending order;
determining the browsers corresponding to the first k-1 first ratios as the first target browser, wherein k is a positive integer greater than or equal to 1;
merging the browsers corresponding to the remaining n-k +1 first ratios into the second target browser, and merging the n-k +1 first ratios into the ratio of the second target browser, wherein the ratio of the second target browser is smaller than the k-1 first ratio.
5. A device for detecting website traffic abnormality is characterized by comprising:
the acquisition unit is used for acquiring access behavior data of a plurality of websites accessed by a plurality of browsers within a preset time period to obtain an access behavior data set;
the first calculation unit is used for calculating a first ratio of the access behavior data using each browser in the access behavior data set to obtain first access behavior data distribution;
a second calculation unit, configured to calculate a second ratio of access behavior data of each browser used in the access behavior data of each website in the multiple websites, to obtain multiple second access behavior data distributions that are in one-to-one correspondence with the multiple websites;
a third calculating unit, configured to calculate a similarity between each second access behavior data distribution in the plurality of second access behavior data distributions and the first access behavior data distribution, to obtain a plurality of similarities corresponding to the plurality of websites one to one;
a determining unit, configured to determine a target website from the multiple websites according to the calculated similarity, where the target website is a website with abnormal traffic;
wherein the apparatus further comprises:
a merging unit, configured to merge the multiple browsers according to a first ratio to obtain multiple target browsers before the first computing unit computes a first ratio of access behavior data using each browser in the access behavior data set to obtain first access behavior data distribution;
wherein the first calculation unit includes: and the calculation module is used for calculating a first ratio of the access behavior data of each target browser in the plurality of target browsers in the access behavior data set to obtain the first access behavior data distribution.
6. The apparatus of claim 5, wherein the determining unit comprises:
the first selection module is used for selecting websites with the similarity smaller than a preset proportion threshold value from the plurality of websites as the target websites;
the second selection module is used for sequencing the similarity degrees from small to large and selecting websites corresponding to the first n similarity degrees as the target websites, wherein n is a positive integer greater than or equal to 1; or
And the third selection module is used for sequencing the similarity degrees from small to large and selecting the websites corresponding to the top m% of similarity degrees as the target websites, wherein m is a positive integer which is greater than or equal to 1 and less than or equal to 100.
7. The apparatus of claim 5, wherein the third computing unit comprises:
a first calculation module for passing a formula
Figure FDA0002340366790000031
Calculating the similarity, wherein xiFor a first ratio, y, in the first access behavior distributioniFor a second ratio in the second access behavior distribution, i sequentially takes 1 to n, where n is the number of the first ratio and the second ratio; or
A second calculation module for passing the formula
Figure FDA0002340366790000032
Calculating the similarity, wherein xiFor a first ratio, y, in the first access behavior distributioniAnd for a second ratio in the second access behavior distribution, i sequentially takes 1 to n, and n is the number of the first ratio and the second ratio.
8. The apparatus of claim 5, wherein the plurality of target browsers comprises a first target browser and a second target browser, and wherein the merging unit comprises:
the sorting module is used for sorting the first ratios in a descending order;
a determining module, configured to determine that the browsers corresponding to the first k-1 first ratios are the first target browser, where k is a positive integer greater than or equal to 1;
and the merging module is used for merging the browsers corresponding to the remaining n-k +1 first ratios into the second target browser and merging the n-k +1 first ratios into the ratio of the second target browser, wherein the ratio of the second target browser is smaller than the k-1 first ratio.
CN201511019106.7A 2015-12-29 2015-12-29 Method and device for detecting abnormal website traffic Active CN106936778B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201511019106.7A CN106936778B (en) 2015-12-29 2015-12-29 Method and device for detecting abnormal website traffic

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201511019106.7A CN106936778B (en) 2015-12-29 2015-12-29 Method and device for detecting abnormal website traffic

Publications (2)

Publication Number Publication Date
CN106936778A CN106936778A (en) 2017-07-07
CN106936778B true CN106936778B (en) 2020-05-05

Family

ID=59441385

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201511019106.7A Active CN106936778B (en) 2015-12-29 2015-12-29 Method and device for detecting abnormal website traffic

Country Status (1)

Country Link
CN (1) CN106936778B (en)

Families Citing this family (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107438079B (en) * 2017-08-18 2020-05-01 杭州安恒信息技术股份有限公司 Method for detecting unknown abnormal behaviors of website
CN109246026B (en) * 2018-08-13 2022-05-03 中国平安人寿保险股份有限公司 Flow control method, device, equipment and storage medium
CN109783773B (en) * 2018-12-14 2022-11-11 微梦创科网络科技(中国)有限公司 Method and device for determining abnormal flow of website interface
CN111817909B (en) * 2020-06-12 2022-01-21 中国船舶重工集团公司第七二四研究所 Equipment health management method based on behavior set template monitoring
CN114024699A (en) * 2020-07-17 2022-02-08 杨耀忠 Abnormal flow detection method in complex network environment
CN112165466B (en) * 2020-09-16 2022-06-17 杭州安恒信息技术股份有限公司 Method and device for false alarm identification, electronic device and storage medium

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102790700A (en) * 2011-05-19 2012-11-21 北京启明星辰信息技术股份有限公司 Method and device for recognizing webpage crawler
CN102932206A (en) * 2012-11-19 2013-02-13 北京奇虎科技有限公司 Method and system for monitoring website access information
CN103117903A (en) * 2013-02-07 2013-05-22 中国联合网络通信集团有限公司 Internet surfing unusual flow detection method and device
CN104077396A (en) * 2014-07-01 2014-10-01 清华大学深圳研究生院 Method and device for detecting phishing website
WO2015039553A1 (en) * 2013-09-23 2015-03-26 Tencent Technology (Shenzhen) Company Limited Method and system for identifying fraudulent websites priority claim and related application

Family Cites Families (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US7668823B2 (en) * 2007-04-03 2010-02-23 Google Inc. Identifying inadequate search content

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102790700A (en) * 2011-05-19 2012-11-21 北京启明星辰信息技术股份有限公司 Method and device for recognizing webpage crawler
CN102932206A (en) * 2012-11-19 2013-02-13 北京奇虎科技有限公司 Method and system for monitoring website access information
CN103117903A (en) * 2013-02-07 2013-05-22 中国联合网络通信集团有限公司 Internet surfing unusual flow detection method and device
WO2015039553A1 (en) * 2013-09-23 2015-03-26 Tencent Technology (Shenzhen) Company Limited Method and system for identifying fraudulent websites priority claim and related application
CN104077396A (en) * 2014-07-01 2014-10-01 清华大学深圳研究生院 Method and device for detecting phishing website

Also Published As

Publication number Publication date
CN106936778A (en) 2017-07-07

Similar Documents

Publication Publication Date Title
CN106936778B (en) Method and device for detecting abnormal website traffic
CN109558295B (en) Performance index abnormality detection method and device
CN106897284B (en) Recommendation method and device for electronic books
CN110008080B (en) Business index anomaly detection method and device based on time sequence and electronic equipment
CN106874165B (en) Webpage detection method and device
CN111064614A (en) Fault root cause positioning method, device, equipment and storage medium
CN106612216B (en) Method and device for detecting website access abnormality
CN106933893B (en) multi-dimensional data query method and device
CN111026570A (en) Method and device for determining abnormal reason of business system
CN106611023B (en) Method and device for detecting website access abnormality
CN106033574B (en) Method and device for identifying cheating behaviors
CN109409559B (en) Method and device for determining oilfield output reduction rate
CN112132485A (en) Index data processing method and device, electronic equipment and storage medium
CN111091287A (en) Risk object identification method and device and computer equipment
KR20190022434A (en) Method of optimizing a database system, system, electronic device and storage medium
CN114780606B (en) Big data mining method and system
CN106933905B (en) Method and device for monitoring webpage access data
CN111858245A (en) Abnormal data analysis method and device, electronic equipment and storage medium
WO2014178843A1 (en) Database table column annotation
CN106874286B (en) Method and device for screening user characteristics
CN108664550B (en) Funnel analysis method and device for user behavior data
CN108243037B (en) Website traffic abnormity determining method and device
WO2018077059A1 (en) Barcode identification method and apparatus
CN106776264B (en) Application program code testing method and device
CN111695829B (en) Index fluctuation period calculation method and device, storage medium and electronic equipment

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
CB02 Change of applicant information

Address after: 100083 No. 401, 4th Floor, Haitai Building, 229 North Fourth Ring Road, Haidian District, Beijing

Applicant after: Beijing Guoshuang Technology Co.,Ltd.

Address before: 100086 Cuigong Hotel, 76 Zhichun Road, Shuangyushu District, Haidian District, Beijing

Applicant before: Beijing Guoshuang Technology Co.,Ltd.

CB02 Change of applicant information
GR01 Patent grant
GR01 Patent grant