CN109787960B

CN109787960B - Abnormal flow data identification method, abnormal flow data identification device, abnormal flow data identification medium, and electronic device

Info

Publication number: CN109787960B
Application number: CN201811557182.7A
Authority: CN
Inventors: 孙家棣; 马宁; 谢波
Original assignee: Ping An Life Insurance Company of China Ltd
Current assignee: Ping An Life Insurance Company of China Ltd
Priority date: 2018-12-19
Filing date: 2018-12-19
Publication date: 2022-09-02
Anticipated expiration: 2038-12-19
Also published as: CN109787960A

Abstract

The invention discloses an abnormal flow data identification method, an abnormal flow data identification device, an abnormal flow data identification medium and electronic equipment. The method comprises the following steps: acquiring a flow data sample set from flow data of blacklist users and whitelist users; determining a second number of features determined according to a chi-square value in a predetermined feature set aiming at each flow data sample in the flow data sample set to form vectors, and clustering all the vectors into a third number of classes; determining a combination of the selected second number and the selected third number, at each combination of the second number and the third number, at which the sum of the numbers of errors in the aggregated class is smallest, under which the aggregated class is regarded as the selected class; aggregating the flow data of the undetermined users into a selected class; and judging whether the undetermined user is abnormal or not based on the risk score of the undetermined user according to the clustering condition. Under the method, a clustering mode and characteristics suitable for identifying abnormal flow are screened out, and the accuracy of identifying abnormal flow is improved in the aspect of network safety.

Description

Abnormal flow data identification method, abnormal flow data identification device, abnormal flow data identification medium, and electronic device

Technical Field

The invention relates to the field of internet, in particular to an abnormal flow data identification method, an abnormal flow data identification device, an abnormal flow data identification medium and electronic equipment.

Background

With the advent of the internet era, hackers, speculators, black-producing practitioners, and even ordinary people all want to benefit from virtual networks. When some scientific and technological companies sell new products through the internet and when the internet companies give benefits to users by issuing limited number of cards and red packages on their own websites or clients, the internet companies often encounter the attack of abnormal traffic of the people; in addition, in spring every year, railway ticketing websites suffer from unknown traffic, which is usually that buffalo buys tickets in large quantities by technical means such as ticket swiping software, even the tickets in one carriage are all bought by the buffalo, so that traffic pressure is caused to the railway ticketing websites, and meanwhile, the benefits of people who normally grab the tickets are damaged.

In the implementation of the prior art, the identification of abnormal traffic is implemented by determining the size relationship between traffic characteristics such as path repetition, the occupation ratio of front and rear end embedded points of equipment, the number of IP access times, the number of IP access accounts and the like and corresponding threshold values based on user behavior embedded points and an sdk (software Development kit);

the prior art has the defects that if all flow characteristics of the flow are judged to determine whether one flow is abnormal or not, a great amount of calculation power is consumed, so that the prior art is unrealistic; if a plurality of fixed flow characteristics are selected for judgment, an attacker often steals the method for identifying abnormal flow of the server to be attacked to perform reverse engineering, so that the attacking means is upgraded, and sometimes the attacker can disguise the corresponding flow characteristics. The prior art cannot determine the characteristics suitable for identifying abnormal flow, and the accuracy of identifying the abnormal flow is not high.

Disclosure of Invention

In order to solve the technical problem of low accuracy in identifying abnormal flow data in the related art, the invention provides an abnormal flow data identification method, an abnormal flow data identification device, an abnormal flow data identification medium and electronic equipment.

According to an aspect of the present application, there is provided an abnormal flow data identification method, the method including:

dividing all users into blacklist users, white list users and undetermined users, wherein the users have flow data which has the characteristics in a preset characteristic set;

acquiring a first preset number of flow data from the flow data of the blacklist user and the white list user in a preset time period to be used as a flow data sample set, wherein each acquired flow data is a flow data sample;

for each flow data sample in the flow data sample set, determining a first second number of features of a preset feature set for each flow data sample, wherein chi-square values of the preset feature set are arranged from high to low, forming a vector of a second number of dimensions of each flow data sample, and clustering vectors of all flow data samples into a third number of classes;

if the quantity of the flow data samples of the blacklist users in the aggregated category is larger than that of the flow data samples of the white list users, the category is a blacklist; if the quantity of the traffic data samples of the white list users in the aggregated class is larger than that of the traffic data samples of the black list users, the class is a white class; taking the number of the flow data samples of the white list users in the black class as the error number of the black class; taking the flow data sample number of the blacklist user in the white class as the error number of the white class;

determining, for different combinations of the second number and the third number, a sum of the number of errors in all classes grouped under each combination;

a combination of the second number and the third number, in which the sum of the error numbers is minimum, is used as a selected second number and a selected third number, and a class aggregated under the combination is used as a selected class;

aggregating each traffic data for a predetermined period of time for a pending user into one of the selected classes;

and taking the risk score of the cluster gathered by each flow data in the predetermined time period of the pending user as the risk score of the flow data, wherein the risk score of the cluster is calculated by the following formula:

wherein score represents the risk score of the class, N0 is the number of traffic data samples of white list users in all traffic data samples gathered by the class, and N1 is the number of traffic data samples of black list users in all traffic data samples gathered by the class;

determining a risk score of the pending user within a predetermined time period based on the risk score of each flow data within the predetermined time period of the pending user;

and determining whether the pending user is an abnormal user or not according to the risk score of the pending user in a preset time period.

According to another aspect of the present application, there is provided an abnormal flow data identifying apparatus, the apparatus including:

a user classification module configured to classify all users into blacklist users, whitelist users and pending users, the users having traffic data;

the system comprises a first acquisition module, a second acquisition module and a third acquisition module, wherein the first acquisition module is configured to acquire a first preset number of flow data from the flow data of blacklist users and whitelist users in a preset time period as a flow data sample set;

a clustering module configured to determine, for each flow data sample in a set of flow data samples, a first second number of features in a predetermined feature set for each flow data sample, the chi-squared value of which is arranged from high to low, form a vector of a second number of dimensions for each flow data sample, and cluster the vectors of all flow data samples into a third number of classes;

the judging module is configured to determine that the category is black if the quantity of the traffic data samples of the black list users in the aggregated category is greater than that of the traffic data samples of the white list users; if the quantity of the traffic data samples of the white list users in the aggregated class is larger than that of the traffic data samples of the black list users, the class is a white class;

the first determination module is configured to take the number of the traffic data samples of the white list users in the black class as the error number of the black class; taking the flow data sample number of the blacklist user in the white class as the error number of the white class; determining, for different combinations of the second number and the third number, a sum of the number of errors in all classes grouped under each combination;

a second determination module configured to take a combination of the second number and the third number, under which the sum of the error numbers is minimum, as the selected second number and the selected third number, and take the class aggregated under the combination as the selected class;

an abnormal user determination module configured to cluster each traffic data within a predetermined time period of a pending user into one of the selected classes, wherein the abnormal user determination module is further configured to:

taking the risk score of the class to which each flow data in a predetermined time period of the pending user is gathered as the risk score of the flow data;

determining a risk score of the pending user within a predetermined time period based on the risk score of each flow data within the predetermined time period of the pending user; and

According to another aspect of the present application, there is provided a computer readable program medium storing computer program instructions which, when executed by a computer, cause the computer to perform the method as previously described.

According to another aspect of the present application, there is provided an electronic apparatus including:

a processor;

a memory having computer readable instructions stored thereon which, when executed by the processor, implement the method as previously described.

The technical scheme provided by the embodiment of the invention can have the following beneficial effects:

the method for identifying abnormal flow data provided by the invention comprises the following steps: dividing all users into blacklist users, white list users and undetermined users, wherein the users have flow data which has the characteristics in a preset characteristic set; acquiring a first preset number of flow data from the flow data of the blacklist user and the white list user in a preset time period to be used as a flow data sample set, wherein each acquired flow data is a flow data sample; for each flow data sample in the flow data sample set, determining a first second number of features of a preset feature set for each flow data sample, wherein chi-square values of the preset feature set are arranged from high to low, forming a vector of a second number of dimensions of each flow data sample, and clustering vectors of all flow data samples into a third number of classes; if the quantity of the flow data samples of the blacklist users in the aggregated category is larger than that of the flow data samples of the white list users, the category is a blacklist; if the quantity of the traffic data samples of the white list users in the aggregated class is larger than that of the traffic data samples of the black list users, the class is a white class; taking the number of the flow data samples of the white list users in the black class as the error number of the black class; taking the flow data sample number of the blacklist user in the white class as the error number of the white class; determining, for different combinations of the second number and the third number, a sum of the number of errors in all classes grouped under each combination; a combination of the second number and the third number, in which the sum of the error numbers is minimum, is used as a selected second number and a selected third number, and a class aggregated under the combination is used as a selected class; aggregating each traffic data for a predetermined period of time for a pending user into one of the selected classes; and taking the risk score of the cluster gathered by each flow data in the predetermined time period of the pending user as the risk score of the flow data, wherein the risk score of the cluster is calculated by the following formula:

wherein score represents the risk score of the class, N0 is the number of traffic data samples of white list users in all traffic data samples gathered by the class, and N1 is the number of traffic data samples of black list users in all traffic data samples gathered by the class; determining a risk score of the pending user within a predetermined time period based on the risk score of each flow data within the predetermined time period of the pending user; and determining whether the pending user is an abnormal user or not according to the risk score of the pending user in a preset time period.

Under the method, the most suitable clustering mode for judging abnormal flow data is determined by traversing the combination of the second number and the third number, then different characteristics are selected for clustering, the characteristics suitable for identifying abnormal flow are screened out based on the chi-square value, and the accuracy of identifying abnormal flow is improved in the aspect of network safety.

It is to be understood that both the foregoing general description and the following detailed description are exemplary and explanatory only and are not restrictive of the invention, as claimed.

Drawings

The accompanying drawings, which are incorporated in and constitute a part of this specification, illustrate embodiments consistent with the invention and together with the description, serve to explain the principles of the invention.

FIG. 1 is a schematic diagram of an application environment illustrating a method for identifying anomalous traffic data in accordance with an exemplary embodiment;

FIG. 2 is a flow diagram illustrating a method of abnormal flow data identification in accordance with an exemplary embodiment;

FIG. 3 is a schematic diagram illustrating a clustering process in accordance with an exemplary embodiment;

FIG. 4 is a flowchart illustrating details of step 280 according to one embodiment illustrated in a corresponding embodiment in FIG. 2;

FIG. 5 is a flowchart illustrating details of step 290 according to one embodiment illustrated in a corresponding embodiment in FIG. 2;

FIG. 6 is a flowchart illustrating details of step 250 according to one embodiment illustrated in a corresponding embodiment of FIG. 2;

FIG. 7 is a schematic block diagram illustrating an abnormal flow data identification apparatus in accordance with an exemplary embodiment;

FIG. 8 is a block diagram illustrating an example electronic device for implementing the above-described method for identifying abnormal flow data, according to an example embodiment;

fig. 9 is a computer-readable storage medium for implementing the above-described abnormal traffic data identification method according to an exemplary embodiment.

Detailed Description

Reference will now be made in detail to the exemplary embodiments, examples of which are illustrated in the accompanying drawings. When the following description refers to the accompanying drawings, like numbers in different drawings represent the same or similar elements unless otherwise indicated. The embodiments described in the following exemplary embodiments do not represent all embodiments consistent with the present invention. Rather, they are merely examples of apparatus and methods consistent with certain aspects of the invention, as detailed in the appended claims.

Furthermore, the drawings are merely schematic illustrations of the present disclosure and are not necessarily drawn to scale. The same reference numerals in the drawings denote the same or similar parts, and thus their repetitive description will be omitted. Some of the block diagrams shown in the figures are functional entities and do not necessarily correspond to physically or logically separate entities.

The present invention may be implemented in physical devices of the type including, but not limited to: the system comprises a server, a gateway device, a switch desktop computer, a workstation, a notebook computer and a mobile terminal, wherein the implementation terminal can communicate with the outside through a network and can comprise a combination of software, hardware or firmware.

Fig. 1 is a schematic diagram of an application environment of a method for identifying abnormal traffic data according to an exemplary embodiment. As shown in fig. 1, the application environment of the abnormal traffic data identification method includes an internet service provider 110, a normal visitor 120, and an abnormal traffic generator 130. The internet service provider 110 provides information and services for the visitor through the network, and when the internet service provider 110 encounters abnormal traffic, normal operation of the internet service provider 110 is affected, a normal visitor 120 is a user who does not cause large traffic pressure on the internet service provider, and a user who generates large traffic is an abnormal traffic generator 130. Under normal conditions, the internet service provider 110 is running steadily and is rarely affected by large flows; the abnormal traffic generator 130 is usually an illegal speculator for profit, and when the abnormal traffic generator 130 generates the abnormal traffic, it means that a large number of repeated access requests to the internet service provider 110 are generated in a short time, which usually has a rapid and violent impact on the internet service provider 110, and not only interferes with the access of the normal visitor 120, but also causes economic loss to the internet service provider 110.

The present disclosure first provides a method of abnormal traffic data identification. The abnormal traffic data refers to traffic which is repeatedly accessed for a plurality of times in a short time or occupies a large amount of bandwidth. Fig. 2 is a flow chart diagram illustrating a method of identifying abnormal flow data according to an exemplary embodiment. As shown in fig. 2, the method comprises the steps of:

and step 110, dividing all users into blacklist users, white list users and pending users, wherein the users have flow data.

All users refer to users who have access to the internet service provider in a specific time period;

the blacklist user is a user which is known in advance to be in black production or pass through the user with abnormal flow behaviors; the white list user refers to a user who is unlikely to have abnormal flow behaviors after the internet service providers such as internal staff of the internet service providers and senior users confirm; the undetermined users refer to users except for blacklist users and white list users in all users, and the traffic data refers to data recorded by an internet service provider aiming at access behaviors of the users when the users access, such as access account numbers, access time, IP addresses, device front-end and back-end login modes, mobile phone number segments and the like

In one embodiment, an internet service provider marks accounts and IP addresses of internal workers and advanced users as white list users, marks accounts and IP addresses of users with historical abnormal flow behaviors as black list users, and does not mark a user to be determined, so that all users are divided into the black list users, the white list users and the user to be determined, and a server of the internet service provider stores flow data of all the users.

In one embodiment, the flow data has a characteristic in a predetermined set of characteristics. The characteristics refer to technical parameters which are selected according to the statistical conditions of the traffic data of the whole Internet and can possibly represent whether the traffic data is abnormal traffic data or not. The predetermined feature set refers to a set of predetermined features.

In one embodiment, the set of predetermined features includes the following features: the method comprises the steps of path repetition ranking, user wind control parameter abnormity rate, rear-end buried point proportion, wind control IP divergence rate, wind control IP access account number, wind control IP access times, wind control IP _ WIFI name number, wind control IP accumulated risk score, user number average value in a wind control IP period, user variance in a wind control IP period, access number average value in a wind control IP period, access number variance in a wind control IP period, mobile phone number segment user login average value in a period and mobile phone number segment user login variance in a period. The characteristic set composed of the characteristics has the advantage that the characteristic set composed of the characteristics can be used for judging whether one flow data is abnormal flow data or not from different dimensions as far as possible due to the fact that the abnormal flow data has inflexibility and disguise.

Step 120, obtaining a first predetermined number of traffic data from the traffic data of the blacklist user and the whitelist user in a predetermined time period, as a traffic data sample set.

Because the traffic data of the users of the internet service providers is likely to be many, all the traffic data is used for processing, a large amount of resources are consumed, the traffic data is not practical and is not necessary, and the range of the acquired data can be reduced by sampling. The reason why the traffic data of the blacklist user and the whitelist user is selected is that whether the traffic data of the blacklist user and the whitelist user is abnormal traffic data is clear, and it is reasonable to use the traffic data as a judgment reference of the abnormal traffic data. The flow data is selected in the preset time period, so that the internet is constantly changed, and the characteristics of the abnormal flow data and the technical means for generating the abnormal flow are changed day by day, so that the flow data can be acquired timely and pertinently, and the accuracy of identifying the abnormal flow data is improved.

In one embodiment, the traffic data for the blacklisted users and the white-listed users are the same number in the traffic data sample set. In most cases, the number of white list users is much larger than that of black list users, so that the difference between the number of traffic data of the black list users and the number of traffic data of the white list users in a traffic data sample set formed by acquiring a first predetermined number of traffic data is very large, which causes imbalance of samples and falls into a locally optimal condition, that is, due to the limitation of the number, the samples with small number cannot reasonably reflect the overall characteristics and rules of the traffic data sample set. The method has the advantages that the overall characteristics and rules of the flow data sample set can be reasonably reflected, and the situation of falling into local optimum is avoided.

And step 130, determining the first second number of features of the preset feature set of each flow data sample, wherein the chi-squared values of the preset feature set of each flow data sample are arranged from high to low, forming a second number of dimension vectors of each flow data sample, and clustering the vectors of all the flow data samples into a third number of classes.

In a 1In one embodiment, there are 14 features in the predetermined feature set, and 1 feature, 2 features, … … features, or 14 features can be arbitrarily selected from the feature set at a time, where the total number of features that can be extracted is

A feature; generally, the number of the clustered classes cannot be too large, but the similar samples in each class are too few to represent the common characteristics of the classes, and if the clusters are clustered into 20 classes at most, the number of the clustered classes is 19, so the combination of the selected characteristics and the number of the clustered classes shares the common characteristics

And (4) seed preparation. If the combination of all the characteristics and the class numbers is tried, and then the flow data samples in the flow data sample set are clustered, the cost and the workload for judging abnormal flow data are greatly increased. The advantage of selecting chi-square value to screen clustering features is that the number of the feature selection modes is greatly reduced, and the efficiency of judging abnormal flow data is improved.

In one embodiment, the chi-squared value is determined by the following equation:

wherein, χ ² Is the chi-squared value, O is the number of observed occurrences, and E is the number of theoretical occurrences. The larger the chi-squared value is, the better the final clustering effect is, and the better the abnormal flow data judgment effect is.

In one embodiment, 10000 traffic data are selected, three characteristics of the path repetition degree, the number of times of IP access and the number of account numbers of IP access are judged, and when the value of one characteristic is judged to be larger than a predetermined threshold value and the traffic data with the characteristic is abnormal traffic data, the number of times of observation is increased by 1. The statistical data in this embodiment is shown in the following table, and it can be understood that the path repetition degree is the best for clustering among the three features.

Observation times and theoretical times table of each characteristic characterization abnormal flow data in preset characteristic set

Abnormal flow identification	Number of observations (O)	Number of theories (E)	O-E	(O-E) ² /E
					Degree of path repetition	8851	1000	7851	61638.201
Number of IP accesses	3564	1500	2064	2840.064
					Number of IP Access Account	4925	3000	1925	1235.208

In one embodiment, the K-means algorithm is used to cluster the vector of flow data samples. The K-means algorithm is a hard clustering algorithm, is a typical target function clustering method based on a prototype, takes a certain distance from a data point to the prototype as an optimized target function, and obtains an adjustment rule of iterative operation by using a function extremum solving method. The general flow of the K-means algorithm is as follows: randomly selecting k objects from n data objects as initial clustering centers; calculating the distances between all data objects and k clustering centers, and classifying the object as the class represented by which center when the distance from which center is closest to which center is the closest to which center; re-centering the class; and performing iterative clustering by using the center of the re-determined class until the clustering is not changed or the iteration number reaches a preset threshold value.

Fig. 3 is a schematic diagram illustrating a clustering process according to an exemplary embodiment, and as shown in fig. 3, it can be appreciated that because the elements in the graph are clustered by distances in a two-dimensional space, the clustering is based on two-dimensional vectors. The clustering process is to select the number of clusters, i.e. the third number, which is 3 in this embodiment, so that three initial cluster centers are selected, which are black dots as shown in fig. 3. Clustering vectors represented by all points into three classes, taking initial clustering centers represented by three black points as a reference, judging the distances between all the points and the three black points, and clustering the point to the class represented by the black point when the distance between the point and the black point is closest to the black point; after the first round of clustering is finished, the central point (the cross mark shown in fig. 3) of each class is determined again, and all points are clustered again based on the central point until clustering does not change any more or the iteration number reaches a preset threshold value.

In one embodiment, the distances between the points and the center point are in the form of Euclidean distances. For example, a point represents a vector of (x) ₁ ，y ₁ ) The vector of the center point is (x) ₂ ，y ₂ ) Then, the euclidean distance between two points is:

in one embodiment, since the vector units, the dimensions and the sizes of the different flow data samples are different, and the clustering process calculates the distance between each feature vector and the central vector, if the two vector units, the dimensions and the sizes are different, the obtained distance cannot objectively reflect the importance degree of each feature, so before the vector of the second number dimension is formed, each of the first second number of features is first normalized to [0,1] according to the following formula:

where x is any feature to be normalized among the first second number of features, x _min For the feature, x, that is the smallest in the flow data sample set _max For the largest such feature in the flow data sample set,

is the feature after normalization. This has the advantage of reflecting the actual contribution of each feature to the cluster.

Step 140, for different combinations of the second number and the third number, under each combination, determining the sum of the error numbers in all the classes to be aggregated.

If the quantity of the flow data samples of the blacklist users in the aggregated category is larger than that of the flow data samples of the white list users, the category is a blacklist; if the quantity of the traffic data samples of the white list users in the aggregated type is larger than that of the traffic data samples of the black list users, the type is a white type; taking the number of the flow data samples of the white list users in the black class as the error number of the black class; and taking the number of the traffic data samples of the blacklisted users in the white class as the error number of the white class.

Step 150, the combination of the second number and the third number with the smallest sum of the error numbers is used as the selected second number and the selected third number, and the class aggregated under the combination is used as the selected class.

The smaller the error number is, the stronger the distinguishing capability of the clustering mode on the traffic data samples of the blacklist users and the traffic data samples of the white list users is, and the clustering mode is more suitable for distinguishing abnormal traffic data. This has the advantage that to a certain extent the combination of the second and third numbers and the aggregated class is selected which is most suitable for distinguishing between the abnormal flow data.

Step 160, aggregating each flow data in the predetermined time period of the pending user into one of the selected classes.

In one embodiment, there are multiple selected classes, and there may be many selected classes because although the selected third number and the selected second number are the same, the classes of features under each selected second number are different and the number of aggregated classes is large.

In an embodiment, it is determined whether the pending user is an abnormal user, and the traffic data of each user is not necessarily all normal traffic data, nor is it necessarily all abnormal traffic data, so a time period is selected, to which class the multiple traffic data of the user can be aggregated in the time period is determined, and whether the pending user is an abnormal user is determined based on the clustering result.

And step 170, taking the risk score of the class to which each flow data in the predetermined time period of the pending user is gathered as the risk score of the flow data.

In one embodiment, the risk score for the clustered class is calculated by the following formula:

wherein score represents the risk score of the class, N0 is the number of traffic data samples of white list users in all traffic data samples of the class, and N1 is the number of traffic data samples of black list users in all traffic data samples of the class.

Under the condition that the number of the traffic data samples in the traffic data sample set is constant, the larger the proportion of the number of the traffic data samples of the blacklisted users in the gathered class is, the higher the possibility that the traffic data is abnormal traffic data is. This has the advantage that it can be determined more objectively whether a flow data is an abnormal flow data.

And step 180, determining the risk score of the undetermined user in the preset time period based on the risk score of each flow data of the undetermined user in the preset time period.

The risk score for a user may be determined from the risk score for each flow data for a predetermined period of time for the user.

And step 190, determining whether the pending user is an abnormal user according to the risk score of the pending user in a preset time period.

Fig. 4 is a flowchart illustrating details of step 280 according to an embodiment illustrated in a corresponding embodiment of fig. 2, wherein step 280 includes the steps of, as illustrated in fig. 4:

step 281, obtaining an average value of the risk scores of each flow data of the pending user in a preset time period.

And 282, taking the average value as the risk score of the pending user in a preset time period.

In one embodiment, the risk scores of each flow data of the pending users in the predetermined time period are different greatly, the risk score of each flow data with a specific large or small flow data influences the risk score of the user, and the average value is used as the risk score of the pending user in the predetermined time period, so that the risk condition that the user generates abnormal flow can be reflected on the whole.

In another embodiment, the maximum value of the risk scores of the traffic data of the pending user within a predetermined time period is taken as the risk score of the user. This has the advantage of identifying the at-risk user to the greatest extent possible, since as long as the user has traffic data with a greater risk score, it indicates that the user is at a greater risk.

Fig. 5 is a flowchart illustrating the details of step 290 according to one embodiment illustrated in a corresponding embodiment in fig. 2. As shown in fig. 5, step 290 includes the following steps:

step 291, when the risk score of the undetermined user in a predetermined time period is greater than the risk score threshold, determining that the undetermined user is an abnormal user;

step 292, when the risk score of the pending user in the predetermined time period is not greater than the risk score threshold, determining that the pending user is not an abnormal user.

In one embodiment, a risk score threshold is determined according to statistics of a large amount of historical data, and when the risk score of a pending user is greater than the risk score threshold, the user can be determined to be an abnormal user, so that the user can be prevented from being damaged in ways of blocking the IP of the user and the like. And traversing all combinations of the number of the clustering classes and the characteristics determined according to the chi-square value sorting to complete the screening of the optimal clustering mode. If the predetermined feature set has 14 features, the number of feature combinations determined by the chi-squared value is 14, and if the number of clusters is maximum 20 and minimum 2, there are 19 cluster number selection modes, so that the number of combinations of the second number and the third number is 14 × 19, 266, and it still costs a lot to determine the combination of the second number and the third number with the best clustering effect.

In one embodiment, for each traffic data sample in the set of traffic data samples, we refer to: for each of a predetermined proportion of the flow data samples in the set of flow data samples.

In one embodiment, the predetermined ratio is 20%.

Fig. 6 is a flow chart illustrating the details of step 250 according to one embodiment illustrated in a corresponding embodiment in fig. 2. As shown in fig. 6, step 250 includes:

all combinations of the second number and the third number are sorted from small to large according to the sum of the number of errors in the clustered classes, step 251. As described above, the sum of the error numbers can well represent whether the combined class of the second number and the third number can distinguish the abnormal traffic data, so the combinations of the second number and the third number are sorted according to the sum of the error numbers in the combined class.

Step 252, the combination of the second number and the third number of the fourth number before the sorting is used as the candidate second number and the candidate third number.

In the scheme, the flow data sample set is not clustered at first, so that the finally determined combination of the second number and the third number is not representative or fair, and therefore, the combination of the second number and the third number with better clustering effect is obtained first, and then further screening is performed.

And step 253, for each flow data sample in the flow data sample set, determining the previous candidate second number of features with chi-squared values arranged from high to low in the preset feature set of the flow data sample to form a candidate second number dimension vector, and clustering the vectors of all the flow data samples into a candidate third number class corresponding to the candidate second number in the combination of the second number and the third number.

Because a part of the flow data sample set is clustered and the combination of the second number and the third number with better clustering effect is screened, each flow data sample in the whole flow data sample set is clustered according to the screened combination of the second number and the third number, and the clustering efficiency can be greatly improved.

Step 254, for the combinations of the different candidate second number and candidate third number, under each combination, the sum of the error numbers in all the clustered classes is determined.

Step 255, the combination of the candidate second number and the candidate third number with the smallest sum of the error numbers is used as the selected second number and the selected third number, and the class aggregated under the combination is used as the selected class.

The screened combination of the second number and the third number with better clustering effect is only the sum of error numbers which is relatively minimum, and the accuracy of the selected class which is finally used for judging the abnormal flow data is ensured to a greater extent through another round of clustering the flow data samples in the whole flow data sample set.

The method has the advantages that a part of the flow data sample set is selected firstly for clustering, so that the processing efficiency is greatly improved and the selection process of the selected class is optimized under the condition that the accuracy of the finally determined selected class for identifying the abnormal flow data is not reduced greatly.

The disclosure also provides an abnormal flow data identification device, and the following is an embodiment of the device.

Fig. 7 is a schematic block diagram illustrating an abnormal flow data identifying apparatus according to an exemplary embodiment. As shown in fig. 7, the apparatus 700 includes:

a user classification module 710 configured to classify all users into blacklist users, whitelist users and pending users, the users having traffic data;

a first obtaining module 720, configured to obtain, as a traffic data sample set, a first predetermined number of traffic data from the traffic data of the blacklisted user and the whitelisted user within a predetermined time period;

a clustering module 730 configured to determine, for each flow data sample in the set of flow data samples, a first second number of features in the predetermined feature set for each flow data sample, the chi-squared value of which is arranged from high to low, form a vector of a second number of dimensions for each flow data sample, and cluster the vectors of all flow data samples into a third number of classes;

a determining module 740 configured to determine that the category is black if the number of traffic data samples of the black list user in the aggregated category is greater than the number of traffic data samples of the white list user; if the quantity of the traffic data samples of the white list users in the aggregated class is larger than that of the traffic data samples of the black list users, the class is a white class;

a first determining module 750 configured to take the number of traffic data samples of the white-listed user in the black class as the error number of the black class; taking the flow data sample number of the blacklist user in the white class as the error number of the white class; determining, for different combinations of the second number and the third number, a sum of the number of errors in all classes grouped under each combination;

a second determination module 760 configured to determine, as the selected second number and the selected third number, a combination of the second number and the third number at which the sum of the error numbers is minimum, under which the aggregated class is determined as the selected class;

an abnormal user judging module 770 configured to gather each traffic data within a predetermined time period of the pending user into one of the selected classes; taking the risk score of the class to which each flow data in a predetermined time period of the pending user is gathered as the risk score of the flow data; determining a risk score of the pending user within a predetermined time period based on the risk score of each flow data within the predetermined time period of the pending user; and determining whether the pending user is an abnormal user or not according to the risk score of the pending user in a preset time period.

According to a third aspect of the present disclosure, there is also provided an electronic device capable of implementing the above method.

As will be appreciated by one skilled in the art, aspects of the present invention may be embodied as a system, method or program product. Thus, various aspects of the invention may be embodied in the form of: an entirely hardware embodiment, an entirely software embodiment (including firmware, microcode, etc.) or an embodiment combining hardware and software aspects that may all generally be referred to herein as a "circuit," module "or" system.

An electronic device 800 according to this embodiment of the invention is described below with reference to fig. 8. The electronic device 800 shown in fig. 8 is only an example and should not bring any limitations to the function and scope of use of the embodiments of the present invention.

As shown in fig. 8, electronic device 800 is in the form of a general purpose computing device. The components of the electronic device 800 may include, but are not limited to: the at least one processing unit 810, the at least one memory unit 820, and a bus 830 that couples the various system components including the memory unit 820 and the processing unit 810.

Wherein the storage unit stores program code that can be executed by the processing unit 810, such that the processing unit 810 performs the steps according to various exemplary embodiments of the present invention described in the "example methods" section above in this specification.

The storage unit 820 may include readable media in the form of volatile storage units, such as a random access storage unit (RAM)821 and/or a cache storage unit 822, and may further include a read only storage unit (ROM) 823.

Storage unit 820 may also include a program/utility 824 having a set (at least one) of program modules 825, such program modules 825 including, but not limited to: an operating system, one or more application programs, other program modules, and program data, each of which, or some combination thereof, may comprise an implementation of a network environment.

Bus 830 may be any of several types of bus structures including a memory unit bus or memory unit controller, a peripheral bus, an accelerated graphics port, a processing unit, or a local bus using any of a variety of bus architectures.

The electronic device 800 may also communicate with one or more external devices 1000 (e.g., keyboard, pointing device, bluetooth device, etc.), with one or more devices that enable a user to interact with the electronic device 800, and/or with any devices (e.g., router, modem, etc.) that enable the electronic device 800 to communicate with one or more other computing devices. Such communication may occur via input/output (I/O) interfaces 850. Also, the electronic device 800 may communicate with one or more networks (e.g., a Local Area Network (LAN), a Wide Area Network (WAN), and/or a public network, such as the internet) via the network adapter 860. As shown, the network adapter 860 communicates with the other modules of the electronic device 800 via the bus 830. It should be appreciated that although not shown, other hardware and/or software modules may be used in conjunction with the electronic device 800, including but not limited to: microcode, device drivers, redundant processing units, external disk drive arrays, RAID systems, tape drives, and data backup storage systems, among others.

Through the above description of the embodiments, those skilled in the art will readily understand that the exemplary embodiments described herein may be implemented by software, or by software in combination with necessary hardware. Therefore, the technical solution according to the embodiments of the present disclosure may be embodied in the form of a software product, which may be stored in a non-volatile storage medium (which may be a CD-ROM, a usb disk, a removable hard disk, etc.) or on a network, and includes several instructions to enable a computing device (which may be a personal computer, a server, a terminal device, or a network device, etc.) to execute the method according to the embodiments of the present disclosure.

According to a fourth aspect of the present disclosure, there is also provided a computer-readable storage medium having stored thereon a program product capable of implementing the above-mentioned method of the present specification. In some possible embodiments, aspects of the invention may also be implemented in the form of a program product comprising program code means for causing a terminal device to carry out the steps according to various exemplary embodiments of the invention described in the above section "exemplary methods" of the present description, when said program product is run on the terminal device.

Referring to fig. 9, a program product 900 for implementing the above method according to an embodiment of the present invention is described, which may employ a portable compact disc read only memory (CD-ROM) and include program code, and may be run on a terminal device, such as a personal computer. However, the program product of the present invention is not limited in this regard and, in the present document, a readable storage medium may be any tangible medium that can contain, or store a program for use by or in connection with an instruction execution system, apparatus, or device.

The program product may employ any combination of one or more readable media. The readable medium may be a readable signal medium or a readable storage medium. A readable storage medium may be, for example, but not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or any combination of the foregoing. More specific examples (a non-exhaustive list) of the readable storage medium include: an electrical connection having one or more wires, a portable disk, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber, a portable compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing.

A computer readable signal medium may include a propagated data signal with readable program code embodied therein, for example, in baseband or as part of a carrier wave. Such a propagated data signal may take many forms, including, but not limited to, electro-magnetic, optical, or any suitable combination thereof. A readable signal medium may also be any readable medium that is not a readable storage medium and that can communicate, propagate, or transport a program for use by or in connection with an instruction execution system, apparatus, or device.

Program code embodied on a readable medium may be transmitted using any appropriate medium, including but not limited to wireless, wireline, optical fiber cable, RF, etc., or any suitable combination of the foregoing.

Program code for carrying out operations for aspects of the present invention may be written in any combination of one or more programming languages, including an object oriented programming language such as Java, C + + or the like and conventional procedural programming languages, such as the "C" programming language or similar programming languages. The program code may execute entirely on the user's computing device, partly on the user's device, as a stand-alone software package, partly on the user's computing device and partly on a remote computing device, or entirely on the remote computing device or server. In the case of a remote computing device, the remote computing device may be connected to the user computing device through any kind of network, including a Local Area Network (LAN) or a Wide Area Network (WAN), or may be connected to an external computing device (e.g., through the internet using an internet service provider).

Furthermore, the above-described figures are merely schematic illustrations of processes involved in methods according to exemplary embodiments of the invention, and are not intended to be limiting. It will be readily understood that the processes shown in the above figures are not intended to indicate or limit the chronological order of the processes. In addition, it is also readily understood that these processes may be performed synchronously or asynchronously, e.g., in multiple modules.

It will be understood that the invention is not limited to the precise arrangements described above and shown in the drawings and that various modifications and changes may be made without departing from the scope thereof. The scope of the invention is limited only by the appended claims.

Claims

1. An abnormal flow data identification method, characterized in that the method comprises:

for each flow data sample in the flow data sample set, determining the first second number of features of a preset feature set of each flow data sample, wherein chi-square values of the preset feature set of each flow data sample are arranged from high to low, forming a vector of a second number dimension of each flow data sample, and clustering vectors of all flow data samples into a third number of classes;

determining the sum of the error numbers in all the grouped classes for different combinations of the second number and the third number under each combination;

2. The method of claim 1, wherein determining the risk score for the pending user for the predetermined time period based on the risk score for each flow data for the pending user for the predetermined time period comprises in particular:

acquiring the average value of the risk scores of each flow data of the undetermined user in a preset time period;

and taking the average value as the risk score of the pending user in a preset time period.

3. The method of claim 1, wherein the determining whether the pending user is an abnormal user according to the risk score of the pending user within a predetermined time period specifically comprises:

when the risk score of the undetermined user in a preset time period is larger than a risk score threshold value, judging that the undetermined user is an abnormal user;

and when the risk score of the undetermined user in a preset time period is not larger than the risk score threshold value, judging that the undetermined user is not an abnormal user.

4. The method of claim 1, wherein prior to constructing the vector of the second number dimension, the method further comprises:

normalizing each of the first second number of features to [0,1] according to the following equation:

x is any feature to be normalized among the first second number of features, x _min For the feature, x, that is the smallest in the flow data sample set _max For the largest such feature in the flow data sample set,

is the feature after normalization.

5. The method of claim 1, wherein the first predetermined number of traffic data includes an equal number of traffic data for blacklisted and whitelisted users.

6. The method according to claim 1, wherein the step of, for each traffic data sample in the set of traffic data samples, specifically comprises:

for each of a predetermined proportion of the flow data samples in the set of flow data samples;

the combination of the second number and the third number, which minimizes the sum of the error numbers, is used as the selected second number and the selected third number, and the class grouped under the combination is specifically used as the selected class, and includes:

sorting all combinations of the second number and the third number from small to large according to the sum of the error numbers in the clustered classes;

a combination of the second number and the third number of the fourth number before the sorting is used as a candidate second number and a candidate third number;

for each flow data sample in the flow data sample set, determining a candidate second number of features with chi-square values arranged from high to low in a preset feature set of the flow data sample to form a candidate second number of dimension vectors, and clustering the vectors of all the flow data samples into a candidate third number class corresponding to the candidate second number in a combination of the second number and the third number;

determining the sum of the error numbers in all the classes of the aggregate under each combination for the combination of the different candidate second number and the candidate third number;

and taking the combination of the candidate second number and the candidate third number with the minimum error number sum as the selected second number and the selected third number, and taking the class aggregated under the combination as the selected class.

7. The method of claim 1, wherein the set of predetermined features includes the following features: the method comprises the steps of path repetition ranking, user wind control parameter abnormity rate, rear-end buried point proportion, wind control IP divergence rate, wind control IP access account number, wind control IP access times, wind control IP _ WIFI name number, wind control IP accumulated risk score, user number average value in a wind control IP period, user variance in a wind control IP period, access number average value in a wind control IP period, access number variance in a wind control IP period, mobile phone number segment user login average value in a period and mobile phone number segment user login variance in a period.

8. An abnormal flow data identification apparatus, characterized in that the apparatus comprises:

the judging module is configured to determine that the type is black if the quantity of the flow data samples of the blacklist users in the aggregated type is greater than that of the flow data samples of the white list users; if the quantity of the traffic data samples of the white list users in the aggregated class is larger than that of the traffic data samples of the black list users, the class is a white class;

the abnormal user judging module is configured to gather each flow data of the pending user in a preset time period into one of the selected classes; taking the risk score of the class to which each flow data in a predetermined time period of the pending user is gathered as the risk score of the flow data; determining a risk score of the pending user within a predetermined time period based on the risk score of each flow data within the predetermined time period of the pending user; and determining whether the pending user is an abnormal user or not according to the risk score of the pending user in a preset time period.

9. A computer-readable storage medium, characterized in that it stores computer program instructions which, when executed by a computer, cause the computer to perform the method according to any one of claims 1 to 7.

10. An electronic device, characterized in that the electronic device comprises:

a processor;

a memory having stored thereon computer readable instructions which, when executed by the processor, implement the method of any one of claims 1 to 7.