CN109873832B - Flow identification method and device, electronic equipment and storage medium - Google Patents

Flow identification method and device, electronic equipment and storage medium Download PDF

Info

Publication number
CN109873832B
CN109873832B CN201910199191.1A CN201910199191A CN109873832B CN 109873832 B CN109873832 B CN 109873832B CN 201910199191 A CN201910199191 A CN 201910199191A CN 109873832 B CN109873832 B CN 109873832B
Authority
CN
China
Prior art keywords
flow
abnormal
user data
calculating
user
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201910199191.1A
Other languages
Chinese (zh)
Other versions
CN109873832A (en
Inventor
曹战徐
武金
刁士涵
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Beijing Sankuai Online Technology Co Ltd
Original Assignee
Beijing Sankuai Online Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beijing Sankuai Online Technology Co Ltd filed Critical Beijing Sankuai Online Technology Co Ltd
Priority to CN201910199191.1A priority Critical patent/CN109873832B/en
Publication of CN109873832A publication Critical patent/CN109873832A/en
Application granted granted Critical
Publication of CN109873832B publication Critical patent/CN109873832B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Landscapes

  • Debugging And Monitoring (AREA)
  • Data Exchanges In Wide-Area Networks (AREA)

Abstract

The application discloses a traffic identification method, a traffic identification device, electronic equipment and a storage medium. The method comprises the following steps: monitoring the flow of the target service, and calculating the flow mean ratio of the target service; if the calculated flow mean ratio exceeds a first threshold value, determining a flow sudden increase moment; acquiring user data corresponding to the traffic surge moment, and screening abnormal user data in the user data based on a graph semi-supervision method; and determining abnormal users according to the abnormal user data, and identifying the flow of the abnormal users as abnormal flow. According to the technical scheme, the flow sudden increase is monitored by adopting the idea of mean shift, the abnormal flow is determined to be generated, the semi-supervised learning algorithm can represent the similarity of the non-intercepted access behaviors, the interpretability is strong, the requirement on data is lower than that of an unsupervised algorithm, and the obtained result is more stable; and omission is avoided by determining abnormal users and identifying abnormal flow.

Description

Flow identification method and device, electronic equipment and storage medium
Technical Field
The present application relates to the field of traffic identification technologies, and in particular, to a traffic identification method and apparatus, an electronic device, and a storage medium.
Background
The crawler is a common means for crawling resources from the network, and although a large amount of resources can be obtained, for a provider of network service, traffic impact can be generated, even service paralysis can be generated, and poor experience can be caused to customers. Therefore, at present, many providers of network services have the aspiration to identify abnormal traffic caused by crawlers and the like, filter the traffic, and maintain the stability of the network services.
At present, a crawler identification model is established based on big data and machine learning, and prediction, detection and identification of abnormal flow have some preliminary effects, but the defects of unstable results, narrow applicable scenes and the like exist, and the flow identification technology has a larger improvement space.
Disclosure of Invention
In view of the above, the present application is proposed to provide a traffic identification method, apparatus, electronic device and storage medium that overcome or at least partially solve the above-mentioned problems.
According to an aspect of the present application, there is provided a traffic identification method, including:
monitoring the flow of the target service, and calculating the flow mean ratio of the target service;
if the calculated flow mean ratio exceeds a first threshold value, determining a flow sudden increase moment;
acquiring user data corresponding to the traffic surge moment, and screening abnormal user data in the user data based on a graph semi-supervision method;
and determining abnormal users according to the abnormal user data, and identifying the flow of the abnormal users as abnormal flow.
Optionally, the screening out abnormal user data in the user data based on the graph semi-supervised method includes:
calculating the abnormal degree of each user data according to the reference data and the user data, and marking the user data with the abnormal degree larger than a second threshold value as preliminary abnormal user data;
constructing the preliminary abnormal user data into a graph, and dividing the constructed graph into a plurality of subgroups;
judging each subgroup according to a preset inspection rule, and determining an abnormal group; the preset checking rule corresponds to the characteristics of the subgroup, and the characteristics comprise one or more of the following: the number of individuals, the individual abnormal degree and the overall abnormal degree;
and taking the preliminary abnormal user data corresponding to the abnormal group as the abnormal user data.
Optionally, the reference data includes access times of the white list user to each interface of the target service according to statistics;
the user data comprises the access times of the user to each interface of the target service;
the calculating the abnormal degree of each user data according to the reference data and the user data comprises:
by the formula
Figure BDA0001996812000000021
Calculating the abnormal degree of each user behavior data;
wherein x isiRepresenting the times of accessing the ith interface by a user X in the user data, wherein X represents the sum of the times of accessing all interfaces by the user X; y isiThe number of times that the user Y accesses the ith interface in the reference data is shown, and Y is the sum of the number of times that the user Y accesses all the interfaces.
Optionally, the dividing the constructed graph into a plurality of subgroups includes:
obtaining a plurality of subgroups by solving a connected subgraph of the constructed graph,
or,
and calculating the constructed graph according to a label propagation algorithm to obtain a plurality of subgroups.
Optionally, the method further comprises:
if the calculated flow mean ratio is lower than a third threshold, determining a flow sudden-decrease moment, and determining a flow abnormal time interval according to the flow sudden-increase moment and the flow sudden-decrease moment;
and estimating the normal flow in the abnormal flow time period, and calculating to obtain the estimated abnormal flow according to the actual flow in the abnormal flow time period and the normal flow meter.
Optionally, the calculating the traffic-to-average ratio of the target service includes:
according to the formula
Figure BDA0001996812000000022
Calculating the flow mean ratio at the time t, wherein N is an empirical parameter and zt(ii) an actual flow value at time t for the target service, r (t) being a flow mean ratio at time t;
the estimating of the normal flow during the abnormal flow period comprises:
determining a basic straight line according to the actual flow numerical values of the target service at the time of sudden increase of flow and the time of sudden decrease of flow;
respectively fitting flow mean values of the three flow fluctuation curves according to the flow mean values of the three flow fluctuation curves in the N minutes before the flow sudden-increase moment, the N minutes after the flow sudden-decrease moment and the flow abnormal time period to obtain flow fluctuation curves corresponding to the time periods, and selecting one of the three flow fluctuation curves with the most stable fluctuation as a simulated fluctuation curve;
and carrying out interpolation calculation according to the simulated fluctuation curve and the basic straight line to obtain a normal flow curve at the abnormal flow time period.
Optionally, the method further comprises:
according to the formula
Figure BDA0001996812000000031
Calculating a recall rate;
wherein, A is recall rate, m is intercepted abnormal traffic, and n is the traffic hitting white list users in the intercepted abnormal traffic; k is the identified abnormal flow when calculating the real-time recall rate; when calculating the offline recall rate, k is the estimated abnormal flow.
According to another aspect of the present application, there is provided a traffic identification apparatus, including:
the flow monitoring unit is used for monitoring the flow of the target service and calculating the flow mean ratio of the target service;
the abnormal data screening unit is used for determining a traffic surge moment when the calculated traffic mean ratio exceeds a first threshold, acquiring user data corresponding to the traffic surge moment, and screening abnormal user data in the user data based on a graph semi-supervision method;
and the abnormal flow identification unit is used for determining an abnormal user according to the abnormal user data and identifying the flow of the abnormal user as the abnormal flow.
Optionally, the abnormal data screening unit is configured to calculate an abnormal degree of each user data according to the reference data and the user data, and mark the user data with the abnormal degree greater than a second threshold as preliminary abnormal user data; constructing the preliminary abnormal user data into a graph, and dividing the constructed graph into a plurality of subgroups; judging each subgroup according to a preset inspection rule, and determining an abnormal group; the preset checking rule corresponds to the characteristics of the subgroup, and the characteristics comprise one or more of the following: the number of individuals, the individual abnormal degree and the overall abnormal degree; and taking the preliminary abnormal user data corresponding to the abnormal group as the abnormal user data.
Optionally, the reference data includes access times of the white list user to each interface of the target service according to statistics;
the user data comprises the access times of the user to each interface of the target service;
the abnormal data screening unit is used for screening abnormal data through a formula
Figure BDA0001996812000000041
Calculating the abnormal degree of each user behavior data; wherein x isiRepresenting the times of accessing the ith interface by a user X in the user data, wherein X represents the sum of the times of accessing all interfaces by the user X; y isiThe number of times that the user Y accesses the ith interface in the reference data is shown, and Y is the sum of the number of times that the user Y accesses all the interfaces.
Optionally, the abnormal data screening unit is configured to obtain a plurality of subgroups by solving a connected subgraph of the constructed graph, or obtain a plurality of subgroups by calculating the constructed graph according to a label propagation algorithm.
Optionally, the apparatus further comprises:
the abnormal flow estimation unit is used for determining the sudden flow reduction time if the calculated flow mean ratio is lower than a third threshold value, and determining the abnormal flow time interval according to the sudden flow increase time and the sudden flow reduction time; and estimating the normal flow in the abnormal flow time period, and calculating to obtain the estimated abnormal flow according to the actual flow in the abnormal flow time period and the normal flow meter.
Optionally, the abnormal flow estimating unit is configured to estimate the abnormal flow according to a formula
Figure BDA0001996812000000042
Calculating the flow mean ratio at the time t, wherein N is an empirical parameter and ztServe the target atthe actual flow value at time t, r (t), is the flow mean ratio at time t; determining a basic straight line according to the actual flow numerical values of the target service at the time of sudden increase of flow and the time of sudden decrease of flow; respectively fitting flow mean values of the three flow fluctuation curves according to the flow mean values of the three flow fluctuation curves in the N minutes before the flow sudden-increase moment, the N minutes after the flow sudden-decrease moment and the flow abnormal time period to obtain flow fluctuation curves corresponding to the time periods, and selecting one of the three flow fluctuation curves with the most stable fluctuation as a simulated fluctuation curve; and carrying out interpolation calculation according to the simulated fluctuation curve and the basic straight line to obtain a normal flow curve at the abnormal flow time period.
Optionally, the apparatus further comprises:
a recall rate calculating unit for calculating recall rate according to formula
Figure BDA0001996812000000043
Calculating a recall rate;
wherein, A is recall rate, m is intercepted abnormal traffic, and n is the traffic hitting white list users in the intercepted abnormal traffic; k is the identified abnormal flow when calculating the real-time recall rate; when calculating the offline recall rate, k is the estimated abnormal flow.
In accordance with yet another aspect of the present application, there is provided an electronic device including: a processor; and a memory arranged to store computer executable instructions that, when executed, cause the processor to perform a method as any one of the above.
According to a further aspect of the application, there is provided a computer readable storage medium, wherein the computer readable storage medium stores one or more programs which, when executed by a processor, implement a method as in any above.
According to the technical scheme, the traffic burst is monitored by adopting the mean shift idea, abnormal traffic is determined, the similarity of the access behaviors which are not intercepted can be represented by the graph semi-supervised algorithm, the interpretability is strong, the requirement on data is lower than that of an unsupervised algorithm, and the identified result is more stable; and omission is avoided by determining abnormal users and identifying abnormal flow.
The foregoing description is only an overview of the technical solutions of the present application, and the present application can be implemented according to the content of the description in order to make the technical means of the present application more clearly understood, and the following detailed description of the present application is given in order to make the above and other objects, features, and advantages of the present application more clearly understandable.
Drawings
Various other advantages and benefits will become apparent to those of ordinary skill in the art upon reading the following detailed description of the preferred embodiments. The drawings are only for purposes of illustrating the preferred embodiments and are not to be construed as limiting the application. Also, like reference numerals are used to refer to like parts throughout the drawings. In the drawings:
fig. 1 shows a flow diagram of a traffic identification method according to an embodiment of the present application;
FIG. 2 illustrates a schematic diagram of a flow identification device according to an embodiment of the present application;
FIG. 3 shows a schematic structural diagram of an electronic device according to an embodiment of the present application;
FIG. 4 shows a schematic structural diagram of a computer-readable storage medium according to an embodiment of the present application;
FIG. 5a shows a histogram of access behavior for normal user 1;
fig. 5b shows the access behavior histogram of normal user 2.
Detailed Description
Exemplary embodiments of the present disclosure will be described in more detail below with reference to the accompanying drawings. While exemplary embodiments of the present disclosure are shown in the drawings, it should be understood that the present disclosure may be embodied in various forms and should not be limited to the embodiments set forth herein. Rather, these embodiments are provided so that this disclosure will be thorough and complete, and will fully convey the scope of the disclosure to those skilled in the art.
In the process of flow identification, a plurality of different means are tried, and firstly, the following are simply introduced:
firstly, flow prediction is carried out by utilizing machine learning models such as random forests, and the like, and the conversion rate between the training data and the access flow can be utilized; however, this method is easily affected by holidays, and the prediction result has large fluctuation, and cannot normally participate in the calculation of the recall rate.
And secondly, carrying out abnormal flow identification by using an unsupervised algorithm, such as a K Nearest Neighbor (KNN) algorithm. Because the acquisition of the labeled data needs certain cost, and the unsupervised algorithm does not use the labeled data, the detection can be carried out by utilizing the characteristic; however, such algorithms require data to be highly aggregated or to contain a small amount of anomalous data, so the requirements on the data are high and the final detection results are not stable enough.
And thirdly, detecting abnormal user behaviors based on a mouse, a keyboard and the like. The crawler recognition accuracy and recall rate of the method are high, but firstly, the moving tracks of external interaction equipment such as a mouse and a keyboard need to be collected, crawler data samples and normal data are used as training data, a trained classification model is deployed on the line, and the crawler is recognized and is relatively complex in configuration; and the method relies on information acquisition of the js of the front section, so the model has no good effect at the APP end.
The method adopts the graph semi-supervised learning algorithm, utilizes the labeled data provided by the service, does not need to pay too much in cost, but greatly reduces the requirement on the data, has more stable result, avoids the influence of holidays, does not need to be arranged at the front end in a complex way, provides a flow identification method with higher stability, depends on accumulation as little as possible, and is not easy to be bypassed by a crawler.
Fig. 1 is a schematic flow chart of a traffic identification method according to an embodiment of the present application, and as shown in fig. 1, the method includes:
step S110, monitoring the traffic of the target service, and calculating a traffic mean ratio of the target service. The monitoring and calculation can be done in real time with the aim of finding out the situation of abnormal flow by mean shift.
Step S120, if the calculated flow average ratio exceeds the first threshold, determining a flow sudden-increase time. For a network service, although the traffic may fluctuate, the fluctuating threshold value generally does not exceed a limit. This is also interpretable because, in addition to the traffic heavy activity, the user's access to network services is uniformly continuous rather than bursty. The abnormal traffic generated by the crawler activities is caused by extra traffic outside the traffic generated by normal users, and a large amount of crawlers usually rush in within a short time, which often causes a sudden increase of traffic.
Step S130, user data corresponding to the flow sudden increase moment is obtained, and abnormal user data in the user data are screened out based on a graph semi-supervision method.
And step S140, determining abnormal users according to the abnormal user data, and identifying the flow of the abnormal users as abnormal flow.
Steps S130 to S140 represent a method of screening abnormal user data first, then further determining an abnormal user, and finally identifying an abnormal traffic, rather than determining the abnormal traffic directly according to the screened abnormal user data, because the screened abnormal user data does not necessarily cover all the abnormal user data, and the identification result is more accurate by determining the abnormal user data first, and then identifying the abnormal traffic.
Because the crawler detection model or the interception strategy obtained by machine learning and other modes can be utilized to carry out crawler detection and interception in the prior art, the technical scheme of the application can be combined with the crawler detection model or the interception strategy, on one hand, the flow generated by the non-intercepted crawler can be identified, on the other hand, because the flow is a good positive sample, the crawler detection model or the interception strategy can be updated according to the abnormal flow obtained by identification, data support is provided for online learning and incremental learning, and the updated crawler detection model or the interception strategy can further participate in crawler detection and interception.
It can be seen that, in the method shown in fig. 1, the idea of mean shift is adopted to monitor the sudden increase of the flow rate, and it is determined that abnormal flow rate is generated, the semi-supervised learning algorithm in the graph can represent the similarity of the non-intercepted access behaviors, the interpretability is strong, the requirement on data is lower than that of the unsupervised algorithm, and the identified result is more stable; and omission is avoided by determining abnormal users and identifying abnormal flow.
In an embodiment of the present application, in the method, screening abnormal user data in the user data based on a graph semi-supervised method includes: calculating the abnormal degree of each user data according to the reference data and the user data, and marking the user data with the abnormal degree larger than a second threshold value as preliminary abnormal user data; constructing the preliminary abnormal user data into a graph, and dividing the constructed graph into a plurality of subgroups; judging each subgroup according to a preset inspection rule, and determining an abnormal group; the preset checking rule corresponds to the characteristics of the subgroup, and the characteristics comprise one or more of the following: the number of individuals, the individual abnormal degree and the overall abnormal degree; and taking the preliminary abnormal user data corresponding to the abnormal group as abnormal user data.
One example of filtering out abnormal user data from user data using the graph semi-supervised method is given above. The reference data is data with a label, and mainly comes from user data collected in actual services. The degree of abnormality is calculated from the reference data and the user data, and the user data with the degree of abnormality greater than the second threshold is marked as preliminary abnormal user data, rather than all as abnormal user data, because there may be false detection therein, and the accuracy can be increased by further screening.
By constructing the graph and subdividing the plurality of subgroups, the semi-supervised idea of the graph is utilized. Specifically, in one embodiment of the application, the method wherein dividing the constructed graph into a plurality of subgroups comprises: and obtaining a plurality of subgroups by solving a connected subgraph of the constructed graph, or obtaining a plurality of subgroups by calculating the constructed graph according to a label propagation algorithm.
The connected subgraph solving and the label propagation algorithm based on the label can divide the molecular group, the two modes are selected according to two methods selected by actual verification, and of course, other modes of dividing the molecular group based on the graph semi-supervision method can be selected in other embodiments.
After the subgroups are divided, each subgroup is further judged through a check rule, an abnormal group can be selected mainly according to three dimensions of the number of individuals, the abnormal degree of the individuals and the overall abnormal degree in the subgroups, and preliminary abnormal user data corresponding to the abnormal group are used as abnormal user data. That means that the unselected subgroup belongs to the one whose abnormality degree is erroneously detected in the previous calculation step.
In an embodiment of the present application, in the method, the reference data includes access times of the white list user to each interface of the target service, which are obtained according to statistics; the user data comprises the access times of the user to each interface of the target service; calculating the degree of abnormality of each user data from the reference data and the user data includes: by the formula
Figure BDA0001996812000000081
Calculating the abnormal degree of each user behavior data; wherein x isiRepresenting the times of accessing the ith interface by a user X in the user data, wherein X represents the sum of the times of accessing all interfaces by the user X; y isiThe number of times that the user Y accesses the ith interface in the reference data is shown, and Y is the sum of the number of times that the user Y accesses all the interfaces.
Each user data and reference data actually correspond to a user, and user data of different users can be distinguished based on data that can uniquely identify the user, such as an IP, a proxy used, and an ID. In a specific example, the white list user may be a paid user, a good user filtered by a rule, and the like, that is, a normal user corresponding to a normal use of the network service. The benchmark data includes the access times of the white list user to each interface of the target service according to statistics, and the user data includes the access times of the user to each interface of the target service, which can be recorded in a manner of an access behavior histogram, as shown in fig. 5a and 5b, which respectively illustrate the access behavior histograms of two normal users, wherein url1, url2, url3, url4, and url5 respectively correspond to interface addresses of different interfaces. These access behavior histograms can be understood as reference images, and participate in subsequent calculations to implement contrast filtering, so that the reference data can be regarded as a set of reference images. The above formula is a practical way to calculate the degree of abnormality with a good effect, and in other embodiments, other ways may be used to calculate the degree of abnormality of each user behavior data.
In an embodiment of the present application, the method further includes: if the calculated flow mean ratio is lower than a third threshold, determining a flow sudden-decrease moment, and determining a flow abnormal time period according to the flow sudden-increase moment and the flow sudden-decrease moment; and estimating the normal flow in the abnormal flow time period, and calculating to obtain the estimated abnormal flow according to the actual flow in the abnormal flow time period and the normal flow meter.
In one specific example, a sudden increase in flow rate is deemed to have occurred if the flow rate mean ratio is above 1.2, and a sudden decrease in flow rate is deemed to have occurred if the flow rate mean ratio is less than 0.8. And determining the abnormal time interval of the flow according to the sudden increase time and the sudden decrease time of the flow. Through the normal flow of the abnormal flow time period, the operation, statistics and the like on the service can be facilitated; and the estimated abnormal flow is obtained through calculation, so that the recall rate can be further calculated.
In an embodiment of the application, in the method, calculating a traffic-to-average ratio of the target service includes: according to the formula
Figure BDA0001996812000000091
Calculating the flow mean ratio at the time t, wherein N is an empirical parameter and ztActual flow values for the target service at time t, r (t) being the flow mean ratio at time t; estimating the normal flow during the flow anomaly period includes: traffic surge according to target serviceDetermining a basic straight line by the actual flow value at the moment and the sudden flow reduction moment; respectively fitting flow mean value ratios of N minutes before the flow sudden-increase moment, N minutes after the flow sudden-decrease moment and the flow abnormal time period to obtain flow fluctuation curves corresponding to all the time periods, and selecting one of the three flow fluctuation curves with the most stable fluctuation as a simulated fluctuation curve; and carrying out interpolation calculation according to the simulated fluctuation curve and the basic straight line to obtain a normal flow curve at the abnormal flow time period.
The formula given above is also a preferred example, and other formulas may be selected to calculate the flow-to-average ratio at time t in other embodiments. N can also be designed by combining the business requirements as an empirical parameter, for example, set to 5-10.
And respectively fitting flow mean values of the three flow fluctuation curves according to the flow mean values of the flow fluctuation curves in the N minutes before the sudden increase moment, the N minutes after the sudden decrease moment and the abnormal flow period to obtain flow fluctuation curves corresponding to the periods, and selecting one of the three flow fluctuation curves with the most stable fluctuation as a simulated fluctuation curve, so as to estimate and obtain a normal flow curve as accurately as possible. In fact, with the participation of the crawler, the simulated fluctuation curve corresponding to the abnormal period of the flow may be the most stable, so that the simulated fluctuation curve is also taken into consideration. The stable simulation fluctuation curve means that the user access amount is more balanced at the moment, and the estimated deviation is smaller on the basis.
A basic straight line is determined through the actual flow numerical values of the target service at the flow sudden increase moment and the flow sudden decrease moment, and the slope of the straight line can be determined first and then the equation of the straight line can be determined. For the interpolation obtained at time t, r '(t) may be multiplied by the ordinate corresponding to the point at time t on the straight line, where r' (t) is the ordinate corresponding to the point at time t on the simulated fluctuation curve. Thus, a normal flow curve of the abnormal flow period is obtained.
Therefore, the interpolation estimation method by using the fluctuation rule of the flow only uses the change of the flow, is less influenced by the data of the crawler flow and the normal flow, and has wide applicable service scenes.
In an embodiment of the present application, the method further includes: according to the formula
Figure BDA0001996812000000101
Calculating a recall rate; wherein, A is recall rate, m is intercepted abnormal traffic, and n is the traffic hitting white list users in the intercepted abnormal traffic; k is the identified abnormal flow when calculating the real-time recall rate; when calculating the offline recall rate, k is the estimated abnormal flow.
The above formula can be used for calculating the real-time recall rate and the off-line recall rate, and only the values of k are different. The specific k value may be determined in the manner as described in the above embodiments.
Other methods of calculating recall rate, e.g. using formulae
Figure BDA0001996812000000102
Where u is the estimated overall abnormal flow, but this approach has the disadvantage that in some cases the calculated recall is greater than 1, which is not as expected. This is avoided by the improvement. Offline recall may calculate a recall at a level of T + 1. The estimation of the overall abnormal flow can be realized by fitting the normal flow characteristics of the white list users to obtain the normal flow at the abnormal flow time period, and the method is similar to the mode of fitting the simulated fluctuation curve; and estimating the whole abnormal flow according to the actual flow and the normal flow in the abnormal flow time period.
Fig. 2 is a schematic structural diagram of a flow rate identification device according to an embodiment of the present application, and as shown in fig. 2, the flow rate identification device 200 includes:
a traffic monitoring unit 210, configured to monitor traffic of a target service and calculate a traffic mean ratio of the target service;
the abnormal data screening unit 220 is configured to determine a traffic surge time when the calculated traffic mean ratio exceeds a first threshold, acquire user data corresponding to the traffic surge time, and screen abnormal user data in the user data based on a graph semi-supervised method;
and an abnormal traffic identification unit 230, configured to determine an abnormal user according to the abnormal user data, and identify the traffic of the abnormal user as abnormal traffic.
It can be seen that, in the device shown in fig. 2, through the mutual cooperation of the units, the idea of mean shift is adopted to monitor the sudden increase of the flow rate, and the abnormal flow rate is determined to be generated, the semi-supervised learning algorithm in the figure can represent the similarity of the access behaviors which are not intercepted, the interpretability is strong, the requirement on data is lower than that of the unsupervised algorithm, and the identified result is more stable; and omission is avoided by determining abnormal users and identifying abnormal flow.
In an embodiment of the present application, in the above apparatus, the abnormal data filtering unit 220 is configured to calculate an abnormal degree of each user data according to the reference data and the user data, and mark the user data with the abnormal degree greater than the second threshold as the preliminary abnormal user data; constructing the preliminary abnormal user data into a graph, and dividing the constructed graph into a plurality of subgroups; judging each subgroup according to a preset inspection rule, and determining an abnormal group; the preset checking rule corresponds to the characteristics of the subgroup, and the characteristics comprise one or more of the following: the number of individuals, the individual abnormal degree and the overall abnormal degree; and taking the preliminary abnormal user data corresponding to the abnormal group as abnormal user data.
In an embodiment of the present application, in the apparatus, the reference data includes access times of the white list user to each interface of the target service, which are obtained according to statistics; the user data comprises the access times of the user to each interface of the target service; an abnormal data filtering unit 220 for passing the formula
Figure BDA0001996812000000111
Calculating the abnormal degree of each user behavior data; wherein x isiRepresenting the times of accessing the ith interface by a user X in the user data, wherein X represents the sum of the times of accessing all interfaces by the user X; y isiThe number of times that the user Y accesses the ith interface in the reference data is shown, and Y is the sum of the number of times that the user Y accesses all the interfaces.
In an embodiment of the present application, in the above apparatus, the abnormal data filtering unit 220 is configured to obtain a plurality of subgroups by solving a connected subgraph of the constructed graph, or obtain a plurality of subgroups by calculating the constructed graph according to a label propagation algorithm.
In an embodiment of the present application, the apparatus further includes: the abnormal flow estimation unit is used for determining the sudden flow reduction moment if the calculated flow mean ratio is lower than a third threshold value, and determining the abnormal flow time interval according to the sudden flow increase moment and the sudden flow reduction moment; and estimating the normal flow in the abnormal flow time period, and calculating to obtain the estimated abnormal flow according to the actual flow in the abnormal flow time period and the normal flow meter.
In an embodiment of the present application, in the above apparatus, the abnormal flow rate estimating unit is configured to estimate the abnormal flow rate according to a formula
Figure BDA0001996812000000121
Calculating the flow mean ratio at the time t, wherein N is an empirical parameter and ztActual flow values for the target service at time t, r (t) being the flow mean ratio at time t; determining a basic straight line according to the actual flow numerical values of the target service at the time of sudden increase of flow and the time of sudden decrease of flow; respectively fitting flow mean value ratios of N minutes before the flow sudden-increase moment, N minutes after the flow sudden-decrease moment and the flow abnormal time period to obtain flow fluctuation curves corresponding to all the time periods, and selecting one of the three flow fluctuation curves with the most stable fluctuation as a simulated fluctuation curve; and carrying out interpolation calculation according to the simulated fluctuation curve and the basic straight line to obtain a normal flow curve at the abnormal flow time period.
In an embodiment of the present application, the apparatus further includes: a recall rate calculating unit for calculating recall rate according to formula
Figure BDA0001996812000000122
Calculating a recall rate; wherein, A is recall rate, m is intercepted abnormal traffic, and n is the traffic hitting white list users in the intercepted abnormal traffic; k is the identified abnormal flow when calculating the real-time recall rate; k is estimated when calculating offline recallAnd (4) abnormal flow.
It should be noted that, for the specific implementation of each apparatus embodiment, reference may be made to the specific implementation of the corresponding method embodiment, which is not described herein again.
To sum up, according to the technical scheme of the application, the flow of the target service is monitored, the flow mean ratio of the target service is calculated, if the calculated flow mean ratio exceeds a first threshold value, the flow sudden increase moment is determined, the user data corresponding to the flow sudden increase moment is obtained, abnormal user data in the user data are screened out based on a graph semi-supervised method, an abnormal user is determined according to the abnormal user data, the flow of the abnormal user is identified as abnormal flow, the flow sudden increase is monitored by adopting the idea of mean shift, the generation of the abnormal flow is determined, the graph semi-supervised learning algorithm can represent the similarity of the access behaviors which are not intercepted, the interpretability is strong, the requirement on the data is lower than that of an unsupervised algorithm, and the identified result is more stable; and omission is avoided by determining abnormal users and identifying abnormal flow.
It should be noted that:
the algorithms and displays presented herein are not inherently related to any particular computer, virtual machine, or other apparatus. Various general purpose devices may be used with the teachings herein. The required structure for constructing such a device will be apparent from the description above. In addition, this application is not directed to any particular programming language. It will be appreciated that a variety of programming languages may be used to implement the teachings of the present application as described herein, and any descriptions of specific languages are provided above to disclose the best modes of the present application.
In the description provided herein, numerous specific details are set forth. However, it is understood that embodiments of the application may be practiced without these specific details. In some instances, well-known methods, structures and techniques have not been shown in detail in order not to obscure an understanding of this description.
Similarly, it should be appreciated that in the foregoing description of exemplary embodiments of the application, various features of the application are sometimes grouped together in a single embodiment, figure, or description thereof for the purpose of streamlining the disclosure and aiding in the understanding of one or more of the various inventive aspects. However, the disclosed method should not be interpreted as reflecting an intention that: this application is intended to cover such departures from the present disclosure as come within known or customary practice in the art to which this invention pertains. Rather, as the following claims reflect, inventive aspects lie in less than all features of a single foregoing disclosed embodiment. Thus, the claims following the detailed description are hereby expressly incorporated into this detailed description, with each claim standing on its own as a separate embodiment of this application.
Those skilled in the art will appreciate that the modules in the device in an embodiment may be adaptively changed and disposed in one or more devices different from the embodiment. The modules or units or components of the embodiments may be combined into one module or unit or component, and furthermore they may be divided into a plurality of sub-modules or sub-units or sub-components. All of the features disclosed in this specification (including any accompanying claims, abstract and drawings), and all of the processes or elements of any method or apparatus so disclosed, may be combined in any combination, except combinations where at least some of such features and/or processes or elements are mutually exclusive. Each feature disclosed in this specification (including any accompanying claims, abstract and drawings) may be replaced by alternative features serving the same, equivalent or similar purpose, unless expressly stated otherwise.
Furthermore, those skilled in the art will appreciate that while some embodiments described herein include some features included in other embodiments, rather than other features, combinations of features of different embodiments are meant to be within the scope of the application and form different embodiments. For example, in the following claims, any of the claimed embodiments may be used in any combination.
The various component embodiments of the present application may be implemented in hardware, or in software modules running on one or more processors, or in a combination thereof. Those skilled in the art will appreciate that a microprocessor or Digital Signal Processor (DSP) may be used in practice to implement some or all of the functions of some or all of the components in a flow identification device according to embodiments of the present application. The present application may also be embodied as apparatus or device programs (e.g., computer programs and computer program products) for performing a portion or all of the methods described herein. Such programs implementing the present application may be stored on a computer readable medium or may be in the form of one or more signals. Such a signal may be downloaded from an internet website or provided on a carrier signal or in any other form.
For example, fig. 3 shows a schematic structural diagram of an electronic device according to an embodiment of the present application. The electronic device comprises a processor 310 and a memory 320 arranged to store computer executable instructions (computer readable program code). The memory 320 may be an electronic memory such as a flash memory, an EEPROM (electrically erasable programmable read only memory), an EPROM, a hard disk, or a ROM. The memory 320 has a storage space 330 storing computer readable program code 331 for performing any of the method steps described above. For example, the storage space 330 for storing the computer readable program code may comprise respective computer readable program codes 331 for respectively implementing various steps in the above method. The computer readable program code 331 may be read from or written to one or more computer program products. These computer program products comprise a program code carrier such as a hard disk, a Compact Disc (CD), a memory card or a floppy disk. Such a computer program product is typically a computer readable storage medium such as described in fig. 4. FIG. 4 shows a schematic structural diagram of a computer-readable storage medium according to an embodiment of the present application. The computer readable storage medium 400 has stored thereon a computer readable program code 331 for performing the steps of the method according to the application, readable by a processor 310 of an electronic device 300, which computer readable program code 331, when executed by the electronic device 300, causes the electronic device 300 to perform the steps of the method described above, in particular the computer readable program code 331 stored on the computer readable storage medium may perform the method shown in any of the embodiments described above. The computer readable program code 331 may be compressed in a suitable form.
It should be noted that the above-mentioned embodiments illustrate rather than limit the application, and that those skilled in the art will be able to design alternative embodiments without departing from the scope of the appended claims. In the claims, any reference signs placed between parentheses shall not be construed as limiting the claim. The word "comprising" does not exclude the presence of elements or steps not listed in a claim. The word "a" or "an" preceding an element does not exclude the presence of a plurality of such elements. The application may be implemented by means of hardware comprising several distinct elements, and by means of a suitably programmed computer. In the unit claims enumerating several means, several of these means may be embodied by one and the same item of hardware. The usage of the words first, second and third, etcetera do not indicate any ordering. These words may be interpreted as names.

Claims (8)

1. A traffic identification method, comprising:
monitoring the flow of the target service, and calculating the flow mean ratio of the target service;
if the calculated flow mean ratio exceeds a first threshold value, determining a flow sudden increase moment;
acquiring user data corresponding to the traffic surge moment, and screening abnormal user data in the user data based on a graph semi-supervision method;
determining abnormal users according to the abnormal user data, and identifying the flow of the abnormal users as abnormal flow;
the method further comprises the following steps:
according to the formula
Figure FDA0002454010820000011
Calculating a recall rate;
wherein, A is recall rate, m is intercepted abnormal traffic, and n is the traffic hitting white list users in the intercepted abnormal traffic; k is the identified abnormal flow when calculating the real-time recall rate; when the offline recall rate is calculated, k is the estimated abnormal flow;
the screening of abnormal user data in the user data based on the graph semi-supervised method comprises the following steps:
calculating the abnormal degree of each user data according to the reference data and the user data, and marking the user data with the abnormal degree larger than a second threshold value as preliminary abnormal user data;
constructing the preliminary abnormal user data into a graph, and dividing the constructed graph into a plurality of subgroups;
judging each subgroup according to a preset inspection rule, and determining an abnormal group; the preset checking rule corresponds to the characteristics of the subgroup, and the characteristics comprise one or more of the following: the number of individuals, the individual abnormal degree and the overall abnormal degree;
and taking the preliminary abnormal user data corresponding to the abnormal group as the abnormal user data.
2. The method of claim 1,
the reference data comprises the access times of the white list user to each interface of the target service according to statistics;
the user data comprises the access times of the user to each interface of the target service;
the calculating the abnormal degree of each user data according to the reference data and the user data comprises:
by the formula
Figure FDA0002454010820000012
Calculating the abnormal degree of each user behavior data;
wherein x isiIndicating that user x accesses the first in the user dataThe number of i interfaces, X represents the sum of the number of times that user X accesses all interfaces; y isiThe number of times that the user Y accesses the ith interface in the reference data is shown, and Y is the sum of the number of times that the user Y accesses all the interfaces.
3. The method of claim 1, wherein the dividing the constructed graph into a plurality of subgroups comprises:
obtaining a plurality of subgroups by solving a connected subgraph of the constructed graph,
or,
and calculating the constructed graph according to a label propagation algorithm to obtain a plurality of subgroups.
4. The method of claim 1, further comprising:
if the calculated flow mean ratio is lower than a third threshold, determining a flow sudden-decrease moment, and determining a flow abnormal time interval according to the flow sudden-increase moment and the flow sudden-decrease moment;
and estimating the normal flow in the abnormal flow time period, and calculating to obtain the estimated abnormal flow according to the actual flow in the abnormal flow time period and the normal flow meter.
5. The method of claim 4, wherein the calculating the traffic-to-average ratio for the target service comprises:
according to the formula
Figure FDA0002454010820000021
Calculating the flow mean ratio at the time t, wherein N is an empirical parameter and zt(ii) an actual flow value at time t for the target service, r (t) being a flow mean ratio at time t;
the estimating of the normal flow during the abnormal flow period comprises:
determining a basic straight line according to the actual flow numerical values of the target service at the time of sudden increase of flow and the time of sudden decrease of flow;
respectively fitting flow mean values of the three flow fluctuation curves according to the flow mean values of the three flow fluctuation curves in the N minutes before the flow sudden-increase moment, the N minutes after the flow sudden-decrease moment and the flow abnormal time period to obtain flow fluctuation curves corresponding to the time periods, and selecting one of the three flow fluctuation curves with the most stable fluctuation as a simulated fluctuation curve;
and carrying out interpolation calculation according to the simulated fluctuation curve and the basic straight line to obtain a normal flow curve at the abnormal flow time period.
6. A flow rate identification device, comprising:
the flow monitoring unit is used for monitoring the flow of the target service and calculating the flow mean ratio of the target service;
the abnormal data screening unit is used for determining a traffic surge moment when the calculated traffic mean ratio exceeds a first threshold, acquiring user data corresponding to the traffic surge moment, and screening abnormal user data in the user data based on a graph semi-supervision method;
the abnormal flow identification unit is used for determining an abnormal user according to the abnormal user data and identifying the flow of the abnormal user as abnormal flow;
optionally, the apparatus further comprises:
a recall rate calculating unit for calculating recall rate according to formula
Figure FDA0002454010820000031
Calculating a recall rate;
wherein, A is recall rate, m is intercepted abnormal traffic, and n is the traffic hitting white list users in the intercepted abnormal traffic; k is the identified abnormal flow when calculating the real-time recall rate; when calculating the offline recall rate, k is the estimated abnormal flow.
7. An electronic device, wherein the electronic device comprises: a processor; and a memory arranged to store computer-executable instructions that, when executed, cause the processor to perform the method of any one of claims 1-5.
8. A computer readable storage medium, wherein the computer readable storage medium stores one or more programs which, when executed by a processor, implement the method of any of claims 1-5.
CN201910199191.1A 2019-03-15 2019-03-15 Flow identification method and device, electronic equipment and storage medium Active CN109873832B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201910199191.1A CN109873832B (en) 2019-03-15 2019-03-15 Flow identification method and device, electronic equipment and storage medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201910199191.1A CN109873832B (en) 2019-03-15 2019-03-15 Flow identification method and device, electronic equipment and storage medium

Publications (2)

Publication Number Publication Date
CN109873832A CN109873832A (en) 2019-06-11
CN109873832B true CN109873832B (en) 2020-07-31

Family

ID=66920517

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201910199191.1A Active CN109873832B (en) 2019-03-15 2019-03-15 Flow identification method and device, electronic equipment and storage medium

Country Status (1)

Country Link
CN (1) CN109873832B (en)

Families Citing this family (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111125193B (en) * 2019-12-23 2023-08-29 北京秒针人工智能科技有限公司 Method, device, equipment and storage medium for identifying abnormal multimedia comments
CN111400126B (en) * 2020-02-19 2024-07-30 中国平安人寿保险股份有限公司 Network service abnormal data detection method, device, equipment and medium
CN111565171B (en) * 2020-03-31 2022-09-20 北京三快在线科技有限公司 Abnormal data detection method and device, electronic equipment and storage medium
CN111586001B (en) * 2020-04-28 2022-11-22 咪咕文化科技有限公司 Abnormal user identification method and device, electronic equipment and storage medium
CN115037528B (en) * 2022-05-24 2023-11-03 天翼云科技有限公司 Abnormal flow detection method and device
CN115361231B (en) * 2022-10-19 2023-02-17 中孚安全技术有限公司 Host abnormal flow detection method, system and equipment based on access baseline

Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN104158823A (en) * 2014-09-01 2014-11-19 江南大学 Simulation method oriented to LDoS (Low-rate Denial of Service) and LDDoS (Low-rate Distributed Denial of Service)
CN104539471A (en) * 2014-12-01 2015-04-22 北京百度网讯科技有限公司 Bandwidth metering method and device and computer equipment
CN107154947A (en) * 2017-06-16 2017-09-12 清华大学 Based on effectively frequently the exception of network traffic of stream feature is detected and sorting technique
CN109120548A (en) * 2018-07-02 2019-01-01 联动优势电子商务有限公司 A kind of flow control methods and device
CN109150647A (en) * 2017-06-28 2019-01-04 大唐移动通信设备有限公司 A kind of network flow monitoring method and device
CN109639633A (en) * 2018-11-02 2019-04-16 平安科技(深圳)有限公司 Abnormal flow data identification method, device, medium and electronic equipment

Family Cites Families (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20180255099A1 (en) * 2017-03-02 2018-09-06 Microsoft Technology Licensing, Llc Security and compliance alerts based on content, activities, and metadata in cloud
CN109067725B (en) * 2018-07-24 2021-05-14 成都亚信网络安全产业技术研究院有限公司 Network flow abnormity detection method and device

Patent Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN104158823A (en) * 2014-09-01 2014-11-19 江南大学 Simulation method oriented to LDoS (Low-rate Denial of Service) and LDDoS (Low-rate Distributed Denial of Service)
CN104539471A (en) * 2014-12-01 2015-04-22 北京百度网讯科技有限公司 Bandwidth metering method and device and computer equipment
CN107154947A (en) * 2017-06-16 2017-09-12 清华大学 Based on effectively frequently the exception of network traffic of stream feature is detected and sorting technique
CN109150647A (en) * 2017-06-28 2019-01-04 大唐移动通信设备有限公司 A kind of network flow monitoring method and device
CN109120548A (en) * 2018-07-02 2019-01-01 联动优势电子商务有限公司 A kind of flow control methods and device
CN109639633A (en) * 2018-11-02 2019-04-16 平安科技(深圳)有限公司 Abnormal flow data identification method, device, medium and electronic equipment

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
《基于密度峰值和维度概率模型的混合属性数据聚类研究》;刘世华;《中国博士学位论文全文库信息科技辑》;20190131;全文 *
《网络异常检测算法研究》;王子玉;《中国博士学位论文全文库信息科技辑》;20190215;第17-20,32-36页 *

Also Published As

Publication number Publication date
CN109873832A (en) 2019-06-11

Similar Documents

Publication Publication Date Title
CN109873832B (en) Flow identification method and device, electronic equipment and storage medium
CN110113226B (en) Method and device for detecting equipment abnormity
CN110210508B (en) Model generation method, abnormal flow detection device, electronic device and computer-readable storage medium
CN104350471B (en) Method and system for detecting anomalies in real-time in processing environment
CN105809035B (en) The malware detection method and system of real-time behavior is applied based on Android
CN111460312A (en) Method and device for identifying empty-shell enterprise and computer equipment
CN110717824A (en) Method and device for conducting and calculating risk of public and guest groups by bank based on knowledge graph
CN110060087B (en) Abnormal data detection method, device and server
CN106611023B (en) Method and device for detecting website access abnormality
CN111144941A (en) Merchant score generation method, device, equipment and readable storage medium
CN110503566B (en) Wind control model building method and device, computer equipment and storage medium
CN103853839A (en) Method and device for evaluating advertisement page malicious click suspected degree
CN106612216A (en) Method and apparatus of detecting website access exception
CN113271322B (en) Abnormal flow detection method and device, electronic equipment and storage medium
CN106327230A (en) Abnormal user detection method and device
CN115660262B (en) Engineering intelligent quality inspection method, system and medium based on database application
CN112581291A (en) Risk assessment transaction detection method, device, equipment and storage medium
CN112819476A (en) Risk identification method and device, nonvolatile storage medium and processor
CN106611348A (en) Anomaly traffic detection method and apparus
CN107135199A (en) The detection method and device at webpage back door
CN113269378A (en) Network traffic processing method and device, electronic equipment and readable storage medium
CN111309706A (en) Model training method and device, readable storage medium and electronic equipment
CN109598525B (en) Data processing method and device
CN116071133A (en) Cross-border electronic commerce environment analysis method and system based on big data and computing equipment
CN112148764B (en) Feature screening method, device, equipment and storage medium

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant