CN109039797B

CN109039797B - Strong learning based large flow detection method

Info

Publication number: CN109039797B
Application number: CN201810594740.0A
Authority: CN
Inventors: 王雄; 潘志豪; 任婧; 徐世中; 王晟
Original assignee: University of Electronic Science and Technology of China
Current assignee: University of Electronic Science and Technology of China
Priority date: 2018-06-11
Filing date: 2018-06-11
Publication date: 2021-11-23
Anticipated expiration: 2038-06-11
Also published as: CN109039797A

Abstract

The invention discloses a reinforcement learning-based large flow detection method, which comprises the following steps: s1: detecting a data stream to obtain stream detection data; s2: optimizing a detection data model by adopting a historical sample buffer pool; s3: judging the big flow of the flow detection data by adopting the optimized detection data model, and detecting the big flow again; s4: the flow inspection data is put into the history sample buffer pool, and S2, S3, and S4 are again sequentially performed until the inspection ends. The invention takes the link state of the network and the historical measurement information of the flow as the state, takes the measurement size of the flow as the reward value, adopts the reinforcement learning-based large flow detection method to detect the large flow in the network, can fully extract the characteristics of the correlation and the like of the flow, and can improve the accuracy of the large flow detection.

Description

Strong learning based large flow detection method

Technical Field

The invention relates to the technical field of computers, in particular to a big flow detection method based on reinforcement learning.

Background

Fine-grained network flow measurement is required for planning, operation management, charging and security audit of a data center. Both NetFlow and sFlow are flow-based measurement methods, which can provide fine-grained network traffic measurement, but they require specific network devices or specific functional support, for example, NetFlow can only be used on cisco devices. On the other hand, because the traffic information data to be measured in the actual network is huge, the flow-based measurement method usually needs to consume a lot of network resources (network bandwidth, node storage and computation, etc.), and the scalability is poor. For Netflow, the processing time of each packet is limited by network resources, and when a high-speed switch is used, the processing time of each data packet is limited. In SDN (software defined network), the restricted resources of the switch are ternary content addressable TCAM resources, each TCAM resource can only measure one flow. Due to the shortage of resources, NetFlow will use a sampling method to measure, which will reduce the accuracy of measurement. The FlowRaar provides a method for measuring each flow in real time in a fine-grained manner under the condition of switch resource limitation, and reduces the processing time of a packet and the occupation of network load by compressing a counter of each flow.

The iStamP provides a method for flow aggregation and decoupling, reduces the use of TCAM resources through flow aggregation, and provides fine-grained measurement for flows with larger flow through flow decoupling. Since the size of the traffic is variable, the decoupled flow also needs to be changed frequently, which requires the algorithm to find the current flow with larger traffic in the network and measure the current flow in real time.

The problem is essentially that of a multiple-arm gambling machine MAB (multi-armed bandit) in which there are many machines that look the same, each winning a different probability and varying with time. Each time a gambler shakes a slot machine, a certain cost is spent, and how to maximize the yield is the problem to be solved by the dobby gambling machine. When the gambler finds a slot machine with a higher probability of winning, he may choose to continue to shake the slot machine to obtain a steady return. However, there may be a slot machine with a higher winning rate, or the probability of winning the slot machine decreases over time, so another longer term is to lose a portion of the current prize and explore other slot machines. How to balance between greedy selection of current optima and exploration of other possibilities is a major problem to be solved by dobby slot machines.

Algorithms for solving the MAB have a lot, wherein a greedy strategy is more direct, greedy selection is carried out on the current optimal solution according to a certain probability, such as the probability of 0.95, and the probability of 0.05 is left to search other more optimal solutions. An obvious disadvantage of greedy policy is that contextual information is not fully exploited, such as the possibility that multiple slot machines may have a correlation in advance. From this idea, context-based dobby algorithms have come to mind. The context-based dobby game machine algorithm records a d-dimensional feature array, which is updated each time a selection is made in an iteration, and the d-dimensional feature array records context-dependent data. The purpose of the algorithm is to gather enough information to find the correlation between context and reward so that the optimal choice can be made each time, and the maximum benefit is obtained. Common context-based dobby gambling algorithms are the upper Confidence bound algorithm ucb (upper Confidence bound), neural networks and random forests.

The iSTAMP uses the mucb (modified upper Confidence bound) to detect large flows, but it does not exploit the correlation of flows and is less accurate.

At present, various algorithms for network flow measurement have the problem of low detection and measurement accuracy.

Disclosure of Invention

The invention aims to solve the technical problem that various algorithms for network flow measurement have low detection and measurement accuracy at present, and provides a reinforcement learning-based large flow detection method to solve the problem.

The invention is realized by the following technical scheme:

the method for detecting the large flow based on reinforcement learning comprises the following steps: s1: detecting a data stream to obtain stream detection data; s2: optimizing a detection data model by adopting a historical sample buffer pool; s3: judging the big flow of the flow detection data by adopting the optimized detection data model, and detecting the big flow again; s4: the flow inspection data is put into the history sample buffer pool, and S2, S3, and S4 are again sequentially performed until the inspection ends.

Further, step S3 includes the following sub-steps: s31: the detection data model scores the data stream according to the current state; s32: selecting k flows for detection to obtain new flow detection data and a new network state; s33: the currently detected reward value is derived from the new flow detection data and the new network state.

Further, step S32 includes the following sub-steps: setting a probability threshold epsilon_threshold(ii) a When the random probability is less than epsilon_thresholdTime of day, randomSelecting k streams for detection; when the random probability is greater than epsilon_thresholdAnd then, sorting the scores of the streams in a reverse order, and selecting the k streams with the highest scores for detection.

Further, the epsilon_thresholdObtained by the following formula:

wherein steps is the detection times; epsilon_sIs the probability upper bound; epsilon_eIs the lower probability limit; epsilon_delayIs a rate parameter.

Further, step S33 includes the following sub-steps: taking the proportion of the detected network traffic in all the network traffic as a reward value reward; reward is obtained by the following formula:

in the formula, action is the set of the current detected flow; last is the set of sizes of the last detected stream; measure is the set of sizes of currently detected streams.

Further, step S4 includes the following sub-steps: the stream detection data put into the historical sample buffer pool includes: the state of the current network; making a decision on according to the current network state; the state next _ state to which the decision is made; the prize value reward for each stream.

Further, step S2 further includes the following sub-steps: obtaining errors of the detection value and the model value according to the current network state and the next _ state of the transferred state after decision making; and optimizing the model according to the errors of the detection value and the model value.

Further, the model adopts a neural network model.

Compared with the prior art, the invention has the following advantages and beneficial effects:

the invention takes the link state of the network and the historical measurement information of the flow as the state, takes the measurement size of the flow as the reward value, adopts the reinforcement learning-based large flow detection method to detect the large flow in the network, can fully extract the characteristics of the correlation and the like of the flow, and can improve the accuracy of the large flow detection.

Drawings

The accompanying drawings, which are included to provide a further understanding of the embodiments of the invention and are incorporated in and constitute a part of this application, illustrate embodiment(s) of the invention and together with the description serve to explain the principles of the invention. In the drawings:

FIG. 1 is a schematic diagram of the process steps of the present invention;

FIG. 2 is a block diagram of the system of the present invention.

Detailed Description

In order to make the objects, technical solutions and advantages of the present invention more apparent, the present invention is further described in detail below with reference to examples and accompanying drawings, and the exemplary embodiments and descriptions thereof are only used for explaining the present invention and are not meant to limit the present invention.

Examples

As shown in fig. 1 and fig. 2, the reinforced learning-based large flow detection method of the present invention includes the following steps: s1: detecting a data stream to obtain stream detection data; s2: optimizing a detection data model by adopting a historical sample buffer pool; s3: judging the big flow of the flow detection data by adopting the optimized detection data model, and detecting the big flow again; s4: the flow inspection data is put into the history sample buffer pool, and S2, S3, and S4 are again sequentially performed until the inspection ends.

Step S3 includes the following substeps: s31: the detection data model scores the data stream according to the current state; s32: selecting k flows for detection to obtain new flow detection data and a new network state; s33: the currently detected reward value is derived from the new flow detection data and the new network state.

A Markov decision process on the theoretical basis of reinforcement learning, wherein the Markov decision process is in different states s_tNext, a different action a is performed_tDifferent prizes r(s) will be obtained_t,a_t). In different statesState s_tNext, a different action a is performed_tThe environment will be dynamic according to the probability p(s)_t+1|s_t,a_t) Change to a new state s_t+1. The goal of reinforcement learning is to learn a strategy pi_θ(s_t,a_t) I.e. in the current state s_tNext, what action a should be taken_tWherein theta is the strategy parameter of the user, and the reinforcement learning objective is to continuously optimize the strategy parameter theta. Action a_tNot only will the current benefit be directly affected, but the action will also have an effect on future benefits since the next state is also affected by the action, which is referred to as delayed benefits.

Step S32 includes the following substeps: setting a probability threshold epsilon_threshold(ii) a When the random probability is less than epsilon_thresholdThen, randomly selecting k streams for detection; when the random probability is greater than epsilon_thresholdAnd then, sorting the scores of the streams in a reverse order, and selecting the k streams with the highest scores for detection.

Said epsilon_thresholdObtained by the following formula:

Initially, a measurement is performed on all streams to obtain the current sizes of all streams. In the early stage of the algorithm operation, more exploration is needed to collect information, and in the later stage of the algorithm, the exploration probability can be reduced to obtain high benefits. To this end, we set a probability threshold epsilon_thresholdWhen lower than epsilon_thresholdWe randomly choose k streams to observe. If higher than epsilon_thresholdThen we sort the scores of each flow in reverse order according to the algorithm, and choose the k flows with the highest scores to observe. The formula of the probability threshold is:

as the algorithm runs, the threshold value is gradually changed from epsilon_sDown to epsilon_eTypically, the value is ε_s＝0.95，ε_eAt the beginning of the algorithm running, the algorithm has a high probability of randomly selecting to search, and at the later stage of the algorithm, the algorithm selects the current optimal strategy with a high probability, but still leaves a small probability to search. At this stage we get the current aggregate action that should measure the flow. By adjusting epsilon_delayThe value of (c) may adjust the rate of decrease.

Step S33 includes the following substeps: taking the proportion of the detected network traffic in all the network traffic as a reward value reward; reward is obtained by the following formula:

Step S4 includes the following substeps: the stream detection data put into the historical sample buffer pool includes: the state of the current network; making a decision on according to the current network state; the state next _ state to which the decision is made; the prize value reward for each stream.

The last measured size of each flow is retained, and if the flow is re-measured, the update is replaced with a new value, and if not, the last value is retained. And taking the ratio of the measured network traffic to all the network traffic as the quality of a selected result, and dividing the sum of the currently measured traffic by the sum of the estimated values of each flow to obtain an estimated score reward of the current strategy. The calculation formula is as follows:

the size of each flow over the past k measurement periods is chosen. All the measured values in the previous k periods are stored, and the model can extract more context information. By adjusting the size of k, we can extract features of different time scales. Storing the measurement condition under the current strategy, and storing (state, on, next _ state, reward) into a historical sample buffer pool, wherein the state is the state of the current network, and specifically the size of each stream in the past k measurement periods; the aciton is a decision made according to the current state, and stores and selects which flow to measure; next _ state is the state to which the decision is made to transition, i.e. the flow measurement data after the refinement; reward is an estimate of the prize value for each flow, which is measured as the amount of traffic in the total flow, and 0 otherwise.

Step S2 further includes the following sub-steps: obtaining errors of the detection value and the model value according to the current network state and the next _ state of the transferred state after decision making; and optimizing the model according to the errors of the detection value and the model value.

And optimizing the model by using the data in the historical sample buffer pool. The model obtains state _ values according to the state and obtains next _ state _ values according to the next _ state. The prize value received by the actual environment for a state is reward. γ in target _ state _ values γ + (1- γ) reward is a number from 0 to 1 for controlling the attenuation of potential reports. Ideally, the difference between state _ values and target should be small. And optimizing the model according to the error obtained by losss (state _ values, target).

The model adopts a neural network model.

The model used is a neural network, the input is the state action composed of the measured values of the past k cycles, the output is the estimated reward value rewarded corresponding to each flow, the value is the value after the normalization of the softmax, and the formula of the normalization of the softmax is

The loss evaluation function we used smooth l1loss, as follows:

taking the link state of the network and the historical measurement information of the flow as states, taking the measurement size of the flow as a reward value, and selecting k maximum flows for measurement in each strategy; the big flow detection method based on reinforcement learning is adopted to detect the big flow in the network, the characteristics of flow correlation and the like can be fully extracted, and the accuracy of big flow detection can be improved.

The above-mentioned embodiments are intended to illustrate the objects, technical solutions and advantages of the present invention in further detail, and it should be understood that the above-mentioned embodiments are merely exemplary embodiments of the present invention, and are not intended to limit the scope of the present invention, and any modifications, equivalent substitutions, improvements and the like made within the spirit and principle of the present invention should be included in the scope of the present invention.

Claims

1. The method for detecting the large flow based on reinforcement learning is characterized by comprising the following steps:

s1: detecting a data stream to obtain stream detection data;

s2: optimizing a detection data model by adopting a historical sample buffer pool;

s3: judging the big flow of the flow detection data by adopting the optimized detection data model, and detecting the big flow again;

s4: putting the stream detection data into a history sample buffer pool, and sequentially executing S2, S3 and S4 again until the detection is finished;

wherein, step S3 includes the following substeps:

s31: the detection data model scores the data stream according to the current state;

s32: selecting k flows for detection to obtain new flow detection data and a new network state;

s33: obtaining a currently detected reward value according to the new flow detection data and the new network state;

step S32 includes the following substeps:

setting a probability threshold epsilon_threshold；

When the random probability is less than the probability threshold epsilon_thresholdThen, randomly selecting k streams for detection;

when the random probability is greater than the probability threshold epsilon_thresholdThen, sorting the scores of the streams in a reverse order, and selecting k streams with the highest scores for detection;

the probability threshold epsilon_thresholdObtained by the following formula:

wherein steps is the detection times; epsilon_sIs a probability threshold epsilon_thresholdThe upper limit of (d); epsilon_eIs a probability threshold epsilon_thresholdThe lower limit of (d); epsilon_delayIs a rate parameter;

step S33 includes the following substeps:

taking the proportion of the detected network traffic in all the network traffic as a reward value reward;

reward is obtained by the following formula:

in the formula, action is the set of the current detected flow; last is the set of sizes of the last detected stream; measure is the set of the sizes of the currently detected streams, and n is the total number of the streams in the network;

step S4 includes the following substeps:

the stream detection data put into the historical sample buffer pool includes: the state of the current network; making a decision on according to the current network state; the state next _ state to which the decision is made; detecting a reward value reward for a stream;

step S2 further includes the following sub-steps:

obtaining errors of the detection value and the model value according to the current network state and the next _ state of the transferred state after decision making;

and optimizing the model according to the errors of the detection value and the model value.

2. The reinforcement learning-based large flow detection method according to claim 1, wherein the model is a neural network model.