CN116244612B

CN116244612B - HTTP traffic clustering method and device based on self-learning parameter measurement

Info

Publication number: CN116244612B
Application number: CN202310532962.0A
Authority: CN
Inventors: 赵新建; 张颂; 陈石; 徐晨维; 张明明; 夏飞; 袁国泉; 宋浒; 王鹏飞; 汤铭; 朱佳佳
Original assignee: State Grid Jiangsu Electric Power Co Ltd; Information and Telecommunication Branch of State Grid Jiangsu Electric Power Co Ltd
Current assignee: State Grid Jiangsu Electric Power Co Ltd; Information and Telecommunication Branch of State Grid Jiangsu Electric Power Co Ltd
Priority date: 2023-05-12
Filing date: 2023-05-12
Publication date: 2023-08-29
Anticipated expiration: 2043-05-12
Also published as: CN116244612A

Abstract

The application discloses a method for setting a class label attribute in a message header attribute of HTTP message flow data set T, wherein the HTTP message flow data set T comprises a plurality of pieces of message flow data; step 2, training distance parameters in the distance measurement function; step 3, setting a clustering threshold, wherein the clustering threshold comprises a neighborhood density threshold and a radius parameter; and 4, aggregating the flow data set T based on the neighborhood density threshold and the radius parameter, wherein the class label attribute L of each cluster of data obtained after aggregation is the same. The application adopts DBSCAN clustering algorithm to cluster flow data based on dist measurement function, and improves clustering accuracy of HTTP flow data.

Description

HTTP traffic clustering method and device based on self-learning parameter measurement

Technical Field

The application relates to the field of data mining, in particular to an HTTP traffic clustering method and device based on self-learning parameter measurement.

Background

In recent years, with the continuous development of internet technology, the rapid growth of network applications has created a series of challenges for traffic classification. The initial traffic classification method is to divide the network traffic of a specific port into corresponding network applications according to a port mapping table specified by the internet address assignment authority IANA. However, with the increasing popularity of web network applications, a large amount of HTTP message data is accumulated in a network, and these messages correspond to different network access behaviors, so that clustering is performed on a large amount of HTTP messages, which is significant for identifying network access behaviors, recommending services and monitoring access.

The clustering belongs to non-guide learning, the data is analyzed on the premise of no priori knowledge, and the data belonging to the unknown state is divided into groups, so that the data in the same group have similar state characteristics; data belonging to different groups has different state characteristics. The DBSCAN algorithm is a classical clustering method in the field of data mining, can find class clusters with different shapes, and is widely applied to relational mode data clustering. The algorithm uses the density-based clustering (or cluster-like) concept, requiring that the number of objects (points or other spatial objects) contained within a certain region in the clustering space be no less than a given threshold. The DBSCAN algorithm has the remarkable advantages of high clustering speed, and can effectively process noise points (outliers) and find spatial clusters with arbitrary shapes, and ideas and flows of the DBSCAN algorithm are shown in figure 2.

Different from the fixed relation type data mode, the data lacks of the explicit type label attribute, and the numerical distribution on each attribute is relatively uniform; the numerical distribution on different attributes of HTTP message data is relatively sparse, and a common data distance measurement method fails; meanwhile, small-class label priori knowledge including class labels and data point division constraint conditions can be obtained from part of HTTP message data. How to use these only a small amount of a priori knowledge to perform cluster analysis on a large amount of data without a priori knowledge becomes a very important issue.

In the actual flow clustering work, statistical analysis is performed on the flow data, and it is found that in some clustering tasks, some data in the data set often contain class label attributes capable of locating class clusters, such as an IP address, user userid and the like. However, prior studies ignore the effect of this portion of information on the grouping of traffic data groups. The DBSCAN algorithm is directly applied to HTTP message data clustering, and the problems that the data distance (difference) measurement is inaccurate, the parameter setting of the clustering algorithm is difficult, and the available part of class label information cannot be fully utilized are caused, so that the cluster accuracy of clustering acquisition is poor, and the subsequent decision support service is difficult to support.

Disclosure of Invention

The application aims to solve the problems in the background art and provides an HTTP traffic clustering method and device based on self-learning parameter measurement. The application designs a measurement function dist for measuring the difference between two pieces of HTTP message data, optimizes the parameter of the measurement function dist by introducing the ideas of logistic regression and gradient descent, improves the measurement accuracy, and finally adopts a DBSCAN clustering algorithm to cluster the flow data based on the dist measurement function, and improves the clustering precision of clustering the HTTP flow data.

In order to achieve the technical purpose, the application adopts the following technical scheme:

a self-learning parameter measurement-based HTTP traffic semi-supervised clustering method comprises the following steps:

step 1, giving an HTTP message flow data set T which comprises a plurality of message flowsThe method comprises the steps of measuring data, setting class label attributes in message header attributes of each message flow data to be L, wherein L is expressed in a discrete variable form, and the L is partially empty, and forming a training set by the data with the class label attributes L which are not empty；

Step 2, defining a distance measurement function for measuring two message flow data，/>Is the gap between (1);

step 3, utilizing a logistic regression model to make training setTraining distance parameters in a distance measurement function;

step 4, setting a clustering threshold, wherein the clustering threshold comprises a neighborhood density thresholdAnd radius parameter->；

Step 5, based on neighborhood density thresholdAnd radius parameter->The stream data set T is aggregated, and the class label attribute L of each cluster of data obtained after aggregation is the same.

The step 1 comprises the following steps:

step 1.1, an HTTP message flow data set T is given, wherein the data set T contains k pieces of message flow data and is expressed as follows:

，

；

wherein ,an nth attribute value of HTTP header attribute of kth message flow data in the data set,/->Is the attribute value of the class label attribute L, wherein, < ->，/>All are natural numbers;

step 1.2 define prediction setFor containing all the message flow data with the L attributes being null, the rest message flow data with the L attributes not being null form a training set +.>。

The step 2 comprises the following steps:

defining a distance metric function as follows:

, wherein

，

；

wherein ,，/>for any two pieces of data, r represents the property of item r,>representation->R-th property of->Difference of property of r ∈ ->For the standard deviation of attribute r in the HTTP message flow data set T message header attribute, +.>For the mean value of attribute r in the HTTP message flow data set T message header attribute, the +.>，/>Attribute value of the r-th Attribute representing the m-th data,>represent arbitrary constant, ++>As distance weight for the distance metric function dist,，/>distance weight for the r-th attribute, b is the deviation of distance metric function dist, distance metricDistance weight of function dist +.>And the deviation b of the distance metric function dist are collectively referred to as distance parameters.

The step 3 comprises the following steps:

step 3.1 logistic regression model is expressed as follows:

，

wherein ,weights representing logistic regression model, +.>Representing training set +.>The flow rate of the kth line in the (a),representing the bias of the logistic regression model;

step 3.2 processing training setWill->The difference operation is carried out on any data in the data, and the formula is as follows:

，

；

wherein ,representation->，/>The two data are differenced and normalized, < >>Representing training set +.>First->Standard deviation of individual attributes->Representation->Is>Attribute value of individual attribute->Representation->Is>The attribute values of the individual attributes are used,representation->，/>Whether the class label attributes of the two pieces of data are the same;

step 3.3 represents the loss function of the logistic regression model by averaging the square error sum MSE, i.e. the following formula:

，

wherein ,representing distance weight +.>Is a loss function of->Is->，/>Minimum value->Representing training set +.>The number of data entries to be included in the data stream,

solving the distance weight of the distance measurement function dist of the formula by a gradient descent methodFirst randomly initialize +.>Optimizing the distance weight by means of approximation>The optimization is achieved by the following formula:

，

wherein ,learning rate representing gradient descent, which controls distanceWeight of->The speed of convergence;

through multiple training iterations, the distance weight is madeConvergence, and thus distance weight +_in distance metric function dist>Is a training of (a).

The step 4 comprises the following steps:

step 4.1 training setStatistical analysis is performed, and the average inter-class distance is used as the radius parameter of the clustering algorithm>The data with the average inter-class distance L and the same attribute are classified into one class, the data distance of the same class is calculated, and the distance average value is calculated, namely +.>Is expressed by the following formula:

，

step 4.2, counting average neighborhood density or minimum neighborhood density as a neighborhood density threshold of a clustering algorithmThe average neighborhood density is the average intra-cluster density in various clusters, and when the average neighborhood density is used as the neighborhood density threshold value of the clustering algorithm +.>When (I)>By the following general formulaThe formula is:

，

when the minimum neighborhood density is used as the neighborhood density threshold of the clustering algorithm, the neighborhood density threshold of the clustering algorithm is used as the neighborhood density threshold of the clustering algorithmWhen (I)>Is expressed by the following formula:

，

in the above formula, training setCommon->Cluster of individuals, ->Represents>Cluster of individuals, ->Indicate->The size of the cluster of individuals, wherein +.>，/>Is natural number (i.e.)>，/>Is cluster->The i, j data in (a); function->Is a piecewise function, expressed by the following formula:

。

the step 5 comprises the following steps:

step 5.1 selecting one flow data from the data set T，/>Then find distance +.>Less than or equal to->If the distance is +.>Is at +.>The amount of flow data within is less thanThen->Marked as noise if the distance is +.>The flow data quantity in the system is more than +.>Then->Marked as core samples and assigned a new cluster;

step 5.2 AccessAt a distance->All neighbors in the same, if they have not been assigned a cluster, then assign the new cluster labels in step 5.1 to them, if they are core samples, access their neighbors in turn, and so on, cluster gradually increasing until cluster is in the cluster>No more core samples within the distance;

step 5.3, selecting another flow data which is not accessed yet, and repeating the steps 5.1 and 5.2;

and 5.4, finally, obtaining all class clusters after clustering the data set T:

，

for each cluster of classesThe traffic data within is randomly assigned a class label that is the same and different from the other class clusters.

The HTTP message flow data set module is used for giving HTTP message flow, and the HTTP message flow data set module is connected with the distance measurement module, the self-learning training module, the clustering threshold module and the aggregation module in sequence in a signal modeThe volume data set T, HTTP message flow data set T contains a plurality of message flow data, the class label attribute in the message header attribute of each message flow data set is set to be L, L is expressed in the form of discrete variable, and L has a part which is empty, and the data with the class label attribute L which is not empty form a training setThe method comprises the steps of carrying out a first treatment on the surface of the The distance measurement module is used for measuring two message flow data +.>，/>Is the gap between (1); the self-learning training module is used for utilizing a logistic regression model to perform +.>Training distance parameters in a distance measurement function; the clustering threshold module is used for setting a clustering threshold, and the aggregation module is used for aggregating the flow data set T based on the clustering threshold, wherein the class label attribute L of each cluster of data obtained after aggregation is the same.

The clustering threshold module comprises a neighborhood density threshold module and a radius parameter value module, wherein the neighborhood density threshold module is used for calculating a neighborhood density threshold, and the radius parameter value module is used for calculating a radius parameter。

A computer readable storage medium storing a computer program which, when executed by a processor, performs the method steps described above.

An electronic device comprising a processor and a memory, said memory storing a computer program which, when executed by said processor, implements the method steps described above.

Compared with the prior art, the application has the following beneficial effects:

the application aims to solve the problems in the background art and provides an HTTP traffic clustering method and device based on self-learning parameter measurement. The application designs a distance measurement function dist for measuring the difference between two HTTP message data, introduces a semi-supervised learning idea, designs a distance parameter training method based on a logistic regression model, trains distance weights by using a small amount of marked data in a data set, and improves the accuracy of data measurement; designing a marked data clustering parameter calculation mechanism, calculating the average inter-class distance and neighborhood density of marked data clusters by using a distance measurement function as clustering parameters of a DBSCAN algorithm, and improving the adaptability and clustering precision of the clustering algorithm to HTTP traffic data sets.

Drawings

FIG. 1 is a flow chart of the present application;

FIG. 2 is a schematic and flow chart of a DBSCAN algorithm;

FIG. 3 is an embodiment and />Is a visual image of the data;

fig. 4 is a visual diagram of a clustering result in an embodiment.

Detailed Description

Embodiments of the present application are described in further detail below with reference to the accompanying drawings.

It should be noted that the terms like "upper", "lower", "left", "right", "front", "rear", and the like are also used for descriptive purposes only and are not intended to limit the scope of the application in which the application may be practiced, but rather the relative relationship of the terms may be altered or modified without materially altering the teachings of the application.

As shown in fig. 1, a flowchart of an HTTP traffic semi-supervised clustering method based on self-learning parameter measurement provided by the present application includes the following steps:

step 1, given an HTTP message flow data set T, HTTP messageThe message flow data set T comprises a plurality of pieces of message flow data, the class label attribute in the message header attribute of each piece of message flow data is set to be L, L is expressed in the form of discrete variables, and the L is partially empty, and the data with the class label attribute L not being empty form a training set；

，

；

the specific contents are as follows:

defining a distance metric function as follows:

, wherein

，

；

wherein ,，/>for any two pieces of data, r represents the property of item r,>representation->R-th property of->Difference of property of r ∈ ->For the standard deviation of attribute r in the HTTP message flow data set T message header attribute, +.>For the mean value of attribute r in the HTTP message flow data set T message header attribute, the +.>，/>Attribute value of the r-th Attribute representing the m-th data,>represent arbitrary constant, ++>As distance weight for the distance metric function dist,，/>for the distance weight of the r-th attribute, b is the deviation of the distance measurement function dist, the distance weight of the distance measurement function dist +.>And the deviation b of the distance metric function dist are collectively referred to as distance parameters.

step 3.1 logistic regression model is expressed as follows:

，

；

，

wherein ,learning rate representing gradient decline, which controls distance weight +.>The speed of convergence;

，

step 4.2, counting average neighborhood density or minimum neighborhood density as a neighborhood density threshold of a clustering algorithmThe average neighborhood density is the average intra-cluster density in various clusters, and when the average neighborhood density is used as the neighborhood density threshold value of the clustering algorithm +.>When (I)>Is expressed by the following formula:

，

。

Step 5.1 selecting one flow data from the data set T，/>Then find distance +.>Less than or equal to->If the distance is +.>Is at +.>The amount of flow data within is less thanThen->Marked as noise if the distance is +.>The flow data quantity in the system is more than +.>Then/>Marked as core samples and assigned a new cluster;

and 5.4, finally, obtaining all class clusters after clustering the data set T:

，

The application also provides a self-learning parameter measurement-based HTTP traffic semi-supervised clustering device, which comprises an HTTP message traffic data set module, a distance measurement module, a self-learning training module, a clustering threshold module and an aggregation module, wherein the HTTP message traffic data set module, the distance measurement module, the self-learning training module, the clustering threshold module and the aggregation module are connected by signals in sequence,

the HTTP message flow data set module is used for giving an HTTP message flow data set T and HTTP messagesThe flow data set T comprises a plurality of pieces of message flow data, the class label attribute in the message header attribute of each piece of message flow data is set to be L, L is expressed in a discrete variable form, and the L is partially empty, and the data with the class label attribute L not being empty form a training set；

The distance measurement module is used for measuring two message traffic data，/>Is the gap between (1);

the self-learning training module is used for training a set by using a logistic regression modelTraining distance parameters in a distance measurement function;

the clustering threshold module is used for setting a clustering threshold;

the aggregation module is used for aggregating the flow data set T based on the clustering threshold value, and the class label attribute L of each cluster of data obtained after aggregation is the same.

To get a clearer pictureThe technical scheme of the application provides an HTTP traffic semi-supervised clustering method example based on self-learning parameter measurement. First, the HTTP message traffic data set T is divided into two tables, training setFor Table 1, prediction set->Table 2; /> and />See fig. 3 for a visual representation of the data.

TABLE 1

X1	X2	Y	X1	X2	Y	X1	X2	Y
									2.24	2.78	1	-3.81	0.08	2	1.18	-4.55	3
1.95	3.69	1	-4.17	-0.68	2	-0.2	-4.72	3
									2.04	1.94	1	-4.32	-1.47	2	-0.2	-4.2	3
1.22	2.61	1	-4.97	-0.4	2	-0.28	-3.75	3
									-0.28	2.52	1	-6.83	-0.77	2	-0.78	-2.4	3
1.75	3.32	1	-5.29	0.17	2	-0.57	-3.78	3
									1.9	2.92	1	-3.9	-0.8	2	0.67	-3.25	3
3.28	2.17	1	-3.92	-0.78	2	-1.19	-5.31	3
									2.41	0.6	1	-4.71	-0.29	2	-0.4	-3.34	3
0.53	1.98	1	-3.4	0.81	2	0.8	-1.97	3
									0.76	2.6	1	-4.36	-0.7	2	0.16	-4.35	3
1.37	2.96	1	-5.64	0.01	2	-1.05	-2.77	3
									2.24	1.57	1	-6.16	-0.07	2	0.3	-4.23	3
3.5	2.63	1	-4.97	1.43	2	0.78	-4.34	3
									3.48	2.51	1	-2.39	0.88	2	1.24	-4.76	3

TABLE 2

X1	X2	Y	X1	X2	Y	X1	X2	Y
									1.98	2.3	1	-3.51	-1.22	2	0.81	-2.98	3
2.88	3.23	1	-4.63	0.48	2	-0.57	-3.56	3
									2.38	1.41	1	-3.87	-0.78	2	-1	-4.84	3
0.9	2.84	1	-4.43	0.09	2	0.52	-3.54	3
									0.58	2.36	1	-1.99	2.31	2	0.43	-2.33	3
2.52	0.91	1	-3.42	-0.46	2	-0.12	-3.05	3
									1.88	-0.34	1	-3.82	1.28	2	0.38	-2.73	3
3.01	1.03	1	-3.22	-0.03	2	-1.31	-2.97	3
									2.51	2.03	1	-3.85	-1.37	2	0.2	-6.63	3
1.9	1.84	1	-3.78	0.71	2	0.43	-5.44	3
									1.32	5.36	1	-5.17	0.23	2	-0.14	-2.53	3
0.71	1.57	1	-3.84	0.82	2	0.69	-3.7	3
									2.68	0.3	1	-4.91	-1	2	0.48	-4	3
0.38	3.07	1	-3.33	-0.8	2	-1.23	-3.55	3
									1.35	2.39	1	-4.81	-0.17	2	-0.02	-4.43	3

Three data sets are composed of two attributes X1 and X2 and a class label attribute Y;

distance weights for dist are first initializedWe set randomly as->

We first apply to the dataset by the operation shown in step 3Treatment is carried out, will->The difference operation was performed two by two for any data in (number limitation, only 5×5 partial matrix display before the selection) as shown in table 3 below.

TABLE 3 Table 3

(0.0; 0.0)	(0.0; 0.0)	(0.0; 0.0)	(0.0; 0.0)	(0.0; 0.0)	……
						(0.14; 0.13)	(0.0; 0.0)	(0.0; 0.0)	(0.0; 0.0)	(0.0; 0.0)	……
(0.06; 0.13)	(0.08; 0.26)	(0.0; 0.0)	(0.0; 0.0)	(0.0; 0.0)	……
						(0.17; 0.08)	(0.32; 0.06)	(0.24; 0.2)	(0.0; 0.0)	(0.0; 0.0)	……
(0.22; 0.01)	(0.37; 0.12)	(0.29; 0.13)	(0.05; 0.07)	(0.0; 0.0)	……
						……	……	……	……	……	……

Training setFor->Optimizing, and obtaining a better distance weight through multiple training iterations

The method comprises the steps of carrying out a first treatment on the surface of the Thus we get the metric function dist after our training.

Then by the operation shown in step 4 we passComputing neighborhood density threshold +.>And radius parameterThe method comprises the steps of carrying out a first treatment on the surface of the We haveObtain->；/>。

Finally, the data set T is clustered through the step 5, and the following result is obtained.

Wherein we are toThe data were found to be distance-wise to obtain a 90 x 90 distance matrix as shown in table 4 below.

TABLE 4 Table 4

0.00	0.31	0.23	0.27	……	0.98	0.99	1.00	0.99
									0.31	0.00	0.42	0.39	……	1.00	1.00	1.00	1.00
0.23	0.42	0.00	0.52	……	0.97	0.98	0.99	0.99
									0.27	0.39	0.52	0.00	……	0.97	0.98	0.99	0.99
0.23	0.55	0.47	0.16	……	0.96	0.97	0.99	0.98
									0.34	0.50	0.15	0.65	……	0.96	0.97	0.99	0.99
0.53	0.83	0.41	0.78	……	0.83	0.88	0.95	0.94
									0.40	0.43	0.18	0.71	……	0.97	0.98	0.99	0.99
……	……	……	……	……	……	……	……	……
									1.00	1.00	1.00	1.00	……	0.66	0.56	0.82	0.45
1.00	1.00	0.99	1.00	……	0.36	0.27	0.65	0.25
									0.97	0.99	0.96	0.96	……	0.34	0.37	0.36	0.37
0.98	1.00	0.97	0.97	……	0.00	0.13	0.33	0.24
									0.99	1.00	0.98	0.98	……	0.13	0.00	0.35	0.17
1.00	1.00	0.99	0.99	……	0.33	0.35	0.00	0.35
									0.99	1.00	0.99	0.99	……	0.24	0.17	0.35	0.00

Clustering the distance matrix can obtain the results shown in the following table 5, and a visual image of the clustering result is shown in fig. 4 (square points are noise points).

TABLE 5

X1	X2	Y	X1	X2	Y	X1	X2	Y
									2.24	2.78	1	-3.81	0.08	0	1.18	-4.55	2
1.95	3.69	1	-4.17	-0.68	0	-0.2	-4.72	2
									2.04	1.94	1	-4.32	-1.47	0	-0.2	-4.2	2
1.22	2.61	1	-4.97	-0.4	0	-0.28	-3.75	2
									-0.28	2.52	1	-6.83	-0.77	0	-0.78	-2.4	2
1.75	3.32	1	-5.29	0.17	0	-0.57	-3.78	2
									1.9	2.92	1	-3.9	-0.8	0	0.67	-3.25	2
3.28	2.17	1	-3.92	-0.78	0	-1.19	-5.31	2
									2.41	0.6	1	-4.71	-0.29	0	-0.4	-3.34	2
0.53	1.98	1	-3.4	0.81	0	0.8	-1.97	2
									0.76	2.6	1	-4.36	-0.7	0	0.16	-4.35	2
1.37	2.96	1	-5.64	0.01	0	-1.05	-2.77	2
									2.24	1.57	1	-6.16	-0.07	0	0.3	-4.23	2
3.5	2.63	1	-4.97	1.43	0	0.78	-4.34	2
									3.48	2.51	1	-2.39	0.88	0	1.24	-4.76	2
1.98	2.3	1	-3.51	-1.22	0	0.81	-2.98	2
									2.88	3.23	1	-4.63	0.48	0	-0.57	-3.56	2
2.38	1.41	1	-3.87	-0.78	0	-1	-4.84	2
									0.9	2.84	1	-4.43	0.09	0	0.52	-3.54	2
0.58	2.36	1	-1.99	2.31	-1	0.43	-2.33	2
									2.52	0.91	1	-3.42	-0.46	0	-0.12	-3.05	2
1.88	-0.34	1	-3.82	1.28	0	0.38	-2.73	2
									3.01	1.03	1	-3.22	-0.03	0	-1.31	-2.97	2
2.51	2.03	1	-3.85	-1.37	0	0.2	-6.63	2
									1.9	1.84	1	-3.78	0.71	0	0.43	-5.44	2
1.32	5.36	-1	-5.17	0.23	0	-0.14	-2.53	2
									0.71	1.57	1	-3.84	0.82	0	0.69	-3.7	2
2.68	0.3	1	-4.91	-1	0	0.48	-4	2
									0.38	3.07	1	-3.33	-0.8	0	-1.23	-3.55	2
1.35	2.39	1	-4.81	-0.17	0	-0.02	-4.43	2

Wherein, -1 in the Y label represents noise point, and 0-2 represents different clusters.

It will be appreciated by those skilled in the art that embodiments of the present application may be provided as a method, system, or computer program product. Accordingly, the present application may take the form of an entirely hardware embodiment, an entirely software embodiment or an embodiment combining software and hardware aspects. Furthermore, the present application may take the form of a computer program product embodied on one or more computer-usable storage media (including, but not limited to, disk storage, CD-ROM, optical storage, and the like) having computer-usable program code embodied therein. The scheme in the embodiment of the application can be realized by adopting various computer languages, such as object-oriented programming language Java, an transliteration script language JavaScript and the like.

The present application is described with reference to flowchart illustrations and/or block diagrams of methods, apparatus (systems) and computer program products according to embodiments of the application. It will be understood that each flow and/or block of the flowchart illustrations and/or block diagrams, and combinations of flows and/or blocks in the flowchart illustrations and/or block diagrams, can be implemented by computer program instructions. These computer program instructions may be provided to a processor of a general purpose computer, special purpose computer, embedded processor, or other programmable data processing apparatus to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing apparatus, create means for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.

These computer program instructions may also be stored in a computer-readable memory that can direct a computer or other programmable data processing apparatus to function in a particular manner, such that the instructions stored in the computer-readable memory produce an article of manufacture including instruction means which implement the function specified in the flowchart flow or flows and/or block diagram block or blocks.

These computer program instructions may also be loaded onto a computer or other programmable data processing apparatus to cause a series of operational steps to be performed on the computer or other programmable apparatus to produce a computer implemented process such that the instructions which execute on the computer or other programmable apparatus provide steps for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.

While preferred embodiments of the present application have been described, additional variations and modifications in those embodiments may occur to those skilled in the art once they learn of the basic inventive concepts. It is therefore intended that the following claims be interpreted as including the preferred embodiments and all such alterations and modifications as fall within the scope of the application.

It will be apparent to those skilled in the art that various modifications and variations can be made to the present application without departing from the spirit or scope of the application. Thus, it is intended that the present application also include such modifications and alterations insofar as they come within the scope of the appended claims or the equivalents thereof.

Claims

1. The HTTP traffic semi-supervised clustering method based on the self-learning parameter measurement is characterized by comprising the following steps of:

step 1, giving an HTTP message flow data set T which contains a plurality of message flow data, wherein the class label attribute in the header attribute of each message flow data set is set to be L, L is expressed in the form of discrete variables, part of L is empty, and the data with the class label attribute L not being empty form a training set T _Train ；

Step 2, defining a distance measurement function for measuring two message flow data x _i ，x _j Is the gap between (1);

step 3, using a logistic regression model in the training set T _Train Training distance parameters in a distance measurement function;

step 4, setting a clustering threshold, wherein the clustering threshold comprises a neighborhood density threshold Pts _min And radius parameter Eps, for training set T _Train Carrying out statistical analysis, taking the average inter-class distance as a radius parameter Eps of a clustering algorithm, classifying the data with the same L attribute as one class, calculating the data distance of the same class and calculating a distance average value;

step 5, based on the neighborhood density threshold Pts _min And the radius parameter Eps is aggregated with the flow data set T, and the class label attribute L of each cluster of data obtained after aggregation is the same.

2. The HTTP traffic semi-supervised clustering method based on self-learning parameter metrics of claim 1, wherein the step 1 includes:

T＝{x ₁ ，x ₂ ，…，x _k }，

wherein ,an nth attribute value, y, of an HTTP message header attribute of kth message flow data in the dataset _k Is the attribute value of the class label attribute L, wherein n and k are natural numbers;

step 1.2 define a prediction set T _Pred For the message flow data containing all the L attributes being null, the rest message flow data not being null form a training set T _Train 。

3. The HTTP traffic semi-supervised clustering method based on self-learning parameter metrics according to claim 2, wherein the step 2 includes:

defining a distance metric function as follows:

wherein

wherein ,x_i ，x _j For any two pieces of data, r represents the property of the r-th piece,represents x _i The r-th property of (2) and x _j Difference of property of the r th strip, s ^(r) For the standard deviation of attribute r in the HTTP message flow data set T message header attribute, +.>For the average value of attribute r in the T message header attribute of HTTP message flow data set, m is E [1, k ]]，/>Attribute value of the (r) th attribute of the (m) th data, z represents an arbitrary constant, w is distance weight of distance metric function dist, and w= < w ₁ ，w ₂ ，...，w _r ，...，w _n ＞，w _r For the distance weight of the r-th attribute, b is the deviation of the distance measurement function dist, and the distance weight W of the distance measurement function dist and the deviation b of the distance measurement function dist are collectively called as distance parameters.

4. The HTTP traffic semi-supervised clustering method based on self-learning parameter metrics according to claim 3, wherein the step 3 includes:

step 3.1 logistic regression model is expressed as follows:

y＝sigmod(w′x _k +b′)，

wherein W' represents the weight of the logistic regression model, x _k Representing training set T _Train The k-th flow, b' represents the deviation of the logistic regression model;

step 3.2 processing training set T _Train Will T _Train The difference operation is carried out on any data in the data, and the formula is as follows:

wherein ,x_ij Represents x _i ，x _j Two data are subjected to difference operation and standardization, s ⁽ⁿ⁾ Representing training set T _Train Standard deviation of the nth attribute,represents x _i Attribute value of the nth attribute, +.>Represents x _j Attribute value, y of the nth attribute of (2) _ij Represents x _i ，x _j Whether the class label attributes of the two pieces of data are the same;

wherein J (w) represents a loss function of the distance weight w, argmin _(w，b) Minimum value for w, b, l represents training set T _Train The number of data entries to be included in the data stream,

solving the distance weight w of the distance measurement function dist of the formula by a gradient descent method, randomly initializing w, optimizing the distance weight w by an approximation mode, and realizing the optimization by the following formula:

wherein α represents a learning rate of gradient descent, which controls a rate of convergence of the distance weight w;

and after multiple training iterations, the distance weight w converges, and further training of the distance weight w in the distance measurement function dist is completed.

5. The HTTP traffic semi-supervised clustering method based on self-learning parameter metrics according to claim 4, wherein the step 4 includes:

step 4.1 training set T _Train Carrying out statistical analysis, taking the average inter-class distance as a radius parameter Eps of a clustering algorithm, classifying the data with the same L attribute as one class, calculating the data distance of the same class, and calculating a distance average value, wherein Eps is expressed by the following formula:

step 4.2, counting average neighborhood density or minimum neighborhood density as a neighborhood density threshold Pts of a clustering algorithm _min The average neighborhood density is the average intra-cluster density in various clusters, and when the average neighborhood density is used as a neighborhood density threshold Pts of a clustering algorithm _min Pts when _min Is expressed by the following formula:

when the minimum neighborhood density is used as the neighborhood density threshold of the clustering algorithm, the neighborhood density threshold Pts of the clustering algorithm is used as the neighborhood density threshold Pts of the clustering algorithm _min Pts when _min Is expressed by the following formula:

in the above formula, training set T _Train Q clusters of class C _h Represents the h class cluster in T, ||C _h I represents the size of the h class cluster, where h ε [1, q]Q is a natural number, x' _i ，x′ _j Is cluster C _h The i, j data in (a); function f (dist (x' _i ，x′ _j ) Eps) is a piecewise function, expressed by the following formula:

6. the HTTP traffic semi-supervised clustering method based on self-learning parameter metrics according to claim 5, wherein the step 5 includes:

step 5.1 selecting one flow data x from the data set T _i ，i∈[1，k]Then find the distance x _i All flow data less than or equal to Eps, if distance x _i The number of flow data within the Eps is smaller than Pts _min Then x _i Marked as noise if the amount of flow data within Eps is greater than Pts _min X is then _i Marked as core samples and assigned a new cluster;

step 5.2 Access x _i All neighbors within a distance Eps, if they have not been assigned a cluster, then assign the new cluster label in step 5.1 to them, if they are core samples, access their neighbors in turn, and so on, cluster gradually increasing until there are no more core samples within the Eps distance of the cluster;

and 5.4, finally, obtaining all class clusters after clustering the data set T:

＜C ₁ ，C ₂ ，...，C _q |C _h ∈T，h∈[1，q]＞，

for each cluster C _h The traffic data within is randomly assigned a class label that is the same and different from the other class clusters.

7. A self-learning parameter measurement-based HTTP traffic semi-supervised clustering device is characterized by comprising an HTTP message traffic data set module, a distance measurement module and a self-learning parameter measurement module,The self-learning training module, the clustering threshold module and the aggregation module are sequentially connected by signals, wherein the HTTP message flow data set module, the distance measurement module, the self-learning training module, the clustering threshold module and the aggregation module are used for giving an HTTP message flow data set T, the HTTP message flow data set T comprises a plurality of message flow data, the class label attribute in the message header attribute of each message flow data set is set to be L, the L is expressed in the form of discrete variable, part of L is empty, and the data with the class label attribute L not empty form the training set T _Train The method comprises the steps of carrying out a first treatment on the surface of the The distance measurement module is used for measuring two message flow data x _i ，x _j Is the gap between (1); the self-learning training module is used for utilizing a logistic regression model to train the set T _Train Training distance parameters in a distance measurement function; the clustering threshold module is used for setting a clustering threshold, the aggregation module is used for aggregating the flow data set T based on the clustering threshold, the class label attribute L of each cluster of data obtained after aggregation is the same, the clustering threshold module comprises a neighborhood density threshold module and a radius parameter value module, the neighborhood density threshold module is used for calculating a neighborhood density threshold, and the radius parameter value module is used for training the set T _Train And carrying out statistical analysis, taking the average inter-class distance as a radius parameter Eps of a clustering algorithm, classifying the data with the same L attribute as one class, calculating the data distance of the same class and calculating a distance average value.

8. A computer-readable storage medium, characterized in that the computer-readable storage medium stores a computer program, which, when executed by a processor, implements the method steps of any of claims 1-6.

9. An electronic device comprising a processor and a memory, the memory storing a computer program, which, when executed by the processor, performs the method steps of any of claims 1-6.