CN116244612B - HTTP traffic clustering method and device based on self-learning parameter measurement - Google Patents

HTTP traffic clustering method and device based on self-learning parameter measurement Download PDF

Info

Publication number
CN116244612B
CN116244612B CN202310532962.0A CN202310532962A CN116244612B CN 116244612 B CN116244612 B CN 116244612B CN 202310532962 A CN202310532962 A CN 202310532962A CN 116244612 B CN116244612 B CN 116244612B
Authority
CN
China
Prior art keywords
distance
data
attribute
flow data
clustering
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202310532962.0A
Other languages
Chinese (zh)
Other versions
CN116244612A (en
Inventor
赵新建
张颂
陈石
徐晨维
张明明
夏飞
袁国泉
宋浒
王鹏飞
汤铭
朱佳佳
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
State Grid Jiangsu Electric Power Co Ltd
Information and Telecommunication Branch of State Grid Jiangsu Electric Power Co Ltd
Original Assignee
State Grid Jiangsu Electric Power Co Ltd
Information and Telecommunication Branch of State Grid Jiangsu Electric Power Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by State Grid Jiangsu Electric Power Co Ltd, Information and Telecommunication Branch of State Grid Jiangsu Electric Power Co Ltd filed Critical State Grid Jiangsu Electric Power Co Ltd
Priority to CN202310532962.0A priority Critical patent/CN116244612B/en
Publication of CN116244612A publication Critical patent/CN116244612A/en
Application granted granted Critical
Publication of CN116244612B publication Critical patent/CN116244612B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L67/00Network arrangements or protocols for supporting network services or applications
    • H04L67/01Protocols
    • H04L67/02Protocols based on web technology, e.g. hypertext transfer protocol [HTTP]
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N20/00Machine learning
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02DCLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
    • Y02D30/00Reducing energy consumption in communication networks
    • Y02D30/50Reducing energy consumption in communication networks in wire-line communication networks, e.g. low power modes or reduced link rate

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Software Systems (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Artificial Intelligence (AREA)
  • Data Mining & Analysis (AREA)
  • Evolutionary Computation (AREA)
  • Medical Informatics (AREA)
  • Signal Processing (AREA)
  • Physics & Mathematics (AREA)
  • Computing Systems (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Mathematical Physics (AREA)
  • Computer Networks & Wireless Communication (AREA)
  • Management, Administration, Business Operations System, And Electronic Commerce (AREA)

Abstract

The application discloses a method for setting a class label attribute in a message header attribute of HTTP message flow data set T, wherein the HTTP message flow data set T comprises a plurality of pieces of message flow data; step 2, training distance parameters in the distance measurement function; step 3, setting a clustering threshold, wherein the clustering threshold comprises a neighborhood density threshold and a radius parameter; and 4, aggregating the flow data set T based on the neighborhood density threshold and the radius parameter, wherein the class label attribute L of each cluster of data obtained after aggregation is the same. The application adopts DBSCAN clustering algorithm to cluster flow data based on dist measurement function, and improves clustering accuracy of HTTP flow data.

Description

HTTP traffic clustering method and device based on self-learning parameter measurement
Technical Field
The application relates to the field of data mining, in particular to an HTTP traffic clustering method and device based on self-learning parameter measurement.
Background
In recent years, with the continuous development of internet technology, the rapid growth of network applications has created a series of challenges for traffic classification. The initial traffic classification method is to divide the network traffic of a specific port into corresponding network applications according to a port mapping table specified by the internet address assignment authority IANA. However, with the increasing popularity of web network applications, a large amount of HTTP message data is accumulated in a network, and these messages correspond to different network access behaviors, so that clustering is performed on a large amount of HTTP messages, which is significant for identifying network access behaviors, recommending services and monitoring access.
The clustering belongs to non-guide learning, the data is analyzed on the premise of no priori knowledge, and the data belonging to the unknown state is divided into groups, so that the data in the same group have similar state characteristics; data belonging to different groups has different state characteristics. The DBSCAN algorithm is a classical clustering method in the field of data mining, can find class clusters with different shapes, and is widely applied to relational mode data clustering. The algorithm uses the density-based clustering (or cluster-like) concept, requiring that the number of objects (points or other spatial objects) contained within a certain region in the clustering space be no less than a given threshold. The DBSCAN algorithm has the remarkable advantages of high clustering speed, and can effectively process noise points (outliers) and find spatial clusters with arbitrary shapes, and ideas and flows of the DBSCAN algorithm are shown in figure 2.
Different from the fixed relation type data mode, the data lacks of the explicit type label attribute, and the numerical distribution on each attribute is relatively uniform; the numerical distribution on different attributes of HTTP message data is relatively sparse, and a common data distance measurement method fails; meanwhile, small-class label priori knowledge including class labels and data point division constraint conditions can be obtained from part of HTTP message data. How to use these only a small amount of a priori knowledge to perform cluster analysis on a large amount of data without a priori knowledge becomes a very important issue.
In the actual flow clustering work, statistical analysis is performed on the flow data, and it is found that in some clustering tasks, some data in the data set often contain class label attributes capable of locating class clusters, such as an IP address, user userid and the like. However, prior studies ignore the effect of this portion of information on the grouping of traffic data groups. The DBSCAN algorithm is directly applied to HTTP message data clustering, and the problems that the data distance (difference) measurement is inaccurate, the parameter setting of the clustering algorithm is difficult, and the available part of class label information cannot be fully utilized are caused, so that the cluster accuracy of clustering acquisition is poor, and the subsequent decision support service is difficult to support.
Disclosure of Invention
The application aims to solve the problems in the background art and provides an HTTP traffic clustering method and device based on self-learning parameter measurement. The application designs a measurement function dist for measuring the difference between two pieces of HTTP message data, optimizes the parameter of the measurement function dist by introducing the ideas of logistic regression and gradient descent, improves the measurement accuracy, and finally adopts a DBSCAN clustering algorithm to cluster the flow data based on the dist measurement function, and improves the clustering precision of clustering the HTTP flow data.
In order to achieve the technical purpose, the application adopts the following technical scheme:
a self-learning parameter measurement-based HTTP traffic semi-supervised clustering method comprises the following steps:
step 1, giving an HTTP message flow data set T which comprises a plurality of message flowsThe method comprises the steps of measuring data, setting class label attributes in message header attributes of each message flow data to be L, wherein L is expressed in a discrete variable form, and the L is partially empty, and forming a training set by the data with the class label attributes L which are not empty
Step 2, defining a distance measurement function for measuring two message flow data,/>Is the gap between (1);
step 3, utilizing a logistic regression model to make training setTraining distance parameters in a distance measurement function;
step 4, setting a clustering threshold, wherein the clustering threshold comprises a neighborhood density thresholdAnd radius parameter->
Step 5, based on neighborhood density thresholdAnd radius parameter->The stream data set T is aggregated, and the class label attribute L of each cluster of data obtained after aggregation is the same.
The step 1 comprises the following steps:
step 1.1, an HTTP message flow data set T is given, wherein the data set T contains k pieces of message flow data and is expressed as follows:
wherein ,an nth attribute value of HTTP header attribute of kth message flow data in the data set,/->Is the attribute value of the class label attribute L, wherein, < ->,/>All are natural numbers;
step 1.2 define prediction setFor containing all the message flow data with the L attributes being null, the rest message flow data with the L attributes not being null form a training set +.>
The step 2 comprises the following steps:
defining a distance metric function as follows:
, wherein
wherein ,,/>for any two pieces of data, r represents the property of item r,>representation->R-th property of->Difference of property of r ∈ ->For the standard deviation of attribute r in the HTTP message flow data set T message header attribute, +.>For the mean value of attribute r in the HTTP message flow data set T message header attribute, the +.>,/>Attribute value of the r-th Attribute representing the m-th data,>represent arbitrary constant, ++>As distance weight for the distance metric function dist,,/>distance weight for the r-th attribute, b is the deviation of distance metric function dist, distance metricDistance weight of function dist +.>And the deviation b of the distance metric function dist are collectively referred to as distance parameters.
The step 3 comprises the following steps:
step 3.1 logistic regression model is expressed as follows:
wherein ,weights representing logistic regression model, +.>Representing training set +.>The flow rate of the kth line in the (a),representing the bias of the logistic regression model;
step 3.2 processing training setWill->The difference operation is carried out on any data in the data, and the formula is as follows:
wherein ,representation->,/>The two data are differenced and normalized, < >>Representing training set +.>First->Standard deviation of individual attributes->Representation->Is>Attribute value of individual attribute->Representation->Is>The attribute values of the individual attributes are used,representation->,/>Whether the class label attributes of the two pieces of data are the same;
step 3.3 represents the loss function of the logistic regression model by averaging the square error sum MSE, i.e. the following formula:
wherein ,representing distance weight +.>Is a loss function of->Is->,/>Minimum value->Representing training set +.>The number of data entries to be included in the data stream,
solving the distance weight of the distance measurement function dist of the formula by a gradient descent methodFirst randomly initialize +.>Optimizing the distance weight by means of approximation>The optimization is achieved by the following formula:
wherein ,learning rate representing gradient descent, which controls distanceWeight of->The speed of convergence;
through multiple training iterations, the distance weight is madeConvergence, and thus distance weight +_in distance metric function dist>Is a training of (a).
The step 4 comprises the following steps:
step 4.1 training setStatistical analysis is performed, and the average inter-class distance is used as the radius parameter of the clustering algorithm>The data with the average inter-class distance L and the same attribute are classified into one class, the data distance of the same class is calculated, and the distance average value is calculated, namely +.>Is expressed by the following formula:
step 4.2, counting average neighborhood density or minimum neighborhood density as a neighborhood density threshold of a clustering algorithmThe average neighborhood density is the average intra-cluster density in various clusters, and when the average neighborhood density is used as the neighborhood density threshold value of the clustering algorithm +.>When (I)>By the following general formulaThe formula is:
when the minimum neighborhood density is used as the neighborhood density threshold of the clustering algorithm, the neighborhood density threshold of the clustering algorithm is used as the neighborhood density threshold of the clustering algorithmWhen (I)>Is expressed by the following formula:
in the above formula, training setCommon->Cluster of individuals, ->Represents>Cluster of individuals, ->Indicate->The size of the cluster of individuals, wherein +.>,/>Is natural number (i.e.)>,/>Is cluster->The i, j data in (a); function->Is a piecewise function, expressed by the following formula:
the step 5 comprises the following steps:
step 5.1 selecting one flow data from the data set T,/>Then find distance +.>Less than or equal to->If the distance is +.>Is at +.>The amount of flow data within is less thanThen->Marked as noise if the distance is +.>The flow data quantity in the system is more than +.>Then->Marked as core samples and assigned a new cluster;
step 5.2 AccessAt a distance->All neighbors in the same, if they have not been assigned a cluster, then assign the new cluster labels in step 5.1 to them, if they are core samples, access their neighbors in turn, and so on, cluster gradually increasing until cluster is in the cluster>No more core samples within the distance;
step 5.3, selecting another flow data which is not accessed yet, and repeating the steps 5.1 and 5.2;
and 5.4, finally, obtaining all class clusters after clustering the data set T:
for each cluster of classesThe traffic data within is randomly assigned a class label that is the same and different from the other class clusters.
The HTTP message flow data set module is used for giving HTTP message flow, and the HTTP message flow data set module is connected with the distance measurement module, the self-learning training module, the clustering threshold module and the aggregation module in sequence in a signal modeThe volume data set T, HTTP message flow data set T contains a plurality of message flow data, the class label attribute in the message header attribute of each message flow data set is set to be L, L is expressed in the form of discrete variable, and L has a part which is empty, and the data with the class label attribute L which is not empty form a training setThe method comprises the steps of carrying out a first treatment on the surface of the The distance measurement module is used for measuring two message flow data +.>,/>Is the gap between (1); the self-learning training module is used for utilizing a logistic regression model to perform +.>Training distance parameters in a distance measurement function; the clustering threshold module is used for setting a clustering threshold, and the aggregation module is used for aggregating the flow data set T based on the clustering threshold, wherein the class label attribute L of each cluster of data obtained after aggregation is the same.
The clustering threshold module comprises a neighborhood density threshold module and a radius parameter value module, wherein the neighborhood density threshold module is used for calculating a neighborhood density threshold, and the radius parameter value module is used for calculating a radius parameter
A computer readable storage medium storing a computer program which, when executed by a processor, performs the method steps described above.
An electronic device comprising a processor and a memory, said memory storing a computer program which, when executed by said processor, implements the method steps described above.
Compared with the prior art, the application has the following beneficial effects:
the application aims to solve the problems in the background art and provides an HTTP traffic clustering method and device based on self-learning parameter measurement. The application designs a distance measurement function dist for measuring the difference between two HTTP message data, introduces a semi-supervised learning idea, designs a distance parameter training method based on a logistic regression model, trains distance weights by using a small amount of marked data in a data set, and improves the accuracy of data measurement; designing a marked data clustering parameter calculation mechanism, calculating the average inter-class distance and neighborhood density of marked data clusters by using a distance measurement function as clustering parameters of a DBSCAN algorithm, and improving the adaptability and clustering precision of the clustering algorithm to HTTP traffic data sets.
Drawings
FIG. 1 is a flow chart of the present application;
FIG. 2 is a schematic and flow chart of a DBSCAN algorithm;
FIG. 3 is an embodiment and />Is a visual image of the data;
fig. 4 is a visual diagram of a clustering result in an embodiment.
Detailed Description
Embodiments of the present application are described in further detail below with reference to the accompanying drawings.
It should be noted that the terms like "upper", "lower", "left", "right", "front", "rear", and the like are also used for descriptive purposes only and are not intended to limit the scope of the application in which the application may be practiced, but rather the relative relationship of the terms may be altered or modified without materially altering the teachings of the application.
As shown in fig. 1, a flowchart of an HTTP traffic semi-supervised clustering method based on self-learning parameter measurement provided by the present application includes the following steps:
step 1, given an HTTP message flow data set T, HTTP messageThe message flow data set T comprises a plurality of pieces of message flow data, the class label attribute in the message header attribute of each piece of message flow data is set to be L, L is expressed in the form of discrete variables, and the L is partially empty, and the data with the class label attribute L not being empty form a training set
Step 1.1, an HTTP message flow data set T is given, wherein the data set T contains k pieces of message flow data and is expressed as follows:
wherein ,an nth attribute value of HTTP header attribute of kth message flow data in the data set,/->Is the attribute value of the class label attribute L, wherein, < ->,/>All are natural numbers;
step 1.2 define prediction setFor containing all the message flow data with the L attributes being null, the rest message flow data with the L attributes not being null form a training set +.>
Step 2, defining a distance measurement function for measuring two message flow data,/>Is the gap between (1);
the specific contents are as follows:
defining a distance metric function as follows:
, wherein
wherein ,,/>for any two pieces of data, r represents the property of item r,>representation->R-th property of->Difference of property of r ∈ ->For the standard deviation of attribute r in the HTTP message flow data set T message header attribute, +.>For the mean value of attribute r in the HTTP message flow data set T message header attribute, the +.>,/>Attribute value of the r-th Attribute representing the m-th data,>represent arbitrary constant, ++>As distance weight for the distance metric function dist,,/>for the distance weight of the r-th attribute, b is the deviation of the distance measurement function dist, the distance weight of the distance measurement function dist +.>And the deviation b of the distance metric function dist are collectively referred to as distance parameters.
Step 3, utilizing a logistic regression model to make training setTraining distance parameters in a distance measurement function;
step 3.1 logistic regression model is expressed as follows:
wherein ,weights representing logistic regression model, +.>Representing training set +.>The flow rate of the kth line in the (a),representing the bias of the logistic regression model;
step 3.2 processing training setWill->The difference operation is carried out on any data in the data, and the formula is as follows:
wherein ,representation->,/>The two data are differenced and normalized, < >>Representing training set +.>First->Standard deviation of individual attributes->Representation->Is>Attribute value of individual attribute->Representation->Is>The attribute values of the individual attributes are used,representation->,/>Whether the class label attributes of the two pieces of data are the same;
step 3.3 represents the loss function of the logistic regression model by averaging the square error sum MSE, i.e. the following formula:
wherein ,representing distance weight +.>Is a loss function of->Is->,/>Minimum value->Representing training set +.>The number of data entries to be included in the data stream,
solving the distance weight of the distance measurement function dist of the formula by a gradient descent methodFirst randomly initialize +.>Optimizing the distance weight by means of approximation>The optimization is achieved by the following formula:
wherein ,learning rate representing gradient decline, which controls distance weight +.>The speed of convergence;
through multiple training iterations, the distance weight is madeConvergence, and thus distance weight +_in distance metric function dist>Is a training of (a).
Step 4, setting a clustering threshold, wherein the clustering threshold comprises a neighborhood density thresholdAnd radius parameter->
Step 4.1 training setStatistical analysis is performed, and the average inter-class distance is used as the radius parameter of the clustering algorithm>The data with the average inter-class distance L and the same attribute are classified into one class, the data distance of the same class is calculated, and the distance average value is calculated, namely +.>Is expressed by the following formula:
step 4.2, counting average neighborhood density or minimum neighborhood density as a neighborhood density threshold of a clustering algorithmThe average neighborhood density is the average intra-cluster density in various clusters, and when the average neighborhood density is used as the neighborhood density threshold value of the clustering algorithm +.>When (I)>Is expressed by the following formula:
when the minimum neighborhood density is used as the neighborhood density threshold of the clustering algorithm, the neighborhood density threshold of the clustering algorithm is used as the neighborhood density threshold of the clustering algorithmWhen (I)>Is expressed by the following formula:
in the above formula, training setCommon->Cluster of individuals, ->Represents>Cluster of individuals, ->Indicate->The size of the cluster of individuals, wherein +.>,/>Is natural number (i.e.)>,/>Is cluster->The i, j data in (a); function->Is a piecewise function, expressed by the following formula:
step 5, based on neighborhood density thresholdAnd radius parameter->The stream data set T is aggregated, and the class label attribute L of each cluster of data obtained after aggregation is the same.
Step 5.1 selecting one flow data from the data set T,/>Then find distance +.>Less than or equal to->If the distance is +.>Is at +.>The amount of flow data within is less thanThen->Marked as noise if the distance is +.>The flow data quantity in the system is more than +.>Then/>Marked as core samples and assigned a new cluster;
step 5.2 AccessAt a distance->All neighbors in the same, if they have not been assigned a cluster, then assign the new cluster labels in step 5.1 to them, if they are core samples, access their neighbors in turn, and so on, cluster gradually increasing until cluster is in the cluster>No more core samples within the distance;
step 5.3, selecting another flow data which is not accessed yet, and repeating the steps 5.1 and 5.2;
and 5.4, finally, obtaining all class clusters after clustering the data set T:
for each cluster of classesThe traffic data within is randomly assigned a class label that is the same and different from the other class clusters.
The application also provides a self-learning parameter measurement-based HTTP traffic semi-supervised clustering device, which comprises an HTTP message traffic data set module, a distance measurement module, a self-learning training module, a clustering threshold module and an aggregation module, wherein the HTTP message traffic data set module, the distance measurement module, the self-learning training module, the clustering threshold module and the aggregation module are connected by signals in sequence,
the HTTP message flow data set module is used for giving an HTTP message flow data set T and HTTP messagesThe flow data set T comprises a plurality of pieces of message flow data, the class label attribute in the message header attribute of each piece of message flow data is set to be L, L is expressed in a discrete variable form, and the L is partially empty, and the data with the class label attribute L not being empty form a training set
The distance measurement module is used for measuring two message traffic data,/>Is the gap between (1);
the self-learning training module is used for training a set by using a logistic regression modelTraining distance parameters in a distance measurement function;
the clustering threshold module is used for setting a clustering threshold;
the aggregation module is used for aggregating the flow data set T based on the clustering threshold value, and the class label attribute L of each cluster of data obtained after aggregation is the same.
The clustering threshold module comprises a neighborhood density threshold module and a radius parameter value module, wherein the neighborhood density threshold module is used for calculating a neighborhood density threshold, and the radius parameter value module is used for calculating a radius parameter
A computer readable storage medium storing a computer program which, when executed by a processor, performs the method steps described above.
An electronic device comprising a processor and a memory, said memory storing a computer program which, when executed by said processor, implements the method steps described above.
To get a clearer pictureThe technical scheme of the application provides an HTTP traffic semi-supervised clustering method example based on self-learning parameter measurement. First, the HTTP message traffic data set T is divided into two tables, training setFor Table 1, prediction set->Table 2; /> and />See fig. 3 for a visual representation of the data.
TABLE 1
X1 X2 Y X1 X2 Y X1 X2 Y
2.24 2.78 1 -3.81 0.08 2 1.18 -4.55 3
1.95 3.69 1 -4.17 -0.68 2 -0.2 -4.72 3
2.04 1.94 1 -4.32 -1.47 2 -0.2 -4.2 3
1.22 2.61 1 -4.97 -0.4 2 -0.28 -3.75 3
-0.28 2.52 1 -6.83 -0.77 2 -0.78 -2.4 3
1.75 3.32 1 -5.29 0.17 2 -0.57 -3.78 3
1.9 2.92 1 -3.9 -0.8 2 0.67 -3.25 3
3.28 2.17 1 -3.92 -0.78 2 -1.19 -5.31 3
2.41 0.6 1 -4.71 -0.29 2 -0.4 -3.34 3
0.53 1.98 1 -3.4 0.81 2 0.8 -1.97 3
0.76 2.6 1 -4.36 -0.7 2 0.16 -4.35 3
1.37 2.96 1 -5.64 0.01 2 -1.05 -2.77 3
2.24 1.57 1 -6.16 -0.07 2 0.3 -4.23 3
3.5 2.63 1 -4.97 1.43 2 0.78 -4.34 3
3.48 2.51 1 -2.39 0.88 2 1.24 -4.76 3
TABLE 2
X1 X2 Y X1 X2 Y X1 X2 Y
1.98 2.3 1 -3.51 -1.22 2 0.81 -2.98 3
2.88 3.23 1 -4.63 0.48 2 -0.57 -3.56 3
2.38 1.41 1 -3.87 -0.78 2 -1 -4.84 3
0.9 2.84 1 -4.43 0.09 2 0.52 -3.54 3
0.58 2.36 1 -1.99 2.31 2 0.43 -2.33 3
2.52 0.91 1 -3.42 -0.46 2 -0.12 -3.05 3
1.88 -0.34 1 -3.82 1.28 2 0.38 -2.73 3
3.01 1.03 1 -3.22 -0.03 2 -1.31 -2.97 3
2.51 2.03 1 -3.85 -1.37 2 0.2 -6.63 3
1.9 1.84 1 -3.78 0.71 2 0.43 -5.44 3
1.32 5.36 1 -5.17 0.23 2 -0.14 -2.53 3
0.71 1.57 1 -3.84 0.82 2 0.69 -3.7 3
2.68 0.3 1 -4.91 -1 2 0.48 -4 3
0.38 3.07 1 -3.33 -0.8 2 -1.23 -3.55 3
1.35 2.39 1 -4.81 -0.17 2 -0.02 -4.43 3
Three data sets are composed of two attributes X1 and X2 and a class label attribute Y;
distance weights for dist are first initializedWe set randomly as->
We first apply to the dataset by the operation shown in step 3Treatment is carried out, will->The difference operation was performed two by two for any data in (number limitation, only 5×5 partial matrix display before the selection) as shown in table 3 below.
TABLE 3 Table 3
(0.0; 0.0) (0.0; 0.0) (0.0; 0.0) (0.0; 0.0) (0.0; 0.0) ……
(0.14; 0.13) (0.0; 0.0) (0.0; 0.0) (0.0; 0.0) (0.0; 0.0) ……
(0.06; 0.13) (0.08; 0.26) (0.0; 0.0) (0.0; 0.0) (0.0; 0.0) ……
(0.17; 0.08) (0.32; 0.06) (0.24; 0.2) (0.0; 0.0) (0.0; 0.0) ……
(0.22; 0.01) (0.37; 0.12) (0.29; 0.13) (0.05; 0.07) (0.0; 0.0) ……
…… …… …… …… …… ……
Training setFor->Optimizing, and obtaining a better distance weight through multiple training iterations
The method comprises the steps of carrying out a first treatment on the surface of the Thus we get the metric function dist after our training.
Then by the operation shown in step 4 we passComputing neighborhood density threshold +.>And radius parameterThe method comprises the steps of carrying out a first treatment on the surface of the We haveObtain->;/>
Finally, the data set T is clustered through the step 5, and the following result is obtained.
Wherein we are toThe data were found to be distance-wise to obtain a 90 x 90 distance matrix as shown in table 4 below.
TABLE 4 Table 4
0.00 0.31 0.23 0.27 …… 0.98 0.99 1.00 0.99
0.31 0.00 0.42 0.39 …… 1.00 1.00 1.00 1.00
0.23 0.42 0.00 0.52 …… 0.97 0.98 0.99 0.99
0.27 0.39 0.52 0.00 …… 0.97 0.98 0.99 0.99
0.23 0.55 0.47 0.16 …… 0.96 0.97 0.99 0.98
0.34 0.50 0.15 0.65 …… 0.96 0.97 0.99 0.99
0.53 0.83 0.41 0.78 …… 0.83 0.88 0.95 0.94
0.40 0.43 0.18 0.71 …… 0.97 0.98 0.99 0.99
…… …… …… …… …… …… …… …… ……
1.00 1.00 1.00 1.00 …… 0.66 0.56 0.82 0.45
1.00 1.00 0.99 1.00 …… 0.36 0.27 0.65 0.25
0.97 0.99 0.96 0.96 …… 0.34 0.37 0.36 0.37
0.98 1.00 0.97 0.97 …… 0.00 0.13 0.33 0.24
0.99 1.00 0.98 0.98 …… 0.13 0.00 0.35 0.17
1.00 1.00 0.99 0.99 …… 0.33 0.35 0.00 0.35
0.99 1.00 0.99 0.99 …… 0.24 0.17 0.35 0.00
Clustering the distance matrix can obtain the results shown in the following table 5, and a visual image of the clustering result is shown in fig. 4 (square points are noise points).
TABLE 5
X1 X2 Y X1 X2 Y X1 X2 Y
2.24 2.78 1 -3.81 0.08 0 1.18 -4.55 2
1.95 3.69 1 -4.17 -0.68 0 -0.2 -4.72 2
2.04 1.94 1 -4.32 -1.47 0 -0.2 -4.2 2
1.22 2.61 1 -4.97 -0.4 0 -0.28 -3.75 2
-0.28 2.52 1 -6.83 -0.77 0 -0.78 -2.4 2
1.75 3.32 1 -5.29 0.17 0 -0.57 -3.78 2
1.9 2.92 1 -3.9 -0.8 0 0.67 -3.25 2
3.28 2.17 1 -3.92 -0.78 0 -1.19 -5.31 2
2.41 0.6 1 -4.71 -0.29 0 -0.4 -3.34 2
0.53 1.98 1 -3.4 0.81 0 0.8 -1.97 2
0.76 2.6 1 -4.36 -0.7 0 0.16 -4.35 2
1.37 2.96 1 -5.64 0.01 0 -1.05 -2.77 2
2.24 1.57 1 -6.16 -0.07 0 0.3 -4.23 2
3.5 2.63 1 -4.97 1.43 0 0.78 -4.34 2
3.48 2.51 1 -2.39 0.88 0 1.24 -4.76 2
1.98 2.3 1 -3.51 -1.22 0 0.81 -2.98 2
2.88 3.23 1 -4.63 0.48 0 -0.57 -3.56 2
2.38 1.41 1 -3.87 -0.78 0 -1 -4.84 2
0.9 2.84 1 -4.43 0.09 0 0.52 -3.54 2
0.58 2.36 1 -1.99 2.31 -1 0.43 -2.33 2
2.52 0.91 1 -3.42 -0.46 0 -0.12 -3.05 2
1.88 -0.34 1 -3.82 1.28 0 0.38 -2.73 2
3.01 1.03 1 -3.22 -0.03 0 -1.31 -2.97 2
2.51 2.03 1 -3.85 -1.37 0 0.2 -6.63 2
1.9 1.84 1 -3.78 0.71 0 0.43 -5.44 2
1.32 5.36 -1 -5.17 0.23 0 -0.14 -2.53 2
0.71 1.57 1 -3.84 0.82 0 0.69 -3.7 2
2.68 0.3 1 -4.91 -1 0 0.48 -4 2
0.38 3.07 1 -3.33 -0.8 0 -1.23 -3.55 2
1.35 2.39 1 -4.81 -0.17 0 -0.02 -4.43 2
Wherein, -1 in the Y label represents noise point, and 0-2 represents different clusters.
It will be appreciated by those skilled in the art that embodiments of the present application may be provided as a method, system, or computer program product. Accordingly, the present application may take the form of an entirely hardware embodiment, an entirely software embodiment or an embodiment combining software and hardware aspects. Furthermore, the present application may take the form of a computer program product embodied on one or more computer-usable storage media (including, but not limited to, disk storage, CD-ROM, optical storage, and the like) having computer-usable program code embodied therein. The scheme in the embodiment of the application can be realized by adopting various computer languages, such as object-oriented programming language Java, an transliteration script language JavaScript and the like.
The present application is described with reference to flowchart illustrations and/or block diagrams of methods, apparatus (systems) and computer program products according to embodiments of the application. It will be understood that each flow and/or block of the flowchart illustrations and/or block diagrams, and combinations of flows and/or blocks in the flowchart illustrations and/or block diagrams, can be implemented by computer program instructions. These computer program instructions may be provided to a processor of a general purpose computer, special purpose computer, embedded processor, or other programmable data processing apparatus to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing apparatus, create means for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.
These computer program instructions may also be stored in a computer-readable memory that can direct a computer or other programmable data processing apparatus to function in a particular manner, such that the instructions stored in the computer-readable memory produce an article of manufacture including instruction means which implement the function specified in the flowchart flow or flows and/or block diagram block or blocks.
These computer program instructions may also be loaded onto a computer or other programmable data processing apparatus to cause a series of operational steps to be performed on the computer or other programmable apparatus to produce a computer implemented process such that the instructions which execute on the computer or other programmable apparatus provide steps for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.
While preferred embodiments of the present application have been described, additional variations and modifications in those embodiments may occur to those skilled in the art once they learn of the basic inventive concepts. It is therefore intended that the following claims be interpreted as including the preferred embodiments and all such alterations and modifications as fall within the scope of the application.
It will be apparent to those skilled in the art that various modifications and variations can be made to the present application without departing from the spirit or scope of the application. Thus, it is intended that the present application also include such modifications and alterations insofar as they come within the scope of the appended claims or the equivalents thereof.

Claims (9)

1. The HTTP traffic semi-supervised clustering method based on the self-learning parameter measurement is characterized by comprising the following steps of:
step 1, giving an HTTP message flow data set T which contains a plurality of message flow data, wherein the class label attribute in the header attribute of each message flow data set is set to be L, L is expressed in the form of discrete variables, part of L is empty, and the data with the class label attribute L not being empty form a training set T Train
Step 2, defining a distance measurement function for measuring two message flow data x i ,x j Is the gap between (1);
step 3, using a logistic regression model in the training set T Train Training distance parameters in a distance measurement function;
step 4, setting a clustering threshold, wherein the clustering threshold comprises a neighborhood density threshold Pts min And radius parameter Eps, for training set T Train Carrying out statistical analysis, taking the average inter-class distance as a radius parameter Eps of a clustering algorithm, classifying the data with the same L attribute as one class, calculating the data distance of the same class and calculating a distance average value;
step 5, based on the neighborhood density threshold Pts min And the radius parameter Eps is aggregated with the flow data set T, and the class label attribute L of each cluster of data obtained after aggregation is the same.
2. The HTTP traffic semi-supervised clustering method based on self-learning parameter metrics of claim 1, wherein the step 1 includes:
step 1.1, an HTTP message flow data set T is given, wherein the data set T contains k pieces of message flow data and is expressed as follows:
T={x 1 ,x 2 ,…,x k },
wherein ,an nth attribute value, y, of an HTTP message header attribute of kth message flow data in the dataset k Is the attribute value of the class label attribute L, wherein n and k are natural numbers;
step 1.2 define a prediction set T Pred For the message flow data containing all the L attributes being null, the rest message flow data not being null form a training set T Train
3. The HTTP traffic semi-supervised clustering method based on self-learning parameter metrics according to claim 2, wherein the step 2 includes:
defining a distance metric function as follows:
wherein
wherein ,xi ,x j For any two pieces of data, r represents the property of the r-th piece,represents x i The r-th property of (2) and x j Difference of property of the r th strip, s (r) For the standard deviation of attribute r in the HTTP message flow data set T message header attribute, +.>For the average value of attribute r in the T message header attribute of HTTP message flow data set, m is E [1, k ]],/>Attribute value of the (r) th attribute of the (m) th data, z represents an arbitrary constant, w is distance weight of distance metric function dist, and w= < w 1 ,w 2 ,...,w r ,...,w n >,w r For the distance weight of the r-th attribute, b is the deviation of the distance measurement function dist, and the distance weight W of the distance measurement function dist and the deviation b of the distance measurement function dist are collectively called as distance parameters.
4. The HTTP traffic semi-supervised clustering method based on self-learning parameter metrics according to claim 3, wherein the step 3 includes:
step 3.1 logistic regression model is expressed as follows:
y=sigmod(w′x k +b′),
wherein W' represents the weight of the logistic regression model, x k Representing training set T Train The k-th flow, b' represents the deviation of the logistic regression model;
step 3.2 processing training set T Train Will T Train The difference operation is carried out on any data in the data, and the formula is as follows:
wherein ,xij Represents x i ,x j Two data are subjected to difference operation and standardization, s (n) Representing training set T Train Standard deviation of the nth attribute,represents x i Attribute value of the nth attribute, +.>Represents x j Attribute value, y of the nth attribute of (2) ij Represents x i ,x j Whether the class label attributes of the two pieces of data are the same;
step 3.3 represents the loss function of the logistic regression model by averaging the square error sum MSE, i.e. the following formula:
wherein J (w) represents a loss function of the distance weight w, argmin (w,b) Minimum value for w, b, l represents training set T Train The number of data entries to be included in the data stream,
solving the distance weight w of the distance measurement function dist of the formula by a gradient descent method, randomly initializing w, optimizing the distance weight w by an approximation mode, and realizing the optimization by the following formula:
wherein α represents a learning rate of gradient descent, which controls a rate of convergence of the distance weight w;
and after multiple training iterations, the distance weight w converges, and further training of the distance weight w in the distance measurement function dist is completed.
5. The HTTP traffic semi-supervised clustering method based on self-learning parameter metrics according to claim 4, wherein the step 4 includes:
step 4.1 training set T Train Carrying out statistical analysis, taking the average inter-class distance as a radius parameter Eps of a clustering algorithm, classifying the data with the same L attribute as one class, calculating the data distance of the same class, and calculating a distance average value, wherein Eps is expressed by the following formula:
step 4.2, counting average neighborhood density or minimum neighborhood density as a neighborhood density threshold Pts of a clustering algorithm min The average neighborhood density is the average intra-cluster density in various clusters, and when the average neighborhood density is used as a neighborhood density threshold Pts of a clustering algorithm min Pts when min Is expressed by the following formula:
when the minimum neighborhood density is used as the neighborhood density threshold of the clustering algorithm, the neighborhood density threshold Pts of the clustering algorithm is used as the neighborhood density threshold Pts of the clustering algorithm min Pts when min Is expressed by the following formula:
in the above formula, training set T Train Q clusters of class C h Represents the h class cluster in T, ||C h I represents the size of the h class cluster, where h ε [1, q]Q is a natural number, x' i ,x′ j Is cluster C h The i, j data in (a); function f (dist (x' i ,x′ j ) Eps) is a piecewise function, expressed by the following formula:
6. the HTTP traffic semi-supervised clustering method based on self-learning parameter metrics according to claim 5, wherein the step 5 includes:
step 5.1 selecting one flow data x from the data set T i ,i∈[1,k]Then find the distance x i All flow data less than or equal to Eps, if distance x i The number of flow data within the Eps is smaller than Pts min Then x i Marked as noise if the amount of flow data within Eps is greater than Pts min X is then i Marked as core samples and assigned a new cluster;
step 5.2 Access x i All neighbors within a distance Eps, if they have not been assigned a cluster, then assign the new cluster label in step 5.1 to them, if they are core samples, access their neighbors in turn, and so on, cluster gradually increasing until there are no more core samples within the Eps distance of the cluster;
step 5.3, selecting another flow data which is not accessed yet, and repeating the steps 5.1 and 5.2;
and 5.4, finally, obtaining all class clusters after clustering the data set T:
<C 1 ,C 2 ,...,C q |C h ∈T,h∈[1,q]>,
for each cluster C h The traffic data within is randomly assigned a class label that is the same and different from the other class clusters.
7. A self-learning parameter measurement-based HTTP traffic semi-supervised clustering device is characterized by comprising an HTTP message traffic data set module, a distance measurement module and a self-learning parameter measurement module,The self-learning training module, the clustering threshold module and the aggregation module are sequentially connected by signals, wherein the HTTP message flow data set module, the distance measurement module, the self-learning training module, the clustering threshold module and the aggregation module are used for giving an HTTP message flow data set T, the HTTP message flow data set T comprises a plurality of message flow data, the class label attribute in the message header attribute of each message flow data set is set to be L, the L is expressed in the form of discrete variable, part of L is empty, and the data with the class label attribute L not empty form the training set T Train The method comprises the steps of carrying out a first treatment on the surface of the The distance measurement module is used for measuring two message flow data x i ,x j Is the gap between (1); the self-learning training module is used for utilizing a logistic regression model to train the set T Train Training distance parameters in a distance measurement function; the clustering threshold module is used for setting a clustering threshold, the aggregation module is used for aggregating the flow data set T based on the clustering threshold, the class label attribute L of each cluster of data obtained after aggregation is the same, the clustering threshold module comprises a neighborhood density threshold module and a radius parameter value module, the neighborhood density threshold module is used for calculating a neighborhood density threshold, and the radius parameter value module is used for training the set T Train And carrying out statistical analysis, taking the average inter-class distance as a radius parameter Eps of a clustering algorithm, classifying the data with the same L attribute as one class, calculating the data distance of the same class and calculating a distance average value.
8. A computer-readable storage medium, characterized in that the computer-readable storage medium stores a computer program, which, when executed by a processor, implements the method steps of any of claims 1-6.
9. An electronic device comprising a processor and a memory, the memory storing a computer program, which, when executed by the processor, performs the method steps of any of claims 1-6.
CN202310532962.0A 2023-05-12 2023-05-12 HTTP traffic clustering method and device based on self-learning parameter measurement Active CN116244612B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202310532962.0A CN116244612B (en) 2023-05-12 2023-05-12 HTTP traffic clustering method and device based on self-learning parameter measurement

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202310532962.0A CN116244612B (en) 2023-05-12 2023-05-12 HTTP traffic clustering method and device based on self-learning parameter measurement

Publications (2)

Publication Number Publication Date
CN116244612A CN116244612A (en) 2023-06-09
CN116244612B true CN116244612B (en) 2023-08-29

Family

ID=86635285

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202310532962.0A Active CN116244612B (en) 2023-05-12 2023-05-12 HTTP traffic clustering method and device based on self-learning parameter measurement

Country Status (1)

Country Link
CN (1) CN116244612B (en)

Citations (12)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107145827A (en) * 2017-04-01 2017-09-08 浙江大学 Across the video camera pedestrian recognition methods again learnt based on adaptive distance metric
CN107819698A (en) * 2017-11-10 2018-03-20 北京邮电大学 A kind of net flow assorted method based on semi-supervised learning, computer equipment
CN109067612A (en) * 2018-07-13 2018-12-21 哈尔滨工程大学 A kind of online method for recognizing flux based on incremental clustering algorithm
CN110222782A (en) * 2019-06-13 2019-09-10 齐鲁工业大学 There are supervision two-category data analysis method and system based on Density Clustering
TW202019140A (en) * 2018-11-01 2020-05-16 財團法人資訊工業策進會 Suspicious packet detection device and suspicious packet detection method thereof
CN112822121A (en) * 2019-11-15 2021-05-18 中兴通讯股份有限公司 Traffic identification method, traffic determination method and knowledge graph establishment method
CN112926045A (en) * 2021-02-24 2021-06-08 北京通付盾人工智能技术有限公司 Group control equipment identification method based on logistic regression model
CN113242207A (en) * 2021-04-02 2021-08-10 河海大学 Iterative clustering network flow abnormity detection method
CN115600128A (en) * 2022-10-26 2023-01-13 中国农业银行股份有限公司(Cn) Semi-supervised encrypted traffic classification method and device and storage medium
CN116028832A (en) * 2023-02-06 2023-04-28 中国烟草总公司贵州省公司 Sample clustering processing method and device, storage medium and electronic equipment
WO2023073092A1 (en) * 2021-11-01 2023-05-04 Koninklijke Philips N.V. Managing a model trained using a machine learning process
CN116090454A (en) * 2022-12-26 2023-05-09 南京航空航天大学 Method for extracting critical features of dangerous sources of air management system based on semi-supervised learning

Family Cites Families (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2019064598A1 (en) * 2017-09-29 2019-04-04 Nec Corporation Regression apparatus, regression method, and computer-readable storage medium
US11494669B2 (en) * 2018-10-30 2022-11-08 Diveplane Corporation Clustering, explainability, and automated decisions in computer-based reasoning systems

Patent Citations (12)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107145827A (en) * 2017-04-01 2017-09-08 浙江大学 Across the video camera pedestrian recognition methods again learnt based on adaptive distance metric
CN107819698A (en) * 2017-11-10 2018-03-20 北京邮电大学 A kind of net flow assorted method based on semi-supervised learning, computer equipment
CN109067612A (en) * 2018-07-13 2018-12-21 哈尔滨工程大学 A kind of online method for recognizing flux based on incremental clustering algorithm
TW202019140A (en) * 2018-11-01 2020-05-16 財團法人資訊工業策進會 Suspicious packet detection device and suspicious packet detection method thereof
CN110222782A (en) * 2019-06-13 2019-09-10 齐鲁工业大学 There are supervision two-category data analysis method and system based on Density Clustering
CN112822121A (en) * 2019-11-15 2021-05-18 中兴通讯股份有限公司 Traffic identification method, traffic determination method and knowledge graph establishment method
CN112926045A (en) * 2021-02-24 2021-06-08 北京通付盾人工智能技术有限公司 Group control equipment identification method based on logistic regression model
CN113242207A (en) * 2021-04-02 2021-08-10 河海大学 Iterative clustering network flow abnormity detection method
WO2023073092A1 (en) * 2021-11-01 2023-05-04 Koninklijke Philips N.V. Managing a model trained using a machine learning process
CN115600128A (en) * 2022-10-26 2023-01-13 中国农业银行股份有限公司(Cn) Semi-supervised encrypted traffic classification method and device and storage medium
CN116090454A (en) * 2022-12-26 2023-05-09 南京航空航天大学 Method for extracting critical features of dangerous sources of air management system based on semi-supervised learning
CN116028832A (en) * 2023-02-06 2023-04-28 中国烟草总公司贵州省公司 Sample clustering processing method and device, storage medium and electronic equipment

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
基于HTTP流量的洪泛攻击检测研究;杨健;《中国优秀硕士学位论文全文数据库_信息科技辑》;I139-84 *

Also Published As

Publication number Publication date
CN116244612A (en) 2023-06-09

Similar Documents

Publication Publication Date Title
Wu et al. Contextual bandits in a collaborative environment
US10354201B1 (en) Scalable clustering for mixed machine learning data
US20190147297A1 (en) System for time-efficient assignment of data to ontological classes
CN110196908A (en) Data classification method, device, computer installation and storage medium
US20220058222A1 (en) Method and apparatus of processing information, method and apparatus of recommending information, electronic device, and storage medium
CN113468227B (en) Information recommendation method, system, equipment and storage medium based on graph neural network
WO2022247955A1 (en) Abnormal account identification method, apparatus and device, and storage medium
CN103559504A (en) Image target category identification method and device
CN112668579A (en) Weak supervision semantic segmentation method based on self-adaptive affinity and class distribution
CN107633257B (en) Data quality evaluation method and device, computer readable storage medium and terminal
Du et al. Parallel processing of improved KNN text classification algorithm based on Hadoop
CN108540988B (en) Scene division method and device
CN114418035A (en) Decision tree model generation method and data recommendation method based on decision tree model
CN110929028A (en) Log classification method and device
CN110458600A (en) Portrait model training method, device, computer equipment and storage medium
US20230342606A1 (en) Training method and apparatus for graph neural network
CN115204416A (en) Heterogeneous client-oriented joint learning method based on hierarchical sampling optimization
CN115562940A (en) Load energy consumption monitoring method and device, medium and electronic equipment
CN115456093A (en) High-performance graph clustering method based on attention-graph neural network
CN108154380A (en) The method for carrying out the online real-time recommendation of commodity to user based on extensive score data
CN114995964A (en) Combination service reconstruction method, device, equipment and computer readable medium
CN114417095A (en) Data set partitioning method and device
CN116244612B (en) HTTP traffic clustering method and device based on self-learning parameter measurement
WO2020147259A1 (en) User portait method and apparatus, readable storage medium, and terminal device
CN110764907A (en) Cloud computing resource map construction method

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant