WO2023173790A1 - 一种基于数据包的加密流量分类系统 - Google Patents

一种基于数据包的加密流量分类系统 Download PDF

Info

Publication number
WO2023173790A1
WO2023173790A1 PCT/CN2022/133120 CN2022133120W WO2023173790A1 WO 2023173790 A1 WO2023173790 A1 WO 2023173790A1 CN 2022133120 W CN2022133120 W CN 2022133120W WO 2023173790 A1 WO2023173790 A1 WO 2023173790A1
Authority
WO
WIPO (PCT)
Prior art keywords
traffic
data packets
data
packet
features
Prior art date
Application number
PCT/CN2022/133120
Other languages
English (en)
French (fr)
Inventor
仇晶
丁杰
李鉴明
周玲
贾焰
田志宏
韩伟红
顾钊铨
王乐
李树栋
鲁辉
苏申
Original Assignee
广州大学
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 广州大学 filed Critical 广州大学
Publication of WO2023173790A1 publication Critical patent/WO2023173790A1/zh
Priority to US18/386,251 priority Critical patent/US20240064107A1/en

Links

Images

Classifications

    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L47/00Traffic control in data switching networks
    • H04L47/10Flow control; Congestion control
    • H04L47/24Traffic characterised by specific attributes, e.g. priority or QoS
    • H04L47/2441Traffic characterised by specific attributes, e.g. priority or QoS relying on flow classification, e.g. using integrated services [IntServ]
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/23Clustering techniques
    • G06F18/232Non-hierarchical techniques
    • G06F18/2321Non-hierarchical techniques using statistics or function optimisation, e.g. modelling of probability density functions
    • G06F18/23213Non-hierarchical techniques using statistics or function optimisation, e.g. modelling of probability density functions with fixed number of clusters, e.g. K-means clustering
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/24Classification techniques
    • G06F18/241Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L47/00Traffic control in data switching networks
    • H04L47/10Flow control; Congestion control
    • H04L47/24Traffic characterised by specific attributes, e.g. priority or QoS
    • H04L47/2483Traffic characterised by specific attributes, e.g. priority or QoS involving identification of individual flows
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L63/00Network architectures or network communication protocols for network security
    • H04L63/14Network architectures or network communication protocols for network security for detecting or protecting against malicious traffic
    • H04L63/1408Network architectures or network communication protocols for network security for detecting or protecting against malicious traffic by monitoring network traffic
    • H04L63/1416Event detection, e.g. attack signature detection
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L63/00Network architectures or network communication protocols for network security
    • H04L63/14Network architectures or network communication protocols for network security for detecting or protecting against malicious traffic
    • H04L63/1408Network architectures or network communication protocols for network security for detecting or protecting against malicious traffic by monitoring network traffic
    • H04L63/1425Traffic logging, e.g. anomaly detection

Definitions

  • the invention relates to the field of network data technology, specifically an encrypted traffic classification system based on data packets.
  • Traffic classification is a task that divides network traffic into corresponding categories. It has crucial applications in many software programs such as network intrusion detection systems. Traffic classification can trace the source of traffic, such as applications, operating systems, and browsers. Type, in the field of cyberspace security, the main category of concern is the security nature of traffic, such as normal traffic and malicious traffic, so as to identify the traffic generated by malware in the network and help the firewall identify malicious connection intrusions. Malware can be defined as aimed at Programs that damage computer systems are the biggest threat to information security today.
  • Encrypted traffic is a A technology that encrypts original data messages and then transmits information on the network. Encryption is designed to protect the security and concealment of network communications and protect user privacy. However, this concealment is often used by attackers to hide and deploy malicious intent. code, remote command control or data leakage. Therefore, how to classify encrypted traffic without decrypting data packets has become a major issue in cyberspace security research.
  • the TLS protocol is mainly used to encrypt traffic on the network.
  • the TLS protocol is designed for Internet communication security. It performs a handshake between two communicating applications to establish a reliable encrypted communication channel.
  • Traditional methods such as port-based and Deep packet inspection technology relies on plaintext data when classifying traffic, so these methods are ineffective in the face of encrypted traffic. This has also led to the use of machine learning to classify encrypted traffic, which has gradually attracted the attention of most scholars. Machine learning algorithms have been Proved to be the most suitable method for handling encrypted traffic classification tasks,
  • Machine learning algorithms have more advantages than traditional traffic classification methods. For example, machine learning algorithms can handle encrypted traffic classification and have high accuracy. Encrypted traffic detection methods based on machine learning rely on statistics of data flow metadata. Analyze and construct a combination of statistical attributes of encrypted traffic as a fingerprint to classify and identify encrypted traffic. Kim et al. proposed a new method to automatically generate service signatures from encrypted traffic payload data, using the certificate issuance information in the certificate exchange process. Fields are used to sign services, and a list of certificate, session ID, and IP address pairs is constructed to match the category of traffic. Building a mapping table is a common method for rule-based encrypted traffic detection. Shbair et al.
  • SNI mapped Server Name Indication
  • the above method requires manual filtering of field features and matching the extracted rules to classify encrypted traffic.
  • This method is also called rule-based encrypted traffic detection.
  • This method has the advantages of being lightweight, fast and easy to construct, but has the disadvantages of
  • the feature fields need to be manually screened, and only known categories of traffic can be constructed to match the corresponding mapping table. It is easily bypassed by attackers by splicing or forging data packet-related fields, and has a high false positive rate.
  • Cisco provides a method to identify threats in encrypted traffic based on TLS handshake metadata and contextual analysis. They are characterized by time series and statistical data, which can classify both unencrypted traffic and encrypted traffic.
  • the encryption protocol is analyzed to complete the purpose of traffic classification.
  • the purpose of the present invention is to provide an encrypted traffic classification system based on data packets to solve the problems raised in the above background technology
  • an encrypted traffic classification system based on data packets which is composed of three functions: capturing traffic, analyzing data packets, and classifying traffic;
  • Capture traffic All data packets transmitted between two IPs and corresponding port numbers are network flows. This includes a large number of important data packets related to information exchange, as well as many timeout retransmissions, out-of-order and error messages. Data packets, the encrypted traffic classification system based on data packets filters and filters the data packet information in the flow by identifying the IP, port number, protocol category, flag bit and other information of the flow, thereby obtaining reliable flow data;
  • Protocol information such as TLS, HTTP, DNS and related fields can be extracted from the flow.
  • the packet-based encrypted traffic classification system also extracts the information of the data packets in the flow and analyzes important information such as the size, flow direction and delay of the data packets. Cluster analysis was performed. Different from the packet statistical information in previous studies, the extracted packet information used packet behavior as an important feature to be input into the model for training.
  • This system extracted four features from the network flow, respectively. are spatiotemporal features, head features, load features and statistical features,
  • Spatiotemporal characteristics generally refer to the time and space attributes of data packets sent normally during network traffic transmission
  • Header features include traffic quintuple, DNS, and HTTP information;
  • the client and the server need to exchange data messages to confirm each other's identity information; in the TLS protocol, the client and the server need to exchange mutually supported cipher suites to choose an appropriate one.
  • the encryption algorithm encrypts the data message; authenticates the identity of the client and the server, and the server will send a certificate to the client to verify the identity of both communicating parties;
  • Statistical characteristics a single data packet has three attributes: number of bytes, transmission direction and inter-packet delay; then the ratio of the number of uplink and downlink data packets and the ratio of the number of uplink and downlink bytes are further calculated;
  • Normal traffic will have abnormal traffic packets, and malicious traffic will also have normal traffic packets.
  • the k-means clustering method is used to classify normal data packets and malicious data packets;
  • C1 and C2 are the labels of normal traffic and malicious traffic respectively.
  • two samples are randomly selected from the data set D as the centroid set ⁇ 1 , ⁇ 2 ⁇ , ⁇ j is the centroid of the set, and then the distance between each sample x i and the centroid ⁇ j is calculated.
  • the distance calculation method is:
  • centroid and its assigned samples represent a cluster. After all samples are assigned, if all centroid vectors are has not been changed, the results of clustering are output. Finally, in our system, the output clusters are divided into:
  • LightGBM is a framework for implementing the GBDT algorithm.
  • the main idea is to use weak classifiers to iteratively train to get the best model, support efficient calculation, and have faster training. Speed, higher accuracy and other advantages, LightGBM uses the Gini coefficient instead of the information gain ratio. The smaller the Gini coefficient, the lower the impurities and the better the characteristics.
  • the Gini coefficient expression of the probability distribution is:
  • p is the probability of normal traffic.
  • the loss function we use is the log-likelihood loss function, and its calculation formula is as follows:
  • the attributes of the data packet time include data packet sending time and inter-packet delay
  • the spatial characteristics of the data packet include data packet length, data packet sending direction and data packet number,
  • the DNS includes DNS domain name, return code, DNS address and TTL lifetime; other features can be extracted from the DNS, such as the popularity ranking of the website; such as the Alexa ranking of the website; such as the length and character distribution of the website domain name regularity; such as the Gaussian distribution of domain names,
  • the HTTP is the most widely used protocol and is commonly used in web browsers and SMTP mail services.
  • Features that can be extracted include HTTP protocol type, request method, status code and Content-Type field.
  • the direction of the traffic can be simply defined according to the source of data sent by the client and the server.
  • the upstream traffic is the traffic sent by the client to the server
  • the downlink traffic is the traffic received by the client from the server.
  • attributes such as average packet length, maximum packet length, average inter-packet delay, etc. can be calculated,
  • L is the loss function
  • N is the number of samples
  • the packet-based encrypted traffic classification system aims to provide an effective method to use the information of the original PCAP file to build machine learning by collecting network flow data packets model, classifies encrypted traffic, intercepts malicious traffic, and constructs a feature matrix.
  • data packet behavior characteristics are also proposed. The behavior of data packets shows normal The difference between traffic and malicious traffic.
  • the present invention focuses on the differences between different versions of encryption protocols, especially the TLS protocol, and introduces them into the model for analysis, thereby improving the system's ability to classify encrypted traffic.
  • Figure 1 is a schematic diagram of the data set traffic package of the present invention
  • Figure 2 is a diagram of the data packet characteristic fields of the present invention.
  • Figure 3 is a model architecture diagram of the present invention.
  • the present invention provides a technical solution: an encrypted traffic classification system based on data packets.
  • the system is composed of three functions: capturing traffic, analyzing data packets, and classifying traffic;
  • Capture traffic All data packets transmitted between two IPs and corresponding port numbers are network flows. This includes a large number of important data packets related to information exchange, as well as many timeout retransmissions, out-of-order and error messages. Data packets, the encrypted traffic classification system based on data packets filters and filters the data packet information in the flow by identifying the IP, port number, protocol category, flag bit and other information of the flow, thereby obtaining reliable flow data;
  • Protocol information such as TLS, HTTP, DNS and related fields can be extracted from the flow.
  • the packet-based encrypted traffic classification system also extracts the information of the data packets in the flow and analyzes important information such as the size, flow direction and delay of the data packets. Cluster analysis is performed. Different from the packet statistical information in previous studies, the extracted packet information uses packet behavior as an important feature to be input into the model for training.
  • This system extracts four features from the network flow, in sequence. are spatio-temporal features, head features, load features and statistical features,
  • Spatiotemporal characteristics generally refer to the time and space attributes of data packets sent normally during network traffic transmission.
  • the time attributes of data packets include packet sending time, inter-packet delay, etc.
  • the spatial characteristics of data packets include data packet length and data packet sending direction. and the number of data packets, etc.,
  • Header features include traffic quintuple, DNS, and HTTP information.
  • DNS includes DNS domain name, return code, DNS address, and TTL lifetime.
  • other features can also be extracted from DNS, such as the popularity of the website. degree ranking, such as the Alexa ranking of the website; such as the length of the website domain name and the character distribution pattern, such as the Gaussian distribution of the domain name.
  • HTTP is the most widely used protocol and is commonly used in web browsers and SMTP mail services. Features that can be extracted include HTTP Protocol type, request method, status code, Content-Type field, etc.
  • Load characteristics which refer to the content encapsulated on the flow data, such as encryption protocols.
  • the client and server Before establishing a secure encrypted communication channel, the client and server must exchange data messages to confirm each other's identity information. This process is usually called encryption.
  • the client and the server In the handshake phase of the protocol, in the TLS protocol, the client and the server need to exchange mutually supported cipher suites to select an appropriate encryption algorithm to encrypt the data message.
  • the server In order to authenticate the identities of the client and server, the server will send a certificate to Client, verify the identity of both communicating parties.
  • TLS1.2 protocol TLS1.3 has become popular. Compared with the TLS1.2 protocol, the TLS1.3 protocol The handshake process sends fewer messages and the number of handshakes, which also brings more challenges to the encrypted traffic classification task.
  • Statistical characteristics can be obtained from traffic.
  • a single data packet has three attributes: number of bytes, transmission direction and inter-packet delay. According to the attributes of the data packet itself, the average packet length, maximum packet length, average inter-packet delay, etc. can be calculated. Attribute, the direction of the traffic can be simply defined based on the source of data sent by the client and the server.
  • the upstream traffic is the traffic sent by the client to the server, and the downstream traffic is the traffic received by the client from the server. From this, the upstream and downstream data can be further calculated.
  • Statistical features can numerically show the difference between normal traffic and malicious traffic. This is also an important feature for classifying encrypted traffic. However, in order to obtain statistical features, the classifier must get many packets in one flow or session so it can only be used for offline classification,
  • the model proposed in this article focuses on the characteristics of data packet behavior. We believe that different data packets have different behaviors. Normal traffic will have abnormal traffic packets, and malicious traffic will also have normal traffic packets.
  • This paper uses the k-means clustering method to classify normal traffic packets.
  • centroid and its assigned samples represent a cluster. After all samples are assigned, if all centroid vectors are has not been changed, the results of clustering are output. Finally, in our system, the output clusters are divided into:
  • LightGBM is a framework for implementing the GBDT algorithm.
  • the main idea is to use weak classifiers to iteratively train to get the best model, support efficient calculation, and have faster training. Speed, higher accuracy and other advantages, LightGBM uses the Gini coefficient instead of the information gain ratio. The smaller the Gini coefficient, the lower the impurities and the better the characteristics.
  • the Gini coefficient expression of the probability distribution is:
  • p is the probability of normal traffic.
  • the loss function we use is the log-likelihood loss function, and its calculation formula is as follows:
  • L is the loss function
  • N is the number of samples
  • This packet-based encrypted traffic classification system integrates encryption protocols and packet characteristics to improve the accuracy of identifying malicious traffic; it supports multiple versions of the TLS protocol.
  • encryption protocols are constantly iterating, such as The TLS1.3 protocol has been widely used since it was proposed in 2018, ensuring that the system can adapt to iterations of encryption protocols and has high applicability in the current network environment.

Landscapes

  • Engineering & Computer Science (AREA)
  • Computer Networks & Wireless Communication (AREA)
  • Signal Processing (AREA)
  • Computer Security & Cryptography (AREA)
  • General Engineering & Computer Science (AREA)
  • Computer Hardware Design (AREA)
  • Computing Systems (AREA)
  • Data Mining & Analysis (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Evolutionary Biology (AREA)
  • Evolutionary Computation (AREA)
  • Artificial Intelligence (AREA)
  • General Physics & Mathematics (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Probability & Statistics with Applications (AREA)
  • Data Exchanges In Wide-Area Networks (AREA)

Abstract

本发明涉及网络数据技术领域,且公开了一种基于数据包的加密流量分类系统,由捕获流量、分析数据包和流量分类三部分功能构成:捕获流量,在两个IP与对应端口号之间传输的所有数据包即网络流,该基于数据包的加密流量分类系统,目标是提供一种有效的方法来利用原始PCAP文件的信息,通过收集网络流数据包,构建机器学习模型,对加密流量进行分类,拦截恶意流量,构建特征矩阵时,除了获得基本的时空特征、头部特征、负载数据和统计特征外,还提出了数据包行为特征,数据包的行为表现了正常流量和恶意流量的区别,与此同时,本发明重点关注了加密协议尤其是TLS协议不同版本的区别,同时引入到模型中进行分析,从而提高系统对于加密流量的分类能力。

Description

一种基于数据包的加密流量分类系统 技术领域
本发明涉及网络数据技术领域,具体为一种基于数据包的加密流量分类系统,
背景技术
流量分类是一种将网络流量划分为相应类别的任务,在网络入侵检测系统等许多软件程序中具有至关重要的应用,流量分类可以追溯流量产生的来源,比如应用、操作系统以及浏览器的类型,在网络空间安全领域中,主要关注的类别是流量的安全性质,比如正常流量和恶意流量,从而识别网络中恶意软件产生的流量,帮助防火墙识别恶意连接入侵,恶意软件可以定义为旨在破坏计算机系统的程序,它是当今信息安全领域的最大威胁,
随着网络的发展,加密流量方法受到大范围推广,网络中传输的报文主体变成TLS/SSL协议加密后的密文,过去依赖明文内容的流量分类方法使用率不断下降,加密流量是一种对原始数据报文加密后,在网络上进行信息传输的技术,加密旨在保护网络通信的安全性和隐蔽性,保障用户隐私,但这种隐蔽性往往被攻击者利用于隐藏和部署恶意代码,远程命令控制或者造成数据泄露,因此,如何在不解密数据包的前提下对加密流量进行分类成为网络空间安全研究的主要问题,
网络上主要采用TLS协议对流量加密,TLS协议是为了互联网通信安全被设计出来的,其在两个通信的应用程序之间进行握手,从而建立起可靠的加密通信信道,传统方法如基于端口以及深度包检测技术对流量分类时依赖明文数据,因此这些方法在加密流量面前失去作用,这也导致了利用机器学习对加密流量进行分类的方法逐渐引起了大多数学者的关注,机器学习算法已被证明是处理加密流量分类任务最合适的方法,
机器学习算法相比传统流量分类方法具有更多的优点,比如机器学习算法能够处理加密流量分类,同时具备了很高的准确率,基于机器学习的加密流量检测方法通过对数据流元数据的统计分析,构建加密流量的统计属性组合作为指纹,对加密流量进行分类识别,Kim等提出了一种从加密的流量有效载荷数据中自动生成服务签名的新方法,使用证书交换过程中的证书颁发信息字段对服务进行签名,并构建证书、会话ID和IP地址对系列表以匹配流量的类别,建立映射表是基于规则的加密流量检测的常用方法,Shbair等将服务器名称指示(SNI)与IP对应的域名信息进行比较,依靠可信的DNS服务来验证真实目标服务器与声明的SNI值的一致性,从而监控HTTPs流量,Papadogiannaki等提出了一种模式语言,通过定期匹配固定的加密模式(例如相关数据包的出现频率或数据包的位置)来识别加密流量的类型,
以上的方法需要对字段特征进行人工过滤,匹配提取的规则对加密流量进行分类,这种方法也被称为基于规则的加密流量检测,这种方法具有轻量快速且容易构建的优点,但缺点在于需要人工筛选特征字段,仅可以对已知类别的流量构造映射表进行对应匹配,容易被攻击者采用数据包相关字段拼接或伪造的方法绕过,具有很高的误报率,2016年,思科提供了一种基于对TLS握手元数据和上下文分析的方法来识别加密流量中的威胁,他们以时间序列和统计数据为特征,既可以对未加密流量进行分类,也可以通过对加密 流量的加密协议进行分析从而完成流量分类的目的,因为TLS协议在握手协议阶段,双方发送的数据仍然是以明文的形式进行传输,直到加密通信信道建立为止,这种方法关注到流量数据本身的特征以及在加密握手阶段数据的交互信息,而不是简单构建规则映射表,具有很高的扩展性和准确率,能够适应复杂多变的网络环境,
发明内容
本发明的目的在于提供一种基于数据包的加密流量分类系统,以解决上述背景技术中提出的问题,
为实现上述目的,本发明提供如下技术方案:一种基于数据包的加密流量分类系统,该系统总共由捕获流量、分析数据包和流量分类三部分功能构成;
捕获流量,在两个IP与对应端口号之间传输的所有数据包即网络流,这其中包含了大量与信息交互有关的重要数据包,同时也包含了众多超时重传、乱序以及错误的数据包,基于数据包的加密流量分类系统通过识别流量的IP、端口号、协议类别、标志位等信息对流中数据包信息进行筛选过滤,从而得到可靠的流数据;
流量分析
从流中可以提取到TLS、HTTP、DNS等协议信息以及相关字段,基于数据包的加密流量分类系统又提取了流中数据包的信息,对数据包的大小、流向、时延等重要信息进行了聚类分析,不同于以往研究中的数据包统计信息,提取出的数据包信息将数据包行为作为重要特征输入到模型之中进行训练,该系统从网络流中提取出四种特征,分别为时空特征、头部特征、负载特征和统计特征,
时空特征一般指网络流量传输过程中正常发送的数据包时间和空间属性;
头部特征包括流量五元组、DNS、HTTP信息;
负载特征,在建立安全的加密通信信道之前,客户端和服务端需交换数据报文以确认对方身份信息;在TLS协议中,客户端和服务端需要交换彼此支持的密码套件从而选择一种合适的加密算法加密数据报文;认证客户端和服务端的身份,服务端会发送证书给客户端,验证通信双方的身份;
统计特征,单个数据包具有三个属性:字节数、传输方向和包间延迟;然后进一步计算出上下行数据包数的比值、上下行字节数的比值;
正常流量会有异常流量包,恶意流量也会有正常流量包,采用k-means聚类方法对正常数据包和恶意数据包进行分类;
输入的数据集格式为D={x 1,x 2,...x m},输出是分类的结果C={C 1,C 2},C1和C2分别是正常流量和恶意流量的标签,首先,从数据集D中随机选择两个样本作为质心集{μ 1,μ 2},μ j是集合的质心,之后计算每个样本x i和质心μ j的距离,距离计算方法为:
Figure PCTCN2022133120-appb-000001
之后重新计算集合C中的质心,计算公式如下:
Figure PCTCN2022133120-appb-000002
计算每个样本与两个质心之间的距离,将每个对象分配给距离它最近的质心,质 心与它分配的样本就代表一个聚类,所有的样本被分配之后,如果所有的质心向量都没有被改变,则输出聚类的结果,最终,在我们的系统中,输出簇划分为:
C={C 1,C 2}.
得到流量中正常行为数据包和异常数据包的类别之后,我们进一步计算出正常行为数据包和异常数据宝在流量中的比例以及两者之间的比值,将它们作为参数添加到特征矩阵之中,最终得到样本集合S={S_1,S_2|x_i∈S},其中x i是集合S中的一个样本,
输入特征集之后,这篇论文使用LightGBM模型进行分类,LightGBM是一种实现GBDT算法的框架,主要思想是使用弱分类器迭代训练以得到最佳模型,支持高效的计算,并且具有更快的训练速度、更高的准确率等优点,LightGBM使用基尼系数而不是信息增益比,基尼系数越小,杂质越低,特性越好,概率分布的基尼系数表达式为:
Gini(p)=2p(1-p)
p是属于正常流量的概率,我们使用的损失函数是对数似然损失函数,其计算公式如下:
Figure PCTCN2022133120-appb-000003
优选的,所述数据包时间的属性有数据包发送时间、包间时延,
优选的,所述数据包的空间特征包括数据包长度、数据包发送方向和数据包个数,
优选的,所述DNS包括DNS域名、返回码、DNS地址和TTL生存期;可以从DNS中提取其他特征,如网站的受访问欢迎度排名;如网站Alexa排名;如网站域名的长度以及字符分布规律;如域名的高斯分布,
优选的,所述HTTP是使用最广泛的协议,常用于web浏览器和SMTP邮件服务中,能够提取出的特征包括HTTP协议类型、请求方式、状态码以及Content-Type字段,
优选的,所述根据客户端和服务端发送数据的来源可以简单定义流量的方向,上行流量为客户端发送给服务端的流量,下行流量为客户端接收服务端的流量,
优选的,所述根据数据包本身的属性,可以计算得到平均包长、最大包长、平均包间时延等属性,
优选的,所述L是损失函数,N是样本数,是输入实例的真实类别,是输入实例属于正常流量类别的预测概率,
与现有技术相比,本发明的有益效果是:该基于数据包的加密流量分类系统,目标是提供一种有效的方法来利用原始PCAP文件的信息,通过收集网络流数据包,构建机器学习模型,对加密流量进行分类,拦截恶意流量,构建特征矩阵时,除了获得基本的时空特征、头部特征、负载数据和统计特征外,还提出了数据包行为特征,数据包的行为表现了正常流量和恶意流量的区别,与此同时,本发明重点关注了加密协议尤其是TLS协议不同版本的区别,同时引入到模型中进行分析,从而提高系统对于加密流量的分类能力,
附图说明
图1为本发明数据集流量包示意图;
图2为本发明数据包特征字段图;
图3为本发明模型架构图,
具体实施方式
下面将结合本发明实施例中的附图,对本发明实施例中的技术方案进行清楚、完整地描述,显然,所描述的实施例仅仅是本发明一部分实施例,而不是全部的实施例,基于本发明中的实施例,本领域普通技术人员在没有做出创造性劳动前提下所获得的所有其他实施例,都属于本发明保护的范围,
请参阅图1-3,本发明提供一种技术方案:一种基于数据包的加密流量分类系统,该系统总共由捕获流量、分析数据包和流量分类三部分功能构成;
捕获流量,在两个IP与对应端口号之间传输的所有数据包即网络流,这其中包含了大量与信息交互有关的重要数据包,同时也包含了众多超时重传、乱序以及错误的数据包,基于数据包的加密流量分类系统通过识别流量的IP、端口号、协议类别、标志位等信息对流中数据包信息进行筛选过滤,从而得到可靠的流数据;
流量分析
从流中可以提取到TLS、HTTP、DNS等协议信息以及相关字段,基于数据包的加密流量分类系统又提取了流中数据包的信息,对数据包的大小、流向、时延等重要信息进行了聚类分析,不同于以往研究中的数据包统计信息,提取出的数据包信息将数据包行为作为重要特征输入到模型之中进行训练,该系统从网络流中提取出四种特征,依次是时空特征、头部特征、负载特征和统计特征,
时空特征一般指网络流量传输过程中正常发送的数据包时间和空间属性,数据包的时间属性有数据包发送时间、包间时延等等,数据包的空间特征包括数据包长度、数据包发送方向和数据包个数等等,
头部特征,包括流量五元组、DNS、HTTP信息,DNS包括DNS域名、返回码、DNS地址和TTL生存期,除此之外,还可以从DNS中提取其他特征,比如网站的受访问欢迎度排名,如网站Alexa排名;比如网站域名的长度以及字符分布规律,如域名的高斯分布,HTTP是使用最广泛的协议,常用于web浏览器和SMTP邮件服务中,能够提取出的特征包括HTTP协议类型、请求方式、状态码以及Content-Type字段等等,
负载特征,它是指封装在流数据上的内容,例如加密协议,在建立安全的加密通信信道之前,客户端和服务端必须交换数据报文以确认对方身份信息,这一过程通常称为加密协议的握手阶段,在TLS协议中,客户端和服务端需要交换彼此支持的密码套件从而选择一种合适的加密算法加密数据报文,为了认证客户端和服务端的身份,服务端会发送证书给客户端,验证通信双方的身份,值得一提的是,目前对加密流量的研究大多基于TLS1.2协议,而TLS1.3已经开始流行,相比于TLS1.2协议而言,TLS1.3协议的握手过程发送的报文以及握手次数更少,也给加密流量分类任务带来更多的挑战,
统计特征,可以从流量中获取,单个数据包具有三个属性:字节数、传输方向和包间延迟,根据数据包本身的属性,可以计算得到平均包长、最大包长、平均包间时延等属性,根据客户端和服务端发送数据的来源可以简单定义流量的方向,上行流量为客户端发送给服务端的流量,下行流量为客户端接收服务端的流量,由此,可以进一步计算出上下行数据包数的比值、上下行字节数的比值,统计特征可以从数值上看出正常流量和恶意流量的区别,这也是对加密流量进行分类的一个重要特征,但是为了获得统计特征,分类器必须在一个流或会话中获得许多数据包,因此它只能用于离线分类,
分类算法
本文提出的模型侧重于数据包行为的特征,我们认为不同的数据包有不同的行为,正常流量会有异常流量包,恶意流量也会有正常流量包,本文采用k-means聚类方法对正常数据包和恶意数据包进行分类输入的数据集格为D={x 1,x 2,...x m},输出是分类的结果C={C 1,C 2},C1和C2分别是正常流量和恶意流量的标签,首先,从数据集D中随机选择两个样本作为质心集{μ 1,μ 2},μ j是集合的质心,之后计算每个样本x i和质心μ j的距离,距离计算方法为:
Figure PCTCN2022133120-appb-000004
之后重新计算集合C中的质心,计算公式如下:
Figure PCTCN2022133120-appb-000005
计算每个样本与两个质心之间的距离,将每个对象分配给距离它最近的质心,质心与它分配的样本就代表一个聚类,所有的样本被分配之后,如果所有的质心向量都没有被改变,则输出聚类的结果,最终,在我们的系统中,输出簇划分为:
C={C 1,C 2}.
得到流量中正常行为数据包和异常数据包的类别之后,我们进一步计算出正常行为数据包和异常数据宝在流量中的比例以及两者之间的比值,将它们作为参数添加到特征矩阵之中,最终得到样本集合S={S_1,S_2|x_i∈S},其中x i是集合S中的一个样本,
输入特征集之后,这篇论文使用LightGBM模型进行分类,LightGBM是一种实现GBDT算法的框架,主要思想是使用弱分类器迭代训练以得到最佳模型,支持高效的计算,并且具有更快的训练速度、更高的准确率等优点,LightGBM使用基尼系数而不是信息增益比,基尼系数越小,杂质越低,特性越好,概率分布的基尼系数表达式为:
Gini(p)=2p(1-p)
p是属于正常流量的概率,我们使用的损失函数是对数似然损失函数,其计算公式如下:
Figure PCTCN2022133120-appb-000006
L是损失函数,N是样本数,是输入实例的真实类别,是输入实例属于正常流量类别的预测概率,
该基于数据包的加密流量分类系统,对加密协议以及数据包特征进行整合,提高了对恶意流量的识别准确率;支持多版本TLS协议,随着网络的发展,加密协议也在不断迭代,如TLS1.3协议自2018年提出以来得到广泛应用,保证了系统能够适应加密协议的迭代,在目前网络环境下拥有很高的适用性,
尽管已经示出和描述了本发明的实施例,对于本领域的普通技术人员而言,可以理解在不脱离本发明的原理和精神的情况下可以对实施例进行多种变化、修改、替换和变型,本发明的范围由所附权利要求及其等同物限定。

Claims (8)

  1. 一种基于数据包的加密流量分类系统,其特征在于:该系统总共由捕获流量、分析数据包和流量分类三部分功能构成;
    捕获流量,在两个IP与对应端口号之间传输的所有数据包即网络流,这其中包含了大量与信息交互有关的重要数据包,同时也包含了众多超时重传、乱序以及错误的数据包,基于数据包的加密流量分类系统通过识别流量的IP、端口号、协议类别、标志位等信息对流中数据包信息进行筛选过滤,从而得到可靠的流数据;
    流量分析
    从流中可以提取到TLS、HTTP、DNS等协议信息以及相关字段,基于数据包的加密流量分类系统又提取了流中数据包的信息,对数据包的大小、流向、时延等重要信息进行了聚类分析,不同于以往研究中的数据包统计信息,提取出的数据包信息将数据包行为作为重要特征输入到模型之中进行训练;该系统从网络流中提取出四种特征,分别为时空特征、头部特征、负载特征和统计特征,
    时空特征一般指网络流量传输过程中正常发送的数据包时间和空间属性;
    头部特征包括流量五元组、DNS、HTTP信息;
    负载特征,在建立安全的加密通信信道之前,客户端和服务端需交换数据报文以确认对方身份信息;在TLS协议中,客户端和服务端需要交换彼此支持的密码套件从而选择一种合适的加密算法加密数据报文;认证客户端和服务端的身份,服务端会发送证书给客户端,验证通信双方的身份;
    统计特征,单个数据包具有三个属性:字节数、传输方向和包间延迟;然后进一步计算出上下行数据包数的比值、上下行字节数的比值;
    正常流量会有异常流量包,恶意流量也会有正常流量包,采用k-means聚类方法对正常数据包和恶意数据包进行分类;
    输入的数据集格式为D={x 1,x 2,...x m},输出是分类的结果C={C 1,C 2},C1和C2分别是正常流量和恶意流量的标签,首先,从数据集D中随机选择两个样本作为质心集{μ 1,μ 2},μ j是集合的质心,之后计算每个样本x i和质心μ j的距离,距离计算方法为:
    Figure PCTCN2022133120-appb-100001
    之后重新计算集合C中的质心,计算公式如下:
    Figure PCTCN2022133120-appb-100002
    计算每个样本与两个质心之间的距离,将每个对象分配给距离它最近的质心,质心与它分配的样本就代表一个聚类,所有的样本被分配之后,如果所有的质心向量都没有被改变,则输出聚类的结果,最终,在我们的系统中,输出簇划分为:
    C={C 1,C 2}.
    得到流量中正常行为数据包和异常数据包的类别之后,我们进一步计算出正常行为数据包和异常数据宝在流量中的比例以及两者之间的比值,将它们作为参数添加到特征矩阵之中,最终得到样本集合S={S_1,S_2|x_i∈S},其中x i是集合S中的一个样本,
    输入特征集之后,这篇论文使用LightGBM模型进行分类,LightGBM是一种实现GBDT算法的框架,主要思想是使用弱分类器迭代训练以得到最佳模型,支持高效的计算,并且具有 更快的训练速度、更高的准确率等优点,LightGBM使用基尼系数而不是信息增益比,基尼系数越小,杂质越低,特性越好,概率分布的基尼系数表达式为:
    Gini(p)=2p(1-p)
    p是属于正常流量的概率,我们使用的损失函数是对数似然损失函数,其计算公式如下:
    Figure PCTCN2022133120-appb-100003
  2. 根据权利要求1所述的一种基于数据包的加密流量分类系统,其特征在于:所述数据包时间的属性有数据包发送时间、包间时延。
  3. 根据权利要求1所述的一种基于数据包的加密流量分类系统,其特征在于:所述数据包的空间特征包括数据包长度、数据包发送方向和数据包个数。
  4. 根据权利要求1所述的一种基于数据包的加密流量分类系统,其特征在于:所述DNS包括DNS域名、返回码、DNS地址和TTL生存期;可以从DNS中提取其他特征,如网站的受访问欢迎度排名;如网站Alexa排名;如网站域名的长度以及字符分布规律;如域名的高斯分布。
  5. 根据权利要求1所述的一种基于数据包的加密流量分类系统,其特征在于:所述HTTP是使用最广泛的协议,常用于web浏览器和SMTP邮件服务中,能够提取出的特征包括HTTP协议类型、请求方式、状态码以及Content-Type字段。
  6. 根据权利要求1所述的一种基于数据包的加密流量分类系统,其特征在于:所述根据客户端和服务端发送数据的来源可以简单定义流量的方向,上行流量为客户端发送给服务端的流量,下行流量为客户端接收服务端的流量。
  7. 根据权利要求1所述的一种基于数据包的加密流量分类系统,其特征在于:所述根据数据包本身的属性,可以计算得到平均包长、最大包长、平均包间时延等属性。
  8. 根据权利要求1所述的一种基于数据包的加密流量分类系统,其特征在于:所述L是损失函数,N是样本数,是输入实例的真实类别,是输入实例属于正常流量类别的预测概率。
PCT/CN2022/133120 2022-03-18 2022-11-21 一种基于数据包的加密流量分类系统 WO2023173790A1 (zh)

Priority Applications (1)

Application Number Priority Date Filing Date Title
US18/386,251 US20240064107A1 (en) 2022-03-18 2023-11-01 System for classifying encrypted traffic based on data packet

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN202210271454.7A CN114866486B (zh) 2022-03-18 2022-03-18 一种基于数据包的加密流量分类系统
CN202210271454.7 2022-03-18

Related Child Applications (1)

Application Number Title Priority Date Filing Date
US18/386,251 Continuation-In-Part US20240064107A1 (en) 2022-03-18 2023-11-01 System for classifying encrypted traffic based on data packet

Publications (1)

Publication Number Publication Date
WO2023173790A1 true WO2023173790A1 (zh) 2023-09-21

Family

ID=82628112

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2022/133120 WO2023173790A1 (zh) 2022-03-18 2022-11-21 一种基于数据包的加密流量分类系统

Country Status (3)

Country Link
US (1) US20240064107A1 (zh)
CN (1) CN114866486B (zh)
WO (1) WO2023173790A1 (zh)

Families Citing this family (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114866486B (zh) * 2022-03-18 2024-06-18 广州大学 一种基于数据包的加密流量分类系统
US20230319101A1 (en) * 2022-03-30 2023-10-05 Ecolux Technology Co., Ltd. Artificial intelligence system and method thereof for defending against cyber attacks
CN115941600B (zh) * 2023-03-14 2023-05-26 鹏城实验室 报文分流方法、系统与计算机可读存储介质
CN117955734A (zh) * 2024-03-21 2024-04-30 道普信息技术有限公司 一种加密协议的pcap元数据再分析方法

Citations (10)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110113349A (zh) * 2019-05-15 2019-08-09 北京工业大学 一种恶意加密流量特征分析方法
CN110868409A (zh) * 2019-11-08 2020-03-06 中国科学院信息工程研究所 一种基于tcp/ip协议栈指纹的操作系统被动识别方法及系统
CN111277587A (zh) * 2020-01-19 2020-06-12 武汉思普崚技术有限公司 基于行为分析的恶意加密流量检测方法及系统
CN111277578A (zh) * 2020-01-14 2020-06-12 西安电子科技大学 加密流量分析特征提取方法、系统、存储介质、安全设备
WO2020119662A1 (zh) * 2018-12-14 2020-06-18 深圳先进技术研究院 一种网络流量分类方法
CN111565156A (zh) * 2020-04-27 2020-08-21 南京烽火星空通信发展有限公司 一种对网络流量识别分类的方法
CN112738039A (zh) * 2020-12-18 2021-04-30 北京中科研究院 一种基于流量行为的恶意加密流量检测方法、系统及设备
CN113705619A (zh) * 2021-08-03 2021-11-26 广州大学 一种恶意流量检测方法、系统、计算机及介质
WO2022041394A1 (zh) * 2020-08-28 2022-03-03 南京邮电大学 一种网络加密流量识别方法及装置
CN114866486A (zh) * 2022-03-18 2022-08-05 广州大学 一种基于数据包的加密流量分类系统

Family Cites Families (10)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
RU2546236C2 (ru) * 2013-08-05 2015-04-10 Государственное казенное образовательное учреждение высшего профессионального образования Академия Федеральной службы охраны Российской Федерации (Академия ФСО России) Способ анализа информационного потока и определения состояния защищенности сети на основе адаптивного прогнозирования и устройство для его осуществления
CN105227408A (zh) * 2015-10-22 2016-01-06 蓝盾信息安全技术股份有限公司 一种智能木马识别装置及方法
CN108809989B (zh) * 2018-06-14 2021-04-23 北京中油瑞飞信息技术有限责任公司 一种僵尸网络的检测方法及装置
CN110505630A (zh) * 2019-03-12 2019-11-26 杭州海康威视数字技术股份有限公司 无线网络入侵检测方法、装置及电子设备
CN110348471B (zh) * 2019-05-23 2023-09-01 平安科技(深圳)有限公司 异常对象识别方法、装置、介质及电子设备
CN113452656B (zh) * 2020-03-26 2022-10-11 百度在线网络技术(北京)有限公司 用于识别异常行为的方法、装置、电子设备和计算机可读介质
CN111131335B (zh) * 2020-03-30 2020-08-28 腾讯科技(深圳)有限公司 基于人工智能的网络安全防护方法、装置、电子设备
CN111611280A (zh) * 2020-04-29 2020-09-01 南京理工大学 一种基于cnn和sae的加密流量识别方法
CN112087450B (zh) * 2020-09-09 2022-11-04 北京明略昭辉科技有限公司 一种异常ip识别方法、系统及计算机设备
CN113329023A (zh) * 2021-05-31 2021-08-31 西北大学 一种加密流量恶意性检测模型建立、检测方法及系统

Patent Citations (10)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2020119662A1 (zh) * 2018-12-14 2020-06-18 深圳先进技术研究院 一种网络流量分类方法
CN110113349A (zh) * 2019-05-15 2019-08-09 北京工业大学 一种恶意加密流量特征分析方法
CN110868409A (zh) * 2019-11-08 2020-03-06 中国科学院信息工程研究所 一种基于tcp/ip协议栈指纹的操作系统被动识别方法及系统
CN111277578A (zh) * 2020-01-14 2020-06-12 西安电子科技大学 加密流量分析特征提取方法、系统、存储介质、安全设备
CN111277587A (zh) * 2020-01-19 2020-06-12 武汉思普崚技术有限公司 基于行为分析的恶意加密流量检测方法及系统
CN111565156A (zh) * 2020-04-27 2020-08-21 南京烽火星空通信发展有限公司 一种对网络流量识别分类的方法
WO2022041394A1 (zh) * 2020-08-28 2022-03-03 南京邮电大学 一种网络加密流量识别方法及装置
CN112738039A (zh) * 2020-12-18 2021-04-30 北京中科研究院 一种基于流量行为的恶意加密流量检测方法、系统及设备
CN113705619A (zh) * 2021-08-03 2021-11-26 广州大学 一种恶意流量检测方法、系统、计算机及介质
CN114866486A (zh) * 2022-03-18 2022-08-05 广州大学 一种基于数据包的加密流量分类系统

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
YANG, YANZHAO ET AL.: "Encrypted Traffic Classification Based on Packet Characteristics", JOURNAL OF GUANGZHOU UNIVERSITY (NATURAL SCIENCE EDITION), vol. 21, no. 2, 30 June 2022 (2022-06-30), pages 60 - 66, XP009548873 *

Also Published As

Publication number Publication date
CN114866486A (zh) 2022-08-05
CN114866486B (zh) 2024-06-18
US20240064107A1 (en) 2024-02-22

Similar Documents

Publication Publication Date Title
WO2023173790A1 (zh) 一种基于数据包的加密流量分类系统
Anderson et al. Deciphering malware’s use of TLS (without decryption)
Anderson et al. Identifying encrypted malware traffic with contextual flow data
US11399288B2 (en) Method for HTTP-based access point fingerprint and classification using machine learning
Siddharthan et al. Senmqtt-set: An intelligent intrusion detection in iot-mqtt networks using ensemble multi cascade features
Liu et al. Maldetect: A structure of encrypted malware traffic detection
Yan et al. Identifying wechat red packets and fund transfers via analyzing encrypted network traffic
CN111147394A (zh) 一种远程桌面协议流量行为的多级分类检测方法
CN116346418A (zh) 基于联邦学习的DDoS检测方法及装置
Liu et al. A survey on encrypted traffic identification
WO2024065956A1 (zh) 一种基于数据多维熵值指纹的网络异常行为检测方法
Nair et al. A study on botnet detection techniques
Zheng et al. Detecting malicious tls network traffic based on communication channel features
Cheng et al. ACER: detecting Shadowsocks server based on active probe technology
Zou et al. A flow classifier with tamper-resistant features and an evaluation of its portability to new domains
Cui et al. CBSeq: A Channel-level Behavior Sequence For Encrypted Malware Traffic Detection
Lee et al. An SSH predictive model using machine learning with web proxy session logs
Zeng et al. Toward identifying malicious encrypted traffic with a causality detection system
AT&T h6.ps
RU183015U1 (ru) Средство обнаружения вторжений
Zhao et al. A novel malware encrypted traffic detection framework based on ensemble learning
Rai et al. Packet-based Anomaly Detection using n-gram Approach
Lee et al. Encrypted malware traffic detection using TLS features and random forest
Battalov et al. Network traffic analyzing algorithms on the basis of machine learning methods
Aafa et al. A survey on network traffic classification techniques

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 22931802

Country of ref document: EP

Kind code of ref document: A1