CN111885059A

CN111885059A - A method for detecting and locating anomaly in industrial network traffic

Info

Publication number: CN111885059A
Application number: CN202010716056.2A
Authority: CN
Inventors: 赵曦滨; 陆犇圆; 高跃
Original assignee: Tsinghua University
Current assignee: Tsinghua University
Priority date: 2020-07-23
Filing date: 2020-07-23
Publication date: 2020-11-03
Anticipated expiration: 2040-07-23
Also published as: CN111885059B

Abstract

The invention discloses an industrial network abnormality detection algorithm, which includes step 1: deploying switches at nodes of industrial network traffic exchange; step 2: reading traffic data through a network interface, and transmitting the traffic data to a protocol analysis algorithm for real-time layered protocol Analyze and extract network behavior features; step 3, process the missing features, process the data features into digital form, and select appropriate feature combinations from the data features for model training; step 4, establish a network behavior model for each protocol , to judge whether there is an abnormality; Step 5, use normal traffic data to establish a network behavior model for the protocol characteristics of each layer of the OSI network model, input the abnormal traffic into the network behavior model for further abnormal analysis, and output the abnormal location result of the traffic; Step 6. At regular intervals, update the training data of the network behavior model and replace the original model; the anomaly detection algorithm uses machine learning algorithms to discriminate, completes the detection of unknown anomalies, and solves the problem that traditional methods cannot identify new types of anomalies. disadvantages.

Description

A method for detecting and locating anomaly in industrial network traffic

技术领域technical field

本发明涉及工业网络安全领域，尤其是一种基于网络流量解析的异常检测和定位方法。The invention relates to the field of industrial network security, in particular to an abnormal detection and location method based on network traffic analysis.

背景技术Background technique

随着网络技术的发展，互联网开始与各行业生产相结合，发挥着愈发重要的作用。然而，当工业互联网为生产带来便利的同时，也增添了诸多的安全隐患与威胁。我国遭受着严重复杂的境外网络攻击，工业控制系统安全受威胁程度显著上升，恶意程序依法的安全事件也呈现多发态势。With the development of network technology, the Internet has begun to combine with the production of various industries, playing an increasingly important role. However, while the industrial Internet brings convenience to production, it also adds many security risks and threats. my country has suffered from serious and complex overseas network attacks, the security of industrial control systems has been threatened significantly, and the security incidents of malicious programs in accordance with the law have also appeared frequently.

工业网络入侵检测方法主要是针对工控设备所构成的网络，对收集漏洞信息、造成拒绝访问以及试图获取非法的系统控制权等危害计算机系统安全的行为检测的算法。The industrial network intrusion detection method is mainly aimed at the network composed of industrial control equipment, and it is an algorithm to detect behaviors that endanger the security of computer systems, such as collecting vulnerability information, causing access denial, and trying to obtain illegal system control rights.

异常检测方法按照检测方式可以分为两类：一类是误用检测，一类是异常检测。误用检测是基于异常特征字符串匹配的检测算法，通过与数据库中已有的异常特征进行比对，进行检测。误用检测的准确率较高，同时有着低误报率。但是将异常信息提取特征，并按照特定形式写入数据库十分耗费人力。随着数据库规模的增大，检测时长显著增大，导致了很差的实时性。Anomaly detection methods can be divided into two categories according to detection methods: one is misuse detection, and the other is anomaly detection. Misuse detection is a detection algorithm based on string matching of abnormal features, which is detected by comparing with existing abnormal features in the database. Misuse detection has a high accuracy rate and a low false positive rate. However, it is very labor-intensive to extract features from abnormal information and write them into the database in a specific form. As the size of the database increases, the detection time increases significantly, resulting in poor real-time performance.

异常检测方式摒弃了数据库比对的方法，对网络行为流量数据特征值建立正常行为模型，或者结合近年来兴起的机器学习算法，将待检测的流量特征通过模型检测进行判断，将明显偏离正常模型的数据判定为异常流量。这一方式极大地提高了检测速率，不过有时会造成一定程度的虚警率。同时，异常检测方式不需要人工维护特征模式库就可以检测到新的异常情况。The anomaly detection method abandons the database comparison method, establishes a normal behavior model for the characteristic values of network behavior traffic data, or combines the machine learning algorithms that have emerged in recent years to judge the traffic characteristics to be detected through model detection, which will obviously deviate from the normal model. The data is determined to be abnormal traffic. This method greatly improves the detection rate, but sometimes results in a certain degree of false alarm rate. At the same time, the anomaly detection method can detect new anomalies without manual maintenance of the feature pattern library.

不过当前检测方法均存在不足之处：误用检测方法由于依赖于数据库比对的检测方式，不能检测出特征库中不存在的新型异常。而异常检测方法多采用机器学习算法，虽具备较高的检测正确率，但由于缺乏可解释性，因此检测结果只能上报异常的产生，而无法精准地定位异常所在位置以及引发异常的原因。However, the current detection methods all have shortcomings: the misuse detection method cannot detect new anomalies that do not exist in the signature database due to the detection method that relies on database comparison. Anomaly detection methods mostly use machine learning algorithms. Although they have a high detection accuracy, due to the lack of interpretability, the detection results can only report the occurrence of anomalies, but cannot accurately locate the location of the anomaly and the cause of the anomaly.

发明内容SUMMARY OF THE INVENTION

为了解决现有的工业网络入侵检测方法的弊端，本发明提出了一种实时异常检测和定位的方法，可以在数据流输入的同时快速检测异常并定位造成异常的原因。In order to solve the drawbacks of the existing industrial network intrusion detection methods, the present invention proposes a real-time anomaly detection and location method, which can quickly detect anomalies and locate the cause of the anomalies while the data stream is input.

本发明的技术方案是提供了一种工业网络异常检测算法，包括如下步骤：The technical scheme of the present invention provides an industrial network anomaly detection algorithm, which includes the following steps:

步骤1、在工业网络流量交换的节点部署交换机，对经过交换机的数据进行采集并镜像传递至服务器的网络接口；Step 1. Deploy a switch at the node of industrial network traffic exchange, collect the data passing through the switch, and transfer it to the network interface of the server in a mirror image;

步骤2、通过网络接口读取流量数据，将流量数据传递给协议解析算法进行实时分层协议剖析，提取网络行为特征；Step 2, read the traffic data through the network interface, pass the traffic data to the protocol analysis algorithm for real-time layered protocol analysis, and extract network behavior characteristics;

协议解析算法包括将流量数据按照TCP/IP五层模型进行解析，对数据链路层、网络层、传输层依照协议提取网络行为特征；The protocol parsing algorithm includes parsing the traffic data according to the TCP/IP five-layer model, and extracting network behavior characteristics from the data link layer, network layer and transport layer according to the protocol;

提取网络行为特征时，对应用层的某种协议的不同格式数据，提取公有的行为特征关键字以及各种格式的代表性行为特征关键字，将公有的关键字和代表性关键字作为某种协议的关键字；When extracting network behavior features, extract public behavior feature keywords and representative behavior feature keywords in various formats for data in different formats of a certain protocol at the application layer, and use public keywords and representative keywords as certain Keyword of the agreement;

步骤3、对特征缺失的情况进行处理，将数据特征处理成数字形式，从数据特征中选取合适的特征组合进行模型训练；Step 3. Process the situation of missing features, process the data features into digital form, and select a suitable feature combination from the data features for model training;

步骤4、对每种协议建立网络行为模型，以判断是否出现异常；对于异常分大于阈值的流量进行报警处理；Step 4. Establish a network behavior model for each protocol to determine whether there is an abnormality; perform alarm processing for the traffic whose abnormality score is greater than the threshold;

步骤5、利用正常的流量数据对OSI网络模型的各层协议特征建立网络行为模型，将异常流量输入网络行为模型进行进一步的异常分析，输出流量的异常定位结果；Step 5, using normal traffic data to establish a network behavior model for the protocol characteristics of each layer of the OSI network model, input abnormal traffic into the network behavior model for further abnormal analysis, and output the abnormal location result of the traffic;

步骤6、每隔一段时间，对网络行为模型进行训练数据的更新并替换原模型；Step 6. At regular intervals, update the training data of the network behavior model and replace the original model;

设定时间窗口t，每个时间窗口内，将数据依次通过前台检测模块和后台训练模块、定位模块，生成检测报告的同时完成新时间窗口对应的检测子模型和定位子模型的训练，替换原有网络行为模型。Set the time window t, and in each time window, pass the data through the foreground detection module, the background training module, and the positioning module in turn, generate a detection report, and complete the training of the detection sub-model and the positioning sub-model corresponding to the new time window, replacing the original one. There are network behavior models.

进一步地，在步骤3中，对于COTP协议，若载荷数据的第5位不为2，则目的引用为载荷数据的7-8位所表示的数字，否则没有目的引用这一字段，因此记为0；Further, in step 3, for the COTP protocol, if the 5th bit of the payload data is not 2, the purpose reference is the number represented by the 7-8 bits of the payload data, otherwise there is no purpose to reference this field, so it is marked as 0;

对于HTTP协议的URL字段，将其转化为ASCII码，作为数字型字段处理。For the URL field of the HTTP protocol, convert it to ASCII code and handle it as a numeric field.

进一步地，步骤4中，利用IForest算法建立网络行为模型，对后续流量进行检测，计算所有待检测的网络行为流量特征值的特征组在孤立森林中的异常分公式如下：Further, in step 4, the IForest algorithm is used to establish a network behavior model, the subsequent traffic is detected, and the anomaly score formula of the feature groups in the isolated forest for the feature values of all the network behavior traffic eigenvalues to be detected is calculated as follows:

其中E(h(x))表示数据特征在孤立森林中的路径长度的均值，ψ表示单棵孤立树的训练样本的样本数，C(ψ)表示用ψ条数据构建的二叉孤立树的平均路径长度，C(ψ)的计算公式如下：where E(h(x)) represents the mean path length of the data feature in the isolated forest, ψ represents the number of training samples of a single isolated tree, and C(ψ) represents the binary isolation tree constructed with ψ pieces of data. The average path length, C(ψ), is calculated as follows:

H(ψ-1)＝ln(ψ-1)+Euler，Euler为欧拉常数。H(ψ-1)=ln(ψ-1)+Euler, where Euler is Euler's constant.

进一步地，步骤5具体包括：Further, step 5 specifically includes:

步骤5.1、判断同类流量数据是否具有相同的网络行为特征字段，当字段取值出现大范围波动时，则认为该字段出现异常，采用变异系数来衡量流量数据的异常程度：Step 5.1. Determine whether the same type of traffic data has the same network behavior characteristic field. When the value of the field fluctuates widely, it is considered that the field is abnormal, and the coefficient of variation is used to measure the abnormal degree of the traffic data:

其中，

x_i是第i个字段的数据，μ为某字段的平均值，CV为标准差与均值的比值；in,

x _i is the data of the ith field, μ is the average value of a field, and CV is the ratio of the standard deviation to the average value;

步骤5.2、构建数据处理矩阵，将前10分钟的正常数据模型作为PCA的输入矩阵A：Step 5.2. Construct the data processing matrix, and use the normal data model of the first 10 minutes as the input matrix A of PCA:

其中，m为历史数据的数量，n为字段数量，列表示字段的数据；Among them, m is the number of historical data, n is the number of fields, and the column represents the data of the field;

步骤5.3、对原始输入矩阵进行标准化处理，利用标准化矩阵得到特征值与特征向量；Step 5.3, standardize the original input matrix, and use the standardized matrix to obtain eigenvalues and eigenvectors;

由于每个字段的取值范围不同，需要对原始输入矩阵A进行标准化转化，得到标准化矩阵Z；Since the value range of each field is different, the original input matrix A needs to be standardized and transformed to obtain the standardized matrix Z;

其中，

in,

接着，求矩阵Z的协方差矩阵Σ：Next, find the covariance matrix Σ of matrix Z:

其中，

cov(Z_i,Z_j)＝E((Z_i-E(Z_i))(Z_j-E(Z_j)))；最后，求得协方差矩阵的特征值λ₁,λ₂,…,λ_n和特征向量μ₁,μ₂,…,μ_n；in,

cov(Z _i ,Z _j )=E((Z _i -E(Z _i ))(Z _j -E(Z _j ))); finally, find the eigenvalues of the covariance matrix λ ₁ ,λ ₂ ,... ,λ _n and eigenvectors μ ₁ , μ ₂ ,…,μ _n ;

步骤5.4、对特征值λ₁,λ₂,…,λ_n按照降序排序，然后计算主成分方差百分比，使得

β为设定的阈值；Step 5.4, sort the eigenvalues λ ₁ , λ ₂ ,...,λ _n in descending order, and then calculate the variance percentage of the principal components, so that

β is the set threshold;

步骤5.5、对所选的主成分排序，选择信息量最大的主成分，将其代表的字段作为根因字段候选集；Step 5.5, sort the selected principal components, select the principal component with the largest amount of information, and use the field represented by it as the root cause field candidate set;

步骤5.6、将k个元素的根因字段候选集定义为当前的异常模式，与已知的异常模式进行比照；若已存在同类型异常模式，则上报该异常模式及其详细信息；否则更新异常模式库并上报异常。Step 5.6. Define the root cause field candidate set of k elements as the current abnormal pattern, and compare it with the known abnormal pattern; if there is an abnormal pattern of the same type, report the abnormal pattern and its detailed information; otherwise, update the abnormal pattern Pattern library and report exceptions.

本发明的有益效果是：The beneficial effects of the present invention are:

(1)高效率：通过将模型的训练和检测分离，即离线训练检测算法后，将模型用于在线的流量检测，实现高效率的异常检测。不仅可以对静态数据进行检测，也能够保证对数据流的检测。(1) High efficiency: By separating the training and detection of the model, that is, after training the detection algorithm offline, the model is used for online traffic detection to achieve efficient anomaly detection. It can not only detect static data, but also ensure the detection of data flow.

(2)高准确率：异常检测算法利用机器学习算法进行判别，完成对未知异常的检测，解决了传统方法无法对新型异常做出识别的弊端。(2) High accuracy rate: The anomaly detection algorithm uses machine learning algorithms to discriminate, completes the detection of unknown anomalies, and solves the drawback that traditional methods cannot identify new types of anomalies.

(3)可解释性：在完成异常检测之后，会对异常数据做进一步分析，判断导致异常发生的原因，生成检测报告。通过自动化的异常定位，能够清晰地看到异常所在位置以及发生原因，无需再分配人力对异常警告进行异常定位。(3) Interpretability: After the abnormality detection is completed, the abnormal data will be further analyzed to determine the cause of the abnormality and generate a detection report. Through the automatic abnormal location, the location and cause of the abnormality can be clearly seen, and there is no need to allocate manpower to locate the abnormality warning.

(4)自动更新：为了适配流数据的检测，检测算法能够随着实时数据特征的变化进行自动更新，更新后的模型将替换原有检测模型，对后续到来的数据做异常检测。(4) Automatic update: In order to adapt to the detection of streaming data, the detection algorithm can be automatically updated with the change of real-time data characteristics, and the updated model will replace the original detection model to detect anomalies in subsequent data.

附图说明Description of drawings

图1是本方法的系统数据流向图；Fig. 1 is the system data flow diagram of this method;

图2是本方法对工控协议分协议和分层检测流程图。Figure 2 is a flowchart of the method for sub-protocol and layered detection of industrial control protocols.

具体实施方式Detailed ways

本说明书公开的所有特征，或公开的所有方法或过程中的步骤，除了互相排斥的特征和/或步骤以外，均可以以任何方式组合。本说明书中公开的任一特征，除非特别叙述，均可被其他等效或具有类似目的的替代特征加以替换。即，除非特别叙述，每个特征只是一系列等效或类似特征中的一个例子而己。All features disclosed in this specification, or steps in all methods or processes disclosed, may be combined in any way except mutually exclusive features and/or steps. Any feature disclosed in this specification, unless expressly stated otherwise, may be replaced by other equivalent or alternative features serving a similar purpose. That is, unless expressly stated otherwise, each feature is but one example of a series of equivalent or similar features.

以下对发明的较佳实施例作进一步详细说明。The preferred embodiments of the invention will be described in further detail below.

如图1所示，该实施例提供了一种实施例的工业网络异常检测算法，包括如下步骤：As shown in FIG. 1 , this embodiment provides an industrial network anomaly detection algorithm of an embodiment, including the following steps:

步骤1、在工业网络流量交换的节点部署交换机，对经过交换机的数据进行采集并镜像传递至服务器的网络接口。Step 1. Deploy a switch at a node where industrial network traffic is exchanged, collect data passing through the switch, mirror it and transmit it to the network interface of the server.

如图1所示，采集的数据包括各PLC器件之间交换的流量数据和上位机传送至PLC器件的流量数据。As shown in Figure 1, the collected data includes the flow data exchanged between each PLC device and the flow data transmitted by the host computer to the PLC device.

步骤2、通过网络接口读取流量数据，将流量数据传递给协议解析算法进行实时分层协议剖析，提取网络行为特征。Step 2: Read the traffic data through the network interface, and transmit the traffic data to the protocol analysis algorithm for real-time layered protocol analysis, and extract network behavior characteristics.

其中，如图1所示，协议解析算法包括将流量数据按照TCP/IP五层模型进行解析，对数据链路层、网络层、传输层依照协议提取网络行为特征。Among them, as shown in Figure 1, the protocol parsing algorithm includes parsing the traffic data according to the TCP/IP five-layer model, and extracting network behavior characteristics from the data link layer, network layer, and transport layer according to the protocol.

提取网络行为特征时，由于应用层协议格式的复杂性，对应用层的某种协议的不同格式数据，提取公有的网络行为关键字以及各种格式的代表性网络行为关键字，将公有的关键字和代表性关键字作为某种协议的关键字。When extracting network behavior characteristics, due to the complexity of the application layer protocol format, the public network behavior keywords and the representative network behavior keywords of various formats are extracted for data in different formats of a certain protocol in the application layer, and the public key words and representative keywords as keywords for a certain protocol.

例如：对于HTTP协议的所有流量数据，公有的关键字包括链路层的源MAC地址、目的MAC地址，网络层的源IP地址、目的IP地址，传输层的源端口、目的端口、窗口大小等。而由于HTTP协议的流量数据分为请求报文和响应报文，因此需要对两类报文提取代表性关键字。HTTP协议的代表性关键字则包括HTTP报文类型、状态码、方法和URL等。For example, for all traffic data of the HTTP protocol, the public keywords include the source MAC address and destination MAC address of the link layer, the source IP address and destination IP address of the network layer, and the source port, destination port, and window size of the transport layer. . Since the traffic data of the HTTP protocol is divided into request packets and response packets, it is necessary to extract representative keywords from the two types of packets. The representative keywords of the HTTP protocol include HTTP message types, status codes, methods, and URLs.

步骤3、对特征缺失的情况进行处理，将数据特征处理成数字形式，从数据特征中选取合适的特征组合进行模型训练。Step 3: Process the missing feature, process the data feature into a digital form, and select an appropriate feature combination from the data feature for model training.

例如：对于COTP协议，若载荷数据的第5位不为2，则目的引用为载荷数据的7-8位所表示的数字，否则没有目的引用这一字段，因此记为0。对于HTTP协议的URL字段，这是字符串型的字段，为了便于算法输入，将其转化为ASCII码，作为数字型字段处理。而由于字段数量过多，为了提高算法效率，可提取若干主要字段作为输入，例如舍去所含信息相同的的源IP和源MAC字段之一，或利用PCA、tSNE等数据降维算法进行字段选择。For example: for COTP protocol, if the 5th bit of the payload data is not 2, the destination reference is the number represented by the 7-8 bits of the payload data, otherwise there is no destination reference to this field, so it is marked as 0. For the URL field of the HTTP protocol, this is a string-type field. In order to facilitate the input of the algorithm, it is converted into ASCII code and processed as a number-type field. However, due to the large number of fields, in order to improve the efficiency of the algorithm, several main fields can be extracted as input, such as discarding one of the source IP and source MAC fields with the same information, or using PCA, tSNE and other data dimension reduction algorithms to choose.

如图1所示，处理数据包括对于某些字段值不存在的情况，根据该字段数值的具体分布情况进行填充，如填充0或-1等；将关键字中字符串等进行散列或编码等数字化操作，使得所有特征值均以数字形式表示。As shown in Figure 1, the processing data includes filling in the case where the value of some fields does not exist, according to the specific distribution of the field value, such as filling 0 or -1, etc.; hashing or encoding the character string in the keyword, etc. and other digitization operations, so that all eigenvalues are represented in digital form.

步骤4、对每种协议建立网络行为模型，以判断是否出现异常；对于异常分大于阈值的流量进行报警处理。Step 4: Establish a network behavior model for each protocol to determine whether there is an abnormality; perform alarm processing for traffic whose abnormality score is greater than a threshold.

如图2所示，利用IForest算法建立网络行为模型，对后续流量进行检测。As shown in Figure 2, the IForest algorithm is used to establish a network behavior model to detect subsequent traffic.

检测方法为：计算所有待检测的网络行为流量特征值的特征组在孤立森林中的异常分公式如下：The detection method is as follows: Calculate the anomaly score of the feature group in the isolated forest for all the feature values of the network behavior traffic to be detected. The formula is as follows:

例如，对于HTTP协议的流量数据，选择历史样本数据，选择源MAC、目的MAC、源端口、目的端口、TCP的各标志位、HTTP协议类型、状态码等字段作为IForest算法的输入，根据样本的结果，预设异常分的阈值，应用于后续流数据的检测。For example, for the traffic data of the HTTP protocol, select historical sample data, select the source MAC, destination MAC, source port, destination port, TCP flags, HTTP protocol type, status code and other fields as the input of the IForest algorithm. As a result, the threshold of the abnormal score is preset and applied to the detection of subsequent stream data.

步骤5、利用正常的流量数据对OSI网络模型的各层协议特征建立网络行为模型，将异常流量输入网络行为模型进行进一步的异常分析，输出流量的异常定位结果。Step 5: Use normal traffic data to establish a network behavior model for the protocol features of each layer of the OSI network model, input abnormal traffic into the network behavior model for further abnormal analysis, and output the abnormal location result of the traffic.

步骤5.1、判断同类流量数据是否具有相同的网络行为特征字段，例如HTTP数据包、Modbus数据包等，各字段值也相对稳定。如果字段取值出现大范围波动，则认为该字段出现异常。采用变异系数来衡量流量数据的异常程度。Step 5.1. Determine whether the same type of traffic data has the same network behavior characteristic fields, such as HTTP data packets, Modbus data packets, etc. The value of each field is relatively stable. If the value of the field fluctuates widely, it is considered that the field is abnormal. The coefficient of variation is used to measure the abnormal degree of traffic data.

其中，

x_i是第i个字段的数据，μ为某字段的平均值，CV为标准差与均值的比值。in,

x _i is the data of the ith field, μ is the average value of a field, and CV is the ratio of the standard deviation to the mean.

变异系数CV较大，则表明流量包在该时间段内发生异常，进行进一步定位分析。在多元统计分析中，当超过10个变量时，由于被丢弃的维度大多与其他维度相关而冗余，因此利用主成分分析法(PCA)将n维数据削减为p个主成分(p<n),以表现原始数据信息，同时可以有效降低数据维度，缩小问题的定位范围。If the coefficient of variation CV is large, it indicates that the traffic packet is abnormal in this time period, and further positioning analysis is performed. In multivariate statistical analysis, when there are more than 10 variables, most of the discarded dimensions are redundant with other dimensions, so principal component analysis (PCA) is used to reduce the n-dimensional data into p principal components (p<n ), to represent the original data information, and at the same time, it can effectively reduce the data dimension and narrow the scope of the problem.

其中，m为历史数据的数量，n为字段数量，列表示字段的数据。Among them, m is the number of historical data, n is the number of fields, and the column represents the data of the field.

步骤5.3、对原始输入矩阵进行标准化处理，利用标准化矩阵得到特征值与特征向量。Step 5.3: Standardize the original input matrix, and use the standardized matrix to obtain eigenvalues and eigenvectors.

由于每个字段的取值范围不同，需要对原始输入矩阵A进行标准化转化，得到标准化矩阵Z。Since the value range of each field is different, the original input matrix A needs to be standardized and transformed to obtain the standardized matrix Z.

其中，

接着，求矩阵Z的协方差矩阵Σ：in,

Next, find the covariance matrix Σ of matrix Z:

其中，

cov(Z_i,Z_j)＝E((Z_i-E(Z_i))(Z_j-E(Z_j)))。最后，求得协方差矩阵的特征值λ₁,λ₂,…,λ_n和特征向量μ₁,μ₂,…,μ_n。in,

cov(Z _i , Z _j )=E((Z _i -E(Z _i ))(Z _j -E(Z _j ))). Finally, the eigenvalues λ ₁ , λ ₂ ,…,λ _n of the covariance matrix and the eigenvectors μ ₁ , μ ₂ ,…,μ _n are obtained.

β为设定的阈值。Step 5.4, sort the eigenvalues λ ₁ , λ ₂ ,...,λ _n in descending order, and then calculate the variance percentage of the principal components, so that

β is the set threshold.

主成分实际上是原有维度的线性组合，而系数向量就是对应的特征向量。系数表示了主成分与原始数据维度的相关性，系数越大，则表示该维度对主成分贡献越大，即维度对应的方法即为引起异常的主要字段。The principal component is actually a linear combination of the original dimensions, and the coefficient vector is the corresponding feature vector. The coefficient represents the correlation between the principal component and the original data dimension. The larger the coefficient, the greater the contribution of the dimension to the principal component, that is, the method corresponding to the dimension is the main field that causes anomalies.

步骤5.5、对所选的主成分排序，选择信息量最大的主成分，将其代表的字段作为根因字段候选集。Step 5.5, sort the selected principal components, select the principal component with the largest amount of information, and take the field represented by it as the root cause field candidate set.

异常定位算法首先选取k个主成分p₁,p₂,…,p_n,按照对应的特征值λ₁,λ₂,…,λ_k按照从大到小排序，排序越靠前，该主成分愈显著；然后，依次选取主成分最大系数及对应的字段，并计算方法的权重因子，得到k个字段及权重因子；其次，对选取的k个字段按照权重因子进行倒序排序，作为异常定位的根因字段。The anomaly localization algorithm first selects k principal components p ₁ , p ₂ ,...,p _n , and sorts them according to the corresponding eigenvalues λ ₁ , λ ₂ ,..., λ _k in descending order. The more significant; then, select the maximum coefficient of the principal component and the corresponding field in turn, and calculate the weight factor of the method to obtain k fields and weight factors; secondly, sort the selected k fields in reverse order according to the weight factors, as the abnormal location. Root cause field.

步骤5.6、将k个元素的根因字段候选集定义为当前的异常模式，与已知的异常模式进行比照。若已存在同类型异常模式，则上报该异常模式及其详细信息；否则更新异常模式库并上报异常。Step 5.6: Define the root cause field candidate set of k elements as the current abnormal pattern, and compare it with the known abnormal pattern. If an abnormal pattern of the same type already exists, report the abnormal pattern and its detailed information; otherwise, update the abnormal pattern library and report the exception.

步骤6、每隔一段时间，对网络行为模型进行训练数据的更新并替换原模型。Step 6. Update the training data of the network behavior model and replace the original model at regular intervals.

尽管参考附图详地公开了本发明，但应理解的是，这些描述仅仅是示例性的，并非用来限制本发明的应用。本发明的保护范围由附加权利要求限定，并可包括在不脱离本发明保护范围和精神的情况下针对发明所作的各种变型、改型及等效方案。Although the present invention has been disclosed in detail with reference to the accompanying drawings, it is to be understood that these descriptions are exemplary only and are not intended to limit the application of the present invention. The protection scope of the present invention is defined by the appended claims, and may include various modifications, alterations and equivalent solutions made to the invention without departing from the protection scope and spirit of the present invention.

Claims

1. An industrial network anomaly detection algorithm, comprising the steps of:

step 1, deploying a switch at a node of industrial network flow exchange, collecting data passing through the switch and transmitting the data to a network interface of a server in a mirror image manner;

step 2, reading the flow data through a network interface, transmitting the flow data to a protocol analysis algorithm to analyze a real-time layered protocol, and extracting network behavior characteristics;

the protocol analysis algorithm comprises analyzing the flow data according to a TCP/IP five-layer model, and extracting network behavior characteristics from a data link layer, a network layer and a transmission layer according to a protocol;

when network behavior characteristics are extracted, common behavior characteristic keywords and representative behavior characteristic keywords in various formats are extracted from data in different formats of a certain protocol of an application layer, and the common keywords and the representative keywords are used as keywords of the certain protocol;

step 3, processing the condition of feature missing, processing the data features into a digital form, and selecting a proper feature combination from the data features for model training;

step 4, establishing a network behavior model for each protocol to judge whether the abnormality occurs; alarming the flow with the abnormal score larger than the threshold value;

step 5, establishing a network behavior model for each layer of protocol characteristics of the OSI network model by using normal traffic data, inputting abnormal traffic into the network behavior model for further abnormal analysis, and outputting abnormal positioning results of the traffic;

step 6, updating training data of the network behavior model at intervals and replacing the original model;

and setting time windows t, sequentially passing the data through a foreground detection module, a background training module and a positioning module in each time window, generating a detection report, finishing the training of a detection sub-model and a positioning sub-model corresponding to the new time window, and replacing the original network behavior model.

2. The industrial network anomaly detection algorithm of claim 1, wherein:

in step 3, for the COTP protocol, if the 5 th bit of the payload data is not 2, the destination reference is a number represented by the 7-8 bits of the payload data, otherwise, no destination reference is made to this field, and therefore, the number is marked as 0;

for the URL field of the HTTP protocol, it is converted into ASCII code and processed as a digital type field.

3. The industrial network anomaly detection algorithm of claim 1, wherein:

in step 4, a network behavior model is established by using an IForest algorithm, subsequent flow is detected, and an abnormal component formula of a characteristic group of all to-be-detected network behavior flow characteristic values in an soliton forest is calculated as follows:

wherein E (h (x)) represents the mean of the path lengths of the data features in the solitary forest, ψ represents the number of samples of training samples of a single solitary tree, C (ψ) represents the mean path length of a binary solitary tree constructed with ψ pieces of data, and the calculation formula of C (ψ) is as follows:

h (ψ -1) ═ ln (ψ -1) + Euler, Euler being the Euler constant.

4. The industrial network anomaly detection algorithm of claim 1, wherein: the step 5 specifically comprises the following steps:

step 5.1, judging whether the same-class flow data has the same network behavior characteristic field, when the field value fluctuates in a large range, considering that the field is abnormal, and measuring the abnormal degree of the flow data by adopting a variation coefficient:

wherein,

x_iis the data of the ith field, mu is the average value of a certain field, and CV is the ratio of the standard deviation to the average value;

step 5.2, constructing a data processing matrix, and taking the normal data model of the previous 10 minutes as an input matrix A of the PCA:

wherein m is the number of historical data, n is the number of fields, and the columns represent the data of the fields;

step 5.3, carrying out standardization processing on the original input matrix, and obtaining eigenvalues and eigenvectors by using the standardized matrix;

because the value ranges of all fields are different, the original input matrix A needs to be subjected to standardized conversion to obtain a standardized matrix Z;

wherein,

next, a covariance matrix Σ of the matrix Z is obtained:

wherein,

finally, the eigenvalue lambda of the covariance matrix is solved₁,λ₂,…,λ_nAnd a feature vector mu₁,μ₂,…,μ_n；

Step 5.4, the characteristic value lambda is subjected to₁,λ₂,…,λ_nSorted in descending order, and then the principal component variance percentages are calculated such that

Beta is a set threshold value;

step 5.5, sorting the selected principal components, selecting the principal component with the largest information amount, and taking the field represented by the principal component as a root cause field candidate set;

step 5.6, defining the root cause field candidate set of k elements as a current abnormal mode, and comparing the current abnormal mode with a known abnormal mode; if the same type of abnormal mode exists, reporting the abnormal mode and detailed information thereof; otherwise, updating the abnormal mode library and reporting the abnormality.