CN110445776A

CN110445776A - A kind of unknown attack Feature Selection Model construction method based on machine learning

Info

Publication number: CN110445776A
Application number: CN201910692688.7A
Authority: CN
Inventors: 左晓军; 董立勉; 陈泽; 侯波涛; 赵建斌; 刘欣; 常杰; 董娜; 郗波; 王春璞; 刘惠颖; 张君艳; 刘伟娜; 王颖; 郭禹伶; 冯海燕
Original assignee: Electric Power Research Institute of State Grid Hebei Electric Power Co Ltd; State Grid Hebei Energy Technology Service Co Ltd
Current assignee: Electric Power Research Institute of State Grid Hebei Electric Power Co Ltd; State Grid Hebei Energy Technology Service Co Ltd
Priority date: 2019-07-30
Filing date: 2019-07-30
Publication date: 2019-11-12

Abstract

The invention discloses a kind of unknown attack Feature Selection Model construction method based on machine learning, this method calculates feature according to algorithm, robustness and interpretation with higher, through overtesting, the detection success rate that the unknown attack feature obtained by this method is used to detect this kind of feature is higher than artificial detection method and common attack signature automatically extracts technology, can detect to most attacks；Meanwhile feature extraction speed, the exploitativeness of the easy expenditure of feature and method itself have all reached a higher level, have effectively raised the feasibility of method.

Description

A method for building unknown attack feature extraction model based on machine learning

技术领域technical field

本发明涉及机器学习技术领域，具体领域为一种基于机器学习的未知攻击特征提取模型构建方法。The invention relates to the technical field of machine learning, and the specific field is a method for constructing an unknown attack feature extraction model based on machine learning.

背景技术Background technique

随着网络规模的日益扩大，网络攻击数量也随之增多。如何保证网络系统的正常平稳运行，成为了网络安全的最主要课题。而基于攻击特征的攻击检测成为了最为常见的检测方式。攻击特征是对攻击行为的一种总结性的描述，通常情况下，攻击特征应为该攻击所产生的流量数据中的独有特性，通过特征可以直观地发现和确定一个攻击行为，并且不会对日常的生产生活造成较大影响。而对于一个未知的攻击行为，我们需要对其进行特征的分析和提取，以便之后对该类攻击的预警和防御。攻击特征提取的过程十分繁琐复杂，采用渗透专家进行攻击特征提取的方式速度慢，并且主观性高，无法确定所提取的特征的有效性。因此需要一种高效的攻击特征自动提取技术。With the increasing scale of the network, the number of network attacks also increases. How to ensure the normal and stable operation of the network system has become the most important topic of network security. Attack detection based on attack characteristics has become the most common detection method. The attack signature is a summary description of the attack behavior. Usually, the attack signature should be the unique characteristics in the traffic data generated by the attack. An attack behavior can be intuitively discovered and determined through the signature, and will not It has a great impact on daily production and life. For an unknown attack behavior, we need to analyze and extract its characteristics, so as to provide early warning and defense against this type of attack. The process of attack feature extraction is very cumbersome and complicated. Using penetration experts to extract attack features is slow and highly subjective, making it impossible to determine the validity of the extracted features. Therefore, an efficient attack feature automatic extraction technology is needed.

现有的攻击特征自动提取技术分为基于网络的攻击特征提取技术和基于主机的攻击特征提取技术。基于网络的攻击特征提取技术利用网络上的攻击信息，通过算法提取攻击信息中的攻击特征；而基于主机的攻击特征提取技术通过对系统环境做一定的改变，在被攻击的主机中获取相关攻击信息并分析得出特征。两类方法的准确性，特征提取速度，特征易用度和方法本身都有着不同程度的优缺点。The existing automatic attack feature extraction technology is divided into network-based attack feature extraction technology and host-based attack feature extraction technology. The network-based attack feature extraction technology uses the attack information on the network to extract the attack features in the attack information through algorithms; while the host-based attack feature extraction technology obtains relevant attack information from the attacked host by making certain changes to the system environment. information and analyze the characteristics. The accuracy of the two types of methods, the speed of feature extraction, the ease of use of features and the method itself have different degrees of advantages and disadvantages.

发明内容Contents of the invention

本发明的目的在于提供一种基于机器学习的未知攻击特征提取模型构建方法，以解决上述背景技术中提出的问题。The purpose of the present invention is to provide a machine learning-based unknown attack feature extraction model building method to solve the problems raised in the above-mentioned background technology.

为实现上述目的，本发明提供如下技术方案：一种基于机器学习的未知攻击特征提取模型构建方法，所述基于机器学习的未知攻击特征提取模型构建方法包括如下步骤：In order to achieve the above object, the present invention provides the following technical solution: a method for constructing an unknown attack feature extraction model based on machine learning, the method for constructing an unknown attack feature extraction model based on machine learning includes the following steps:

步骤一：收集待检测未知攻击的相关数据和相关对比安全数据；Step 1: Collect relevant data of unknown attacks to be detected and relevant comparative security data;

步骤二：对数据进行特征预提取，形成特征数据；Step 2: Perform feature pre-extraction on the data to form feature data;

步骤三：将特征数据中可能存在的字符数据转换成为数字形式，以数字矩阵方式输出；Step 3: Convert the character data that may exist in the feature data into a digital form, and output it in a digital matrix;

步骤四：对机器学习中的参数矩阵进行大小和内容的初始化；Step 4: Initialize the size and content of the parameter matrix in machine learning;

步骤五：将攻击数据输入模型，与参数矩阵相乘得出预测值；Step 5: Input the attack data into the model and multiply it with the parameter matrix to obtain the predicted value;

步骤六：计算预测值与实际值的偏差，得出误差值；Step 6: Calculate the deviation between the predicted value and the actual value to obtain the error value;

步骤七：判断误差值是否满足条件，若不满足，根据误差值对参数矩阵中的参数进行更新后，回到步骤五，若满足则进行下一步骤；Step 7: Determine whether the error value satisfies the condition, if not, update the parameters in the parameter matrix according to the error value, return to step 5, and proceed to the next step if it is satisfied;

步骤八：完成训练输出矩阵参数，根据参数绝对值输出未知攻击特征。Step 8: Complete the training output matrix parameters, and output unknown attack features according to the absolute value of the parameters.

优选的，对攻击数据中的相关特征进行预提取和数据类型转换，以便于机器学习模型的训练。Preferably, relevant features in the attack data are pre-extracted and data type converted, so as to facilitate the training of the machine learning model.

优选的，使用机器学习模型借助转换后的攻击数据对特征参数进行训练。Preferably, a machine learning model is used to train the characteristic parameters with the converted attack data.

优选的，在完成训练后，根据特征参数来得到不同特征的权值，进而提取出未知攻击特征。Preferably, after the training is completed, the weights of different features are obtained according to the feature parameters, and then unknown attack features are extracted.

与现有技术相比，本发明的有益效果是：一种基于机器学习的未知攻击特征提取模型构建方法，该方法根据算法计算特征，具有较高的鲁棒性(计算机软件在输入错误、磁盘故障、网络过载或有意攻击情况下，能否不死机、不崩溃，就是该软件的鲁棒性)和可解释性，经过试验，通过该方法得出的未知攻击特征用于检测该种特征的检测成功率高于人工检测法和常见的攻击特征自动提取技术，可以对绝大多数攻击行为进行检测；同时，特征提取速度，特征易用度和方法本身的可实施性都达到了一个较高的水平，有效的提高了方法的可行性。Compared with the prior art, the beneficial effects of the present invention are: a method for constructing an unknown attack feature extraction model based on machine learning, the method calculates features according to an algorithm, and has higher robustness (computer software is in the event of an input error, a disk, etc.) In the case of failure, network overload or intentional attack, whether it can not crash or crash is the robustness and interpretability of the software. After experiments, the unknown attack characteristics obtained by this method are used to detect the characteristics of this kind of characteristics. The detection success rate is higher than the manual detection method and the common automatic attack feature extraction technology, and can detect most of the attack behaviors; at the same time, the feature extraction speed, feature ease of use and implementability of the method itself have reached a high level. level, effectively improving the feasibility of the method.

附图说明Description of drawings

图1为本发明的未知攻击特征提取流程图。FIG. 1 is a flow chart of unknown attack feature extraction in the present invention.

具体实施方式Detailed ways

下面将结合本发明实施例中的附图，对本发明实施例中的技术方案进行清楚、完整地描述，显然，所描述的实施例仅仅是本发明一部分实施例，而不是全部的实施例。基于本发明中的实施例，本领域普通技术人员在没有做出创造性劳动前提下所获得的所有其他实施例，都属于本发明保护的范围。The following will clearly and completely describe the technical solutions in the embodiments of the present invention with reference to the accompanying drawings in the embodiments of the present invention. Obviously, the described embodiments are only some, not all, embodiments of the present invention. Based on the embodiments of the present invention, all other embodiments obtained by persons of ordinary skill in the art without making creative efforts belong to the protection scope of the present invention.

请参阅图1，本发明提供一种技术方案：一种基于机器学习的未知攻击特征提取模型构建方法，所述基于机器学习的未知攻击特征提取模型构建方法包括如下步骤：Please refer to Fig. 1, the present invention provides a kind of technical solution: a kind of unknown attack feature extraction model construction method based on machine learning, described unknown attack feature extraction model construction method based on machine learning comprises the following steps:

攻击数据收集时，对于一个特征未知的攻击，首先我们应该对相关攻击数据进行收集，包括网络上可以抓取的攻击数据例如攻击流量包，以及攻击目标主机上的攻击数据，例如攻击留下的文件，相关的日志文件等,此外还应该收集相关的安全数据以便在后面的实验训练过程中形成对比。When collecting attack data, for an attack with unknown characteristics, we should first collect relevant attack data, including attack data that can be captured on the network, such as attack traffic packets, and attack data on the target host, such as attack left Files, related log files, etc. In addition, relevant security data should be collected for comparison in the later experimental training process.

数据特征预提取时，对于收集到的数据，我们对其中存在的相关特征进行提取，例如流量包中报文的长度，报文的时间，报文中相关的敏感字符，目标主机中日志的时间，日志中相关的敏感信息等，在这里提取的特征种类越多，最后得出攻击行为特征提取的准确度也会越高。When pre-extracting data features, we extract the relevant features of the collected data, such as the length of the message in the traffic packet, the time of the message, the relevant sensitive characters in the message, and the time of the log in the target host , relevant sensitive information in logs, etc., the more types of features extracted here, the higher the accuracy of attack behavior feature extraction will be.

数据转换时，对于提取出来的特征数据中可能存在字符，这里我们对字符进行标号，将不同字符依次以数字进行标记，这样就可以用不同的数字代替不同的字符，以便于之后机器学习模型的训练。During data conversion, there may be characters in the extracted feature data. Here we label the characters and mark different characters with numbers in turn, so that different numbers can be used to replace different characters, so that the machine learning model can be used later. train.

初始化参数矩阵的过程中，假设之前数据特征提取时提取出的参数个数为m个(m表示若干个)，总共提取出了n条攻击记录(n表示若干个)，那么这里我们将参数矩阵大小初始化为m*1，矩阵中的每一个参数初始设置为1。In the process of initializing the parameter matrix, assuming that the number of parameters extracted during the previous data feature extraction is m (m means several), and a total of n attack records have been extracted (n means several), then here we set the parameter matrix The size is initialized to m*1, and each parameter in the matrix is initially set to 1.

计算预测结果，应对于每条攻击记录矩阵1*m，我们将其与参数矩阵相乘得到一个数字，若该数字大于0.5，我们则认为该条记录确实为攻击行为；若该数字小于等于0.5，我们则认为该条记录为安全行为。这样n条攻击记录得出n个预测结果。Calculate the prediction result. For each attack record matrix 1*m, we multiply it with the parameter matrix to get a number. If the number is greater than 0.5, we consider the record to be an attack; if the number is less than or equal to 0.5 , we consider this record to be a safe behavior. In this way, n pieces of attack records can obtain n prediction results.

计算预测结果后进行计算误差，我们将n个预测结果与他们的实际结果(攻击行为记为1，安全行为记为0)进行对比，计算出他们之间的差值α，即为误差。After calculating the prediction results, we calculate the error. We compare the n prediction results with their actual results (the attack behavior is recorded as 1, and the security behavior is recorded as 0), and the difference α between them is calculated, which is the error.

对误差进行判断，若误差a大于预设值，我们根据误差来更新参数：b_i→b_i-kα_i，其中代表第i个参数，k代表学习速率，然后回到第五步再次计算预测结果，若误差α小于预设值或参数更新次数大于阈值，则结束训练。Judging the error, if the error a is greater than the preset value, we update the parameters according to the error: b _i → b _i -kα _i , where it represents the i-th parameter, k represents the learning rate, and then return to the fifth step to calculate the prediction again As a result, if the error α is less than a preset value or the number of parameter updates is greater than a threshold, the training is terminated.

最后记性参数分析，将训练完的参数取出进行分析，参数绝对值大说明该参数对应的特征起到较大作用，参数绝对值小说明该参数对应的特征作用较小，由此可以得出该未知攻击中相关特征权重，从而最终提取出未知攻击特征，通过以上的分析得到本发明的具体方法。Finally, remember the parameter analysis, and take out the trained parameters for analysis. A large absolute value of the parameter indicates that the feature corresponding to the parameter plays a greater role, and a small absolute value of the parameter indicates that the feature corresponding to the parameter has a small role. From this, it can be concluded that the The relevant feature weights in the unknown attack, so as to finally extract the unknown attack feature, and obtain the specific method of the present invention through the above analysis.

具体而言，对攻击数据中的相关特征进行预提取和数据类型转换，以便于机器学习模型的训练。Specifically, pre-extraction and data type conversion are performed on relevant features in the attack data to facilitate the training of machine learning models.

具体而言，使用机器学习模型借助转换后的攻击数据对特征参数进行训练。Specifically, feature parameters are trained using machine learning models with transformed attack data.

具体而言，在完成训练后，根据特征参数来得到不同特征的权值，进而提取出未知攻击特征。Specifically, after the training is completed, the weights of different features are obtained according to the feature parameters, and then unknown attack features are extracted.

尽管已经示出和描述了本发明的实施例，对于本领域的普通技术人员而言，可以理解在不脱离本发明的原理和精神的情况下可以对这些实施例进行多种变化、修改、替换和变型，本发明的范围由所附权利要求及其等同物限定。Although the embodiments of the present invention have been shown and described, those skilled in the art can understand that various changes, modifications and substitutions can be made to these embodiments without departing from the principle and spirit of the present invention. and modifications, the scope of the invention is defined by the appended claims and their equivalents.

Claims

1. a kind of unknown attack Feature Selection Model construction method based on machine learning, it is characterised in that: described to be based on machine The unknown attack Feature Selection Model construction method of study includes the following steps:

Step 1: the related data and relevant comparative's secure data of unknown attack to be detected are collected；

Step 2: feature preextraction is carried out to data, forms characteristic；

Step 3: character data that may be present in characteristic is converted into digital form, is exported in a manner of character matrix；

Step 4: the initialization of size and content is carried out to the parameter matrix in machine learning；

Step 5: data input model will be attacked, is multiplied with parameter matrix and obtains predicted value；

Step 6: the deviation of predicted value and actual value is calculated, obtains error amount；

Step 7: whether error in judgement value meets condition, if not satisfied, being carried out more according to error amount to the parameter in parameter matrix After new, step 5 is returned to, carries out next step if meeting；

Step 8: completing training output matrix parameter, exports unknown attack feature according to parameter absolute value.

2. a kind of unknown attack Feature Selection Model construction method based on machine learning according to claim 1, special Sign is: preextraction and data type conversion is carried out to the correlated characteristic in attack data, in order to the instruction of machine learning model Practice.

3. a kind of unknown attack Feature Selection Model construction method based on machine learning according to claim 1, special Sign is: being trained by the attack data after conversion to characteristic parameter using machine learning model.

4. a kind of unknown attack Feature Selection Model construction method based on machine learning according to claim 1, special Sign is: after completing training, the weight of different characteristic is obtained according to characteristic parameter, and then extract unknown attack feature.