CN107395640B

CN107395640B - Intrusion detection system and method based on division and characteristic change

Info

Publication number: CN107395640B
Application number: CN201710760156.3A
Authority: CN
Inventors: 郭华平; 周俊; 杨乐; 邬长安; 祁传达
Original assignee: Xinyang Normal University
Current assignee: Xinyang Normal University
Priority date: 2017-08-30
Filing date: 2017-08-30
Publication date: 2020-05-12
Anticipated expiration: 2037-08-30
Also published as: CN107395640A

Abstract

The invention discloses an intrusion detection system and method based on division and characteristic change, wherein the method comprises the following steps: in the training stage, a normal data packet training set is divided into a plurality of clusters by using a K-means clustering method, each cluster and a network intrusion data packet training set are combined to form a plurality of new training sets, and each training set D is provided with a plurality of training sets_iLearning a feature transformation matrix Q_iAnd is represented by Q_iDefined in-space learning k-nearest neighbor model M_i(ii) a In the prediction stage, a learned K-means method is used for selecting a K neighbor prediction model in a corresponding space for a data packet to be predicted, and the model is used for predicting whether the data packet is an invasive data packet or not. The invention can effectively analyze whether the data packet belongs to the intrusion packet or not, and can keep high accuracy in predicting normal and intrusion samples, thereby having wider engineering application value.

Description

Intrusion detection system and method based on division and characteristic change

Technical Field

The invention belongs to the technical field of data analysis, and relates to an intrusion detection system and method based on division and characteristic change.

Background

The appearance and wide application of the network bring convenience to the life and work of people, but also bring many safety problems, and various viruses, bugs and attacks cause huge loss to the society. How to protect information from being attacked and leaked and maintain the integrity, availability and confidentiality of the information is a focus of current research.

In the face of the current situation of network security, measures such as access control, data encryption, identity authentication, firewall, intrusion detection technology and the like are mainly adopted at present to ensure the security of a network and an information system. The intrusion detection technology is an effective means for ensuring the security of the system and the network by collecting information such as an operating system, a system program, an application program, a network data packet and the like and discovering behaviors of a monitored system or the network which violate security policies or endanger the security of the system.

The machine learning method simulates the learning activities of human beings by using a computer, researches how to learn the existing knowledge through the computer, discovers new knowledge, and improves the learning effect through continuous improvement. The machine learning includes a large number of data preprocessing and classification methods, and is related to subjects such as statistics, artificial intelligence, information theory and the like. The basic process is to further classify or predict unknown samples by learning and constructing a learning machine from the existing experience.

The network intrusion sample belongs to a few cases, the proportion of the network intrusion sample is about 0.001, and most data packets belong to normal communication data packets. If the traditional classification method is adopted, the prediction accuracy is quite considerable even if any communication packet is judged to be a normal communication packet. However, this high accuracy is not practical for identifying intrusion samples belonging to a small number of classes, and only correct identification of intrusion samples is the goal to be achieved. In a traditional k-NN algorithm, the difference between samples is usually measured by a simple Euclidean distance, but the detection of an intrusion sample is influenced due to inaccurate measurement caused by different weights of each characteristic of the sample. Therefore, it is necessary to design a classification method for intrusion detection.

Disclosure of Invention

The invention aims to provide an intrusion detection system and method based on division and characteristic change. The method ensures that whether the sample belongs to the intrusion sample can be effectively analyzed, and high accuracy can be kept on the prediction of normal and intrusion samples.

The specific technical scheme is as follows:

an intrusion detection system based on division and characteristic change comprises a data acquisition module, a learning module and a prediction module,

the data acquisition module: inputting network data packet data as basic training data for learning an intrusion detection model based on division and feature transformation;

the learning module: dividing normal data packets into a plurality of clusters by using a K-means method, and combining the data packets of each cluster and an intrusion packet into a new training set for training a characteristic transformation matrix and a corresponding K-nearest neighbor intrusion detection model;

the prediction module: given a prediction packet sample x, a K-means method is used to project the x into a corresponding feature transformation space, and a corresponding K-nearest neighbor intrusion detection model is used to predict the class of the sample x.

An intrusion detection method based on partitioning and feature variation comprises the following steps:

step 1, establishing an attribute table of a sample; acquiring a training sample, and processing the training sample according to the attribute table;

step 2, dividing a majority training set D by adopting a K-means clustering algorithm_majTo obtain a cluster D_maj，1，D_maj，2，...，D_maj，K；

Step 3, clustering D_maj，1，D_maj，2，...，D_maj，KTraining set D with minority classes respectively_minCombining to obtain a new training set D₁，D₂，...，D_K；

Step 4, using each new training set D_kLearning transformation matrix Q_kThe method comprises the following specific steps: at D_kFor a sample point x_iAnother sample point x_jThe probability of the impact on the classification result is:

targeting maximum of leave-one-out (LOO) accuracy for x_iThe probability that it is correctly classified by all samples other than itself is

Wherein omega_iIs equal to x_iSubscript set of samples belonging to the same class.

The optimization goal is then:

the gradient is as follows:

the corresponding characteristic transformation matrix Q can be obtained by solving the above formula by adopting a gradient descent method_k。

And 5, processing the sample x to be classified by using the attribute table in the step 1, processing a training set corresponding to the x by using the transformation matrix Q obtained in the step 3, and classifying the sample x to be classified by adopting a k-NN algorithm in the converted feature space, so that whether the sample x belongs to a normal sample or an invasive sample can be judged.

Preferably, the attribute table divides the types of the data packet variables into numerical values and discrete values.

As one preference, the normal data packets are divided into K clusters.

Preferably, the samples are mapped into the optimized space before the prediction samples are predicted using the k-NN algorithm.

Still further, x is mapped to the corresponding feature transform space and the class of the sample is predicted using k neighbors with the function:

wherein v is a class number, i (true) 1, i (false) 0, Q_iFor the learned spatial transformation matrix, Q_i ^Tx denotes mapping x to Q_iIn the defined feature space, D_zIs represented in a feature space Q_iX searched by using Euclidean distance_iWherein the euclidean distance formula is as follows:

further, the sample attribute table includes connection duration, protocol type, network service type of the target host, connection normal or error state, source host to target host byte number, target host to source host byte number, whether from/to the same host port, number of error segments, number of emergency packets, number of times of accessing sensitive area, number of failed login attempts, number of successful logins, number of times of root user access, number of file creation operations, number of times of using shell, number of times of accessing control file, whether login belongs to "hot" list, whether login is guest login, number of connections of the same target host as the current connection, number of connections of the same service as the current connection, number of connections of the same target host as the current connection, and of the first 100 connections, the percentage of connections with SYN errors among connections with the same target host as the current connection, and the percentage of connections with REJ errors among the first 100 connections with the same service as the current connection.

Compared with the prior art, the invention has the beneficial effects that:

the invention can effectively analyze whether the data packet sample belongs to the intrusion sample, and can keep high accuracy in predicting normal and intrusion samples.

Drawings

FIG. 1 is a flow chart illustrating an intrusion detection method based on partitioning and feature variation according to the present invention;

FIG. 2 is a schematic diagram of clustering a plurality of classes by using a K-means clustering algorithm;

FIG. 3 is a schematic diagram of the classification of the test sample x in the optimized space by using the k-NN algorithm.

Detailed Description

The technical solution of the present invention will be described in further detail with reference to the accompanying drawings and specific embodiments.

As shown in fig. 1, an intrusion detection method based on partitioning and feature transformation includes the following steps:

step 1, capturing a data packet from a network by using a data acquisition module as a basic training set D for a model, and dividing the data packet into a normal data packet training set D_majAnd intrusion data packet training set D_minPreprocessing the data (such as checking the rationality and integrity of the selected data, supplementing and correcting missing data and abnormal data, and performing normalization processing), and establishing an attribute table of the sample, wherein the value types of the attributes are divided into numerical values and discrete values;

step 2, dividing a normal data packet training set D by adopting a K-means clustering algorithm_majTo obtain a cluster D_maj，1，D_maj，2，...，D_maj，K；

Step 3, clustering D_maj，1，D_maj，2，...，D_maj，KRespectively with the intrusion data packet training set D_minCombining to obtain a new training set D₁，D₂，...，D_K；

The optimization goal is then:

the gradient is as follows:

And 5, processing the sample x to be classified by using the sample attribute table in the step 1, processing a training set corresponding to the x by using the transformation matrix Q obtained in the step 4, and classifying the sample x to be classified by using a k-NN algorithm in the converted feature space, so that whether the sample x belongs to a normal sample or an invasive sample can be judged.

The above description is only a preferred embodiment of the present invention, and the scope of the present invention is not limited thereto, and any simple modifications or equivalent substitutions of the technical solutions that can be obviously obtained by those skilled in the art within the technical scope of the present invention are within the scope of the present invention.

Claims

1. An intrusion detection method based on division and characteristic change is characterized by comprising a data acquisition module, a learning module and a prediction module,

the prediction module: given a prediction packet sample x, projecting the x to a corresponding feature transformation matrix by using a K-means method, predicting the class of the sample x by using a corresponding K-nearest neighbor intrusion detection model,

step 1, a data acquisition module is used for capturing a data packet from a network as a basic training set D for a model, dividing the data packet into a normal data packet training set Dmaj and an invasive data packet training set Dmin, and preprocessing the data packet training set Dmin and the invasive data packet training set Dmin; checking the reasonability and integrity of the selected data, supplementing and correcting abnormal data, carrying out normalization processing, and establishing an attribute table of the sample, wherein the value types of all attributes are divided into numerical values and discrete values;

step 2, adopting a K-means clustering algorithm to train a normal data packet into a set D_majDividing into K clusters to obtain cluster D_maj,1,D_maj,2,…,D_maj,K；

Step 3, clustering D_maj,1,D_maj,2,…,D_maj,KRespectively with the intrusion data packet training set D_minCombining to obtain K new training sets D₁,D₂,…,D_K；

where i is 1,2, … | D_k|，j＝1,2,…|D_k|，i≠j，|D_kI is training set D_kThe size of (d);

targeting the left-of-one (LOO) accuracy maximization for sample point x_iThe probability that it is correctly classified by all samples other than itself is:

wherein omega_iThe meaning of expression is: in training set D_kAnd sample point x_iSubscript sets of other samples of the same class;

the optimization goal is then:

the gradient is as follows:

solving the objective function f (Q) by gradient descent method and using the above formula_k) Obtaining the corresponding characteristic transformation matrix Q_k；

Step 5, processing the sample x to be classified by using the sample attribute table in the step 1, and selecting the corresponding transformation matrix Q learned in the step 3 by using the K-means clustering method in the step 2_kAnd processing the training set corresponding to the x, and classifying the sample x to be classified by adopting K nearest neighbors in the converted feature space, namely judging whether the sample x belongs to a normal sample or an invasive sample.

2. The intrusion detection method according to claim 1, wherein the sample attribute table classifies types of packet variables into numeric values and discrete values.

3. The intrusion detection method according to claim 1, wherein the normal data packets are trained to set D_majDivided into K clusters.

4. The partition and feature change based intrusion detection method according to claim 1, wherein the samples are mapped into the optimized space before prediction of the predicted samples using K neighbors.

5. The method of claim 1, wherein x is mapped to a corresponding eigen transformation matrix and k neighbors are used to predict classes of samples, using the function:

wherein v is a class number, i (true) 1, i (false) 0, Q_kFor the learned correspondingThe feature transformation matrix of (a) is,

is x_iSample after feature transformation, y_iIs a sample x_iTrue class designation of; i (-) is a symbolic function, D_zRepresenting search by Euclidean distance

X searched by using Euclidean distance_iWherein the euclidean distance formula is as follows:

wherein x is_idIs a sample x_iThe value in the d-dimension is,

x_jdis a sample x_jThe value in the d-dimension is,

n represents the feature dimension.

6. The partition and feature change based intrusion detection method according to claim 2, wherein the attribute table of the sample includes connection duration, protocol type, network service type of the target host, connection normal or wrong status, byte number of source host to target host, byte number of target host to source host, whether from/to the same host port, number of error segments, number of urgent packets, number of times of accessing sensitive area, number of times of failed login attempts, number of times of successful login, number of times of root user access, number of file creation operations, number of times of using shell, number of times of accessing control file, whether login belongs to "hot" list, whether it is a guest login, number of connections of target host same as current connection, number of connections having same service as current connection, number of connections of previous 100 connections having same target host as current connection, The percentage of the connections with SYN errors among the first 100 connections having the same target host as the current connection, and the percentage of the connections with REJ errors among the first 100 connections having the same service as the current connection.