CN109286622B

CN109286622B - Network intrusion detection method based on learning rule set

Info

Publication number: CN109286622B
Application number: CN201811122445.1A
Authority: CN
Inventors: 王劲松; 杨传印; 黄玮; 莫敬涛
Original assignee: Tianjin University of Technology
Current assignee: Tianjin University of Technology
Priority date: 2018-09-26
Filing date: 2018-09-26
Publication date: 2021-04-20
Anticipated expiration: 2038-09-26
Also published as: CN109286622A

Abstract

A network intrusion detection method based on a learning rule set comprises the steps of preprocessing network connection data selected from an international standard data set KDDCup99, removing redundant data items by adopting an improved FOIL algorithm, extracting classification rules, classifying network connection test data according to the classification rules, and judging whether the network connection is attack connection or not and the specific attack type. The method selects the network connection data in KDDCup99 to carry out experimental verification, and improves the original FOIL algorithm aiming at the characteristics of the data set, so that the method is more suitable for the standard data set. The experimental result shows that the improved algorithm effectively improves the efficiency of classification rule extraction and network connection test data classification, the accuracy of the detection result is improved to a certain extent, and the defects of low classification efficiency and high false alarm rate of the traditional intrusion detection system are effectively overcome.

Description

Network intrusion detection method based on learning rule set

Technical Field

The method relates to the field of network intrusion detection systems, in particular to a network intrusion detection method based on a learning rule set.

Background

The intrusion detection system is an important supplement of the firewall, can collect and analyze information of a plurality of key points in a computer network or a computer system under the condition of not influencing the performance of the network system, and finds whether the network or the system has an evidence of being intruded, thereby completing the protection of the network system, and plays an important role in the safety of the network system.

Intrusion detection technology based on data mining technology has become a research hotspot, many achievements have been produced at home and abroad, but still problems generally exist in that the intrusion detection method based on data mining needs to be further improved in the aspects of detection accuracy, false alarm rate and real-time performance. In particular, a model with a close fit between a data mining technology and intrusion detection needs to be researched, so that the accuracy and the timeliness of the intrusion detection are improved.

Disclosure of Invention

The invention provides a network intrusion detection method based on a learning rule set aiming at the defects of low classification efficiency and high false alarm rate of the traditional intrusion detection system, and the timeliness and the accuracy of intrusion detection are improved by adopting an improved FOIL algorithm to process network connection data. Through testing on KDDCup99 experimental data set, compared with the original FOIL algorithm, the improved algorithm has certain feasibility in intrusion detection.

The application of the learning rule set algorithm to the intrusion detection method is mainly a view centered on data mining and processing, and the acquisition and processing process of the network connection data is out of the consideration of the invention. In the invention, an international standard network connection data set KDDCup99 is taken as an example, and the invasive network connection is classified by taking the idea of data mining as a theoretical basis.

The technical scheme of the invention is as follows:

a network intrusion detection method based on a learning rule set comprises the following steps:

in step 1, network connection data selected from the international standard data set KDDCup99 is divided into training set and test set data, and each data item in the training set and test set is preprocessed to specialize each data item in the network connection data.

And 2, removing redundant attribute data items in the training set data by adopting an improved FOIL algorithm, namely a learning rule set algorithm, training the residual attribute data items, extracting classification rules, and storing the obtained classification rules in a classification rule base.

And 3, matching the classification rules in the classification rule base one by testing the concentrated network connection data, respectively calculating the average accuracy of each classification rule according to the condition that the training set is covered with the samples by different matched classification rules, and respectively accumulating the accuracy of the classification rules according to different connection types in the classification rules.

Step 4, storing the accuracy of different connection types in step 3, finding out the connection type with the highest accuracy, namely the classification result of the network connection test data, namely the final detection result; after the data in the test set is classified to obtain a result, the correctly classified test set data and the detection result are added into the training set data to serve as a training set data source for subsequently extracting the classification rule, so that the classification rule can be dynamically updated to adapt to the change of different network connections.

Wherein, the data set preprocessing in the step 1 comprises the following steps:

step 1.1, using cross validation method, using 60% of network connection data in KDDCup99 data set as training set, and the remaining 40% of network connection data as test set.

Step 1.2, adding sequence parameters for each data item in each piece of network connection data, specializing each data item in the network connection data, and enhancing the data discrimination; kddup 99 has many identical data items in its data set, for example, a network connection has multiple "0" s, and each column has a specific meaning. The original FOIL algorithm treats the same data item as identical data when processing a piece of network connection data, so that processing the data set using the original FOIL algorithm affects the speed of extracting the classification rule and the accuracy of the classification result. To compensate for this drawback, sequence parameters need to be added to the data items of each column during the data preprocessing stage. This can both distinguish identical data in each network connection and guarantee the specific meaning of the data.

The step 2 of removing redundant data items in the training data set and extracting classification rules by adopting the improved FOIL algorithm needs to pass through the following steps:

and 2.1, dividing the preprocessed training set data into two main classes of positive examples and negative examples according to different network connection types, and counting attribute data items in a positive example set. Counting all network connection types in the training set, classifying one type of network connection data with the same connection type as a positive example, classifying all other types of connection data as a negative example, counting different attribute data items in a positive example set, adding the different attribute data items in the positive example set to a positive example attribute data item set Vset, and setting the front part of a classification rule r to be null.

And 2.2, calculating the gain of each attribute data item v in the positive example attribute data item set Vset, removing redundant data items which do not accord with the limiting condition, and adding the attribute data item with the maximum gain which accords with the limiting condition into a front piece of the classification rule r to obtain a new classification rule r'. The gain calculation formula of the attribute data item v is as follows:

when the front part of the classification rule r is empty, P and N in the formula respectively represent the number of samples in the positive example set and the negative example set, P^*And N^*Representing the number of instances covered by the new classification rule r' in the positive and negative case sets after adding the attribute data item v to the antecedent of the classification rule r, respectively. At this time, the gains of all the attribute data items are calculated and compared, and the attribute data item having the largest gain is added to the antecedent of the classification rule r.

When the antecedent of the classification rule r is not empty, P and N in the formula represent the number of samples covered by the classification rule r in the positive example set and the negative example set respectively, and P^*And N^*It represents the number of instances covered by the new classification rule r' in the positive and negative instance sets, respectively, after adding the attribute data item v to the antecedent of the classification rule r. At this time, to add the attribute data item with the maximum gain to the classification rule r antecedent, the following constraint conditions need to be satisfied: adding the attribute data item with the maximum gain to the front part of the classification rule r to obtain a new classification rule r' will cover fewer samples in the negative example set, namely N^*< N; if the attribute data item with the maximum gain is added to the front piece of the classification rule r to obtain a new classification rule r' covering the samples of the negative example set and does not change, and the attribute data item with the maximum gain is the same as the attribute value of one item in the front piece of the classification rule r, the data item is considered to be redundant, the data item can be deleted from the positive example attribute data item set Vset, and then the attribute data item with the maximum gain is searched from the rest attribute data items and is added to the front piece of the classification rule r according to the requirements to specialize the classification rule.

And 2.3, storing the classification rule r 'in the step 2.2, and deleting all samples which are not covered by the classification rule r' in the negative example set. And traversing all the samples in the negative example set, and deleting all the samples which do not contain the front part of the classification rule r' in the negative example set. If all the samples in the negative example set are deleted, the classification rule r' can be used as a classification rule; if there are more samples in the negative example set that have not been deleted, then other attribute data items in the positive example that contain a antecedent to the classification rule r' should be counted. And then returning to the step 2.2, adding the attribute data item with the maximum gain meeting the requirement to the front piece of the classification rule R' to obtain a new classification rule R, continuing to delete all samples which are not covered by the new classification rule R in the negative example set, and repeating the process until all samples in the negative example set are deleted.

And 2.4, storing the classification rule R (or R ') obtained in the 2.3, and deleting all samples covered by the classification rule R (or R') in the normal sample set. And traversing all the samples in the negative example set, comparing the samples one by one to judge whether the samples contain the classification rule R (or R ') antecedents, and deleting all the samples containing the classification rule R (or R') antecedents. If all the samples in the normal example set are deleted, all the classification rules of the type are extracted; and if the samples in the normal sample set are left, counting all attribute data items in the left samples, returning to the step 2.2, repeating all the previous steps until all the samples in the normal sample set are deleted, and finishing the extraction of the classification rule of the type. Each classification rule for deleting the normal case can be used as a classification rule for classifying the type of sample, the back part of the classification rules is the network connection type of the type of sample, and the classification rules are stored in a classification rule base.

And 2.5, returning to the step 2.1, extracting the classification rules of the second type of samples until the classification rules of all types of samples are found, and finishing the process of extracting the classification rules by the training set.

The step 3 of calculating the average accuracy of the matched classification rules needs to be carried out by the following steps:

and 3.1, reading the data of the test set, comparing each piece of network connection data in the test set with each piece of classification rule in the classification rule base, and recording the matched classification rule. Each piece of network connection data in the KDDCup99 data set has a plurality of data items, the antecedents of the classification rules extracted in the step 2 may include a plurality of attribute data items, and when each piece of unknown type network connection data in the test set is classified according to the extracted classification rules, a plurality of classification rules may be matched, and all matched classification rules are recorded.

3.2, for the matched m classification rules, if the back pieces of the classification rules are the same, the unknown type of network connection data is the connection type in the back pieces of the classification rules; if the matched classification rule back pieces are different, respectively calculating the average accuracy of the classification rules, namely the classification rule R_iThe average accuracy of (d) is calculated by the following formula:

where k is the number of different network connection data connection types in the training set, and n is the number of all the included classification rules R in the training set_iNumber of instances of the front piece, e is the type of connection in the training set as classification rule R_iThe sample of the back-piece contains a classification rule R_iNumber of samples of the front piece. After the average accuracy of each matched classification rule is obtained, the average accuracies are respectively accumulated according to the connection types to obtain the connection type t corresponding to the network connection test data_iThe accuracy of (2):

it means that the connection type of the back part of s classification rules in m matching classification rules is t_iAccuracy of (a) Accuracy (t)_i)。

And 4, obtaining a classification result according to the accuracy of different connection types of the matching rule and adding the classified test data into the training set by the following steps:

and 4.1, storing the accuracies of different network connection types calculated in the step 3, and comparing to obtain the connection type with the highest accuracy, namely the final classification result of the network connection test data.

And 4.2, in order to ensure the self-learning characteristic and the dynamic updating of the classification rule of the method, considering the characteristic of the dynamic change of the actual network condition, the classification rule obtained by one-time training may not adapt to the network data which changes constantly, and in the method, the classified test data and the corresponding classification result are added into a training set for training again, so as to generate a new classification rule and update a classification rule base.

The invention has the following advantages:

taking KDDCup99 international standard data set as an example, firstly dividing the KDDCup99 international standard data set into a training set and a testing set according to a cross-checking method, and adding sequence parameters to 41 attribute data items in the training set and the testing set. Redundant data items in the training set are then removed through a modified FOIL algorithm and extraction classification rules are trained. And finally, calculating the average accuracy of the classification rules of the network connection data of unknown types in the matching test set by using a Laplacian accuracy estimation formula, finally obtaining a classification result by comparing the accuracy, and simultaneously adding the data in the test set and the corresponding classification result into the training set to update the data in the training set in real time to generate a new classification rule, so that the method has good self-adaptability and self-learning characteristics. The invention adopts the improved FOIL algorithm, effectively avoids a large amount of repeated traversal and calculation when the KDDCup99 data set is processed by the original FOIL algorithm, reduces the time complexity of the algorithm, greatly accelerates the efficiency of extracting classification rules and classification, improves the accuracy of the classification detection result of the network connection data, and has stronger stability due to the characteristics of self-adaptability and self-learning.

Drawings

Fig. 1 is a flow chart of the network intrusion detection method based on the learning rule set according to the present invention.

Detailed Description

The following describes embodiments of the present invention in further detail with reference to the accompanying drawings.

The application of the learning rule set algorithm to the intrusion detection method is mainly a data-centric view, and the acquisition and processing process of the network connection data is out of the consideration of the invention. The invention takes an international standard network connection data set KDDCup99 as an example, and takes a data mining thought as a theoretical basis to classify the network connection data.

Fig. 1 illustrates the steps of a network intrusion detection method based on a learning rule set in detail. The method provided by the invention comprises the following steps:

Step 1.1, using cross validation method, using 60% of network connection data in KDDCup99 data set as training set, and the remaining 40% of network connection data as test set. Randomly extracting 10 pieces of network connection data from a KDDCup99 data set to form a group, then randomly selecting 6 pieces of the network connection data from each group to be added into a training set, and adding the remaining 4 pieces of the network connection data into a test set.

Step 1.2, adding sequence parameters for each data item in each piece of network connection data, specializing each data item in the network connection data, and enhancing the data discrimination; KDDCup99 has many identical data items in the data set, for example, there are "0" or "1" in a piece of network connection data, each column of data has a specific meaning, and the original FOIL algorithm treats the identical data items in a piece of network connection data as identical data when processing them, so processing the data set using the original FOIL algorithm affects the speed of extracting classification rules and the accuracy of classification results. To make up for this drawback, it is necessary to add a sequence parameter place to each column of data items at the data preprocessing stage, that is, column information where the data items are located, where each data item in the data set may be represented by a structure DataItem { int place, string data }, for example, for such a piece of randomly selected network connection data:

table 1 randomly selected piece of network connection data

0

tcp

http

SF

279

1129

......

0

normal

The data pre-processing stage adds sequence parameters to the data items as follows: (1, 0), (2, tcp), (3, http), (4, SF), (5, 279), …, (40, 0), (41, 0). The processing not only saves the position information of the data, thereby ensuring that the meaning of the data representation is not confused, but also only needs to traverse the data in the corresponding column in the data set in the subsequent process of extracting the classification rule, greatly reducing the quantity of the traversed data and obviously improving the efficiency of extracting the classification rule.

And step 2, removing redundant data items in the training set by adopting an improved FOIL algorithm, training the rest network connection data, extracting classification rules, and storing the classification rules into a classification rule base.

First, the following explanation is made:

network connection: each line of data in the international standard data set KDDCup99 is a network connection, each connection has 42 data, the first 41 are attributes of the network connection, the last is a connection type of the network connection, the network connection with the connection type normal is a normal network connection, and the rest of the connection types are attack types.

Sample: the training set selected from the international standard data set is a collection of network connection data of a plurality of connection types, and each network connection data is a sample.

Attribute data items: the data of each data item in the sample after adding the sequence parameter is called an attribute data item, and the attribute data item contains the sequence parameter and the data item at the position.

Classification rules: the classification rule is composed of a classification rule front part and a classification rule back part, the classification rule front part is composed of a plurality of data items including attribute data items, and the classification rule back part is a connection type.

Coverage: the plurality of data items of the sample including all data items in the front part of the classification rule are said to be covered by the classification rule.

Gain: in the process of extracting the classification rules, the FOIL algorithm needs to select the attribute data items to specialize the classification rule antecedent, and the gain is the difference of information codes before and after the attribute data items are added to the classification rule antecedent. The attribute data items with larger gains have larger contribution to the reduction of information coding, and the FOIL algorithm selects the gains as important criteria for selecting the attribute data items to be added to the front piece of the classification rule.

2.1, dividing the preprocessed training set data into two categories according to different network connection types and counting attribute data items in a regular example set; counting all network connection types in the training set, classifying one type of network connection data with the same connection type as a positive example, classifying all other types of connection data as a negative example, counting different attribute data items in a positive example set, adding the different attribute data items in the positive example set into a positive example attribute data item set Vset, and setting the front part of a classification rule r to be null.

2.2, calculating the gain of each attribute data item v in the positive case attribute data item set Vset, removing redundant data items which do not accord with the limiting condition, and adding the attribute data item which accords with the limiting condition and has the maximum gain to a front part of a classification rule r to obtain a new classification rule r'; the gain calculation formula of the attribute data item v is as follows:

when the front part of the classification rule r is empty, P and N in the formula respectively represent the number of samples in the positive example set and the negative example set, P^*And N^*Representing the number of instances covered by the new classification rule r' in the positive and negative case sets after adding the attribute data item v to the antecedent of the classification rule r, respectively. At this time, the attribute data item having the largest gain obtained when the gains of all the attribute data items are calculated and compared may be directly added to the classification rule precursor.

When the antecedent of the classification rule r is not empty, P and N in the formula represent the number of samples covered by the classification rule r in the positive example set and the negative example set respectively, and P^*And N^*It represents the number of instances covered by the new classification rule r' in the positive and negative instance sets, respectively, after adding the attribute data item v to the antecedent of the classification rule r. At this time, to add the attribute data item with the maximum gain to the classification rule r antecedent, the following constraint conditions need to be satisfied: adding the attribute data item with the maximum gain to the front part of the classification rule r to obtain a new classification rule r' will cover fewer samples in the negative example set, namely N^*< N; if the attribute data item with the maximum gain is added to the front part of the classification rule r to obtain a new classification rule r' covering the samples of the negative example set and does not change, and the attribute data item with the maximum gain is the same as the attribute value of one item in the front part of the classification rule r, the data item is considered to be redundant, the data item can be deleted from the positive example attribute data item set Vset, and then the data item is remainedThe attribute data item with the largest searching gain is added to the front piece of the classification rule r according to the requirements to specialize the classification rule.

The redundant attribute data items are removed in this step, mainly considering that when the training set is small, the redundant attribute data items may be added to the classification rule antecedent, and the redundant attribute data items do not contribute to classification and will also affect the accuracy of classification. For example, the following examples:

table 2 illustrates exemplary 8 pieces of data selected by removing redundant attribute data items

	F1	F2	F3	F4		Type
							...	0	0	0	0	...	normal
...	0	0	1	0	...	land
							...	0	1	0	0	...	ipsweep
...	0	1	1	0	...	normal
							...	1	0	0	1	...	teardrop
...	1	0	1	1	...	normal
							...	1	1	0	1	...	normal
...	1	1	1	1	...	back

For the 8 samples, the number of samples with connection types of normal (positive example) and non-normal (negative example) is the same, and the value and the number of attribute values in each column are also the same, wherein all the values of the attribute F4 are the same as those of F1, so that F4 is a redundant attribute. In extracting the classification rule of the network connection of which the connection type is normal, since the gain of each attribute data item is the same in the case of the connection type normal sample, if the attribute data item (F1, 0) is added to the classification rule antecedent for the first time, it is possible to introduce the redundant attribute data item (F4, 0) into the classification rule antecedent by merely distinguishing the second attribute data item to be selected on the basis of the gain of the attribute data item, which does not contribute any positively to the subsequent classification. By adding a limiting condition N^*If the attribute data item is less than N, the attribute data item with the maximum gain selected each time is ensured not to be a redundant attribute data item.

Step 2.3, storing the classification rule r 'in the step 2.2, and deleting all samples which are not covered by the classification rule r' in the negative example set; traversing all samples in the negative example set, and deleting all samples which do not contain the classification rule r' in the negative example set; if all the samples in the negative example set are deleted, the classification rule r' can be used as a classification rule; if there are samples in the negative example set which are not deleted, counting other attribute data items in the positive example containing a classification rule R 'front piece, then returning to the step 2.2, adding the attribute data item with the maximum gain meeting the requirement to the classification rule R' front piece to obtain a new classification rule R, then continuing deleting all samples which are not covered by the new classification rule R in the negative example set, and repeating the above processes until all samples in the negative example set are deleted;

step 2.4, storing the classification rule R (or R ') obtained in the step 2.3, and deleting all samples covered by the classification rule R (or R') in the normal case set; traversing all samples in the normal example set, comparing one by one whether the samples contain the classification rule R (or R ') antecedents, and deleting all samples containing the classification rule R (or R') antecedents; if all the samples in the normal example set are deleted, all the classification rules of the type are extracted; if the normal example set has the residual examples, counting all attribute data items in the residual examples, returning to the step 2.2, repeating all the previous steps until all the examples in the normal example set are deleted, and finishing the extraction of the classification rules of the type; each classification rule for deleting the normal case can be used as a classification rule for classifying the type of sample, the back part of the classification rules is the network connection type of the type of sample, and the classification rules are stored in a classification rule base;

The above process is described below by way of an example. Randomly choose the following data from KDDCup 99:

TABLE 3 randomly selected 5 pieces of data in KDDCup99 dataset

0	tcp	http	SF	54540	8314	……	0.04	0.04	back
										14	tcp	http	RSTR	33580	7300	……	1	1	back
0	icmp	eco_i	SF	18	0	……	0	0	ipsweep
										0	tcp	http	SF	321	480	……	1	1	normal
0	tcp	http	SF	277	3410	……	0	0	normal

Taking the 5 samples as training set samples, after adding sequence parameters, classifying the sample with the connection type back as a positive sample, and all other samples are negative samples, firstly training the classification rule of the positive sample, and a property data item set vset (back) composed of different property data items in the positive sample: { (1, 0), (2, tcp), (3, http), (4, SF), (5, 54540), (6, 8314), (40, 0.04), (41, 0.04), (1, 14), (4, RSTR), (5, 33580), (6, 7300), (40, 1), (41, 1) }, setting the classification rule r antecedent to null, calculating the gain value of each attribute data item, resulting in the gain values of (5, 54540), (6, 8314), (40, 0.04), (41, 0.04), (1, 14), (4, RSTR), (5, 33580), (6, 7300), (40, 1), (41, 1) being the largest, when (40, 1) is added to the antecedent of the classification rule r, deleting the negative case that is not classified by the classification rule r: { (40, 1) } → samples covered by back front, samples in the negative set cannot be deleted completely.

The gains of all the attribute data items in the sample whose connection type is back and which is covered by the classification rule r are calculated again, and if the gain of (41, 1) is the largest, (41, 1) this attribute data item should be deleted as a redundant data item because it cannot satisfy the constraint that a new classification rule r' obtained after being added to the classification rule r antecedent covers fewer samples in the negative example set, and the attribute value of this attribute data item is the same as the attribute value of the attribute data item in the classification rule r. The attribute data item for which the gain is the largest is calculated as: (1, 14), (4, RSTR), (5, 33580), (6, 7300), from which an attribute data item is selected to be added to the classification rule antecedent, resulting in a classification rule with a connection type back being: r: { (40, 1), (1, 14) } → back, delete all samples in the negative example that are not covered by the classification rule r antecedent, at this time all negative examples are deleted, and find a classification rule of the positive example. Deleting the sample covered by the classification rule r front piece in the regular example, wherein the regular example is not completely deleted, and other classification rules of the regular example need to be generated according to the steps: { (40, 0.04) } → back. And finally, after all the formal example classification rules are found, taking the samples of other connection types as the formal examples to generate the corresponding classification rules. The following classification rules are finally obtained: { (40, 1), (1, 14) } → back; { (40, 0.04) } → back; { (2, icmp) } → ipssweep; { (5, 321) } → normal; { (6, 3410) } → normal; these classification rules are stored in a classification rule base.

And 3.1, reading the data of the test set, comparing each piece of network connection data in the test set with each piece of classification rule in the classification rule base, and recording the matched classification rule. Each piece of network connection data in the KDDCup99 data set has many data items, the antecedents of the classification rules extracted in step 2 may include several attribute data items, and when each piece of unknown type network connection data in the test set is classified according to the extracted classification rules, multiple classification rules may be matched, and all matched classification rules are recorded.

Step 4, storing the accuracy of different connection types in step 3, and comparing and finding out the connection type with the maximum accuracy, namely the classification result of the network connection test; meanwhile, in order to enable the method to have good self-learning characteristics, after the data of the test set is classified according to the classification rules to obtain corresponding results, the data of the test set and the corresponding classification results are added into the data of the training set, a new data source of the training set is provided for the extraction of the subsequent classification rules, and the dynamic update of the classification rules is guaranteed.

And 4.2, in order to ensure the self-learning characteristic and the dynamic updating of the classification rule of the method, considering the dynamic characteristic of the actual network condition, the classification rule obtained by one-time training may not adapt to network data which changes from time to time, and in the method, the classified test data and the corresponding classification result are added into a training set for training again, so as to generate a new classification rule and update a classification rule base.

In order to show the processes of the step 3 and the step 4, one of the network connection data with the KDDCup99 data set connection type back, ipssweep and normal is selected as follows:

table 4 KDDCup99 data set connection type of one piece of data selected from the above three data sets

14

tcp

http

SF

321

3410

......

1

0

normal

Matching the data with classification rules in a classification rule base, wherein the classification rules matched with the network connection test data comprise 3: { (5, 321) } → normal; { (6, 3410) } → normal; { (40, 1), (1, 14) } → back. Since the back pieces of the 3 matching classification rules are different, the accuracy of the two connection types corresponding to the three classification rules needs to be calculated respectively.

The connection type with higher accuracy is normal, that is, the network connection test data acquisition classification result is normal, that is, normal network connection.

To verify the performance of the improved FOIL algorithm compared to the original FOIL algorithm applied to the network intrusion detection system, we performed the following verification experiment. The experimental environment is as follows: a PC. The CPU model is InterCore i 7-47703.4 GHz, the memory is 8G, the 1T hard disk is provided with a Visual Studio 2013 software environment. Experimental data: randomly selecting the network connection types according to different proportions of the KDDCup99 data sets, ensuring that the data quantity of each connection type does not exceed 50 at most, selecting 2150 in total, then selecting 60% of the data as training set data and 40% of the data as test set data by using a cross test method, and carrying out 5 experiments on the FOIL algorithm before and after improvement, wherein the experiment results are shown in Table 5. The average accuracy of the current network intrusion detection related algorithm is obtained by looking up the professional paper data, and the comparison result is shown in table 6.

TABLE 5 comparison of FOIL algorithm verification before and after improvement with the International Standard data set KDDCup99

Table 6 comparison of average accuracy of classification of KDDCup99 network connection data by current network intrusion detection related algorithm

The experimental results show that: compared with the original FOIL algorithm, the intrusion detection method has great improvement on the execution time, and has better performance on the average accuracy of the classification result compared with other algorithms.

Claims

1. A network intrusion detection method based on a learning rule set is characterized by comprising the following steps:

step 1, dividing network connection data selected from an international standard data set KDDCup99 into training set data and testing set data, and then preprocessing each data item in the training set and the testing set to specialize each data item in the network connection data;

step 2, removing redundant attribute data items in the training set data by adopting an improved FOIL algorithm, namely a learning rule set algorithm, training the remaining attribute data items, extracting classification rules, and storing the obtained classification rules into a classification rule base; the specific method comprises the following steps:

2.1, dividing the preprocessed training set data into two categories of positive examples and negative examples according to different network connection types, and counting attribute data items in a positive example set; counting all network connection types in the training set, classifying one type of network connection data with the same connection type as a positive example, classifying all other types of connection data as a negative example, adding different attribute data items in a positive example set into a positive example attribute data item set Vset, and setting the front part of a classification rule r to be null;

2.2, calculating the gain of each attribute data item v in the positive case attribute data item set Vset, removing redundant attribute data items which do not accord with the limiting condition, and adding the attribute data item which accords with the limiting condition and has the maximum gain to the front part of the classification rule r to obtain a new classification rule r'; the gain calculation formula of the attribute data item v is as follows:

when the front part of the classification rule r is empty, P and N in the formula respectively represent the number of samples in the positive example set and the negative example set, P^*And N^*Represents the number of examples covered by the new classification rule r' in the positive example set and the negative example set after the attribute data item v is added to the antecedent of the classification rule r; at this time, the gains of all the attribute data items are calculated and compared, and the attribute data item with the maximum gain is added to the front piece of the classification rule r;

when the antecedent of the classification rule r is not empty, P and N in the formula represent the number of samples covered by the classification rule r in the positive example set and the negative example set respectively, and P^*And N^*Then represent the number of instances covered by the new classification rule r' in the positive and negative instance sets after the attribute data item v was added to the antecedent of the classification rule r, respectively; at this time, to add the attribute data item with the largest gain to the antecedent of the classification rule r, the following constraint conditions need to be satisfied: adding the attribute data item with the maximum gain to the front part of the classification rule r to obtain a new classification rule r' will cover fewer samples in the negative example set, namely N^*< N; if the attribute data item with the maximum gain is added to the front piece of the classification rule r to obtain a new classification rule r' covering the samples of the negative example set and does not change, and the attribute data item with the maximum gain is the same as the attribute value of one item in the front piece of the classification rule r, the data item is considered to be redundant, the data item can be deleted from the positive example attribute data item set Vset, and then the attribute data item with the maximum gain is searched from the rest attribute data items and is added to the front piece of the classification rule r according to the requirements to specialize the classification rule;

step 2.3, storing the classification rule r 'in the step 2.2, and deleting all samples which are not covered by the classification rule r' in the negative example set; traversing all samples in the negative example set, and deleting all samples which do not contain the front part of the classification rule r' in the negative example set; if all the samples in the negative example set are deleted, the classification rule r 'can be used as a classification rule, and the back-part of the classification rule r' is the network connection type in the positive example; if there are samples not deleted in the negative example set, then counting other attribute data items in the positive example containing the classification rule R 'front piece, then returning to the step 2.2, adding the attribute data item with the maximum gain meeting the requirement to the classification rule R' front piece to obtain a new classification rule R, then continuing deleting all samples not covered by the new classification rule R in the negative example set, and repeating the above processes until all samples in the negative example set are deleted;

step 2.4, storing the classification rule R or R 'obtained in the step 2.3, and deleting all samples covered by the front part of the classification rule R or R' in the normal sample set; traversing all samples in the normal sample set, comparing one by one whether the samples contain the classification rule R or R 'antecedents, and deleting all samples containing the classification rule R or R' antecedents; if all the samples in the normal example set are deleted, all the classification rules of the type are extracted; if the normal example set has the residual examples, counting all attribute data items in the residual examples, returning to the step 2.2, repeating all the previous steps until all the examples in the normal example set are deleted, and finishing the extraction of the classification rules of the type; each classification rule for deleting the normal case can be used as a classification rule for the type of network connection, the back part of the classification rules is the network connection type of the type of sample, and the classification rules are stored in a classification rule base;

step 2.5, returning to the step 2.1, extracting the classification rules of the second type of samples until the classification rules of all types of samples are found, and ending the process of extracting the classification rules by the training set;

step 3, matching the network connection data in the test set with classification rules in a classification rule base one by one, respectively calculating the average accuracy of each classification rule according to the condition that different matched classification rules cover the samples in the training set, and respectively accumulating the average accuracy of the classification rules according to different connection types in the classification rules;

2. The learning rule set-based network intrusion detection method according to claim 1, wherein: the method for preprocessing the data set in the step 1 comprises the following steps:

step 1.1, adopting a cross validation method to take 60% of network connection data in KDDCup99 data set as a training set, and the remaining 40% of network connection data as a test set;

and 1.2, adding sequence parameters for each data item in each piece of network connection data, specializing each data item in the network connection data, enhancing the distinguishing degree of the data and ensuring the specific meaning of each data item.

3. The learning rule set-based network intrusion detection method according to claim 1, wherein: the method for calculating the average accuracy of the matched different classification rules in step 3 is as follows:

step 3.1, reading the data of the test set, comparing each piece of network connection data in the test set with each piece of classification rule in a classification rule base, and recording the matched classification rule; each piece of network connection data in the KDDCup99 data set has a plurality of data items, the antecedents of the classification rules extracted in step 2 may include a plurality of attribute data items, and when each piece of unknown type network connection data in the test set is classified according to the extracted classification rules, a plurality of classification rules may be matched, and all matched classification rules are recorded:

where k is the number of different network connection data connection types in the training set, and n is the number of all the included classification rules R in the training set_iNumber of instances of the front piece, e is the type of connection in the training set as classification rule R_iThe sample of the back-piece contains a classification rule R_iNumber of samples of the front piece; after the average accuracy of each matched classification rule is obtained, the average accuracies are respectively accumulated according to the connection types to obtain the connection type t corresponding to the network connection test data_iThe accuracy of (2):

4. The learning rule set-based network intrusion detection method according to claim 1, wherein: step 4, the method for obtaining the classification result by the accuracy of different connection types matched with the classification rule and adding the classified test data into the training set comprises the following steps:

step 4.1, storing the accuracy of different network connection types calculated in the step 3, and comparing to obtain the connection type with the maximum accuracy, namely the final classification result of the network connection test data;