CN112217822A - Detection method for intrusion data - Google Patents

Detection method for intrusion data Download PDF

Info

Publication number
CN112217822A
CN112217822A CN202011088479.0A CN202011088479A CN112217822A CN 112217822 A CN112217822 A CN 112217822A CN 202011088479 A CN202011088479 A CN 202011088479A CN 112217822 A CN112217822 A CN 112217822A
Authority
CN
China
Prior art keywords
sample
classifier
data
samples
points
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN202011088479.0A
Other languages
Chinese (zh)
Other versions
CN112217822B (en
Inventor
任午令
张晓冰
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Zhejiang Gongshang University
Original Assignee
Zhejiang Gongshang University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Zhejiang Gongshang University filed Critical Zhejiang Gongshang University
Priority to CN202011088479.0A priority Critical patent/CN112217822B/en
Publication of CN112217822A publication Critical patent/CN112217822A/en
Application granted granted Critical
Publication of CN112217822B publication Critical patent/CN112217822B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L63/00Network architectures or network communication protocols for network security
    • H04L63/14Network architectures or network communication protocols for network security for detecting or protecting against malicious traffic
    • H04L63/1408Network architectures or network communication protocols for network security for detecting or protecting against malicious traffic by monitoring network traffic
    • H04L63/1416Event detection, e.g. attack signature detection
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/21Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
    • G06F18/214Generating training patterns; Bootstrap methods, e.g. bagging or boosting
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/24Classification techniques
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L41/00Arrangements for maintenance, administration or management of data switching networks, e.g. of packet switching networks
    • H04L41/14Network analysis or design
    • H04L41/142Network analysis or design using statistical or mathematical methods
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L63/00Network architectures or network communication protocols for network security
    • H04L63/20Network architectures or network communication protocols for network security for managing network security; network security policies in general

Landscapes

  • Engineering & Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Computer Security & Cryptography (AREA)
  • Theoretical Computer Science (AREA)
  • General Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Signal Processing (AREA)
  • Computer Networks & Wireless Communication (AREA)
  • General Physics & Mathematics (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Artificial Intelligence (AREA)
  • Evolutionary Biology (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Computer Hardware Design (AREA)
  • Computing Systems (AREA)
  • Evolutionary Computation (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Algebra (AREA)
  • Mathematical Analysis (AREA)
  • Mathematical Optimization (AREA)
  • Mathematical Physics (AREA)
  • Probability & Statistics with Applications (AREA)
  • Pure & Applied Mathematics (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The invention discloses a detection method for intrusion data, which specifically comprises the following steps: 1) a balanced data set acquisition step, 2) a data classification step, and 3) a classifier evaluation step; the invention provides a detection method aiming at intrusion data, which improves data protection and detection generalization performance.

Description

Detection method for intrusion data
Technical Field
The invention relates to the technical field of information reduction intrusion detection, in particular to a detection method for intrusion data.
Background
With the development of the internet industry, preventing information from being invaded is one of important issues, wherein unbalanced distribution of data is one of the problems that information protection cannot be effectively improved. The unequal distribution of data, i.e. the number of samples in one or several classes in the data set, far exceeds the number of samples of other types. The category with a smaller data percentage is called a minority category, and the category with a larger data percentage is called a majority category. The network attack types are complicated, and some attack types are common, such as DDOS, brute force cracking, ARP spoofing and the like. While some attack types occur less often, such as unauthorized local supervisor privileged access, unauthorized access of a remote host, and so on. These types of attack samples are much fewer than the number of common types of attacks. The consequences of DDOS attacks may damage the entire network, reduce service performance, prevent terminal services, and unauthorized access of a remote host may cause the host to be controlled for illicit activities, etc. The existing classification method has high identification rate for sample points of most categories, so that few categories are classified by mistake, thereby causing great problems and reducing the information prevention of the few categories. Therefore, it is important to process unbalanced data sets and improve the generalization performance of data intrusion detection models.
Disclosure of Invention
The invention overcomes the defects of the prior art, and provides a detection method for intrusion data, which improves data protection and detection generalization performance.
In order to solve the technical problems, the technical scheme of the invention is as follows:
a detection method for intrusion data specifically comprises the following steps:
1) and (3) balanced data set acquisition: calculating the distance from the sample point to the cluster center point of the training data according to the Euclidean distance by using a rough clustering method, dividing the training data into a plurality of clustering subsets, regarding the clustering subsets which contain less sample points and are far away as noise points, and deleting the noise data; then randomly sampling different types of the intrusion data and reducing dimensions, reducing overfitting of different types of models of the intrusion data, performing intra-class equalization on a training set by an oversampling method, and increasing the number of samples of partial types to obtain an equalized data set;
2) data classification step: the classifier classifies the balanced data set, and adjusts the number weight of misclassified samples, so that the generalization performance of the classification model is improved; (ii) a Performing repeated iterative training on a weak classifier in the classifier by using an AdaBoost M1 method, wherein the trained weak classifier in each time participates in the next iterative training; specifically, according to the last iteration result, the weight occupied by the misclassified sample points in the training set is increased, and meanwhile, the weight of the correctly classified sample points is reduced to enter the next iteration, so that the classification performance of the classifier is improved; the classifier generated by the next iteration focuses more on the wrong sample classified by the classifier of the last iteration, so that the accuracy of sample classification is increased; finally, voting according to a classifier generated by each iteration to determine a classification result;
3) a classifier evaluation step: evaluating the classifier through a confusion matrix, a missing report rate, an accuracy rate and an ROC curve; the confusion matrix is used for comparing the classification result with the actual result and visually depicting the performance index of the classifier;
the missing report rate and the accuracy rate adopt the following formulas:
missing report rate TP/(FP + TN)
Accuracy rate (TP + TN)/(TP + TN + FN + FP)
Where FP is determined to be a positive sample but is actually a negative sample; FN is determined to be a negative sample, but is actually a positive sample; TN is determined as a negative sample, and is actually a negative sample; TP is determined to be a positive sample, which is also a positive sample in practice;
the ROC curve is usually used to represent the effect of the model classifier, and in the best case, the ROC should be in the upper left corner, which means that there is a high true positive rate at lower false positive rates; the horizontal axis of the ROC curve is the false positive rate FPR and the vertical axis is the true positive rate TPR;
TPR=TP/((TP+FN))
the true positive rate represents the ratio of the number of the normal samples predicted by the model to the number of all the normal samples predicted;
FRP=FP/((FP+TN))
the pseudo-positive rate represents the ratio of the number of normal samples predicted as attack type by the model to the number of all samples predicted as attack type.
Furthermore, the classifier generated by each iteration acquires the proportion of the strong classifier according to the classification error rate; the lower the classification error rate, the higher the weight.
Further, the coarse clustering method uses Euclidean distance to calculate the distance from the sample point to the centroid, and the distance is compared with a set distance threshold value T1、T2Comparing, screening out interference points in the data set according to the number of the sample points in each category and the distance between the sample points and each centroid, and deleting noise sample points; the method specifically comprises the following steps:
1.1.1) randomly arrange the original sample set into a sample list L ═ x1,x2,…,xnAnd setting an initial distance threshold T according to cross validation parameters1、T2(T1>T2);
1.1.2) randomly selecting a sample point x from the list LiI ∈ (1, n), as the centroid of the first Canopy cluster, and x1Delete from the list;
1.1.3) randomly selecting a sample point x from the list LpP ∈ (1, n) p ≠ i, calculating xpDistances to all centroids, and checking the minimum distance Dmin
If T2≤Dmin≤T1Then give xpA weak mark representing DminBelongs to the canty cluster and is added; if D ismin≤T2Then give xpA strong mark representing DminBelongs to the canty cluster and is close to the centroid; and x ispDelete from the list; if D ismin>T1Then xpForm a new cluster and combine xpDelete from the list;
1.1.4) repeating the step 1.1.3) until the number of elements in the list becomes zero, deleting clusters with less sample points in the sphere cluster, and deleting noise points with more than one time of the majority sample points near the minority sample points.
Further, the oversampling method searches a point in k nearest classes from the sample point by randomly searching a few class sample points, performs repeated interpolation to form a plurality of new few class samples, and adds the new few class samples into the data set; the method specifically comprises the following steps:
1.2.1) select a few classes of samples i from the dataset with a feature vector xi,i∈{1,...,T};
1.2.2) find sample x from all T samples of a few classesiK neighbors of (2), denoted as xi(near),near∈{1,…,k};
1.2.3) randomly selecting a sample x from the k neighborsi(nn)Regenerating a random number lambda between (0,1)1To synthesize a new sample xi1
xi1=xi1·(xi(nn)-xi) (1)
1.2.4) repeat step 1.2.3N times) to synthesize N new samples: x is the number ofinew,new∈{1,...,N}。
Compared with the prior art, the invention has the advantages that:
the classifier is formed by fusing data preprocessing and reinforcement learning, a rough clustering method, an oversampling method and an intrusion detection classification method of Adaboost M1 are provided, a Canpoy cluster is used for rough clustering to obtain noise points, the noise points are removed, meanwhile, a down-sampling method is used for reducing the number of most categories, model overfitting is reduced, then a few categories of sample points are linearly synthesized by the rough clustering method, the number of the few categories is increased, imbalance among the categories is reduced, a balanced data set is formed, the balanced sample can well make up the defect that the number of the few categories of classified samples is insufficient, and the problem that important information is lost during random sampling is solved. The method is combined with an AdaBoost M1 classifier, a random forest is used as a base classifier, the performance of a feature subset is randomly selected to reduce the influence of data dimensionality on classification, a local optimal weak classifier is obtained in the process of each iteration, then the weight of a sample is updated, although training time is increased, compared with the result of an original unbalanced data set on an Adaboost M1 classifier, the accuracy of a few classes can be effectively improved, and the average missing report rate is reduced.
Drawings
FIG. 1 is a flow chart of a balanced data set construction of the present invention;
FIG. 2 is a flow diagram of the AdaBoostm1 framework;
FIG. 3 is a comparison of U2R on the ROC curve for an example of the present invention;
FIG. 4 is a comparison of R2L on the ROC curve for the examples of the present invention.
Detailed Description
The following specific embodiments are given to further illustrate the present invention.
As shown in fig. 1 to 4, a method for detecting intrusion data specifically includes the following steps:
1) and (3) balanced data set acquisition: calculating the distance from the sample point to the cluster center point of the training data according to the Euclidean distance by using a rough clustering method, dividing the training data into a plurality of clustering subsets, regarding the clustering subsets which contain less sample points and are far away as noise points, and deleting the noise data; and then randomly sampling different types of the intrusion data and reducing dimensions, reducing overfitting of different types of models of the intrusion data, performing intra-class balance on the training set by an oversampling method, and increasing the number of samples of partial types to obtain a balanced data set.
The rough clustering method uses Euclidean distance to calculate the distance from a sample point to a centroid and a set distance threshold value T1、T2Comparing, screening out interference points in the data set according to the number of the sample points in each category and the distance between the sample points and each centroid, and deleting noise sample points; the method specifically comprises the following steps:
1.1.1) randomly arrange the original sample set into a sample list L ═ x1,x2,…,xnAnd setting an initial distance threshold T according to cross validation parameters1、T2(T1>T2);
1.1.2) randomly selecting a sample point x from the list LiI ∈ (1, n), as the centroid of the first Canopy cluster, and x1Delete from the list;
1.1.3) randomly selecting a sample point x from the list LpP ∈ (1, n) p ≠ i, calculating xpDistances to all centroids, and checking the minimum distance Dmin
If T2≤Dmin≤T1Then give xpA weak mark representing DminBelongs to the canty cluster and is added; if D ismin≤T2Then give xpA strong mark representing DminBelongs to the canty cluster and is close to the centroid; and x ispDelete from the list; if D ismin>T1Then xpForm a new cluster and combine xpDelete from the list;
1.1.4) repeating the step 1.1.3) until the number of elements in the list becomes zero, deleting clusters with less sample points in the sphere cluster, and deleting noise points with more than one time of the majority sample points near the minority sample points.
In the oversampling method, a few types of sample points are randomly searched, one point is searched in k nearest classes away from the sample points, repeated interpolation is carried out to form a plurality of new few types of samples, and the new few types of samples are added into a data set; the method specifically comprises the following steps:
1.2.1) select a few classes of samples i from the dataset with a feature vector xi,i∈{1,...,T};
1.2.2) find sample x from all T samples of a few classesiK neighbors of (2), denoted as xi(near),near∈{1,…,k};
1.2.3) randomly selecting a sample x from the k neighborsi(nn)Regenerating a random number lambda between (0,1)1To synthesize a new sample xi1
xi1=xi1·(xi(nn)-xi) (1)
1.2.4) repeat step 1.2.3N times) to synthesize N new samples: x is the number ofinew,new∈{1,...,N}。
2) Data classification step: the classifier classifies the balanced data set, and adjusts the number weight of misclassified samples, so that the generalization performance of the classification model is improved; (ii) a Performing repeated iterative training on a weak classifier in the classifier by using an AdaBoost M1 method, wherein the trained weak classifier in each time participates in the next iterative training; specifically, according to the last iteration result, the weight occupied by the misclassified sample points in the training set is increased, and meanwhile, the weight of the correctly classified sample points is reduced to enter the next iteration, so that the classification performance of the classifier is improved; the classifier generated by the next iteration focuses more on the wrong sample classified by the classifier of the last iteration, so that the accuracy of sample classification is increased; finally, voting according to a classifier generated by each iteration to determine a classification result;
3) a classifier evaluation step: evaluating the classifier through a confusion matrix, a missing report rate, an accuracy rate and an ROC curve; the confusion matrix is used for comparing the classification result with the actual result and visually depicting the performance index of the classifier;
the missing report rate and the accuracy rate adopt the following formulas:
missing report rate TP/(FP + TN)
Accuracy rate (TP + TN)/(TP + TN + FN + FP)
Where FP is determined to be a positive sample but is actually a negative sample; FN is determined to be a negative sample, but is actually a positive sample; TN is determined as a negative sample, and is actually a negative sample; TP is determined to be a positive sample, which is also a positive sample in practice;
the ROC curve is usually used to represent the effect of the model classifier, and in the best case, the ROC should be in the upper left corner, which means that there is a high true positive rate at lower false positive rates; the horizontal axis of the ROC curve is the false positive rate FPR and the vertical axis is the true positive rate TPR;
TPR=TP/((TP+FN))
the true positive rate represents the ratio of the number of the normal samples predicted by the model to the number of all the normal samples predicted;
FRP=FP/((FP+TN))
the pseudo-positive rate represents the ratio of the number of normal samples predicted as attack type by the model to the number of all samples predicted as attack type.
The classifier generated by each iteration acquires the proportion of the strong classifier according to the classification error rate; the lower the classification error rate, the higher the weight.
Specifically, the experimental use data set is the KDD CUP 99 data set. The data set contains a large amount of network traffic data, perhaps containing more than 5,000,000 network connection records, along with test data of about 2,000,0000. In order to avoid the data volume from being too large, the data set is randomly sampled according to the proportion of 10 percent, the sampling result is used as a learning training set, and 10 percent of the test data is used as a test set. The time for establishing the model can be effectively reduced, and the influence on the precision is small. The training data set used in this experiment contained 49,399 training data. There are 41 features in the dataset, 4 attack types, which are Denial of Service attack (DOS), Remote host originated right acquisition attack (Remote to Local, R2L), port monitoring scanning attack (PROBE), and privilege escalation attack (User to Root, U2R). The distribution of the number of specific data set attack types is shown in table 1 below:
data type Amount of training data Test set
Normal 97278 6059
DoS 391458 2298
Probe 4107 416
R2L 1126 161
U2R 52 228
TABLE 1
Because the samples in the original data set are unbalanced, the number of U2R is far less than that of DOS and Normal, so that the results are influenced, the generalization performance of the model is influenced, noise points are removed by using Canopy, the number of samples of a few samples of U2R and R2L is increased by using a rough clustering method, simultaneously, DOS and Normal types with more records are downsampled, and then, the artificially synthesized data obtained by downsampling the records of the U2R and R2L types are mixed with the data of the Probe types to form a new balanced data set. The data distribution of the equalized data set processed by the scheme is shown in the following table 2:
data type Amount of training data
Normal 9727
DoS 7829
Probe 4107
R2L 1126
U2R 1093
TABLE 2
Specifically, firstly, 10-fold cross validation is performed on an unbalanced training set obtained by random sampling, data in a data set is equally divided into 10 parts, one part is selected to be used as a validation set, the other nine parts are used for training, and iteration is performed for 10 times in sequence. Finally, the average experimental result of 10 models is used as the result of the whole model, the test set is used for testing, and the model trained in the training set is named Adaboost M1. The balanced data set is subjected to 10-fold cross validation, a random forest is used as a base classifier, the random forest can process high-dimensional data without feature selection, errors can be balanced for the unbalanced data set, and therefore the method can be combined with Adaboost M1. The model trained on the balanced dataset was named sadaboost m1 and the model trained on the original dataset was named adaboost m 1.
The two models were tested separately using test sets, with the sadaboost m1 model yielding the confusion matrix shown in table 3 below and the adaboost m1 model yielding the confusion matrix shown in table 4 below. Table 5 shows the results of the false negative rate per category obtained from testing the adaboost m1 model and the sadaostm 1 model on the test set. Table 6 shows the accuracy results for each category tested on the test set using the adaboost m1 model and the sadaostm 1 model.
Figure BDA0002721221250000091
TABLE 3
Figure BDA0002721221250000092
TABLE 4
Classifier R2L DOS PROBE U2R
AdaboostM1 84.4 2.9 36.1 88.7
SAdaboostM1 58.6 2.6 15.5 67.2
TABLE 5
Classifier R2L DOS PROBE U2R
AdaboostM1 15.6 97.1 73.9 13.3
SAdaboostM1 30.3 97.4 81.6 42.8
TABLE 6
Through table 5, table 6 can see that after the sample is processed by using the scheme, the noise is reduced, and the problem of few category sample errors caused by unbalanced samples is solved. On the premise of not changing the original overall accuracy, the accuracy of U2R and R2L is greatly improved, and meanwhile, the missing report rate is reduced.
The following compares the ROC curves for U2R and R2L on the two models. ROC curves for a few classes in the dataset U2R on the sadobaost m1 model are shown on the left in fig. 3, AUC 0.9779, and ROC curves on the adaboost m1 model are shown on the right in fig. 3. ROC curves for a few classes R2L in the dataset on the sadaostomm 1 model are shown on the left side of fig. 4, AUC 0.7091. ROC curves for a few classes R2L in the dataset on the adaboost m1 model are shown on the right side of fig. 4, AUC 0.6486.
In summary, attack behaviors in a network environment are various, the number of collected attack data type samples is unbalanced, and it is difficult to judge attack behaviors of a small number of categories, so noise points are removed by using Canopy, errors in the process of synthesizing the sample points of the small number of categories are reduced, a coarse clustering method artificially synthesizes certain categories (R2L and U2R) with small attack data quantity into data, the proportion of the data is increased, the number of categories (DOS and Normal) with large quantity ratio is reduced, then an Adaboost M1 classifier is trained by using a balanced data set, and the Adaboost M1 classifier is compared with a model trained on the Adaboost M1 classifier under an original data set. According to the scheme, on the premise that the accuracy of the whole data set is not reduced, the accuracy of the attack of the minority class U2R is improved by 29%, the accuracy of the attack of the R2L is improved by 15%, and meanwhile, the average missing report rate is reduced by 28%.
The foregoing is only a preferred embodiment of the present invention, and it should be noted that, for those skilled in the art, several modifications and decorations can be made without departing from the spirit of the present invention, and these modifications and decorations should also be regarded as being within the scope of the present invention.

Claims (4)

1. A detection method for intrusion data is characterized by comprising the following steps:
1) and (3) balanced data set acquisition: calculating the distance from the sample point to the cluster center point of the training data according to the Euclidean distance by using a rough clustering method, dividing the training data into a plurality of clustering subsets, regarding the clustering subsets which contain less sample points and are far away as noise points, and deleting the noise data; then randomly sampling different types of the intrusion data and reducing dimensions, reducing overfitting of different types of models of the intrusion data, performing intra-class equalization on a training set by an oversampling method, and increasing the number of samples of partial types to obtain an equalized data set;
2) data classification step: the classifier classifies the balanced data set, and adjusts the number weight of misclassified samples, so that the generalization performance of the classification model is improved; (ii) a Performing repeated iterative training on a weak classifier in the classifier by using an AdaBoost M1 method, wherein the trained weak classifier in each time participates in the next iterative training; specifically, according to the last iteration result, the weight occupied by the misclassified sample points in the training set is increased, and meanwhile, the weight of the correctly classified sample points is reduced to enter the next iteration, so that the classification performance of the classifier is improved; the classifier generated by the next iteration focuses more on the wrong sample classified by the classifier of the last iteration, so that the accuracy of sample classification is increased; finally, voting according to a classifier generated by each iteration to determine a classification result;
3) a classifier evaluation step: evaluating the classifier through a confusion matrix, a missing report rate, an accuracy rate and an ROC curve; the confusion matrix is used for comparing the classification result with the actual result and visually depicting the performance index of the classifier;
the missing report rate and the accuracy rate adopt the following formulas:
missing report rate TP/(FP + TN)
Accuracy rate (TP + TN)/(TP + TN + FN + FP)
Where FP is determined to be a positive sample but is actually a negative sample; FN is determined to be a negative sample, but is actually a positive sample; TN is determined as a negative sample, and is actually a negative sample; TP is determined to be a positive sample, which is also a positive sample in practice;
the ROC curve is usually used to represent the effect of the model classifier, and in the best case, the ROC should be in the upper left corner, which means that there is a high true positive rate at lower false positive rates; the horizontal axis of the ROC curve is the false positive rate FPR and the vertical axis is the true positive rate TPR;
TPR=TP/((TP+FN))
the true positive rate represents the ratio of the number of the normal samples predicted by the model to the number of all the normal samples predicted;
FRP=FP/((FP+TN))
the pseudo-positive rate represents the ratio of the number of normal samples predicted as attack type by the model to the number of all samples predicted as attack type.
2. A method for intrusion data detection according to claim 1, characterized by: the classifier generated by each iteration acquires the proportion of the strong classifier according to the classification error rate; the lower the classification error rate, the higher the weight.
3. A method for intrusion data detection according to claim 1, characterized by: the coarse clustering method uses Euclidean distance to calculate the distance between a sample point and a centroid and a set distance threshold value T1、T2Comparing, screening out interference points in the data set according to the number of the sample points in each category and the distance between the sample points and each centroid, and deleting noise sample points; the method specifically comprises the following steps:
1.1.1) randomly arrange the original sample set into a sample list L ═ x1,x2,…,xn}, root ofSetting an initial distance threshold T according to cross validation parameters1、T2(T1>T2);
1.1.2) randomly selecting a sample point x from the list LiI ∈ (1, n), as the centroid of the first Canopy cluster, and x1Delete from the list;
1.1.3) randomly selecting a sample point x from the list LpP ∈ (1, n) p ≠ i, calculating xpDistances to all centroids, and checking the minimum distance Dmin
If T2≤Dmin≤T1Then give xpA weak mark representing DminBelongs to the canty cluster and is added; if D ismin≤T2Then give xpA strong mark representing DminBelongs to the canty cluster and is close to the centroid; and x ispDelete from the list; if D ismin>T1Then xpForm a new cluster and combine xpDelete from the list;
1.1.4) repeating the step 1.1.3) until the number of elements in the list becomes zero, deleting clusters with less sample points in the sphere cluster, and deleting noise points with more than one time of the majority sample points near the minority sample points.
4. A method for intrusion data detection according to claim 1, characterized by: in the oversampling method, a few types of sample points are randomly searched, one point is searched in k nearest classes away from the sample points, repeated interpolation is carried out to form a plurality of new few types of samples, and the new few types of samples are added into a data set; the method specifically comprises the following steps:
1.2.1) select a few classes of samples i from the dataset with a feature vector xi,i∈{1,...,T};
1.2.2) find sample x from all T samples of a few classesiK neighbors of (2), denoted as xi(near),near∈{1,…,k};
1.2.3) randomly selecting a sample x from the k neighborsi(nn)Regenerated to one between (0,1)Random number lambda1To synthesize a new sample xi1
xi1=xi1·(xi(nn)-xi) (1)
1.2.4) repeat step 1.2.3N times) to synthesize N new samples: x is the number ofinew,new∈{1,...,N}。
CN202011088479.0A 2020-10-13 2020-10-13 Detection method for intrusion data Active CN112217822B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202011088479.0A CN112217822B (en) 2020-10-13 2020-10-13 Detection method for intrusion data

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202011088479.0A CN112217822B (en) 2020-10-13 2020-10-13 Detection method for intrusion data

Publications (2)

Publication Number Publication Date
CN112217822A true CN112217822A (en) 2021-01-12
CN112217822B CN112217822B (en) 2022-05-27

Family

ID=74053817

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202011088479.0A Active CN112217822B (en) 2020-10-13 2020-10-13 Detection method for intrusion data

Country Status (1)

Country Link
CN (1) CN112217822B (en)

Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20080114888A1 (en) * 2006-11-14 2008-05-15 Fmr Corp. Subscribing to Data Feeds on a Network
CN103716204A (en) * 2013-12-20 2014-04-09 中国科学院信息工程研究所 Abnormal intrusion detection ensemble learning method and apparatus based on Wiener process
WO2016172262A1 (en) * 2015-04-21 2016-10-27 Placemeter, Inc. Systems and methods for processing video data for activity monitoring
CN107220732A (en) * 2017-05-31 2017-09-29 福州大学 A kind of power failure complaint risk Forecasting Methodology based on gradient boosted tree
CN110674846A (en) * 2019-08-29 2020-01-10 南京理工大学 Genetic algorithm and k-means clustering-based unbalanced data set oversampling method
CN111626336A (en) * 2020-04-29 2020-09-04 南京理工大学 Subway fault data classification method based on unbalanced data set

Patent Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20080114888A1 (en) * 2006-11-14 2008-05-15 Fmr Corp. Subscribing to Data Feeds on a Network
CN103716204A (en) * 2013-12-20 2014-04-09 中国科学院信息工程研究所 Abnormal intrusion detection ensemble learning method and apparatus based on Wiener process
WO2016172262A1 (en) * 2015-04-21 2016-10-27 Placemeter, Inc. Systems and methods for processing video data for activity monitoring
CN107220732A (en) * 2017-05-31 2017-09-29 福州大学 A kind of power failure complaint risk Forecasting Methodology based on gradient boosted tree
CN110674846A (en) * 2019-08-29 2020-01-10 南京理工大学 Genetic algorithm and k-means clustering-based unbalanced data set oversampling method
CN111626336A (en) * 2020-04-29 2020-09-04 南京理工大学 Subway fault data classification method based on unbalanced data set

Non-Patent Citations (5)

* Cited by examiner, † Cited by third party
Title
ANH LE ET AL.: "On Optimizing Load Balancing of Intrusion Detection and Prevention Systems", 《IEEE》 *
WULING REN ET AL.: "Application of Network Intrusion Detection Based on Fuzzy C-Means Clustering Algorithm", 《ISIITA》 *
任午令等: "基于攻击行为预测的网络防御策略", 《浙江大学学报(工学版)》 *
张鑫杰等: "基于Fisher-PCA和深度学习的入侵检测方法研究", 《JDAP》 *
郭朝有: "面向不平衡数据集融合Canopy和K-means的SMOTE改进算法", 《科学技术与工程》 *

Also Published As

Publication number Publication date
CN112217822B (en) 2022-05-27

Similar Documents

Publication Publication Date Title
CN108874927B (en) Intrusion detection method based on hypergraph and random forest
CN106973038B (en) Network intrusion detection method based on genetic algorithm oversampling support vector machine
CN110135167B (en) Edge computing terminal security level evaluation method for random forest
CN109522926A (en) Method for detecting abnormality based on comentropy cluster
CN109729091A (en) A kind of LDoS attack detection method based on multiple features fusion and CNN algorithm
CN107579846B (en) Cloud computing fault data detection method and system
CN112039903B (en) Network security situation assessment method based on deep self-coding neural network model
CN112491796A (en) Intrusion detection and semantic decision tree quantitative interpretation method based on convolutional neural network
CN113516228B (en) Network anomaly detection method based on deep neural network
CN112950445B (en) Compensation-based detection feature selection method in image steganalysis
CN111507385B (en) Extensible network attack behavior classification method
CN110011976B (en) Network attack destruction capability quantitative evaluation method and system
CN112633337A (en) Unbalanced data processing method based on clustering and boundary points
CN111753299A (en) Unbalanced malicious software detection method based on packet integration
CN114399029A (en) Malicious traffic detection method based on GAN sample enhancement
CN111695597A (en) Credit fraud group recognition method and system based on improved isolated forest algorithm
CN115577357A (en) Android malicious software detection method based on stacking integration technology
CN109981672B (en) Multilayer intrusion detection method based on semi-supervised clustering
CN113420291B (en) Intrusion detection feature selection method based on weight integration
CN113098862A (en) Intrusion detection method based on combination of hybrid sampling and expansion convolution
CN113269200A (en) Unbalanced data oversampling method based on minority sample spatial distribution
CN117478390A (en) Network intrusion detection method based on improved density peak clustering algorithm
CN112217822B (en) Detection method for intrusion data
CN111914930A (en) Density peak value clustering method based on self-adaptive micro-cluster fusion
CN113792141A (en) Feature selection method based on covariance measurement factor

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant