CN116614313A

CN116614313A - Network intrusion protection system and method based on data identification

Info

Publication number: CN116614313A
Application number: CN202310882684.1A
Authority: CN
Inventors: 谭腊梅
Original assignee: Yiyang Tianjin Intelligent Technology Co ltd
Current assignee: Yiyang Tianjin Intelligent Technology Co ltd
Priority date: 2023-07-19
Filing date: 2023-07-19
Publication date: 2023-08-18

Abstract

The invention relates to the technical field of network protection, in particular to a network intrusion protection system and method based on data identification. In the invention, the data collection and arrangement module collects and arranges the existing data, and sends the data set to the characteristic engineering module for data characteristic treatment, the modeling module uses a machine learning algorithm to carry out model training according to the processed data set, the real-time monitoring and response module uses the trained model to carry out actual network flow monitoring, and carries out data collection, characteristic treatment and model training again on the data which is judged to be normal flow data and still enables the system to be invaded, so that the model is continuously updated and improved, and the novel invasion behavior can be adaptively identified.

Description

Network intrusion protection system and method based on data identification

Technical Field

The invention relates to the technical field of network protection, in particular to a network intrusion protection system and method based on data identification.

Background

The network intrusion protection system uses firewall network security equipment to filter data packets entering and exiting the network to prevent unauthorized access and malicious activities, however, the traditional network intrusion protection system has the problems of low identification accuracy, lack of automatic repair and the like. Therefore, a network intrusion protection system and method based on data identification are needed.

Disclosure of Invention

The invention aims to provide a network intrusion protection system and method based on data identification, so as to solve the problems in the background technology.

In order to achieve the above purpose, the present invention provides the following technical solutions: a network intrusion protection system and method based on data identification comprises a data collection and arrangement module, a characteristic engineering module, a modeling module and a real-time monitoring and response module, wherein:

the data collection and arrangement module is used for collecting normal flow data and intrusion behavior data, sending the normal flow data and the intrusion behavior data to the characteristic engineering module, carrying out data characteristic processing on the normal flow data and the intrusion behavior data by the characteristic engineering module, sending processed data characteristics to the modeling module, training a model according to the processed data characteristics through a machine learning algorithm by the modeling module, receiving the trained model by the real-time monitoring and response module, analyzing the flow data entering the system according to the model by the real-time monitoring and response module, distinguishing the normal flow data and the intrusion behavior data according to the analyzed result, sending the intrusion behavior data which are not successfully identified to the data collection and arrangement module to carry out data analysis modeling again, and automatically repairing the damage environment of the intrusion position according to the intrusion position.

As a further improvement of the technical scheme, the data collection and arrangement module comprises a network traffic collection unit and a database unit, wherein the network traffic collection unit is used for dividing traffic into known normal traffic data and known intrusion behavior data; the database unit is used for designing the characteristics of the flow data and storing the flow data collected by the network flow collection unit in the database to form a data set.

As a further improvement of the technical scheme, the characteristic engineering module comprises a data cleaning unit, a characteristic conversion unit and a characteristic standardization unit, wherein the data cleaning unit is used for receiving a data set in a database unit and processing missing values and repeated values of the data set; the characteristic conversion unit is used for converting the characteristics of the data set after being cleaned by the data cleaning unit into a numerical form; the feature normalization unit is used for normalizing the features of the data set.

As a further improvement of the technical scheme, the control unit comprises an analog-to-digital converter, a calculation unit and a control circuit, the modeling module comprises a data splitting unit, a model training unit, a model evaluation unit and a model tuning unit, the data splitting unit receives a data set standardized by the characteristic standardization unit and is used for dividing the characteristics and the labels of the data set, eighty percent of the characteristic data and the label data are used as training sets, twenty percent are used as test sets, the training sets are sent to the model training unit, and the test sets are sent to the model evaluation unit; the model training unit performs model training according to the training set by using a support vector machine algorithm in a machine learning algorithm, and sends the trained model to the model evaluation unit; the model evaluation unit receives the test set and the model sent by the data splitting unit and the model training unit respectively and is used for calculating the accuracy of the model on the test set, wherein:

when the accuracy rate is more than 90 percent, the model is sent to a monitoring unit in the real-time monitoring and response module;

and when the accuracy rate is less than or equal to 90 percent, sending the model to a model tuning unit for tuning, wherein the model tuning unit receives the model sent by the model evaluation unit and is used for adjusting parameters in the model.

As a further improvement of the technical scheme, the real-time monitoring and responding module comprises a monitoring unit, a responding unit and a system state unit, wherein the monitoring unit receives a model sent by a model evaluation unit and is used for judging whether the network flow entering the system is normal flow data or intrusion behavior data through the model; the response unit is used for intercepting intrusion behavior data from entering the system; the system state unit is used for detecting whether the system is attacked.

The second object of the present invention is to provide a network intrusion protection method based on data identification, which includes any one of the above network intrusion protection systems based on data identification, including the following method steps:

s1, collecting known normal network flow data and known intrusion behavior data in log records;

s2, carrying out characteristic processing on a data set formed by normal network flow data and known intrusion behavior data;

s3, carrying out model training according to the data set after the characterization processing by adopting a support vector machine algorithm;

s4, integrating the trained model into a system, and predicting the type of the network traffic by the system by using the model;

s5, intercepting the data which is judged to be the flow of the intrusion behavior, and judging that the normal flow data enters the system;

s6, analyzing and processing the flow data which is judged to be normal flow data but causes the system to be invaded, so that the system automatically repairs the invaded position.

As a further improvement of the technical scheme, the characterizing processing of the data set formed by the normal network traffic data and the known intrusion behavior data specifically includes:

converting the IP address into a digital form: splitting an IP address into four bytes, and converting each byte into an integer form;

converting the timestamp into a digital form: i.e., converting the time stamp into seconds;

converting port numbers into numerical form: that is, all non-duplicate port numbers appearing in the data are sequentially found, and a unique integer label is assigned to each non-duplicate port number.

As a further improvement of the technical scheme, the method intercepts the data judged as the flow of the intrusion behavior and judges that the normal flow data enters the system, and specifically comprises the following steps:

when the output result of the model is 1, judging that the network flow data at the moment is intrusion behavior data, intercepting the intrusion behavior data and sending the intrusion behavior data to a log record;

when the output result of the model is 0, the network flow data at the moment is judged to be normal flow data, and relevant flow data is recorded in the log record and the network data is continuously monitored.

As a further improvement of the technical scheme, the analyzing and processing are performed on the flow data which is judged to be normal flow data but causes the system to be invaded, so that the system automatically repairs the invaded position, and the method specifically comprises the following steps:

when the system detects that the system is attacked, firstly, extracting intrusion behavior data in a log record according to an intrusion position, and sending the intrusion behavior data to a protection system for further data processing and model training;

isolating the invasive position from other files, and removing the invasive position from the production environment by temporarily disconnecting the network and cutting off access rights;

the backup data of the system is then used to restore the affected location and invoke antivirus software to identify and delete malicious files and malicious access points on the affected location.

Compared with the prior art, the invention has the beneficial effects that:

the principle of the network intrusion protection system and method based on data identification is that network intrusion behaviors are identified and prevented by using a machine learning technology, network data entering the system can be monitored in real time, analysis and judgment are carried out according to the characteristics of normal network data and intrusion behavior data, firstly, a data collection and arrangement module collects known normal flow data and known intrusion behavior data, the data are sent to a characteristic engineering module for data characterization processing, a processed data set is sent to a modeling module, the modeling module carries out model training according to the data set by using a machine learning algorithm, the trained model is sent to a real-time monitoring and response module for carrying out actual network flow monitoring, the data which is judged to be intrusion behaviors is intercepted, the data which is judged to be normal flow data but still enables the system to be invaded is collected again, and the data is sent to a data collection and arrangement module for carrying out data feature processing and model training again, and the model is continuously updated and improved, so that the novel intrusion behaviors can be adaptively identified;

in order to reduce the requirement of manual intervention, accelerate the response to the processing of the intrusion behavior and the recovery speed of the intrusion position, the system determines the affected position according to the intrusion position in the intrusion behavior data, isolates the affected position from other files, removes the affected position from the production environment by temporarily disconnecting the network connection and cutting off the access authority, recovers the affected position by using the backup data of the system and invokes the antivirus software to identify and delete the malicious files and the malicious access points on the affected position, and installs and applies the latest security patches and updates to repair the disclosed loopholes in order to discover the possible security holes, thereby realizing the automatic repair of the affected position.

Drawings

FIG. 1 is a schematic diagram of the overall module of the present invention;

FIG. 2 is a schematic diagram of each module unit in the present invention;

FIG. 3 is a schematic flow chart of the overall method of the present invention;

in the figure: 100. a data collection and arrangement module; 101. a network traffic collection unit; 102. a database unit; 200. a feature engineering module; 201. a data cleaning unit; 202. a feature conversion unit; 203. a feature normalization unit; 300. a modeling module; 301. a data splitting unit; 302. a model training unit; 303. a model evaluation unit; 304. a model tuning unit; 400. a real-time monitoring and responding module; 401. a monitoring unit; 402. a response unit; 403. a system state unit.

Detailed Description

The following description of the embodiments of the present invention will be made clearly and completely with reference to the accompanying drawings, in which it is apparent that the embodiments described are only some embodiments of the present invention, but not all embodiments. All other embodiments, which can be made by those skilled in the art based on the embodiments of the invention without making any inventive effort, are intended to be within the scope of the invention.

Referring to fig. 1-3, the present invention provides a technical solution: a network intrusion protection system and method based on data identification comprises a data collection and arrangement module 100, a feature engineering module 200, a modeling module 300 and a real-time monitoring and response module 400.

S1, firstly, collecting traffic by using the network traffic collection unit 101 in the data collection and arrangement module 100, and dividing the traffic into known normal traffic data and known intrusion behavior data according to log records, wherein:

normal flow data: selecting a time period which runs normally and does not have an intrusion event to acquire data representing normal network traffic;

intrusion behavior data: based on the existing security threat intelligence or historical intrusion event data, known intrusion behavior data is obtained.

The network packet capturing tool is used to capture data packets on the traffic monitoring device and store the data in a table in the database unit 102, where the table in the database unit 102 contains the IP source address, destination IP address, port number, timestamp, communication frequency and tag column, and the tag of normal traffic data is set to 0 and the tag of intrusion behavior data is set to 1.

S2, a data cleaning unit 201 in the feature engineering module 200 receives the data set of the normal flow data and the intrusion behavior data in the database unit 102, cleans the data set, and processes the missing value and the repeated value in the data set, wherein:

processing the missing values: when the number of missing values is less than 2 percent of the total number, directly deleting the row containing the missing values; otherwise, filling the missing value by using the value with the highest occurrence frequency;

processing the repetition value: the row containing the missing value is deleted directly.

The feature conversion unit 202 receives the data set cleaned by the data cleaning unit 201, and in order for the data to be more suitable for the use of the machine learning algorithm, it is required to perform feature conversion on features in the data set, which is converted as follows:

IP address translation: converting the IP address into a numerical form, namely splitting the IP address into four bytes, and converting each byte into an integer form;

timestamp conversion: converting the time stamp into a digital form, namely converting the time stamp into seconds;

port number conversion: the port numbers are converted into numerical form, i.e. all non-duplicate port numbers appearing in the data are found out in sequence, and each non-duplicate port number is assigned a unique integer label, e.g. port number data is [80, 443, 22, 80, 22, 8080], and the converted integer labels are [0, 1, 2, 0, 2, 3].

The feature conversion unit 202 sends the converted data to the feature normalization unit 203, and the feature normalization unit 203 performs normalization processing on each feature to ensure that they have similar dimensions and avoid the influence of some features on the model from being excessive, wherein the normalization processing method uses a mean normalization method: the mean and standard deviation for each column are calculated, and the eigenvalues are subtracted from the mean and divided by the standard deviation.

The feature normalization unit 203 sends the processed data to the data splitting unit 301 in the modeling module 300, and the data splitting unit 301 splits the features and the labels of the data and uses eighty percent of the feature data and the label data as a training set and twenty percent as a test set. The data splitting unit 301 sends the training set to the model training unit 302 for training, and sends the test set to the model evaluation unit 303 for detection evaluation.

S3, the model training unit 302 performs model training according to the training set by using a support vector machine algorithm, and sends the trained model to the model evaluation unit 303, wherein the process of training the training set by using the support vector machine algorithm is as follows:

selection of a kernel function: a polynomial kernel function is selected, and a nonlinear relation cannot be embodied by a pure linear kernel function due to complex characteristics of network data, and the nonlinear relation can be processed by using a Gaussian radial basis function;

setting super parameters: setting the C parameter in the support vector calculation method to be 0.01, and setting the gamma parameter to be 1/(feature dimension by feature variance), wherein: the C parameter controls the punishment degree of the misclassification sample, and smaller C can lead to tolerance of more misclassifications and lead to too simple model; the gamma parameter controls the influence range of a single training sample, and a larger gamma value can lead to more tortuous decision boundaries and strong adaptability to training sets, but is easy to be over-fitted;

constructing an objective function: since the goal of the support vector machine algorithm is to find a decision boundary (hyperplane), the different classes of samples can be separated to the maximum extent and with less generalization error. To achieve this goal, an objective function needs to be defined: minimum 1/2|w|Σ2+c Σmax (0, 1-y_i (w Σ T b (x_i) +b)), wherein: the minimize is a keyword representing the minimized meaning, w represents the square of the norm of the weight vector w, used to control the complexity of the model, C is a regularization parameter used to balance the regularization term and the penalty of misclassified samples, Σ represents the summation of all samples, y_i is the true class label of the ith sample, w is the weight vector, b (x_i) represents the new feature vector obtained by the feature vector x_i through a mapping function b, and b is a bias term;

training model and support vector selection: solving an objective function by using a gradient descent algorithm, finding an optimal solution, selecting a sample which plays a role in determination in a training data set, namely a support vector, wherein the support vector is a sample point closest to a hyperplane, and determining the position and shape of the hyperplane;

and (3) classifying and outputting results: one side of the hyperplane is normal flow data, the output result is 0, the other side of the hyperplane is intrusion behavior data, and the output result is 1.

The model evaluation unit 303 performs detection evaluation on the test set by using a trained model, which is specifically as follows:

and calculating the number of accurately predicted samples between the label array predicted by the model and the label array of the test set, dividing the number of accurately predicted samples by the number of samples of the test set, multiplying the number of samples by the percentage to obtain the accuracy, if the accuracy is more than 90 percent, training the model successfully, sending the model to a real-time monitoring and intrusion detection module for application, and otherwise, sending the model to a model tuning unit 304 for tuning.

The model tuning unit 304 adjusts the algorithm superparameter C and gamma of the model, and finds the best superparameter combination by using the cross-validation and grid search method, and the steps are as follows:

firstly, defining a continuous value range of which the value ranges of C and gamma are powers of 10;

next, creating a super-parametric combination space, listing all possible combinations of C and gamma, the combination space being generated by a grid search method, generating a parametric combination for each combination of C values and gamma values;

for each parameter combination, cross-validation was used to evaluate the performance of the model. Cross-validation divides the data set into k subsets and then performs an iterative loop: k-1 subsets are used as training sets at a time, and the remaining subset is used as test set. For each parameter combination, calculating the accuracy of the model on the test set, and selecting the parameter combination with the best performance according to the cross verification result;

finally, the model is sent to the model training unit 302 for further training until the evaluation is qualified, that is, the accuracy is greater than 90%, and the trained model is sent to the monitoring unit 401 in the real-time monitoring and response module 400.

S4, a monitoring unit 401 in the real-time monitoring and responding module 400 receives the trained model, judges network data entering the system by using the trained model, monitors network traffic entering the system by using a traffic monitoring device, and judges the category of the network traffic according to the output result of the network traffic entering the model, wherein:

s5, when the output result of the model is 1, judging that the network flow data at the moment is intrusion behavior data, and sending a warning instruction to the response unit 402, wherein the response unit 402 intercepts the intrusion behavior data according to the warning instruction and sends the intrusion behavior data to a log record;

S6, a system state unit 403 in the real-time monitoring and responding module 400 is used for detecting whether the system is attacked, when the system state unit 403 detects that the system is attacked, an alarm instruction and an invasion position are sent to the responding unit 402, the responding unit 402 extracts the section of invasion behavior data in the log record according to the invasion position and sends the section of invasion behavior data to the network flow collecting unit 101 in the data collecting and arranging module 100, data processing and model establishment are carried out again, and the model can be continuously updated and improved according to the real-time flow data and the invasion detection result, so that the novel invasion behavior can be adaptively identified. For automated repair of an affected location, the response unit 402 performs analysis processing on intrusion behavior data, which includes the following steps:

first, the affected position is determined according to the intrusion position sent by the system state unit 403, in order to prevent further damage and propagation and provide a repaired security environment, the affected position is isolated from other files, and the affected position is moved out of the production environment by temporarily disconnecting the network and cutting off the access rights;

the backup data of the system is then used to restore the affected location and invoke antivirus software to identify and delete malicious files and malicious access points on the affected location, and to install and apply the latest security patches and updates to repair the vulnerability in order to discover possible vulnerabilities.

The real-time monitoring and responding module 400 reduces the need of manual intervention through timely and automatic processing of the intrusion behavior data, and accelerates the response to the intrusion behavior processing and the recovery speed of the intrusion position.

The foregoing has shown and described the basic principles, principal features and advantages of the invention. It will be understood by those skilled in the art that the present invention is not limited to the above-described embodiments, and that the above-described embodiments and descriptions are only preferred embodiments of the present invention, and are not intended to limit the invention, and that various changes and modifications may be made therein without departing from the spirit and scope of the invention as claimed. The scope of the invention is defined by the appended claims and equivalents thereof.

Claims

1. A network intrusion protection system based on data identification is characterized in that: a data collection and consolidation module (100), a feature engineering module (200), a modeling module (300) and a real-time monitoring and response module (400), wherein:

the data collection and arrangement module (100) is used for collecting normal flow data and intrusion behavior data, sending the normal flow data and the intrusion behavior data to the characteristic engineering module (200), the characteristic engineering module (200) carries out data characteristic processing on the normal flow data and the intrusion behavior data, sending processed data characteristics to the modeling module (300), the modeling module (300) trains a model according to the processed data characteristics through a machine learning algorithm, the trained model is received by the real-time monitoring and response module (400), the real-time monitoring and response module (400) analyzes the flow data entering the system according to the model, the analysis result is used for distinguishing the normal flow data and the intrusion behavior data, sending unsuccessfully identified intrusion behavior data to the data collection and arrangement module (100) for carrying out data analysis modeling again, and repairing the damage environment of the intrusion position according to intrusion position automation.

2. The data identification-based network intrusion prevention system according to claim 1, wherein: the data collection and arrangement module (100) comprises a network traffic collection unit (101) and a database unit (102), wherein the network traffic collection unit (101) is used for dividing traffic into known normal traffic data and known intrusion behavior data; the database unit (102) is used for designing the characteristics of the flow data and storing the flow data collected by the network flow collection unit (101) in a database to form a data set.

3. The data identification based network intrusion prevention system according to claim 2, wherein: the feature engineering module (200) comprises a data cleaning unit (201), a feature conversion unit (202) and a feature standardization unit (203), wherein the data cleaning unit (201) receives a data set in the database unit (102) and is used for processing missing values and repeated values of the data set; the feature conversion unit (202) is used for converting the feature of the data set cleaned by the data cleaning unit (201) into a digital form; the feature normalization unit (203) is configured to normalize the data set features.

4. A data identification based network intrusion protection system according to claim 3 wherein: the modeling module (300) comprises a data splitting unit (301), a model training unit (302), a model evaluation unit (303) and a model tuning unit (304), wherein the data splitting unit (301) receives a data set standardized by the characteristic standardization unit (203) and is used for dividing the characteristics and the labels of the data set, eighty percent of the characteristic data and the label data are used as training sets, twenty percent are used as test sets, the training sets are sent to the model training unit (302), and the test sets are sent to the model evaluation unit (303); the model training unit (302) performs model training according to the training set by using a support vector machine algorithm in a machine learning algorithm, and sends the trained model to the model evaluation unit (303); the model evaluation unit (303) receives the test set and the model respectively sent by the data splitting unit (301) and the model training unit (302) and is used for calculating the accuracy of the model on the test set, wherein:

when the accuracy rate is more than 90 percent, the model is sent to a monitoring unit (401) in the real-time monitoring and responding module (400);

and when the accuracy rate is less than or equal to 90 percent, sending the model to a model tuning unit (304) for tuning, wherein the model tuning unit (304) receives the model sent by the model evaluation unit (303) and is used for adjusting parameters in the model.

5. The data identification based network intrusion prevention system according to claim 4, wherein: the real-time monitoring and responding module (400) comprises a monitoring unit (401), a responding unit (402) and a system state unit (403), wherein the monitoring unit (401) receives a model sent by the model evaluating unit (303) and is used for judging whether the network traffic entering the system is normal traffic data or intrusion behavior data through the model; the response unit (402) is used for intercepting intrusion behavior data from entering the system; the system state unit (403) is used for detecting whether the system is attacked.

6. A method of using the data recognition-based network intrusion protection system according to claim 5, the method comprising the steps of:

7. The method of using a data recognition based network intrusion prevention system according to claim 6, wherein: the data set formed by the normal network flow data and the known intrusion behavior data is characterized, and the method specifically comprises the following steps:

converting the timestamp into a digital form: converting the time stamp into seconds;

converting port numbers into numerical form: and sequentially finding out all non-repeated port numbers appearing in the data, and allocating a unique integer label for each non-repeated port number.

8. The method of using a data recognition based network intrusion prevention system according to claim 6, wherein: intercepting the data which is judged to be the flow of the intrusion behavior, and judging that the data of the normal flow enters the system, wherein the method specifically comprises the following steps of:

9. The method of using a data recognition based network intrusion prevention system according to claim 6, wherein: the analyzing and processing the flow data which is judged to be normal flow data but causes the system to be invaded, so that the system automatically repairs the invaded position, and the method specifically comprises the following steps: