CN112055007A

CN112055007A - Software and hardware combined threat situation perception method based on programmable nodes

Info

Publication number: CN112055007A
Application number: CN202010889682.1A
Authority: CN
Inventors: 程光; 赵玉宇; 吴桦; 袁帅; 张慰慈
Original assignee: Southeast University
Current assignee: Southeast University
Priority date: 2020-08-28
Filing date: 2020-08-28
Publication date: 2020-12-08
Anticipated expiration: 2040-08-28
Also published as: CN112055007B

Abstract

The invention provides a software and hardware combined threat situation perception method based on programmable nodes, which comprises the following steps: abstracting the stream information, extracting abstract information in the message stream and transmitting the abstract information to a database; the database respectively calculates the entropy of various summary information stored in the processor and reports the calculation result to the decision server; the decision server trains a machine learning classifier model by using a training set to train a classifier capable of identifying the entropy value of the threat flow; the training set is constructed by mixing the generated abnormal flow and the normal flow; the decision server receives a message abstract entropy calculation result transmitted from a database, classifies the entropy result by using a trained classifier, identifies whether the flow is a threat flow, and displays the detailed information of the threat through a dynamic interface; and updating the classifier according to the time and the received message. The method of the invention can accurately and effectively identify the threat flow information in the network and improve the network security performance.

Description

Software and hardware combined threat situation perception method based on programmable nodes

Technical Field

The invention belongs to the technical field of network space security, relates to a situation awareness technology for perceiving network environment threats, and particularly relates to a software and hardware combined threat situation awareness method based on programmable nodes.

Background

With the rapid development of computer technology and the gradual improvement of hardware production technology, networks have become an important foundation and driving force for the development of the current information age. The network has larger scale, more and more complicated topology and more data types, which all provide challenges for the development of network security technology. In order to ensure the security of the network space environment, the network security situation awareness technology is becoming one of the hot spots of research in the network security field.

Network security situation awareness models researched abroad mainly include JDL models, Endsley models and Tim Bass models. The network security situation awareness models researched in China mainly include a Netflow-based network security situation awareness model, an information fusion-based network security situation assessment model and a large-scale network-oriented security situation awareness model, but the models all have problems and are not ideal in effect:

(1) netflow-based network security situation perception model

The Netflow-based network security situation perception model is formed by parts of stream data acquisition, event response, situation display and the like. As mass data information is processed by the system, and the visualization problem of the security situation is focused, the performance optimization problem needs to be further researched.

(2) Network security situation assessment model based on information fusion

By introducing an improved D-S evidence theory, the security situation evaluation model integrates vulnerability information and service information of multiple data sources, integrates data information of the multiple data sources, judges the security situation in a network environment by using a situation element fusion and node situation fusion method, and predicts the development trend of the network security situation by analyzing a time sequence. However, the model is established on the basis that log information of each node is error-free and accurate, and the threat situation in the network cannot be perceived.

(3) Security situation awareness model for large-scale network

Because the original network Security Situation Awareness system NSSAS (network Security Awareness architecture System) has limited processing capacity, a network Security Situation Awareness model YHSSAS which consists of four parts, namely data integration, association analysis, system indexes and event prediction, is provided, but the defects of the model YHSSAS are obvious, and the model is aimed at threat Situation analysis of a large-scale network and cannot effectively distinguish information such as threat flow, threat types and the like in the network.

Disclosure of Invention

Aiming at the network edge equipment, the invention provides a network threat situation perception method which utilizes machine learning and multi-core CPU to optimize database scheduling through programmable hardware equipment, identifies threat information in reported equipment flow, can effectively perceive the threat situation in the network and identify the attack type existing in the flow.

In order to achieve the purpose, the invention provides the following technical scheme:

a software and hardware combined threat situation perception method based on programmable nodes comprises the following steps:

(1) the programmable device uses 4 CPUs to operate the flow, CPU 0 receives and forwards the passing flow, and other three CPUs abstract the flow information, extract the source address, destination address, source port, destination port, protocol type, message length, TCP control field mark URG, ACK, PSH, RST, SYN and FIN in the message flow, and transmit the abstract information to the database;

(2) the database respectively calculates the entropy of various summary information stored in the processor and reports the calculation result to the decision server;

(3) the decision server trains a machine learning classifier model by using a training set to train a classifier capable of identifying the entropy value of the threat flow; the training set is constructed by mixing the generated abnormal flow and the normal flow; the abnormal traffic includes: the method comprises the steps that a host scans flow, a port scans flow, SYN flow, ACK flow, UDP flow and HTTP flow;

(4) the decision server receives the calculation result of the message abstract entropy value transmitted from the database, classifies the entropy value result by using the classifier trained in the step (3), identifies whether the flow is the threat flow, and displays the detailed information of the threat through a dynamic interface; updating the classifier according to the time and the received message to prevent the flow from generating concept drift; then cleaning abnormal flow based on IP address; the identified threat traffic types include: port scanning, host scanning, TCP _ ack, syn flooding, UDP flooding, http flooding.

Further, the step (1) specifically includes the following sub-steps:

(1.1) receiving and forwarding the data message by using a CPU (Central processing Unit) No. 0;

(1.2) optimizing the task scheduling of the processor, dividing task time by taking each second as a time interval, and taking each three seconds as a processor task period; in each task period, a CPU 1 receives a message in the first second; in the second, the CPU No. 1 extracts information from the message information abstract, extracts a source address, a destination address, a source port, a destination port, a protocol type, a message length and TCP control field marks URG, ACK, PSH, RST, SYN and FIN in the message, and the CPU No. 2 receives the message; in the third second, the CPU No. 1 uploads the summary information to the database, the CPU No. 2 summarizes the message, and the CPU No. 3 starts to receive the message;

and (1.3) the programmable node creates a database thread, connects the database, empties the selected table, reads the calculated entropy value information from the database, stores the information into the table and displays the information in the table.

Further, when extracting information from the message information summary in step (1.2), for each processor, first determining the standard type of the received message IP address: IPv4 and IPv6 extract the message abstract according to the message address type.

Further, in the step (2), the entropy calculation is performed by using the following formula:

wherein T is the length of the set for entropy calculation, n is the number of non-repetitive elements in the set, and different elements { a ] in the set₁，a₂，…，a_nThe corresponding number of occurrences is the set { d }₁，d，…，d_n}。

Further, the step (3) specifically includes the following sub-steps:

(3.1) firstly, acquiring the generated normal flow and abnormal flows of different types, and generating the abnormal flows by adopting an open source tool: SYN Flood traffic, ACK Flood traffic, host scan traffic, UDP Flood, HTTP Flood, and port scan traffic;

(3.2) mixing normal flow and abnormal flow in a proportional mixing mode by taking second as a unit, splicing the two flows in seconds, and constructing a training set;

and (3.3) leading the training set into a machine learning classifier for learning training, and training the classifier capable of identifying the entropy of the threat flow.

Further, in the step (3.3), an AdaBoost integration method based on a gini decision tree is adopted, each sample weight in training data is given, the sample weights are equal initially, the error rate is counted after the samples are learned by using a first weak learning algorithm, the weights of the algorithms are calculated according to the error rate, after each learning is completed, the weights of the samples are readjusted to enable the weights of the samples which are classified in the previous classification to be learned in a focused manner in the next learning, and the AdaBoost classifier is obtained finally after a plurality of rounds of learning.

Further, the step (4) comprises the following sub-steps:

(4.1) starting a blocking UDP server thread by the decision server, monitoring a database data sending port, and storing an entropy calculation result, time information and a source address sent by the database;

(4.2) initializing a machine learning classifier thread, checking whether a currently running classifier thread exists or not, and classifying the stored entropy information by using the classifier when the currently running classifier thread is detected; if no classifier is running in the thread, reading a training set in the host, and learning by using the training set to obtain the classifier;

(4.3) when the classifier detects the threat flow, the thread displays the threat flow information through a dynamic interface;

(4.4) training a new classifier by using a sample data set newly constructed based on the latest flow when the classifier is judged to meet the updating condition;

(4.5) the server end can input an instruction through the control window to inquire thread information and control the running of the thread;

and (4.6) the server issues an instruction for cleaning the abnormal flow to instruct the programmable node to only forward the normal flow to enter the protection node.

Further, in the step (4.3), the threat traffic information includes: threat type, source address, destination address, reporting time.

Further, the updating conditions in the step (4.4) are as follows: when the operation time of the classifier exceeds a threshold value, or when the data magnitude classified by the classifier reaches a set threshold value.

Compared with the prior art, the invention has the following advantages and beneficial effects:

(1) the invention provides a network threat perception method based on rapid threat discovery and integrated learning accurate identification of flow entropy characteristics, which can accurately and effectively identify threat flow information in a network, so that a network administrator can timely, efficiently and clearly discover detailed information of threat flow and threat in the network according to appointed time granularity, and the network security performance can be improved.

(2) The invention provides a task scheduling mode of a multiprocessor for calculating the flow abstract and the flow entropy value by using a method of combining software and hardware, and aims at the hardware part, thereby greatly improving the utilization rate of processor resources.

(3) The threat flow sensing method provided by the invention reduces the dimension of the flow twice, the dimension reduction for the first time compresses the flow into the abstract information of the data message, and the dimension reduction for the second time compresses the abstract information into various entropy values.

(4) The blocking UDP server can effectively reduce the load increased by the polling mode of the server for receiving data; updating the classifier can prevent traffic from causing conceptual drift.

Drawings

FIG. 1 is a structural framework of a threat situation awareness system;

FIG. 2 is a flow entropy based threat awareness implementation framework;

FIG. 3 is a schematic flow diagram of an AdaBoost classifier;

FIG. 4 is a sample entropy calculation of the host scan after mixing normal flows, label2 indicating the host scan flow;

FIG. 5 is a determination of machine learning hyper-parameters based on accuracy;

FIG. 6 is a machine learning dataset construction method;

FIG. 7 is an experimental topology build-up graph;

FIG. 8 is a schematic diagram of the operation of processors 1-3 in the programmable device;

fig. 9 is a final threat information presentation interface.

Detailed Description

The technical solutions provided by the present invention will be described in detail below with reference to specific examples, and it should be understood that the following specific embodiments are only illustrative of the present invention and are not intended to limit the scope of the present invention.

According to the characteristics of self-similarity, long correlation and heavy tail distribution of network flow indicated by research, the invention can describe the characteristics of the characteristics according to the entropy value, and designs the method for judging whether the network flow contains threat information by calculating the entropy values of various characteristic attributes of the network flow, including message characteristics such as a source address, a destination address, a source port, a destination port and the like.

The principle of detecting abnormal flow according to the flow entropy value is as follows:

compared with normal traffic, the entropy mode of abnormal traffic has a very obvious trend of increase and decrease, network behaviors such as DDOS attack, port scanning, host scanning and the like can cause the entropy change of the characteristic attribute of network information flow, and different modes are corresponded to the entropy change trend, by comparing the entropy changes of the whole flow, the source and destination IP, the source and destination ports, the protocol type, the message length and the Flags flag bit, the entropy characteristics of common typical abnormal traffic are shown in table 1:

TABLE 1 characteristics of entropy values of abnormal flows

Exception name

H(flow)

H(srcIP)

H(srcPort)

H(dstIP)

H(dstPort)

H(proto)

H(len)

H(flags)

Port scanning

Increase of

Reduce

Increase of

Reduce

Increase of

Is substantially unchanged

Host scanning

Increase of

Reduce

Increase of

Reduce

Is substantially unchanged

Reduce

TCP DDOS

Reduce

UDP flood

Reduce

Increase of

Reduce

HTTP flood

Reduce

Increase of

Reduce

Is substantially unchanged

In table 1, h (flow) is global flow entropy, h (srcip) is source and sink IP entropy, h (srcport) is source and sink port entropy, h (dstip) is destination IP entropy, h (dstport) is destination port entropy, h (proto) is protocol type entropy, h (len) is packet length, and h (Flags) is Flags entropy. In TCP DDOS, the message length entropy value of syn flow is increased compared with the normal flow, and the flag bit entropy value (H (flags)) is basically unchanged; the message length entropy value of ack flood is reduced compared with the flag bit entropy value of flags.

Based on the entropy characteristics, the invention provides a network threat perception method based on cooperation of rapid threat discovery and machine learning accurate threat identification, and an implementation framework of the network threat perception method is shown in fig. 1 and comprises programmable equipment, a server and protected network nodes. The main implementation flow is as shown in fig. 2, firstly, a programmable node is used for extracting network flow characteristics, then, network flow entropy calculation is performed, flow information after the entropy dimensionality reduction is uploaded to a decision server, and the decision server identifies threats and identifies the threats by adopting a machine learning method based on a training set. .

The hardware equipment used by the invention is a network processor carrying a programmable FPGA, and by utilizing the programmability of the equipment and changing the mode of processing a data packet by bottom hardware, when the flow passes through the network processor, the processor can extract quintuple and other information of the flow and store the information into a database, the database respectively calculates the entropy value of each characteristic attribute of the flow, and the calculated entropy value is reported to a server. And the decision server classifies the received entropy information through the trained machine learning classifier, identifies threat information in the flow and displays detailed information of the threat through visualization.

The machine learning classifier used in the invention is an AdaBoostClassifier classifier, and the AdaBoost machine learning classification algorithm based on gini decision trees with the accuracy rate of about 0.99962 is finally determined by carrying out the accuracy rate of entropy value classification on algorithms such as a gini impure splitting decision tree and an entrypy entropy splitting decision tree which are respectively based on fig. 5, a bagging decision forest, a boarding decision forest, a random forest integration method, an AdaBoost lifting method, a GBRT gradient lifting method and the like.

Specifically, as shown in FIG. 2, the method of the present invention comprises the following steps:

(1) the programmable device uses 4 CPUs to operate the flow, CPU 0 receives and forwards the passing flow, and other three CPUs abstract the flow information, and extract the source address, the destination address, the source port, the destination port, the protocol type, the message length, the TCP control field mark URG, ACK, PSH, RST, SYN and FIN in the message flow. And passes this summary information to the database.

The specific process of the step is as follows:

and (1.1) receiving and forwarding the data message by using a CPU No. 0.

And (1.2) optimizing the task scheduling of the processor, dividing the task time by taking each second as a time interval, and taking each three seconds as a processor task period. Receiving a message by the CPU No. 1 in the first second; in the second, the CPU No. 1 abstracts the message information, extracts information such as quintuple and the like, and the CPU No. 2 receives the message; and in the third second, the CPU No. 1 uploads the summary information to the database, the CPU No. 2 summarizes the message, and the CPU No. 3 starts to receive the message. The specific workflow is shown in fig. 8. For each processor, firstly judging the standard type of the IP address of the received message: IPv4 and IPv6 extract the abstract of the message according to the message address type, extract the source address, destination address, source port, destination port, protocol type, message length, TCP control field mark URG, ACK, PSH, RST, SYN and FIN in the message, and upload the abstract information into the database.

(2) The database respectively calculates entropy of each item of summary information stored in the processor, and reports the calculation result to the decision server, as shown in fig. 4.

In this step, the method for calculating the entropy of the summary information by the database is as follows:

the database uses the abstract information of the flow stored by the processor to respectively calculate the entropy values, and the formula of the entropy value calculation is as follows:

(3) The decision server uses a training set to train a machine learning classifier model, and constructs normal flow and abnormal flow of different types through an existing tool, wherein the abnormal flow generated by using the tool comprises the following steps: the method comprises the steps that a host scans flow, port scanning flow, SYN flow, ACK flow, UDP flow and HTTP flow, and the generated abnormal flow and normal flow are mixed to form a training set; and training a classifier capable of identifying the entropy value of the threat flow through a training set.

The method specifically comprises the following steps:

(3.1) first, the generation normal traffic and the different types of abnormal traffic are acquired, and the SYN flow, the ACK flow, the host scan traffic, the UDP flow, the HTTP flow, and the port scan traffic are generated using an open source tool using data from two to two and fifteen minutes in the afternoon of 13 pm in 5/2020 of MAWI working group in japan as normal traffic data.

And (3.2) mixing the normal flow and the abnormal flow in a proportional mixing mode by taking seconds as a unit, and splicing the two flows by taking seconds as a time unit. As shown in fig. 6, where s1 is the generated pure anomaly traffic data set. In order to better meet the actual conditions of various network attacks, the method divides the s1 traffic data set into s1, s2 … … sn per second, and inserts the segmented traffic t1, t2 … … tn per second of normal traffic, wherein n is the duration of the abnormal traffic data set. And finally realizing the construction of a training set of the machine learning method.

And (3.3) leading the training set into a machine learning classifier for learning training, and training the classifier capable of identifying the entropy of the threat flow. The AdaBoost integration method based on gini decision tree is used, as shown in figure 3, the method firstly gives each sample weight in training data, each sample weight is equal initially, the error rate is counted after learning is carried out by using a first weak learning algorithm, the weight of the algorithm is calculated according to the error rate, after each learning is finished, the weight of the sample is readjusted to enable the weight of the sample which is wrongly classified in the previous classification to be mainly learned in the next learning, and the AdaBoostClassifier algorithm classifier is finally obtained after multiple rounds of learning.

(4) And the decision server receives the message abstract entropy calculation result transmitted from the database, classifies the entropy calculation result by using a trained classifier, and identifies whether the flow is the threat flow. And displaying the detailed information of the threat through a dynamic interface set up by the open source tool. The threat types that the present invention can identify include: port scanning, host scanning, TCP _ ack, syn flooding, UDP flooding, http flooding. And updating the classifier according to the time and the received message, so as to prevent the flow from generating concept drift. An IP address based flush of the exception traffic is then performed.

The method specifically comprises the following steps:

and (4.1) starting a blocking UDP server thread by the decision server, monitoring a data sending port of the database, and storing the entropy calculation result, the time information and the source address sent by the database. The blocking UDP server can effectively reduce the load increased by the polling mode for receiving data by the server.

And (4.2) initializing a machine learning classifier thread, checking whether a currently running classifier thread exists or not, and classifying the stored entropy value information by using the classifier when the currently running classifier thread is detected. And if the thread does not have a classifier which is running, reading a training set in the csv format in the host, and training by using the training set to obtain the AdaBoostClassifier classifier.

(4.3) when the classifier detects the threat flow, the thread displays the threat flow information through a dynamic interface built by using an open source tool, wherein the dynamic interface comprises threat detailed information such as threat types, source addresses, destination addresses and reporting time, and a specific result display interface is shown in fig. 9.

And (4.4) judging the updating condition of the classifier, and training a new classifier by using a sample data set newly constructed based on the latest flow to prevent the flow from generating concept drift when the operation time of the classifier exceeds a threshold value or the data magnitude of the classifier reaches a set threshold value.

And (4.5) the server end can input an instruction through the control window to inquire the thread information and control the thread to run.

In order to verify the effectiveness of the threat situation awareness method, an experiment topology based on the combination of software and hardware is established in an experiment.

Fig. 7 shows an experimental topology building framework capable of implementing the invention, in which a host carrying a machine learning stream entropy classification algorithm is used as a server, and a next-generation network processor prototype system carrying a programmable FPGA and a multi-core CPU is used as a network flow monitoring device.

Software in the server is a training module and a classifier module of machine learning, the training module is responsible for training a classifier by utilizing a known threat sample, and the classifier module is responsible for receiving stream entropy information and visualizing a classification result. CPU 0 in the programmable device is responsible for receiving and transmitting data messages, CPU 1-3 is responsible for abstracting and transmitting data messages, and the flow entropy calculation module is responsible for calculating message abstract information into a flow entropy and uploading calculation results to the server. The server is directly connected with the programmable device, and the programmable device is placed in a network topology needing sensing and is responsible for a port forwarding function.

The technical means disclosed in the invention scheme are not limited to the technical means disclosed in the above embodiments, but also include the technical scheme formed by any combination of the above technical features. It should be noted that those skilled in the art can make various improvements and modifications without departing from the principle of the present invention, and such improvements and modifications are also considered to be within the scope of the present invention.

Claims

1. A software and hardware combined threat situation perception method based on programmable nodes is characterized by comprising the following steps:

(3) the decision server trains a machine learning classifier model by using a training set to train a classifier capable of identifying the entropy value of the threat flow; the training set is constructed by mixing the generated abnormal flow and the normal flow; the abnormal traffic includes: host scanning flow, port scanning flow, SYN flow, ACK flow, UDPFlood flow and HTTP flow;

2. The programmable node-based software and hardware combined threat situation awareness method according to claim 1, wherein the step (1) specifically comprises the following sub-steps:

3. The method for sensing threat situation based on combination of hardware and software of programmable node according to claim 2, wherein when extracting information from the message information summary in step (1.2), for each processor, first determining the standard type of the IP address of the received message: IPv4 and IPv6 extract the message abstract according to the message address type.

4. The programmable node-based software and hardware combined threat situation awareness method according to claim 1, wherein in the step (2), the entropy calculation is performed by using the following formula:

wherein T is the length of the set for entropy calculation, n is the number of non-repetitive elements in the set, and different elements { a ] in the set₁，a₂，...，a_nThe corresponding number of occurrences is the set { d }₁，d，...，d_n}。

5. The programmable node-based software and hardware combined threat situation awareness method according to claim 1, wherein the step (3) specifically comprises the following sub-steps:

6. The software and hardware combined threat situation awareness method based on the programmable node as claimed in claim 5, wherein in the step (3.3), an AdaBoost integration method based on a gini decision tree is adopted, each sample weight in training data is firstly given, each sample weight is equal initially, the error rate is counted after the training data is learned by using a first weak learning algorithm, the weight of the algorithm is calculated according to the error rate, after each learning is completed, the weight of the sample is readjusted to enable the weight of the sample which is wrongly classified in the previous classification to be learned with emphasis in the next learning, and the AdaBoostClassifier algorithm classifier is finally obtained after a plurality of rounds of learning.

7. The programmable node-based hardware and software combined threat situation awareness method according to claim 1, wherein the step (4) comprises the sub-steps of:

8. The programmable node-based hardware and software combined threat situation awareness method according to claim 7, wherein in the step (4.3), the threat traffic information comprises: threat type, source address, destination address, reporting time.

9. The programmable node-based software and hardware combined threat situation awareness method according to claim 7, wherein the updating conditions in the step (4.4) are as follows: when the operation time of the classifier exceeds a threshold value, or when the data magnitude classified by the classifier reaches a set threshold value.