CN113141375A

CN113141375A - Network security monitoring method and device, storage medium and server

Info

Publication number: CN113141375A
Application number: CN202110498132.1A
Authority: CN
Inventors: 马保银; 杨全才; 刘征; 谢君鹏; 孙蒙; 冯继强; 王刚; 李一波; 白凌; 雷宇
Original assignee: Kashgar Power Supply Co Of State Grid Xinjiang Electric Power Co ltd
Current assignee: Kashgar Power Supply Co Of State Grid Xinjiang Electric Power Co ltd
Priority date: 2021-05-08
Filing date: 2021-05-08
Publication date: 2021-07-20

Abstract

The invention discloses a network security monitoring method, a device, a storage medium and a server, belonging to the technical field of internet security, in particular to a network security monitoring method, comprising the following steps: filtering the encrypted flow in the public network by adopting a filter, and acquiring the flow; extracting feature types and features of the collected flow; and training the sample data according to the feature type and the features by adopting a preset algorithm to generate a model classification accuracy result. The method does not need to use an interceptor, reduces the cost and the computing power, does not need to decrypt the flow on the premise of not influencing the network performance, and ensures the privacy in the flow communication process.

Description

Network security monitoring method and device, storage medium and server

Technical Field

The invention belongs to the technical field of internet security, and particularly relates to a network security monitoring method, a network security monitoring device, a storage medium and a server.

Background

Computer networks are important means and ways for people to know society and obtain information through modern information technology means. The network security management is the fundamental guarantee that people can safely surf the internet, surf the internet in a green way and surf the internet in a healthy way.

In order to ensure communication security and privacy and to cope with various eavesdropping and man-in-the-middle attacks, HTTPS is becoming widespread throughout, and more network traffic is also encrypted, however, an attacker can also hide his own information and whereabouts in this way, and evade detection by disguising malware as normal traffic to attack infection by wearing it with a layer of vest named TLS/SSL.

In recent years, detection of encrypted malicious traffic is always a focus of attention in the field of network security, and the inventor of the present invention finds that, in the prior art, an industrial gateway device mainly uses a method for decrypting traffic to detect an attack, but this decryption method consumes a large amount of resources and is high in cost, and at the same time, the decryption process is strictly limited by laws and regulations related to privacy protection.

Disclosure of Invention

In order to at least solve the technical problems, the invention provides a network security monitoring method, a network security monitoring device, a storage medium and a server.

According to a first aspect of the present invention, there is provided a network security monitoring method, including:

filtering the encrypted flow in the public network by adopting a filter, and acquiring the flow;

extracting feature types and features of the collected flow;

and training the sample data according to the feature type and the features by adopting a preset algorithm to generate a model classification accuracy result.

Further, in the above-mentioned case,

the filtering of the encrypted flow in the public network by adopting a filter for flow acquisition comprises the following steps:

and capturing a network data packet according to a preset filtering rule by using a wireshark as a filter, and generating a process characteristic analysis software packet file as the acquired flow.

Further, in the above-mentioned case,

and extracting information logs in the HTTPS flow captured by the packet by adopting a flow packet deep analysis mode, wherein the information logs comprise a connection communication log, an SSL protocol log and a certificate log.

Further, in the above-mentioned case,

the extracting of the feature category and the feature of the collected flow comprises the following steps:

and acquiring the characteristics of the acquired flow by analyzing the head information of the HTTPS data packet, capturing the network data packet by using the wireshark, and generating a process characteristic analysis software packet file to obtain the flow characteristic category.

Further, in the above-mentioned case,

the extracting features of the collected flow comprises the following steps:

and creating a connection 4-tuple through data from the connection log, the SSL protocol log and the certificate log, and extracting features.

Further, in the above-mentioned case,

training the sample data according to the feature category and the features by adopting a preset algorithm to generate a model classification accuracy result, training the sample data by adopting a proper drawing learning algorithm as a classifier to generate a corresponding classification model, and calculating the accuracy of the sample data based on the classification model to obtain the classification accuracy result;

the sample data includes encrypted malicious traffic and encrypted benign traffic.

Further, in the above-mentioned case,

the preset algorithm comprises the following steps: l1 regularized logistic regression algorithm, support vector machine, random forest, extreme gradient boosting.

According to a second aspect of the present invention, a network security monitoring apparatus comprises:

the acquisition module is used for filtering the encrypted flow in the public network by adopting a filter to acquire the flow;

the characteristic extraction module is used for extracting characteristic categories and characteristics of the acquired flow;

and the effect analysis module is used for training the sample data according to the feature type and the features by adopting a preset algorithm to generate a model classification accuracy result.

According to a third aspect of the present invention, a network security monitoring server comprises a memory, a processor, and a computer program stored on the memory and executable on the processor,

the processor, when executing the program, performs the steps of the method of any of the first aspect.

According to a fourth aspect of the invention, a computer readable storage medium stores a program which, when executed, is capable of implementing a method as defined in any one of the above.

The invention has the beneficial effects that: filtering encrypted flow in a public network by adopting a filter, collecting the flow, and extracting feature types and features; and training the sample data according to the feature type and the features by adopting a preset algorithm to generate a model classification accuracy result. The method does not need to use an interceptor, reduces the cost and the computing power, does not need to decrypt the flow on the premise of not influencing the network performance, and ensures the privacy in the flow communication process.

Drawings

The above and/or additional aspects and advantages of the present invention will become apparent and readily appreciated from the following description of the embodiments, taken in conjunction with the accompanying drawings of which,

fig. 1 is a flowchart of a network security monitoring method provided in the present invention.

Detailed Description

Reference will now be made in detail to embodiments of the present invention, examples of which are illustrated in the accompanying drawings, wherein like or similar reference numerals refer to the same or similar elements or elements having the same or similar function throughout. The embodiments described below with reference to the drawings are illustrative only and should not be construed as limiting the invention.

In order to more clearly illustrate the invention, the invention is further described below with reference to preferred embodiments and the accompanying drawings. Similar parts in the figures are denoted by the same reference numerals. It is to be understood by persons skilled in the art that the following detailed description is illustrative and not restrictive, and is not to be taken as limiting the scope of the invention.

In a first aspect of the present invention, a network security monitoring method is provided, as shown in fig. 1, including:

step 201: filtering the encrypted flow in the public network by adopting a filter, and acquiring the flow;

in the disclosure, a tool such as wireshark is used as a filter, and a network data packet generation process characteristic analysis software package file is captured according to a preset filtering rule and used as a filtered flow.

Furthermore, tools such as wireshark and the like are used as filters, encrypted benign traffic and encrypted malicious traffic in a public network are respectively obtained according to preset filtering rules, and the captured encrypted malicious traffic is generated into a process characteristic analysis software package file, so that the detection work is converted into a two-classification problem in machine learning.

In another embodiment of the invention, wireshark is adopted to collect the flow of the HTTPS packet.

A traffic packet deep analysis mode can be used for extracting enough information logs in the HTTPS traffic captured by the packet, wherein the enough information logs comprise a connection communication log, an SSL protocol log and a certificate log.

From the 3 logs the following information can be obtained: connection records, SSL records, certificate records.

Wherein the connection record comprises, for each row, aggregating a set of packets and describing the connection between the two endpoints. The connection record contains information such as IP address, port, protocol, connection status, number of packets, label, etc.

The SSL record includes SSL/TLS handshake and encrypted connection establishment procedures. There are SSL/TLS versions, passwords used, server names, certificate paths, topics, certificate issuers, etc.

The certificate record includes that each line in the log is a certificate record and describes certificate information, such as a certificate serial number, a common name, time validity, a subject, a signature algorithm, a key length in bits, and the like.

The flow packet generates log data after deep analysis, and each row in any log has a unique key for linking rows in other logs.

By connecting the unique key in the log record, 2 records of association can be performed with the unique key in the SSL protocol log.

By using a column of id key values spliced by commas in the protocol log, the certificate record corresponding to each id can be found in the certificate log.

The Certificate path after traffic analysis exists in a Certificate path column in the ssl protocol log, wherein id key values of all certificates are stored, and each comma-separated id value corresponds to one Certificate record in the Certificate log.

Step 202: extracting feature types and features of the collected flow;

in the disclosure, the collected traffic characteristics may be obtained by analyzing the header information of the HTTPS packet, and the TLS handshake protocol containing the information is transmitted in the clear text in the network, so that a tool such as wireshark or the like may be used to capture a network packet and generate a process characteristic analysis software package file, and a traffic characteristic category is obtained.

The extracted traffic feature categories may be classified into data element statistics, TLS features, and context data features.

The statistical characteristics of the data elements comprise the size of the data packet, the arrival time sequence and the byte distribution.

The TLS features include the encryption suite and TLS extension provided by the client, the client public key length, the encryption suite selected by the server, and certificate information. Further, the certificate information includes whether it is a non-CA self-signature, the number in SAN x.509 extension, the validity period, and the like.

Contextual data features include, but can be subdivided into, DNS data flow and HTTP data flow features. The DNS feature concerns the domain name length in the DNS response, the character length ratio of digits to non-digits in the domain name, the TTL value, the number of IP addresses returned by the DMS response and the ranking condition of the domain name in the Alexa website. Further, the HTTP feature focuses on the various fields of the inbound and outbound HTTP and the HTTP response code. Wherein the plurality of fields of the inbound and outbound HTTP include: Set-Cookie, Location, Expires, Content-Type, Server, etc.

Joy is adopted to extract data features from a real-time network flow or process characteristic analysis software package file, wherein the data features comprise information such as clientHello, serverHello, certificate and clien-tKeyExchange, and then JSON is used for representing the data features, and the Joy further comprises an analysis tool (sleuth) which can be applied to the data files. And (4) extracting the required specific characteristic information by adopting slouth analysis.

In the invention, besides paying attention to the traditional flow characteristics such as the size of the data packet and some parameters related to time, Joy analyzes the initial data packet of the encrypted connection and fully utilizes the unencrypted field to extract the data characteristic elements from the encrypted packet.

For example, Joy's configuration commands in an ubuntu environment are as follows:

sudo apt-get installbuilt-essential libssl-dev lib process characteristic analysis software package-dev libcurl 4-opennsl-dev

git clone https://github.com/cisco/joy.git

cd joy

./config

make

In the method, Joy is used for extracting the data characteristics of the tls/dns/http type from the process characteristic analysis software package file in the data directory, and the extracted json result file is stored in the feature directory. For example:

./joy output＝features bidir＝1tls＝1dns＝1http＝1data/*

extracting specific feature information required by using sleuth analysis

./sleuth bin/features/*--select"tls{cs,c_extensions,c_key_length,s_cs,s_extensions,s_cert[{validity_not_before,validity_not_after}]}"

In another embodiment of the invention, a connection 4-tuple is created by data from a connection log, an SSL protocol log, and a certificate log, and features are extracted for machine learning model training.

Further, connection is carried out according to id in a connection log and id in an SSL protocol log, then according to a certificate path in the conn _ ssl.log, the first key is taken and is associated with the id in the certificate log again, group aggregation operation is carried out according to the same data of connection 4 tuples (source IP, target port and protocol) in the obtained association result, and then feature extraction is carried out on each obtained connection 4 tuple according to the aggregation result.

For each connected 4-tuple, 37 features are extracted. The features are created based on thorough analysis of malware data. For these features, we divided them into 3 groups: connection feature, SSL feature, certificate feature.

Wherein the connection characteristics are based on characteristics of the connection record describing common behavior of the communication flow independent of credentials and encryption.

The SSL feature is a feature based on SSL records, describing SSL handshakes and information of encrypted communications.

The certificate feature is based on the characteristics of the certificate record, describing the information that the web service person provides to our certificate during the SSL handshake. Each property is a floating point value that is-1 if the property cannot be computed due to lack of information.

Further, the connection features comprise 12 in total, including: number of aggregation and connection records. That is, each connection 4-tuple contains the sum of the SSL aggregation and the connection record.

The duration average is the average of the connected parameter duration of each connected 4-tuple.

The standard deviation of duration is the standard deviation of the connected parameter duration of each connected 4-tuple.

Duration out of standard deviation ratio, including what percentage of all duration values of each connected 4-tuple are out of range. There are two limits to this range, the upper limit being the mean + standard deviation and the lower limit being the mean-standard deviation.

And the total transmit packet size. All connections of each 4-tuple record the number of bytes of payload sent.

In another embodiment of the present invention, the SSL characteristics include 10, including the ratio of the SSL connections in the connection record, i.e. the ratio of the number of non-SSL connections and SSL connections in the connection 4 tuple.

The ratio of TLS to SSL, i.e. the TLS version distribution in the join 4 tuple.

SNI ratio, i.e., the ratio of server _ name not empty in the join 4 tuple.

SNI is IP, the ratio of server _ name to IP address in the join 4 tuple.

In another embodiment of the present invention, the certificate features comprise 15 in total, including: public key mean. I.e. the average of all certificates exponennt in the concatenated 4-tuple.

The average value of the validity period of the certificate, i.e. the average value of the number of days of validity connecting all certificates in the 4-tuple.

Standard deviation of certificate validity period. I.e. the standard deviation of the number of days of validity connecting all certificates in the 4-tuple.

The validity of the certificate period during the capture. I.e. the proportion of all certificates connecting 4 tuples that are not expired.

Step 203: and training the sample data according to the feature type and the features by adopting a preset algorithm to generate a model classification accuracy result.

In the method, a preset algorithm of proper drawing learning is adopted as a classifier to train the sample data, wherein the sample data comprises encrypted malicious flow and encrypted benign flow, a corresponding classification model is generated, and the accuracy of the sample data is calculated based on the classification model to obtain a classification accuracy result. Wherein, the preset algorithm comprises: support Vector Machines (SVM), random forest (random forest).

In the invention, a support vector machine algorithm is adopted to train the sample data according to the feature type and the features, and the method comprises the following steps:

and (3) taking the characteristic value of the marked training data as input, and performing model training by using an SVM classifier of the LibSVM.

Further, the method for calculating the feature value of the marked training data comprises the following steps: the acquired flow file is F, any K continuous bytes in the file are taken as an element, the entropy value of a set S' formed by all the K continuous bytes in the file is calculated, and the relative entropy corresponding to the set is h_k，

Wherein m is_ikIs a set f_kThe frequency of the occurrence of the ith element is calculated to obtain h₀，h₁，h₂，h₃。

For each data file needing to be processed, calculating a Monte Carlo shockproof pad by taking each 48-bit stream as a group, taking the first 24 bits as montex and the last 24 bits as montey, calculating whether the point of the 48-bit stream falls in a circular area by using the montex and the montey, estimating a Monte Carlo pi value according to the point number falling in the circular area, and calculating the difference value between the Monte Carlo pi value and a real pi value as an error value P of the estimated pi value by a Monte Carlo simulation method_error(ii) a Will be the eigenvalues of the labeled training data.

Analyzing the flow, judging the flow by using a classification model generated in a classifier training stage, and determining a classification resultPolicy evaluation including model classification accuracy result P_r，P_r＝T_P/(T_P+F_P) Wherein, T_PThe number of correctly marked samples in the encrypted samples; f_PThe number of samples that are mis-marked as encrypted in the non-encrypted samples.

Furthermore, the decision evaluation can also comprise a recall ratio R_eAnd comprehensive evaluation F_mWherein R is_e＝T_P/(T_P+F_N)；F_m＝2P_rR_e/(P_r+R_e). The identification effect of the identification method can be reflected by calculating the classification accuracy result and recall ratio of the model, the comprehensive evaluation can be more comprehensively evaluated based on the accuracy and recall ratio, and the more comprehensive evaluation result is higher, so that the more executed encryption flow classification effect of the method is ideal.

In another embodiment of the present invention, training sample data according to the feature type and the features by using a random forest algorithm to generate a model classification accuracy result, comprising:

the method for constructing the random forest by using Bagging specifically comprises the following steps:

step a1, constructing samples containing each element by random repeated sampling for the sample data, such as constructing a slave data (X, Y) … … (X) for random sampling with n times_n,Y_n) Starting, constructing a boot strap sample;

step a 2: constructing a decision tree for each boot strap sample;

step a 3: repeating the step a1 and the step a2 to obtain a plurality of decision trees;

step a 4: and voting the input vector X by each decision tree, calculating all votes, taking the decision tree with the highest number of votes as a classification label of the vector X, and acquiring the proportion different from the proportion of the correct classification label as the false classification rate of the immediate forest.

Step a 5: and respectively calculating the true TP and the false positive FP of the sample data, and calculating the classification accuracy result of the model according to the obtained true TP and the false positive FP.

Further, the number TP of samples correctly predicted by the classification model in the samples of the actual type i_i＝n_ijTaking the obtained calculation result as the real TP;

taking the obtained calculation result as false negative FN; the number FP of samples which are misjudged as type i by the classification model in the samples with the actual type of non-i_i＝∑_j≠in_ji(ii) a Taking the obtained calculation result as a false positive FP; the calculated model classification accuracy result 0A is:

in another embodiment of the invention, data set selection, for negative examples, collected traffic, uses a latest batch of 10w malware to capture malware-generated traffic through a sandbox. For the positive sample, one part uses normal flow in a daily office network, and simultaneously crawls the top10000 website which has the most visit in alexa by using a crawler, and collects the generated flow as the other part of data set.

It should be noted that, in the present invention, the preset algorithm may also be one of a l1 regularized logistic regression algorithm (l1-logistic regression) and an extreme gradient boost (XGBoost).

In a second aspect of the present invention, there is provided a network security monitoring apparatus, comprising:

in the disclosure, the acquisition module is configured to capture a network data packet according to a preset filtering rule and generate a process characteristic analysis software package file as a filtered flow by using a tool such as wireshark as a filter.

Further, the acquisition module is used for adopting tools such as wireshark and the like as a filter, respectively acquiring encrypted benign traffic and encrypted malicious traffic in a public network according to a preset filtering rule, and generating a process characteristic analysis software package file from the captured encrypted malicious traffic, so that detection work is converted into a two-classification problem in machine learning.

in the disclosure, the feature extraction module is configured to obtain the collected traffic features by analyzing header information of the HTTPS packet, and a TLS handshake protocol including the information is transmitted in a clear text in a network, so that a network packet generation process characteristic analysis software package file may be captured by using a tool such as wireshark, and a traffic feature category is obtained.

The feature extraction module is further used for extracting traffic feature categories which can be divided into data element statistical features, TLS features and context data features.

git clone https://github.com/cisco/joy.git

cd joy

./config

make

./joy output＝features bidir＝1tls＝1dns＝1http＝1data/*

extracting specific feature information required by using sleuth analysis

The ratio of TLS to SSL, i.e. the TLS version distribution in the join 4 tuple.

SNI ratio, i.e., the ratio of server _ name not empty in the join 4 tuple.

SNI is IP, the ratio of server _ name to IP address in the join 4 tuple.

And the effect analysis module is used for training the acquired flow according to the feature type and the features by adopting a preset algorithm to generate a model classification accuracy result.

In the disclosure, the effect analysis module is configured to use a suitable drawing learning algorithm as a classifier, train the collected traffic as sample data, generate a corresponding classification model, classify the encrypted traffic based on the classification model, and screen out the encrypted malicious traffic. Wherein, the classification algorithm comprises: support Vector Machines (SVM), random forest (random forest).

In the invention, the effect analysis module adopts a support vector machine algorithm to train the sample data according to the feature type and the features, and the method comprises the following steps:

For each data file to be processed, calculating a Monte Carlo shockproof pad by taking each 48-bit stream as a group, taking the first 24 bits as montex and the last 24 bits as montey, calculating whether the point of the 48-bit stream falls in a circular area by utilizing the montex and the montey, estimating the Monte Carlo pi value according to the point number falling in the circular area, and then calculating the Monte Carlo pi value and the real Monte Carlo pi valueThe difference value between the pi values is used as an error value P for estimating the pi values by a Monte Carlo simulation method_error(ii) a Will be the eigenvalues of the labeled training data.

Analyzing the flow, judging the flow by using a classification model generated in a classifier training stage, and performing decision evaluation on classification results including a model classification accuracy result P_r，P_r＝T_P/(T_P+F_P) Wherein, T_PThe number of correctly marked samples in the encrypted samples; f_PThe number of samples that are mis-marked as encrypted in the non-encrypted samples.

In another embodiment of the present invention, the effect analysis module trains the sample data according to the feature type and the features by using a random forest algorithm to generate a model classification accuracy result, including:

the effect analysis module is used for executing step a1, i.e. constructing a sample containing each element by random repeatable sampling for a sample data for a number of times, e.g. constructing a slave data (X, Y) … … (X) for a random number of times of n_n，Y_n) Starting, constructing a boot strap sample;

the effect analysis module is further configured to perform step a 2: for each boot strap sample, constructing an effect analysis module for executing a decision tree;

the effect analysis module is further configured to perform step a 3: repeating the step a1 and the step a2 to obtain a plurality of decision trees;

the effect analysis module is further configured to perform step a 4: and voting the input vector X by each decision tree, calculating all votes, taking the decision tree with the highest number of votes as a classification label of the vector X, and acquiring the proportion different from the proportion of the correct classification label as the false classification rate of the immediate forest.

The effect analysis module is further configured to perform step a 5: and respectively calculating the true TP and the false positive FP of the sample data, and calculating the classification accuracy result of the model according to the obtained true TP and the false positive FP.

in another embodiment of the present invention, the predetermined algorithm may be one of a l1 regularized logistic regression algorithm (l1-logistic regression) and an extreme gradient boost (XGBoost).

As used herein, the singular forms "a", "an", "the" and "the" are intended to include the plural forms as well, unless the context clearly indicates otherwise. It will be further understood that the terms "comprises" and/or "comprising," when used in this specification, specify the presence of stated features, integers, steps, operations, elements, and/or components, but do not preclude the presence or addition of one or more other features, integers, steps, operations, elements, components, and/or groups thereof. It will be understood that when an element is referred to as being "connected" or "coupled" to another element, it can be directly connected or coupled to the other element or intervening elements may also be present. Further, "connected" or "coupled" as used herein may include wirelessly connected or wirelessly coupled. As used herein, the term "and/or" includes all or any element and all combinations of one or more of the associated listed items.

It will be understood by those skilled in the art that, unless otherwise defined, all terms (including technical and scientific terms) used herein have the same meaning as commonly understood by one of ordinary skill in the art to which this invention belongs. It will be further understood that terms, such as those defined in commonly used dictionaries, should be interpreted as having a meaning that is consistent with their meaning in the context of the prior art and will not be interpreted in an idealized or overly formal sense unless expressly so defined herein.

It should be understood that the above detailed description of the technical solution of the present invention with the help of preferred embodiments is illustrative and not restrictive. On the basis of reading the description of the invention, a person skilled in the art can modify the technical solutions described in the embodiments, or make equivalent substitutions for some technical features; and such modifications or substitutions do not depart from the spirit and scope of the corresponding technical solutions of the embodiments of the present invention.

Claims

1. A network security monitoring method is characterized by comprising the following steps:

extracting feature types and features of the collected flow;

2. The method of claim 1,

3. The method of claim 1,

4. The method of claim 1,

5. The method of claim 3,

the extracting features of the collected flow comprises the following steps:

6. The method of claim 1,

7. The method of claim 1,

8. A network security monitoring apparatus, comprising:

9. A network security monitoring server comprising a memory, a processor and a computer program stored on the memory and executable on the processor, wherein,

the processor, when executing the program, performs the steps of the method of any one of claims 1 to 7.

10. A computer-readable storage medium, characterized in that the computer-readable storage medium stores a program which, when executed, is capable of implementing the method according to any one of claims 1-7.