CN108595667B

CN108595667B - Method for analyzing relevance of network abnormal data

Info

Publication number: CN108595667B
Application number: CN201810402502.5A
Authority: CN
Inventors: 姜文婷; 亢中苗; 陈燕; 施展; 赵瑞峰; 陈飞鹏
Original assignee: Guangdong Power Grid Co Ltd; Electric Power Dispatch Control Center of Guangdong Power Grid Co Ltd
Current assignee: Guangdong Power Grid Co Ltd; Electric Power Dispatch Control Center of Guangdong Power Grid Co Ltd
Priority date: 2018-04-28
Filing date: 2018-04-28
Publication date: 2020-06-09
Anticipated expiration: 2038-04-28
Also published as: CN108595667A

Abstract

The invention relates to a method for analyzing relevance of network abnormal data, which comprises the following steps: collecting abnormal data of the power communication network; preprocessing the acquired abnormal data to obtain preprocessed abnormal data; according to the preprocessed abnormal data, calculating a weight value according to principal component analysis; calculating the similarity of the abnormal data to generate a transaction database; and completing the relevance analysis based on an Apriori algorithm according to the generated transaction database. According to the relevance analysis method of the network abnormal data, the weight of the abnormal flow data variable is calculated based on a principal component analysis method, dimension reduction is carried out, the relevance of the abnormal flow data of the power communication network is analyzed and mined by using an Apriori relevance rule, the complexity of the abnormal flow of the power communication network is fully considered, the real state of the abnormal flow of the network is fully considered, and the similarity of the abnormal flow of the network is better reflected.

Description

Method for analyzing relevance of network abnormal data

Technical Field

The invention relates to the technical field of network abnormal data processing, in particular to a method for analyzing relevance of network abnormal data.

Background

With the advance of research and practice of smart power grids, power grids in the traditional sense are gradually fused with information communication systems and monitoring control systems, the safety of power communication networks is closely connected with the operation safety of the power grids, and the safety of the power communication networks is the central importance of the safety of the power grids. The network safety is continuously strengthened in the power industry during the 'twelve-five' period, and the network safety protection system with the characteristics of the power industry is continuously improved.

The electric power communication network system has the characteristics of complexity, dynamics and the like, has certain vulnerability, and the security events such as denial of service attack, network scanning, network deception, virus trojan, information leakage and the like are layered endlessly, so that the abnormal data of the power grid communication network is analyzed and processed in time in a lack of method, and the internal and external security risks bring great pressure to the network security work.

Disclosure of Invention

The invention provides a method for analyzing the relevance of network abnormal data, aiming at solving the technical defects that the prior art lacks a method for analyzing and processing the abnormal data of the power communication network and brings great pressure to the network safety work.

In order to realize the purpose, the technical scheme is as follows:

a method for analyzing relevance of network abnormal data comprises the following steps:

s1: collecting abnormal data of the power communication network;

s2: preprocessing the acquired abnormal data to obtain preprocessed abnormal data;

s3: according to the preprocessed abnormal data, calculating a weight value according to principal component analysis;

s4: calculating the similarity of the abnormal data to generate a transaction database;

s5: and completing the relevance analysis based on an Apriori algorithm according to the generated transaction database.

Wherein, the step S1 specifically includes: selecting network key information data with a fixed time length from original records of abnormal flow data collected in a power communication network, wherein each piece of data comprises 4 attributes: TIME attribute, collecting TIME period, and recording as A; host-related attributes in a network, including: host importance LEVEL HOSTS _ LEVEL, denoted B₁The security LEVEL SECURE _ LEVEL of each host is marked as B₂(ii) a Run information attributes, including: the total number of SERVICEs running on each host SERVICE _ NUM, denoted C₁The importance LEVEL SERVICE _ LEVEL of each SERVICE is marked as C₂(ii) a IP attribute: source Address SIP, denoted D₁And a destination address DIP, noted as D₂(ii) a The data characteristics are A, B₁、B₂、C₁、C₂、D₁、D₂(ii) a B is to be₁、B₂、C₂The descriptive text of (a) is quantized to numbers.

Specifically, in step S2, the data is cleaned to remove data records containing missing values.

Wherein the step S3 includes:

s31: and (3) data standardization treatment: the standardization of the data is to scale the data to make the data fall into a small specific interval, which is mainly used for removing the unit limitation of the data and converting the data into a dimensionless pure numerical value, so that indexes of different units or orders of magnitude can be compared and weighted conveniently; the extreme value normalization method (0-1normalization) is adopted here, and is a linear transformation on the original data, and the specific expression of the transformation function is as follows:

wherein, X_maxIs the maximum value of the sample data, X_minThe minimum value of the sample data is X, and the X is the collected abnormal data; x is converted abnormal data;

s32: and (3) performing principal component analysis on X, wherein the calculation steps are as follows:

and calculating a correlation coefficient matrix R, wherein the calculation formula is as follows:

wherein r is_ij(i, j ═ 1, 2.. times.p) is the original variable x_iAnd x_jOf correlation coefficient r_ij＝r_jiThe calculation formula is

Wherein the content of the first and second substances,

representing the average over the rows and columns of the X matrix from which A, B can be derived₁、B₂、C₁、C₂、D₁、D₂A matrix of correlation coefficients of;

calculating a characteristic value and a characteristic vector, and solving a characteristic value equation:

|λI-R|＝0

the characteristic values were obtained by the Jacobi method (Jacobi), and the values were arranged in order of magnitude₁≥λ₂≥...≥λ_pNot less than 0, respectively calculating correspondence and characteristic value lambda_iCharacteristic vector e of_j(i ═ 1,2,. > p), requiring | | | e_i1 | | |, i.e

Wherein e_ijRepresents a vector e_iThe jth component of (a);

calculating the principal component contribution rate and the accumulated contribution rate, wherein the calculation formula is as follows:

contribution rate:

cumulative contribution rate:

wherein λ is_i,λ_kA non-negative feature vector, i ═ 1, 2., p, p represents the number of non-negative feature roots;

calculating principal component load l_ijThe calculation formula is as follows:

wherein e is_i,jAs a unit vector component, according to l_ijA component matrix Z can be obtained, with the principal component scores as follows:

determining the weight of the principal component analysis: determining the weight by principal component analysis, wherein the index weight is equal to the weight by taking the variance contribution rate of the principal component as the weight, and normalizing the weighted average of the coefficients of the index in each principal component linear combination, therefore, three steps are required for determining the index weight:

calculating coefficients in the principal component linear combination: squaring the number of loads/eigenvalues in the component matrix obtained from the principal component loads, i.e.

Coefficients of linear combinations of principal components are obtained, where the number of principal components is obtained by analyzing each principal component score, n (n.ltoreq.7) is set as n, and n sets of data A, B are obtained₁、B₂、C₁、C₂、D₁、D₂Linear combination coefficient F of

Wherein x is₁,x₂,...,x₇Corresponds to A, B₁、B₂、C₁、C₂、D₁、D₂；

Calculating the variance contribution rate of the principal component, wherein the greater the variance contribution rate, the greater the importance of the principal component, therefore, considering the variance contribution rate as the weight of different principal components, replacing the original data with n principal components, performing weighted average on the coefficient in the linear combination according to the weight of the principal component in the principal component variance contribution rate,

F＝c₁F₁+c₂F₂+…+c_nF_n

wherein, c₁,c₂,...,c_nIs F₁,F₂,...,F_nThe proportion of the variance contribution rate is occupied,

combining coefficients in the principal component linear combination to obtain:

F＝w₁x₁+w₂x₂+…+w₇x₇

wherein, w₁,w₂,...,w₇I.e. the weight, and will w₁,w₂,...,w₇Carrying out normalization processing;

when the weight of the data variable is lower than the weight threshold, the data variable is considered to be low in association degree with the abnormal flow data analysis, and the data variable is deleted.

Wherein the step S4 includes the steps of:

calculating the similarity between abnormal flow data:

similarity of time information δ₁：

Wherein, t₁，t₂Is the abnormal flow A, B detection time, T_winIs the reference design time;

similarity delta of host related information₂：

Wherein S is₁,S₂For exceptional traffic A, B host importance level, N_SIs an important grade number;

similarity delta of host security protection level₃：

Wherein, C₁,C₂For exceptional traffic A, B host protection level, N_CThe number of total protection grades;

similarity of total number of running services δ₄：

Wherein, I₁,I₂Total number of services running on the host for exception traffic A, B, N_IThe total weight of the service running on the host computer is in a grade number;

running similarity δ of service importance levels₅：

Wherein l₁,l₂For the importance level of the service running on the exception A, B host, N_lThe total weight of the service running on the host computer is in a grade number;

similarity δ of IP-related information₆: let the binary numbers of the IP addresses of the two abnormal traffic devices A, B be IP1 and IP2, respectively, XOR the IP addresses to obtain diff as IP1 XOR IP2, start scanning from the left side of diff, encounter 1 and stop, and define the variable p as the number of 0 encountered in scanning, then the IP similarity function is:

respectively calculating source IP addresses delta according to the obtained similarity function₆；

Calculating the similarity of the abnormal flow η:

obtaining the similarity between each abnormal flow, so as to convert the operation on the variable in the abnormal flow data into the operation on each abnormal flow;

generating a transaction database according to the similarity:

setting a similarity threshold: setting a similarity threshold according to the calculated similarity between the abnormal flows; analyzing according to the result obtained by the experiment, and setting the similarity threshold values as a maximum threshold value of 0.5, a minimum threshold value of 0.1 and a discard threshold value of 0.05 respectively;

and generating a transaction type database D according to the similarity threshold: when the similarity is lower than 0.05, the similarity between the abnormal flows is considered to be too low, and no possible correlation exists; when the similarity of the two is higher than 0.5, the association degree is considered to be higher, and the two can be used as transaction data items with 2 abnormal flows; on the basis of obtaining the transaction data items with the similarity higher than 0.5, if the similarity between the two transaction data items and the other abnormal traffic is higher than 0.48, the transaction data items with 3 abnormal traffic are generated, and so on, the required similarity is correspondingly reduced by 0.02 every time the abnormal traffic is added in the transaction data items, but the abnormal traffic cannot be added in the transaction data items when the similarity is lower than 0.1.

Wherein the step S5 includes the steps of:

s51: setting a minimum support threshold value min _ sup and a minimum confidence coefficient min _ conf, wherein the minimum support threshold value is 20 percent, and the minimum confidence coefficient threshold value is 80 percent;

s52: a frequent item set is generated by iterating the transactional database D: the transaction database D is processed once through the first iteration of the algorithmScanning, calculating the occurrence frequency of each item contained in D, and generating a candidate 1-item set C₁；

S53: according to the set minimum support degree, from C₁Determine frequent 1-item set L₁From this analogy, we derive the frequent set L_kWhere k is 7;

s54: and generating a rule meeting the minimum confidence coefficient on the basis of the frequent item set, wherein the generated rule is called as a strong association rule, so that the correlation of abnormal data of the power communication network is obtained, and further the potential state perception and prediction of the power communication cell data network are realized.

In the above scheme, the algorithm strategy based on Apriori algorithm relevance analysis is as follows:

a connection step: frequent (k-1) item set L_k-1Generates a candidate k term set C_kApriori assumes that the set of items are ordered in lexicographic order. If L is_k-1The first (k-2) items of the elements (item sets) itemset1 and itemset2 of two of them are identical, then itemset1 and itemset2 are said to be connectable. The resulting set of items resulting from the concatenation of itemset1 with itemset2 is { itemset1[1 ]],itemset1[2],…,itemset1[k-1],itemset2[k-1]}；

Pruning strategy: due to the presence of a priori properties: any infrequent (k-1) item set is not a subset of the frequent k item set. Thus, if a candidate k-term set C_kIs not in L_k-1Then the candidate set is unlikely to be frequent, so that C can be selected from_kDeleting to obtain compressed C_k；

Deletion strategy: based on C after compression_kScan all transactions, pair C_kCounting each item in the k item set, and then deleting the item which does not meet the minimum support degree, thereby obtaining a frequent k item set;

setting a minimum support threshold value min _ sup and a minimum confidence coefficient min _ conf, wherein the minimum support threshold value is 20 percent, and the minimum confidence coefficient threshold value is 80 percent;

a frequent item set is generated by iterating the transactional database D: through the first iteration of the algorithm, the transaction database D is scanned once, and each time D is calculatedThe number of times of occurrence of each item generates a set C of candidate 1-item sets₁；

According to the set minimum support degree, from C₁Determine frequent 1-item set L₁From this analogy, we derive the frequent set L_kWhere k is 7;

and generating a rule meeting the minimum confidence coefficient on the basis of the frequent item set, wherein the generated rule is called as a strong association rule, so that the correlation of abnormal data of the power communication network is obtained, and further the potential state perception and prediction of the power communication cell data network are realized.

According to the scheme, data processing is carried out on abnormal flow data acquired from a power communication network, the weight of abnormal flow data variables is calculated based on a principal component analysis method and dimension reduction is carried out, the similarity between abnormal flows is calculated by using the weight to generate a transaction type database, and then the transaction type database is associated based on an Apriori association rule algorithm to generate a strong association rule.

Compared with the prior art, the invention has the beneficial effects that:

the invention provides a relevance analysis method of network abnormal data, which is characterized in that the weight of abnormal flow data variables is calculated by using a principal component analysis method, dimension reduction is carried out, and the similarity between abnormal flows is calculated by using the weight to generate a transaction type database; the relevance of the abnormal traffic data of the power communication network is analyzed and mined by using an Apriori association rule, the complexity of the abnormal traffic of the power communication network is fully considered, the real state of the abnormal traffic of the network is comprehensively considered, and the similarity of the abnormal traffic of the network is better reflected.

Drawings

Fig. 1 is a schematic flow chart of a method for analyzing relevance of network abnormal data.

FIG. 2 is a flowchart of an algorithm of a method for analyzing relevance of network anomaly data;

FIG. 3 is a comparison of average temporal complexity.

Detailed Description

The drawings are for illustrative purposes only and are not to be construed as limiting the patent;

the invention is further illustrated below with reference to the figures and examples.

Example 1

As shown in fig. 1 and fig. 2, a method for analyzing relevance of network abnormal data includes the following steps:

s1: collecting abnormal data of the power communication network;

More specifically, the step S1 specifically includes: selecting network key information data with a fixed time length from original records of abnormal flow data collected in a power communication network, wherein each piece of data comprises 4 attributes: TIME attribute, collecting TIME period, and recording as A; host-related attributes in a network, including: host importance LEVEL HOSTS _ LEVEL, denoted B₁The security LEVEL SECURE _ LEVEL of each host is marked as B₂(ii) a Run information attributes, including: the total number of SERVICEs running on each host SERVICE _ NUM, denoted C₁The importance LEVEL SERVICE _ LEVEL of each SERVICE is marked as C₂(ii) a IP attribute: source Address SIP, denoted D₁And a destination address DIP, noted as D₂(ii) a The data characteristics are A, B₁、B₂、C₁、C₂、D₁、D₂(ii) a B is to be₁、B₂、C₂The descriptive text of (a) is quantized to numbers.

More specifically, in step S2, the data is cleaned to remove the data records containing missing values.

More specifically, the step S3 includes:

wherein, X_maxIs the maximum value of the sample data, X_minThe minimum value of the sample data is X, and the X is the collected abnormal data; x^*Converting the abnormal data into abnormal data;

s32: to X^*The principal component analysis is carried out, and the calculation steps are as follows:

Wherein the content of the first and second substances,

|λI-R|＝0

Wherein e_ijRepresents a vector e_iThe jth component of (a);

contribution rate:

cumulative contribution rate:

calculating principal component load l_ijThe calculation formula is as follows:

F＝c₁F₁+c₂F₂+...+c_nF_n

combining coefficients in the principal component linear combination to obtain:

F＝w₁x₁+w₂x₂+...+w₇x₇

Wherein the step S4 includes the steps of:

calculating the similarity between abnormal flow data:

similarity of time information δ₁：

similarity delta of host related information₂：

similarity delta of host security protection level₃：

similarity of total number of running services δ₄：

running similarity δ of service importance levels₅：

Calculating the similarity of the abnormal flow η:

generating a transaction database according to the similarity:

More specifically, the step S5 includes the following steps:

s52: a frequent item set is generated by iterating the transactional database D: after the first iteration of the algorithm, the transaction database D is scanned once, the frequency of occurrence of each item contained in D is calculated, and a candidate 1-item set C is generated₁；

In the specific implementation process, the algorithm strategy based on Apriori algorithm relevance analysis is as follows:

Deletion policyA little: based on C after compression_kScan all transactions, pair C_kCounting each item in the k item set, and then deleting the item which does not meet the minimum support degree, thereby obtaining a frequent k item set;

a frequent item set is generated by iterating the transactional database D: after the first iteration of the algorithm, the transaction database D is scanned once, the frequency of occurrence of each item contained in D is calculated, and a candidate 1-item set C is generated₁；

In the specific implementation process, data processing is performed on abnormal flow data acquired from a power communication network, the weight of abnormal flow data variables is calculated based on a principal component analysis method, dimension reduction is performed, the similarity between abnormal flows is calculated by using the weight to generate a transaction-type database, and then the transaction-type database is associated based on an Apriori association rule algorithm to generate a strong association rule.

In the specific implementation process, as shown in fig. 3, data preprocessing is performed first to remove data with missing values. And (3) processing the abnormal flow data variable by adopting a principal component analysis method, and reducing the dimension of the data variable while obtaining the weight of the data variable. On the basis of simplifying abnormal flow data, the similarity between abnormal flows is obtained, and accordingly the abnormal flows of the power communication network are generated into a transaction type database. The association rule of the abnormal traffic is completed by step S5.

In a specific implementation process, due to inherent defects of Apriori association rules, spatial complexity and temporal complexity for implementing Apriori increase with the increase of data, and the reason for the increase of the spatial complexity and the temporal complexity is that Apriori needs to perform multiple accesses and iterations on a database. With the development of the data processing platform at present, the time complexity of an iterative algorithm can be effectively solved by adopting spark parallel computing framework programming; meanwhile, the spark supports caching of data used for multiple times in a cache mode, and pressure of multiple access to the database is relieved to a certain extent.

In the specific implementation process, a principal component analysis method is used as a data processing method, and Apriori obtains association rules, the principal component analysis reduces the workload of data processing, and the weight among abnormal flow data variables is determined according to the principal component analysis method, and the method is objective and reasonable in weight determination. Apriori association rules have been widely applied to various fields such as business and network security, and the association of data is analyzed and mined to mine useful information.

It should be understood that the above-described embodiments of the present invention are merely examples for clearly illustrating the present invention, and are not intended to limit the embodiments of the present invention. Other variations and modifications will be apparent to persons skilled in the art in light of the above description. And are neither required nor exhaustive of all embodiments. Any modification, equivalent replacement, and improvement made within the spirit and principle of the present invention should be included in the protection scope of the claims of the present invention.

Claims

1. A method for analyzing relevance of network abnormal data is characterized in that: the method comprises the following steps:

s1: collecting abnormal data of the power communication network;

s5: according to the generated transaction database, relevance analysis is completed based on an Apriori algorithm;

wherein the step S3 includes:

s31: and (3) data standardization treatment: the standardization of the data is to scale the data, so that the data falls into a small specific interval, which is mainly used for removing unit limitation of the data and converting the unit limitation into a dimensionless pure numerical value, thereby facilitating the comparison and weighting of indexes of different units or orders of magnitude; the extreme value normalization method, namely 0-1normalization, is adopted here, and is a linear transformation on the original data, and the specific expression of the conversion function is as follows:

Wherein the content of the first and second substances,

|λI-R|＝0

Wherein e_ijRepresents a vector e_iThe jth component of (a);

contribution rate:

cumulative contribution rate:

calculating principal component load l_ijThe calculation formula is as follows:

F＝c₁F₁+c₂F₂+...c_nF_n

combining coefficients in the principal component linear combination to obtain:

F＝w₁x₁+w₂x₂+...+w₇x₇

wherein, w₁,w₂,...,w₇I.e. the weight, and willw₁,w₂,...,w₇Carrying out normalization processing;

setting a weight threshold value of a data variable to be 0.05 while determining the weight by using a principal component analysis method, and deleting the data variable when the weight of the data variable is lower than the weight threshold value and the data variable is considered to have low association degree with the abnormal flow data analysis;

the step S4 includes the steps of:

calculating the similarity between abnormal flow data:

similarity of time information δ₁：

similarity delta of host related information₂：

similarity delta of host security protection level₃：

similarity of total number of running services δ₄：

Wherein, I₁,I₂As abnormal flowA. Total number of services running on B host, N_IThe total weight of the service running on the host computer is in a grade number;

running similarity δ of service importance levels₅：

Calculating the similarity of the abnormal flow η:

generating a transaction database according to the similarity:

2. The method according to claim 1, wherein the method comprises the following steps: the step S1 specifically includes: selecting network key information data with a fixed time length from original records of abnormal flow data collected in a power communication network, wherein each piece of data comprises 4 attributes: TIME attribute, collecting TIME period, and recording as A; host-related attributes in a network, including: host importance LEVEL HOSTS _ LEVEL, denoted B₁The security LEVEL SECURE _ LEVEL of each host is marked as B₂(ii) a Run information attributes, including: the total number of SERVICEs running on each host SERVICE _ NUM, denoted C₁The importance LEVEL SERVICE _ LEVEL of each SERVICE is marked as C₂(ii) a IP attribute: source Address SIP, denoted D₁And a destination address DIP, noted as D₂(ii) a The data characteristics are A, B₁、B₂、C₁、C₂、D₁、D₂(ii) a B is to be₁、B₂、C₂The descriptive text of (a) is quantized to numbers.

3. The method according to claim 2, wherein the method comprises the following steps: the step S2 is specifically to wash the data and remove the data records containing missing values.

4. The method according to claim 3, wherein the method comprises the following steps: the step S5 includes the steps of: