CN108595667B - Method for analyzing relevance of network abnormal data - Google Patents
Method for analyzing relevance of network abnormal data Download PDFInfo
- Publication number
- CN108595667B CN108595667B CN201810402502.5A CN201810402502A CN108595667B CN 108595667 B CN108595667 B CN 108595667B CN 201810402502 A CN201810402502 A CN 201810402502A CN 108595667 B CN108595667 B CN 108595667B
- Authority
- CN
- China
- Prior art keywords
- data
- abnormal
- similarity
- principal component
- weight
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Active
Links
Images
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F21/00—Security arrangements for protecting computers, components thereof, programs or data against unauthorised activity
- G06F21/50—Monitoring users, programs or devices to maintain the integrity of platforms, e.g. of processors, firmware or operating systems
- G06F21/57—Certifying or maintaining trusted computer platforms, e.g. secure boots or power-downs, version controls, system software checks, secure updates or assessing vulnerabilities
Abstract
The invention relates to a method for analyzing relevance of network abnormal data, which comprises the following steps: collecting abnormal data of the power communication network; preprocessing the acquired abnormal data to obtain preprocessed abnormal data; according to the preprocessed abnormal data, calculating a weight value according to principal component analysis; calculating the similarity of the abnormal data to generate a transaction database; and completing the relevance analysis based on an Apriori algorithm according to the generated transaction database. According to the relevance analysis method of the network abnormal data, the weight of the abnormal flow data variable is calculated based on a principal component analysis method, dimension reduction is carried out, the relevance of the abnormal flow data of the power communication network is analyzed and mined by using an Apriori relevance rule, the complexity of the abnormal flow of the power communication network is fully considered, the real state of the abnormal flow of the network is fully considered, and the similarity of the abnormal flow of the network is better reflected.
Description
Technical Field
The invention relates to the technical field of network abnormal data processing, in particular to a method for analyzing relevance of network abnormal data.
Background
With the advance of research and practice of smart power grids, power grids in the traditional sense are gradually fused with information communication systems and monitoring control systems, the safety of power communication networks is closely connected with the operation safety of the power grids, and the safety of the power communication networks is the central importance of the safety of the power grids. The network safety is continuously strengthened in the power industry during the 'twelve-five' period, and the network safety protection system with the characteristics of the power industry is continuously improved.
The electric power communication network system has the characteristics of complexity, dynamics and the like, has certain vulnerability, and the security events such as denial of service attack, network scanning, network deception, virus trojan, information leakage and the like are layered endlessly, so that the abnormal data of the power grid communication network is analyzed and processed in time in a lack of method, and the internal and external security risks bring great pressure to the network security work.
Disclosure of Invention
The invention provides a method for analyzing the relevance of network abnormal data, aiming at solving the technical defects that the prior art lacks a method for analyzing and processing the abnormal data of the power communication network and brings great pressure to the network safety work.
In order to realize the purpose, the technical scheme is as follows:
a method for analyzing relevance of network abnormal data comprises the following steps:
s1: collecting abnormal data of the power communication network;
s2: preprocessing the acquired abnormal data to obtain preprocessed abnormal data;
s3: according to the preprocessed abnormal data, calculating a weight value according to principal component analysis;
s4: calculating the similarity of the abnormal data to generate a transaction database;
s5: and completing the relevance analysis based on an Apriori algorithm according to the generated transaction database.
Wherein, the step S1 specifically includes: selecting network key information data with a fixed time length from original records of abnormal flow data collected in a power communication network, wherein each piece of data comprises 4 attributes: TIME attribute, collecting TIME period, and recording as A; host-related attributes in a network, including: host importance LEVEL HOSTS _ LEVEL, denoted B1The security LEVEL SECURE _ LEVEL of each host is marked as B2(ii) a Run information attributes, including: the total number of SERVICEs running on each host SERVICE _ NUM, denoted C1The importance LEVEL SERVICE _ LEVEL of each SERVICE is marked as C2(ii) a IP attribute: source Address SIP, denoted D1And a destination address DIP, noted as D2(ii) a The data characteristics are A, B1、B2、C1、C2、D1、D2(ii) a B is to be1、B2、C2The descriptive text of (a) is quantized to numbers.
Specifically, in step S2, the data is cleaned to remove data records containing missing values.
Wherein the step S3 includes:
s31: and (3) data standardization treatment: the standardization of the data is to scale the data to make the data fall into a small specific interval, which is mainly used for removing the unit limitation of the data and converting the data into a dimensionless pure numerical value, so that indexes of different units or orders of magnitude can be compared and weighted conveniently; the extreme value normalization method (0-1normalization) is adopted here, and is a linear transformation on the original data, and the specific expression of the transformation function is as follows:
wherein, XmaxIs the maximum value of the sample data, XminThe minimum value of the sample data is X, and the X is the collected abnormal data; x is converted abnormal data;
s32: and (3) performing principal component analysis on X, wherein the calculation steps are as follows:
and calculating a correlation coefficient matrix R, wherein the calculation formula is as follows:
wherein r isij(i, j ═ 1, 2.. times.p) is the original variable xiAnd xjOf correlation coefficient rij=rjiThe calculation formula is
Wherein the content of the first and second substances,representing the average over the rows and columns of the X matrix from which A, B can be derived1、B2、C1、C2、D1、D2A matrix of correlation coefficients of;
calculating a characteristic value and a characteristic vector, and solving a characteristic value equation:
|λI-R|=0
the characteristic values were obtained by the Jacobi method (Jacobi), and the values were arranged in order of magnitude1≥λ2≥...≥λpNot less than 0, respectively calculating correspondence and characteristic value lambdaiCharacteristic vector e ofj(i ═ 1,2,. > p), requiring | | | ei1 | | |, i.eWherein eijRepresents a vector eiThe jth component of (a);
calculating the principal component contribution rate and the accumulated contribution rate, wherein the calculation formula is as follows:
wherein λ isi,λkA non-negative feature vector, i ═ 1, 2., p, p represents the number of non-negative feature roots;
calculating principal component load lijThe calculation formula is as follows:
wherein e isi,jAs a unit vector component, according to lijA component matrix Z can be obtained, with the principal component scores as follows:
determining the weight of the principal component analysis: determining the weight by principal component analysis, wherein the index weight is equal to the weight by taking the variance contribution rate of the principal component as the weight, and normalizing the weighted average of the coefficients of the index in each principal component linear combination, therefore, three steps are required for determining the index weight:
calculating coefficients in the principal component linear combination: squaring the number of loads/eigenvalues in the component matrix obtained from the principal component loads, i.e.Coefficients of linear combinations of principal components are obtained, where the number of principal components is obtained by analyzing each principal component score, n (n.ltoreq.7) is set as n, and n sets of data A, B are obtained1、B2、C1、C2、D1、D2Linear combination coefficient F of
Wherein x is1,x2,...,x7Corresponds to A, B1、B2、C1、C2、D1、D2;
Calculating the variance contribution rate of the principal component, wherein the greater the variance contribution rate, the greater the importance of the principal component, therefore, considering the variance contribution rate as the weight of different principal components, replacing the original data with n principal components, performing weighted average on the coefficient in the linear combination according to the weight of the principal component in the principal component variance contribution rate,
F=c1F1+c2F2+…+cnFn
wherein, c1,c2,...,cnIs F1,F2,...,FnThe proportion of the variance contribution rate is occupied,
combining coefficients in the principal component linear combination to obtain:
F=w1x1+w2x2+…+w7x7
wherein, w1,w2,...,w7I.e. the weight, and will w1,w2,...,w7Carrying out normalization processing;
when the weight of the data variable is lower than the weight threshold, the data variable is considered to be low in association degree with the abnormal flow data analysis, and the data variable is deleted.
Wherein the step S4 includes the steps of:
calculating the similarity between abnormal flow data:
similarity of time information δ1:
Wherein, t1,t2Is the abnormal flow A, B detection time, TwinIs the reference design time;
similarity delta of host related information2:
Wherein S is1,S2For exceptional traffic A, B host importance level, NSIs an important grade number;
similarity delta of host security protection level3:
Wherein, C1,C2For exceptional traffic A, B host protection level, NCThe number of total protection grades;
similarity of total number of running services δ4:
Wherein, I1,I2Total number of services running on the host for exception traffic A, B, NIThe total weight of the service running on the host computer is in a grade number;
running similarity δ of service importance levels5:
Wherein l1,l2For the importance level of the service running on the exception A, B host, NlThe total weight of the service running on the host computer is in a grade number;
similarity δ of IP-related information6: let the binary numbers of the IP addresses of the two abnormal traffic devices A, B be IP1 and IP2, respectively, XOR the IP addresses to obtain diff as IP1 XOR IP2, start scanning from the left side of diff, encounter 1 and stop, and define the variable p as the number of 0 encountered in scanning, then the IP similarity function is:
respectively calculating source IP addresses delta according to the obtained similarity function6;
Calculating the similarity of the abnormal flow η:
obtaining the similarity between each abnormal flow, so as to convert the operation on the variable in the abnormal flow data into the operation on each abnormal flow;
generating a transaction database according to the similarity:
setting a similarity threshold: setting a similarity threshold according to the calculated similarity between the abnormal flows; analyzing according to the result obtained by the experiment, and setting the similarity threshold values as a maximum threshold value of 0.5, a minimum threshold value of 0.1 and a discard threshold value of 0.05 respectively;
and generating a transaction type database D according to the similarity threshold: when the similarity is lower than 0.05, the similarity between the abnormal flows is considered to be too low, and no possible correlation exists; when the similarity of the two is higher than 0.5, the association degree is considered to be higher, and the two can be used as transaction data items with 2 abnormal flows; on the basis of obtaining the transaction data items with the similarity higher than 0.5, if the similarity between the two transaction data items and the other abnormal traffic is higher than 0.48, the transaction data items with 3 abnormal traffic are generated, and so on, the required similarity is correspondingly reduced by 0.02 every time the abnormal traffic is added in the transaction data items, but the abnormal traffic cannot be added in the transaction data items when the similarity is lower than 0.1.
Wherein the step S5 includes the steps of:
s51: setting a minimum support threshold value min _ sup and a minimum confidence coefficient min _ conf, wherein the minimum support threshold value is 20 percent, and the minimum confidence coefficient threshold value is 80 percent;
s52: a frequent item set is generated by iterating the transactional database D: the transaction database D is processed once through the first iteration of the algorithmScanning, calculating the occurrence frequency of each item contained in D, and generating a candidate 1-item set C1;
S53: according to the set minimum support degree, from C1Determine frequent 1-item set L1From this analogy, we derive the frequent set LkWhere k is 7;
s54: and generating a rule meeting the minimum confidence coefficient on the basis of the frequent item set, wherein the generated rule is called as a strong association rule, so that the correlation of abnormal data of the power communication network is obtained, and further the potential state perception and prediction of the power communication cell data network are realized.
In the above scheme, the algorithm strategy based on Apriori algorithm relevance analysis is as follows:
a connection step: frequent (k-1) item set Lk-1Generates a candidate k term set CkApriori assumes that the set of items are ordered in lexicographic order. If L isk-1The first (k-2) items of the elements (item sets) itemset1 and itemset2 of two of them are identical, then itemset1 and itemset2 are said to be connectable. The resulting set of items resulting from the concatenation of itemset1 with itemset2 is { itemset1[1 ]],itemset1[2],…,itemset1[k-1],itemset2[k-1]};
Pruning strategy: due to the presence of a priori properties: any infrequent (k-1) item set is not a subset of the frequent k item set. Thus, if a candidate k-term set CkIs not in Lk-1Then the candidate set is unlikely to be frequent, so that C can be selected fromkDeleting to obtain compressed Ck;
Deletion strategy: based on C after compressionkScan all transactions, pair CkCounting each item in the k item set, and then deleting the item which does not meet the minimum support degree, thereby obtaining a frequent k item set;
setting a minimum support threshold value min _ sup and a minimum confidence coefficient min _ conf, wherein the minimum support threshold value is 20 percent, and the minimum confidence coefficient threshold value is 80 percent;
a frequent item set is generated by iterating the transactional database D: through the first iteration of the algorithm, the transaction database D is scanned once, and each time D is calculatedThe number of times of occurrence of each item generates a set C of candidate 1-item sets1;
According to the set minimum support degree, from C1Determine frequent 1-item set L1From this analogy, we derive the frequent set LkWhere k is 7;
and generating a rule meeting the minimum confidence coefficient on the basis of the frequent item set, wherein the generated rule is called as a strong association rule, so that the correlation of abnormal data of the power communication network is obtained, and further the potential state perception and prediction of the power communication cell data network are realized.
According to the scheme, data processing is carried out on abnormal flow data acquired from a power communication network, the weight of abnormal flow data variables is calculated based on a principal component analysis method and dimension reduction is carried out, the similarity between abnormal flows is calculated by using the weight to generate a transaction type database, and then the transaction type database is associated based on an Apriori association rule algorithm to generate a strong association rule.
Compared with the prior art, the invention has the beneficial effects that:
the invention provides a relevance analysis method of network abnormal data, which is characterized in that the weight of abnormal flow data variables is calculated by using a principal component analysis method, dimension reduction is carried out, and the similarity between abnormal flows is calculated by using the weight to generate a transaction type database; the relevance of the abnormal traffic data of the power communication network is analyzed and mined by using an Apriori association rule, the complexity of the abnormal traffic of the power communication network is fully considered, the real state of the abnormal traffic of the network is comprehensively considered, and the similarity of the abnormal traffic of the network is better reflected.
Drawings
Fig. 1 is a schematic flow chart of a method for analyzing relevance of network abnormal data.
FIG. 2 is a flowchart of an algorithm of a method for analyzing relevance of network anomaly data;
FIG. 3 is a comparison of average temporal complexity.
Detailed Description
The drawings are for illustrative purposes only and are not to be construed as limiting the patent;
the invention is further illustrated below with reference to the figures and examples.
Example 1
As shown in fig. 1 and fig. 2, a method for analyzing relevance of network abnormal data includes the following steps:
s1: collecting abnormal data of the power communication network;
s2: preprocessing the acquired abnormal data to obtain preprocessed abnormal data;
s3: according to the preprocessed abnormal data, calculating a weight value according to principal component analysis;
s4: calculating the similarity of the abnormal data to generate a transaction database;
s5: and completing the relevance analysis based on an Apriori algorithm according to the generated transaction database.
More specifically, the step S1 specifically includes: selecting network key information data with a fixed time length from original records of abnormal flow data collected in a power communication network, wherein each piece of data comprises 4 attributes: TIME attribute, collecting TIME period, and recording as A; host-related attributes in a network, including: host importance LEVEL HOSTS _ LEVEL, denoted B1The security LEVEL SECURE _ LEVEL of each host is marked as B2(ii) a Run information attributes, including: the total number of SERVICEs running on each host SERVICE _ NUM, denoted C1The importance LEVEL SERVICE _ LEVEL of each SERVICE is marked as C2(ii) a IP attribute: source Address SIP, denoted D1And a destination address DIP, noted as D2(ii) a The data characteristics are A, B1、B2、C1、C2、D1、D2(ii) a B is to be1、B2、C2The descriptive text of (a) is quantized to numbers.
More specifically, in step S2, the data is cleaned to remove the data records containing missing values.
More specifically, the step S3 includes:
s31: and (3) data standardization treatment: the standardization of the data is to scale the data to make the data fall into a small specific interval, which is mainly used for removing the unit limitation of the data and converting the data into a dimensionless pure numerical value, so that indexes of different units or orders of magnitude can be compared and weighted conveniently; the extreme value normalization method (0-1normalization) is adopted here, and is a linear transformation on the original data, and the specific expression of the transformation function is as follows:
wherein, XmaxIs the maximum value of the sample data, XminThe minimum value of the sample data is X, and the X is the collected abnormal data; x*Converting the abnormal data into abnormal data;
s32: to X*The principal component analysis is carried out, and the calculation steps are as follows:
and calculating a correlation coefficient matrix R, wherein the calculation formula is as follows:
wherein r isij(i, j ═ 1, 2.. times.p) is the original variable xiAnd xjOf correlation coefficient rij=rjiThe calculation formula is
Wherein the content of the first and second substances,representing the average over the rows and columns of the X matrix from which A, B can be derived1、B2、C1、C2、D1、D2A matrix of correlation coefficients of;
calculating a characteristic value and a characteristic vector, and solving a characteristic value equation:
|λI-R|=0
the characteristic values were obtained by the Jacobi method (Jacobi), and the values were arranged in order of magnitude1≥λ2≥...≥λpNot less than 0, respectively calculating correspondence and characteristic value lambdaiCharacteristic vector e ofj(i ═ 1,2,. > p), requiring | | | ei1 | | |, i.eWherein eijRepresents a vector eiThe jth component of (a);
calculating the principal component contribution rate and the accumulated contribution rate, wherein the calculation formula is as follows:
wherein λ isi,λkA non-negative feature vector, i ═ 1, 2., p, p represents the number of non-negative feature roots;
calculating principal component load lijThe calculation formula is as follows:
wherein e isi,jAs a unit vector component, according to lijA component matrix Z can be obtained, with the principal component scores as follows:
determining the weight of the principal component analysis: determining the weight by principal component analysis, wherein the index weight is equal to the weight by taking the variance contribution rate of the principal component as the weight, and normalizing the weighted average of the coefficients of the index in each principal component linear combination, therefore, three steps are required for determining the index weight:
calculating coefficients in the principal component linear combination: squaring the number of loads/eigenvalues in the component matrix obtained from the principal component loads, i.e.Coefficients of linear combinations of principal components are obtained, where the number of principal components is obtained by analyzing each principal component score, n (n.ltoreq.7) is set as n, and n sets of data A, B are obtained1、B2、C1、C2、D1、D2Linear combination coefficient F of
Wherein x is1,x2,...,x7Corresponds to A, B1、B2、C1、C2、D1、D2;
Calculating the variance contribution rate of the principal component, wherein the greater the variance contribution rate, the greater the importance of the principal component, therefore, considering the variance contribution rate as the weight of different principal components, replacing the original data with n principal components, performing weighted average on the coefficient in the linear combination according to the weight of the principal component in the principal component variance contribution rate,
F=c1F1+c2F2+...+cnFn
wherein, c1,c2,...,cnIs F1,F2,...,FnThe proportion of the variance contribution rate is occupied,
combining coefficients in the principal component linear combination to obtain:
F=w1x1+w2x2+...+w7x7
wherein, w1,w2,...,w7I.e. the weight, and will w1,w2,...,w7Carrying out normalization processing;
when the weight of the data variable is lower than the weight threshold, the data variable is considered to be low in association degree with the abnormal flow data analysis, and the data variable is deleted.
Wherein the step S4 includes the steps of:
calculating the similarity between abnormal flow data:
similarity of time information δ1:
Wherein, t1,t2Is the abnormal flow A, B detection time, TwinIs the reference design time;
similarity delta of host related information2:
Wherein S is1,S2For exceptional traffic A, B host importance level, NSIs an important grade number;
similarity delta of host security protection level3:
Wherein, C1,C2For exceptional traffic A, B host protection level, NCThe number of total protection grades;
similarity of total number of running services δ4:
Wherein, I1,I2Total number of services running on the host for exception traffic A, B, NIThe total weight of the service running on the host computer is in a grade number;
running similarity δ of service importance levels5:
Wherein l1,l2For the importance level of the service running on the exception A, B host, NlThe total weight of the service running on the host computer is in a grade number;
similarity δ of IP-related information6: let the binary numbers of the IP addresses of the two abnormal traffic devices A, B be IP1 and IP2, respectively, XOR the IP addresses to obtain diff as IP1 XOR IP2, start scanning from the left side of diff, encounter 1 and stop, and define the variable p as the number of 0 encountered in scanning, then the IP similarity function is:
respectively calculating source IP addresses delta according to the obtained similarity function6;
Calculating the similarity of the abnormal flow η:
obtaining the similarity between each abnormal flow, so as to convert the operation on the variable in the abnormal flow data into the operation on each abnormal flow;
generating a transaction database according to the similarity:
setting a similarity threshold: setting a similarity threshold according to the calculated similarity between the abnormal flows; analyzing according to the result obtained by the experiment, and setting the similarity threshold values as a maximum threshold value of 0.5, a minimum threshold value of 0.1 and a discard threshold value of 0.05 respectively;
and generating a transaction type database D according to the similarity threshold: when the similarity is lower than 0.05, the similarity between the abnormal flows is considered to be too low, and no possible correlation exists; when the similarity of the two is higher than 0.5, the association degree is considered to be higher, and the two can be used as transaction data items with 2 abnormal flows; on the basis of obtaining the transaction data items with the similarity higher than 0.5, if the similarity between the two transaction data items and the other abnormal traffic is higher than 0.48, the transaction data items with 3 abnormal traffic are generated, and so on, the required similarity is correspondingly reduced by 0.02 every time the abnormal traffic is added in the transaction data items, but the abnormal traffic cannot be added in the transaction data items when the similarity is lower than 0.1.
More specifically, the step S5 includes the following steps:
s51: setting a minimum support threshold value min _ sup and a minimum confidence coefficient min _ conf, wherein the minimum support threshold value is 20 percent, and the minimum confidence coefficient threshold value is 80 percent;
s52: a frequent item set is generated by iterating the transactional database D: after the first iteration of the algorithm, the transaction database D is scanned once, the frequency of occurrence of each item contained in D is calculated, and a candidate 1-item set C is generated1;
S53: according to the set minimum support degree, from C1Determine frequent 1-item set L1From this analogy, we derive the frequent set LkWhere k is 7;
s54: and generating a rule meeting the minimum confidence coefficient on the basis of the frequent item set, wherein the generated rule is called as a strong association rule, so that the correlation of abnormal data of the power communication network is obtained, and further the potential state perception and prediction of the power communication cell data network are realized.
In the specific implementation process, the algorithm strategy based on Apriori algorithm relevance analysis is as follows:
a connection step: frequent (k-1) item set Lk-1Generates a candidate k term set CkApriori assumes that the set of items are ordered in lexicographic order. If L isk-1The first (k-2) items of the elements (item sets) itemset1 and itemset2 of two of them are identical, then itemset1 and itemset2 are said to be connectable. The resulting set of items resulting from the concatenation of itemset1 with itemset2 is { itemset1[1 ]],itemset1[2],…,itemset1[k-1],itemset2[k-1]};
Pruning strategy: due to the presence of a priori properties: any infrequent (k-1) item set is not a subset of the frequent k item set. Thus, if a candidate k-term set CkIs not in Lk-1Then the candidate set is unlikely to be frequent, so that C can be selected fromkDeleting to obtain compressed Ck;
Deletion policyA little: based on C after compressionkScan all transactions, pair CkCounting each item in the k item set, and then deleting the item which does not meet the minimum support degree, thereby obtaining a frequent k item set;
setting a minimum support threshold value min _ sup and a minimum confidence coefficient min _ conf, wherein the minimum support threshold value is 20 percent, and the minimum confidence coefficient threshold value is 80 percent;
a frequent item set is generated by iterating the transactional database D: after the first iteration of the algorithm, the transaction database D is scanned once, the frequency of occurrence of each item contained in D is calculated, and a candidate 1-item set C is generated1;
According to the set minimum support degree, from C1Determine frequent 1-item set L1From this analogy, we derive the frequent set LkWhere k is 7;
and generating a rule meeting the minimum confidence coefficient on the basis of the frequent item set, wherein the generated rule is called as a strong association rule, so that the correlation of abnormal data of the power communication network is obtained, and further the potential state perception and prediction of the power communication cell data network are realized.
In the specific implementation process, data processing is performed on abnormal flow data acquired from a power communication network, the weight of abnormal flow data variables is calculated based on a principal component analysis method, dimension reduction is performed, the similarity between abnormal flows is calculated by using the weight to generate a transaction-type database, and then the transaction-type database is associated based on an Apriori association rule algorithm to generate a strong association rule.
In the specific implementation process, as shown in fig. 3, data preprocessing is performed first to remove data with missing values. And (3) processing the abnormal flow data variable by adopting a principal component analysis method, and reducing the dimension of the data variable while obtaining the weight of the data variable. On the basis of simplifying abnormal flow data, the similarity between abnormal flows is obtained, and accordingly the abnormal flows of the power communication network are generated into a transaction type database. The association rule of the abnormal traffic is completed by step S5.
In a specific implementation process, due to inherent defects of Apriori association rules, spatial complexity and temporal complexity for implementing Apriori increase with the increase of data, and the reason for the increase of the spatial complexity and the temporal complexity is that Apriori needs to perform multiple accesses and iterations on a database. With the development of the data processing platform at present, the time complexity of an iterative algorithm can be effectively solved by adopting spark parallel computing framework programming; meanwhile, the spark supports caching of data used for multiple times in a cache mode, and pressure of multiple access to the database is relieved to a certain extent.
In the specific implementation process, a principal component analysis method is used as a data processing method, and Apriori obtains association rules, the principal component analysis reduces the workload of data processing, and the weight among abnormal flow data variables is determined according to the principal component analysis method, and the method is objective and reasonable in weight determination. Apriori association rules have been widely applied to various fields such as business and network security, and the association of data is analyzed and mined to mine useful information.
It should be understood that the above-described embodiments of the present invention are merely examples for clearly illustrating the present invention, and are not intended to limit the embodiments of the present invention. Other variations and modifications will be apparent to persons skilled in the art in light of the above description. And are neither required nor exhaustive of all embodiments. Any modification, equivalent replacement, and improvement made within the spirit and principle of the present invention should be included in the protection scope of the claims of the present invention.
Claims (4)
1. A method for analyzing relevance of network abnormal data is characterized in that: the method comprises the following steps:
s1: collecting abnormal data of the power communication network;
s2: preprocessing the acquired abnormal data to obtain preprocessed abnormal data;
s3: according to the preprocessed abnormal data, calculating a weight value according to principal component analysis;
s4: calculating the similarity of the abnormal data to generate a transaction database;
s5: according to the generated transaction database, relevance analysis is completed based on an Apriori algorithm;
wherein the step S3 includes:
s31: and (3) data standardization treatment: the standardization of the data is to scale the data, so that the data falls into a small specific interval, which is mainly used for removing unit limitation of the data and converting the unit limitation into a dimensionless pure numerical value, thereby facilitating the comparison and weighting of indexes of different units or orders of magnitude; the extreme value normalization method, namely 0-1normalization, is adopted here, and is a linear transformation on the original data, and the specific expression of the conversion function is as follows:
wherein, XmaxIs the maximum value of the sample data, XminThe minimum value of the sample data is X, and the X is the collected abnormal data; x*Converting the abnormal data into abnormal data;
s32: to X*The principal component analysis is carried out, and the calculation steps are as follows:
and calculating a correlation coefficient matrix R, wherein the calculation formula is as follows:
wherein r isij(i, j ═ 1, 2.. times.p) is the original variable xiAnd xjOf correlation coefficient rij=rjiThe calculation formula is
Wherein the content of the first and second substances,representing the average over the rows and columns of the X matrix from which A, B can be derived1、B2、C1、C2、D1、D2A matrix of correlation coefficients of;
calculating a characteristic value and a characteristic vector, and solving a characteristic value equation:
|λI-R|=0
the characteristic values were obtained by the Jacobi method (Jacobi), and the values were arranged in order of magnitude1≥λ2≥...≥λpNot less than 0, respectively calculating correspondence and characteristic value lambdaiCharacteristic vector e ofj(i ═ 1,2,. > p), requiring | | | ei1 | | |, i.eWherein eijRepresents a vector eiThe jth component of (a);
calculating the principal component contribution rate and the accumulated contribution rate, wherein the calculation formula is as follows:
wherein λ isi,λkA non-negative feature vector, i ═ 1, 2., p, p represents the number of non-negative feature roots;
calculating principal component load lijThe calculation formula is as follows:
wherein e isi,jAs a unit vector component, according to lijA component matrix Z can be obtained, with the principal component scores as follows:
determining the weight of the principal component analysis: determining the weight by principal component analysis, wherein the index weight is equal to the weight by taking the variance contribution rate of the principal component as the weight, and normalizing the weighted average of the coefficients of the index in each principal component linear combination, therefore, three steps are required for determining the index weight:
calculating coefficients in the principal component linear combination: squaring the number of loads/eigenvalues in the component matrix obtained from the principal component loads, i.e.Coefficients of linear combinations of principal components are obtained, where the number of principal components is obtained by analyzing each principal component score, n (n.ltoreq.7) is set as n, and n sets of data A, B are obtained1、B2、C1、C2、D1、D2Linear combination coefficient F of
Wherein x is1,x2,...,x7Corresponds to A, B1、B2、C1、C2、D1、D2;
Calculating the variance contribution rate of the principal component, wherein the greater the variance contribution rate, the greater the importance of the principal component, therefore, considering the variance contribution rate as the weight of different principal components, replacing the original data with n principal components, performing weighted average on the coefficient in the linear combination according to the weight of the principal component in the principal component variance contribution rate,
F=c1F1+c2F2+...cnFn
wherein, c1,c2,...,cnIs F1,F2,...,FnThe proportion of the variance contribution rate is occupied,
combining coefficients in the principal component linear combination to obtain:
F=w1x1+w2x2+...+w7x7
wherein, w1,w2,...,w7I.e. the weight, and willw1,w2,...,w7Carrying out normalization processing;
setting a weight threshold value of a data variable to be 0.05 while determining the weight by using a principal component analysis method, and deleting the data variable when the weight of the data variable is lower than the weight threshold value and the data variable is considered to have low association degree with the abnormal flow data analysis;
the step S4 includes the steps of:
calculating the similarity between abnormal flow data:
similarity of time information δ1:
Wherein, t1,t2Is the abnormal flow A, B detection time, TwinIs the reference design time;
similarity delta of host related information2:
Wherein S is1,S2For exceptional traffic A, B host importance level, NSIs an important grade number;
similarity delta of host security protection level3:
Wherein, C1,C2For exceptional traffic A, B host protection level, NCThe number of total protection grades;
similarity of total number of running services δ4:
Wherein, I1,I2As abnormal flowA. Total number of services running on B host, NIThe total weight of the service running on the host computer is in a grade number;
running similarity δ of service importance levels5:
Wherein l1,l2For the importance level of the service running on the exception A, B host, NlThe total weight of the service running on the host computer is in a grade number;
similarity δ of IP-related information6: let the binary numbers of the IP addresses of the two abnormal traffic devices A, B be IP1 and IP2, respectively, XOR the IP addresses to obtain diff as IP1 XOR IP2, start scanning from the left side of diff, encounter 1 and stop, and define the variable p as the number of 0 encountered in scanning, then the IP similarity function is:
respectively calculating source IP addresses delta according to the obtained similarity function6;
Calculating the similarity of the abnormal flow η:
obtaining the similarity between each abnormal flow, so as to convert the operation on the variable in the abnormal flow data into the operation on each abnormal flow;
generating a transaction database according to the similarity:
setting a similarity threshold: setting a similarity threshold according to the calculated similarity between the abnormal flows; analyzing according to the result obtained by the experiment, and setting the similarity threshold values as a maximum threshold value of 0.5, a minimum threshold value of 0.1 and a discard threshold value of 0.05 respectively;
and generating a transaction type database D according to the similarity threshold: when the similarity is lower than 0.05, the similarity between the abnormal flows is considered to be too low, and no possible correlation exists; when the similarity of the two is higher than 0.5, the association degree is considered to be higher, and the two can be used as transaction data items with 2 abnormal flows; on the basis of obtaining the transaction data items with the similarity higher than 0.5, if the similarity between the two transaction data items and the other abnormal traffic is higher than 0.48, the transaction data items with 3 abnormal traffic are generated, and so on, the required similarity is correspondingly reduced by 0.02 every time the abnormal traffic is added in the transaction data items, but the abnormal traffic cannot be added in the transaction data items when the similarity is lower than 0.1.
2. The method according to claim 1, wherein the method comprises the following steps: the step S1 specifically includes: selecting network key information data with a fixed time length from original records of abnormal flow data collected in a power communication network, wherein each piece of data comprises 4 attributes: TIME attribute, collecting TIME period, and recording as A; host-related attributes in a network, including: host importance LEVEL HOSTS _ LEVEL, denoted B1The security LEVEL SECURE _ LEVEL of each host is marked as B2(ii) a Run information attributes, including: the total number of SERVICEs running on each host SERVICE _ NUM, denoted C1The importance LEVEL SERVICE _ LEVEL of each SERVICE is marked as C2(ii) a IP attribute: source Address SIP, denoted D1And a destination address DIP, noted as D2(ii) a The data characteristics are A, B1、B2、C1、C2、D1、D2(ii) a B is to be1、B2、C2The descriptive text of (a) is quantized to numbers.
3. The method according to claim 2, wherein the method comprises the following steps: the step S2 is specifically to wash the data and remove the data records containing missing values.
4. The method according to claim 3, wherein the method comprises the following steps: the step S5 includes the steps of:
s51: setting a minimum support threshold value min _ sup and a minimum confidence coefficient min _ conf, wherein the minimum support threshold value is 20 percent, and the minimum confidence coefficient threshold value is 80 percent;
s52: a frequent item set is generated by iterating the transactional database D: after the first iteration of the algorithm, the transaction database D is scanned once, the frequency of occurrence of each item contained in D is calculated, and a candidate 1-item set C is generated1;
S53: according to the set minimum support degree, from C1Determine frequent 1-item set L1From this analogy, we derive the frequent set LkWhere k is 7;
s54: and generating a rule meeting the minimum confidence coefficient on the basis of the frequent item set, wherein the generated rule is called as a strong association rule, so that the correlation of abnormal data of the power communication network is obtained, and further the potential state perception and prediction of the power communication cell data network are realized.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201810402502.5A CN108595667B (en) | 2018-04-28 | 2018-04-28 | Method for analyzing relevance of network abnormal data |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201810402502.5A CN108595667B (en) | 2018-04-28 | 2018-04-28 | Method for analyzing relevance of network abnormal data |
Publications (2)
Publication Number | Publication Date |
---|---|
CN108595667A CN108595667A (en) | 2018-09-28 |
CN108595667B true CN108595667B (en) | 2020-06-09 |
Family
ID=63619304
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201810402502.5A Active CN108595667B (en) | 2018-04-28 | 2018-04-28 | Method for analyzing relevance of network abnormal data |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN108595667B (en) |
Families Citing this family (25)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN109767618B (en) * | 2018-12-20 | 2020-10-09 | 北京航空航天大学 | Comprehensive study and judgment method and system for abnormal data of public security traffic management service |
CN109450955B (en) * | 2018-12-30 | 2022-04-05 | 北京世纪互联宽带数据中心有限公司 | Traffic processing method and device based on network attack |
CN109828549A (en) * | 2019-01-28 | 2019-05-31 | 中国石油大学(华东) | A kind of industry internet equipment fault prediction technique based on deep learning |
CN110322357A (en) * | 2019-05-30 | 2019-10-11 | 深圳壹账通智能科技有限公司 | Anomaly assessment method, apparatus, computer equipment and the medium of data |
CN110392046B (en) * | 2019-06-28 | 2021-12-24 | 平安科技(深圳)有限公司 | Method and device for detecting abnormity of network access |
CN111626461A (en) * | 2019-09-18 | 2020-09-04 | 东莞灵虎智能科技有限公司 | Safety risk prediction method |
CN110928718B (en) * | 2019-11-18 | 2024-01-30 | 上海维谛信息科技有限公司 | Abnormality processing method, system, terminal and medium based on association analysis |
CN111025144A (en) * | 2020-03-06 | 2020-04-17 | 广东电网有限责任公司佛山供电局 | High-voltage circuit breaker health level early warning method |
CN111650898B (en) * | 2020-05-13 | 2023-10-20 | 大唐七台河发电有限责任公司 | Distributed control system and method with high fault tolerance performance |
CN111698302A (en) * | 2020-05-29 | 2020-09-22 | 深圳壹账通智能科技有限公司 | Data early warning method and device, electronic equipment and medium |
CN111858662A (en) * | 2020-06-01 | 2020-10-30 | 广东恒睿科技有限公司 | Method, system and storage medium for identifying underlying network potential danger data |
CN111983469B (en) * | 2020-08-24 | 2023-08-22 | 哈尔滨理工大学 | Lithium battery safety degree estimation method and device based on voltage safety boundary and temperature safety boundary |
CN112087350B (en) * | 2020-09-17 | 2022-03-18 | 中国工商银行股份有限公司 | Method, device, system and medium for monitoring network access line flow |
CN112131284A (en) * | 2020-09-30 | 2020-12-25 | 国网智能科技股份有限公司 | Transformer substation holographic data slicing method and system |
CN112231392A (en) * | 2020-10-29 | 2021-01-15 | 广东机场白云信息科技有限公司 | Civil aviation customer source data analysis method, electronic equipment and computer readable storage medium |
CN112487053B (en) * | 2020-11-27 | 2022-07-08 | 重庆医药高等专科学校 | Abnormal control extraction working method for mass financial data |
CN112583825B (en) * | 2020-12-07 | 2022-09-27 | 四川虹微技术有限公司 | Method and device for detecting abnormality of industrial system |
CN112714462A (en) * | 2020-12-25 | 2021-04-27 | 南京邮电大学 | Electric wireless private network specific network attack monitoring method based on improved Apriori algorithm |
CN113537590A (en) * | 2021-07-14 | 2021-10-22 | 深圳供电局有限公司 | Data anomaly prediction method and system |
CN113469567A (en) * | 2021-07-21 | 2021-10-01 | 东营市城市管理服务中心 | Digital urban management system operation comprehensive evaluation method based on principal component analysis |
CN114090413B (en) * | 2022-01-21 | 2022-04-19 | 成都市以太节点科技有限公司 | System data anomaly detection method and system, electronic equipment and storage medium |
CN114598527A (en) * | 2022-03-08 | 2022-06-07 | 江苏大学 | Abnormal network flow detection method based on maximum frequent pattern non-similarity |
CN115953073A (en) * | 2023-01-06 | 2023-04-11 | 国能信控互联技术有限公司 | Data correlation analysis method and system based on thermal power production index management |
CN117097578B (en) * | 2023-10-20 | 2024-01-05 | 杭州烛微智能科技有限责任公司 | Network traffic safety monitoring method, system, medium and electronic equipment |
CN117454120B (en) * | 2023-12-20 | 2024-03-15 | 山西思极科技有限公司 | Method for collecting and analyzing data of power communication system |
Citations (7)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN101694744A (en) * | 2009-10-28 | 2010-04-14 | 北京交通大学 | Method and system for evaluating road emergency evacuation capacity and method and system for grading road emergency evacuation capacity |
CN105046376A (en) * | 2015-09-06 | 2015-11-11 | 河海大学 | Reservoir group flood control scheduling scheme optimization method taking index correlation into consideration |
CN105303468A (en) * | 2015-11-20 | 2016-02-03 | 国网天津市电力公司 | Comprehensive evaluation method of smart power grid construction based on principal component cluster analysis |
CN105303302A (en) * | 2015-10-12 | 2016-02-03 | 国家电网公司 | Power grid evaluating indicator correlation analysis method, apparatus and computing apparatus |
CN105427053A (en) * | 2015-12-07 | 2016-03-23 | 广东电网有限责任公司江门供电局 | Relative influence analysis model applied to evaluation of distribution network construction and renovation schemes and power supply quality indexes |
CN105677759A (en) * | 2015-12-30 | 2016-06-15 | 国家电网公司 | Alarm correlation analysis method in communication network |
CN105868928A (en) * | 2016-04-29 | 2016-08-17 | 西南石油大学 | High-dimensional evaluating method for oil field operational risk |
Family Cites Families (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US6654763B2 (en) * | 2001-06-14 | 2003-11-25 | International Business Machines Corporation | Selecting a function for use in detecting an exception in multidimensional data |
-
2018
- 2018-04-28 CN CN201810402502.5A patent/CN108595667B/en active Active
Patent Citations (7)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN101694744A (en) * | 2009-10-28 | 2010-04-14 | 北京交通大学 | Method and system for evaluating road emergency evacuation capacity and method and system for grading road emergency evacuation capacity |
CN105046376A (en) * | 2015-09-06 | 2015-11-11 | 河海大学 | Reservoir group flood control scheduling scheme optimization method taking index correlation into consideration |
CN105303302A (en) * | 2015-10-12 | 2016-02-03 | 国家电网公司 | Power grid evaluating indicator correlation analysis method, apparatus and computing apparatus |
CN105303468A (en) * | 2015-11-20 | 2016-02-03 | 国网天津市电力公司 | Comprehensive evaluation method of smart power grid construction based on principal component cluster analysis |
CN105427053A (en) * | 2015-12-07 | 2016-03-23 | 广东电网有限责任公司江门供电局 | Relative influence analysis model applied to evaluation of distribution network construction and renovation schemes and power supply quality indexes |
CN105677759A (en) * | 2015-12-30 | 2016-06-15 | 国家电网公司 | Alarm correlation analysis method in communication network |
CN105868928A (en) * | 2016-04-29 | 2016-08-17 | 西南石油大学 | High-dimensional evaluating method for oil field operational risk |
Non-Patent Citations (5)
Title |
---|
Nonlinear process fault pattern recognition using statistics kernel PCA similarity factor;Xiaogang Deng等;《Neurocomputing》;20131209;298-308页 * |
一种基于主成分分析算法的网络异常检测实现;付强等;《南京师范大学学报(工程技术版)》;20081220;13-16页 * |
基于主成分分析的指标权重确定方法;韩小孩等;《四川兵工学报》;20121025;124-126页 * |
基于主成分聚类分析的智能电网建设综合评价;高新华等;《电网技术》;20130805;2238-2243页 * |
模糊-主成分分析综合评价法在地下水水质评价中的应用;杜军凯等;《中国环境监测》;20150815;75-81页 * |
Also Published As
Publication number | Publication date |
---|---|
CN108595667A (en) | 2018-09-28 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN108595667B (en) | Method for analyzing relevance of network abnormal data | |
Ektefa et al. | Intrusion detection using data mining techniques | |
CN110909811A (en) | OCSVM (online charging management system) -based power grid abnormal behavior detection and analysis method and system | |
CN112016602B (en) | Method, equipment and storage medium for analyzing correlation between power grid fault cause and state quantity | |
CN113505826B (en) | Network flow anomaly detection method based on joint feature selection | |
Jiang et al. | A feature selection method for malware detection | |
CN115544519A (en) | Method for carrying out security association analysis on threat information of metering automation system | |
CN115361150B (en) | Security risk assessment method for power distribution network risk cascade under network attack | |
CN115329338A (en) | Information security risk analysis method and analysis system based on cloud computing service | |
Harbola et al. | Improved intrusion detection in DDoS applying feature selection using rank & score of attributes in KDD-99 data set | |
YANG et al. | Phishing website detection using C4. 5 decision tree | |
Aung et al. | Association rule pattern mining approaches network anomaly detection | |
CN117273516A (en) | Performance evaluation method based on attention mechanism neural network | |
Xu et al. | Hybrid model for network anomaly detection with gradient boosting decision trees and tabtransformer | |
CN115473748B (en) | DDoS attack classification detection method, device and equipment based on BiLSTM-ELM | |
Xin et al. | Research on feature selection of intrusion detection based on deep learning | |
CN114124484A (en) | Network attack identification method, system, device, terminal equipment and storage medium | |
CN113935420A (en) | Malicious encrypted data detection method and device, computer equipment and storage medium | |
CN110689074A (en) | Feature selection method based on fuzzy set feature entropy value calculation | |
Zhang et al. | Insider threat identification system model based on rough set dimensionality reduction | |
Alharbi et al. | High performance proactive digital forensics | |
CN111343165A (en) | Network intrusion detection method and system based on BIRCH and SMOTE | |
Qi et al. | An Intrusion Detection Feature Selection Method Based on Improved Mutual Information | |
CN113595987B (en) | Communication abnormal discovery method and device based on baseline behavior characterization, storage medium and electronic device | |
Serkani et al. | Hybrid anomaly detection using decision tree and support vector machine |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant |