CN108595667B - Method for analyzing relevance of network abnormal data - Google Patents

Method for analyzing relevance of network abnormal data Download PDF

Info

Publication number
CN108595667B
CN108595667B CN201810402502.5A CN201810402502A CN108595667B CN 108595667 B CN108595667 B CN 108595667B CN 201810402502 A CN201810402502 A CN 201810402502A CN 108595667 B CN108595667 B CN 108595667B
Authority
CN
China
Prior art keywords
data
abnormal
similarity
principal component
weight
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201810402502.5A
Other languages
Chinese (zh)
Other versions
CN108595667A (en
Inventor
姜文婷
亢中苗
陈燕
施展
赵瑞峰
陈飞鹏
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Guangdong Power Grid Co Ltd
Electric Power Dispatch Control Center of Guangdong Power Grid Co Ltd
Original Assignee
Guangdong Power Grid Co Ltd
Electric Power Dispatch Control Center of Guangdong Power Grid Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Guangdong Power Grid Co Ltd, Electric Power Dispatch Control Center of Guangdong Power Grid Co Ltd filed Critical Guangdong Power Grid Co Ltd
Priority to CN201810402502.5A priority Critical patent/CN108595667B/en
Publication of CN108595667A publication Critical patent/CN108595667A/en
Application granted granted Critical
Publication of CN108595667B publication Critical patent/CN108595667B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F21/00Security arrangements for protecting computers, components thereof, programs or data against unauthorised activity
    • G06F21/50Monitoring users, programs or devices to maintain the integrity of platforms, e.g. of processors, firmware or operating systems
    • G06F21/57Certifying or maintaining trusted computer platforms, e.g. secure boots or power-downs, version controls, system software checks, secure updates or assessing vulnerabilities

Abstract

The invention relates to a method for analyzing relevance of network abnormal data, which comprises the following steps: collecting abnormal data of the power communication network; preprocessing the acquired abnormal data to obtain preprocessed abnormal data; according to the preprocessed abnormal data, calculating a weight value according to principal component analysis; calculating the similarity of the abnormal data to generate a transaction database; and completing the relevance analysis based on an Apriori algorithm according to the generated transaction database. According to the relevance analysis method of the network abnormal data, the weight of the abnormal flow data variable is calculated based on a principal component analysis method, dimension reduction is carried out, the relevance of the abnormal flow data of the power communication network is analyzed and mined by using an Apriori relevance rule, the complexity of the abnormal flow of the power communication network is fully considered, the real state of the abnormal flow of the network is fully considered, and the similarity of the abnormal flow of the network is better reflected.

Description

Method for analyzing relevance of network abnormal data
Technical Field
The invention relates to the technical field of network abnormal data processing, in particular to a method for analyzing relevance of network abnormal data.
Background
With the advance of research and practice of smart power grids, power grids in the traditional sense are gradually fused with information communication systems and monitoring control systems, the safety of power communication networks is closely connected with the operation safety of the power grids, and the safety of the power communication networks is the central importance of the safety of the power grids. The network safety is continuously strengthened in the power industry during the 'twelve-five' period, and the network safety protection system with the characteristics of the power industry is continuously improved.
The electric power communication network system has the characteristics of complexity, dynamics and the like, has certain vulnerability, and the security events such as denial of service attack, network scanning, network deception, virus trojan, information leakage and the like are layered endlessly, so that the abnormal data of the power grid communication network is analyzed and processed in time in a lack of method, and the internal and external security risks bring great pressure to the network security work.
Disclosure of Invention
The invention provides a method for analyzing the relevance of network abnormal data, aiming at solving the technical defects that the prior art lacks a method for analyzing and processing the abnormal data of the power communication network and brings great pressure to the network safety work.
In order to realize the purpose, the technical scheme is as follows:
a method for analyzing relevance of network abnormal data comprises the following steps:
s1: collecting abnormal data of the power communication network;
s2: preprocessing the acquired abnormal data to obtain preprocessed abnormal data;
s3: according to the preprocessed abnormal data, calculating a weight value according to principal component analysis;
s4: calculating the similarity of the abnormal data to generate a transaction database;
s5: and completing the relevance analysis based on an Apriori algorithm according to the generated transaction database.
Wherein, the step S1 specifically includes: selecting network key information data with a fixed time length from original records of abnormal flow data collected in a power communication network, wherein each piece of data comprises 4 attributes: TIME attribute, collecting TIME period, and recording as A; host-related attributes in a network, including: host importance LEVEL HOSTS _ LEVEL, denoted B1The security LEVEL SECURE _ LEVEL of each host is marked as B2(ii) a Run information attributes, including: the total number of SERVICEs running on each host SERVICE _ NUM, denoted C1The importance LEVEL SERVICE _ LEVEL of each SERVICE is marked as C2(ii) a IP attribute: source Address SIP, denoted D1And a destination address DIP, noted as D2(ii) a The data characteristics are A, B1、B2、C1、C2、D1、D2(ii) a B is to be1、B2、C2The descriptive text of (a) is quantized to numbers.
Specifically, in step S2, the data is cleaned to remove data records containing missing values.
Wherein the step S3 includes:
s31: and (3) data standardization treatment: the standardization of the data is to scale the data to make the data fall into a small specific interval, which is mainly used for removing the unit limitation of the data and converting the data into a dimensionless pure numerical value, so that indexes of different units or orders of magnitude can be compared and weighted conveniently; the extreme value normalization method (0-1normalization) is adopted here, and is a linear transformation on the original data, and the specific expression of the transformation function is as follows:
Figure GDA0002383260630000021
wherein, XmaxIs the maximum value of the sample data, XminThe minimum value of the sample data is X, and the X is the collected abnormal data; x is converted abnormal data;
s32: and (3) performing principal component analysis on X, wherein the calculation steps are as follows:
and calculating a correlation coefficient matrix R, wherein the calculation formula is as follows:
Figure GDA0002383260630000022
wherein r isij(i, j ═ 1, 2.. times.p) is the original variable xiAnd xjOf correlation coefficient rij=rjiThe calculation formula is
Figure GDA0002383260630000023
Wherein the content of the first and second substances,
Figure GDA0002383260630000024
representing the average over the rows and columns of the X matrix from which A, B can be derived1、B2、C1、C2、D1、D2A matrix of correlation coefficients of;
calculating a characteristic value and a characteristic vector, and solving a characteristic value equation:
|λI-R|=0
the characteristic values were obtained by the Jacobi method (Jacobi), and the values were arranged in order of magnitude1≥λ2≥...≥λpNot less than 0, respectively calculating correspondence and characteristic value lambdaiCharacteristic vector e ofj(i ═ 1,2,. > p), requiring | | | ei1 | | |, i.e
Figure GDA0002383260630000025
Wherein eijRepresents a vector eiThe jth component of (a);
calculating the principal component contribution rate and the accumulated contribution rate, wherein the calculation formula is as follows:
contribution rate:
Figure GDA0002383260630000031
cumulative contribution rate:
Figure GDA0002383260630000032
wherein λ isikA non-negative feature vector, i ═ 1, 2., p, p represents the number of non-negative feature roots;
calculating principal component load lijThe calculation formula is as follows:
Figure GDA0002383260630000033
wherein e isi,jAs a unit vector component, according to lijA component matrix Z can be obtained, with the principal component scores as follows:
Figure GDA0002383260630000034
determining the weight of the principal component analysis: determining the weight by principal component analysis, wherein the index weight is equal to the weight by taking the variance contribution rate of the principal component as the weight, and normalizing the weighted average of the coefficients of the index in each principal component linear combination, therefore, three steps are required for determining the index weight:
calculating coefficients in the principal component linear combination: squaring the number of loads/eigenvalues in the component matrix obtained from the principal component loads, i.e.
Figure GDA0002383260630000035
Coefficients of linear combinations of principal components are obtained, where the number of principal components is obtained by analyzing each principal component score, n (n.ltoreq.7) is set as n, and n sets of data A, B are obtained1、B2、C1、C2、D1、D2Linear combination coefficient F of
Figure GDA0002383260630000036
Wherein x is1,x2,...,x7Corresponds to A, B1、B2、C1、C2、D1、D2
Calculating the variance contribution rate of the principal component, wherein the greater the variance contribution rate, the greater the importance of the principal component, therefore, considering the variance contribution rate as the weight of different principal components, replacing the original data with n principal components, performing weighted average on the coefficient in the linear combination according to the weight of the principal component in the principal component variance contribution rate,
F=c1F1+c2F2+…+cnFn
wherein, c1,c2,...,cnIs F1,F2,...,FnThe proportion of the variance contribution rate is occupied,
combining coefficients in the principal component linear combination to obtain:
F=w1x1+w2x2+…+w7x7
wherein, w1,w2,...,w7I.e. the weight, and will w1,w2,...,w7Carrying out normalization processing;
when the weight of the data variable is lower than the weight threshold, the data variable is considered to be low in association degree with the abnormal flow data analysis, and the data variable is deleted.
Wherein the step S4 includes the steps of:
calculating the similarity between abnormal flow data:
similarity of time information δ1
Figure GDA0002383260630000041
Wherein, t1,t2Is the abnormal flow A, B detection time, TwinIs the reference design time;
similarity delta of host related information2
Figure GDA0002383260630000042
Wherein S is1,S2For exceptional traffic A, B host importance level, NSIs an important grade number;
similarity delta of host security protection level3
Figure GDA0002383260630000043
Wherein, C1,C2For exceptional traffic A, B host protection level, NCThe number of total protection grades;
similarity of total number of running services δ4
Figure GDA0002383260630000044
Wherein, I1,I2Total number of services running on the host for exception traffic A, B, NIThe total weight of the service running on the host computer is in a grade number;
running similarity δ of service importance levels5
Figure GDA0002383260630000045
Wherein l1,l2For the importance level of the service running on the exception A, B host, NlThe total weight of the service running on the host computer is in a grade number;
similarity δ of IP-related information6: let the binary numbers of the IP addresses of the two abnormal traffic devices A, B be IP1 and IP2, respectively, XOR the IP addresses to obtain diff as IP1 XOR IP2, start scanning from the left side of diff, encounter 1 and stop, and define the variable p as the number of 0 encountered in scanning, then the IP similarity function is:
Figure GDA0002383260630000051
respectively calculating source IP addresses delta according to the obtained similarity function6
Calculating the similarity of the abnormal flow η:
Figure GDA0002383260630000052
obtaining the similarity between each abnormal flow, so as to convert the operation on the variable in the abnormal flow data into the operation on each abnormal flow;
generating a transaction database according to the similarity:
setting a similarity threshold: setting a similarity threshold according to the calculated similarity between the abnormal flows; analyzing according to the result obtained by the experiment, and setting the similarity threshold values as a maximum threshold value of 0.5, a minimum threshold value of 0.1 and a discard threshold value of 0.05 respectively;
and generating a transaction type database D according to the similarity threshold: when the similarity is lower than 0.05, the similarity between the abnormal flows is considered to be too low, and no possible correlation exists; when the similarity of the two is higher than 0.5, the association degree is considered to be higher, and the two can be used as transaction data items with 2 abnormal flows; on the basis of obtaining the transaction data items with the similarity higher than 0.5, if the similarity between the two transaction data items and the other abnormal traffic is higher than 0.48, the transaction data items with 3 abnormal traffic are generated, and so on, the required similarity is correspondingly reduced by 0.02 every time the abnormal traffic is added in the transaction data items, but the abnormal traffic cannot be added in the transaction data items when the similarity is lower than 0.1.
Wherein the step S5 includes the steps of:
s51: setting a minimum support threshold value min _ sup and a minimum confidence coefficient min _ conf, wherein the minimum support threshold value is 20 percent, and the minimum confidence coefficient threshold value is 80 percent;
s52: a frequent item set is generated by iterating the transactional database D: the transaction database D is processed once through the first iteration of the algorithmScanning, calculating the occurrence frequency of each item contained in D, and generating a candidate 1-item set C1
S53: according to the set minimum support degree, from C1Determine frequent 1-item set L1From this analogy, we derive the frequent set LkWhere k is 7;
s54: and generating a rule meeting the minimum confidence coefficient on the basis of the frequent item set, wherein the generated rule is called as a strong association rule, so that the correlation of abnormal data of the power communication network is obtained, and further the potential state perception and prediction of the power communication cell data network are realized.
In the above scheme, the algorithm strategy based on Apriori algorithm relevance analysis is as follows:
a connection step: frequent (k-1) item set Lk-1Generates a candidate k term set CkApriori assumes that the set of items are ordered in lexicographic order. If L isk-1The first (k-2) items of the elements (item sets) itemset1 and itemset2 of two of them are identical, then itemset1 and itemset2 are said to be connectable. The resulting set of items resulting from the concatenation of itemset1 with itemset2 is { itemset1[1 ]],itemset1[2],…,itemset1[k-1],itemset2[k-1]};
Pruning strategy: due to the presence of a priori properties: any infrequent (k-1) item set is not a subset of the frequent k item set. Thus, if a candidate k-term set CkIs not in Lk-1Then the candidate set is unlikely to be frequent, so that C can be selected fromkDeleting to obtain compressed Ck
Deletion strategy: based on C after compressionkScan all transactions, pair CkCounting each item in the k item set, and then deleting the item which does not meet the minimum support degree, thereby obtaining a frequent k item set;
setting a minimum support threshold value min _ sup and a minimum confidence coefficient min _ conf, wherein the minimum support threshold value is 20 percent, and the minimum confidence coefficient threshold value is 80 percent;
a frequent item set is generated by iterating the transactional database D: through the first iteration of the algorithm, the transaction database D is scanned once, and each time D is calculatedThe number of times of occurrence of each item generates a set C of candidate 1-item sets1
According to the set minimum support degree, from C1Determine frequent 1-item set L1From this analogy, we derive the frequent set LkWhere k is 7;
and generating a rule meeting the minimum confidence coefficient on the basis of the frequent item set, wherein the generated rule is called as a strong association rule, so that the correlation of abnormal data of the power communication network is obtained, and further the potential state perception and prediction of the power communication cell data network are realized.
According to the scheme, data processing is carried out on abnormal flow data acquired from a power communication network, the weight of abnormal flow data variables is calculated based on a principal component analysis method and dimension reduction is carried out, the similarity between abnormal flows is calculated by using the weight to generate a transaction type database, and then the transaction type database is associated based on an Apriori association rule algorithm to generate a strong association rule.
Compared with the prior art, the invention has the beneficial effects that:
the invention provides a relevance analysis method of network abnormal data, which is characterized in that the weight of abnormal flow data variables is calculated by using a principal component analysis method, dimension reduction is carried out, and the similarity between abnormal flows is calculated by using the weight to generate a transaction type database; the relevance of the abnormal traffic data of the power communication network is analyzed and mined by using an Apriori association rule, the complexity of the abnormal traffic of the power communication network is fully considered, the real state of the abnormal traffic of the network is comprehensively considered, and the similarity of the abnormal traffic of the network is better reflected.
Drawings
Fig. 1 is a schematic flow chart of a method for analyzing relevance of network abnormal data.
FIG. 2 is a flowchart of an algorithm of a method for analyzing relevance of network anomaly data;
FIG. 3 is a comparison of average temporal complexity.
Detailed Description
The drawings are for illustrative purposes only and are not to be construed as limiting the patent;
the invention is further illustrated below with reference to the figures and examples.
Example 1
As shown in fig. 1 and fig. 2, a method for analyzing relevance of network abnormal data includes the following steps:
s1: collecting abnormal data of the power communication network;
s2: preprocessing the acquired abnormal data to obtain preprocessed abnormal data;
s3: according to the preprocessed abnormal data, calculating a weight value according to principal component analysis;
s4: calculating the similarity of the abnormal data to generate a transaction database;
s5: and completing the relevance analysis based on an Apriori algorithm according to the generated transaction database.
More specifically, the step S1 specifically includes: selecting network key information data with a fixed time length from original records of abnormal flow data collected in a power communication network, wherein each piece of data comprises 4 attributes: TIME attribute, collecting TIME period, and recording as A; host-related attributes in a network, including: host importance LEVEL HOSTS _ LEVEL, denoted B1The security LEVEL SECURE _ LEVEL of each host is marked as B2(ii) a Run information attributes, including: the total number of SERVICEs running on each host SERVICE _ NUM, denoted C1The importance LEVEL SERVICE _ LEVEL of each SERVICE is marked as C2(ii) a IP attribute: source Address SIP, denoted D1And a destination address DIP, noted as D2(ii) a The data characteristics are A, B1、B2、C1、C2、D1、D2(ii) a B is to be1、B2、C2The descriptive text of (a) is quantized to numbers.
More specifically, in step S2, the data is cleaned to remove the data records containing missing values.
More specifically, the step S3 includes:
s31: and (3) data standardization treatment: the standardization of the data is to scale the data to make the data fall into a small specific interval, which is mainly used for removing the unit limitation of the data and converting the data into a dimensionless pure numerical value, so that indexes of different units or orders of magnitude can be compared and weighted conveniently; the extreme value normalization method (0-1normalization) is adopted here, and is a linear transformation on the original data, and the specific expression of the transformation function is as follows:
Figure GDA0002383260630000081
wherein, XmaxIs the maximum value of the sample data, XminThe minimum value of the sample data is X, and the X is the collected abnormal data; x*Converting the abnormal data into abnormal data;
s32: to X*The principal component analysis is carried out, and the calculation steps are as follows:
and calculating a correlation coefficient matrix R, wherein the calculation formula is as follows:
Figure GDA0002383260630000082
wherein r isij(i, j ═ 1, 2.. times.p) is the original variable xiAnd xjOf correlation coefficient rij=rjiThe calculation formula is
Figure GDA0002383260630000083
Wherein the content of the first and second substances,
Figure GDA0002383260630000084
representing the average over the rows and columns of the X matrix from which A, B can be derived1、B2、C1、C2、D1、D2A matrix of correlation coefficients of;
calculating a characteristic value and a characteristic vector, and solving a characteristic value equation:
|λI-R|=0
the characteristic values were obtained by the Jacobi method (Jacobi), and the values were arranged in order of magnitude1≥λ2≥...≥λpNot less than 0, respectively calculating correspondence and characteristic value lambdaiCharacteristic vector e ofj(i ═ 1,2,. > p), requiring | | | ei1 | | |, i.e
Figure GDA0002383260630000085
Wherein eijRepresents a vector eiThe jth component of (a);
calculating the principal component contribution rate and the accumulated contribution rate, wherein the calculation formula is as follows:
contribution rate:
Figure GDA0002383260630000086
cumulative contribution rate:
Figure GDA0002383260630000087
wherein λ isikA non-negative feature vector, i ═ 1, 2., p, p represents the number of non-negative feature roots;
calculating principal component load lijThe calculation formula is as follows:
Figure GDA0002383260630000091
wherein e isi,jAs a unit vector component, according to lijA component matrix Z can be obtained, with the principal component scores as follows:
Figure GDA0002383260630000092
determining the weight of the principal component analysis: determining the weight by principal component analysis, wherein the index weight is equal to the weight by taking the variance contribution rate of the principal component as the weight, and normalizing the weighted average of the coefficients of the index in each principal component linear combination, therefore, three steps are required for determining the index weight:
calculating coefficients in the principal component linear combination: squaring the number of loads/eigenvalues in the component matrix obtained from the principal component loads, i.e.
Figure GDA0002383260630000093
Coefficients of linear combinations of principal components are obtained, where the number of principal components is obtained by analyzing each principal component score, n (n.ltoreq.7) is set as n, and n sets of data A, B are obtained1、B2、C1、C2、D1、D2Linear combination coefficient F of
Figure GDA0002383260630000094
Wherein x is1,x2,...,x7Corresponds to A, B1、B2、C1、C2、D1、D2
Calculating the variance contribution rate of the principal component, wherein the greater the variance contribution rate, the greater the importance of the principal component, therefore, considering the variance contribution rate as the weight of different principal components, replacing the original data with n principal components, performing weighted average on the coefficient in the linear combination according to the weight of the principal component in the principal component variance contribution rate,
F=c1F1+c2F2+...+cnFn
wherein, c1,c2,...,cnIs F1,F2,...,FnThe proportion of the variance contribution rate is occupied,
combining coefficients in the principal component linear combination to obtain:
F=w1x1+w2x2+...+w7x7
wherein, w1,w2,...,w7I.e. the weight, and will w1,w2,...,w7Carrying out normalization processing;
when the weight of the data variable is lower than the weight threshold, the data variable is considered to be low in association degree with the abnormal flow data analysis, and the data variable is deleted.
Wherein the step S4 includes the steps of:
calculating the similarity between abnormal flow data:
similarity of time information δ1
Figure GDA0002383260630000101
Wherein, t1,t2Is the abnormal flow A, B detection time, TwinIs the reference design time;
similarity delta of host related information2
Figure GDA0002383260630000102
Wherein S is1,S2For exceptional traffic A, B host importance level, NSIs an important grade number;
similarity delta of host security protection level3
Figure GDA0002383260630000103
Wherein, C1,C2For exceptional traffic A, B host protection level, NCThe number of total protection grades;
similarity of total number of running services δ4
Figure GDA0002383260630000104
Wherein, I1,I2Total number of services running on the host for exception traffic A, B, NIThe total weight of the service running on the host computer is in a grade number;
running similarity δ of service importance levels5
Figure GDA0002383260630000105
Wherein l1,l2For the importance level of the service running on the exception A, B host, NlThe total weight of the service running on the host computer is in a grade number;
similarity δ of IP-related information6: let the binary numbers of the IP addresses of the two abnormal traffic devices A, B be IP1 and IP2, respectively, XOR the IP addresses to obtain diff as IP1 XOR IP2, start scanning from the left side of diff, encounter 1 and stop, and define the variable p as the number of 0 encountered in scanning, then the IP similarity function is:
Figure GDA0002383260630000111
respectively calculating source IP addresses delta according to the obtained similarity function6
Calculating the similarity of the abnormal flow η:
Figure GDA0002383260630000112
obtaining the similarity between each abnormal flow, so as to convert the operation on the variable in the abnormal flow data into the operation on each abnormal flow;
generating a transaction database according to the similarity:
setting a similarity threshold: setting a similarity threshold according to the calculated similarity between the abnormal flows; analyzing according to the result obtained by the experiment, and setting the similarity threshold values as a maximum threshold value of 0.5, a minimum threshold value of 0.1 and a discard threshold value of 0.05 respectively;
and generating a transaction type database D according to the similarity threshold: when the similarity is lower than 0.05, the similarity between the abnormal flows is considered to be too low, and no possible correlation exists; when the similarity of the two is higher than 0.5, the association degree is considered to be higher, and the two can be used as transaction data items with 2 abnormal flows; on the basis of obtaining the transaction data items with the similarity higher than 0.5, if the similarity between the two transaction data items and the other abnormal traffic is higher than 0.48, the transaction data items with 3 abnormal traffic are generated, and so on, the required similarity is correspondingly reduced by 0.02 every time the abnormal traffic is added in the transaction data items, but the abnormal traffic cannot be added in the transaction data items when the similarity is lower than 0.1.
More specifically, the step S5 includes the following steps:
s51: setting a minimum support threshold value min _ sup and a minimum confidence coefficient min _ conf, wherein the minimum support threshold value is 20 percent, and the minimum confidence coefficient threshold value is 80 percent;
s52: a frequent item set is generated by iterating the transactional database D: after the first iteration of the algorithm, the transaction database D is scanned once, the frequency of occurrence of each item contained in D is calculated, and a candidate 1-item set C is generated1
S53: according to the set minimum support degree, from C1Determine frequent 1-item set L1From this analogy, we derive the frequent set LkWhere k is 7;
s54: and generating a rule meeting the minimum confidence coefficient on the basis of the frequent item set, wherein the generated rule is called as a strong association rule, so that the correlation of abnormal data of the power communication network is obtained, and further the potential state perception and prediction of the power communication cell data network are realized.
In the specific implementation process, the algorithm strategy based on Apriori algorithm relevance analysis is as follows:
a connection step: frequent (k-1) item set Lk-1Generates a candidate k term set CkApriori assumes that the set of items are ordered in lexicographic order. If L isk-1The first (k-2) items of the elements (item sets) itemset1 and itemset2 of two of them are identical, then itemset1 and itemset2 are said to be connectable. The resulting set of items resulting from the concatenation of itemset1 with itemset2 is { itemset1[1 ]],itemset1[2],…,itemset1[k-1],itemset2[k-1]};
Pruning strategy: due to the presence of a priori properties: any infrequent (k-1) item set is not a subset of the frequent k item set. Thus, if a candidate k-term set CkIs not in Lk-1Then the candidate set is unlikely to be frequent, so that C can be selected fromkDeleting to obtain compressed Ck
Deletion policyA little: based on C after compressionkScan all transactions, pair CkCounting each item in the k item set, and then deleting the item which does not meet the minimum support degree, thereby obtaining a frequent k item set;
setting a minimum support threshold value min _ sup and a minimum confidence coefficient min _ conf, wherein the minimum support threshold value is 20 percent, and the minimum confidence coefficient threshold value is 80 percent;
a frequent item set is generated by iterating the transactional database D: after the first iteration of the algorithm, the transaction database D is scanned once, the frequency of occurrence of each item contained in D is calculated, and a candidate 1-item set C is generated1
According to the set minimum support degree, from C1Determine frequent 1-item set L1From this analogy, we derive the frequent set LkWhere k is 7;
and generating a rule meeting the minimum confidence coefficient on the basis of the frequent item set, wherein the generated rule is called as a strong association rule, so that the correlation of abnormal data of the power communication network is obtained, and further the potential state perception and prediction of the power communication cell data network are realized.
In the specific implementation process, data processing is performed on abnormal flow data acquired from a power communication network, the weight of abnormal flow data variables is calculated based on a principal component analysis method, dimension reduction is performed, the similarity between abnormal flows is calculated by using the weight to generate a transaction-type database, and then the transaction-type database is associated based on an Apriori association rule algorithm to generate a strong association rule.
In the specific implementation process, as shown in fig. 3, data preprocessing is performed first to remove data with missing values. And (3) processing the abnormal flow data variable by adopting a principal component analysis method, and reducing the dimension of the data variable while obtaining the weight of the data variable. On the basis of simplifying abnormal flow data, the similarity between abnormal flows is obtained, and accordingly the abnormal flows of the power communication network are generated into a transaction type database. The association rule of the abnormal traffic is completed by step S5.
In a specific implementation process, due to inherent defects of Apriori association rules, spatial complexity and temporal complexity for implementing Apriori increase with the increase of data, and the reason for the increase of the spatial complexity and the temporal complexity is that Apriori needs to perform multiple accesses and iterations on a database. With the development of the data processing platform at present, the time complexity of an iterative algorithm can be effectively solved by adopting spark parallel computing framework programming; meanwhile, the spark supports caching of data used for multiple times in a cache mode, and pressure of multiple access to the database is relieved to a certain extent.
In the specific implementation process, a principal component analysis method is used as a data processing method, and Apriori obtains association rules, the principal component analysis reduces the workload of data processing, and the weight among abnormal flow data variables is determined according to the principal component analysis method, and the method is objective and reasonable in weight determination. Apriori association rules have been widely applied to various fields such as business and network security, and the association of data is analyzed and mined to mine useful information.
It should be understood that the above-described embodiments of the present invention are merely examples for clearly illustrating the present invention, and are not intended to limit the embodiments of the present invention. Other variations and modifications will be apparent to persons skilled in the art in light of the above description. And are neither required nor exhaustive of all embodiments. Any modification, equivalent replacement, and improvement made within the spirit and principle of the present invention should be included in the protection scope of the claims of the present invention.

Claims (4)

1. A method for analyzing relevance of network abnormal data is characterized in that: the method comprises the following steps:
s1: collecting abnormal data of the power communication network;
s2: preprocessing the acquired abnormal data to obtain preprocessed abnormal data;
s3: according to the preprocessed abnormal data, calculating a weight value according to principal component analysis;
s4: calculating the similarity of the abnormal data to generate a transaction database;
s5: according to the generated transaction database, relevance analysis is completed based on an Apriori algorithm;
wherein the step S3 includes:
s31: and (3) data standardization treatment: the standardization of the data is to scale the data, so that the data falls into a small specific interval, which is mainly used for removing unit limitation of the data and converting the unit limitation into a dimensionless pure numerical value, thereby facilitating the comparison and weighting of indexes of different units or orders of magnitude; the extreme value normalization method, namely 0-1normalization, is adopted here, and is a linear transformation on the original data, and the specific expression of the conversion function is as follows:
Figure FDA0002383260620000011
wherein, XmaxIs the maximum value of the sample data, XminThe minimum value of the sample data is X, and the X is the collected abnormal data; x*Converting the abnormal data into abnormal data;
s32: to X*The principal component analysis is carried out, and the calculation steps are as follows:
and calculating a correlation coefficient matrix R, wherein the calculation formula is as follows:
Figure FDA0002383260620000012
wherein r isij(i, j ═ 1, 2.. times.p) is the original variable xiAnd xjOf correlation coefficient rij=rjiThe calculation formula is
Figure FDA0002383260620000013
Wherein the content of the first and second substances,
Figure FDA0002383260620000014
representing the average over the rows and columns of the X matrix from which A, B can be derived1、B2、C1、C2、D1、D2A matrix of correlation coefficients of;
calculating a characteristic value and a characteristic vector, and solving a characteristic value equation:
|λI-R|=0
the characteristic values were obtained by the Jacobi method (Jacobi), and the values were arranged in order of magnitude1≥λ2≥...≥λpNot less than 0, respectively calculating correspondence and characteristic value lambdaiCharacteristic vector e ofj(i ═ 1,2,. > p), requiring | | | ei1 | | |, i.e
Figure FDA0002383260620000021
Wherein eijRepresents a vector eiThe jth component of (a);
calculating the principal component contribution rate and the accumulated contribution rate, wherein the calculation formula is as follows:
contribution rate:
Figure FDA0002383260620000022
cumulative contribution rate:
Figure FDA0002383260620000023
wherein λ isikA non-negative feature vector, i ═ 1, 2., p, p represents the number of non-negative feature roots;
calculating principal component load lijThe calculation formula is as follows:
Figure FDA0002383260620000024
wherein e isi,jAs a unit vector component, according to lijA component matrix Z can be obtained, with the principal component scores as follows:
Figure FDA0002383260620000025
determining the weight of the principal component analysis: determining the weight by principal component analysis, wherein the index weight is equal to the weight by taking the variance contribution rate of the principal component as the weight, and normalizing the weighted average of the coefficients of the index in each principal component linear combination, therefore, three steps are required for determining the index weight:
calculating coefficients in the principal component linear combination: squaring the number of loads/eigenvalues in the component matrix obtained from the principal component loads, i.e.
Figure FDA0002383260620000026
Coefficients of linear combinations of principal components are obtained, where the number of principal components is obtained by analyzing each principal component score, n (n.ltoreq.7) is set as n, and n sets of data A, B are obtained1、B2、C1、C2、D1、D2Linear combination coefficient F of
Figure FDA0002383260620000027
Wherein x is1,x2,...,x7Corresponds to A, B1、B2、C1、C2、D1、D2
Calculating the variance contribution rate of the principal component, wherein the greater the variance contribution rate, the greater the importance of the principal component, therefore, considering the variance contribution rate as the weight of different principal components, replacing the original data with n principal components, performing weighted average on the coefficient in the linear combination according to the weight of the principal component in the principal component variance contribution rate,
F=c1F1+c2F2+...cnFn
wherein, c1,c2,...,cnIs F1,F2,...,FnThe proportion of the variance contribution rate is occupied,
combining coefficients in the principal component linear combination to obtain:
F=w1x1+w2x2+...+w7x7
wherein, w1,w2,...,w7I.e. the weight, and willw1,w2,...,w7Carrying out normalization processing;
setting a weight threshold value of a data variable to be 0.05 while determining the weight by using a principal component analysis method, and deleting the data variable when the weight of the data variable is lower than the weight threshold value and the data variable is considered to have low association degree with the abnormal flow data analysis;
the step S4 includes the steps of:
calculating the similarity between abnormal flow data:
similarity of time information δ1
Figure FDA0002383260620000031
Wherein, t1,t2Is the abnormal flow A, B detection time, TwinIs the reference design time;
similarity delta of host related information2
Figure FDA0002383260620000032
Wherein S is1,S2For exceptional traffic A, B host importance level, NSIs an important grade number;
similarity delta of host security protection level3
Figure FDA0002383260620000033
Wherein, C1,C2For exceptional traffic A, B host protection level, NCThe number of total protection grades;
similarity of total number of running services δ4
Figure FDA0002383260620000034
Wherein, I1,I2As abnormal flowA. Total number of services running on B host, NIThe total weight of the service running on the host computer is in a grade number;
running similarity δ of service importance levels5
Figure FDA0002383260620000041
Wherein l1,l2For the importance level of the service running on the exception A, B host, NlThe total weight of the service running on the host computer is in a grade number;
similarity δ of IP-related information6: let the binary numbers of the IP addresses of the two abnormal traffic devices A, B be IP1 and IP2, respectively, XOR the IP addresses to obtain diff as IP1 XOR IP2, start scanning from the left side of diff, encounter 1 and stop, and define the variable p as the number of 0 encountered in scanning, then the IP similarity function is:
Figure FDA0002383260620000042
respectively calculating source IP addresses delta according to the obtained similarity function6
Calculating the similarity of the abnormal flow η:
Figure FDA0002383260620000043
obtaining the similarity between each abnormal flow, so as to convert the operation on the variable in the abnormal flow data into the operation on each abnormal flow;
generating a transaction database according to the similarity:
setting a similarity threshold: setting a similarity threshold according to the calculated similarity between the abnormal flows; analyzing according to the result obtained by the experiment, and setting the similarity threshold values as a maximum threshold value of 0.5, a minimum threshold value of 0.1 and a discard threshold value of 0.05 respectively;
and generating a transaction type database D according to the similarity threshold: when the similarity is lower than 0.05, the similarity between the abnormal flows is considered to be too low, and no possible correlation exists; when the similarity of the two is higher than 0.5, the association degree is considered to be higher, and the two can be used as transaction data items with 2 abnormal flows; on the basis of obtaining the transaction data items with the similarity higher than 0.5, if the similarity between the two transaction data items and the other abnormal traffic is higher than 0.48, the transaction data items with 3 abnormal traffic are generated, and so on, the required similarity is correspondingly reduced by 0.02 every time the abnormal traffic is added in the transaction data items, but the abnormal traffic cannot be added in the transaction data items when the similarity is lower than 0.1.
2. The method according to claim 1, wherein the method comprises the following steps: the step S1 specifically includes: selecting network key information data with a fixed time length from original records of abnormal flow data collected in a power communication network, wherein each piece of data comprises 4 attributes: TIME attribute, collecting TIME period, and recording as A; host-related attributes in a network, including: host importance LEVEL HOSTS _ LEVEL, denoted B1The security LEVEL SECURE _ LEVEL of each host is marked as B2(ii) a Run information attributes, including: the total number of SERVICEs running on each host SERVICE _ NUM, denoted C1The importance LEVEL SERVICE _ LEVEL of each SERVICE is marked as C2(ii) a IP attribute: source Address SIP, denoted D1And a destination address DIP, noted as D2(ii) a The data characteristics are A, B1、B2、C1、C2、D1、D2(ii) a B is to be1、B2、C2The descriptive text of (a) is quantized to numbers.
3. The method according to claim 2, wherein the method comprises the following steps: the step S2 is specifically to wash the data and remove the data records containing missing values.
4. The method according to claim 3, wherein the method comprises the following steps: the step S5 includes the steps of:
s51: setting a minimum support threshold value min _ sup and a minimum confidence coefficient min _ conf, wherein the minimum support threshold value is 20 percent, and the minimum confidence coefficient threshold value is 80 percent;
s52: a frequent item set is generated by iterating the transactional database D: after the first iteration of the algorithm, the transaction database D is scanned once, the frequency of occurrence of each item contained in D is calculated, and a candidate 1-item set C is generated1
S53: according to the set minimum support degree, from C1Determine frequent 1-item set L1From this analogy, we derive the frequent set LkWhere k is 7;
s54: and generating a rule meeting the minimum confidence coefficient on the basis of the frequent item set, wherein the generated rule is called as a strong association rule, so that the correlation of abnormal data of the power communication network is obtained, and further the potential state perception and prediction of the power communication cell data network are realized.
CN201810402502.5A 2018-04-28 2018-04-28 Method for analyzing relevance of network abnormal data Active CN108595667B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201810402502.5A CN108595667B (en) 2018-04-28 2018-04-28 Method for analyzing relevance of network abnormal data

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201810402502.5A CN108595667B (en) 2018-04-28 2018-04-28 Method for analyzing relevance of network abnormal data

Publications (2)

Publication Number Publication Date
CN108595667A CN108595667A (en) 2018-09-28
CN108595667B true CN108595667B (en) 2020-06-09

Family

ID=63619304

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201810402502.5A Active CN108595667B (en) 2018-04-28 2018-04-28 Method for analyzing relevance of network abnormal data

Country Status (1)

Country Link
CN (1) CN108595667B (en)

Families Citing this family (25)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109767618B (en) * 2018-12-20 2020-10-09 北京航空航天大学 Comprehensive study and judgment method and system for abnormal data of public security traffic management service
CN109450955B (en) * 2018-12-30 2022-04-05 北京世纪互联宽带数据中心有限公司 Traffic processing method and device based on network attack
CN109828549A (en) * 2019-01-28 2019-05-31 中国石油大学(华东) A kind of industry internet equipment fault prediction technique based on deep learning
CN110322357A (en) * 2019-05-30 2019-10-11 深圳壹账通智能科技有限公司 Anomaly assessment method, apparatus, computer equipment and the medium of data
CN110392046B (en) * 2019-06-28 2021-12-24 平安科技(深圳)有限公司 Method and device for detecting abnormity of network access
CN111626461A (en) * 2019-09-18 2020-09-04 东莞灵虎智能科技有限公司 Safety risk prediction method
CN110928718B (en) * 2019-11-18 2024-01-30 上海维谛信息科技有限公司 Abnormality processing method, system, terminal and medium based on association analysis
CN111025144A (en) * 2020-03-06 2020-04-17 广东电网有限责任公司佛山供电局 High-voltage circuit breaker health level early warning method
CN111650898B (en) * 2020-05-13 2023-10-20 大唐七台河发电有限责任公司 Distributed control system and method with high fault tolerance performance
CN111698302A (en) * 2020-05-29 2020-09-22 深圳壹账通智能科技有限公司 Data early warning method and device, electronic equipment and medium
CN111858662A (en) * 2020-06-01 2020-10-30 广东恒睿科技有限公司 Method, system and storage medium for identifying underlying network potential danger data
CN111983469B (en) * 2020-08-24 2023-08-22 哈尔滨理工大学 Lithium battery safety degree estimation method and device based on voltage safety boundary and temperature safety boundary
CN112087350B (en) * 2020-09-17 2022-03-18 中国工商银行股份有限公司 Method, device, system and medium for monitoring network access line flow
CN112131284A (en) * 2020-09-30 2020-12-25 国网智能科技股份有限公司 Transformer substation holographic data slicing method and system
CN112231392A (en) * 2020-10-29 2021-01-15 广东机场白云信息科技有限公司 Civil aviation customer source data analysis method, electronic equipment and computer readable storage medium
CN112487053B (en) * 2020-11-27 2022-07-08 重庆医药高等专科学校 Abnormal control extraction working method for mass financial data
CN112583825B (en) * 2020-12-07 2022-09-27 四川虹微技术有限公司 Method and device for detecting abnormality of industrial system
CN112714462A (en) * 2020-12-25 2021-04-27 南京邮电大学 Electric wireless private network specific network attack monitoring method based on improved Apriori algorithm
CN113537590A (en) * 2021-07-14 2021-10-22 深圳供电局有限公司 Data anomaly prediction method and system
CN113469567A (en) * 2021-07-21 2021-10-01 东营市城市管理服务中心 Digital urban management system operation comprehensive evaluation method based on principal component analysis
CN114090413B (en) * 2022-01-21 2022-04-19 成都市以太节点科技有限公司 System data anomaly detection method and system, electronic equipment and storage medium
CN114598527A (en) * 2022-03-08 2022-06-07 江苏大学 Abnormal network flow detection method based on maximum frequent pattern non-similarity
CN115953073A (en) * 2023-01-06 2023-04-11 国能信控互联技术有限公司 Data correlation analysis method and system based on thermal power production index management
CN117097578B (en) * 2023-10-20 2024-01-05 杭州烛微智能科技有限责任公司 Network traffic safety monitoring method, system, medium and electronic equipment
CN117454120B (en) * 2023-12-20 2024-03-15 山西思极科技有限公司 Method for collecting and analyzing data of power communication system

Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101694744A (en) * 2009-10-28 2010-04-14 北京交通大学 Method and system for evaluating road emergency evacuation capacity and method and system for grading road emergency evacuation capacity
CN105046376A (en) * 2015-09-06 2015-11-11 河海大学 Reservoir group flood control scheduling scheme optimization method taking index correlation into consideration
CN105303468A (en) * 2015-11-20 2016-02-03 国网天津市电力公司 Comprehensive evaluation method of smart power grid construction based on principal component cluster analysis
CN105303302A (en) * 2015-10-12 2016-02-03 国家电网公司 Power grid evaluating indicator correlation analysis method, apparatus and computing apparatus
CN105427053A (en) * 2015-12-07 2016-03-23 广东电网有限责任公司江门供电局 Relative influence analysis model applied to evaluation of distribution network construction and renovation schemes and power supply quality indexes
CN105677759A (en) * 2015-12-30 2016-06-15 国家电网公司 Alarm correlation analysis method in communication network
CN105868928A (en) * 2016-04-29 2016-08-17 西南石油大学 High-dimensional evaluating method for oil field operational risk

Family Cites Families (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US6654763B2 (en) * 2001-06-14 2003-11-25 International Business Machines Corporation Selecting a function for use in detecting an exception in multidimensional data

Patent Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101694744A (en) * 2009-10-28 2010-04-14 北京交通大学 Method and system for evaluating road emergency evacuation capacity and method and system for grading road emergency evacuation capacity
CN105046376A (en) * 2015-09-06 2015-11-11 河海大学 Reservoir group flood control scheduling scheme optimization method taking index correlation into consideration
CN105303302A (en) * 2015-10-12 2016-02-03 国家电网公司 Power grid evaluating indicator correlation analysis method, apparatus and computing apparatus
CN105303468A (en) * 2015-11-20 2016-02-03 国网天津市电力公司 Comprehensive evaluation method of smart power grid construction based on principal component cluster analysis
CN105427053A (en) * 2015-12-07 2016-03-23 广东电网有限责任公司江门供电局 Relative influence analysis model applied to evaluation of distribution network construction and renovation schemes and power supply quality indexes
CN105677759A (en) * 2015-12-30 2016-06-15 国家电网公司 Alarm correlation analysis method in communication network
CN105868928A (en) * 2016-04-29 2016-08-17 西南石油大学 High-dimensional evaluating method for oil field operational risk

Non-Patent Citations (5)

* Cited by examiner, † Cited by third party
Title
Nonlinear process fault pattern recognition using statistics kernel PCA similarity factor;Xiaogang Deng等;《Neurocomputing》;20131209;298-308页 *
一种基于主成分分析算法的网络异常检测实现;付强等;《南京师范大学学报(工程技术版)》;20081220;13-16页 *
基于主成分分析的指标权重确定方法;韩小孩等;《四川兵工学报》;20121025;124-126页 *
基于主成分聚类分析的智能电网建设综合评价;高新华等;《电网技术》;20130805;2238-2243页 *
模糊-主成分分析综合评价法在地下水水质评价中的应用;杜军凯等;《中国环境监测》;20150815;75-81页 *

Also Published As

Publication number Publication date
CN108595667A (en) 2018-09-28

Similar Documents

Publication Publication Date Title
CN108595667B (en) Method for analyzing relevance of network abnormal data
Ektefa et al. Intrusion detection using data mining techniques
CN110909811A (en) OCSVM (online charging management system) -based power grid abnormal behavior detection and analysis method and system
CN112016602B (en) Method, equipment and storage medium for analyzing correlation between power grid fault cause and state quantity
CN113505826B (en) Network flow anomaly detection method based on joint feature selection
Jiang et al. A feature selection method for malware detection
CN115544519A (en) Method for carrying out security association analysis on threat information of metering automation system
CN115361150B (en) Security risk assessment method for power distribution network risk cascade under network attack
CN115329338A (en) Information security risk analysis method and analysis system based on cloud computing service
Harbola et al. Improved intrusion detection in DDoS applying feature selection using rank & score of attributes in KDD-99 data set
YANG et al. Phishing website detection using C4. 5 decision tree
Aung et al. Association rule pattern mining approaches network anomaly detection
CN117273516A (en) Performance evaluation method based on attention mechanism neural network
Xu et al. Hybrid model for network anomaly detection with gradient boosting decision trees and tabtransformer
CN115473748B (en) DDoS attack classification detection method, device and equipment based on BiLSTM-ELM
Xin et al. Research on feature selection of intrusion detection based on deep learning
CN114124484A (en) Network attack identification method, system, device, terminal equipment and storage medium
CN113935420A (en) Malicious encrypted data detection method and device, computer equipment and storage medium
CN110689074A (en) Feature selection method based on fuzzy set feature entropy value calculation
Zhang et al. Insider threat identification system model based on rough set dimensionality reduction
Alharbi et al. High performance proactive digital forensics
CN111343165A (en) Network intrusion detection method and system based on BIRCH and SMOTE
Qi et al. An Intrusion Detection Feature Selection Method Based on Improved Mutual Information
CN113595987B (en) Communication abnormal discovery method and device based on baseline behavior characterization, storage medium and electronic device
Serkani et al. Hybrid anomaly detection using decision tree and support vector machine

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant