CN112714080A

CN112714080A - Interconnection relation classification method and system based on spark graph algorithm

Info

Publication number: CN112714080A
Application number: CN202011543625.4A
Authority: CN
Inventors: 陶景龙; 梁淑云; 刘胜; 马影; 王启凡; 魏国富; 殷钱安; 余贤喆; 周晓勇
Original assignee: Information and Data Security Solutions Co Ltd
Current assignee: Information and Data Security Solutions Co Ltd
Priority date: 2020-12-23
Filing date: 2020-12-23
Publication date: 2021-04-27
Anticipated expiration: 2040-12-23
Also published as: CN112714080B

Abstract

The invention relates to an interconnection relation classification method based on spark graph algorithm, which comprises the steps of generating a node data table V and a relation data table E; based on the node data table V and the relation data table E, a Spark graph algorithm is applied to generate a graph relation G; carrying out communication group discovery by using a Lovain algorithm; performing group label setting by combining services, and performing group classification; and screening the free relation to be recorded as a free relation table P. A system based on the method is also provided. The method adopts the directed weighting network which maps the power terminal communication network into the communication network with the IP as the node, the communication relation as the edge and the IP sending instruction similarity as the edge weight, adopts the classic community classification algorithm Louvain to carry out community mining, and simultaneously designs the relation weight based on the equipment communication instruction similarity as the obvious characteristic, thereby effectively improving the algorithm classification effect, and finally accurately and efficiently finishing the interconnection relation classification among the communication equipment to obtain the obviously classified communication group.

Description

Interconnection relation classification method and system based on spark graph algorithm

Technical Field

The invention relates to the technical field of network traffic classification, in particular to a spark graph algorithm-based interconnection relationship classification method and system.

Background

Mutual communication behaviors exist in power distribution network terminal equipment in the power industry, however, with the development of construction requirements and services, communication equipment facilities are more and more, networking relations are more and more complex, and the interconnection relations between the communication equipment facilities and the networking relations belong to communication function categories which are more difficult to distinguish. The method for meeting the requirement at the present stage is to adopt manual verification, and use keyword matching or fixed rules to screen and classify according to flow logs between devices, so that the traditional mode is time-consuming and labor-consuming, the accuracy is difficult to guarantee, and as the communication function becomes complex, the previously set keywords, fixed rules and the like need to be maintained all the time, and the condition that the classification result cannot be timely produced exists. In order to improve efficiency, accuracy and reduce cost, the need to implement automatic classification of interconnection relationships of communication devices by data mining technology is becoming increasingly important.

The manual verification mainly depends on the understanding ability of related workers on the power distribution networking service and the mastering degree of the communication type of the power distribution terminal, so that a keyword matching library and a fixed rule item are established, then, the corresponding screening analysis is carried out on the communication flow logs between the devices, and finally, the corresponding classification categories are obtained. The traditional equipment interconnection relation classification method has low efficiency, low accuracy and high later maintenance cost.

For example, a device and a method for classifying network traffic based on Spark performance optimization disclosed in application number CN202010537734.9 belong to the technical field of network traffic classification. The data preprocessing module is used for acquiring and extracting characteristics related to time from the original flow data; the model training module is used for classifying the network flow; the real-time classification module is used for loading the classification model trained by the model training module on the data processed by the preprocessing module and classifying the data under the Topic; and the Spark performance optimization module is used for providing performance optimization support for the model training module and the real-time classification module. According to the invention, the quick and accurate classification of the network flow is realized through the Spark Shuffle performance optimization architecture diagram and the weighted random forest algorithm construction flow diagram, different service strategies can be provided for different application scenes for a network service provider, the network service quality is further improved, and the powerful support is provided for ensuring the network safety. Although the method adopts the random forest algorithm to classify the network traffic, the method cannot be directly applied to classification of the power network communication data because the power network communication data is greatly different from the network traffic data.

In summary, the device interconnection relationship classification method in the prior art cannot accurately, efficiently and inexpensively classify interconnection relationships among power distribution terminal devices in the power industry. Therefore, the need to achieve automatic classification of communication device interconnection relationships by data mining techniques is becoming increasingly important to improve efficiency, accuracy and reduce cost.

Disclosure of Invention

The technical problem to be solved by the invention is that the interconnection relationship among the power distribution terminal devices in the power industry cannot be accurately, efficiently and cheaply classified in the prior art.

The invention solves the technical problems through the following technical means:

an interconnection relation classification method based on spark graph algorithm comprises the following steps:

s101, acquiring data, and acquiring inter-network communication flow data of a power distribution terminal;

s102, processing data to generate a node data table V and a relation data table E;

s103, generating a graph relation, mapping the power terminal communication network into a directed weighting network which takes the IP as a node, the communication relation as an edge and the IP sending instruction similarity as an edge weight based on the node data table V and the relation data table E, and generating a graph relation G by applying a Spark graph algorithm;

s104, communication group discovery is carried out by using a Lovain algorithm based on the graph relation G;

s105, according to the result of the step S104, group label setting is carried out in combination with the service, and group classification is carried out; and screening the free relation to be recorded as a free relation table P.

The method adopts the directed weighting network which maps the power terminal communication network into the directed weighting network which takes the IP as a node, the communication relation as an edge and the similarity of the IP sending instruction as an edge weight, adopts the classic community classification algorithm Louvain to carry out community mining, and simultaneously designs the relation weight based on the similarity of the equipment communication instruction as an obvious characteristic, thereby effectively improving the algorithm classification effect, finally accurately and efficiently finishing the classification of the interconnection relation between communication equipment to obtain an obviously classified communication group, and designing a group type label by combining the power network service logic to obtain the power distribution terminal network communication relation group with the actual service type. The problems of low efficiency, low accuracy and high later maintenance cost in the prior art are solved.

Further, the method in step S101 is:

acquiring inter-network flow data of the power distribution terminal from a data management and control department of an electric power company, wherein the time period is one week; the data contains a five-tuple structure, which at least comprises a source IP, a source port, a destination IP, a destination port, a communication type, a communication instruction, source equipment information and destination equipment information.

Further, the method in step S102 is:

s1021, extracting all source IPs and destination IPs of internetwork flow data and instruction sets corresponding to the source IPs and the destination IPs, and forming a data table with two columns after grouping, aggregating and de-duplicating, and recording the data table as a node data table V;

s1022 extracts all communication records of source IP and destination IP of the internetwork flow data, obtains the one-to-one correspondence of the source IP and the destination IP after duplication removal, obtains an instruction set corresponding to each IP by combining the node data table V obtained in the step S1021, then calculates the instruction set similarity of the source IP and the destination IP by using a Levenson distance algorithm, and generates a field: and (3) obtaining the similarity of the instructions, and finally obtaining a relation data table with four columns, wherein the fields are as follows: source IP, destination IP, communication relation and instruction similarity; this table is denoted as relational data table E.

Further, the method in step S104 is: based on the graph relationship G generated in step S103, a Louvain community discovery algorithm is applied to perform community mining, so as to obtain a communication group with an obvious classification.

Further, the method in step S105 is: based on the communication groups calculated in step S104, screening node relationships that are far away from or isolated from each group, and generating a free relationship table; and establishing a communication group category label by combining the specific communication service type and characteristics.

The invention also provides an interconnection relation classification system based on spark graph algorithm, which comprises

The data acquisition module is used for acquiring the data of the communication flow between the power distribution terminal networks;

the data processing module is used for generating a node data table V and a relation data table E;

the graph relation generating module is used for generating a graph relation G by applying a Spark graph algorithm based on the node data table V and the relation data table E;

the communication group discovery module is used for discovering the communication group by using a Lovain algorithm;

the group classification module is used for performing group label setting by combining services according to the discovery result of the communication group discovery module and performing group classification; and screening the free relation to be recorded as a free relation table P.

Further, the data acquisition module specifically acquires the following processes: acquiring inter-network flow data of the power distribution terminal from a data management and control department of an electric power company, wherein the time period is one week; the data contains a five-tuple structure, which at least comprises a source IP, a source port, a destination IP, a destination port, a communication type, a communication instruction, source equipment information and destination equipment information.

Further, the method for generating the node data table V and the relationship data table E in the data processing module includes:

Further, the method for group discovery in the communication group discovery module is as follows: and based on the graph relation G, carrying out community mining by applying a Louvain community discovery algorithm to obtain a communication group with obvious classification.

Further, the group classification module screens node relationships far away from or isolated from each group based on the obtained communication groups to generate a free relationship table; and establishing a communication group category label by combining the specific communication service type and characteristics.

The invention has the advantages that:

Drawings

FIG. 1 is a block flow diagram of a classification method in an embodiment of the invention;

FIG. 2 is a graph relationship G generated by a node data table V and a relationship data table E in the embodiment of the present invention;

FIG. 3 is a diagram illustrating a relationship G for community mining to obtain a communication group G-C;

FIG. 4 is a table P of freeness relationships generated based on communication groups G-C in an embodiment of the present invention;

fig. 5 is a communication relation classification result table R generated based on the communication groups G-C in the embodiment of the present invention.

Detailed Description

In order to make the objects, technical solutions and advantages of the embodiments of the present invention clearer, the technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the embodiments of the present invention, and it is obvious that the described embodiments are some embodiments of the present invention, but not all embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.

The embodiment provides an interconnection relation classification method based on spark graph algorithm, which is used for solving the problem that the interconnection relation among power distribution terminal equipment in the power industry cannot be accurately, efficiently and inexpensively classified in the prior art. To achieve the above object.

As shown in fig. 1, a spark graph algorithm-based interconnection relationship classification method specifically includes the following steps:

s102, data processing is carried out, and a node data table V and a relation data table E are generated according to the internetwork communication traffic data;

s103, generating a graph relation, applying a Spark graph algorithm, and generating a graph relation G;

s104, communication group discovery is carried out, and a Lovain algorithm is used for group discovery;

The contents of each step are specifically described as follows:

the method in S101 comprises the following steps:

acquiring inter-network flow data of the power distribution terminal from a data management and control department of an electric power company, wherein the time period is at least one week; the data typically comprises a five-tuple structure including, but not limited to: source IP, source port, destination IP, destination port, communication type, communication instruction, source device information, destination device information, etc.;

the method in S102 comprises the following steps:

extracting contents aiming at the acquired internetwork flow data, and respectively generating a node data table V and a relation data table E;

s1021 extracts all source IP, destination IP, and their corresponding instruction sets, and forms a data table with two columns after grouping, aggregating, and de-duplicating, which is denoted as data node table V, and its content form is as in table 1.

Table 1 data node table

S1022 extracts all communication records of the source IP and the destination IP, obtains a one-to-one correspondence relationship between the source IP and the destination IP after deduplication, obtains an instruction set corresponding to each IP by combining the node table V obtained in S1021, then calculates an instruction set similarity between the source IP and the destination IP by using the levenstein distance algorithm, generates a field "instruction similarity", and finally obtains a relational data table having four columns, where the fields are: source IP, destination IP, communication relation and instruction similarity; this table is denoted as relationship table E, the content of which is in table 2, the instructions and IP in table 2 having been desensitized.

TABLE 2 data relationship Table

Wherein the Levensan distance algorithm refers to: the Levenshtein Distance algorithm, also called Edit Distance algorithm, refers to the minimum number of editing operations required for converting one character string into another character string. Permitted editing operations include replacing one character with another, inserting one character, and deleting one character. Generally, the smaller the edit distance, the greater the similarity of the two strings. Processed in the relation table, the larger the numerical value is, the larger the similarity is.

The method in S103 comprises the following steps: the power terminal communication network is mapped into a directed weighting network which takes the IP as a node, the communication relation is an edge, the IP sends an instruction similarity as an edge weight, the specific operation steps are that a graph frame graph algorithm package is led into the spark big data computing platform environment, and a graph relation is generated for an V, E table and is marked as a graph relation G, which is shown in figure 2.

Wherein spark cluster refers to: spark is a fast and general computing engine specially designed for large-scale data processing, and now forms an ecosystem with high-speed development and wide application, and has the characteristics of high speed, easy use, general use and the like.

Wherein the graph frame graph algorithm packet refers to: the database is constructed on the dataframes of spark, can utilize the good expansibility and strong performance of the dataframes, provides a uniform image processing API for scala, java and python, and has the advantages of uniform API, strong query function, image storage and reading, easy transplantation and the like.

The method in S104 comprises the following steps: based on the graph relationship G generated in the step S103, a classic Louvain community discovery algorithm is applied to carry out community mining to obtain a communication group G-C with obvious classification, and the form of the communication group G-C is shown in FIG. 3: community relationship graph G-C.

Wherein the Louvain algorithm refers to: the principle of the modularity-based algorithm is that the difference value of the module cohesion of a certain division result and the cohesion of a random division result is used for evaluating the division result to find the division with the optimal module cohesion, and the specific algorithm and formula are as follows:

continuously traversing points in the network, taking the points out of the original communities, calculating the modularity increment generated when the points are added into each community, selecting a community with the maximum corresponding modularity increment from the communities, adding the points until the points can not move, combining the communities into a super point, repeating the steps until the modularity is not increased any more, wherein the modularity increment refers to that the modularity value changes after one point is taken out of the original community and added into another community, and the calculation formula is as follows:

in the formula, sigma in represents the sum of all the side weights in the community C; c represents a community to which point i is to join; i denotes a node to be moved; k_i,inRepresenting the sum of all edge weights from the point i to the community C; Σ tot represents the sum of the weights of all edges connected to community C;

the method in S105 comprises the following steps: the communication groups calculated in the step S104 are screened for node relationships that are far away from the individual groups or isolated, and a free relationship table P is generated, the content form of which is shown in fig. 4; and establishing a communication group category label by combining specific communication service types and characteristics, and generating a communication relation classification result table R, wherein the content form is as shown in FIG. 5: and a communication relation classification result table R. And finishing the function of classifying the interconnection relation among the power distribution terminal devices in the power industry.

The invention also provides an interconnection relation classification system based on spark graph algorithm, which comprises:

the data processing module is used for generating a node data table V and a relation data table E according to the internetwork communication traffic data;

the graph relation generating module is used for applying a Spark graph algorithm to generate a graph relation G;

the communication group discovery module is used for performing group discovery by using a Lovain algorithm;

the group classification module is used for performing group label setting by combining services according to the result obtained by the communication group discovery module and performing group classification; and screening the free relation to be recorded as a free relation table P.

The contents of each step are specifically described as follows:

the method in the data acquisition module comprises the following steps:

the method in the data processing module comprises the following steps:

Table 1 data node table

TABLE 2 data relationship Table

src	dst	relationship	similarity
				31.119.0.185	31.119.0.219	communication	35
31.118.224.103	31.118.244.171	communication	32
				31.118.244.171	31.118.188.7	communication	27
31.14.128.12	31.119.0.185	communication	18
				31.119.0.185	31.119.0.177	communication	49
31.118.168.13	31.118.224.103	communication	29
				31.119.0.177	31.118.224.107	communication	27
31.118.244.171	31.119.0.177	communication	26
				31.119.0.185	31.118.140.24	communication	26
31.118.224.107	31.118.140.24	communication	28
				31.14.128.12	31.119.0.177	communication	35
31.118.188.7	31.118.84.13	communication	44
				31.119.0.177	31.119.0.185	communication	49
31.118.152.51	31.119.2.77	communication	29
				31.118.140.24	31.119.0.219	communication	37

The method in the graph relation generation module comprises the following steps: the power terminal communication network is mapped into a directed weighting network which takes the IP as a node, the communication relation is an edge, the IP sends an instruction similarity as an edge weight, the specific operation steps are that a graph frame graph algorithm package is led into the spark big data computing platform environment, and a graph relation is generated for an V, E table and is marked as a graph relation G, which is shown in figure 2.

The method in the communication group discovery module comprises the following steps: based on the generated graph relation G, a classical Louvain community discovery algorithm is applied to carry out community mining to obtain a communication group G-C with obvious classification, and the form of the communication group G-C is shown in FIG. 3: community relationship graph G-C.

the method in the group classification module comprises the following steps: based on the communication groups obtained by the calculation, screening node relations far away from the individual groups or isolated node relations to generate a free relation table P, wherein the content form of the free relation table P is shown in FIG. 4; and establishing a communication group category label by combining specific communication service types and characteristics, and generating a communication relation classification result table R, wherein the content form is as shown in FIG. 5: and a communication relation classification result table R. And finishing the function of classifying the interconnection relation among the power distribution terminal devices in the power industry.

The above examples are only intended to illustrate the technical solution of the present invention, but not to limit it; although the present invention has been described in detail with reference to the foregoing embodiments, it will be understood by those of ordinary skill in the art that: the technical solutions described in the foregoing embodiments may still be modified, or some technical features may be equivalently replaced; and such modifications or substitutions do not depart from the spirit and scope of the corresponding technical solutions of the embodiments of the present invention.

Claims

1. An interconnection relation classification method based on spark graph algorithm is characterized by comprising the following steps:

2. The interconnection relationship classification method based on spark graph algorithm as claimed in claim 1, wherein the method in step S101 is:

3. The interconnection relationship classification method based on spark graph algorithm as claimed in claim 1, wherein the method in step S102 is:

4. The method for classifying interconnection relationships based on spark erosion graph algorithm according to any one of claims 1 to 3, wherein the method in step S104 is: based on the graph relationship G generated in step S103, a Louvain community discovery algorithm is applied to perform community mining, so as to obtain a communication group with an obvious classification.

5. The interconnection relationship classification method based on spark graph algorithm as claimed in any one of claims 1 to 3, wherein the method in step S105 is: based on the communication groups calculated in step S104, screening node relationships that are far away from or isolated from each group, and generating a free relationship table; and establishing a communication group category label by combining the specific communication service type and characteristics.

6. An interconnection relation classification system based on spark graph algorithm is characterized by comprising

7. The spark image algorithm-based interconnection relationship classification system according to claim 6, wherein the data acquisition module specifically acquires the following processes: acquiring inter-network flow data of the power distribution terminal from a data management and control department of an electric power company, wherein the time period is one week; the data contains a five-tuple structure, which at least comprises a source IP, a source port, a destination IP, a destination port, a communication type, a communication instruction, source equipment information and destination equipment information.

8. The spark graph algorithm-based interconnection relationship classification system according to claim 6, wherein the method for generating the node data table V and the relationship data table E in the data processing module is as follows:

9. The interconnection relationship classification system based on spark graph algorithm as claimed in claim 6, wherein the method for group discovery in the communication group discovery module is: and based on the graph relation G, carrying out community mining by applying a Louvain community discovery algorithm to obtain a communication group with obvious classification.

10. The spark graph algorithm-based interconnection relationship classification system according to claim 6, wherein the group classification module screens node relationships far away from or isolated from each group based on the obtained communication groups to generate a free relationship table; and establishing a communication group category label by combining the specific communication service type and characteristics.