CN112714080B

CN112714080B - Interconnection relation classification method and system based on spark graph algorithm

Info

Publication number: CN112714080B
Application number: CN202011543625.4A
Authority: CN
Inventors: 陶景龙; 梁淑云; 刘胜; 马影; 王启凡; 魏国富; 殷钱安; 余贤喆; 周晓勇
Original assignee: Information and Data Security Solutions Co Ltd
Current assignee: Information and Data Security Solutions Co Ltd
Priority date: 2020-12-23
Filing date: 2020-12-23
Publication date: 2023-10-17
Anticipated expiration: 2040-12-23
Also published as: CN112714080A

Abstract

The invention relates to an interconnection relation classification method based on a spark graph algorithm, which comprises the steps of generating a node data table V and a relation data table E; based on the node data table V and the relation data table E, a Spark graph algorithm is applied to generate a graph relation G; using a Louvain community discovery algorithm to perform communication community discovery; setting group labels by combining with services, and classifying groups; and the free relationship was selected and noted as free relationship table P. A system based on the method is also provided. The invention adopts the directional weighting network which takes the IP as the node, the communication relation as the side and the similarity of the IP sending instruction as the side weight, adopts the classical community classification algorithm Louvain to carry out community mining, designs the relation weight based on the similarity of the equipment communication instruction as the obvious characteristic, effectively improves the algorithm classification effect, and finally can accurately and efficiently finish the interconnection relation classification among the communication equipment to obtain the communication group with obvious classification.

Description

Interconnection relation classification method and system based on spark graph algorithm

Technical Field

The invention relates to the technical field of network traffic classification, in particular to an interconnection relation classification method and system based on a spark graph algorithm.

Background

The distribution network terminal equipment in the electric power industry has mutual communication behavior, but as construction requirements and services develop, communication equipment facilities are more and more, networking relations are more and more complex, and the communication function categories to which the interconnection relations belong become more indistinguishable. The method for completing the requirement at present adopts manual verification, and uses keyword matching or fixed rules to screen and classify according to the flow logs among the devices, so that the traditional method is time-consuming and labor-consuming, the accuracy is difficult to ensure, and the situation that classification results cannot be timely produced can exist as the communication function becomes complex, the previously set keywords, fixed rules and the like are required to be maintained at any time. To improve efficiency, accuracy, and reduce cost, the need to implement automatic classification of communication device interconnections by data mining techniques has become increasingly important.

The manual verification mainly depends on the understanding capability of related staff to the power distribution networking service and the grasping degree of the communication type of the power distribution terminal, so that a keyword matching library and a fixed rule item are established, then the communication flow logs among the devices are correspondingly screened and analyzed, and finally the corresponding classification type is obtained. The traditional equipment interconnection relation classification method has low efficiency, low accuracy and high later maintenance cost.

The invention discloses a network traffic classification device and a classification method based on Spark performance optimization, and belongs to the technical field of network traffic classification. The data preprocessing module is used for collecting and extracting time-related characteristics from the original flow data; the model training module is used for classifying the network traffic; the real-time classification module is used for classifying the data under Topic by the classification model trained by the data loading model training module which is processed by the preprocessing module; and the Spark performance optimization module is used for providing performance optimization support for the model training module and the real-time classification module. According to the invention, the network flow is classified rapidly and accurately by constructing the Spark Shuffle performance optimization architecture and the weighted random forest algorithm, different service strategies can be provided for different application scenes for network service providers, the network service quality is further improved, and the network security is guaranteed. The method adopts a random forest algorithm to classify the network traffic, but the method cannot be directly applied to the classification of the power network communication data because the power network communication data and the network traffic data are very different.

In summary, the device interconnection relationship classification method in the prior art cannot accurately, efficiently and inexpensively classify the interconnection relationship between the power distribution terminal devices in the power industry. Therefore, to improve efficiency, accuracy, and reduce cost, the need to achieve automatic classification of communication device interconnections by data mining techniques has become increasingly important.

Disclosure of Invention

The technical problem to be solved by the invention is that the interconnection relation between the power distribution terminal equipment in the power industry cannot be classified accurately, efficiently and at low cost in the prior art.

The invention solves the technical problems by the following technical means:

an interconnection relation classification method based on a spark graph algorithm comprises the following steps:

s101, acquiring data, namely acquiring communication flow data among power distribution terminals;

s102, data processing is carried out, and a node data table V and a relation data table E are generated;

s103, generating a graph relationship, namely mapping the power terminal communication network into a directional weighting network with IP as a node and the communication relationship as an edge and the IP sending instruction similarity as an edge weight based on the node data table V and the relationship data table E, and generating a graph relationship G by applying a Spark graph algorithm;

s104, carrying out communication group discovery by using a Louvain community discovery algorithm based on the graph relation G;

s105, according to the result of the step S104, setting group labels by combining the service, and classifying the groups; and the free relationship was selected and noted as free relationship table P.

The invention adopts the directed weighted network which takes the IP as a node, takes the communication relation as an edge and takes the similarity of the IP sending instructions as an edge weight, adopts the classical community classification algorithm Louvain to carry out community mining, designs the relation weight based on the similarity of the equipment communication instructions, effectively improves the classification effect of the algorithm as obvious characteristics, can finally accurately and efficiently finish the classification of the interconnection relation between communication equipment, obtains the communication group with obvious classification, combines the power network business logic to design the group type label, and obtains the power distribution terminal network communication relation group with actual business types. The problems of low efficiency, low accuracy and high later maintenance cost in the prior art are solved.

Further, the method in step S101 is as follows:

acquiring flow data among power distribution terminals from a data management and control department of an electric company, wherein the time period is one week; the data comprises a five-tuple structure, and at least comprises a source IP, a source port, a destination IP, a destination port, a communication type, a communication instruction, source equipment information and destination equipment information.

Further, the method in step S102 is as follows:

s1021, extracting all source IP and destination IP of the inter-network flow data and the corresponding instruction sets thereof, and forming a data table with two columns after grouping, aggregating and de-duplicating, and marking the data table as a node data table V;

s1022 extracting all communication records of the source IP and the destination IP of the inter-network traffic data, obtaining a one-to-one correspondence relation between the source IP and the destination IP after de-duplication, combining the node data table V obtained in the step S1021 to obtain an instruction set corresponding to each IP, and then calculating the similarity of the instruction sets of the source IP and the destination IP by using a Levenstein distance algorithm to generate a field: and finally obtaining a relational data table with four columns according to the instruction similarity, wherein the fields are respectively as follows: source IP, destination IP, communication relationship, instruction similarity; this table is denoted as relational data table E.

Further, the method in step S104 is as follows: based on the graph relationship G generated in the step S103, applying a Louvain community discovery algorithm to conduct community mining to obtain a communication group with obvious classification.

Further, the method in step S105 is as follows: based on the communication group calculated in the step S104, screening node relations far away from each group or isolated, and generating a free relation table; and establishing a communication group category label by combining specific communication service types and characteristics.

The invention also provides an interconnection relation classification system based on the spark graph algorithm, which comprises

The data acquisition module is used for acquiring communication flow data among the power distribution terminal networks;

the data processing module generates a node data table V and a relation data table E;

the graph relation generating module is used for generating a graph relation G by applying a Spark graph algorithm based on the node data table V and the relation data table E;

the communication group discovery module is used for carrying out communication group discovery by using a Louvain community discovery algorithm;

the group classification module is used for carrying out group label setting according to the discovery result of the communication group discovery module and carrying out group classification in combination with the service; and the free relationship was selected and noted as free relationship table P.

Further, the specific acquisition process of the data acquisition module is as follows: acquiring flow data among power distribution terminals from a data management and control department of an electric company, wherein the time period is one week; the data comprises a five-tuple structure, and at least comprises a source IP, a source port, a destination IP, a destination port, a communication type, a communication instruction, source equipment information and destination equipment information.

Further, the method for generating the node data table V and the relationship data table E in the data processing module is as follows:

Further, the method for group discovery in the communication group discovery module comprises the following steps: based on the graph relation G, applying a Louvain community discovery algorithm to carry out community mining to obtain a communication group with obvious classification.

Further, the group classification module screens node relations far away from each group or isolated based on the obtained communication groups to generate a free relation table; and establishing a communication group category label by combining specific communication service types and characteristics.

The invention has the advantages that:

Drawings

FIG. 1 is a flow chart of a classification method according to an embodiment of the invention;

FIG. 2 is a graph relationship G generated by the node data table V and the relationship data table E in an embodiment of the present invention;

FIG. 3 is a diagram illustrating a communication group G-C obtained by community mining based on the graph relationship G in an embodiment of the present invention;

FIG. 4 is a table P of free relations generated based on communication groups G-C in an embodiment of the invention;

FIG. 5 is a table R of communication relationship classification results generated based on communication groups G-C in an embodiment of the invention.

Detailed Description

For the purpose of making the objects, technical solutions and advantages of the embodiments of the present invention more apparent, the technical solutions in the embodiments of the present invention will be clearly and completely described in the following in conjunction with the embodiments of the present invention, and it is apparent that the described embodiments are some embodiments of the present invention, but not all embodiments. All other embodiments, which can be made by those skilled in the art based on the embodiments of the invention without making any inventive effort, are intended to be within the scope of the invention.

The embodiment provides an interconnection relation classification method based on a spark graph algorithm, which is used for solving the problem that the interconnection relation between distribution terminal equipment in the power industry cannot be classified accurately, efficiently and at low cost in the prior art. To achieve the above object.

As shown in fig. 1, an interconnection relation classification method based on spark graph algorithm specifically comprises the following steps:

s102, data processing is carried out, and a node data table V and a relation data table E are generated according to the inter-network communication flow data;

s103, generating a graph relation, namely applying a Spark graph algorithm to generate a graph relation G;

s104, carrying out communication group discovery, namely carrying out group discovery by using a Louvain algorithm;

The contents of each step are specifically described below:

the method in S101 is:

acquiring flow data among power distribution terminal networks from a data management and control department of an electric company, wherein the time period is at least one week; data typically comprises five-tuple structures including, but not limited to: source IP, source port, destination IP, destination port, communication type, communication instruction, source device information, destination device information, etc.;

the method in S102 is:

content extraction is carried out on the acquired inter-network flow data to respectively generate a node data table V and a relation data table E;

s1021 extracts all source IP, destination IP and their corresponding instruction sets, and forms a data table with two columns after grouping, which is recorded as a data node table V, the content form of which is shown in table 1.

Table 1 data node table

S1022 extracting all communication records of the source IP and the destination IP, obtaining a one-to-one correspondence relation between the source IP and the destination IP after de-duplication, combining the node table V obtained in S1021 to obtain an instruction set corresponding to each IP, then calculating the similarity of the instruction sets of the source IP and the destination IP by using a Levenstein distance algorithm to generate a field 'instruction similarity', and finally obtaining a relation data table with four columns, wherein the fields are respectively: source IP, destination IP, communication relationship, instruction similarity; this table is denoted as relationship table E, whose content is in the form of table 2, the instructions and IP in table 2 have been desensitized.

Table 2 data relationship table

Wherein the lywenstein distance algorithm refers to: levenshtein Distance algorithm, also called Edit Distance algorithm, refers to the minimum number of editing operations required to change from one string to another between two strings. The permitted editing operations include replacing one character with another, inserting one character, and deleting one character. In general, the smaller the edit distance, the greater the similarity of the two strings. The larger the value in the relationship table, the greater the similarity is represented by the processed relationship table.

The method in S103 is: mapping the power terminal communication network into a directional weighting network with IP as a node, a communication relationship as an edge and the similarity of IP sending instructions as an edge weight, specifically, importing a graphframe graph algorithm package into a spark big data computing platform environment, generating a graph relationship of a V, E table, and recording the graph relationship G as shown in fig. 2.

Wherein spark cluster refers to: spark is a fast and general computing engine designed for large-scale data processing, and now forms an ecological system with high development and wide application, and has the characteristics of high speed, easy use, general use and the like.

Wherein the graphframe graph algorithm package refers to: the library is constructed on the dataframes of spark, can utilize the good expansibility and strong performance of the dataframes, provides a unified graph processing API for scala, java, python, and has the advantages of unified API, strong query function, graph storage and reading, easiness in transplanting and the like.

The method in S104 is: based on the graph relationship G generated in step S103, a classical Louvain community discovery algorithm is applied to perform community mining to obtain a communication group G-C with obvious classification, and the form is as shown in fig. 3: community relationship graph G-C.

The Louvain algorithm refers to the following steps: the principle of the algorithm based on modularity is that the difference value between the module cohesiveness of a certain division result and the cohesiveness of a random division result is used for evaluating the division result to find the division with optimal module cohesiveness, and the specific algorithm and formula are as follows:

continuously traversing points in the network, taking out the points from the original communities, calculating the modularity increment generated by adding the points to each community, selecting one community with the largest modularity increment from the communities, adding the points until no points can move, combining each community into a super point, repeating the steps until the modularity is not increased any more, wherein the modularity increment means that after one point is taken out from the original community and added to another community, the value of the modularity is changed, and the calculation formula is as follows:

in the formula, sigma in represents the sum of all edge weights in the community C; c represents a community to which the point i is to be added; i represents a node to be moved; k (K) _i,in Representing the sum of all edge weights from the point i to the community C; Σtot represents the sum of all the weights of the edges connected to community C;

the method in S105 is: the communication group obtained by calculation in the step S104 is screened out node relations which are far away from the group or are isolated, and a free relation table P is generated, wherein the content form of the free relation table P is shown in figure 4; and building a communication group category label by combining specific communication service types and characteristics to generate a communication relation classification result table R, wherein the content form of the communication relation classification result table R is as shown in fig. 5: the communication relation classification result table R. The function of classifying the interconnection relation among the power distribution terminal devices in the power industry is finished.

The invention also provides an interconnection relation classification system based on the spark graph algorithm, which comprises the following steps:

the data processing module generates a node data table V and a relation data table E according to the inter-network communication flow data;

the chart relation generating module is used for generating a chart relation G by applying a Spark chart algorithm;

the communication group discovery module is used for carrying out group discovery by using a Louvain algorithm;

the group classification module is used for carrying out group label setting according to the result obtained by the communication group discovery module and carrying out group classification by combining the service; and the free relationship was selected and noted as free relationship table P.

The contents of each step are specifically described below:

the method in the data acquisition module is as follows:

the method in the data processing module is as follows:

Table 1 data node table

Table 2 data relationship table

src	dst	relationship	similarity
				31.119.0.185	31.119.0.219	communication	35
31.118.224.103	31.118.244.171	communication	32
				31.118.244.171	31.118.188.7	communication	27
31.14.128.12	31.119.0.185	communication	18
				31.119.0.185	31.119.0.177	communication	49
31.118.168.13	31.118.224.103	communication	29
				31.119.0.177	31.118.224.107	communication	27
31.118.244.171	31.119.0.177	communication	26
				31.119.0.185	31.118.140.24	communication	26
31.118.224.107	31.118.140.24	communication	28
				31.14.128.12	31.119.0.177	communication	35
31.118.188.7	31.118.84.13	communication	44
				31.119.0.177	31.119.0.185	communication	49
31.118.152.51	31.119.2.77	communication	29
				31.118.140.24	31.119.0.219	communication	37

The method in the graph relation generation module is as follows: mapping the power terminal communication network into a directional weighting network with IP as a node, a communication relationship as an edge and the similarity of IP sending instructions as an edge weight, specifically, importing a graphframe graph algorithm package into a spark big data computing platform environment, generating a graph relationship of a V, E table, and recording the graph relationship G as shown in fig. 2.

The method in the communication group discovery module is as follows: based on the generated graph relationship G, a classical Louvain community discovery algorithm is applied to carry out community mining to obtain a communication group G-C with obvious classification, and the form is as shown in FIG. 3: community relationship graph G-C.

in the formula, sigma in represents all edges in community CSum of weights; c represents a community to which the point i is to be added; i represents a node to be moved; k (K) _i,in Representing the sum of all edge weights from the point i to the community C; Σtot represents the sum of all the weights of the edges connected to community C;

the method in the group classification module is as follows: based on the communication group obtained by the calculation, screening node relations far away from the group or isolated from the group to generate a free relation table P, wherein the content form of the free relation table P is shown in figure 4; and building a communication group category label by combining specific communication service types and characteristics to generate a communication relation classification result table R, wherein the content form of the communication relation classification result table R is as shown in fig. 5: the communication relation classification result table R. The function of classifying the interconnection relation among the power distribution terminal devices in the power industry is finished.

The above embodiments are only for illustrating the technical solution of the present invention, and are not limiting; although the invention has been described in detail with reference to the foregoing embodiments, it will be understood by those of ordinary skill in the art that: the technical scheme described in the foregoing embodiments can be modified or some technical features thereof can be replaced by equivalents; such modifications and substitutions do not depart from the spirit and scope of the technical solutions of the embodiments of the present invention.

Claims

1. The interconnection relation classification method based on the spark graph algorithm is characterized by comprising the following steps of:

s105, according to the result of the step S104, setting group labels by combining the service, and classifying the groups; screening the free relation and marking the free relation as a free relation table P;

the method in step S102 is as follows:

s1022 extracting all communication records of the source IP and the destination IP of the inter-network traffic data, obtaining a one-to-one correspondence relation between the source IP and the destination IP after de-duplication, combining the node data table V obtained in the step S1021 to obtain an instruction set corresponding to each IP, and then calculating the similarity of the instruction sets of the source IP and the destination IP by using a Levenstein distance algorithm to generate a field: and finally obtaining a relational data table with four columns according to the instruction similarity, wherein the fields are respectively as follows: source IP, destination IP, communication relationship, instruction similarity; the table is marked as a relation data table E;

the method in step S105 is: based on the communication group calculated in the step S104, screening node relations far away from each group or isolated, and generating a free relation table; and establishing a communication group category label by combining specific communication service types and characteristics.

2. The interconnection relationship classification method based on spark graph algorithm according to claim 1, wherein the method in step S101 is:

3. The interconnection relationship classification method based on spark graph algorithm according to claim 1 or 2, wherein the method in step S104 is: based on the graph relationship G generated in the step S103, applying a Louvain community discovery algorithm to conduct community mining to obtain a communication group with obvious classification.

4. An interconnection relation classification system based on spark graph algorithm is characterized by comprising

the group classification module is used for carrying out group label setting according to the discovery result of the communication group discovery module and carrying out group classification in combination with the service; screening the free relation and marking the free relation as a free relation table P;

the method for generating the node data table V and the relation data table E in the data processing module comprises the following steps:

the group classification module screens node relations far away from each group or isolated based on the obtained communication groups to generate a free relation table; and establishing a communication group category label by combining specific communication service types and characteristics.

5. The interconnection relation classification system based on spark graph algorithm according to claim 4, wherein the data acquisition module specifically acquires: acquiring flow data among power distribution terminals from a data management and control department of an electric company, wherein the time period is one week; the data comprises a five-tuple structure, and at least comprises a source IP, a source port, a destination IP, a destination port, a communication type, a communication instruction, source equipment information and destination equipment information.

6. The interconnection relation classification system based on spark graph algorithm according to claim 4, wherein the method of group discovery in the communication group discovery module is as follows: based on the graph relation G, applying a Louvain community discovery algorithm to carry out community mining to obtain a communication group with obvious classification.