CN112714080A - Interconnection relation classification method and system based on spark graph algorithm - Google Patents

Interconnection relation classification method and system based on spark graph algorithm Download PDF

Info

Publication number
CN112714080A
CN112714080A CN202011543625.4A CN202011543625A CN112714080A CN 112714080 A CN112714080 A CN 112714080A CN 202011543625 A CN202011543625 A CN 202011543625A CN 112714080 A CN112714080 A CN 112714080A
Authority
CN
China
Prior art keywords
relation
communication
algorithm
data table
data
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN202011543625.4A
Other languages
Chinese (zh)
Other versions
CN112714080B (en
Inventor
陶景龙
梁淑云
刘胜
马影
王启凡
魏国富
殷钱安
余贤喆
周晓勇
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Information and Data Security Solutions Co Ltd
Original Assignee
Information and Data Security Solutions Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Information and Data Security Solutions Co Ltd filed Critical Information and Data Security Solutions Co Ltd
Priority to CN202011543625.4A priority Critical patent/CN112714080B/en
Publication of CN112714080A publication Critical patent/CN112714080A/en
Application granted granted Critical
Publication of CN112714080B publication Critical patent/CN112714080B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L47/00Traffic control in data switching networks
    • H04L47/10Flow control; Congestion control
    • H04L47/24Traffic characterised by specific attributes, e.g. priority or QoS
    • H04L47/2441Traffic characterised by specific attributes, e.g. priority or QoS relying on flow classification, e.g. using integrated services [IntServ]
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/24Classification techniques
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L41/00Arrangements for maintenance, administration or management of data switching networks, e.g. of packet switching networks
    • H04L41/14Network analysis or design
    • H04L41/145Network analysis or design involving simulating, designing, planning or modelling of a network

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Signal Processing (AREA)
  • Computer Networks & Wireless Communication (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Evolutionary Biology (AREA)
  • Evolutionary Computation (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Artificial Intelligence (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
  • Data Exchanges In Wide-Area Networks (AREA)

Abstract

The invention relates to an interconnection relation classification method based on spark graph algorithm, which comprises the steps of generating a node data table V and a relation data table E; based on the node data table V and the relation data table E, a Spark graph algorithm is applied to generate a graph relation G; carrying out communication group discovery by using a Lovain algorithm; performing group label setting by combining services, and performing group classification; and screening the free relation to be recorded as a free relation table P. A system based on the method is also provided. The method adopts the directed weighting network which maps the power terminal communication network into the communication network with the IP as the node, the communication relation as the edge and the IP sending instruction similarity as the edge weight, adopts the classic community classification algorithm Louvain to carry out community mining, and simultaneously designs the relation weight based on the equipment communication instruction similarity as the obvious characteristic, thereby effectively improving the algorithm classification effect, and finally accurately and efficiently finishing the interconnection relation classification among the communication equipment to obtain the obviously classified communication group.

Description

Interconnection relation classification method and system based on spark graph algorithm
Technical Field
The invention relates to the technical field of network traffic classification, in particular to a spark graph algorithm-based interconnection relationship classification method and system.
Background
Mutual communication behaviors exist in power distribution network terminal equipment in the power industry, however, with the development of construction requirements and services, communication equipment facilities are more and more, networking relations are more and more complex, and the interconnection relations between the communication equipment facilities and the networking relations belong to communication function categories which are more difficult to distinguish. The method for meeting the requirement at the present stage is to adopt manual verification, and use keyword matching or fixed rules to screen and classify according to flow logs between devices, so that the traditional mode is time-consuming and labor-consuming, the accuracy is difficult to guarantee, and as the communication function becomes complex, the previously set keywords, fixed rules and the like need to be maintained all the time, and the condition that the classification result cannot be timely produced exists. In order to improve efficiency, accuracy and reduce cost, the need to implement automatic classification of interconnection relationships of communication devices by data mining technology is becoming increasingly important.
The manual verification mainly depends on the understanding ability of related workers on the power distribution networking service and the mastering degree of the communication type of the power distribution terminal, so that a keyword matching library and a fixed rule item are established, then, the corresponding screening analysis is carried out on the communication flow logs between the devices, and finally, the corresponding classification categories are obtained. The traditional equipment interconnection relation classification method has low efficiency, low accuracy and high later maintenance cost.
For example, a device and a method for classifying network traffic based on Spark performance optimization disclosed in application number CN202010537734.9 belong to the technical field of network traffic classification. The data preprocessing module is used for acquiring and extracting characteristics related to time from the original flow data; the model training module is used for classifying the network flow; the real-time classification module is used for loading the classification model trained by the model training module on the data processed by the preprocessing module and classifying the data under the Topic; and the Spark performance optimization module is used for providing performance optimization support for the model training module and the real-time classification module. According to the invention, the quick and accurate classification of the network flow is realized through the Spark Shuffle performance optimization architecture diagram and the weighted random forest algorithm construction flow diagram, different service strategies can be provided for different application scenes for a network service provider, the network service quality is further improved, and the powerful support is provided for ensuring the network safety. Although the method adopts the random forest algorithm to classify the network traffic, the method cannot be directly applied to classification of the power network communication data because the power network communication data is greatly different from the network traffic data.
In summary, the device interconnection relationship classification method in the prior art cannot accurately, efficiently and inexpensively classify interconnection relationships among power distribution terminal devices in the power industry. Therefore, the need to achieve automatic classification of communication device interconnection relationships by data mining techniques is becoming increasingly important to improve efficiency, accuracy and reduce cost.
Disclosure of Invention
The technical problem to be solved by the invention is that the interconnection relationship among the power distribution terminal devices in the power industry cannot be accurately, efficiently and cheaply classified in the prior art.
The invention solves the technical problems through the following technical means:
an interconnection relation classification method based on spark graph algorithm comprises the following steps:
s101, acquiring data, and acquiring inter-network communication flow data of a power distribution terminal;
s102, processing data to generate a node data table V and a relation data table E;
s103, generating a graph relation, mapping the power terminal communication network into a directed weighting network which takes the IP as a node, the communication relation as an edge and the IP sending instruction similarity as an edge weight based on the node data table V and the relation data table E, and generating a graph relation G by applying a Spark graph algorithm;
s104, communication group discovery is carried out by using a Lovain algorithm based on the graph relation G;
s105, according to the result of the step S104, group label setting is carried out in combination with the service, and group classification is carried out; and screening the free relation to be recorded as a free relation table P.
The method adopts the directed weighting network which maps the power terminal communication network into the directed weighting network which takes the IP as a node, the communication relation as an edge and the similarity of the IP sending instruction as an edge weight, adopts the classic community classification algorithm Louvain to carry out community mining, and simultaneously designs the relation weight based on the similarity of the equipment communication instruction as an obvious characteristic, thereby effectively improving the algorithm classification effect, finally accurately and efficiently finishing the classification of the interconnection relation between communication equipment to obtain an obviously classified communication group, and designing a group type label by combining the power network service logic to obtain the power distribution terminal network communication relation group with the actual service type. The problems of low efficiency, low accuracy and high later maintenance cost in the prior art are solved.
Further, the method in step S101 is:
acquiring inter-network flow data of the power distribution terminal from a data management and control department of an electric power company, wherein the time period is one week; the data contains a five-tuple structure, which at least comprises a source IP, a source port, a destination IP, a destination port, a communication type, a communication instruction, source equipment information and destination equipment information.
Further, the method in step S102 is:
s1021, extracting all source IPs and destination IPs of internetwork flow data and instruction sets corresponding to the source IPs and the destination IPs, and forming a data table with two columns after grouping, aggregating and de-duplicating, and recording the data table as a node data table V;
s1022 extracts all communication records of source IP and destination IP of the internetwork flow data, obtains the one-to-one correspondence of the source IP and the destination IP after duplication removal, obtains an instruction set corresponding to each IP by combining the node data table V obtained in the step S1021, then calculates the instruction set similarity of the source IP and the destination IP by using a Levenson distance algorithm, and generates a field: and (3) obtaining the similarity of the instructions, and finally obtaining a relation data table with four columns, wherein the fields are as follows: source IP, destination IP, communication relation and instruction similarity; this table is denoted as relational data table E.
Further, the method in step S104 is: based on the graph relationship G generated in step S103, a Louvain community discovery algorithm is applied to perform community mining, so as to obtain a communication group with an obvious classification.
Further, the method in step S105 is: based on the communication groups calculated in step S104, screening node relationships that are far away from or isolated from each group, and generating a free relationship table; and establishing a communication group category label by combining the specific communication service type and characteristics.
The invention also provides an interconnection relation classification system based on spark graph algorithm, which comprises
The data acquisition module is used for acquiring the data of the communication flow between the power distribution terminal networks;
the data processing module is used for generating a node data table V and a relation data table E;
the graph relation generating module is used for generating a graph relation G by applying a Spark graph algorithm based on the node data table V and the relation data table E;
the communication group discovery module is used for discovering the communication group by using a Lovain algorithm;
the group classification module is used for performing group label setting by combining services according to the discovery result of the communication group discovery module and performing group classification; and screening the free relation to be recorded as a free relation table P.
Further, the data acquisition module specifically acquires the following processes: acquiring inter-network flow data of the power distribution terminal from a data management and control department of an electric power company, wherein the time period is one week; the data contains a five-tuple structure, which at least comprises a source IP, a source port, a destination IP, a destination port, a communication type, a communication instruction, source equipment information and destination equipment information.
Further, the method for generating the node data table V and the relationship data table E in the data processing module includes:
s1021, extracting all source IPs and destination IPs of internetwork flow data and instruction sets corresponding to the source IPs and the destination IPs, and forming a data table with two columns after grouping, aggregating and de-duplicating, and recording the data table as a node data table V;
s1022 extracts all communication records of source IP and destination IP of the internetwork flow data, obtains the one-to-one correspondence of the source IP and the destination IP after duplication removal, obtains an instruction set corresponding to each IP by combining the node data table V obtained in the step S1021, then calculates the instruction set similarity of the source IP and the destination IP by using a Levenson distance algorithm, and generates a field: and (3) obtaining the similarity of the instructions, and finally obtaining a relation data table with four columns, wherein the fields are as follows: source IP, destination IP, communication relation and instruction similarity; this table is denoted as relational data table E.
Further, the method for group discovery in the communication group discovery module is as follows: and based on the graph relation G, carrying out community mining by applying a Louvain community discovery algorithm to obtain a communication group with obvious classification.
Further, the group classification module screens node relationships far away from or isolated from each group based on the obtained communication groups to generate a free relationship table; and establishing a communication group category label by combining the specific communication service type and characteristics.
The invention has the advantages that:
the method adopts the directed weighting network which maps the power terminal communication network into the directed weighting network which takes the IP as a node, the communication relation as an edge and the similarity of the IP sending instruction as an edge weight, adopts the classic community classification algorithm Louvain to carry out community mining, and simultaneously designs the relation weight based on the similarity of the equipment communication instruction as an obvious characteristic, thereby effectively improving the algorithm classification effect, finally accurately and efficiently finishing the classification of the interconnection relation between communication equipment to obtain an obviously classified communication group, and designing a group type label by combining the power network service logic to obtain the power distribution terminal network communication relation group with the actual service type. The problems of low efficiency, low accuracy and high later maintenance cost in the prior art are solved.
Drawings
FIG. 1 is a block flow diagram of a classification method in an embodiment of the invention;
FIG. 2 is a graph relationship G generated by a node data table V and a relationship data table E in the embodiment of the present invention;
FIG. 3 is a diagram illustrating a relationship G for community mining to obtain a communication group G-C;
FIG. 4 is a table P of freeness relationships generated based on communication groups G-C in an embodiment of the present invention;
fig. 5 is a communication relation classification result table R generated based on the communication groups G-C in the embodiment of the present invention.
Detailed Description
In order to make the objects, technical solutions and advantages of the embodiments of the present invention clearer, the technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the embodiments of the present invention, and it is obvious that the described embodiments are some embodiments of the present invention, but not all embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.
The embodiment provides an interconnection relation classification method based on spark graph algorithm, which is used for solving the problem that the interconnection relation among power distribution terminal equipment in the power industry cannot be accurately, efficiently and inexpensively classified in the prior art. To achieve the above object.
As shown in fig. 1, a spark graph algorithm-based interconnection relationship classification method specifically includes the following steps:
s101, acquiring data, and acquiring inter-network communication flow data of a power distribution terminal;
s102, data processing is carried out, and a node data table V and a relation data table E are generated according to the internetwork communication traffic data;
s103, generating a graph relation, applying a Spark graph algorithm, and generating a graph relation G;
s104, communication group discovery is carried out, and a Lovain algorithm is used for group discovery;
s105, according to the result of the step S104, group label setting is carried out in combination with the service, and group classification is carried out; and screening the free relation to be recorded as a free relation table P.
The contents of each step are specifically described as follows:
the method in S101 comprises the following steps:
acquiring inter-network flow data of the power distribution terminal from a data management and control department of an electric power company, wherein the time period is at least one week; the data typically comprises a five-tuple structure including, but not limited to: source IP, source port, destination IP, destination port, communication type, communication instruction, source device information, destination device information, etc.;
the method in S102 comprises the following steps:
extracting contents aiming at the acquired internetwork flow data, and respectively generating a node data table V and a relation data table E;
s1021 extracts all source IP, destination IP, and their corresponding instruction sets, and forms a data table with two columns after grouping, aggregating, and de-duplicating, which is denoted as data node table V, and its content form is as in table 1.
Table 1 data node table
Figure BDA0002854604610000051
Figure BDA0002854604610000061
S1022 extracts all communication records of the source IP and the destination IP, obtains a one-to-one correspondence relationship between the source IP and the destination IP after deduplication, obtains an instruction set corresponding to each IP by combining the node table V obtained in S1021, then calculates an instruction set similarity between the source IP and the destination IP by using the levenstein distance algorithm, generates a field "instruction similarity", and finally obtains a relational data table having four columns, where the fields are: source IP, destination IP, communication relation and instruction similarity; this table is denoted as relationship table E, the content of which is in table 2, the instructions and IP in table 2 having been desensitized.
TABLE 2 data relationship Table
Figure BDA0002854604610000062
Figure BDA0002854604610000071
Wherein the Levensan distance algorithm refers to: the Levenshtein Distance algorithm, also called Edit Distance algorithm, refers to the minimum number of editing operations required for converting one character string into another character string. Permitted editing operations include replacing one character with another, inserting one character, and deleting one character. Generally, the smaller the edit distance, the greater the similarity of the two strings. Processed in the relation table, the larger the numerical value is, the larger the similarity is.
The method in S103 comprises the following steps: the power terminal communication network is mapped into a directed weighting network which takes the IP as a node, the communication relation is an edge, the IP sends an instruction similarity as an edge weight, the specific operation steps are that a graph frame graph algorithm package is led into the spark big data computing platform environment, and a graph relation is generated for an V, E table and is marked as a graph relation G, which is shown in figure 2.
Wherein spark cluster refers to: spark is a fast and general computing engine specially designed for large-scale data processing, and now forms an ecosystem with high-speed development and wide application, and has the characteristics of high speed, easy use, general use and the like.
Wherein the graph frame graph algorithm packet refers to: the database is constructed on the dataframes of spark, can utilize the good expansibility and strong performance of the dataframes, provides a uniform image processing API for scala, java and python, and has the advantages of uniform API, strong query function, image storage and reading, easy transplantation and the like.
The method in S104 comprises the following steps: based on the graph relationship G generated in the step S103, a classic Louvain community discovery algorithm is applied to carry out community mining to obtain a communication group G-C with obvious classification, and the form of the communication group G-C is shown in FIG. 3: community relationship graph G-C.
Wherein the Louvain algorithm refers to: the principle of the modularity-based algorithm is that the difference value of the module cohesion of a certain division result and the cohesion of a random division result is used for evaluating the division result to find the division with the optimal module cohesion, and the specific algorithm and formula are as follows:
continuously traversing points in the network, taking the points out of the original communities, calculating the modularity increment generated when the points are added into each community, selecting a community with the maximum corresponding modularity increment from the communities, adding the points until the points can not move, combining the communities into a super point, repeating the steps until the modularity is not increased any more, wherein the modularity increment refers to that the modularity value changes after one point is taken out of the original community and added into another community, and the calculation formula is as follows:
Figure BDA0002854604610000081
in the formula, sigma in represents the sum of all the side weights in the community C; c represents a community to which point i is to join; i denotes a node to be moved; ki,inRepresenting the sum of all edge weights from the point i to the community C; Σ tot represents the sum of the weights of all edges connected to community C;
the method in S105 comprises the following steps: the communication groups calculated in the step S104 are screened for node relationships that are far away from the individual groups or isolated, and a free relationship table P is generated, the content form of which is shown in fig. 4; and establishing a communication group category label by combining specific communication service types and characteristics, and generating a communication relation classification result table R, wherein the content form is as shown in FIG. 5: and a communication relation classification result table R. And finishing the function of classifying the interconnection relation among the power distribution terminal devices in the power industry.
The method adopts the directed weighting network which maps the power terminal communication network into the directed weighting network which takes the IP as a node, the communication relation as an edge and the similarity of the IP sending instruction as an edge weight, adopts the classic community classification algorithm Louvain to carry out community mining, and simultaneously designs the relation weight based on the similarity of the equipment communication instruction as an obvious characteristic, thereby effectively improving the algorithm classification effect, finally accurately and efficiently finishing the classification of the interconnection relation between communication equipment to obtain an obviously classified communication group, and designing a group type label by combining the power network service logic to obtain the power distribution terminal network communication relation group with the actual service type. The problems of low efficiency, low accuracy and high later maintenance cost in the prior art are solved.
The invention also provides an interconnection relation classification system based on spark graph algorithm, which comprises:
the data acquisition module is used for acquiring the data of the communication flow between the power distribution terminal networks;
the data processing module is used for generating a node data table V and a relation data table E according to the internetwork communication traffic data;
the graph relation generating module is used for applying a Spark graph algorithm to generate a graph relation G;
the communication group discovery module is used for performing group discovery by using a Lovain algorithm;
the group classification module is used for performing group label setting by combining services according to the result obtained by the communication group discovery module and performing group classification; and screening the free relation to be recorded as a free relation table P.
The contents of each step are specifically described as follows:
the method in the data acquisition module comprises the following steps:
acquiring inter-network flow data of the power distribution terminal from a data management and control department of an electric power company, wherein the time period is at least one week; the data typically comprises a five-tuple structure including, but not limited to: source IP, source port, destination IP, destination port, communication type, communication instruction, source device information, destination device information, etc.;
the method in the data processing module comprises the following steps:
extracting contents aiming at the acquired internetwork flow data, and respectively generating a node data table V and a relation data table E;
s1021 extracts all source IP, destination IP, and their corresponding instruction sets, and forms a data table with two columns after grouping, aggregating, and de-duplicating, which is denoted as data node table V, and its content form is as in table 1.
Table 1 data node table
Figure BDA0002854604610000091
S1022 extracts all communication records of the source IP and the destination IP, obtains a one-to-one correspondence relationship between the source IP and the destination IP after deduplication, obtains an instruction set corresponding to each IP by combining the node table V obtained in S1021, then calculates an instruction set similarity between the source IP and the destination IP by using the levenstein distance algorithm, generates a field "instruction similarity", and finally obtains a relational data table having four columns, where the fields are: source IP, destination IP, communication relation and instruction similarity; this table is denoted as relationship table E, the content of which is in table 2, the instructions and IP in table 2 having been desensitized.
TABLE 2 data relationship Table
src dst relationship similarity
31.119.0.185 31.119.0.219 communication 35
31.118.224.103 31.118.244.171 communication 32
31.118.244.171 31.118.188.7 communication 27
31.14.128.12 31.119.0.185 communication 18
31.119.0.185 31.119.0.177 communication 49
31.118.168.13 31.118.224.103 communication 29
31.119.0.177 31.118.224.107 communication 27
31.118.244.171 31.119.0.177 communication 26
31.119.0.185 31.118.140.24 communication 26
31.118.224.107 31.118.140.24 communication 28
31.14.128.12 31.119.0.177 communication 35
31.118.188.7 31.118.84.13 communication 44
31.119.0.177 31.119.0.185 communication 49
31.118.152.51 31.119.2.77 communication 29
31.118.140.24 31.119.0.219 communication 37
Wherein the Levensan distance algorithm refers to: the Levenshtein Distance algorithm, also called Edit Distance algorithm, refers to the minimum number of editing operations required for converting one character string into another character string. Permitted editing operations include replacing one character with another, inserting one character, and deleting one character. Generally, the smaller the edit distance, the greater the similarity of the two strings. Processed in the relation table, the larger the numerical value is, the larger the similarity is.
The method in the graph relation generation module comprises the following steps: the power terminal communication network is mapped into a directed weighting network which takes the IP as a node, the communication relation is an edge, the IP sends an instruction similarity as an edge weight, the specific operation steps are that a graph frame graph algorithm package is led into the spark big data computing platform environment, and a graph relation is generated for an V, E table and is marked as a graph relation G, which is shown in figure 2.
Wherein spark cluster refers to: spark is a fast and general computing engine specially designed for large-scale data processing, and now forms an ecosystem with high-speed development and wide application, and has the characteristics of high speed, easy use, general use and the like.
Wherein the graph frame graph algorithm packet refers to: the database is constructed on the dataframes of spark, can utilize the good expansibility and strong performance of the dataframes, provides a uniform image processing API for scala, java and python, and has the advantages of uniform API, strong query function, image storage and reading, easy transplantation and the like.
The method in the communication group discovery module comprises the following steps: based on the generated graph relation G, a classical Louvain community discovery algorithm is applied to carry out community mining to obtain a communication group G-C with obvious classification, and the form of the communication group G-C is shown in FIG. 3: community relationship graph G-C.
Wherein the Louvain algorithm refers to: the principle of the modularity-based algorithm is that the difference value of the module cohesion of a certain division result and the cohesion of a random division result is used for evaluating the division result to find the division with the optimal module cohesion, and the specific algorithm and formula are as follows:
continuously traversing points in the network, taking the points out of the original communities, calculating the modularity increment generated when the points are added into each community, selecting a community with the maximum corresponding modularity increment from the communities, adding the points until the points can not move, combining the communities into a super point, repeating the steps until the modularity is not increased any more, wherein the modularity increment refers to that the modularity value changes after one point is taken out of the original community and added into another community, and the calculation formula is as follows:
Figure BDA0002854604610000111
in the formula, sigma in represents the sum of all the side weights in the community C; c represents a community to which point i is to join; i denotes a node to be moved; ki,inRepresenting the sum of all edge weights from the point i to the community C; Σ tot represents the sum of the weights of all edges connected to community C;
the method in the group classification module comprises the following steps: based on the communication groups obtained by the calculation, screening node relations far away from the individual groups or isolated node relations to generate a free relation table P, wherein the content form of the free relation table P is shown in FIG. 4; and establishing a communication group category label by combining specific communication service types and characteristics, and generating a communication relation classification result table R, wherein the content form is as shown in FIG. 5: and a communication relation classification result table R. And finishing the function of classifying the interconnection relation among the power distribution terminal devices in the power industry.
The method adopts the directed weighting network which maps the power terminal communication network into the directed weighting network which takes the IP as a node, the communication relation as an edge and the similarity of the IP sending instruction as an edge weight, adopts the classic community classification algorithm Louvain to carry out community mining, and simultaneously designs the relation weight based on the similarity of the equipment communication instruction as an obvious characteristic, thereby effectively improving the algorithm classification effect, finally accurately and efficiently finishing the classification of the interconnection relation between communication equipment to obtain an obviously classified communication group, and designing a group type label by combining the power network service logic to obtain the power distribution terminal network communication relation group with the actual service type. The problems of low efficiency, low accuracy and high later maintenance cost in the prior art are solved.
The above examples are only intended to illustrate the technical solution of the present invention, but not to limit it; although the present invention has been described in detail with reference to the foregoing embodiments, it will be understood by those of ordinary skill in the art that: the technical solutions described in the foregoing embodiments may still be modified, or some technical features may be equivalently replaced; and such modifications or substitutions do not depart from the spirit and scope of the corresponding technical solutions of the embodiments of the present invention.

Claims (10)

1. An interconnection relation classification method based on spark graph algorithm is characterized by comprising the following steps:
s101, acquiring data, and acquiring inter-network communication flow data of a power distribution terminal;
s102, processing data to generate a node data table V and a relation data table E;
s103, generating a graph relation, mapping the power terminal communication network into a directed weighting network which takes the IP as a node, the communication relation as an edge and the IP sending instruction similarity as an edge weight based on the node data table V and the relation data table E, and generating a graph relation G by applying a Spark graph algorithm;
s104, communication group discovery is carried out by using a Lovain algorithm based on the graph relation G;
s105, according to the result of the step S104, group label setting is carried out in combination with the service, and group classification is carried out; and screening the free relation to be recorded as a free relation table P.
2. The interconnection relationship classification method based on spark graph algorithm as claimed in claim 1, wherein the method in step S101 is:
acquiring inter-network flow data of the power distribution terminal from a data management and control department of an electric power company, wherein the time period is one week; the data contains a five-tuple structure, which at least comprises a source IP, a source port, a destination IP, a destination port, a communication type, a communication instruction, source equipment information and destination equipment information.
3. The interconnection relationship classification method based on spark graph algorithm as claimed in claim 1, wherein the method in step S102 is:
s1021, extracting all source IPs and destination IPs of internetwork flow data and instruction sets corresponding to the source IPs and the destination IPs, and forming a data table with two columns after grouping, aggregating and de-duplicating, and recording the data table as a node data table V;
s1022 extracts all communication records of source IP and destination IP of the internetwork flow data, obtains the one-to-one correspondence of the source IP and the destination IP after duplication removal, obtains an instruction set corresponding to each IP by combining the node data table V obtained in the step S1021, then calculates the instruction set similarity of the source IP and the destination IP by using a Levenson distance algorithm, and generates a field: and (3) obtaining the similarity of the instructions, and finally obtaining a relation data table with four columns, wherein the fields are as follows: source IP, destination IP, communication relation and instruction similarity; this table is denoted as relational data table E.
4. The method for classifying interconnection relationships based on spark erosion graph algorithm according to any one of claims 1 to 3, wherein the method in step S104 is: based on the graph relationship G generated in step S103, a Louvain community discovery algorithm is applied to perform community mining, so as to obtain a communication group with an obvious classification.
5. The interconnection relationship classification method based on spark graph algorithm as claimed in any one of claims 1 to 3, wherein the method in step S105 is: based on the communication groups calculated in step S104, screening node relationships that are far away from or isolated from each group, and generating a free relationship table; and establishing a communication group category label by combining the specific communication service type and characteristics.
6. An interconnection relation classification system based on spark graph algorithm is characterized by comprising
The data acquisition module is used for acquiring the data of the communication flow between the power distribution terminal networks;
the data processing module is used for generating a node data table V and a relation data table E;
the graph relation generating module is used for generating a graph relation G by applying a Spark graph algorithm based on the node data table V and the relation data table E;
the communication group discovery module is used for discovering the communication group by using a Lovain algorithm;
the group classification module is used for performing group label setting by combining services according to the discovery result of the communication group discovery module and performing group classification; and screening the free relation to be recorded as a free relation table P.
7. The spark image algorithm-based interconnection relationship classification system according to claim 6, wherein the data acquisition module specifically acquires the following processes: acquiring inter-network flow data of the power distribution terminal from a data management and control department of an electric power company, wherein the time period is one week; the data contains a five-tuple structure, which at least comprises a source IP, a source port, a destination IP, a destination port, a communication type, a communication instruction, source equipment information and destination equipment information.
8. The spark graph algorithm-based interconnection relationship classification system according to claim 6, wherein the method for generating the node data table V and the relationship data table E in the data processing module is as follows:
s1021, extracting all source IPs and destination IPs of internetwork flow data and instruction sets corresponding to the source IPs and the destination IPs, and forming a data table with two columns after grouping, aggregating and de-duplicating, and recording the data table as a node data table V;
s1022 extracts all communication records of source IP and destination IP of the internetwork flow data, obtains the one-to-one correspondence of the source IP and the destination IP after duplication removal, obtains an instruction set corresponding to each IP by combining the node data table V obtained in the step S1021, then calculates the instruction set similarity of the source IP and the destination IP by using a Levenson distance algorithm, and generates a field: and (3) obtaining the similarity of the instructions, and finally obtaining a relation data table with four columns, wherein the fields are as follows: source IP, destination IP, communication relation and instruction similarity; this table is denoted as relational data table E.
9. The interconnection relationship classification system based on spark graph algorithm as claimed in claim 6, wherein the method for group discovery in the communication group discovery module is: and based on the graph relation G, carrying out community mining by applying a Louvain community discovery algorithm to obtain a communication group with obvious classification.
10. The spark graph algorithm-based interconnection relationship classification system according to claim 6, wherein the group classification module screens node relationships far away from or isolated from each group based on the obtained communication groups to generate a free relationship table; and establishing a communication group category label by combining the specific communication service type and characteristics.
CN202011543625.4A 2020-12-23 2020-12-23 Interconnection relation classification method and system based on spark graph algorithm Active CN112714080B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202011543625.4A CN112714080B (en) 2020-12-23 2020-12-23 Interconnection relation classification method and system based on spark graph algorithm

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202011543625.4A CN112714080B (en) 2020-12-23 2020-12-23 Interconnection relation classification method and system based on spark graph algorithm

Publications (2)

Publication Number Publication Date
CN112714080A true CN112714080A (en) 2021-04-27
CN112714080B CN112714080B (en) 2023-10-17

Family

ID=75543998

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202011543625.4A Active CN112714080B (en) 2020-12-23 2020-12-23 Interconnection relation classification method and system based on spark graph algorithm

Country Status (1)

Country Link
CN (1) CN112714080B (en)

Citations (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2018014610A1 (en) * 2016-07-20 2018-01-25 武汉斗鱼网络科技有限公司 C4.5 decision tree algorithm-based specific user mining system and method therefor
CN108509551A (en) * 2018-03-19 2018-09-07 西北大学 A kind of micro blog network key user digging system under the environment based on Spark and method
WO2018222064A1 (en) * 2017-05-29 2018-12-06 Huawei Technologies Co., Ltd. Systems and methods of hierarchical community detection in graphs
CN109446395A (en) * 2018-09-29 2019-03-08 上海派博软件有限公司 A kind of method and system of the raising based on Hadoop big data comprehensive inquiry engine efficiency
CN110647942A (en) * 2019-09-25 2020-01-03 广东电网有限责任公司 Intrusion detection method, device and equipment for satellite network
US10581851B1 (en) * 2019-07-17 2020-03-03 Capital One Services, Llc Change monitoring and detection for a cloud computing environment
CN111178587A (en) * 2019-12-06 2020-05-19 广东工业大学 Spark framework-based short-term power load rapid prediction method
CN111917600A (en) * 2020-06-12 2020-11-10 贵州大学 Spark performance optimization-based network traffic classification device and classification method

Patent Citations (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2018014610A1 (en) * 2016-07-20 2018-01-25 武汉斗鱼网络科技有限公司 C4.5 decision tree algorithm-based specific user mining system and method therefor
WO2018222064A1 (en) * 2017-05-29 2018-12-06 Huawei Technologies Co., Ltd. Systems and methods of hierarchical community detection in graphs
CN108509551A (en) * 2018-03-19 2018-09-07 西北大学 A kind of micro blog network key user digging system under the environment based on Spark and method
CN109446395A (en) * 2018-09-29 2019-03-08 上海派博软件有限公司 A kind of method and system of the raising based on Hadoop big data comprehensive inquiry engine efficiency
US10581851B1 (en) * 2019-07-17 2020-03-03 Capital One Services, Llc Change monitoring and detection for a cloud computing environment
CN110647942A (en) * 2019-09-25 2020-01-03 广东电网有限责任公司 Intrusion detection method, device and equipment for satellite network
CN111178587A (en) * 2019-12-06 2020-05-19 广东工业大学 Spark framework-based short-term power load rapid prediction method
CN111917600A (en) * 2020-06-12 2020-11-10 贵州大学 Spark performance optimization-based network traffic classification device and classification method

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
XIAO WANG; YING LIU; WEI SU: "《Real-Time Classification Method of Network Traffic Based on Parallelized CNN》", 《2019 IEEE INTERNATIONAL CONFERENCE ON POWER, INTELLIGENT COMPUTING AND SYSTEMS》 *
刘兆禄,赵英,刘淑梅: "基于Spark的网络流量分类方法研究", 《通信学报》 *

Also Published As

Publication number Publication date
CN112714080B (en) 2023-10-17

Similar Documents

Publication Publication Date Title
CN108874927B (en) Intrusion detection method based on hypergraph and random forest
CN109492026B (en) Telecommunication fraud classification detection method based on improved active learning technology
CN106228398A (en) Specific user's digging system based on C4.5 decision Tree algorithms and method thereof
CN108768986A (en) A kind of encryption traffic classification method and server, computer readable storage medium
CN113378899B (en) Abnormal account identification method, device, equipment and storage medium
CN111191767A (en) Vectorization-based malicious traffic attack type judgment method
CN105471670A (en) Flow data classification method and device
CN112508726A (en) False public opinion identification system based on information spreading characteristics and processing method thereof
CN114511330B (en) Ether house Pompe fraudster detection method and system based on improved CNN-RF
CN112153636A (en) Method for predicting number portability and roll-out of telecommunication industry user based on machine learning
CN108959577B (en) Entity matching method and computer program based on non-dominant attribute outlier detection
CN107888494B (en) Community discovery-based packet classification method and system
CN107871055A (en) A kind of data analysing method and device
CN114095447A (en) Communication network encrypted flow classification method based on knowledge distillation and self-distillation
CN112235254B (en) Rapid identification method for Tor network bridge in high-speed backbone network
CN111428821A (en) Asset classification method based on decision tree
CN109903176B (en) Real-time public opinion analysis method based on streaming cloud platform
CN112714080B (en) Interconnection relation classification method and system based on spark graph algorithm
CN109255433B (en) Community detection method based on similarity
CN110826845A (en) Multidimensional combination cost allocation device and method
CN113256507B (en) Attention enhancement method for generating image aiming at binary flow data
CN113918558A (en) Supplier close relation identification method based on community discovery and association rule analysis
CN110032596B (en) Method and system for identifying abnormal traffic user
CN109447490B (en) User address-based abnormal change relation discrimination method
CN109308565B (en) Crowd performance grade identification method and device, storage medium and computer equipment

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant