CN112714080B - Interconnection relation classification method and system based on spark graph algorithm - Google Patents

Interconnection relation classification method and system based on spark graph algorithm Download PDF

Info

Publication number
CN112714080B
CN112714080B CN202011543625.4A CN202011543625A CN112714080B CN 112714080 B CN112714080 B CN 112714080B CN 202011543625 A CN202011543625 A CN 202011543625A CN 112714080 B CN112714080 B CN 112714080B
Authority
CN
China
Prior art keywords
communication
relation
data table
data
algorithm
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202011543625.4A
Other languages
Chinese (zh)
Other versions
CN112714080A (en
Inventor
陶景龙
梁淑云
刘胜
马影
王启凡
魏国富
殷钱安
余贤喆
周晓勇
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Information and Data Security Solutions Co Ltd
Original Assignee
Information and Data Security Solutions Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Information and Data Security Solutions Co Ltd filed Critical Information and Data Security Solutions Co Ltd
Priority to CN202011543625.4A priority Critical patent/CN112714080B/en
Publication of CN112714080A publication Critical patent/CN112714080A/en
Application granted granted Critical
Publication of CN112714080B publication Critical patent/CN112714080B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L47/00Traffic control in data switching networks
    • H04L47/10Flow control; Congestion control
    • H04L47/24Traffic characterised by specific attributes, e.g. priority or QoS
    • H04L47/2441Traffic characterised by specific attributes, e.g. priority or QoS relying on flow classification, e.g. using integrated services [IntServ]
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/24Classification techniques
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L41/00Arrangements for maintenance, administration or management of data switching networks, e.g. of packet switching networks
    • H04L41/14Network analysis or design
    • H04L41/145Network analysis or design involving simulating, designing, planning or modelling of a network

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Signal Processing (AREA)
  • Computer Networks & Wireless Communication (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Evolutionary Biology (AREA)
  • Evolutionary Computation (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Artificial Intelligence (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
  • Data Exchanges In Wide-Area Networks (AREA)

Abstract

The invention relates to an interconnection relation classification method based on a spark graph algorithm, which comprises the steps of generating a node data table V and a relation data table E; based on the node data table V and the relation data table E, a Spark graph algorithm is applied to generate a graph relation G; using a Louvain community discovery algorithm to perform communication community discovery; setting group labels by combining with services, and classifying groups; and the free relationship was selected and noted as free relationship table P. A system based on the method is also provided. The invention adopts the directional weighting network which takes the IP as the node, the communication relation as the side and the similarity of the IP sending instruction as the side weight, adopts the classical community classification algorithm Louvain to carry out community mining, designs the relation weight based on the similarity of the equipment communication instruction as the obvious characteristic, effectively improves the algorithm classification effect, and finally can accurately and efficiently finish the interconnection relation classification among the communication equipment to obtain the communication group with obvious classification.

Description

Interconnection relation classification method and system based on spark graph algorithm
Technical Field
The invention relates to the technical field of network traffic classification, in particular to an interconnection relation classification method and system based on a spark graph algorithm.
Background
The distribution network terminal equipment in the electric power industry has mutual communication behavior, but as construction requirements and services develop, communication equipment facilities are more and more, networking relations are more and more complex, and the communication function categories to which the interconnection relations belong become more indistinguishable. The method for completing the requirement at present adopts manual verification, and uses keyword matching or fixed rules to screen and classify according to the flow logs among the devices, so that the traditional method is time-consuming and labor-consuming, the accuracy is difficult to ensure, and the situation that classification results cannot be timely produced can exist as the communication function becomes complex, the previously set keywords, fixed rules and the like are required to be maintained at any time. To improve efficiency, accuracy, and reduce cost, the need to implement automatic classification of communication device interconnections by data mining techniques has become increasingly important.
The manual verification mainly depends on the understanding capability of related staff to the power distribution networking service and the grasping degree of the communication type of the power distribution terminal, so that a keyword matching library and a fixed rule item are established, then the communication flow logs among the devices are correspondingly screened and analyzed, and finally the corresponding classification type is obtained. The traditional equipment interconnection relation classification method has low efficiency, low accuracy and high later maintenance cost.
The invention discloses a network traffic classification device and a classification method based on Spark performance optimization, and belongs to the technical field of network traffic classification. The data preprocessing module is used for collecting and extracting time-related characteristics from the original flow data; the model training module is used for classifying the network traffic; the real-time classification module is used for classifying the data under Topic by the classification model trained by the data loading model training module which is processed by the preprocessing module; and the Spark performance optimization module is used for providing performance optimization support for the model training module and the real-time classification module. According to the invention, the network flow is classified rapidly and accurately by constructing the Spark Shuffle performance optimization architecture and the weighted random forest algorithm, different service strategies can be provided for different application scenes for network service providers, the network service quality is further improved, and the network security is guaranteed. The method adopts a random forest algorithm to classify the network traffic, but the method cannot be directly applied to the classification of the power network communication data because the power network communication data and the network traffic data are very different.
In summary, the device interconnection relationship classification method in the prior art cannot accurately, efficiently and inexpensively classify the interconnection relationship between the power distribution terminal devices in the power industry. Therefore, to improve efficiency, accuracy, and reduce cost, the need to achieve automatic classification of communication device interconnections by data mining techniques has become increasingly important.
Disclosure of Invention
The technical problem to be solved by the invention is that the interconnection relation between the power distribution terminal equipment in the power industry cannot be classified accurately, efficiently and at low cost in the prior art.
The invention solves the technical problems by the following technical means:
an interconnection relation classification method based on a spark graph algorithm comprises the following steps:
s101, acquiring data, namely acquiring communication flow data among power distribution terminals;
s102, data processing is carried out, and a node data table V and a relation data table E are generated;
s103, generating a graph relationship, namely mapping the power terminal communication network into a directional weighting network with IP as a node and the communication relationship as an edge and the IP sending instruction similarity as an edge weight based on the node data table V and the relationship data table E, and generating a graph relationship G by applying a Spark graph algorithm;
s104, carrying out communication group discovery by using a Louvain community discovery algorithm based on the graph relation G;
s105, according to the result of the step S104, setting group labels by combining the service, and classifying the groups; and the free relationship was selected and noted as free relationship table P.
The invention adopts the directed weighted network which takes the IP as a node, takes the communication relation as an edge and takes the similarity of the IP sending instructions as an edge weight, adopts the classical community classification algorithm Louvain to carry out community mining, designs the relation weight based on the similarity of the equipment communication instructions, effectively improves the classification effect of the algorithm as obvious characteristics, can finally accurately and efficiently finish the classification of the interconnection relation between communication equipment, obtains the communication group with obvious classification, combines the power network business logic to design the group type label, and obtains the power distribution terminal network communication relation group with actual business types. The problems of low efficiency, low accuracy and high later maintenance cost in the prior art are solved.
Further, the method in step S101 is as follows:
acquiring flow data among power distribution terminals from a data management and control department of an electric company, wherein the time period is one week; the data comprises a five-tuple structure, and at least comprises a source IP, a source port, a destination IP, a destination port, a communication type, a communication instruction, source equipment information and destination equipment information.
Further, the method in step S102 is as follows:
s1021, extracting all source IP and destination IP of the inter-network flow data and the corresponding instruction sets thereof, and forming a data table with two columns after grouping, aggregating and de-duplicating, and marking the data table as a node data table V;
s1022 extracting all communication records of the source IP and the destination IP of the inter-network traffic data, obtaining a one-to-one correspondence relation between the source IP and the destination IP after de-duplication, combining the node data table V obtained in the step S1021 to obtain an instruction set corresponding to each IP, and then calculating the similarity of the instruction sets of the source IP and the destination IP by using a Levenstein distance algorithm to generate a field: and finally obtaining a relational data table with four columns according to the instruction similarity, wherein the fields are respectively as follows: source IP, destination IP, communication relationship, instruction similarity; this table is denoted as relational data table E.
Further, the method in step S104 is as follows: based on the graph relationship G generated in the step S103, applying a Louvain community discovery algorithm to conduct community mining to obtain a communication group with obvious classification.
Further, the method in step S105 is as follows: based on the communication group calculated in the step S104, screening node relations far away from each group or isolated, and generating a free relation table; and establishing a communication group category label by combining specific communication service types and characteristics.
The invention also provides an interconnection relation classification system based on the spark graph algorithm, which comprises
The data acquisition module is used for acquiring communication flow data among the power distribution terminal networks;
the data processing module generates a node data table V and a relation data table E;
the graph relation generating module is used for generating a graph relation G by applying a Spark graph algorithm based on the node data table V and the relation data table E;
the communication group discovery module is used for carrying out communication group discovery by using a Louvain community discovery algorithm;
the group classification module is used for carrying out group label setting according to the discovery result of the communication group discovery module and carrying out group classification in combination with the service; and the free relationship was selected and noted as free relationship table P.
Further, the specific acquisition process of the data acquisition module is as follows: acquiring flow data among power distribution terminals from a data management and control department of an electric company, wherein the time period is one week; the data comprises a five-tuple structure, and at least comprises a source IP, a source port, a destination IP, a destination port, a communication type, a communication instruction, source equipment information and destination equipment information.
Further, the method for generating the node data table V and the relationship data table E in the data processing module is as follows:
s1021, extracting all source IP and destination IP of the inter-network flow data and the corresponding instruction sets thereof, and forming a data table with two columns after grouping, aggregating and de-duplicating, and marking the data table as a node data table V;
s1022 extracting all communication records of the source IP and the destination IP of the inter-network traffic data, obtaining a one-to-one correspondence relation between the source IP and the destination IP after de-duplication, combining the node data table V obtained in the step S1021 to obtain an instruction set corresponding to each IP, and then calculating the similarity of the instruction sets of the source IP and the destination IP by using a Levenstein distance algorithm to generate a field: and finally obtaining a relational data table with four columns according to the instruction similarity, wherein the fields are respectively as follows: source IP, destination IP, communication relationship, instruction similarity; this table is denoted as relational data table E.
Further, the method for group discovery in the communication group discovery module comprises the following steps: based on the graph relation G, applying a Louvain community discovery algorithm to carry out community mining to obtain a communication group with obvious classification.
Further, the group classification module screens node relations far away from each group or isolated based on the obtained communication groups to generate a free relation table; and establishing a communication group category label by combining specific communication service types and characteristics.
The invention has the advantages that:
the invention adopts the directed weighted network which takes the IP as a node, takes the communication relation as an edge and takes the similarity of the IP sending instructions as an edge weight, adopts the classical community classification algorithm Louvain to carry out community mining, designs the relation weight based on the similarity of the equipment communication instructions, effectively improves the classification effect of the algorithm as obvious characteristics, can finally accurately and efficiently finish the classification of the interconnection relation between communication equipment, obtains the communication group with obvious classification, combines the power network business logic to design the group type label, and obtains the power distribution terminal network communication relation group with actual business types. The problems of low efficiency, low accuracy and high later maintenance cost in the prior art are solved.
Drawings
FIG. 1 is a flow chart of a classification method according to an embodiment of the invention;
FIG. 2 is a graph relationship G generated by the node data table V and the relationship data table E in an embodiment of the present invention;
FIG. 3 is a diagram illustrating a communication group G-C obtained by community mining based on the graph relationship G in an embodiment of the present invention;
FIG. 4 is a table P of free relations generated based on communication groups G-C in an embodiment of the invention;
FIG. 5 is a table R of communication relationship classification results generated based on communication groups G-C in an embodiment of the invention.
Detailed Description
For the purpose of making the objects, technical solutions and advantages of the embodiments of the present invention more apparent, the technical solutions in the embodiments of the present invention will be clearly and completely described in the following in conjunction with the embodiments of the present invention, and it is apparent that the described embodiments are some embodiments of the present invention, but not all embodiments. All other embodiments, which can be made by those skilled in the art based on the embodiments of the invention without making any inventive effort, are intended to be within the scope of the invention.
The embodiment provides an interconnection relation classification method based on a spark graph algorithm, which is used for solving the problem that the interconnection relation between distribution terminal equipment in the power industry cannot be classified accurately, efficiently and at low cost in the prior art. To achieve the above object.
As shown in fig. 1, an interconnection relation classification method based on spark graph algorithm specifically comprises the following steps:
s101, acquiring data, namely acquiring communication flow data among power distribution terminals;
s102, data processing is carried out, and a node data table V and a relation data table E are generated according to the inter-network communication flow data;
s103, generating a graph relation, namely applying a Spark graph algorithm to generate a graph relation G;
s104, carrying out communication group discovery, namely carrying out group discovery by using a Louvain algorithm;
s105, according to the result of the step S104, setting group labels by combining the service, and classifying the groups; and the free relationship was selected and noted as free relationship table P.
The contents of each step are specifically described below:
the method in S101 is:
acquiring flow data among power distribution terminal networks from a data management and control department of an electric company, wherein the time period is at least one week; data typically comprises five-tuple structures including, but not limited to: source IP, source port, destination IP, destination port, communication type, communication instruction, source device information, destination device information, etc.;
the method in S102 is:
content extraction is carried out on the acquired inter-network flow data to respectively generate a node data table V and a relation data table E;
s1021 extracts all source IP, destination IP and their corresponding instruction sets, and forms a data table with two columns after grouping, which is recorded as a data node table V, the content form of which is shown in table 1.
Table 1 data node table
S1022 extracting all communication records of the source IP and the destination IP, obtaining a one-to-one correspondence relation between the source IP and the destination IP after de-duplication, combining the node table V obtained in S1021 to obtain an instruction set corresponding to each IP, then calculating the similarity of the instruction sets of the source IP and the destination IP by using a Levenstein distance algorithm to generate a field 'instruction similarity', and finally obtaining a relation data table with four columns, wherein the fields are respectively: source IP, destination IP, communication relationship, instruction similarity; this table is denoted as relationship table E, whose content is in the form of table 2, the instructions and IP in table 2 have been desensitized.
Table 2 data relationship table
Wherein the lywenstein distance algorithm refers to: levenshtein Distance algorithm, also called Edit Distance algorithm, refers to the minimum number of editing operations required to change from one string to another between two strings. The permitted editing operations include replacing one character with another, inserting one character, and deleting one character. In general, the smaller the edit distance, the greater the similarity of the two strings. The larger the value in the relationship table, the greater the similarity is represented by the processed relationship table.
The method in S103 is: mapping the power terminal communication network into a directional weighting network with IP as a node, a communication relationship as an edge and the similarity of IP sending instructions as an edge weight, specifically, importing a graphframe graph algorithm package into a spark big data computing platform environment, generating a graph relationship of a V, E table, and recording the graph relationship G as shown in fig. 2.
Wherein spark cluster refers to: spark is a fast and general computing engine designed for large-scale data processing, and now forms an ecological system with high development and wide application, and has the characteristics of high speed, easy use, general use and the like.
Wherein the graphframe graph algorithm package refers to: the library is constructed on the dataframes of spark, can utilize the good expansibility and strong performance of the dataframes, provides a unified graph processing API for scala, java, python, and has the advantages of unified API, strong query function, graph storage and reading, easiness in transplanting and the like.
The method in S104 is: based on the graph relationship G generated in step S103, a classical Louvain community discovery algorithm is applied to perform community mining to obtain a communication group G-C with obvious classification, and the form is as shown in fig. 3: community relationship graph G-C.
The Louvain algorithm refers to the following steps: the principle of the algorithm based on modularity is that the difference value between the module cohesiveness of a certain division result and the cohesiveness of a random division result is used for evaluating the division result to find the division with optimal module cohesiveness, and the specific algorithm and formula are as follows:
continuously traversing points in the network, taking out the points from the original communities, calculating the modularity increment generated by adding the points to each community, selecting one community with the largest modularity increment from the communities, adding the points until no points can move, combining each community into a super point, repeating the steps until the modularity is not increased any more, wherein the modularity increment means that after one point is taken out from the original community and added to another community, the value of the modularity is changed, and the calculation formula is as follows:
in the formula, sigma in represents the sum of all edge weights in the community C; c represents a community to which the point i is to be added; i represents a node to be moved; k (K) i,in Representing the sum of all edge weights from the point i to the community C; Σtot represents the sum of all the weights of the edges connected to community C;
the method in S105 is: the communication group obtained by calculation in the step S104 is screened out node relations which are far away from the group or are isolated, and a free relation table P is generated, wherein the content form of the free relation table P is shown in figure 4; and building a communication group category label by combining specific communication service types and characteristics to generate a communication relation classification result table R, wherein the content form of the communication relation classification result table R is as shown in fig. 5: the communication relation classification result table R. The function of classifying the interconnection relation among the power distribution terminal devices in the power industry is finished.
The invention adopts the directed weighted network which takes the IP as a node, takes the communication relation as an edge and takes the similarity of the IP sending instructions as an edge weight, adopts the classical community classification algorithm Louvain to carry out community mining, designs the relation weight based on the similarity of the equipment communication instructions, effectively improves the classification effect of the algorithm as obvious characteristics, can finally accurately and efficiently finish the classification of the interconnection relation between communication equipment, obtains the communication group with obvious classification, combines the power network business logic to design the group type label, and obtains the power distribution terminal network communication relation group with actual business types. The problems of low efficiency, low accuracy and high later maintenance cost in the prior art are solved.
The invention also provides an interconnection relation classification system based on the spark graph algorithm, which comprises the following steps:
the data acquisition module is used for acquiring communication flow data among the power distribution terminal networks;
the data processing module generates a node data table V and a relation data table E according to the inter-network communication flow data;
the chart relation generating module is used for generating a chart relation G by applying a Spark chart algorithm;
the communication group discovery module is used for carrying out group discovery by using a Louvain algorithm;
the group classification module is used for carrying out group label setting according to the result obtained by the communication group discovery module and carrying out group classification by combining the service; and the free relationship was selected and noted as free relationship table P.
The contents of each step are specifically described below:
the method in the data acquisition module is as follows:
acquiring flow data among power distribution terminal networks from a data management and control department of an electric company, wherein the time period is at least one week; data typically comprises five-tuple structures including, but not limited to: source IP, source port, destination IP, destination port, communication type, communication instruction, source device information, destination device information, etc.;
the method in the data processing module is as follows:
content extraction is carried out on the acquired inter-network flow data to respectively generate a node data table V and a relation data table E;
s1021 extracts all source IP, destination IP and their corresponding instruction sets, and forms a data table with two columns after grouping, which is recorded as a data node table V, the content form of which is shown in table 1.
Table 1 data node table
S1022 extracting all communication records of the source IP and the destination IP, obtaining a one-to-one correspondence relation between the source IP and the destination IP after de-duplication, combining the node table V obtained in S1021 to obtain an instruction set corresponding to each IP, then calculating the similarity of the instruction sets of the source IP and the destination IP by using a Levenstein distance algorithm to generate a field 'instruction similarity', and finally obtaining a relation data table with four columns, wherein the fields are respectively: source IP, destination IP, communication relationship, instruction similarity; this table is denoted as relationship table E, whose content is in the form of table 2, the instructions and IP in table 2 have been desensitized.
Table 2 data relationship table
src dst relationship similarity
31.119.0.185 31.119.0.219 communication 35
31.118.224.103 31.118.244.171 communication 32
31.118.244.171 31.118.188.7 communication 27
31.14.128.12 31.119.0.185 communication 18
31.119.0.185 31.119.0.177 communication 49
31.118.168.13 31.118.224.103 communication 29
31.119.0.177 31.118.224.107 communication 27
31.118.244.171 31.119.0.177 communication 26
31.119.0.185 31.118.140.24 communication 26
31.118.224.107 31.118.140.24 communication 28
31.14.128.12 31.119.0.177 communication 35
31.118.188.7 31.118.84.13 communication 44
31.119.0.177 31.119.0.185 communication 49
31.118.152.51 31.119.2.77 communication 29
31.118.140.24 31.119.0.219 communication 37
Wherein the lywenstein distance algorithm refers to: levenshtein Distance algorithm, also called Edit Distance algorithm, refers to the minimum number of editing operations required to change from one string to another between two strings. The permitted editing operations include replacing one character with another, inserting one character, and deleting one character. In general, the smaller the edit distance, the greater the similarity of the two strings. The larger the value in the relationship table, the greater the similarity is represented by the processed relationship table.
The method in the graph relation generation module is as follows: mapping the power terminal communication network into a directional weighting network with IP as a node, a communication relationship as an edge and the similarity of IP sending instructions as an edge weight, specifically, importing a graphframe graph algorithm package into a spark big data computing platform environment, generating a graph relationship of a V, E table, and recording the graph relationship G as shown in fig. 2.
Wherein spark cluster refers to: spark is a fast and general computing engine designed for large-scale data processing, and now forms an ecological system with high development and wide application, and has the characteristics of high speed, easy use, general use and the like.
Wherein the graphframe graph algorithm package refers to: the library is constructed on the dataframes of spark, can utilize the good expansibility and strong performance of the dataframes, provides a unified graph processing API for scala, java, python, and has the advantages of unified API, strong query function, graph storage and reading, easiness in transplanting and the like.
The method in the communication group discovery module is as follows: based on the generated graph relationship G, a classical Louvain community discovery algorithm is applied to carry out community mining to obtain a communication group G-C with obvious classification, and the form is as shown in FIG. 3: community relationship graph G-C.
The Louvain algorithm refers to the following steps: the principle of the algorithm based on modularity is that the difference value between the module cohesiveness of a certain division result and the cohesiveness of a random division result is used for evaluating the division result to find the division with optimal module cohesiveness, and the specific algorithm and formula are as follows:
continuously traversing points in the network, taking out the points from the original communities, calculating the modularity increment generated by adding the points to each community, selecting one community with the largest modularity increment from the communities, adding the points until no points can move, combining each community into a super point, repeating the steps until the modularity is not increased any more, wherein the modularity increment means that after one point is taken out from the original community and added to another community, the value of the modularity is changed, and the calculation formula is as follows:
in the formula, sigma in represents all edges in community CSum of weights; c represents a community to which the point i is to be added; i represents a node to be moved; k (K) i,in Representing the sum of all edge weights from the point i to the community C; Σtot represents the sum of all the weights of the edges connected to community C;
the method in the group classification module is as follows: based on the communication group obtained by the calculation, screening node relations far away from the group or isolated from the group to generate a free relation table P, wherein the content form of the free relation table P is shown in figure 4; and building a communication group category label by combining specific communication service types and characteristics to generate a communication relation classification result table R, wherein the content form of the communication relation classification result table R is as shown in fig. 5: the communication relation classification result table R. The function of classifying the interconnection relation among the power distribution terminal devices in the power industry is finished.
The invention adopts the directed weighted network which takes the IP as a node, takes the communication relation as an edge and takes the similarity of the IP sending instructions as an edge weight, adopts the classical community classification algorithm Louvain to carry out community mining, designs the relation weight based on the similarity of the equipment communication instructions, effectively improves the classification effect of the algorithm as obvious characteristics, can finally accurately and efficiently finish the classification of the interconnection relation between communication equipment, obtains the communication group with obvious classification, combines the power network business logic to design the group type label, and obtains the power distribution terminal network communication relation group with actual business types. The problems of low efficiency, low accuracy and high later maintenance cost in the prior art are solved.
The above embodiments are only for illustrating the technical solution of the present invention, and are not limiting; although the invention has been described in detail with reference to the foregoing embodiments, it will be understood by those of ordinary skill in the art that: the technical scheme described in the foregoing embodiments can be modified or some technical features thereof can be replaced by equivalents; such modifications and substitutions do not depart from the spirit and scope of the technical solutions of the embodiments of the present invention.

Claims (6)

1. The interconnection relation classification method based on the spark graph algorithm is characterized by comprising the following steps of:
s101, acquiring data, namely acquiring communication flow data among power distribution terminals;
s102, data processing is carried out, and a node data table V and a relation data table E are generated;
s103, generating a graph relationship, namely mapping the power terminal communication network into a directional weighting network with IP as a node and the communication relationship as an edge and the IP sending instruction similarity as an edge weight based on the node data table V and the relationship data table E, and generating a graph relationship G by applying a Spark graph algorithm;
s104, carrying out communication group discovery by using a Louvain community discovery algorithm based on the graph relation G;
s105, according to the result of the step S104, setting group labels by combining the service, and classifying the groups; screening the free relation and marking the free relation as a free relation table P;
the method in step S102 is as follows:
s1021, extracting all source IP and destination IP of the inter-network flow data and the corresponding instruction sets thereof, and forming a data table with two columns after grouping, aggregating and de-duplicating, and marking the data table as a node data table V;
s1022 extracting all communication records of the source IP and the destination IP of the inter-network traffic data, obtaining a one-to-one correspondence relation between the source IP and the destination IP after de-duplication, combining the node data table V obtained in the step S1021 to obtain an instruction set corresponding to each IP, and then calculating the similarity of the instruction sets of the source IP and the destination IP by using a Levenstein distance algorithm to generate a field: and finally obtaining a relational data table with four columns according to the instruction similarity, wherein the fields are respectively as follows: source IP, destination IP, communication relationship, instruction similarity; the table is marked as a relation data table E;
the method in step S105 is: based on the communication group calculated in the step S104, screening node relations far away from each group or isolated, and generating a free relation table; and establishing a communication group category label by combining specific communication service types and characteristics.
2. The interconnection relationship classification method based on spark graph algorithm according to claim 1, wherein the method in step S101 is:
acquiring flow data among power distribution terminals from a data management and control department of an electric company, wherein the time period is one week; the data comprises a five-tuple structure, and at least comprises a source IP, a source port, a destination IP, a destination port, a communication type, a communication instruction, source equipment information and destination equipment information.
3. The interconnection relationship classification method based on spark graph algorithm according to claim 1 or 2, wherein the method in step S104 is: based on the graph relationship G generated in the step S103, applying a Louvain community discovery algorithm to conduct community mining to obtain a communication group with obvious classification.
4. An interconnection relation classification system based on spark graph algorithm is characterized by comprising
The data acquisition module is used for acquiring communication flow data among the power distribution terminal networks;
the data processing module generates a node data table V and a relation data table E;
the graph relation generating module is used for generating a graph relation G by applying a Spark graph algorithm based on the node data table V and the relation data table E;
the communication group discovery module is used for carrying out communication group discovery by using a Louvain community discovery algorithm;
the group classification module is used for carrying out group label setting according to the discovery result of the communication group discovery module and carrying out group classification in combination with the service; screening the free relation and marking the free relation as a free relation table P;
the method for generating the node data table V and the relation data table E in the data processing module comprises the following steps:
s1021, extracting all source IP and destination IP of the inter-network flow data and the corresponding instruction sets thereof, and forming a data table with two columns after grouping, aggregating and de-duplicating, and marking the data table as a node data table V;
s1022 extracting all communication records of the source IP and the destination IP of the inter-network traffic data, obtaining a one-to-one correspondence relation between the source IP and the destination IP after de-duplication, combining the node data table V obtained in the step S1021 to obtain an instruction set corresponding to each IP, and then calculating the similarity of the instruction sets of the source IP and the destination IP by using a Levenstein distance algorithm to generate a field: and finally obtaining a relational data table with four columns according to the instruction similarity, wherein the fields are respectively as follows: source IP, destination IP, communication relationship, instruction similarity; the table is marked as a relation data table E;
the group classification module screens node relations far away from each group or isolated based on the obtained communication groups to generate a free relation table; and establishing a communication group category label by combining specific communication service types and characteristics.
5. The interconnection relation classification system based on spark graph algorithm according to claim 4, wherein the data acquisition module specifically acquires: acquiring flow data among power distribution terminals from a data management and control department of an electric company, wherein the time period is one week; the data comprises a five-tuple structure, and at least comprises a source IP, a source port, a destination IP, a destination port, a communication type, a communication instruction, source equipment information and destination equipment information.
6. The interconnection relation classification system based on spark graph algorithm according to claim 4, wherein the method of group discovery in the communication group discovery module is as follows: based on the graph relation G, applying a Louvain community discovery algorithm to carry out community mining to obtain a communication group with obvious classification.
CN202011543625.4A 2020-12-23 2020-12-23 Interconnection relation classification method and system based on spark graph algorithm Active CN112714080B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202011543625.4A CN112714080B (en) 2020-12-23 2020-12-23 Interconnection relation classification method and system based on spark graph algorithm

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202011543625.4A CN112714080B (en) 2020-12-23 2020-12-23 Interconnection relation classification method and system based on spark graph algorithm

Publications (2)

Publication Number Publication Date
CN112714080A CN112714080A (en) 2021-04-27
CN112714080B true CN112714080B (en) 2023-10-17

Family

ID=75543998

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202011543625.4A Active CN112714080B (en) 2020-12-23 2020-12-23 Interconnection relation classification method and system based on spark graph algorithm

Country Status (1)

Country Link
CN (1) CN112714080B (en)

Citations (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2018014610A1 (en) * 2016-07-20 2018-01-25 武汉斗鱼网络科技有限公司 C4.5 decision tree algorithm-based specific user mining system and method therefor
CN108509551A (en) * 2018-03-19 2018-09-07 西北大学 A kind of micro blog network key user digging system under the environment based on Spark and method
WO2018222064A1 (en) * 2017-05-29 2018-12-06 Huawei Technologies Co., Ltd. Systems and methods of hierarchical community detection in graphs
CN109446395A (en) * 2018-09-29 2019-03-08 上海派博软件有限公司 A kind of method and system of the raising based on Hadoop big data comprehensive inquiry engine efficiency
CN110647942A (en) * 2019-09-25 2020-01-03 广东电网有限责任公司 Intrusion detection method, device and equipment for satellite network
US10581851B1 (en) * 2019-07-17 2020-03-03 Capital One Services, Llc Change monitoring and detection for a cloud computing environment
CN111178587A (en) * 2019-12-06 2020-05-19 广东工业大学 Spark framework-based short-term power load rapid prediction method
CN111917600A (en) * 2020-06-12 2020-11-10 贵州大学 Spark performance optimization-based network traffic classification device and classification method

Patent Citations (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2018014610A1 (en) * 2016-07-20 2018-01-25 武汉斗鱼网络科技有限公司 C4.5 decision tree algorithm-based specific user mining system and method therefor
WO2018222064A1 (en) * 2017-05-29 2018-12-06 Huawei Technologies Co., Ltd. Systems and methods of hierarchical community detection in graphs
CN108509551A (en) * 2018-03-19 2018-09-07 西北大学 A kind of micro blog network key user digging system under the environment based on Spark and method
CN109446395A (en) * 2018-09-29 2019-03-08 上海派博软件有限公司 A kind of method and system of the raising based on Hadoop big data comprehensive inquiry engine efficiency
US10581851B1 (en) * 2019-07-17 2020-03-03 Capital One Services, Llc Change monitoring and detection for a cloud computing environment
CN110647942A (en) * 2019-09-25 2020-01-03 广东电网有限责任公司 Intrusion detection method, device and equipment for satellite network
CN111178587A (en) * 2019-12-06 2020-05-19 广东工业大学 Spark framework-based short-term power load rapid prediction method
CN111917600A (en) * 2020-06-12 2020-11-10 贵州大学 Spark performance optimization-based network traffic classification device and classification method

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
Xiao Wang ; Ying Liu ; Wei Su.《Real-Time Classification Method of Network Traffic Based on Parallelized CNN》.《2019 IEEE International Conference on Power, Intelligent Computing and Systems》.2020, *
基于Spark的网络流量分类方法研究;刘兆禄,赵英,刘淑梅;《通信学报》;20180930;全文 *

Also Published As

Publication number Publication date
CN112714080A (en) 2021-04-27

Similar Documents

Publication Publication Date Title
CN109492026B (en) Telecommunication fraud classification detection method based on improved active learning technology
CN106228398A (en) Specific user's digging system based on C4.5 decision Tree algorithms and method thereof
CN109218223B (en) Robust network traffic classification method and system based on active learning
CN110880019B (en) Method for adaptively training target domain classification model through unsupervised domain
CN108768986A (en) A kind of encryption traffic classification method and server, computer readable storage medium
CN112685504A (en) Production process-oriented distributed migration chart learning method
CN105471670A (en) Flow data classification method and device
CN110377605A (en) A kind of Sensitive Attributes identification of structural data and classification stage division
CN109214407A (en) Event detection model, calculates equipment and storage medium at method, apparatus
CN107465691A (en) Network attack detection system and detection method based on router log analysis
CN112949748A (en) Dynamic network anomaly detection algorithm model based on graph neural network
CN112508726A (en) False public opinion identification system based on information spreading characteristics and processing method thereof
CN115456093A (en) High-performance graph clustering method based on attention-graph neural network
CN114095447B (en) Communication network encryption flow classification method based on knowledge distillation and self-distillation
CN110830291A (en) Node classification method of heterogeneous information network based on meta-path
CN112714080B (en) Interconnection relation classification method and system based on spark graph algorithm
CN109255433B (en) Community detection method based on similarity
CN110609936A (en) Intelligent classification method for fuzzy address data
CN111428821A (en) Asset classification method based on decision tree
CN116827666A (en) Malicious network traffic detection method based on graph attention network
CN104318306A (en) Non-negative matrix factorization and evolutionary algorithm optimized parameter based self-adaption overlapping community detection method
CN114265954B (en) Graph representation learning method based on position and structure information
CN113256507B (en) Attention enhancement method for generating image aiming at binary flow data
CN106156256A (en) A kind of user profile classification transmitting method and system
Xue et al. Diversified point cloud classification using personalized federated learning

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant